The Billion-Dollar Question Your AI Vendor Can't Answer: What is the Lawful Basis for Your Training Data?

Every company integrating AI is asking the same questions: "How can this technology make us more efficient? How can it help us build better products?" But they are failing to ask the most important question of all, the one that carries a billion-dollar compliance risk: What is the legal justification for the data this model was trained on?

The power of Large Language Models (LLMs) comes from the vast datasets they are trained on - often, a significant portion of the public internet. This data inevitably contains billions of data points that qualify as personal data under the GDPR. And under the law, all processing of personal data requires a valid lawful basis.

The problem? For most major AI models, this lawful basis is, at best, a legal grey area and, at worst, a complete compliance black hole. At Janus, we see this as the single most significant supply chain risk for SMEs in 2025. This is more than a problem for AI builders; it's also a potential problem for every company that uses their tools.

The Six Lawful Bases: A Process of Elimination

Under the GDPR, there are six possible lawful bases for processing personal data. Let's analyze them in the context of training an LLM on a web-scale dataset:

  1. Consent: This is impossible. You cannot obtain specific, informed, and unambiguous consent from the billions of individuals whose blog posts, forum comments, and public profiles were scraped to create the training data.

  2. Contractual Necessity: This is not applicable. You do not have a contract with the individuals whose data was used.

  3. Legal Obligation: This does not apply to the commercial development of an AI model.

  4. Vital Interests: This is only relevant in life-or-death situations.

  5. Public Task: This is for public authorities, not private companies.

This leaves only one possible candidate:

  1. Legitimate Interests: This is the basis that most AI developers are forced to rely on. However, "legitimate interests" is not a get-out-of-jail-free card. It requires the data controller to perform a rigorous three-part test, known as a Legitimate Interests Assessment (LIA).

The Flaw in the "Legitimate Interests" Argument

The LIA requires the controller to balance their commercial interests against the fundamental rights and freedoms of the individuals whose data is being processed. For web-scraped training data, this balancing test is incredibly difficult to pass.

  • Was it Necessary? Could the model have been trained on a smaller, more controlled, or fully anonymized dataset?

  • What was the Impact? Did the individuals whose data was used reasonably expect that their public posts would be used to train a commercial AI model, potentially for purposes they would not agree with?

For many of the datasets used to train foundational models, the answer to these questions is a resounding no. The processing is often not necessary, and the impact on individual rights is significant. Without a robust, defensible LIA, the entire legal foundation for the AI model collapses.

The Strategic Lesson for Your Business: De-Risking Your AI Supply Chain

As a business using a third-party AI tool, you are a data controller. You have a legal obligation to ensure that your vendors (your data processors) are compliant. You cannot simply trust their marketing claims.

  1. Demand the LIA: When conducting due diligence on an AI vendor, do not just ask, "Are you GDPR compliant?" Ask the specific, difficult question: "Can you provide us with your documented Legitimate Interests Assessment for the training data used in this model?" The answer will tell you everything you need to know about their compliance maturity.

  2. Prioritize Fine-Tuning over Foundational Training: The safest and most compliant way to leverage AI is often to use a pre-trained foundational model and then fine-tune it on a smaller, cleaner, and fully compliant dataset that you own and control.

  3. Implement a Data Firewall: Before sending any of your own business data to a third-party AI model for analysis, implement a technical pseudonymization layer. This "data firewall" strips out personal identifiers, allowing you to benefit from the AI's intelligence without exposing unnecessary personal data.

Conclusion: From Liability to Leadership

The unresolved legal questions at the heart of the AI revolution represent a significant risk for any business that fails to act with diligence. But for those who do, it represents an opportunity.

By demanding a higher standard of compliance from your AI vendors and implementing a robust internal data governance framework, you are not only mitigating risk but are also building your business on a foundation of trust and positioning yourself as a leader in the responsible use of artificial intelligence.

At Janus Compliance, this is our area of expertise. We provide the outsourced DPO services that allow you to navigate this complex legal and technical landscape with confidence.

Previous
Previous

The Data Your AI Creates is Your Biggest GDPR Risk

Next
Next

Your Data's Passport: A Founder's Guide to UK & Irish Cross-Border Data Transfers