Go back

Article

Apr 8, 2026

AI Training Data and Copyright: What Your Company Needs to Know Before You Build

Learn how AI training data and copyright law affect your business. Understand legal risks, fair use, and how to avoid copyright infringement when building AI models.

Diagram showing AI training data copyright law across three panels: left shows training-phase and output-phase infringement theories plus the four fair use factors; center lists five active cases including NYT v. OpenAI and Getty v. Stability AI; right shows a five-level risk spectrum from web scraping at highest risk to fine-tuning on proprietary data at lowest

Before you train an AI model on data scraped from the web, licensed datasets, or user-generated content, you need to understand the legal exposure you may be creating. The question of whether using copyrighted material to train AI systems constitutes infringement is one of the most actively litigated issues in intellectual property law right now, and the outcomes will shape how AI companies operate for the next decade.

This is not a theoretical risk. The New York Times, Getty Images, major recording labels, and a growing list of authors and publishers have filed lawsuits against the largest AI developers in the world. None of these cases has produced a final merits ruling yet, but the legal theories they are testing are live and the stakes are enormous.

If your company is building an AI product, fine-tuning a foundation model, or ingesting third-party content into a training pipeline, this is what you need to know.

The core legal question

Copyright gives the owner of a protected work the exclusive right to reproduce it, distribute it, create derivative works, and publicly display or perform it. Training an AI model on copyrighted content involves making copies of that content, processing it, and in some cases storing it in a form that can influence model outputs. The question courts are wrestling with is whether any or all of those activities infringe the copyright owner's exclusive rights.

There are two distinct infringement theories in play.

Training-phase infringement focuses on whether copying data into a training pipeline, even temporarily, constitutes reproduction under the Copyright Act. If it does, companies that scraped or used copyrighted content without authorization may have infringed the reproduction right regardless of what the model does with the content afterward.

Output-phase infringement focuses on whether the model's outputs reproduce or are substantially similar to protected expression from the training data. This is a harder claim to prove because most outputs are not verbatim reproductions, but cases involving memorization — where models reproduce training content nearly word for word — have given plaintiffs strong factual ground to stand on.

Both theories are being tested simultaneously in current litigation.

The fair use question

The primary defense AI companies assert against training data copyright claims is fair use. Fair use is a doctrine that permits the use of copyrighted material without authorization in certain circumstances, evaluated through a four-factor test.

Factor 1: Purpose and character of the use. Courts look at whether the use is transformative — whether it adds new meaning, expression, or message rather than simply substituting for the original. AI developers argue that training is transformative because the model learns patterns and relationships rather than reproducing the content itself. Rights holders argue it is reproductive because the model ingests and stores the content.

Factor 2: Nature of the copyrighted work. Creative works receive stronger copyright protection than factual or functional ones. Training on novels, photographs, and music raises harder fair use questions than training on datasets of factual information.

Factor 3: Amount and substantiality of the portion used. AI training typically uses entire works rather than excerpts, which weighs against fair use. The fact that the entire text of a book or the full resolution of a photograph is ingested is meaningful to this analysis.

Factor 4: Effect on the market for the original work. This is often the most important factor. If AI-generated content substitutes for the original work in the market, depriving the rights holder of licensing revenue, that weighs heavily against fair use. The New York Times lawsuit is particularly focused on this factor, arguing that AI outputs that answer queries with Times content eliminate the economic incentive to visit the Times website.

No court has yet fully applied the four-factor test to AI training data in a final merits ruling. The outcomes will likely vary by the type of content, the type of model, and how the model's outputs relate to the training data. Blanket claims that AI training is always fair use, or that it never is, are both legally unsupportable at this stage.

The major cases shaping this area

The New York Times v. OpenAI and Microsoft is the highest-profile case. The Times alleges that OpenAI trained its models on millions of Times articles without authorization and that the resulting models can reproduce Times content nearly verbatim in response to certain prompts. The complaint includes documented examples of ChatGPT reproducing entire article passages. The market harm theory is central: the Times argues that AI-generated summaries and reproductions reduce traffic and licensing revenue. This case is proceeding in the Southern District of New York.

Getty Images v. Stability AI alleges that Stability AI copied over 12 million Getty images to train Stable Diffusion without authorization or compensation. Getty has presented evidence that Stable Diffusion can generate images bearing Getty's watermark, which it argues demonstrates direct copying from the training set. The case is proceeding in parallel in the US and UK.

Andersen v. Stability AI, Midjourney, and DeviantArt is a class action brought by visual artists who allege their work was used to train image generation models without consent or compensation. The district court dismissed several claims but allowed core infringement claims to proceed, and the case is ongoing.

UMG Recordings v. Suno and Udio involves the major record labels suing AI music generation platforms for allegedly training on copyrighted sound recordings. The labels are seeking significant statutory damages. This case tests whether audio training raises different considerations than text or image training.

The Authors Guild cases involve a class of authors whose books were allegedly used without authorization to train large language models. These cases are testing both the reproduction and market substitution theories in the text context.

The common thread across all of these cases is that rights holders are asserting claims across every content category, and the litigation is moving fast enough that outcomes in some cases could arrive within the next 12 to 18 months.

What the Copyright Office has said

The US Copyright Office has been actively studying AI and copyright and issued a significant report in 2024 covering training data, among other issues. Key takeaways from the Office's current position:

Training AI models on copyrighted works raises serious copyright questions that existing fair use doctrine may not cleanly resolve. The Office stopped short of declaring training infringement per se, but it also declined to endorse the AI industry's broad fair use argument.

The Office has recommended that Congress consider a licensing framework for AI training, which would allow rights holders to receive compensation for the use of their content while giving AI developers a clear legal path to training data. No such legislation has passed as of 2026, but the recommendation signals where regulatory thinking is heading.

The Office has also been receptive to the argument that rights holders should have opt-out rights from AI training, analogous to the opt-out mechanisms that exist in some other contexts. Several major platforms have begun implementing opt-out systems, though their legal significance is uncertain.

The practical risk framework for companies building AI

Not every company using training data faces the same level of risk. Here is a framework for evaluating your exposure.

Highest risk: Scraping copyrighted content from the web without authorization and training a model on it, particularly if the model's outputs can reproduce or closely substitute for the original content. This is the fact pattern at the center of the major litigation and represents the greatest legal exposure.

Significant risk: Licensing data from aggregators or data brokers without verifying that the underlying content was properly licensed from rights holders. Downstream purchasers of infringing data can face liability even if they didn't do the scraping themselves.

Moderate risk: Using publicly available datasets that have uncertain provenance. Many commonly used training datasets were assembled before the current legal environment crystallized, and their copyright status is not always clear.

Lower risk: Training on content you created, content that is genuinely in the public domain, content licensed directly from rights holders with terms that explicitly permit AI training use, or synthetic data generated for training purposes.

Lower risk: Fine-tuning a foundation model on your own proprietary data rather than training from scratch on third-party content. Fine-tuning raises fewer copyright questions because the underlying model has already been trained by its developer.

Steps to take before you build or expand your training pipeline

Conduct a training data audit. Identify every dataset in your training pipeline and document its source, the terms under which it was obtained, and whether those terms explicitly authorize AI training use. Many data licenses predate the AI training context and are silent on the question. Silence is not authorization.

Verify that your licenses cover AI training. Standard commercial data licenses, content syndication agreements, and API terms of service often do not include AI training rights. If you are relying on a license to authorize your training use, have counsel review whether that license actually covers what you are doing.

Implement opt-out mechanisms where applicable. If your training pipeline includes web-scraped content, implementing robots.txt compliance and content opt-out mechanisms reduces your exposure and may be relevant to a fair use defense. It also positions you better for a regulatory environment where opt-out rights may become mandatory.

Assess your output-phase exposure. Test whether your model can reproduce substantial portions of training data verbatim or near-verbatim. Output memorization is one of the strongest factual bases for plaintiff claims and is something you can identify and address in model development.

Document your fair use analysis. If your position is that your training use qualifies as fair use, document the analysis contemporaneously. A well-reasoned, contemporaneous fair use assessment supports a good-faith defense and demonstrates that you considered the legal question seriously rather than ignoring it.

Consider training data insurance. A small number of insurers now offer IP liability coverage specifically for AI training data claims. Whether this makes sense depends on your risk profile, the size of your training dataset, and the commercial stakes of your product.

Frequently asked questions

Does using publicly available content mean I can train on it freely?

No. Public availability does not mean copyright-free. Most content on the internet is protected by copyright regardless of whether it carries a copyright notice. The fact that content is accessible does not mean the rights holder has authorized AI training use.

What is the robots.txt file and does it matter legally?

The robots.txt file is a web standard that allows website owners to communicate to crawlers which content should not be scraped. Ignoring robots.txt disallowances does not automatically constitute copyright infringement, but it is relevant to the willfulness analysis and to fair use. Several courts have indicated that disregarding robots.txt is a factor weighing against fair use.

Can I use content licensed under Creative Commons for AI training?

It depends on the specific Creative Commons license. CC0 (public domain dedication) and CC BY (attribution) licenses generally permit broad reuse including AI training. CC BY-NC (non-commercial) licenses may prohibit commercial AI training. CC BY-SA (share-alike) licenses impose conditions on derivative works that may complicate AI training use. Review the specific license terms and have counsel evaluate whether your intended use is within scope.

If I fine-tune an existing model rather than train from scratch, do I have the same exposure?

Fine-tuning on proprietary or properly licensed data significantly reduces your training data copyright exposure because you are not ingesting third-party copyrighted content into your training pipeline. However, you should still verify that your fine-tuning data is properly licensed and review the terms of the foundation model you are building on, as those terms may restrict downstream use.

What happens if the major cases go against AI developers?

If courts find that AI training on copyrighted content constitutes infringement without fair use protection, the practical consequences would include: mandatory licensing of training data, retroactive liability for companies that trained on unlicensed content, and a restructuring of how foundation models are built and accessed. The industry would likely shift toward licensed data pipelines, which would increase training costs and potentially favor larger companies with resources to negotiate broad licensing deals.

The legal landscape around AI training data is unsettled, but the direction of travel is clear: rights holders are organized, the litigation is moving forward, and regulatory guidance is pointing toward a more structured framework for training data use. Companies that wait for final court rulings before thinking about their training data legal position are taking on risk they don't need to take.

If you want to assess your company's training data exposure, review your data licenses, or build a compliance framework for your AI development pipeline, contact Ana Law to schedule a strategy session.