In this article
The rise of generative AI has spawned a slew of lawsuits brought by content owners and copyright holders against tech companies for using their copyrighted works as data to train AI systems without permission.
In the following table, VIP+ provides all the active lawsuits filed in the U.S. by content owners. It details the name of the suit, the date and state where it was originally filed, the type of data it relates to and a brief description of the litigation claim.
Litigation Claims
For the creative community, the source of AI training data is a chief frustration with generative AI. All of the lawsuits by content owners primarily frame their complaints as copyright infringement. Some cases have also claimed unfair competition, unjust enrichment, trademark dilution and commercial misappropriation, among other complaints.
Meanwhile, AI developers commonly hold that training on copyrighted works is fair use, the exception under U.S. copyright law that allows for certain uses of copyrighted material without permission.
Plaintiffs & Defendants
Cases so far have been brought against tech companies and AI firms by a range of content producers. Many types of content producers are notably not yet represented, including Hollywood studios, video game companies and book publishers.
Plaintiffs in the various litigation span the following:
- News publishers, notably including The New York Times Co., eight newspaper publishers owned by Alden Global Capital such as the New York Daily News, and others have filed suits against Microsoft and OpenAI.
- Book authors, notably including Sarah Silverman, John Grisham, Jodi Picoult, David Baldacci, George R.R. Martin, Ta-Nehisi Coates, Junot Diaz, Paul Tremblay and Mona Awad, as well as the Authors Guild, have filed suits against OpenAI, Meta, Databricks and Nvidia.
- Visual artists, notably including Sarah Andersen, Kelly McKernan, Karla Ortiz and others, have filed suits against Stability AI and Google.
- Stock image company Getty Images has filed suit against Stability AI.
- Major music labels, including Universal Music Group, Warner Music Group and Sony Music Entertainment, have filed against Anthropic, Suno and Uncharted Labs (developer of the AI music generator Udio). The suit against Anthropic pertained specifically to song lyrics, not music.
Among the defendants are Big Tech companies and large AI firms each developing large language model-based commercial generative AI tools and products. Key firms include Microsoft and OpenAI, Meta, Google, Nvidia, Stability AI, Anthropic and Databricks. Music publishers have brought their infringement claims against the developers of AI music generators including Uncharted Labs, Suno and Udio.
RELATED: VIP+’s Complete Updated Index of AI Content Licensing Deals With Major Publishers
Transparency on Training Data
Incentives at tech companies work against transparency. AI developers don’t reveal complete details about the data ingested by AI models, which they argue is proprietary information that revealing would open to competitive harms of their own. A second objective of silence on training contents is legal risk, as detailed knowledge would encourage more infringement cases by rights holders.
Yet such transparency is likely critical to effective litigation claims. Without it, rights holders are harder pressed to prove their copyrighted works have in fact been used for training.
Instead, rights holders and their lawyers in some of these cases have been able to reasonably infer that training occurred by demonstrating that a prompt to the AI model is capable of outputting verbatim or substantially similar material relative to the copyrighted work.
In other cases, defendants can point to references to specific datasets in research whitepapers by AI developers for a given model, which can sometimes be traced back to the original dataset publisher, where more details are often found. Barring these methods, leaks from within tech companies reported by press can indicate when specific data has been purposed for AI training, such as recent reports by 404 Media on Runway and Nvidia have done.
Legislative pushes to force greater transparency are mounting. New state bills have been introduced that would require model developers and dataset publishers respectively to disclose certain information, including the AI Foundation Model Transparency Act in December and the Generative AI Copyright Disclosure Act in April.
What’s Next
The market is still waiting for meaningful answers on copyright and AI training. It will take time for litigation to make its way through the courts, but case outcomes could give the market its earliest signal on whether training AI on copyrighted works is fair use or infringement. This single open question is arguably the most significant and potentially existential one facing generative AI, as the free use of web-scraped data at scale has enabled today’s generative AI models.
RELATED: Training AI With TV and Film Content: How Licensing Deals Look
In the U.S., additional guidance is also expected to come from the Copyright Office later this year. Having released the first of three reports in its study on artificial intelligence after requesting public comments last fall, the U.S. Copyright Office will release the third part sometime this fall, addressing the legal implications of training AI models on copyrighted works, licensing considerations and the allocation of any potential liability.
If courts rule AI companies need permission to train on copyrighted material, that would instigate a stricter paradigm broadly requiring developers to license content. Without immediate answers, copyright concerns pose a problematic barrier to the safe adoption of generative AI in the media and entertainment industry.