Adobe is now facing a proposed class-action lawsuit alleging that it illegally used copyrighted books, including works by author Elizabeth Lyon, to train one of its artificial intelligence (AI) models. The core of the dispute centers around Adobe’s SlimLM program, a language model designed for document assistance, and the dataset used to develop it.

The Allegations: Pirated Books Fueling AI

The lawsuit claims Adobe leveraged a dataset containing pirated material – specifically, a manipulated version of the controversial “Books3” collection, which consists of 191,000 books. This dataset was integrated into SlimPajama-627B, the open-source dataset Adobe used for pre-training SlimLM. Lyon asserts that her own copyrighted guidebooks were included in this illegally sourced training data.

Why This Matters: The AI Copyright Debate

This case highlights a growing concern within the tech industry: the ethical and legal implications of using copyrighted material to train AI models. Many generative AI systems rely on massive datasets scraped from the internet, often including books, articles, and images without explicit permission from copyright holders. The legality of such practices remains contested, with multiple lawsuits now challenging the industry’s approach.

Broader Legal Challenges in the AI Space

Adobe is not alone in facing scrutiny. Similar lawsuits have been filed against Apple and Salesforce, both accused of training their AI models on copyrighted content sourced from datasets like RedPajama (which is linked to Books3). These cases are testing the boundaries of fair use and copyright law in the age of generative AI.

The central question is whether companies can profit from AI trained on stolen intellectual property. The outcome of these lawsuits could reshape the future of AI development.

The lawsuit against Adobe underscores the increasing legal risks for tech companies eager to deploy AI without addressing the underlying copyright issues. As more authors and creators challenge these practices, the industry may be forced to adopt more transparent and legally compliant data sourcing methods.