Technology giants are back in court over the data used to train their artificial intelligence models. The company Chicken Soup for the Soul has filed a massive copyright infringement lawsuit against major tech firms for the unauthorized use of the data repository known as “The Pile.” The high-profile defendants include Apple, Meta, xAI, Google, Anthropic, OpenAI, Perplexity, and NVIDIA.
AI Training Crisis: Apple and Google Under Fire
The dataset at the heart of the lawsuit, “The Pile,” contains a vast amount of information scraped from the internet. At the center of the controversy is a specific subset called “Books3,” often described as a shadow library. This collection includes a large volume of copyrighted content, ranging from books previously involved in other litigations to YouTube transcript files. It is alleged that these companies used this pirated data to train their respective AI tools.

Apple stands out as a particularly interesting name in the lawsuit. The company previously stated unequivocally that it did not use this controversial dataset to train its flagship Apple Intelligence models. Apple researchers only utilized the dataset in a publicly accessible open-source project called OpenELMs. In 2024, Apple positioned itself as one of the few companies attempting to train AI models through legal and ethical channels. Consequently, there is a strong possibility that Apple may be dismissed from the case in the later stages.
The “Gemini” Connection: Indirect Legal Risks
The situation for other companies is not as clear-cut. Firms like Perplexity, which extensively crawl the web, argue they have a right to use public information. However, another detail complicates the matter for Apple: it is known that some of Apple’s newer foundational AI models were trained using Google Gemini. If Google is found liable for using this pirated dataset, Apple could be indirectly pulled back into the legal process through its association with Google’s technology.
The training of AI models using human-generated data and copyrighted books has become one of the most significant debates in modern technology. What do you think about tech giants using internet data without explicit permission for AI training? Share your thoughts in the comments!

