OpenAI confronts a critical legal setback in a class-action lawsuit from authors alleging illegal training of ChatGPT on their copyrighted works, as a US magistrate judge mandates disclosure of internal communications about deleting two controversial datasets, “Books 1” and “Books 2.” These datasets, compiled in 2021 by former OpenAI employees including Dario Amodei, drew heavily from the shadow library Library Genesis (LibGen) via open web scraping. OpenAI deleted them before ChatGPT’s 2022 launch, but authors suspect deeper motives tied to copyright awareness, potentially proving willful infringement that could escalate statutory damages up to $150,000 per work.
Authors’ Lawsuit Centers on Dataset Deletion
The core dispute revolves around OpenAI’s rationale for excising the datasets, initially linked to a Slack channel named “excise-libgen” later renamed “project-clear” on in-house lawyer Jason Kwon’s suggestion. Authors argue OpenAI flip-flopped by first citing “non-use” as a deletion reason—then retracting it under attorney-client privilege—prompting intensified discovery demands. Judge Ona Wang ruled this inconsistency waived privilege, ordering OpenAI to produce all in-house lawyer communications on deletions and unredacted LibGen references by December 8, 2025, plus lawyer depositions by December 19.
Judge Rejects OpenAI’s Privilege Claims
Wang scrutinized OpenAI’s assertions, noting most Slack messages lacked legal advice requests or counsel input, thus non-privileged despite lawyer involvement or channel creation. OpenAI’s shifting positions—claiming “non-use” non-privileged then privileged—strained credulity, as the firm publicly stated reasons it later shielded. The judge emphasized OpenAI cannot dodge discovery by editing prior filings after over a year on the docket, putting the company’s good faith defense and state of mind squarely at issue for potential willfulness findings.
Willful Infringement Implications
Revealing deletion motives could expose OpenAI’s awareness of infringing activity or reckless disregard for authors’ rights, key to willful infringement under copyright law. Authors’ counsel Christopher Young highlighted risks if evidence shows OpenAI avoided datasets in later models due to legal fears or masked continued use under new names. Wang noted OpenAI’s recent filing tweaks—dropping “good faith,” “innocent,” and “reasonably believed”—bolster discovery needs, as a jury deserves insight into the firm’s mindset.
Critique of OpenAI’s Fair Use Arguments
OpenAI misrepresented a prior Anthropic ruling by Judge William Alsup, falsely claiming he deemed downloading pirated books lawful for LLM training if discarded afterward. Wang clarified Alsup’s stance: pirating available copies remains inherently infringing, even for transformative fair use followed by deletion—directly mirroring OpenAI’s actions. This contradiction undermines OpenAI’s good faith narrative, especially amid parallel Anthropic evidence of legal hesitancy on pirated training data before their $1.5 billion settlement, the largest AI copyright class action payout to date.
Dario Amodei’s Role and Broader Stakes
Authors seek testimony from Anthropic CEO Dario Amodei, who created the datasets at OpenAI and holds knowledge of their destruction; a March ruling compelled his deposition despite OpenAI opposition. Wang identified a “fundamental conflict” where OpenAI asserts counsel-based good faith yet blocks related inquiries via privilege, weakening its position. OpenAI plans to appeal the order, stating disagreement, but compliance looms as withheld messages could yield smoking-gun proof akin to Anthropic’s internal shifts on pirated books “for legal reasons.”
Potential Outcomes for OpenAI Litigation
Forced disclosures may pressure OpenAI toward settlement, mirroring Anthropic’s path after authors uncovered reluctance on pirated data. Proving willfulness via deletion discussions could amplify damages, reshaping AI training transparency and liability. As discovery unfolds, OpenAI’s handling of these communications will test its litigation strategy against growing scrutiny of shadow library reliance in model development.
This escalating battle underscores tensions between AI innovation and creator rights, with OpenAI’s dataset saga potentially defining precedents for fair use in generative models. Authors’ persistence in piercing privilege highlights risks of scraping unlicensed content, even if later discarded. The December deadlines position this as a pivotal moment in AI copyright enforcement.



