The OpenAI Copyright MDL Has Become a Data-Governance Case

Current through June 9, 2026.

The OpenAI copyright multidistrict litigation is no longer important merely because many lawsuits were consolidated in one court. It has produced a record of claim-specific rulings and unusually consequential discovery disputes involving model-development records, privilege, and large sets of ChatGPT conversation logs.

No court has entered a final judgment that resolves whether OpenAI’s model training is fair use. The litigation nevertheless offers a practical warning for companies deploying generative AI: ordinary decisions about retention, deletion, vendor terms, and internal records can become central evidence in a major lawsuit.

Where the MDL stands

The Judicial Panel on Multidistrict Litigation created MDL No. 3143 on April 3, 2025, transferring four actions to the Southern District of New York for coordinated or consolidated pretrial proceedings. Additional actions have since been transferred. The proceeding is captioned In re OpenAI, Inc. Copyright Infringement Litigation, No. 25-md-3143, before District Judge Sidney H. Stein and Magistrate Judge Ona T. Wang.

The consolidated cases are not identical. They include claims by authors, newspapers, publishers, and other rights holders involving model training, allegedly infringing outputs, contributory infringement, and provisions of the Digital Millennium Copyright Act concerning copyright-management information.

An MDL coordinates overlapping pretrial work, but it does not turn every plaintiff’s theory into one claim or decide that any claim will succeed.

The case has moved beyond consolidation

In a December 15, 2025 ruling involving Ziff Davis, the court allowed several claims to proceed past a motion to dismiss, including contributory-infringement and certain DMCA copyright-management-information claims. It dismissed other theories, including the claim that disregarding robots.txt instructions constituted circumvention of an effective technological measure. The ruling did not decide liability, but it showed that the litigation will turn on specific claims, facts, and model behavior rather than one sweeping answer about AI and copyright.

Discovery has become just as important as the pleading rulings:

In January 2026, Judge Stein upheld orders requiring production of a sample of 20 million de-identified ChatGPT conversation logs.
In March 2026, Magistrate Judge Wang granted in part a request involving additional reservoirs of 78 million and 10 million logs, subject to a protocol addressing de-identification and user privacy.
In May 2026, the court ordered OpenAI to produce deposition testimony from separate litigation involving Sam Altman, Greg Brockman, Satya Nadella, and an OpenAI corporate designee after finding the narrowed request relevant and proportional.
In February 2026, Judge Stein set aside a magistrate judge’s ruling that OpenAI had waived attorney-client privilege over certain 2022 communications concerning the Books1 and Books2 datasets and Library Genesis.

Those developments do not establish infringement. They do show what an AI copyright case can demand once it reaches coordinated discovery.

The clearest governance lesson is about data

The MDL has made retention and discovery policy part of the copyright-risk discussion.

For enterprise users, the immediate lesson is not that their prompts will necessarily become evidence in this case. It is that vendor data practices, contractual promises, litigation holds, and court-ordered discovery can interact in ways that are easy to overlook during procurement.

Legal, privacy, security, and procurement teams should ask:

What prompts, outputs, metadata, and logs does the vendor retain?
Which retention settings are defaults, and which require an enterprise plan or approval?
Can ordinary deletion schedules be suspended by a legal hold, court order, or regulatory obligation?
What data may be used to improve models, and what requires an affirmative opt-in?
Which internal repositories or systems can the tool access?
Can the company preserve its own audit trail without collecting more sensitive content than it needs?

OpenAI states that business-product and API data are not used to train its models by default. It also states that API inputs and outputs are generally removed after 30 days unless legal requirements require retention, and that eligible customers may request zero-data-retention controls for qualifying endpoints. Those are meaningful controls, but buyers still need to understand exceptions, product-specific settings, and what happens when litigation or another legal obligation changes the ordinary retention rules.

Procurement terms need operational follow-through

Contract review remains important, but contract language alone is not a governance program.

Buyers should evaluate vendor representations about training data, output restrictions, indemnity, retention, confidentiality, security, and cooperation during disputes. They should also make sure internal settings and workflows match the negotiated terms.

A contract may promise strong business-data protections while employees continue using consumer accounts. A zero-retention option does little if it was never enabled for the relevant endpoint. A restriction on confidential data will not help if the organization has no practical rule for deciding which repositories or matters an AI coding or research tool may access.

Code-generation controls remain useful, but they are a separate issue

AI-generated code presents a related but distinct set of copyright and open-source-license risks. Companies should not treat the OpenAI MDL as proof that generated code infringes or that any particular compliance control is legally required.

Still, software companies have practical reasons to treat AI-generated code like third-party code until it is reviewed:

use approved enterprise coding tools and accounts;
enable public-code matching or reference features where available;
require human review for long or unusually specific generated snippets;
scan generated code for open-source and snippet-level risks before release;
document remediation of flagged code; and
restrict sensitive repository and file access.

GitHub documents a policy that can block Copilot suggestions matching publicly available code or allow them with code references. Cursor states that Privacy Mode enables zero data retention for model providers, while also explaining that some code data may still be stored to provide additional features and that codebase indexing involves uploads for embedding. These controls address different risks. Public-code matching is not a confidentiality control, and privacy settings are not license-clearance tools.

What legal teams should do now

Map which AI products are in use, including consumer accounts and tools embedded in other software.
Record the actual retention, training, access, and deletion settings for each approved product.
Reconcile vendor contracts with technical configuration and employee practice.
Decide what AI-use records are genuinely needed for audit, compliance, or litigation readiness.
Apply separate controls for confidential data, generated code, external-facing content, and high-risk decisions.
Revisit the program when vendor terms, product settings, or major court rulings change.

What to watch next

The most consequential developments will be rulings that clarify training-copy theories, output-based claims, fair use, DMCA liability, and the permissible scope of discovery. Additional transfer orders also matter because they determine which disputes enter the coordinated proceeding.

For enterprise users, the discovery fights may be as instructive as the eventual merits rulings. They show that AI governance is not limited to deciding whether employees may use a tool. It includes knowing what the tool retains, what the company can control, and what may have to be preserved or produced when litigation begins.