Skip to content

Menu

LexBlog, Inc. logo
NetworkSub-MenuBrowse by SubjectBrowse by PublisherBrowse by ChannelAbout the NetworkJoin the NetworkProductsSub-MenuProducts OverviewBlog ProBlog PlusBlog PremierMicrositeSyndication PortalsAbout UsContactSubscribeSupport
Book a Demo
Search
Close

Sneak Peek: Copyright Office Releases Pre-Publication Version of Third Report on Generative AI

By Sigrid Jernudd on July 4, 2025
Email this postTweet this postLike this postShare this post on LinkedIn
Copyright-AI-Training-Report-745x300

On May 9, 2025, the U.S. Copyright Office released a “pre-publication” version of Part Three of its planned reports on the intersection between copyright and generative artificial intelligence (“AI”).  Titled “Part 3: Generative AI Training,” the Report addresses whether AI companies may use copyrighted material to train their product.  Part One, discussed previously on this blog, focused on digital replicas.  Part Two, also discussed previously on this blog, focused on the copyrightability of AI-generated outputs.  All three publications follow a lengthy notice-and-comment period. 

The extent to which the Report will become Copyright Office policy remains to be seen.  The Copyright Office has yet to issue a final report after the Trump administration fired both Librarian of Congress Carla Hayden and Register of Copyrights Shira Perlmutter.  In a suit seeking her reinstatement, Ms. Perlmutter takes pains to flag that she was terminated the day after the Copyright Office issued the “pre-publication” Report.[1] The future will tell who will head the Copyright Office and if new leadership will embrace the pre-publication version of the Report.

While the Report declines to take a firm stance on whether uses of copyrighted material to train generative AI tools should be treated as “fair use,” it suggests that there are a number of situations where fair use should not apply.  Instead, AI companies should license materials from their copyright owners.  However, the Report also concludes that there is no need for Congress to enact a compulsory license scheme to enable the progress of generative AI, finding that voluntary collective licensing schemes should be adequate to make this new licensing market work.

The Technical Background of Generative AI

The Report starts by outlining how companies develop generative AI and defining several key terms:

  • Machine Learning—A “field of artificial intelligence focused on designing computer systems that can automatically learn and improve based on data or experience, without relying on explicitly programmed rules.”[2] The AI learns by “creating a statistical model using examples of inputs and expected outputs, called ‘training data,’ along with a metric of how well the model performs.”[3]
  • Generative Language Models—A mathematical “statistical model of language . . . represented by the probability of the next word given all the preceding words or ‘context.’”[4] These models are trained using “generative pre-training,” in which “text examples serve as both the input and expected output, with performance measured by how well the model predicts each next token (output) based on preceding tokens (input).”[5]
  • Training Data—Generative AI models typically need an extremely large dataset to engage in machine learning, creating a massive demand for data.[6] Generative AI models also need high-quality data, which may include images without watermarks and text that is “edited, factually rich, and cover[s] diverse topics.”[7] Finally, generative AI models need data that “align[s] with the expected use of the model.”[8] Developers acquire this data in a variety of ways, including “scraping” data from the internet, “pirate sources” (such as “shadow libraries” collecting “full, published books”), or through interactions with customers or users.[9] Raw data then needs to be curated and processed before it can be used.[10]
  • Training—The “iterative process” of developing the generative AI model.[11] Training can involve different stages, with different data needs.[12] The Report notes that commentators dispute the extent to which this reflects “memorization,” but even a small memorization rate of one percent could be significant.[13]
  • Deployment—Most individual models “are deployed in larger AI systems,” with which users interact.[14] The same AI model can be used in systems that “perform very differently,” which impacts the materials they use to generate outputs.[15]

Generative AI Systems Can Infringe Copyrighted Works

The Report concludes that “[c]reating and deploying a generative AI system using copyright-protected material involves multiple acts that, absent a license or other defense, may infringe one or more rights.”[16]  A prima facie case of copyright infringement requires that a plaintiff show “(1) ownership of a valid copyright, and (2) copying of constituent elements of the work that are original.”[17]  The Report examines three factors to reach this conclusion:

  • Data Collection and Curation—Developers of AI systems often make “multiple copies” of works and “[m]ost commentators agreed with or did not dispute that copying during the acquisition and curation process implicates the reproduction right.”[18]
  • Training—Training generative AI systems necessarily “implicates the right of reproduction,” as (i) developers need to copy and download the dataset for training; (ii) developers “temporarily” reproduce works to show them to the model “in batches”; and (iii) the training process may result in “model weights that contain copies of works in the training data,” sometimes providing outputs that are so similar they infringe the works.[19]
  • RAG (Retrieval-Augmentation Generation)—RAG is a feature of many generative AI systems, and similarly involves making reproductions.[20] RAG improves Large Language Models (LLMs) by “grounding the model on external sources of knowledge to supplement the LLM’s internal representation of information,” ensuring the information the model provides is up-to-date and “users have access to the model’s sources.”[21]
  • Outputs—Some generative AI models produce “near exact replicas” of the copyrighted works used as inputs.[22]

The Fair Use Defense Only Applies In Limited Circumstances

“Fair use” is a potential defense to a claim of copyright infringement, contained in section 107 of the Copyright Act itself.  The Report outlines the debate among commentators pitting artists’ and creators’ ability to thrive, on the one hand, against the development of a “critical technology,” on the other hand.[23]  The Copyright Office examines each of the four fair use statutory factors in the context of generative AI in turn:

  • Purpose and Character of the Use—Factor One examines “whether an allegedly infringing use has a further purpose or different character,” stressing “transformativeness and commerciality,” but also “whether the defendant had lawful access to the work.”[24] This “must also be evaluated in the context of the overall use” of the material.[25] Ultimately, the Report concludes that “training a generative AI foundation model on a large and diverse dataset will often be transformative,” but how transformative “will depend on the functionality of the model and how it is deployed.”[26] Generative AI companies can impact this factor by, for example, adding restrictions preventing the model from reproducing copyrighted works.[27] The Report rejects the notion that generative AI is inherently transformative, finding it to be “expressive” and unlike human learning.[28] The Report also rejects the idea that even non-profit AI companies (like OpenAI) are “non-commercial.”[29] The Report also notes that “the knowing use of a dataset that consists of pirated or illegally accessed works should weigh against fair use without being determinative.”[30]
  • The Nature of the Copyrighted Work—Factor Two recognizes that “some works are closer to the core of intended copyright protection than others,” and the use of these works—such as “novels, movies, art, or music”—is “less likely to be fair use than factual or functional works.”[31] The Report finds that this factor will depend on the nature of the works.[32]
  • The Amount and Substantiality of the Portion Used—Factor Three considers “how much of each work is used; the reasonableness of the amount in light of the purpose of the use; and the amount made accessible to the public.”[33] The Report concludes that “the use of entire copyrighted works is less clearly justified in the context of AI training or a thumbnail image search” than search engines like Google Books, but it may nonetheless be “practically necessary” and “reasonable” if there is a transformative purpose.[34] This factor weighs more heavily in favor of fair use where the generative AI company takes steps to prevent copyrighted material from being made available to the public.[35]
  • Effect of the Use On the Market for the Copyrighted Work—Factor Four considers harm to the market for both the original and derivative works.[36] This factor considers “actual or potential market substitution” (i.e., lost sales), which the Report recognizes as very possible in the context of generative AI.[37] It also considers market dilution. The Report resolves a dispute among commentators over whether harm to the market extends to an “overall body of work” or is limited to “markets for the specific copyrighted work,” positing that it should broadly include any effect on the market, a particular risk where AI is trained to copy certain styles.[38] Factor Four also considers lost revenue in “actual or potential licensing markets.”[39] A number of AI companies have begun purchasing licenses to use copyrighted material, a complex issue dependent on how markets evolve.[40] Finally, Factor Four considers “the public benefits that the defendant’s use is likely to produce, considering how these benefits relate to the goals of copyright and their relative importance.”[41] The Report recognizes “strong claims to public benefits on both sides” but concludes that this consideration does not change the fair use analysis in the context of generative AI.[42]

Fair use analysis in the context of generative AI “will depend on the facts and circumstances of the particular case,” and the Report concludes that many uses will not qualify as fair use.[43] Although it claims not to prejudge litigation outcomes, the Report strongly suggests that AI companies will face an uphill battle if they seek to rely on the fair use defense. The Report notes that “the first and fourth factors can be expected to assume considerable weight in the analysis,” adding that “[d]ifferent uses of copyrighted works in AI training will be more transformative than others” and that “the impact on the market for copyrighted works could be of unprecedented scale”—presenting a serious obstacle for AI companies seeking to prevail on Factor Four.[44] The Report provides a scale from “uses for purposes of noncommercial research or analysis that do not enable portions of the works to be reproduced in the outputs”—fair use—to “copying of expressive work from pirate sources in order to generate unrestricted content that competes in the marketplace, which licensing is readily available”—not fair use.[45]

The Future of Licensing to Train AI

The Report contains an expansive discussion of potential licensing regimes.  Rejecting the notion that requiring AI companies to license copyrighted works for training will prevent competition in the field,[46] the Report provides guidance to AI companies considering licensing, outlining the pros and cons of both “voluntary licensing”—which it states is feasible, if potentially costly—and a statutory licensing scheme.[47]

Ultimately, the Report recommends further developing “voluntary collective licensing for the AI context,” though the Department of Justice may need to weigh in on antitrust concerns.[48]  The Report criticizes the idea of a compulsory licensing regime, which “can set practices in stone,” stifle “the development of flexible and creative marked-based solutions,” and “take years to develop.”[49]  The Report finds that an opt-out scheme, whereby copyright owners would be required to take the initiative if they want their works excluded from AI training sets, “is inconsistent with the basic principle that consent is required for uses within the scope of their statutory rights.” It cites concerns about “the effectiveness and availability of opt-outs,” but nevertheless concedes that opt-outs could provide one way for Congress to put in place limitations on the use of copyrighted works in AI.[50]

Conclusion

The Report concludes by revisiting its fair use analysis, writing that “making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets, especially where this is accomplished through illegal access, goes beyond established fair use boundaries.”[51]  While the Report is not binding, the Copyright Office is well-respected and the Report or its rationale are likely to be cited in the various cases where generative AI companies assert a fair use defense.[52]   The Report also supports licensing—though recommends against the creation by Congress of a statutory scheme—which, if adopted widely to avoid copyright challenges, could raise the costs of developing generative AI systems.[53]  Because the Report does not conclusively weigh against fair use in all circumstances, and instead recognizes that the analysis will be fact-specific and dependent on a number of factors, its overall impact is likely to be mixed—both generative AI companies and copyright holders have something to point to in their favor.


[1] Complaint for Declaratory and Injunctive Relief at ¶ 19, Perlmutter v. Blanche, No. 25-cv-1659 (D.D.C. May 22, 2025) (ECF No. 1).

[2] Report at 4.

[3] Report at 4.

[4] Report at 6.

[5] Report at 7-8.

[6] Report at 10-11.

[7] Report at 11-12.

[8] Report at 12.

[9] Report at 13-14.

[10] Report at 15-16.

[11] Report at 17.

[12] Report at 17.

[13] Report at 19-21.

[14] Report at 21.

[15] Report at 22.

[16] Report at 26.

[17] Report at 26 (quoting Feist Publ’ns, Inc. v. Rural Tel. Serv. Co., 499 U.S. 340, 361 (1991)).

[18] Report at 26-27.

[19] Report at 27-30.

[20] Report at 30-31.

[21] Kim Martineau, What is retrieval-augmented generation, IBM Research (Aug. 22, 2023), https://research.ibm.com/blog/retrieval-augmented-generation-RAG.

[22] Report at 31.

[23] Report at 32-34.

[24] Report at 35 (citations omitted).

[25] Report at 36.

[26] Report at 45-46.

[27] Report at 46-47.

[28] Report at 47-48.

[29] Report at 50-51.

[30] Report at 52.

[31] Report at 53 (citation omitted).

[32] Report at 54.

[33] Report at 54.

[34] Report at 57.

[35] Report at 59-60.

[36] Report at 61.

[37] Report at 62-63.

[38] Report at 64-66.

[39] Report at 66.

[40] Report at 70-71.

[41] Report at 71.

[42] Report at 73.

[43] Report at 74.

[44] Report at 74.

[45] Report at 74.  Meta and Anthropic PBC are both defendants in litigation asserting that they improperly relied on books downloaded from so-called “shadow libraries” without their copyright owners’ permission in training their AI systems.  See Third Amended Consolidated Complaint, Kadrey v. Meta Platforms, Inc., No. 3:23-cv-03417-VC (N.D. Cal. Jan. 21, 2025) (ECF No. 407); First Amended Class Action Complaint, Bartz v. Anthropic PBC, No. 3:24-cv-05417-WHA (N.D. Cal. Dec. 4, 2024) (ECF No. 70).  The court in Bartz recently found that the “fair use” defense would not apply to books obtained from “pirated” sources, but did apply to the use of books that were legitimately purchased and then scanned by Anthropic.  See Order on Fair Use, Bartz v. Anthropic PBC, No. 3:24-cv-05417-WHA (N.D. Cal. June 23, 2025) (ECF No. 231).  The court in Kadrey found that Meta could also rely on the fair use defense.  See Order, Kadrey v. Meta Platforms, Inc., No. 3:23-cv-03417-VC (N.D. Cal. June 25, 2025) (ECF No. 601). 

[46] Report at 74-75.

[47] Report at 86-100.

[48] Report at 104. 

[49] Report at 104-5.

[50] Report at 105.

[51] Report at 107.

[52] See, e.g., The Intercept Media, Inc. v. OpenAI, Inc., No. 1:24-cv-01515-JSR (S.D.N.Y.); Concord Music Grp., Inc. v. Anthropic PBC, No. 5:24-cv-03811 (N.D. Cal.); Advance Local Media LLC v. Cohere Inc., No. 1:25-cv-01305 (S.D.N.Y.).

[53] Report at 107.

Photo of Sigrid Jernudd Sigrid Jernudd

Sigrid Jernudd is an associate in the New York office of Hughes Hubbard & Reed, where she focuses on litigation and international arbitration. She has represented both domestic and international clients in a range of industries. She also has experience in antitrust matters.

Read more about Sigrid JernuddEmail
  • Posted in:
    Corporate & Commercial, Intellectual Property
  • Blog:
    HHR Art Law
  • Organization:
    Hughes Hubbard & Reed LLP
  • Article: View Original Source

LexBlog, Inc. logo
Facebook LinkedIn Twitter RSS
Real Lawyers
99 Park Row
  • About LexBlog
  • Careers
  • Press
  • Contact LexBlog
  • Privacy Policy
  • Editorial Policy
  • Disclaimer
  • Terms of Service
  • RSS Terms of Service
  • Products
  • Blog Pro
  • Blog Plus
  • Blog Premier
  • Microsite
  • Syndication Portals
  • LexBlog Community
  • Resource Center
  • 1-800-913-0988
  • Submit a Request
  • Support Center
  • System Status
  • Resource Center
  • Blogging 101

New to the Network

  • Tennessee Insurance Litigation Blog
  • Claims & Sustains
  • New Jersey Restraining Order Lawyers
  • New Jersey Gun Lawyers
  • Blog of Reason
Copyright © 2025, LexBlog, Inc. All Rights Reserved.
Law blog design & platform by LexBlog LexBlog Logo