The News/Media Alliance has produced a White Paper, “How the Pervasive Copying of Expressive Works to Train And Fuel Generative Artificial Intelligence Systems Is Copyright Infringement and Not a Fair Use.”
The Alliance also filed a comprehensive submission addressing copyright and artificial intelligence with the U.S. Copyright Office, to aid the Office in its study and all branches of government on these issues.
About the White Paper and Copyright Office Comments
On October 30, 2023, the News/Media Alliance published a White Paper, including an incorporated technical analysis, and comments submitted to the Copyright Office focusing on generative Artificial Intelligence (AI) developers’ unauthorized use of publisher content.
Together, the White Paper and the Technical Analysis make multiple findings, including:
- Developers have copied and used news, magazine and digital media content to train LLMs.
- Popular curated datasets underlying LLMs significantly overweight publisher content by a factor ranging from over 5 to almost 100 as compared to the generic collection of content that the well-known entity Common Crawl has scraped from the web.
- Other studies show that news and digital media ranks third among all categories of sources in Google’s C4 training set, which was used to develop Google’s generative AI-powered products like Bard. Half of the top ten sites represented in the data set are news outlets.
- LLMs also copy and use publisher content in their outputs. LLMs can reproduce the content on which they were trained, demonstrating that the models retain and can memorize the expressive content of the training works.
The Alliance’s comments to the Copyright Office address further questions related to the use of publisher content in generative AI products and services, including the potential for licensed solutions, including on a voluntary, collective basis, existing legal standards to determine when textual outputs may be substantially similar to news and media articles, and methods to obtain consent from copyright owners to the use of their materials for AI training.
Based on the conclusions these findings, recommendations from the Alliance include:
- The Copyright Office should clarify publicly that use of publishers’ expressive content for commercial generative AI training and development is likely to compete with and harm publisher businesses, which is disfavored as a fair use.
- Substantial transparency measures should develop around the ingestion of copyrighted materials for uses in generative AI technologies.
- Further development of relevant licensing models should be encouraged, including by acknowledging the potential feasibility of voluntary collective licensing to facilitate licensing for ingestion of news and media materials for generative AI purposes.
- The Copyright Office should swiftly promulgate an updated registration option to enable online news publishers to register groups of news articles published online.
- Considering the large bargaining power disparity between media publishers and very large online platforms, measures to correct this negotiating disparity, such as the Journalism Competition and Preservation Act, should be supported.
- Measures to address the scraping of protected content from third-party pirate websites should be adopted.
Members of the News/Media Alliance staff have contributed to this post.