Computers, Privacy & the Constitution

GENERATIVE AI MODELS: POTENTIAL MASSIVE COPYRIGHT INFRINGEMENT

-- By PedroLondono - 28 Feb 2024


In 2023 prominent generative AI developers were hit with three significant lawsuits revolving around copyright infringement. On September, the Authors Guild sued OpenAI? for massive copyright infringement of protected works in the training process of its Large Language Model (LLM) GPT. On November, Authors Guild filed another class action, this time against OpenAI? and Microsoft, for the same reasons as the first suit. On December, The New York Times filed a copyright infringement lawsuit against OpenAI? and Microsoft. These lawsuits are likely to challenge the livelihood of generative AI models as we now know them. This essay seeks to explore the different scenarios of these lawsuits, explain why the training processes of these LLMs have blatantly violated the Copyright Statute, and conclude that the only way out for these big AI developers is if Judges stretch the fair use doctrine.

GENERATIVE ARTIFICIAL INTELLIGENCE


In June 2018 OpenAI? first introduced their generative AI LLM called “Generative Pre-trained Transformer” [GPT]. Its premise is not complex: users design prompts for the software requesting something specific (summarizing a book, revising a text, drafting an email, etc.), and the language model delivers in seconds, as an “output”, what the user requested. Though some argue that generative AI systems such as ChatpGPT? are a technological revolution and the result of the outmost technical advances, others contend that it is simply putting in practice data gathering, organizing and prediction processes that have been around for many years (Moglen and Choudhari).

Perhaps you should have checked how Mishi's name is spelled.

In any case, the underlying key element for these generative LLMs is their training process. Similar to the way in which human beings learn throughout their lives, LLMs have been trained through a series of data and information that has been provided to them, so their technological features are able to analyze, process and use the information in what is called the “machine learning” process.

INTELLECTUAL PROPERTY ISSUE


Copyright is a legal tool through which the progress of arts is promoted by securing limited exclusive rights over an author’s work. Section 106 of the US Copyright Act establishes the exclusive rights that the copyright owner is entitled to. Among these, one of the main exclusive rights that the copyright owner has is the ability to reproduce (or authorize third parties to reproduce) his copyrighted work.

Based the information disclosed by OpenAI, their LLMs are trained with datasets comprised of publicly available texts and information from the internet, licensed content from third parties, user-generated information. However, based on the assertions made by the Plaintiffs (Authors Guild and The New York Times) in their claims, it is very likely that these datasets were also fed with copyrighted works, without a license from their owners.

Just as a copying machine or scanner, the most feasible way in which OpenAI? could have fed its LLMs the copyrighted works is through a process of reproducing them into a tangible medium, to which its language model then had access to and started the training model. Taking these works and reproducing them (in whatever way they did it) without the copyright owners’ authorization clearly violates §106(1) of the Copyright Act. The actions filed by the Plaintiffs are all grounded on the alleged massive copyright infringement to their reproduction right, and seek relief from OpenAI? and Microsoft for this unauthorized use.

FAIR USE – THE DEFENDANT’S SAFE HARBOR


Subject to it being proven wrong in the discovery stage of these proceedings, it seems fairly clear that in training their LLMs the Defendants contravened the dispositions set forth in §106(1) of the Copyright Act. Nonetheless, US Courts have often applied the fair use doctrine to justify copyright infringements and immunize defendants from liability under some specific circumstances (especially when it comes to big and powerful tech companies, as it happened in the Authors Guild v. Google case regarding the Google Books project).

Incorporated through §107 of the Copyright Act, the fair use doctrine allows specific circumstances under which the violation of an exclusive copyright would not be considered an infringement. This doctrine enables certain uses of copyrighted works for specific purposes. As established in the statute, the fair use defense is determined by taking into account four factors: (i) the purpose and character of the use, (ii) the nature of the copyrighted work, (iii) the amount and substantiality of what the infringer used from the copyrighted work and (iv) the effect of the unauthorized use on the potential market or target audience of the copyrighted work (Henderson et. al.). These four factors are non-exhaustive, and fair use is an equitable doctrine through which the judge may find as a grounded defense even if not all the four factors are proved to weigh on the defendant’s favor (Balganesh et. al.).

Through the years, case law has focused the first fair use factor on the “transformativeness” of the infringing work (Leval). In a recent SCOTUS decision on the Warhol v. Goldsmith case, the Court expressed that the focus to assess whether there is fair use should be on whether the alleged infringing work has sufficient transformative use and purpose vis-à-vis the copyrighted work.

In this case, deep pockets and sophisticated lawyering skills can persuade courts that the defendants have a safe harbor under the fair use doctrine. Based on the Google Books precedent, it would not be surprising that the Judges will determine that under the first fair use factor, and policy reasons along the lines of “promoting technological developments” (just like in the legendary Sony case), the Defendants are not liable of copyright infringement. This outcome, though farfetched and a stretch of the fair use doctrine where the other factors weigh heavily against fair use, can be the only way in which generative AI systems will subsist in time, even if they’re alleged genius is all based on false promises (Chomsky et. al.). Otherwise, ruling for the Plaintiffs would force compulsory licensing schemes for the training of LLMs, which could potentially cost billions, and deter the investment in their development.

The route to improvement here is to edit stringently the profuse introductory material. We need only a couple of sentences to understand the claim that training models on copyrighted material constitutes infringement.

But the legal analysis that follows needs to be expanded, or rather reconsidered, because it is wrong. Making a copy in the course of training might be infringement under non-US copyright law, but it isn't correct to say that transient copying in the course of otherwise permitted activity is infringement. Nor is the fair use analysis any good. You missed the issues. Given that you had a starting point in the op-ed Mishi and I wrote, I'm sure a second try, based on the Authors' Guild complaint and a look at Wikipedia's Creative Commons BY-SA license would get you much closer.


You are entitled to restrict access to your paper if you want to. But we all derive immense benefit from reading one another's work, and I hope you won't feel the need unless the subject matter is personal and its disclosure would be harmful or undesirable. To restrict access to your paper simply delete the "#" character on the next two lines:

Note: TWiki has strict formatting rules for preference declarations. Make sure you preserve the three spaces, asterisk, and extra space at the beginning of these lines. If you wish to give access to any other users simply add them to the comma separated ALLOWTOPICVIEW list.

Navigation

Webs Webs

r2 - 20 Apr 2024 - 15:15:50 - EbenMoglen
This site is powered by the TWiki collaboration platform.
All material on this collaboration platform is the property of the contributing authors.
All material marked as authored by Eben Moglen is available under the license terms CC-BY-SA version 4.
Syndicate this site RSSATOM