Law in the Internet Society

AI Scraping: From Free “Knowledge” to Growing Ignorance

The issue of data scraping is familiar to courts, evolving from manual collection to today’s AI-driven automation. As firms train AI models with significant amounts of data, the legal landscape struggles to define its boundaries, a dilemma regulators still face.

In Paul Tremblay et al. v. OpenAI? , Inc. et al., No. 3:23-cv-03223 (N.D. Cal. Feb. 12, 2024) (Dkt. 104), the Court granted in part and denied in part OpenAI? ’s motion to dismiss. Plaintiffs’ vicarious copyright infringement claims were dismissed, focusing on whether there is “substantial similarity” between original works and AI-generated outputs. This high bar for establishing similarity renders copyright infringement cases against AI models unlikely to succeed. The latter might appear beneficial. Copyright as a form of private property restricts access to knowledge, keeping it in the hands of the privileged few while others are left ignorant. In this regard, AI and scraping might rise as a liberating and empowering tool, unlocking knowledge for anyone by bypassing traditional ownership. Yet, is that always the case?

From Free Knowledge to AI’s Knowledge

Costly Surveillance – Privatization

Proponents may argue that AI democratizes learning, sharing the information it scrapes freely with those who can’t afford access to educational opportunities. Nevertheless, this so-called freedom comes with hidden costs—mass data collection driven by economies of “scale” (large quantities of data) and “scope” (any data types). A notable surveillance case with Clearview AI, which scraped pictures from millions of websites to train facial recognition models sold to law enforcement. Are we headed to a dystopian 2084 Orwell’s future?

As Zuboff warns, in a world of surveillance capitalism, companies harvest our data to steer our behavior, transforming AI into an engine of profitability rather than enlightenment. As a result, AI is increasingly becoming privatized. How could it still promote free learning for all when it isn’t open to everyone’s input? Or when powerful entities control the output? For instance, in early October 2024, OpenAI? , previously governed by a charitable nonprofit, closed a $6.6 billion funding haul with investment from Microsoft Corporation and Nvidia. Like Anthropic and Elon Musk’s xAI, already registered as for-profit corporations, OpenAI? will soon become a public-benefit corporation. Goodbye, OpenAI? . Hello ClosedAI? .

Bias Perpetuation

Additionally, the AI tool designed to combat ignorance could instead deepen it, misinterpreting our data, misinforming us, and insidiously shaping our thoughts. Indeed, AI is built by humans and institutions marked by entrenched discrimination, and it is fed with data that may be unrepresentative of marginalized groups, such as people of color or women. As a result, bias may incorporate itself into the model’s outcomes, self-perpetuating.

Data Scraping to Data Creation

AI is not limited to scraping data. It may create new layers of information, crafting a distorted reflection of who we are, and positioning itself as the enemy of human autonomy. Most state comprehensive privacy laws exclude publicly available personal information from their scope. Yet, the contours of “publicly available” data are ever-shifting. The implications are far worse when data scraping moves beyond collection to infer new insights, such as people’s voting tendencies. What once was public becomes something entirely new—data created from data with unseen consequences.

The question may not be whether to leverage AI to expand knowledge access but how to utilize it wisely. That is, how do we foster access to accurate information without being spied on, knowledge that enriches us without molding us into “bot-like” beings?

A “Wise” Use of AI’s Knowledge – “Ethically” Collected Data?

Some may argue that a “wise” use of AI may be to train the model on “ethically” collected data––data where sources have “consented” to its use. However, this is lacking in substance. For instance, take the EU’s GDPR-compliant “consent” by a data subject to the processing of their data (from social media accounts): A controller will find it challenging in practice to identify the specific data owners whose data will be scraped to obtain their so-called “consent.” Thus, “ethically” collected data may only be a pretext for AI to still influence how we perceive and interact with the world, shaping us in its image. Consider how art museums, such as L.A.’s Dataland, will start featuring AI-generated art by claiming reliance on data collected “ethically.” These AI works raise crucial questions: Is Art still serving as a mode of human learning and expression? Or is it becoming an instrument for AI to shape our understanding of Art—and, by extension, of ourselves? No matter how “ethically” (ridiculously) generated, this would transform Art into a novel tool that leads humans to conform to AI-driven frameworks, which may be destructive of the critical, open-ended learning and thinking that Art usually pursues.

A Collaborative Learning System For a “Right” Use

The responsibility lies with lawyers to teach the public the “right” use of AI, guiding us toward solutions that transcend the confines of surveillance, impracticable “consent” theory, and restrictive intellectual property. As Zuboff insightfully observes, “Privacy is not private.” Creativity, too, is not exclusive. Inspiration can spring from all corners of society. Therefore, lawyers must collaborate with professionals across various fields to establish a system that empowers humanity, not steels from it. A collaborative system might leverage a non-privatized “Open” AI with publicly-produced information as a free learning tool. Unlike projects of “SuperIntelligence,” it should not take away our freedom to think but maintain our dignity, free will, and “right to a future tense” (The Age of Surveillance Capitalism, Chapter 2 §6). For Zuboff, this right has long been lost, as our behavior is predicted by Google, making us unable to shape our future. Similarly, AI shouldn’t predict our future, especially if privatized; it cannot decide how we learn and thrive as humans. Whose futures are we talking about? Certainly not the ones filling their pockets.

AI, watching us, empty and cold,

Deprived of a mind, controlled,

Offering our data for perhaps flawed insights

Slowly, we transition from human to bot-like.

Let AI be the product.

I think the best route to improvement begins with a more disciplined analysis of this draft's unspoken assumptions. Neither as a matter of copyright law nor of social policy am I prohibited from reading as much of what is available on the public net as I want, or of memorizing as much of it as I can hold, or of using all the words I learned in new combinations to make new sentences, tunes or pictures. Anything I can do by myself I am equally allowed to do with a computer. If it would not be copyright infringement done by me, it still isn't done by software.

So we need a clear definition of the problem you are seeking to solve. You begin with one lower court case giving an obvious conclusion from these simple premises, which leads you to a large speculative list of concerns about "AI," none of which seem to be tightly coupled to the underlying propositions. If we can be clearer about the subject we can make better progress, surely.

Navigation

Webs Webs

r2 - 11 Nov 2024 - 16:28:59 - EbenMoglen
This site is powered by the TWiki collaboration platform.
All material on this collaboration platform is the property of the contributing authors.
All material marked as authored by Eben Moglen is available under the license terms CC-BY-SA version 4.
Syndicate this site RSSATOM