AI and the great data robbery
Silicon Valley has stolen huge amounts of original material in order to “train” its GPT models
This article is taken from the May 2024 issue of The Critic. To get the full magazine why not subscribe? Right now we’re offering five issues for just £10.
A mimicry machine that can produce knock-offs to order may not be what the world most needed in 2024. But thanks to generative AI, we can now request verse in the style of Bob Dylan or Keats, computer code, diversity guides or visual images. With the newest models able to produce video, too, we’ll soon be able to recreate a personalised “Elvis” performing in our living room. The former Google research director and FTC expert advisor, Meredith Whitaker, calls the output “derivative content paste”.
The bill for this novelty is only now becoming apparent. OpenAI originally restricted public access to its GPT (“generative pre-trained transformer”) models, citing the risks posed by misuse, including impersonation and fraud.
But in late 2022, as interest rates rose, investors grew impatient, and so OpenAI threw the doors open to everyone. ChatGPT was a sensational overnight hit. Parents now use a private safe-word to ensure that a conversation with their children is authentic: a voice can be cloned from just a few seconds of speech. Around the world, a lot of copyright holders woke up.
That’s because the magic of generative AI only happens if the model has undergone a process that involves digesting large amounts of original material, a process called “training”. Wave after wave of lawsuits argue that the originals were ingested without permission, or the knock-offs are too close to the originals to be permissible, or both.
“Companies need three things for generative AI, and pay billions for the first two: AI talent and computing power,” says Ed Newton-Rex. “But they now expect to get the third, training data, for free.” Newton-Rex is both a choral composer and a generative AI music pioneer, winning Time’s “invention of the year” for his generative music model at Stability AI.
That was trained on specially-commissioned music, however — and the musicians were paid. When Newton-Rex realised AI companies like his employer now wanted to scrape data without permission or payment, he parted ways.
Once a licence has been granted, it is difficult to undo the damage
Getty Images has sued Stability AI for generating images derived from scraping its photos — the watermarks are even visible, in distorted form, on the images the model produced. A group of bestselling authors including John Grisham, Jonathan Franzen and George R.R. Martin followed by accusing Open AI of “systematic theft on a mass scale”. News publishers ranging from the New York Times to the Intercept have filed their own infringement suits.
The legal broadside against copyright infringement lawsuits has drawn comparisons with dotcom-era piracy: it’s “the new Napster”, says neuroscientist and AI critic Gary Marcus. But the anxiety from individual creators comes from a deeper place in the soul. It’s an existential howl.
Generative AI is not merely a novel distribution system, as Napster was. “Training” hands the AI model the blueprints. Once a licence has been granted, it is difficult to undo the damage: requests such as “choose which stories to publish to the Daily Telegraph, and write them in the house style”, for example. “You can’t unbake the cake,” one former newspaper executive told me. Even for an industry seemingly addicted to self-harm, this is a step too far.
The first copyright law, Britain’s Statute of Anne in 1709, defined an author’s commercial rights, and Anglophone common law would proceed on a utilitarian basis, around economics. As the concept of authorship grew, however, continental law introduced more metaphysical ideas.
“I once heard a German author explain that the very essence of a human being is in the work,” says author Robert Levine. “That’s the most German thing I’ve ever heard.” The French developed this further into a set of inalienable rights known as personal or moral rights. These allowed the creator a say in how their work is used.
Copyright infringement is not simply a case of straightforward commercial fraud. If art is a metaphysical embodiment of a person, or their soul, then the dignity or personhood is being violated by dodgy knock-offs. Disquiet at this has been most vividly expressed by the musician Nick Cave, after a fan sent him a ChatGPT attempt to write a song in his style. Cave called the output “a grotesque mockery of what it is to be human”. Genuine art, he explained, is “the redemptive artistic act” that evokes an identification, an emotional response. “AI can only mimic the transcendent journey of the artist that forever grapples with his or her own shortcomings. This is where human genius resides, deeply embedded within, yet reaching beyond, those limitations.”
“Dignity is what makes copyright unique,” agrees Neil Turkewitz, who has represented music rights-holders in international forums such as the WTO and TRIPS. “It’s fundamentally a product of the Enlightenment.” The two intellectual strands coalesced into the Berne Convention process 140 years ago, which harmonised both broad philosophical principles. (Moral rights only arrived in UK law in 1988.)
Whilst some jurisdictions, including the UK, grant a very carefully worded and narrow text and data-mining exception (TDM) for private research, this exception doesn’t give the reproduction engines a free pass. “It’s hard to see how an LLM making unlicensed ingestion, then producing remarkable facsimiles, fails to injure the author either morally or economically,” says Turkewitz.
Silicon Valley has always preferred to steal first, then beg for forgiveness later, regarding permission as a tedious friction. Today, in its giddy accelerationist phase, it doesn’t even recognise it as a sin. Still, they cannot have anticipated the grassroots fury. “We’re seeing an unprecedented outpouring of support from artists. It’s completely uncoordinated, and nothing to do with institutional representation or private lawsuits. It’s a global expression of horror at the world Silicon Valley wants to create, based on inauthentic product,” says Turkewitz. Newton-Rex agrees: it’s a question of what society we want to see.
A decade ago many individual creators might have taken the side of technology companies — as they did in the 2012 SOPA protests. Now they’re expressing a violation on an intensely personal level. Campaigns such as “Create Don’t Scrape” capture some of the fury. Perhaps we should not be surprised that in an age of identity, the appropriation of one’s public self, merely to be turned into a bland paste, is taken very personally. As commercial markets diminish, identity is what many creators have left.
The Government is being urged to weaken copyright exceptions and allow Big AI to scrape what it likes without permission or payment. This would be catastrophic for Britain’s creators, says Newton-Rex. “We should be pushing back hard,” he says. “A huge amount of capital would shift to AI as they build cheap competitors to creativity in general. It’s hard to see how the UK’s industries could survive that kind of change.”
Creators may be in a stronger position than they realise. To obviate the necessity of licensing, models are being trained on the artificial output of AI itself. But this hasn’t gone very well. Researchers have found that AI models “collapse” or go “MAD” — Model Autophagy Disorder — in the words of one team who explicitly evoke the analogy of BSE (mad cow disease) to describe the cannibalistic process. It turns out they need fresh live human material after all. Nothing else will do.
Enjoying The Critic online? It's even better in print
Try five issues of Britain’s most civilised magazine for £10
Subscribe