Copyright & AI

Gwern

Copyright & AI

The moral application of copyright and intellectual property to AI is the minimum protection necessary to promote the progress of science and the useful arts.

2023-11-26–2025-03-18 in progress certainty: possible importance: 6 bibliography

The latest spat of debates over AI & copyright law has focused either on whether they are copyright violations, or on whether the speaker would like them to be; less has been said about whether, for the purposes of the copyright system as a whole, they should be.

Many artists believe copyright should secure them a living, or police a perceived moral right to control all possible uses of their work forever, or that the harder they work the more valuable their outputs must be, or at least ensure that should some entity somewhere be better off because of it, that the copyright owner be empowered to dip their fingers into that entity’s wallet.

However, copyright is not any sort of transcendental human right instituted by God or the United Nations, mysteriously overlooked until a few centuries ago, but a legal gimmick invented recently for narrow pragmatic purposes: first for the purpose of state/church censorship of the public, then as an indirect state subsidy for research & creation, where the government infringes on the freedoms of every person to enforce rents paid to an IP owner.

Intellectual property is, in general, a severe infringement of human rights and liberty; consider what Thomas Jefferson famously wrote, in the context of an extensive criticism of the idea of intellectual property:

He who receives an idea from me, receives instruction himself, without lessening mine; as he who lights his taper at mine, receives light without darkening me.¹

The US federal Constitution, often vague or unclear, is admirably clear in the purpose of US copyright: it is explicitly and solely limited to the second economic purpose. To quote the Copyright Clause in its entirety:

[the United States Congress shall have power] To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.

So, from the perspective of US copyright, the only question about how a copyright regime should work is the clear (but not easily-answered) question: does it “promote the progress of Science and useful Arts”? Other copyright systems may introduce concepts like moral rights or religion or ‘social harmony’ as part of their purpose, but not the US one—it is about promoting progress. Period.

If someone complains about GPT-3 being trained on scraped web text without permission of copyright owners, the question is not whether this is ‘immoral’ (whosoever is defining that), but whether creating GPT-3 that way hinders the progress of science or useful arts, and whether their progress would be accelerated if GPT-3’s creators had to, say, pay $1 per training data token or get pre-emptive permission from everyone on the Internet. Likewise, Stable Diffusion—no matter how many artists lose a commission to a competing Stable Diffusion, that is not what copyright is for, and the question is, did a machine doing that commission advance science & useful arts more than paying an artist 1,000× more to do it?

These are not easy questions to answer, but they are rarely asked in these discussions, and asking them would clarify things—people are always apt to confuse what they would like with what is or should be legal, and make claims about the law which are not even wrong.

So, with that in mind: do generative models promote, or demote, the progress of science and useful arts?

First, their effects before release: the effect of generative models in the present on the past has been nil. Outside a few circles of ML enthusiasts, there was no expectation worldwide that generative models would so abruptly become so good. (This is why the release of Stable Diffusion & ChatGPT-3.5 in late 2022 caused such shockwaves.) As one of those enthusiasts, I thought people were foolish for not anticipating photorealistic image generation ~2021–2023, increasingly human-level text-generation/programming, and soon, video—but this shortsightedness has the silver lining that since no one expected it, they couldn’t’ve refused to progress for reasons like expecting to not get paid or being morally outraged. So there could have been no disincentivizing in the past: everyone who wrote something online, or released FLOSS software, or posted a drawing, had adequate incentive to do so, because they did so.

As their effect at release, they obviously promote ‘science’, both in the narrow current form and the older broad meaning of ‘knowledge’: generative models are already highly scientifically useful, and we have learned fascinating things from generative models. They have revolutionized deep learning, and AI, and are showing up everywhere from psychology & philosophy of mind to particle physics to biology. They have been just as influential on the useful arts, like coding or writing.

For the most part, they have not demoted either area, not even in the sense of disemployment—whatever the future technological unemployment effects may be of more advanced & comprehensive AI systems, the current systems largely remain complements to human usage. Aside from some commission artists, the clearcut cases of technological unemployment thus far remain niches, and often ones of little social value. (For example, academic ghostwriting for cheating students: the disemployed ghostwriters are hardly sympathetic as the hard work in faking homework demotes progress by rendering credentials meaningless, and if we are concerned about any demoting of progress, it’s because the generative models increase cheating by making it so much cheaper & easier. Likewise, while generative models wreak havoc on pornographic artists, who now compete against floods of generated images, one hesitates to say this is either promoting or demoting: the nature of pornography is to be as ephemeral as a Kleenex, and, this daily need satisfied, the end-consumer benefits minimally from there being 100,000 pieces of pornography rather than 10,000, and it is irrelevant who provides them.)

What about future disincentives?

Artists complain about AIs being able to imitate ‘styles’ and moot the idea of somehow being able to copyright ‘styles’ and extract royalties from any work which looks vaguely like theirs forever. Practical issues aside, this is not a clear case for copyright: the point of copyright is not to ensure them sinecures, but to create progress, particularly by competition, and for that progress to then become universally accessible to maximize gains. An artist losing a sale to a rival who paints a similar but better painting is not a problem, but competition at work to spur the artist to paint progressively better; only if the rival copies a superior painting entirely (and can undercut the superior painting without being able to create it) is there demotion as the superior painter hangs up their brush. (This was reinforced by the recent Prince copyright case on derivative works & transformativeness: the transformed painting could be, and in fact was, paid for by customers as an exact substitute & alternative to the original photograph, thereby disincentivizing the original photographer.) This is why the clause specifies “limited” times, and implies a public domain, and why patents require publication: the entire point is to ensure that creators must keep creating and innovating and suffering competition after a certain limited time (originally, very limited, to just a few years), and cannot sit on their laurels. So, if a style is so widespread & famous, or can be so easily named & imitated, such that an AI can create it, then lack of protection is a feature and not a bug, as far as the copyright clause is concerned: the humans need to innovate a new style.

Newspapers come to mind. Newspapers are incentivized to report by selling subscriptions & advertising; their reporting serves many important useful functions, and definitely promote the progress of science & arts. Generative models were irrelevant to them, and have not disincentivized any reporting in the past, and probably do not disincentivize articles right now: who would rather ask GPT-3 for its speculation, based on knowledge that cut off in 2019, about today’s current events, than to go and read an actual newspaper article? LLMs are simply not a substitute for newspapers, and so do not demote.

However, this changes when retrieval is added: if an AI can download a copy of the current newspaper, and write an up to date summary of the current news, then even without any direct quotes of the usual copyright-violating sort, this can substitute completely for reading the newspaper and thus a subscription or advertising. For the most part, the actual writing of a newspaper article is unimportant compared to the new facts inside it, and so a long summary can be a complete replacement. This sort of paraphrasing has long been an issue that online publishers have complained about, whether done by publishers themselves with a fig leaf of some added commentary or background, or done by fly-by-night content mills churning rewrites, and AIs, by automating content mills, could make it much worse. At scale, this could choke off newspaper revenues, and demote progress, and is thus a serious concern for how US copyright should deal with it. So unlike artistic styles, there’s a concern here.

How about books? Books are in much less danger. Books tend to be too long to be meaningfully summarized, as they are filled with details and often an experience in their own right. (A summary of Hamlet hardly replaces the text of the play for anything but superficial uses like school reports.) And no one is going to try to ask an AI to print out, line by line, a book they want to read like Harry Potter and the Philosopher’s Stone, even if it could do so accurately without silently veering into confabulation—slow, tedious, & expensive, that must be just about the worst way to read a good novel!

AIs can endanger books as references, by looking up key passages and extracting key facts, thereby replacing the need to purchase or read the entire book. However, aside from the observation that the ‘lost sales’ here are similar to those lost due to libraries or lending books or other books quoting them and no one thinks all of that should be outlawed to boost book sales, there is a trilemma for the demotion claim:

if a book can be replaced by a summary, then it could not have represented much progress at all (perhaps it should’ve been a blog post or newspaper article, or shouldn’t’ve been written at all), and there is little loss;
if the AI cannot replace the book by a summary because it contains so many relevant facts and the AI fails to adequately cover them all, then the original book is not replaced, and there is little loss
and if the AI is good enough at extracting the relevant facts that the summary can in fact replace the book, then because those facts usually come from other works or the author’s thoughts, then they ought to be good at looking them up in wherever the book got them from in the first place, and so do so in a superior way, across larger corpuses, on the fly, customized, etc., being used by other AIs;

being replaceable by AI, then disincentivizing similar new books is not a big deal because AI will just do the same thing but better, and authors shifting to focusing on making use of the retrieved facts or generated thoughts to compete with AIs, thereby promoting progress.

So it’s not obvious from the perspective of promoting-progress that retrieval over large book corpuses is a bad thing after all.

Cure worse than disease? I wonder if Richard Stallman still holds to his anti-copyright maximalist position? Abolishing copyright would probably be good on net, but seems like it’d be terrible for software right now given that it would destroy the GPL, AGPL, etc. (and what about legal problems like implied warranties?).

Local code is open by default. In the world Stallman grew up in, with non-networked machines, either Lisp or assembler programmers, no DRM etc., abolishing copyright on software would be useful because software wasn’t where people made money so much as hardware and the software was just there as a commodity to make the hardware useful; you built a new hardware platform, and hired some guys to write a new OS & userland for it, no big deal. So users could just de-compile or copy anything that anyone shipped, and open source practices take care of much of what one might need.

Remote code defaults closed. In today’s world, all that would happen is the first generation of ‘public domain’ binaries & code would be the last. That is, it would be legal for you to copy any binary or source code you could get your hands on, which would be approximately zero of them. Because the reaction would be: (1) everything would immediately go behind SaaS & APIs (hosted on, of course, major clouds like AWS/Azure/GCP), no matter how well it’d run locally, because everyone would regard shipping a binary anywhere near a user as a business death sentence and (2) smartphones like Apple’s ultra-locked down & secured DRM platform would be how everything local was implemented—good luck beating Apple’s security to jailbreak anything! (Every year, they patch another hole, and they’ve been at it long enough and monomaniacally enough that you aren’t going to beat them; it’s no longer the 1990s where DRM was a joke—DRM has won.) And anything useful developed in the ‘open’ would be cloned behind paywalls instantly: now freed of any concern whatsoever about attribution or virality, all the tech giants would simply switch to copying anything FLOSS behind their SaaS APIs etc., without any of the stuff they contribute back today.

FLOSS

Thomas Jefferson to Isaac McPherson, 13 August 1813; it is worth reading his passage in full, which makes clear that Jefferson is criticizing a patent lacking much novelty and has thought extensively about how adequate intellectual property mechanisms are for their only justification, the “advantage of society”:

Thomas Jefferson to Isaac McPherson, 13 August 1813 ↩︎

Error: JavaScript disabled.

Backlinks, similar links, and the bibliography require JS enabled to load.

Bibliography

[Bibliography of links/references used in page]