Skip to main content

AI Cannibalism Can Be Good

Training new LLMs on old LLM outputs is a good way to get a lot more data and recycle compute, and is not paradoxically self-defeating nor necessarily harmful.

One might think that training LLMs on older LLMs’ outputs is self-contradictory and paradoxical: how could such outputs have any value? Surely future AI researchers will need to eliminate all old LLM text in their datasets, lest it waste ever more compute or even poison the new models with the flaws of the old ones.

But nevertheless, it is a common thing to do, does not seem especially harmful, and in principle can be a very good thing as it shades into reinforcement learning approaches like AlphaZero. (We should remember that cannibalism is such a common, attractive dietary strategy because the fellow members of your species, almost by definition, contain the optimal nutrients for yourself.)


‘Bone appetite!’

‘Bone appetite!’

Regardless of whether older LLM outputs are the most efficient data to train on, a lot of people ask: ‘how can it be useful—even in principle—to train on that?’

While it’s true that “model collapse” results are inapplicable to all real-world scenarios of training LLMs (where we never throw out all real data to train solely on randomly-sampled model outputs), surely it’s at best wasteful, because would not such text be full of redundancy and paraphrases, where they are not erroneous and filled with confabulations?

We can taxonomize the benefits of (correctly!) training on older LLM outputs into, broadly, additional data and additional compute in roughly ‘temporal’ order:

  1. the old LLM outputs can be intrinsically valuable as LLM outputs per se—because they are other, older LLM outputs:

    an older LLM like GPT-3 is a major historical event, in addition to how much text there now is generated by it or influenced by it. Learning its stylistic tics, major focuses, blindspots, reasoning errors etc. is as valid a thing to learn as, say, Shakespeare.

    This is justifiable as part of learning AI & human history, and also for self-defense: an LLM needs to recognize GPT-3 outputs, so it knows to not take them too seriously or unquestioningly believe everything written in them, and to help it “see through” the LLM part to what might be of value, like the human-generated corrections or hidden information.

  2. older LLMs can embody unavailable information:

    One might think that new LLMs are strictly superior to old ones, as they will have an ever-accumulating superset of all training data and much greater sample-efficiency in learning.

    • training data is not monotonically accumulating: However, much training data is proprietary (and will not or cannot be shared), and LLMs trained by different people will have access to different data, and so the outputs are indirectly a kind of sequence-level knowledge distillation.

      Further, even LLMs trained by the same people may have different datasets, due to issues like copyright licenses expiring or changing priorities or bugs or privacy law: OpenAI, for example, states that it has deleted the original GPT-3’s training datasets, to avoid book copyright lawsuits—so GPT-5 may not know about some books that GPT-3 knew about. (But given that LLMs are likely ‘transformative’, training on GPT-3’s outputs may be legally allowed even if it was disallowed to train GPT-3 after all.)

      Much of this data cannot be recreated in the present, because it has been destroyed, changed, or altered by time (eg. in 2025, you cannot hire human raters for RLHF who will rate “as if it was 2020”; that is just not something they can do… but a 2020 LLM can).

    • runtime data is not monotonic either: and of course, users are also providing what we might call “dark data”, unavailable to the creators.

      When you use LLMs in the real world, you will be providing novel text, proprietary data, and all sorts of interesting inputs to the frozen LLM, which will be reflected in the output, and may not be available in any other way. If I RAG over my private database of documents and generate a blog post using GPT-3, there is no particular reason that any other LLM will ever have access to that database (I might be hit by a bus the next day) except as embodied in that blog post.

  3. older LLM outputs can embody additional information:

    RLHF is the most famous example of this. LLM outputs can be selected, edited, commented on, voted on, reshared, endorsed and built on by humans (or other LLMs). A random GPT-3 output may be entirely useless to a future LLM on average, but it could be useful if it was not an average sample but was selected by a human out of 1,024 samples (adding <10 bits of information), say, or if a human left a comment pointing out a mistake in that output, or uncritically cites it (implying it is of acceptable quality), or anything like that.

    Pointing out mistakes is especially valuable information.

  4. older LLM outputs can embody additional computation:

    The “data processing inequality”, like most impossibility proofs, means much less than meets the eye, and is sometimes abused to focus solely on data and imply that all AIs must be pretty much the same—after all, the data processing inequality proves that they can’t get any more data than was in the original data! But in reality, the optimal inference algorithms are impossible to run, and the inequality is irrelevant: what we get out of a dataset also depends on how much compute we spent. (The exact tradeoff between more data and more compute is a major topic of scaling law research: one can regard a result like Chinchilla scaling as implying that compute is more scarce than data.)

    The final output may reflect a lot of computation omitted from the text itself, especially if they are able to call out to a tool (eg. if you play chess with an LLM which can call out to Stockfish), and so LLM text outputs could require literally superhuman intelligence to accurately predict each token without going through the same compute process yourself. So that is one way you can train an LLM to play pretty good bullet chess even with no ‘search’: generate a lot of superhuman text transcripts using Stockfish, and train on that; there is no ‘information’ in those transcripts which is not already embodied by the rules of chess and which a short, simple tree search algorithm could not beat (with enough compute), and so what one must be amortizing into the LLM is not information, but computation.

    Few AI systems achieve perfection in any task, and certainly not LLMs; they can benefit from using more compute in various ways, such as generating n samples and select the best, even if they do not currently have any way to unboundedly improve (the way some AI systems can—like a chess program searching the game tree will select the optimal move given astronomical resources). When we do best-of-n sampling with an LLM and then self-distill it on the outputs, we are training it to be as smart as an ensemble of n copies of itself, and to be robust to its own random errors; or when an o1-style reasoning model tediously brute-forces a bunch of possible answers, and happens to pick the right one, and then future o-series models get trained on that transcript to learn how to better ‘think aloud’, and start trying to solve harder problems.

    When we use an ensemble of judges who are all trained on more or less the same data, like if we use several open-source LLMs, this can be useful, because they share fewer blindspots and will compute in slightly different ways, and so there is a computational benefit. (An ideal model would not benefit as it would not have any blindspots… but when you have a bunch of blind men feeling an elephant, the more the better.) An ensemble of separate agents like the AlphaStar League can be seen as an exorbitantly expensive way to gain the blessings of scale in multi-agent RL, approximating a larger model used more effectively, like prompting AlphaZero to play in different ‘styles’.

    Similarly, synthetic data generation, like flipping images or cropping them or generating ‘rare’ data samples, cannot add any new ‘information’; that it does is use expert knowledge to spend compute in order to teach a cheap model what an ideal model would have already learned from the data & generalized: “all these transformed versions of that image are the same in this way”.

    And when we can appeal to an oracle like the game tree itself by simply spending more compute to select better moves, we can fix our own systematic errors too, and like AlphaZero, slowly bootstrap ourselves to arbitrarily-good performance without needing any ‘information’ from anywhere else, like from a human Go champion.

    Or perhaps we simply used a very large context window to process a lot of data to generate a short output (eg. ‘context compression’: we might use a large context window to pick out the most relevant text to excerpt/summarize and then generate an additional answer, and future LLMs can benefit from only processing the compact package of excerpts → answer).

And all of these separate points can overlap: a now-untrainable GPT-3 could be used on private data to generate a database entry which is checked and verified and found to be erroneous in a GPT-3-style way; this datapoint could simultaneously teach a future LLM what GPT-3 errors look like, what the non-trivial correct answer was, and reflect hidden data it would otherwise never see, and improve its own ability to generate such database entries.

image

image

So ‘old’ outputs may be useful in a bunch of ways, and it is not a big deal if ‘fake data’ increasingly outweighs ‘real’ data: the fake data can be real in its own ways.

Real isn’t how you are made… It’s a thing that happens to you. When a child loves you for a long, long time—not just to play with, but really loves you—then you become Real.

The Velveteen Rabbit, Margery Williams (1921105ya)

Similar Links

[Similar links by topic]