---
title: Human-like Neural Nets by Catapulting
description: "Speculative proposal to create artificial neural nets with human-like performance by high-learning-rate/regularization training of overparameterized NNs to trigger catapulting/grokking. Over-parameterization as a route to true generalization would resolve many outstanding mysteries of artificial versus natural intelligence."
created: 2024-04-21
modified: 2025-05-24
status: finished
confidence: unlikely
importance: 10
css-extension: dropcaps-yinit
...

<!-- Timestamp: `$ echo "hash precommitment for 'Human-like Neural Nets By Catapulting': 2024-04-21, prompted by _Science_ article on childhood amnesia; Proposal to create artificial neural nets with human-like performance by high-LR training of highly parameterized NNs to trigger catapulting/grokking. the difference between human brains and NNs (particularly LLMs) may be due to human brains adopting a scaling strategy of extremely high-temperature training of extremely overparameterized models.
This approach would lead to sample-efficiently & compute-efficiently finding a highly-generalizing basin in the loss landscape, while performing poorly up until the end and failing to memorize much data." | sha256sum` -->
<!-- gwtag /llm-catapult neuroscience adversarial grokking savant -->

<div class="abstract">
There are many mysteries about deep learning and human intelligence, but we could describe the biggest anomaly this way: why are artificial neural nets smart in such stupid ways, and biological brains stupid but in smart ways?

I propose a major change in deep learning scaling paradigms: the architectural differences between human brains and NNs (particularly LLMs) may be due to a bias-​variance tradeoff, where LLMs minimize variance and human brains minimize bias.
Human brains do this by deep double descent-style overparameterization, and adopting a scaling strategy of extremely high-learning-rate training of extremely overparameterized models on small diverse highly-filtered datasets.
This approach would lead to sample-efficiently and compute-efficiently traveling (or **catapulting**) to a highly-generalizing human-like basin in the model loss landscape, while performing poorly up until the end and failing to memorize much data.

If true, this would explain a number of odd stylized facts about how humans/NNs perform well/poorly.

Such a 'catapulted LLM' would generalize much better than existing NNs, be immune to adversarial attacks, have better economics & be more resistant to cloning, could potentially enable extremely efficient [MLP](!W "Multi-layer perceptron") architectures, and by giving true generalization, provide a sturdy foundation for AI safety in the form of useful NNs which are aligned & safe for the *right* reasons.

This could be feasibly tested by training multi-trillion-parameter models for relatively few steps at high cyclical learning rate schedules, and benchmarking adversarial and hard examples on tasks like arithmetic and small-image classification.

---

This is a companion piece to ["Guardian Angels: LLM Personalization for Productivity and Security"](/guardian-angel)
</div>

Because DL has continued to scale up and smash through benchmarks and begun to look like it really will be the final AI paradigm, and thus in some sense the same thing as human 'intelligence', to a considerable degree, we can regard 'intelligence' as solved: intelligence is sufficient compute applied to search over programs (like Turing machines or circuits) to predict or optimize where the optimal solution is a relatively long program.

# Intelligence, Broadly

A scaling-centric view might be summed up like this:

---

[**The Master Synthesis**](/newsletter/2021/05#master-synthesis){.include}

# Anomalies

But this paradigm, as broadly correct as it now seems to be, doesn't explain *everything*.
We still have many specific problems that this paradigm is too general to explain.

While current NNs, and LLMs in particular, are by far the most human-like AI software ever created, in having human-like strengths & weaknesses, there are a number of anomalies in machine & biological intelligence that have no good answers.

We have many puzzles here, but they all feel connected, somehow.

## Artificial

### Sample Inefficiency

Why do **NNs require [Chinchilla](https://arxiv.org/abs/2203.15556#deepmind "‘Chinchilla: Training Compute-Optimal Large Language Models’, Hoffmann et al 2022")-style scaling of data and compute**, when humans appear to learn from multiple orders of magnitude less data, and it is increasingly plausible (given various estimates of human-brain equivalents) that they learn from less total compute? Why, as so many connectionist pioneers like Alan Turing expected, do we not [train AI like children](/doc/ai/nn/2017-proudfoot.pdf "‘Child machines’, Proudfoot 2017"), with a curriculum and clear developmental stages?

There are many answers offered, none satisfactory. (And what should we make of theoretical results like [Rosenfeld 2021's](https://arxiv.org/abs/2108.07686 "‘Scaling Laws for Deep Learning’, Rosenfeld 2021") ["Nyquist learners"](https://arxiv.org/pdf/2108.07686.pdf#page=85){#nyquist-learner}?)

- **Multi-modality**: while useful, multi-modality has failed to yield any major change of scaling law exponents; unimodal models work shockingly well, and language models turn out to already encode a large amount of visual knowledge and can easily be plugged into vision models (eg. [Flamingo](https://arxiv.org/abs/2204.14198#deepmind "‘Flamingo: a Visual Language Model for Few-Shot Learning’, Alayrac et al 2022"), [Tsimpoukelli et al 2021](https://arxiv.org/abs/2106.13884#deepmind "Multimodal Few-Shot Learning with Frozen Language Models")).
- **Human sensory input is actually large**: Another common explanation is to deny that humans learn from less data, and argue from raw sensory bandwidth: if vision+sound+touch is such-and-such bits per second and you accumulate over an adult's lifetime, it can look much more comparable to the trillions of tokens we train an LLM on.

    This is unconvincing because the raw sensory bitrate is meaningless: the input is *extremely* redundant & predictable for the most part.
    (Imagine sitting in a room staring at a computer screen.)
    Attempts at quantifying the information content of images, video, or sound, usually indicate that they boil down to the equivalent of [a few hundred](https://arxiv.org/abs/2010.11929#google "‘Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale’, Dosovitskiy et al 2020") or thousand tokens and those modalities are easily learned by small models (eg. [iGPT](https://openai.com/index/image-gpt/ "‘Image GPT (iGPT): We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples’, Chen et al 2020")/[DALL·E 1](https://openai.com/index/dall-e/ "‘DALL·E 1: Creating Images from Text: We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language’, Ramesh et al 2021")).
    The asymmetry is particularly striking in text-to-image generative models, where the text encoder (usually an afterthought) is often far bigger than the image generator itself.

    And on the human side, disabled people are not much less intelligent than normal humans: deaf/blind people are much worse at language tasks, but their fluid intelligence often remains normal.
If the sensory bandwidth were *so* critical, this would be impossible.
- **[Active Learning](https://burrsettles.com/pub/settles.activelearning.pdf)**: human children, unlike models confined to offline imitation learning, can *choose* what to learn about by exploring their environment or asking questions.
In theory, [active learning](!W) & optimal exploration can be far more sample-efficient than indiscriminate training (exponential rather than power law, at a minimum), and this could account for the entire gap.

    However, if we look at the things children actually choose, the data in question doesn't appear all that amazing.
Further, in stark violation of any notion of optimal Bayesian exploration, children often choose to learn on the *same* data point—eg. watching the same YouTube video hundreds of times.[^Mario] Or if we watch them 'explore' a game or computer, it looks like it is by acting largely at random, and an adult would learn far faster by more carefully thought-out exploration.^[Similarly, children do not seem to learn languages all *that* more efficiently than adults and have some special sample-efficiency; an adult who is serious (ie. spends hours a day studying using total immersion & spaced repetition, not pretending to learn using Duolingo) will not take 18 years to reach native proficiency.
Like learning new systems, the reason children 'beat' adults seems to come down mostly to a willingness to **do the work**—to be embarrassed, to fail, and to just spend as many hours as it takes obsessing over the new thing rather than resting on one's laurels.]

    [^Mario]: A personal example: my mother tells me that when we got our first _Mario Brothers_ video game, I spent *weeks* playing it by running Mario to the first pit and deliberately jumping into it just to hear the sound effects of Mario dying (which I apparently found hysterically funny), and that it drove her crazy.

        What could my brain have possibly learned from the 5,000^th^ repetition of the sound effect? And if I wasted weeks on learning nothing, then how did I manage to be *even more* sample-efficient in the rest of my life?
- **Embodiment**: a closely-related topic is the idea of ["embodied cognition"](!W), which used to be quite popular as an explanation for the weaknesses of AI—AI models simply lacked commonsense & generalization for lack of a body and an appropriate environment.

    But thus far, 'embodiment' like training on robotics data (eg. [Gato](https://arxiv.org/abs/2205.06175#deepmind "‘Gato: A Generalist Agent’, Reed et al 2022")) has exhibited *zero* transfer to other tasks, never mind massive [scaling law](https://arxiv.org/abs/2306.13575 "‘Scaling MLPs: A Tale of Inductive Bias’, Bachmann et al 2023") gains, and ironically, it is, in fact, embodied tasks like robotics models which have been greatly benefiting from non-embodied pretrained models ([including LLMs!](https://arxiv.org/abs/2204.01691#google "‘Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances’, Ahn et al 2022")).
- **Architecture Magic**: Perhaps in some way, _Homo sapiens_-style biological neurons are just some near-perfect architecture, and this explains most of the gap; someday we will understand how all artificial neurons are severely hobbled by mistakes that will seem as tragically obvious in hindsight as earlier mistakes like not using [backpropagation](!W) or using sigmoid activation functions now seem to us, but they remain a mystery for now.

    This view was highly plausible until recently, but has been running into many problems.

    For starters, we simply have not found any architecture magic.
    The most obvious place to find magic would be the [learning rule](!W) for biological NNs, whatever they use in place of backpropagation...
    But while people have proposed many biologically-plausible learning rules since Hebb proposed [the first learning rule](!W "Hebbian theory") in 1949, which respect the requirements like locality, in every case, those learning rules perform worse than, or at best similar to, backprop!
    To [quote Geoff Hinton](https://www.technologyreview.com/2023/05/02/1072528/geoffrey-hinton-google-why-scared-ai/ "‘Geoffrey Hinton tells us why he’s now scared of the tech he helped build: ‘I have suddenly switched my views on whether these things are going to be more intelligent than us.’’, Heaven 2023"):

    > So maybe it’s [\[GPT-4\]]{.editorial} actually got a much better learning algorithm than us.

    And if biological NNs are not so good but there is something special about humans which does make them much better, then why do _Homo sapiens_ not appear to have any major neuroscientific breakthroughs compared to our primate relatives?
    Why are we so genetically similar, and we have failed in the search for major novel mutations that create humans, and human brains seem increasingly like nothing but ["a scaled-up primate brain"](/doc/psychology/neuroscience/2012-herculanohouzel.pdf "‘The remarkable, yet not extraordinary, human brain as a scaled-up primate brain and its associated cost’, Herculano-Houzel 2012")?
    If human (or primate) brains are so uniquely efficiently tuned by evolution, why are [*bird brains*](/doc/psychology/animal/bird/neuroscience/index) so much more efficient in size & [thermodynamics](/doc/psychology/animal/bird/neuroscience/2022-voneugen.pdf "‘Avian neurons consume 3× less glucose than mammalian neurons’, Eugen et al 2022") than primate brains, with clear [genetic changes](https://www.jbc.org/article/S0021-9258), and better scaling to the point where small bird brains like ravens or parrots or [vultures](/doc/psychology/animal/bird/neuroscience/2021-vanoverveld.pdf "‘Vultures as an overlooked model in cognitive ecology’, Overveld et al 2021") exhibit eerie levels of intelligence & behavioral complexity almost on par with dolphins, chimpanzees—or humans?
    Why does human intelligence exhibit so many bizarre drawbacks or anomalies if it is so optimized, like [childhood amnesia](#childhood-amnesia) or [lurching through developmental phases](#stages) over decades?

    If the story of NNs is one of us gradually recapitulating evolution's perfect neural networks, why does neuroscience provide so little useful inspiration (as is emphasized constantly by neuroscientists & AI researchers, even the "neuroscientific" inspiration for things like self-attention is a *very* loose inspiration)?
    Why don't all major improvements come from success in reverse-engineering the human brain to ever greater biologically-realistic detail?
    Why is the rapid progress in neuroscience, like scanning entire connectomes, completely irrelevant to cutting-edge NN models?
    Why do all the scaling laws for CNNs & Transformers look so similar in the exponent, and durable improvement so difficult, to the point where 7 years after [Vaswani et al 2017](https://arxiv.org/abs/1706.03762#google "Attention Is All You Need"), it is still a relevant baseline?
    And why, if we have such fundamentally inferior architectures, are the scaling laws so smooth & reliable, instead of breaking frequently or predicted to asymptote at levels far below human?^[When we dare to project out scaling laws all the way to human or superhuman performance, they generally do not require absurd amounts like quadrillions of parameters, which would appear to be implied by most of the biological-supremacy views.]

    - Conversely, why do *NNs provide little insight into biological brains*? In the other direction, neuroscience & individual psychology has hardly benefited from DL; DL has provided *tools*, and has provided good *predictive models* of brains, but that is about it.
One could open an issue of a psychometrics or neuroscience journal, and note that, over a decade into the deep learning revolution, if it had never happened, that issue would look about the same and would be completely intelligible to a researcher from 2010.

        It is impressive that we can now turn fMRI scans into crude visualizations of what a person is looking at, or we can use LLMs to generate possible questions for surveys, but DL has provided essentially no major insights into such fundamental questions as "what is fluid intelligence? why is the _g_ factor so general? how do neuroanatomic traits like neuron count or network properties cause intelligence or other cognitive abilities?"

        Isn't this astounding?
        We can now create models of enormous generality like Gato or GPT-4o from scratch to match or exceed humans without the hand-engineering of GOFAI, which seem so eerily human-like in many ways, and which recapitulate so many aspects of human cognition down to heuristics & biases, and our ability to create these artificial intelligences tells us... *nothing important* about the human brain? Really?

        So what, all the DL is just a dead-end and *coincidentally* capable of all that, and sometime in the future we'll discover the *real* route to brain-like intelligence...?

### Sample Efficiency

Why do NNs require so much data to pretrain, when **NNs are as sample-efficient in narrow comparisons**?

Despite the huge amount of indiscriminate data used in NN pretraining, we are puzzled because if we examine how well models do in apples-to-apples comparisons, NNs *appear* to be as roughly as good as humans or biological neural networks at learning from small data.

The simplest answer to the exorbitant data-scaling would be that NNs do a bad job of learning each datapoint—but that doesn't seem to be true when we compare things like in-context learning (even [GPT-3](#brown-et-al-2020)), transfer learning [smoothly scaling](https://arxiv.org/abs/2102.01293#openai "‘Scaling Laws for Transfer’, Hernandez et al 2021"), child-sized datasets (eg. [SAYCam](/doc/ai/nn/cnn/2024-vong.pdf "‘Grounded language acquisition through the eyes and ears of a single child’, Vong et al 2024"), [BabyLM](https://arxiv.org/abs/2311.02265 "‘Not all layers are equally as important: Every Layer Counts BERT’, Charpentier & Samuel 2023"), [TinyStories](https://arxiv.org/abs/2305.07759#microsoft "‘TinyStories: How Small Can Language Models Be and Still Speak Coherent English?’, Eldan & Li 2023")), compare total [AlphaZero](/doc/reinforcement-learning/model/alphago/2018-silver.pdf#deepmind "‘A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play’, Silver et al 2018") Go/chess games to total games played by human pros over history, or [disable human priors](/doc/reinforcement-learning/exploration/2023-brandle.pdf "‘Empowerment contributes to exploration behavior in a creative video game’, Brändle et al 2023") [to compare learning speed](https://arxiv.org/abs/1802.10217 "‘Investigating Human Priors for Playing Video Game’, Dubey et al 2018") (cf. [human vs Transformer meta-learning](https://www.nature.com/articles/s41562-025-02359-3 "‘Shared sensitivity to data distribution during learning in humans and Transformer networks’, Lerousseau & Summerfield 2025")), and remarkable experiments like [learning an unknown language from a book](https://arxiv.org/abs/2309.16575 "‘MTOB: A Benchmark for Learning to Translate a New Language from One Grammar Book’, Tanzer et al 2023") or [raising chickens in virtual reality](https://arxiv.org/abs/2312.02843 "‘Are Vision Transformers More Data Hungry Than Newborn Visual Systems?’, Pandey et al 2023") & comparing their visual capabilities with a Transformer (or vice-versa, have profoundly mistaken beliefs about the world [when raised passively in a bubble](/doc/psychology/vision/1985-murphy.pdf "‘Looking Out from the Isolator: David’s Perception of the World’, Murphy & Vogel 1985")).

How can we reconcile this apparent contradiction?

### Smallness

Why are NNs **so intelligent, and yet so small**?

A NN like [GPT-3](https://arxiv.org/abs/2005.14165#openai "‘GPT-3: Language Models are Few-Shot Learners’, Brown et al 2020"){#brown-et-al-2020} is ~0.1t parameters (which are absurdly simplified caricatures of biological neurons), while the human brain is extremely loosely estimated at 100t 'parameters', and many neuroscience results are taken as implying that each of those 'parameters' translates to thousands of equivalent parameters, at a minimum; an LLM is further disadvantaged by lack of recurrency & online learning, and is generally unable to engage in adaptive computation or 'ponder', the way that a brain can continually learn and ruminate.
And yet, despite all these severe disadvantages, LLMs seem *too good*.
It does not *feel* like a GPT-3 knows or does less than 1⧸100,000^th^ of what the human brain does; but if we argue that a more plausible number like 1⧸1,000^th^ is true, that appears to commit us to the position that human brains are highly wasteful of parameter-equivalents (despite the touted superiority of biological brain architecture & evolutionary pressure for efficiency).
And if a GPT-3 really does know that little, then what are those >99,999 things missing from a GPT-3—which *each* require an entire GPT-3 to handle?

This observation becomes even more extreme as a NN like GPT-3 is well-known to be highly overparameterized and can be shrunk down by orders of magnitude, and it is also known that small NNs can be trained for a long time to achieve performance of large models (albeit compute-inefficiently).
We appear to be between Scylla & Charybdis.

Alternate version: why are *human brains so overparameterized*?
Many humans get by with much smaller brains than others; there is a real and causal correlation with brain volume, but even allowing for how crude a proxy that is for the underlying neural nets' capacity & speed, the effect size seems surprisingly small.
Biological brains can survive shocking amounts of physical damage or brain loss, as long as it occurs early in development (eg. [hemispherectomy](!W "Hemispherectomy#Outcomes"); see also [hydrocephalus](/hydrocephalus "‘Hydrocephalus and Intelligence: The Hollow Men’, Gwern 2015")), and it's surprisingly [hard to find brain lesions which damage IQ](/doc/iq/2021-protzko.pdf "‘Testing the structure of human cognitive ability using evidence obtained from the impact of brain lesions over abilities’, Protzko & Colom 2021") instead of specific cognitive abilities.
Cross-species, there is clear allometric scaling and the rankings make sense, but the slope is shallow: chimpanzees with less than a third of the brain often seem competitive with some humans in terms of intelligence and ingenuity.
And remarkable instances like [_Portia_ spiders](/doc/biology/portia/index) or the ability of the dragonfly to home in on prey [using barely 16 neurons](https://www.pnas.org/doi/full/10.1073/pnas.1210489109 "‘8 pairs of descending visual neurons in the dragonfly give wing motor centers accurate population vector of prey direction’, Gonzalez-Bellido et al 2013") further raise the question of why *so* many neurons are necessary (as does, indeed, power law scaling in general).

### Superhuman Prediction

In particular, why do even tiny LLMs like GPT-2 appear to **already be superhuman at next-token prediction** in terms of perplexity/bits-per-character, and yet still greatly underperform humans on benchmarks & real-world and benefit in every way from scaling up next-token prediction pretraining?

Next-token prediction pretraining clearly works (and continues to work), and yet, also something is clearly missing from [the simple story we tell](/scaling-hypothesis#why-does-pretraining-work) about it.
Humans appear to predict better the more important tokens; why and how?

### Persistent Adversarial Examples

Why are **NN adversarial examples still unsolved** over a decade later, while the best efforts at human adversarial examples are debatable?

Scarcely any problems from 2013--2014 DL remain, and yet, in 2024, adversarial examples are not merely still around, but as powerful as ever on standard NN models.
Countless solutions have been proposed—and failed.
The best defenses are weak & compromise performance, like adversarial training [damaging generalization](https://arxiv.org/abs/1805.12152 "‘Robustness May Be at Odds with Accuracy’, Tsipras et al 2018") (but oddly, seeming more [human-like](https://arxiv.org/abs/1906.00945 "‘Adversarial Robustness as a Prior for Learned Representations’, Engstrom et al 2019")), or unable to scale, like robustness certification.
SOTA NNs rarely even attempt to reduce adversarial examples, LLMs are highly susceptible (and have [rough decision boundaries](https://arxiv.org/abs/2406.11233 "‘Probing the Decision Boundaries of In-context Learning in Large Language Models’, Zhao et al 2024")), and DL practitioners usually accept them as a fact of life & work around them.

Meanwhile, attempts to transfer to humans, or create new ones... work to some degree, but not much, and pushing them to the point of meaningful human error requires such large distortions that they no longer seem meaningful.

## Biological

### Human Amnesia

Why are **humans so forgetful** compared to NNs?

If forgetting is critical to flexible decision-making in biological brains like rodent studies, as [has been argued](https://www.sciencedirect.com/science/article/pii/S0896627317303653 "‘The Persistence and Transience of Memory’, Richards & Frankland 2017"), in part by analogy to successful machine learning regularization techniques like [weight decay](!W) or [dropout](!W "Dropout (neural networks)"), why do the most successful NNs still remember *so* much compared to humans?
For example, an LLM like GPT-3 can casually memorize entire pages of text after training on it once.
This is not an extraordinary capability, [but](https://arxiv.org/abs/2305.00118 "‘Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4’, Chang et al 2023") [happens routinely](https://arxiv.org/abs/2205.10770 "‘Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models’, Tirumala et al 2022") [with](https://arxiv.org/abs/2202.07646 "‘Quantifying Memorization Across Neural Language Models’, Carlini et al 2022") [LLMs](https://www.fast.ai/posts/2023-09-04-learning-jumps/). And the bigger and better they get, the more they are able to memorize.

Whereas with a human, even ones who pride themselves on their memory for text (like myself), memorizing any page of text is effectively impossible without extensive practice in mnemonic techniques, and we tell awed stories of those few human beings who appear to have [photographic memories](!W) (and demonstrate it *is* possible for human brains to do such things, they just don't usually).
The usual human's lack of a photographic memory, and the [shallowness of our understanding](/doc/psychology/cognitive-bias/illusion-of-depth/index), can be demonstrated with hilariously simple examples, [like](/doc/psychology/cognitive-bias/illusion-of-depth/2018-wong.pdf "‘The devil’s in the <em>g</em>--tails: Deficient letter-shape knowledge and awareness despite massive visual experience’, Wong et al 2018") "how is the letter 'g' written?" or "which way does Lincoln face on [the American penny](/doc/psychology/cognitive-bias/illusion-of-depth/1979-nickerson.pdf "‘Long-term memory for a common object [a penny]’, Nickerson & Adams 1979")?"
Our memories are *so* bad, in fact, that in addition to struggling to remember what we ate for breakfast yesterday or [learning basic general knowledge](https://link.springer.com/article/10.3758/s13428-012-0307-9 "‘General knowledge norms: Updated and expanded from the Nelson & Narens 1980 norms’, Tauber et al 2013") like ["the capital of Russia"](https://link.springer.com/article/10.3758/s13428-012-0307-9 "‘General knowledge norms: Updated and expanded from the Nelson & Narens 1980 norms’, Tauber et al 2013"), [we forget almost our entire childhood](!W "Childhood amnesia"){#childhood-amnesia}, and treat this **childhood amnesia** as completely normal and may even deny that fact.

This is particularly puzzling given that: we never have more neural connections than we do in childhood, as they continually die or are pruned away; it is not universal cross-species; the amnesia doesn't seem to damage more core knowledge like motor ability or language; important traumatic memories [can be retained](/doc/psychology/spaced-repetition/2001-peterson.pdf "‘5 years later: children’s memory for medical emergencies’, Peterson & Whalen 2001"); and that those lost episodic memories are, to a considerable degree, [still there](https://www.science.org/content/article/are-your-earliest-childhood-memories-still-lurking-your-mind-or-gone-forever "‘The fading memories of youth: The mystery of “infantile amnesia” suggests memory works differently in the developing brain’, Reardon 2024")—detectable by implicit measures like lower reaction times or faster recognition of blurred images or needing fewer hints/cues.

Nor is this because memorizing facts is useless to learning; memorization is critical to learning, and that is true even for eg. [spaced repetition of abstract topics](/spaced-repetition#abstraction).

But then, if memorizing is so useful and more knowledge is better, why do the most successful human memorizers seem to underperform, and it is a mixed blessing at best? Why are examples of photographic memory anecdotally associated with odd limitations in creativity or generalization, like John von Neumann or [Luria's Solomon Shereshevsky](!W "Solomon Shereshevsky#Challenges"), when experimentally in spaced-repetition memory research, better memorization appears to [help generalization & understanding](/spaced-repetition#abstraction) without such drawbacks? Why are normal humans capable of memorizing *some* things, like [the Koran](!W "Hafiz (Quran)") or [_Paradise Lost_](/doc/psychology/spaced-repetition/2010-seamon.pdf "‘Memorizing Milton’s <em>Paradise Lost</em>: A study of a septuagenarian exceptional memorizer’, Seamon et al 2020") or the [Homeric epics](/doc/history/1933-parry.pdf "‘Whole Formulaic Verses in Greek and Southslavic Heroic Song’, Perry 1933"), given enough time & motivation & tricks like [memory palaces](!W "Method of loci"), but not routinely?

The most extreme cases [are idiot savants](#savantism), whose overall performance profile and pre-DL descriptions bear an uncanny description to LLMs.
The feats of savants are too well-documented to doubt, but are impossible to fit comfortably into standard frameworks: if any humans could defeat LLMs in their strong points, it is savants, but then why does savantism seem to only come with such severe costs?

### Human Ignorance

And why are **humans so ignorant** compared to NNs, overall?

Not only do we not recall much easily, we don't *know* much either: the breadth of knowledge of even a low-quality LLM in 2024 vastly surpasses that of any individual human.
Individual expert humans still usually beat the best LLMs in their narrow niche of expertise, but if we quizzed humans with the same breadth of questions that we benchmark LLMs, I don't think there is a human alive who would come anywhere near them.

### Human Intelligence

So why are **humans so smart in general**, ie. why do humans generalize more robustly (eschewing the use of [non-robust features](https://arxiv.org/abs/1905.02175 "‘Adversarial Examples Are Not Bugs, They Are [non-robust] Features’, Ilyas et al 2019")/['dimpled' manifolds](https://arxiv.org/abs/2106.10151 "‘The Dimpled Manifold Model of Adversarial Examples in Machine Learning’, Shamir et al 2021"){#dimpled-manifold} like [GAN steganography](https://arxiv.org/abs/1712.02950 "‘CycleGAN, a Master of Steganography’, Chu et al 2017")^[Aside from not making use of non-robust features, despite their power & omnipresence, humans appear to be almost completely blind to them:]), are more creative, and largely immune to adversarial examples?

One of the most counterintuitive aspects of NNs is that for all the incredible fluency & genuine capabilities, they nevertheless still occasionally make simple blatant mistakes.
LLMs in particular—even excluding technical issues caused by [BPEs](/gpt-3#bpes "‘GPT-3 Creative Fiction § BPEs’, Gwern 2020"), RLHF & other safety tuning, greedy sampling, and so on, it remains true that they will make baffling errors which are fatal and cause them to go in circles and fail to generalize for things where there seems to be ample training data.

Another version: given that NNs seem to so easily memorize & overfit (and famously [can memorize all training data](https://arxiv.org/abs/1611.03530#google "‘Understanding deep learning requires rethinking generalization’, Zhang et al 2016")), and operating by ["direct fitting"](/doc/ai/scaling/2020-hasson.pdf "‘Direct Fit to Nature: An Evolutionary Perspective on Biological and Artificial Neural Networks’, Hasson et al 2020") to 'interpolate' between the ["underspecified"](https://arxiv.org/abs/2011.03395#google "‘Underspecification Presents Challenges for Credibility in Modern Machine Learning’, D’Amour et al 2020") training data, why do they generalize *at all*?

### Need to Sleep

Why do **humans need to sleep**—so much so that we usually die if sleep-deprived harshly enough?

In contrast, there is no known equivalent of sleep in current DL, nor any apparent need for it.

Even for continual learning or reinforcement learning replay, ordinary minibatch [SGD](!W) suffices.
There is no need for extended periods of offline inactivity or non-interactive computation simply to train on some new data.

### Stage-Wise Development {#stages}

Why does **human development spurt through stages** to a much greater degree than NNs do?

Children seem to often develop in phases, with sudden spurts ([Van Der Maas et al 2006](/doc/iq/2006-vandermaas.pdf "‘A dynamical model of general intelligence: The positive manifold of intelligence by mutualism’, Maas et al 2006"); [van Geert 1991](/doc/psychology/neuroscience/1991-vangeert.pdf "‘A dynamic systems model of cognitive and language growth’, Geert 1991"); cf. [phase transition](/doc/psychology/neuroscience/2009-spivey.pdf "‘The Phase Transition In Human Cognition’, Spivey et al 2009") neural dynamics).
Further, children say, believe, and do all sorts of crazy or stupid or evil things—while people make fun of the bizarre things LLMs sometimes say, LLMs have nothing on the things *little kids* routinely say.
(It makes one wonder how human brains reach the adult stage when they apparently must all go through the little kid stages, where if an adult said such things, they would be committed to a mental hospital or feared as a future serial killer.)

That is strikingly different from NNs.
NNs demonstrate various kinds of phases within-training and [emergence](/doc/ai/scaling/emergence/index) across-scaling, but many of the identified phase transitions, like ['induction heads'](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html#anthropic "‘In-context Learning and Induction Heads’, Olsson et al 2022"), barely even change the training loss or more than a few benchmarks.
In general, NN training appears much *smoother* than humans do, aside from changes in [learning rates](https://arxiv.org/abs/1608.03983 "‘SGDR: Stochastic Gradient Descent with Warm Restarts’, Loshchilov & Hutter 2016"), particularly cyclical learning rates.

### Slow Development

Why does **intelligent development take so long**?

Since larger models are more sample-efficient, and many intelligent animals (and definitely humans) are massively larger, why does it take so long?

One of the most universal observations in NN scaling is that, whatever the task, "larger parameter-count models are more sample-efficient": more precisely, they reduce the loss more per data point than a smaller model does, and so, when the training is plotted comparing loss with data processed, larger models show a distinctly 'steeper' plunge than small models do.^[We do not simply scale up models to the largest one that fits in our hardware because they cost more compute per-step, so a compute-optimal small model will process much more data with the same compute budget and eventually beat them.] (This raises important questions that I'm not sure we know the answers to: "How sample-efficient are extremely, extremely large models---multiple orders of magnitude larger than we now train? How large do they have to become before they stop becoming more sample-efficient? If they do stop at some point, can better regularization fix them? Why do we not train these if we are data-limited in some regimes?")

Since many animal brains appear to be so much vastly more parameterized than any DL model ever, and to have a large compute-budget and be bottlenecked by processing samples sequentially, we would naively expect an even steeper plunge.
The expense of intelligence would be running the brain, not wallclock time.

But everywhere we look, like primates or cetaceans or elephants or birds, we see that intelligence tends to be associated with prolonged childhoods & longevity & low reproductive rates.
These long developmental periods don't seem explicable in purely metabolic terms—you cannot produce a college-level human intelligence by force-feeding toddlers, and while we do see changes like average of puberty dropping like a stone from contemporary caloric abundance, it doesn't seem to accelerate brain development much.

And why does the development stop? Why don't we go on being neuroplastic our whole lives but seem to coast after young adulthood, appearing fully functional even if we've almost stopped learning like many old people?

#### Human Slowness

Why can **human development take so long** for many things that other animals learn quickly?

For example, some prey animals are able to walk and even run within minutes or hours of birth (thanks in part to [fetal training of motor control via twitching](https://www.newyorker.com/science/elements/what-are-dreams-for "‘What Are Dreams For? Converging lines of research suggest that we might be misunderstanding something we do every night of our lives’, Gefter 2023")), while a human child is both physically & mentally incapable of crawling for >5,000 hours.

And where are the informative priors from evolution?
Evolutionary psychology in particular suggested that many complicated social behaviors could be traced to genes and the human brain would be found to have "massive modularity"; this research program has failed completely.

### Qualitative Differences in Training Dynamics

Where are the **missing equivalents for humans of major NN dynamics** like [deep](https://openai.com/research/deep-double-descent "‘Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. This effect is often avoided through careful regularization. While this behavior appears to be fairly universal, we don’t yet fully understand why it happens, and view further study of this phenomenon as an important research direction.’, Nakkiran et al 2019") [double descent](https://arxiv.org/abs/1812.11118 "‘Reconciling modern machine learning practice and the bias-variance trade-off’, Belkin et al 2018") (true even with [1,000-degree polynomial splines](https://windowsontheory.org/2019/12/05/deep-double-descent/)) & "epoch-wise double descent", [super-convergence](https://arxiv.org/abs/1708.07120 "‘Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates’, Smith & Topin 2017"), and [grokking](/doc/ai/nn/fully-connected/2021-power.pdf#openai "‘Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets’, Power et al 2021")/[catapult](https://arxiv.org/abs/2003.02218 "‘The large learning rate phase of deep learning: the catapult mechanism’, Lewkowycz et al 2020") or [patient-teachers](https://arxiv.org/abs/2106.05237#google "‘Knowledge distillation: A good teacher is patient and consistent’, Beyer et al 2021")?

And if there are none, why not?

### Sleep

[Tononi's](/doc/zeo/2006-tononi.pdf "‘Sleep function and synaptic homeostasis’, Tononi & Cirelli 2006") [SHY](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3921176/ "‘Sleep and the price of plasticity: from synaptic and cellular homeostasis to memory consolidation and integration’, Tononi & Cirelli 2014") is one of the most interesting theories attempting to explain sleep in general (rather than dreaming).
In SHY, 'sleep' is forced by neural net local learning, where the neural net 'weights' increase in size over time as they learn; this is bad because it directly increases the energetic demands of firing a synapse, and because (just like an artificial NN), it makes it easier to memorize rather than generalize.
The 'weights' need to all be multiplied by a fraction, to shrink them back down to normal energy consumption, and helping regularize the brain; but it can't be done one by one, because they are all in use.

Sleep, then, is when the *entire* brain is taken offline, to shrink them all simultaneously.
So, you can skip sleep for a while, and your brain keeps chugging along, weights growing slowly and becoming more and more inefficient, but it causes increasing problems, and eventually something diverges.

(Dreams, then are not essential; they are probably more of an enhancement for additional sample-efficiency, by doing a [more extreme form of experience replay](https://arxiv.org/abs/2007.09560 "‘The Overfitted Brain: Dreams evolved to assist generalization’, Hoel 2020") than can be done during waking.)

### Savantism

Savants are often capable of remarkable feats of memory or perception, like making near-photorealistic drawings from a single glance at a photograph or memorizing books.

How is it possible for the (non-autistic) [Kim Peek](!W "Kim Peek#Early life") to ["at age 16--20 months...memorize every book that was read to him.
His parents moved Kim's finger along each sentence being read."](https://web.archive.org/web/20080916062059/http://www.wisconsinmedicalsociety.org/savant_syndrome/savant_profiles/kim_peek)?
He could hardly have developed mnemonic palace techniques as an infant or other adult-like memory tricks like hours of repetition; instead, he seems to simply memorize books in a single epoch, so to speak, to parrot them back like an LLM.

Further, why do Peek & other savants seem to avoid childhood amnesia entirely (accounts emphasize that their memory tends to be lifelong, and Peek seems to remember books read as a young child), while also appearing to not go through childhood development stages?

Why do they usually suffer from severe neurological (associated with left [hemisphere](!W "Cerebral hemisphere") trauma) & physical problems and demonstrating remarkable deficits in abstract understanding and general intelligence?
For example, ["he \[Peek\] cannot, for example, explain many commonplace proverbs"](https://www.scientificamerican.com/article/inside-the-mind-of-a-sava/ "'Inside the Mind of a Savant: Kim Peek—the inspiration for <em>Rain Man</em>—possesses one of the most extraordinary memories ever recorded. Until we can explain his abilities, we cannot pretend to understand human cognition', Darold A. Treffert & Daniel D. Christensen 2006-06-01")—and even this level of abstraction understanding *some* proverbs is considered unusual for a savant!
As far back as [Down 1887](/doc/psychiatry/autism/1887-down.pdf "<em>On some of the mental affections of childhood and youth</em>: Lecture 3: Idiot Savants"), observers were struck by the contrast between the lifelong "verbal adhesion" of many savants and their understanding, or general intelligence.
([Treffert & Wallace 2002](/doc/psychology/neuroscience/memory/savant/2002-treffert.pdf "Islands of Genius: Artistic brilliance and a dazzling memory can sometimes accompany autism and other developmental disorders") note that of the ~100 well-described cases, the *highest* IQ was 114.)
Strikingly for LLM parallels, savants can also be capable of [confabulation](!W): Down 1887 describes an illiterate child who "take up a book...and improvise stories of all kinds with a great deal of skill, and in any variety, to suit the supposed tastes of his auditors".^[This is much less remarked on in the savant literature.
Is confabulation under-documented because no one expects it, and so ignores it or fails to test for it (eg. by asking trick questions)? Or do savants refuse to answer such questions, if it is not in one of their special interests or if they feel confused when they try to answer it? (Down also notes that savants are not always cooperative, and one can imagine that it might take a long period of building rapport to reach the point where one could ask anything, instead of the usual displays of skill which they may [take great pride in](/doc/psychology/neuroscience/memory/savant/1979-hoffman.pdf "‘An idiot savant with unusual mechanical ability’, Hoffman & Reeves 1979").)]

In cases of mathematical calculation prodigies like calendar dates or taking roots, the underlying algorithm can be inferred to [some](/doc/psychology/neuroscience/memory/savant/1962-hunter.pdf "‘An Exceptional Talent For Calculative Thinking’, Hunter 1962") [degree](/doc/psychiatry/autism/1990-hermelin.pdf "‘Factors and primes: a specific numerical ability’, Hermelin & O’Connor 1990"), and appear to mix huge amounts of memorization & distilled intuition with explicit algorithms.
These all appear to benefit from [disabling higher levels of the brain](https://rstb.royalsocietypublishing.org/content/364/1522/1399.full).

Savants are often intellectually-disabled or otherwise abnormal.
An important example in this context is [Luria's](!W "Alexander Luria") famous mnemonist, [Solomon Shereshevsky](!W) ([_The Mind of a Mnemonist_](https://archive.org/details/LuriaTheMindOfAMnemonist)), who simply remembered everything he saw or heard: in part by using extensive visualization & mnemonic approaches like the [memory palace](!W), but his level of [synesthesia](!W) also meant that everything was inextricably associated for him.[^S]

[^S]: [Synesthesia](https://www.newyorker.com/books/page-turner/the-mystery-of-s-the-man-with-an-impossible-memory "'The Mystery of S.  the Man with an Impossible Memory', Reed Johnson 2017-08-12"):

    > S.’s case, as many readers have noted, resembles the [Jorge Luis Borges](!W) story [“Funes the Memorious”](!W), a fictional work about a man plagued by the persistence of his memory.
    > “To think is to forget a difference, to generalize, to abstract”, Borges writes.
    > “In the overly replete world of Funes there were nothing but details, almost contiguous details.”
    > Similarly, Luria writes that for S.  almost every word, every thought, was freighted with excessive detail.
    > When he heard “restaurant”, for example, he would picture an entrance, customers, a Romanian orchestra tuning up to play for them, and so on.
    > Like Funes, S. had a sort of private language to catalogue the richness of his mental associations.
    > The word for “roach” in Yiddish could also mean, in his mind, a dent in a metal chamber pot, a crust of black bread, and the light cast by a lamp that fails to push back all the darkness in a room.

He [struggled with non-literal interpretations](!W "Solomon Shereshevsky#Challenges"), reading, or even recognizing faces—he could remember exactly how a person's face looked previously, of course, but then he could not generalize to how their face looked *now*.
Why did he not benefit from the ability to recall every version of a person's face, and gain uncanny [super-recognizer](!W) powers?
(See also [hyperthymesia](!W): people with HSAM do not have an intrinsically superior memory for *learning*, but for *not forgetting*---they seem to struggle to focus on daily life and [cope with](https://www.ctvnews.ca/w5/why-18-year-old-canadian-emily-nash-is-sharing-her-unique-brain-with-science-1.6818765) [anxiety or memories](https://www.thisamericanlife.org/585/transcript) of bad days.)
He has what sounds like [maladaptive daydreaming](!W), and ultimately died of alcoholism.

On a similar note, one of [John von Neumann's](!W) famous party tricks was memorizing pages of books, and being able to recite books on command—just one of his many remarkable feats of cognition which have [entered legend](/doc/math/1973-halmos.pdf "‘The Legend of John Von Neumann’, Halmos 1973").
But [von Neumann was criticized](!W "John von Neumann#Preferred problem-solving techniques") for pedanticism, explicitness, inelegance, brute-force calculation, comfort with following complicated lines of reasoning by reliance on his photographic memory.
[Eugene Wigner](!W), who knew Von Neumann so well, while describing his feats, [struck a cautionary note](https://www.amazon.com/The-Recollections-Of-Eugene-Wigner/dp/0738208868#unorignality "‘<em>The Recollections of Eugene P. Wigner as Told to Andrew Szanton</em> § Von Neumann’s Originality’, Szanton & Wigner 1992"): none of them were truly *original*, and Von Neumann himself reportedly said that ["\[I\] will be forgotten while Kurt Gödel is remembered with Pythagoras"](http://at.yorku.ca/t/o/p/d/03.htm "'Once over lightly', John L. Kelley 1989").
Was von Neumann ever original & creative enough to be truly *great* like Pythagoras or Gödel?[^Wigner]
Was he held back by his great knowledge & calculating ability (if not exactly ["library work"](/doc/science/1986-hamming#library-work)), and unable to drill to the essence of matters like a Grothendieck?

If we combine the ideas of lack of abstraction or later stages of childhood cognitive development, memorization, synesthesia (failing to separate modalities), injuries to the left hemisphere (particularly the [temporal lobe](!W)), ability to confabulate, and general extreme imbalances in cognitive capabilities compared to normal humans, I am left with the impression that savants are the "LLM version" of humans.
Savants are what can sometimes happen when a key node in the brain is knocked out and higher level processes become less effective, exposing more of the brain's low-level raw intelligence, which is satisfied by simpler forms of learning like massive memorization.

## Grokking

A good example is the much-discussed phenomenon of "grokking".
In the [original grokking paper](#power-et-al-2021), a small NN is trained on a simple arithmetic problem; as expected, it quickly memorizes the training data, achieving 0% error while failing to generalize with ~100% error on the held-out data.
What is more surprising is that after training for long enough, sometimes, apparently at random it gradually begins to improve on the held-out data despite the ~0% error.
The odds of grokking are improved the more the NN is regularized—such as by [oversampling implied data](https://arxiv.org/abs/2405.15071 "‘Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization’, Wang et al 2024"){#wang-et-al-2024-ratio} which cannot be easily memorized, and [particularly](https://arxiv.org/abs/2405.20233 "‘Grokfast: Accelerated Grokking by Amplifying Slow Gradients’, Lee et al 2024") [by](https://arxiv.org/abs/2210.01117 "‘Omnigrok: Grokking Beyond Algorithmic Data’, Liu et al 2022"){#omnigrok} [weight decay](https://arxiv.org/abs/1711.05101 "‘Decoupled Weight Decay Regularization’, Loshchilov & Hutter 2017"), which tries to shrink the size of the NN weights.
(The larger the weights, the easier it is to encode information into them, and memorize the training data.)
What is going on?

Mechanistically, what seems to happen when it groks: is that the initial hyperparameters are 'poorly' chosen and the NN quickly finds a nearby local optimum in the model-space ([loss landscape](https://arxiv.org/abs/1712.09913 "‘Visualizing the Loss Landscape of Neural Nets’, Li et al 2017")), representing the 'memorization' solution, but that after enough training, the weight decay or other regularization makes the local optimum too 'narrow' to contain the NN forever, and then it randomly walks out, or is ["*catapulted*"](#lewkowycz-et-al-2020) out of the original solution—and eventually reaches a new region of model-space.

This new region still has ~0% training error... but it corresponds to an algorithm completely different from memorization, which is [simpler in an intrinsic](https://arxiv.org/abs/2309.02390 "‘Explaining grokking through circuit efficiency’, Varma et al 2023") [sense](https://arxiv.org/abs/2401.10463 "‘Critical Data Size of Language Models from a Grokking Perspective’, Zhu et al 2024") than "memorizing all the training data" and so unsurprisingly, generalizes better, even in the presence of [data error](https://arxiv.org/abs/2310.13061 "‘To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets’, Doshi et al 2023").
(It could in fact be many algorithms—there are a [surprising number of valid algorithms](https://arxiv.org/abs/2306.17844 "‘The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks’, Zhong et al 2023") for that particular arithmetic problem.)
And since it is simpler, there are many more possible models which all correspond to the same thing, so this new region of model-space is ['wider'](/doc/ai/nn/rnn/1997-hochreiter-2.pdf#schmidhuber "‘Flat Minima’, Hochreiter & Schmidhuber 1997") than the memorization region, and so the model generally does not randomly walk out of that one; or if it does, it will probably find a third, bigger & better region, and so on.

This is a quite intriguing result, but it's not obvious how to apply it to anything else, like a full-size LLM.
It would doubtless be possible to train an LLM like GPT-3 to zero training error if it was trained for enough epochs—but it would be impossibly expensive, and then you would still have to train long enough past *that* to trigger "grokking".
Given that it can take hours to trigger "grokking" in even the tiniest possible toy dataset, it is impossible to do true "LLM grokking".
But the phenomenon of grokking is interesting.

## Cyclical Learning Rates

A related theme is struck by studies of [cyclical learning rates](https://arxiv.org/abs/1506.01186 "‘Cyclical Learning Rates for Training Neural Networks’, Smith 2015"), like [super-convergence](#smith-topin-2017) or cosine learning rates, as used in eg. [Chinchilla](#hoffmann-et-al-2022)/[MiniCPM](https://arxiv.org/abs/2404.06395 "‘MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies’, Hu et al 2024") (which yielded a large improvement over [Kaplan et al 2020](https://arxiv.org/abs/2001.08361#openai "Scaling Laws for Neural Language Models")) and appears critical in [continual learning](/doc/reinforcement-learning/meta-learning/continual-learning/index).

Cyclical learning rate schedules look like [simulated annealing](!W) in having distinct loss curves: regularly spiking up (very) high when the LR goes high, but then dropping rapidly when the LR is lowered and achieving the best results yet.
These often correspond to big capability gains; one can think of the high-learning-rate period as 'exploring' to create a new model, while the low-learning-rate periods 'exploit' and finetune the new model, only to start all over.
(And interestingly, cyclical LRs can be run an indefinite number of times.)

Thus, cyclical LRs can look bad if not properly evaluated, because they trade off short-term loss minimization for long-term gains—if you get them right and run them long enough compared to other LR schedules, "the curves cross", but until then...

In this respect, cyclical LRs look remarkably like childhood development, complete with periods of high errors while trying to learn new abilities (eg. when children learn sarcasm, it is usually preceded by a period where they make many major errors because they are unable to tell when a negation of an obvious fact is an effective use of sarcasm vs simply false).

## Adversarial Examples & Isoperimetry

Consider also another [blessing of scale](/scaling-hypothesis#blessings-of-scale): adversarial examples and the [isoperimetry paper](https://arxiv.org/abs/2105.12806 "‘A Universal Law of Robustness via Isoperimetry’, Bubeck & Sellke 2021").

Adversarial examples, 10 years after discovery, still are not solved, every defense has been broken, and adversarial examples even transfer in many ways, like across models or modalities.
Adversarial examples seem to get weaker with parameter scale, but not necessarily with regular compute-optimal scaling.
The isoperimetry paper advances the thesis that the reason for this is that because current NNs can be seen as subdividing the latent space into linear volumes, defined piecewise, parameter by parameter, the boundaries between each volume are linear and fragile: instead of 'curving around' like they ought to, they simply go straight.
Therefore, it is easy to 'nudge' any particular data point in a variety of small ways, such that they get pushed just over the artificially simple linear boundaries and into the wrong volume.

The true, correct, curve can be approximated to arbitrary accuracy by simply a lot of small boundary lines, but it may take a *lot* of them, and thus, a lot of parameters (and data?) to define them all.

This is why all the defenses, like [adversarial training](https://adversarial-ml-tutorial.org/adversarial_training/), keep failing: they do not solve the basic geometric problem—it's not a few parameters which are bad, it's all of them, there are countless boundary lines which are too straight and open for exploitation.
Such defenses can, at best, eliminate specific examples and maybe make it harder for a specific attacker to find an example, but they may only push the problem elsewhere, or provide security-by-obscurity.
It is also why examples are so stubborn and transfer: models will tend to conceptualize and think in roughly similar ways, and they are all highly under-parameterized, so they will tend to share similar vulnerabilities.

The apparent message of the isoperimetry paper is that you need to scale up your model sizes.
And not by a little: their ballpark estimates suggest that models for things like [ImageNet](https://www-cs-faculty.stanford.edu/groups/vision/documents/ImageNet_CVPR2009.pdf "‘ImageNet: A Large-Scale Hierarchical Image Database’, Deng 2009") are at least 2 OOMs too small.
You can imagine what that must imply about LLMs! A GPT-3-175b-scale dense Transformer model might need to be, say, 100 trillion parameters.

Unfortunately, whether you use Kaplan or Chinchilla scaling laws, a 100t-parameter dense LLM would appear to be out of the question for the foreseeable future.
To train a model which is able to fit all of those tiny line-segments defined by those 100t parameters, one by one, would require at least as much data—it would require astronomically more data than plausibly exists, and then the training cost of training a model that size on that much data would be equally astronomical.

At least... if you try to fit those parameters with the standard 'sharp' (greedily myopic first-order) SGD, looking at each parameter individually.
If instead you crossed your eyes, and stared, all of those pixelated lines might *blur into smooth curves*, and if you thought about it long enough, suddenly at some point the pattern of curves may snap into the drawing of a Dalmatian dog.

This is how NNs catapulting to new model basins may learn intrinsically more robust models which have the blessings of isoperimetry without the brute-force approach of fitting the smooth curves directly to data.

# How the Brain Works

<div class="epigraph">
> [[Geoff Hinton](!W)] tells of coming home from work one day in a state of great excitement, exclaiming, "I did it! I've figured out how the brain works!"
>
> His daughter replied, "Oh Dad, not again!"
>
> ---[Pedro Domingos](https://www.wired.com/story/master-algorithm-pedro-domingos/) (cf. [NIPS 2010](https://www.youtube.com/watch?v=mlXzufEk-2E "The Deep Learning Saga"))
</div>

What all of these anomalies seem to share is a core of scaling-law-like relationship of parameters, memorization, generalization, and training—a multi-way [bias-variance tradeoff](!W), where different systems hit different points on a Pareto frontier where NNs have low [variance](!W) in-sample but then high [bias](!W "Bias of an estimator") out-of-sample or on hard problems, and biological brains are at the other extreme (with an unhappy valley of intermediates which impress no one).

I suggest that the core insight here is that too-extensive memorization is the enemy of abstraction, by leading a model to a local optimum which minimizes error but encodes fundamentally the wrong algorithm.
Instead, we must, paradoxically, defy intuitions about overfitting by training as large a model as possible in order to handle as small data as available to a human without overfitting.

What happens when we train a current DL model is that it is lazy, and so it rapidly homes in on the nearest loss local minimum it can find.
This minimum tends to be one where it has highly-efficiently memorized the data and learned all the [non-robust features](#non-robust-features) and statistical shortcuts; this assemblage of tricks is genuinely effective and intelligent, and is not "cheating" (non-robust features really *are* present in the heldout data etc.), but they do not generalize *far beyond* that, because they have not hit on the true underlying algorithm or latent manifold.

Why do grokking NNs seem to need to "memorize to generalize"?
Why might it be hard to make progress towards the true target before the memorization phase?

Perhaps because LLMs both memorize *and forget* datapoints [constantly](https://arxiv.org/abs/2406.11813 "‘How Do Large Language Models Acquire Factual Knowledge During Pretraining?’, Chang et al 2024") during training (which is why number of copies in the dataset matters: more likely to have seen it recently before the end)---but this forgetting happens for no good reason? "Repetition is the mother of learning" (which is why [spaced repetition is good for abstraction](/spaced-repetition#abstraction), not just brute facts): it is difficult to generalize even the simplest proposition like A + B = C if a NN is constantly forgetting & relearning each part.

Grokking appears to proceed in two phases, first the memorization of all available training data, then the gradual development internally of an increasingly-refined generalizing algorithm.

With that in mind, we might describe the benefit of the memorization as the "learning facts" phase of pedagogy, and then the generalization phase as "the NN thinking about or pondering the facts it has learned until it *gets* it".
Each minibatch is another 'thought' about the data, as the NN struggles to understand the _gestalt_ of the data as more than a bunch of brute facts, and the gradient descent slowly homes in on the generalizing algorithm.
(And then memorizing *too many* facts can sabotage grokking by 'squeezing out' the generalizing algorithm because too much needs to be memorized---memorizing a few well-chosen examples is more useful than memorizing countless redundant pieces of trivia.)

That true target may be 'very distant' in the loss landscape, and getting there may require an exorbitant amount of data—each data point painfully pushing it ever so slightly out of its comfort zone until one day, it finally is forced by the overwhelming weight of long-tail anomalies to turn into the right model.

The right algorithm will lie in a distant part of the model loss-landscape, but to reach it using a reasonable amount of training data requires the model to travel far (as a kind of grokking/catapult/[super-convergence](#smith-topin-2017)), which is only possible if the model is *so* overparameterized that it can encode smooth paths (like saddle-points, as all [models](https://arxiv.org/abs/1802.10026 "‘Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs’, Garipov et al 2018") [are](https://arxiv.org/abs/1803.00885 "‘Essentially No Barriers in Neural Network Energy Landscape’, Draxler et al 2018") [linear](https://arxiv.org/abs/1912.05671 "‘Linear Mode Connectivity and the Lottery Ticket Hypothesis’, Frankle et al 2019") [mode](https://arxiv.org/abs/1611.01540 "‘Topology and Geometry of Half-Rectified Network Optimization’, Freeman & Bruna 2016") [connected](https://arxiv.org/abs/2104.14421#google "‘What Are Bayesian Neural Network Posteriors Really Like?’, Izmailov et al 2021") [seemingly](https://arxiv.org/abs/2110.06296#google "‘The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks’, Entezari et al 2021")) and it can ensemble over extremely large families of models which 'blur out' to a smooth abstraction of the posterior.
And then high-learning-rate training is critical to kick it along the [frequent saddle-points & plains](https://arxiv.org/abs/1406.2572 "‘Identifying and attacking the saddle point problem in high-dimensional non-convex optimization’, Dauphin et al 2014") that slow down optimization, oscillating between high learning rates to escape the current local optimum and lower learning rates to consolidate the gains and find the new escape route.
These models inherently require long serial training with many time-steps, and cannot be easily 'parallelized' or absorb large amounts of information, and benefit from many long periods of inactivity ('sleep') to globally repair the damage from learning or 'catch up' on backlogged steps.

Such a model will memorize little of its training data, *because* that would require rigid, fragile, precise parameters—but those parameters all need to be recycled and explore strange new model loss-landscapes in order to eventually arrive at the promised land.
So during training, and even afterwards, such a model will forget and perform badly on benchmarks that reward memorization (such as declarative knowledge)—even though it will avoid adversarial examples (because none of its boundaries are extremely low-parameter linear lines dependent on [inhumanly](https://arxiv.org/abs/2302.05442#google "‘Scaling Vision Transformers to 22 Billion Parameters’, Dehghani et al 2023") high-frequency [texture-biased](https://arxiv.org/abs/1911.09071 "‘The Origins and Prevalence of Texture Bias in Convolutional Neural Networks’, Hermann et al 2019") [non-robust features](https://arxiv.org/abs/1905.02175 "‘Adversarial Examples Are Not Bugs, They Are Features’, Ilyas et al 2019"){#non-robust-features}) and will generalize well to hard problems (which by definition make up little of the standard benchmarks) by learning all sub-skills and achieving ['slingshot generalization'](https://arxiv.org/abs/2307.15936 "‘A Theory for Emergence of Complex Skills in Language Models’, Arora & Goyal 2023") by mastering all combinations of skills, and resolve many issues with contemporary DL.

Such models will be hard to discover because of the use of early stopping and the general greediness of DL training (and R&D in general), even though the core ideas are well-known and have large associated literatures: like the original DL scaling ideas, the payoff of catapulting simply takes too long to come.

This intelligence is not that hard to evolve^[In fact, this sort of delayed or 'deceptive' payoff is an optimization task that [evolutionary strategies](https://arxiv.org/abs/1712.06567#uber "‘Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning’, Such et al 2017") [are competitive](https://arxiv.org/abs/1712.06560#uber "‘Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents’, Conti et al 2017") with gradient-based methods, because they optimize the net fitness over a lifetime; it does not matter to evolution how useless an infant looks on benchmarks (yielding a deceptive gradient) if the converged adult has higher fitness.], but it usually is not worthwhile.
So they fall into an uneasy ecological niche: such robust generalization is *not* useful for many animals compared to simpler, cheaper, forms of learning, such as association or imitation, or [hardwiring directly](!W "Baldwin effect") into genetics.
It can only pay back its exorbitant costs if the environment rewards robust generalization appropriately by providing enough high-payoff opportunities which are predictable but not *too* predictable.

### Hardware {.collapse}

<div class="abstract-collapse">A serial catapulting regime would render extremely large GPU-clusters much less useful, as they will not be able to step through each iteration much faster; it would instead place a high premium on more exotic hardware like Cerebras chips, which can execute a training step in a small fraction of the time, and hence, wallclock.</div>

[**See main comment**](https://www.reddit.com/r/gwern/comments/1eyqu0s/hardware_hedging_against_scaling_regime_shifts/){.include-annotation .redirect-from-id}

Incidentally, the use of low-latency hardware would also open up more exotic neural net architectures like [AUNN/IFNN](/aunn "‘Absolute Unit NNs: Regression-Based MLPs for Everything’, Gwern 2023").

## Training a Catapulted LLM

<div class="admonition note">
<div class="admonition-title">What Is the *Largest* Useful LLM?</div>

LLMs are more sample-efficient the larger they are.
At what parameter scale does this stop holding true?
And why don't we know the answer to this question already?
</div>

What would a 'human-sized LLM' (or **HLLM**) look like?

First, it would need to be highly overparameterized relative to the problem as a whole, in order to provide a smoothly-connected loss landscape.
Second, because it is so overparameterized, standard hyperparameters will result in overfitting but not catapulting; heavy regularization will be required, and weight decay is probably not enough (as it doesn't result in much model movement) but a high LR schedule may be adequate regularization given prior art like [super-convergence](#smith-topin-2017) & human parallels.

The final full-scale HLLM would probably look something like a dense Transformer or [MLP](#mlps) which is multiple orders of magnitude larger than currently trained^[In the terminology of [Huang et al 2024](https://arxiv.org/abs/2402.15175 "Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition"), we normally train models in the "progression"/"semi-grokking" regimes (underfitting), but by increasing the model size, we can move it to the "grokking" regime, where it may memorize but then generalizes much better than progressing models; and if we filter the data stringently enough to minimize the ability to memorize, this 'ratio' will move the crossover point much earlier in training (see also Huang et al 2024's observation about how adding a pure-memorization task "poisons" grokking, which parallels [Dohmatob et al 2024](https://arxiv.org/abs/2402.07043 "A Tale of Tails: Model Collapse as a Change of Scaling Laws") about noisy/low-quality synthetic data).] (so >10-trillion parameters, possibly >100-trillion^[Such large networks have been demonstrated to run on GPU clusters, although they have all been [mixture](https://arxiv.org/abs/2110.03888#alibaba "‘M6--10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining’, Lin et al 2021")-[of](https://arxiv.org/abs/2112.06905#google "‘GLaM: Efficient Scaling of Language Models with Mixture-of-Experts’, Du et al 2021")-[experts](https://spectrum.ieee.org/china-us-militarized-ai "‘U.S. versus China Rivalry Boosts Tech—and Tensions: Militarized AI threatens a new arms race’, Smith 2021") or [recommender models](https://arxiv.org/abs/2111.05897 "‘Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters’, Lian et al 2021").]), in order to be highly-overparameterized compared to the full distribution.
(In grokking papers or the isoperimetry estimates, the NNs are generally several OOMs larger than 'reasonable', so if we ballpark GPT-4-level models at ~1t, we would weakly expect the catapulting regime to be ~100t.)
The NN should probably be a "skinny" one, [emphasizing depth over width](https://arxiv.org/abs/2405.19454 "‘Deep Grokking: Would Deep Neural Networks Generalize Better?’, Fan et al 2024"); a long-standing trend in DL is that 'wide' networks tend to memorize more heavily and poorly express more algorithmic/computational reasoning, and that DL NNs tend to look rather un-biological in using so little recurrence/iterative computation or composing reasoning (while RNNs often do better than Transformers in specific cases, generalizing better despite their overall inferiority).^[Whether RNNs can grok/catapult, and if that could help them leapfrog Transformers, is an interesting question that I don't believe has been tried much, given the focus on fully-connected/CNN/Transformers.]
Very deep networks tend to be avoided due to overfitting or instability rendering their theoretical advantages moot, but catapulting would potentially fix that, and benefit from the inductive biases.
This is consistent with the most recent work on sample-efficient LLMs, like [Kim et al 2025](https://arxiv.org/abs/2509.14786 "Pre-training under infinite compute") or [NanoGPT Slowrun](https://qlabs.sh/slowrun), which emphasize increasing parameter size (eg. via ensembling) and heavy regularization over many epochs.

The NN is trained on small text corpuses like [BabyLM](https://arxiv.org/abs/2301.11796 "‘The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus’, Warstadt et al 2023") scale (~0.1b words) .
One benefit here is that because the sample size has to be limited, that means it can be filtered extremely heavily for quality/deduplication: because the data distribution decides whether a model will meta-learn, the dataset should be as diverse as possible, and penalize memorization as much as possible (see [Wang et al 2024](#wang-et-al-2024-ratio)).
Each minibatch should sample as many distinct datapoints as possible, and likewise, diversity maximized across batches, so each catapulting step is catapulting for as many 'skills' or 'capabilities' simultaneously as possible.
This will delay their learning, because only the bare minimum of data is available per skill, so there is no overkill, and they will tend to be learned simultaneously---which leads to 'emergence' as multi-step processes [suddenly start](https://arxiv.org/abs/2311.12997 "‘Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks’, Ramesh et al 2023") [becoming possible](https://arxiv.org/abs/2310.09336 "‘Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task’, Okawa et al 2023").
(Because humans can replay memories, both in short-term & long-term memory and through the hippocampus, it would be reasonable to do multiple epochs if there is not enough high-quality diverse data for single-epoch training.)

The catapulting itself is due to a cyclical learning rate schedule like [super-convergence](#smith-topin-2017), perhaps combined with heavy weight decay.^[Or would it make more sense to instead have a *constant* large learning rate, and instead cycle the *weight decay*, to let weights grow and then brutally prune them back, to increase the analogy to ["sleep"](#sleep)? [Smith 2022](https://arxiv.org/abs/2202.08835 "General Cyclical Training of Neural Networks") claims that a high-learning-rate+cyclical-WD schedule works & may increase sample-efficiency.]

So, what would happen when training our oversized LLM on our highly-diverse memorization/repetition-purified data with cyclical schedules for heavy regularization?
We would observe the classic cyclical training loss behaviors of spikes followed by reaching new lows, but with stagnant performance on the truly held-out data, as the LLM goes through the memorization phase, and eventually reaching a new regime where it begins to transition from memorization to generalization over many tasks simultaneously, which will then suddenly 'emerge'.
Each cycle builds up a new set of atomic skills, dependent on the skills learned in previous cycles (analogous to developmental phases).

### Prototyping with Arithmetic

It might be feasible to test LLM catapulting on small-scale tasks where current LLMs clearly generalize poorly, like arithmetic.
Arithmetic is about the smallest, simplest, easiest-to-generate problem that LLMs still fail in oddly brittle ways on, so it's a great testbed.

Arithmetic *is* learnable with [appropriate formatting](https://arxiv.org/abs/2307.03381 "‘Teaching Arithmetic to Small Transformers’, Lee et al 2023") by small cheap LLMs, but standard LLMs (trained on natural arithmetic text data) continue to not implement *true* arithmetic, even at the capability frontier like [GPT-4](https://x.com/mattshumer_/status/1636512490195501056), and arithmetic problems are easy to generate, benchmark, understand, and even do neural net interpretability on; so one could pilot catapulting on a pretrained LLM by looking for training schedules which make it find true arithmetic much faster than standard finetuning does.

<div id="catapulting-scaling-law-sweep" class="collapse"><span class="abstract-collapse">More specifically, one would filter for 'hard' arithmetic problems and then search for catapult training recipes which reduce the exponent in the scaling law compared to the 'standard' training recipe.</span>
If one used regular arithmetic problems, the gain on rare hard problems—the sort which expose the fact that the LLM has only learned a collection of partial heuristics, approximations, and memorized answers—would be hopelessly masked by the average case.
(It is entirely possible that before perfect arithmetic performance generalization, a catapulted LLM, which has mostly succeeded in learning true arithmetic, would be outperformed on average by the regular LLM which has memorized as much as possible.)

So one would do something like filtering stringently for the 0.1% (since arithmetic is so big) of the hardest arithmetic problems (as evaluated by an existing LLM or by testing for generalization past _n_ digits), and then use *that* as the heldout data that one runs scaling law sweeps on for all training recipes.

The scaling laws would ignore the average-case performance from the training runs, and also the constant factor on the hard data, and look for changes in the *exponent* of the scaling laws for the hard data.

Ideally, one would find something like a training recipe where after many epochs, the catapulted small LLMs are improving more rapidly on the hard data than the standard LLMs are, and that even if the catapulted LLMs are substantially worse everywhere, the more rapid improvement means that at some point "the curves cross", and the catapulted LLMs are superior.
This would then be proof of concept that catapulting is not merely possible on a more complex problem like arithmetic that continues to challenge even SOTA LLMs, but that it changes the *scaling laws*.

Then to verify this result, one could apply interpretability research to it (like [Zhong et al 2023](#zhong-et-al-2023)): the final catapulted LLM should clearly express a valid arithmetic algorithm where the standard LLM fails to, and there should be phase transitions across the catapulted LLM's checkpoints from standard LLM-like pseudo-arithmetics to true arithmetic.

With this proof of concept in hand, one can work on further optimizing the catapult training, and start attempting to infer what catapult training methods might scale up to SOTA LLMs like a GPT-4.
</div>

### MLPs {.collapse}

<span id="mlp"></span> <div class="abstract-collapse">Why might MLPs be especially suited here?
Because extreme regularization may help fix their persistent overfitting problems and provide superior scaling.</div>

Use of sparsity like mixture-of-experts would tend to reduce the effective parameter-size and the connectivity of the NN landscape, and so would be somewhat risky.
However an intriguing possibility is that catapulting might make [fully-connected MLP architectures](/doc/ai/nn/fully-connected/index#convolution-learning) viable.

MLP architectures are much simpler, more general, parameter-efficient, more hardware-friendly than CNNs/Transformers, and look like the logical next candidate for [Bitter-Lesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html "‘The Bitter Lesson’, Sutton 2019")-ing the status quo NN architecture—but MLPs still fall far behind.
Why?

The main reason seems to be that they are *too* powerful, and overfit.
(They are to Transformers/CNNs as those architectures are to humans, one might say.)
[Zhao et al 2021](https://arxiv.org/abs/2108.13002#microsoft "A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP") & [Bachmann et al 2023](https://arxiv.org/abs/2306.13575 "Scaling MLPs: A Tale of Inductive Bias") demonstrate that MLPs scale well and can be competitive if enough regularization (like bottleneck layers) is added.
In particular, the more regularization (and consequently, generalization downstream), the more they learn sensible convolution-like features, rather extremely high-frequency & [non-robust features](#non-robust-features).

However, it is still unclear what "regularization" would preserve all the MLP benefits without crippling the architecture—and catapulting fits the bill! The high LR & catapult trajectory would suppress those MLP pathologies the same way it suppresses the milder versions in Transformers. (See [Liu et al 2022](#omnigrok), and [Fan et al 2024](#fan-et-al-2024) on how narrow deep MLPs seem to grok in unusual & better ways; likewise, [He et al 2024](https://arxiv.org/abs/2406.02550 "Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks") on deeper Transformers.)

### Prototyping with Image Classification

For non-LLMs like CNNs/MLPs trained on CIFAR-10/CIFAR-100 or ImageNet-1k (MNIST being too trivial to be convincing), we would similarly expect more human-like images.
In image classification, it's harder to define 'hard' than arithmetic, and the standard NN accuracy is so high that simply filtering for errors might not yield enough 'reasonable' errors at this point.

So we would instead use one of the many 'post-ImageNet' image datasets designed to stress-test classifiers, like
[ImageNet-A](https://arxiv.org/abs/1907.07174 "‘ImageNet-A: Natural Adversarial Examples’, Hendrycks et al 2020"), [ImageNet-C](https://arxiv.org/abs/1903.12261 "‘Benchmarking Neural Network Robustness to Common Corruptions and Perturbations’, Hendrycks & Dietterich 2019"), [ImageNet-Hard](https://arxiv.org/abs/2304.05538 "‘ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification’, Taesiri et al 2023"), [ImageNet-R](https://arxiv.org/abs/2006.16241 "‘The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization’, Hendrycks et al 2020"), & [ImageNet-Sketch](https://arxiv.org/abs/1905.13549 "‘ImageNet-Sketch: Learning Robust Global Representations by Penalizing Local Predictive Power’, Wang et al 2019").
These would be used as the target in scaling law sweeps [as described previously](#catapulting-scaling-law-sweep).

#### Adversarial Robustness

A possible alternative would be to look at *[adversarial robustness](https://arxiv.org/abs/2006.14536#google "‘Smooth Adversarial Training’, Xie et al 2020")* instead of a standard benchmark.^[One could try to test this with LLMs too, but adversarial examples are not as well understood or easy with LLMs compared to the standard CNN/ViT image classification adversarial examples, so would not be as good for prototyping.]

If the [dimpled manifold](#dimpled-manifold) thesis is correct, then I interpret it as predicting that: while HLLMs might be intrinsically robust per [isoperimetry](#adversarial-examples-isoperimetry), so too might be tiny models, as long as they found the *right* manifold which did not require the "dimples" (ie. unprincipled, ad hoc, memorized tweaks to the decision boundaries to get the right answer).
This true manifold would potentially be found by catapulting.

So if one successfully catapulted a small CNN, even on CIFAR-10, it might demonstrate adversarial robustness *and* generalization (rather than the usual small NN choice of either robustness or generalization).
This might be especially the case [with MLPs](#mlps), and while an additional research risk, the efficiency of MLPs would allow extensive testing on just 1 GPU (eg. [Bachmann et al 2023](#bachmann-et-al-2023)).

No theory of adversarial examples other than "non-robust features" & "dimpled manifold" predicts that small models might be adversarially robust if simply trained for a long time with an odd learning rate schedule, so any large improvement in adversarial robustness is an important finding.

### Prior Art

There is essentially no research on training >10t-parameter LLMs, or cyclical LRs on large LLMs (as opposed to small [GPT-2](/doc/ai/nn/transformer/gpt/2/2019-radford.pdf#openai "‘Language Models are Unsupervised Multitask Learners’, Radford et al 2019")-scale ones).

Historically, this is due to a mix of academic fashion/prejudices and capability gains by smaller models.
Work on training highly-overparameterized LLMs, or on the [equiparameterization regime](https://arxiv.org/abs/2210.16859 "‘A Solvable Model of Neural Scaling Laws’, Maloney et al 2022")[^overparameterization], was largely killed by the release of the Chinchilla paper, which provided the perfect excuse for everyone to immediately halt parameter-scaling (since it no longer led *immediately* to SOTA results), as they have always wanted to, for a mix of good and bad reasons.
Similarly, the excuse of "optimizing for inference-optimality" by overtraining small models has become popular, by optimizing scaling laws which assume that the trained model will be naively deployed as-is, without the actual [pruning](/doc/ai/nn/sparsity/pruning/index), [quantizing](/doc/ai/nn/sparsity/low-precision/index), & [distilling](/doc/ai/nn/sparsity/knowledge-distillation/index) everyone does anyway.

[^overparameterization]: One of the reasons to favor overparameterized models is that an extreme level of parameterization (compared to Chinchilla) keeps showing up in scaling theory papers as optimal, and in particular, a 1:1 equiparameterization ratio---one datapoint, one feature.

    This seems especially intuitive if datapoints [are optimally selected](/tool-ai#active-learning-sample-efficiency) (or synthesized), and can in theory achieve much more favorable scaling like exponential decrease in error.

This means that the field of high-energy DL is wide-open: this proposal will be highly unpopular, it is much less likely than usual that anyone would independently investigate this direction, and they will be discouraged by poor preliminary results when the training runs appear to have simply failed (because of subtle bugs, poor hyperparameters, or simply inadequate training time to catapult into a region where better results can be benchmarked—assuming the right benchmarks are being used to begin with).

### Benchmarking

This is such a different training regime that previous scaling law sweeps are inapplicable.
Further, the goal here may be one that existing benchmarks actively mislead on.
They test mostly common easy questions—the sort where 'direct fit'-like thinking does best on, by definition.

So a major question here would be whether scaling laws should target perplexity as usual, or if they need to target a custom benchmark which tries to test human-like generalization rather than memorization.

Given the difficulty we have in constructing non-trivia-heavy benchmarks which existing LLMs can't beat, this might be one of the hardest parts!
But I suspect that after training a reasonable HLLM, possibly via trial-and-error, and interacting with it for a while to get an idea of how it acts qualitatively, the right metric might become more obvious.

The benchmarks might include adversarial robustness, hard negative mining (ie. the hardest problems that the best LLMs still get wrong), meta-learning, checking how much sets of models add to an ensemble's accuracy[^ensemble], or use a metric which rewards performance while penalizing memorization of training data.

[^ensemble]: To the extent that normal models are all trapped in the same nearby basins, they will make the same sorts of errors and ensembling will not improve metrics much.

    To the extent that they are exploring the full loss landscape and finding far away models, they should have much less correlated errors and so ensemble better than expected.

### Capabilities

The final model should generalize much better—possibly achieving the [Nyquist learner](#nyquist-learner) limit of *perfectly* modeling the true (non-dimpled?) latent manifold, and thereby constructively answering Rosenfeld's question about how a band-limited Nyquist learner could be implemented in current NN architectures.

Per the [lottery ticket hypothesis](https://arxiv.org/abs/1803.03635 "‘The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks’, Frankle & Carbin 2018"), once the true generalizing algorithm has been found, and has been further trained as desired (perhaps on some large trivia-heavy corpus), we can prune it down to a much smaller, faster, more feasible-to-use model, in an example of ["train large, then compress"](https://arxiv.org/abs/2002.11794 "‘Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers’, Li et al 2020").

Given the capacity of even small models, it may be possible to finetune these smaller models on arbitrarily large amounts of data to exploit [the scaling laws of transfer](https://arxiv.org/abs/2102.01293#openai "‘Scaling Laws for Transfer’, Hernandez et al 2021").
Because they start with the human-like generalizing prior, they will learn & memorize the data as appropriate, but without the pathologies of 'direct fit' to the data as when trained from scratch on the same data—thereby achieving the best of both worlds.

#### Economic Implications

If catapulting LLMs is enough to close the gaps with humans and solve AGI, then their economics simply turn into a discussion of AGI; but what if they are much better, yet still far from AGI?

As of 2024, the economics of the largest SOTA LLMs are poor, because model capabilities can be so easily cloned by using the cheap APIs to create large training corpuses, enabling behavior-cloning of the superior models.
This makes it possible to 'clone' the most expensive multi-billion-dollar LLM into a cheap or even FLOSS LLMs (justified by [commoditize-your-complement dynamics](/complement "‘Laws of Tech: Commoditize Your Complement’, Gwern 2018")), eroding margins within months of release—even if the cloned model is broadly inferior.
AI scaling companies are trapped in a race to spend the most capital on ever-larger training runs & deploy dirt-cheap distilled versions to gain market-share before the cloners erode their edge, in the hope that they can achieve some network effect or 'escape velocity' (like creating AGI).

For a catapulted LLM, however, this would still be the case, but much more so.
For catch-up players, catapulting eliminates data-constraints, but the compute-cost (and difficulty) of training a catapulted LLM may be insuperable: the training run itself might not be too expensive, but the trial-and-error and proprietary trade-secret knowledge of the special sauce of how *exactly* to make catapulting work may be extremely compute-expensive.

And they can continue to do the cloning as usual, but whereas in 2024, the cloned models are not that much inferior and share all the same weaknesses as the SOTA model they are cloning (suffering from adversarial examples, confabulations, bizarre simple errors and brittleness etc.), in a catapult scaling regime, this would not be the case: there would be a clear qualitative difference.
In fact, the cloned models may even be superior on many benchmarks, because they are trained so heavily the 'standard' low-learning-rate way and enjoy all the benefits of *that* approach, but still disfavored by users, who can't afford their unreliability, confabulations, and blandness.

However, the catapulters remain able to do all the usual tricks and can create their own superior 'clone' models.
(And these clone models can themselves presumably be catapulted for regimes or tasks where there is insufficient data to train the low-learning-rate direct-fit way.)
Given the economics of scaling DL, where the compute-cost can be dropped by extremely large amounts while amortizing the initial training cost over many users who provide high usage of GPUs, this further means the first-mover in catapulting potentially can drop prices enough to discourage competition.

So a catapult LLM creator, if the catapulted LLM has enough of a human-like edge in reliability & quality, may be able to maintain high margins much longer than 2024 LLMs do.

### Alignment

Speculating further, on the premise of the above: most capability improvements do not help AI alignment as much.
This is because they either are specific to a capability in a narrow domain, or they enhance broad capabilities but only in a 'brittle' non-generalizing way, which can be highly economically valuable but doesn't help 'alignment' much because we are interested in alignment not for its current narrow in-domain but generalizing.
We need the kind of alignment which doesn't help a chatbot *say* socially-acceptable things today (or do the right things for the wrong reasons), but which would make chatbots in charge of civilization *do* socially-acceptable things in the future (and do the right things, for the right reasons, so they will keep on doing the right things indefinitely).

But this catapult or Nyquist learner may be the exception, because what it helps with is true generalization.
A catapulted LLM trained on large amounts of morality-related text has not learned purely an assemblage of memorized fragments, heuristics, tricks, and statistical associations, with any underlying algorithm begrudgingly forced by scaling, or learned deception & situated reasoning to maximize a reward, or been adversarially selected by use of 'interpretability' techniques to learn to think in nonlinear opaque ways which do not raise any red flags; instead, it has learned the underlying value-manifold the hard way, like a highly-intelligent, grown, moral adult human has.

Even if the capability improvement turns out to be beaten by standard LLM scaling approaches (like simply brute-force annotating every error), true generalization would be invaluable to alignment.

It does not solve the alignment problem in generality—but it might provide a way to create a **genuinely moral AI**, and that is a good starting point.

### Interpretability

Further, because the native neural net way of thinking is a large complicated pastiche of memorization & heuristics, while the overparameterized grokked LLM has distilled out an algorithmic core for the key tasks, such a catapulted LLM ought to be much more useful for *interpretability* work.
We can more easily validate that they are genuinely moral AI.

Once the overparameterization has been removed, what is left should be natively much closer to simple, verifiable, interpretable, and *extractable* algorithms.
These can then be formally analyzed & verified (possibly with the assistance of the genuinely-moral AI LLMs, which are still risky but not too risky if they are probably moral to begin with, and we confine them to carefully-checked formal outputs and tasks like algorithmic equivalence of a sparsified neural network and extracted sub-algorithms.)

# Appendix
## Dynamic Grokking {.collapse}

<div class="abstract">
An alternative to grokking of full models might be grokking of *specific problems*.
Pretraining might be the cheap but incomplete way of training a brain, followed by invoking a much more expensive "pondering" process which spot-fixes important errors.

How would you grok a specific problem in an LLM?
You might do it by training it repeatedly on the problem, finetuning the model on a very small data sample (ie. dynamic evaluation).

Done repeatedly, this could lead the model to explore deeply in the loss landscape, eventually reaching new basins, generating qualitatively different outputs and even having 'creative breakthroughs'.
</div>

Inducing grokking on a pretraining corpus may be neither desirable nor possible because it would require too much training on all problems.
However, we may still want to grok on *some* problems---particularly hard ones.

At present, there is not really any way to take a hard problem that an LLM fails to solve, and let the LLM "think about it".
If a bunch of inner-monologue completions do not solve it, there's no straightforward way to "ponder".
You cannot spend a lot of money or compute on an important problem to solve it; even [GPT-4 o1](https://openai.com/index/introducing-openai-o1-preview/) appears to have a hard limit, set by its context window size.
There is currently no AI equivalent of a human thinking for a long time, seemingly getting nowhere, but then ([incubation effect](!W)) having an insight and solving it (or at least making useful progress).
Grokking, however, looks an awful lot like "a neural net thinking about a problem for a very long time until it starts to understand it and then starts to have insights & solve it".

What would it look like to try to have an incubation effect in AI?

### Pondering ≠ Tree Search

Usually, such 'search' is imagined as a kind of MCTS-esque tree search, where the human brain runs a deep tree search 'in the background' and eventually it finds a solution and that bursts into consciousness as a eureka moment or [_L'esprit de l'escalier_](!W).

But are humans running some sort of MCTS-esque tree search?
I have my doubts.
A tree search requires expanding many nodes and building a large deep tree---*where* is this all happening in the human brain, with its highly-limited working memory?
If it's so deep and requires days or months of search, where is it stored all that time while being constantly updated & traversed, while being robust to all the other activities we do, like sleep?
And why does this sort of thinking seem to benefit so much from sleep, even short periods? The kind of hippocampal ['experience replay'](https://arxiv.org/abs/2104.04132 "‘Replay in Deep Learning: Current Approaches and Missing Biological Elements’, Hayes et al 2021") we know the brain is doing has a highly spatial dimension and is repetitive, and so doesn't seem too much like we'd expect from a tree search.
Why does it seem to require a consistent level of effort over a long time, often with no progress and just beating one's head against the wall as one goes around in intellectual circles?
If we are so good at it that it can be totally unconscious, why do we find conscious tree search, like thinking through lines of play in chess, so difficult?
Why can't we simply run these searches in the background indefinitely, stop thinking about a problem entirely, and then years later be startled by sudden Nobel-Prize-worthy insights out of the blue?

What this looks like to me is closer to an effect of [*neuroplasticity*](!W): the constant repeated thoughts force neuroplasticity and neurological changes, which are then heavily regularized during sleep; if not done often enough, the benefits disappear, but if done often enough, despite no apparent change in the outputs, the brain keeps changing as it tries to find the optimal 'solution' to compressing/predicting the problem inputs (and all intermediate results of value); eventually the brain has updated enough times to move to a new basin and experiences a phase shift in the solutions it has been futilely trying, and comes up with a valuable novel output, which suddenly appears in consciousness without any apparent cause---because the cause is the low-level neuron changes which are totally inaccessible to conscious introspection.

### Neuroplasticity = Dynamic Evaluation

What would be the closest thing here in LLMs?
Neuroplasticity is equivalent to finetuning, or [**dynamic evaluation**](/doc/ai/nn/dynamic-evaluation/index)---finetuning a model at runtime on the new inputs, which can [greatly boost LLM performance](https://arxiv.org/abs/2403.01518#deepmind "‘Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models’, Rannen-Triki et al 2024").

It has been observed that self-attention is equivalent to gradient descent: it updates the model at runtime by doing a constricted form of gradient descent encoded into the forward pass as "fast" weights computed by the slow weights per-problem.
Enough self-attention is equivalent to gradient descent on the same data... but it becomes increasingly inefficient.
Dynamic evaluation is similar, and hits a middle point between self-attention and fullblown training.
Notably, there is a 3-way tradeoff between model scale, context window scale, and use of dynamic evaluation: dynamic evaluation can boost a small model up to the equivalent of a model which is much larger or where there is much more relevant data in a sufficiently-large context window.^[Note that dynamic evaluation is better than the alternatives in the sense that we can *always* do dynamic evaluation on a model we are using, whereas we usually cannot use a larger model nor put more into the context window. (And if we could, then we could do dynamic evaluation on that one instead.)]

Dynamic evaluation works both for supervised learning like image classification, but it was introduced in RNNs for unsupervised learning by next-token prediction, and does not require labels or known-answers: eg. you can do dynamic evaluation on the *questions* in a dataset, to improve prediction of the still-unknown answers.
Using dynamic evaluation lets a model learn and think more deeply about inputs, by letting it do multiple implicit gradient descent steps to update itself per problem: the self-attention tries to analyze the inputs, and then their results are distilled into the weights by the dynamic evaluation, which improves the next self-attention pass, and so on---the learning done during the self-attention gets amortized into the model weights, instead of being thrown away only to be re-computed again, taking up time during the next forward pass.

Dynamic evaluation is usually done as a single-pass for convenience & efficiency, but there is nothing inherently requiring it, and for difficult inputs, you might want to do more than one, and run multiple steps of dynamic evaluation.
This is because two gradient descent steps are not necessarily the same as one step with a larger learning rate, and the more iterative "looks" at the data, the more the model can "change its mind" in a way that a single step can't.
(An analogy would be experience replay, where a RL model can learn from a datapoint [dozens of times](https://openreview.net/forum?id=OpC-9aBBVJe "‘Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier’, D’Oro et al 2023"), or [expert iteration](https://arxiv.org/abs/1712.01815#deepmind "‘Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm’, Silver et al 2017") like [AlphaZero](/doc/reinforcement-learning/model/alphago/2018-silver.pdf#deepmind "‘A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play’, Silver et al 2018"): the model and the search bootstrap each other up, which is highly non-stationary, where one alone would simply [sigmoid hard](https://arxiv.org/pdf/2104.03113#page=5).)

One might then think that this can't be useful because given the sample-efficiency of LLMs, one or two steps is probably enough to memorize the current input, and that one risks [catastrophic forgetting](https://openreview.net/forum?id=GhVS8_yPeEa#google "‘Effect of scale on catastrophic forgetting in neural networks’, Ramasesh et al 2022")... but that is an intuition which seems to be wrong in practice---LLMs memorize many things, and yet generalize, catastrophic forgetting just doesn't happen with large pretrained models due to their extreme overparameterization & sample-efficiency enabling [continual learning](/doc/reinforcement-learning/meta-learning/continual-learning/index), and grokking & related phenomena show that there is still benefit to training long after it would appear to be useless.

### Repeated Neuroplasticity as Implicit Search

So if we put this together, we can imagine using dynamic evaluation to try to trigger a kind of pondering which avoids tree search and can potentially lead to grokking: call it **dynamic grokking**.

In dynamic grokking, we spend a lot of compute to try to solve a specific hard problem.
We repeatedly do dynamic evaluation on the input to update the model, corresponding to a "learning" neuroplasticity step, and roll out a new completion, corresponding to a "pondering" step.
If the completion is correct, or we run out of compute, we stop; otherwise, we repeat.

During dynamic grokking, the LLM will update possibly thousands of times, immediately memorizing the input but hopefully gradually changing deep within, eventually generating novel completions and solutions that the original unimproved 'base' model could not have generated even with tens of thousands of brute-forced completions.
These final results can then be given to the user, used as the seed for other search techniques or a new dynamic grokking, trained on to improve base models, etc.

Extensions to this would include: periodically doing dynamic evaluation but with heavy [weight decay](https://arxiv.org/abs/1711.05101 "‘Decoupled Weight Decay Regularization’, Loshchilov & Hutter 2017") regularization, corresponding to a "sleep" step; updating the input with the best completions or parts of a completion (possibly using a second prompt to generate a useful summary of one completion); attempting to systematically vary the context to produce less redundant updates (eg. dropping out parts of the input, retrieving different documents from a database, injecting noise into the forward passes); and saving the model snapshots to merge together into a better base model (which aggregated over many problems, may steer the model towards a [MAML](https://arxiv.org/abs/1703.03400 "‘MAML: Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks’, Finn et al 2017")-esque meta-learning model which learns how to dynamically learn).
Given the repeated iteration of training, this would benefit from hardware setups [focused on minimizing latency](https://www.reddit.com/r/mlscaling/comments/1eyophn/hardware_hedging_against_scaling_regime_shifts/ "‘Hardware Hedging Against Scaling Regime Shifts’, Gwern 2024"), like Cerebras.
(It is possible that a small model using dynamic grokking on Cerebras-like hardware might be able to outperform time-to-solution of even a highly-parallelized but static large model, depending on how poorly static models scale their searches.)

As far as I am aware, there is no prior art relevant to dynamic grokking, because dynamic evaluation is almost completely unused, and the few inner-monologue approaches which do gradient descent at all will usually only do it with something like a subset of completions filtered for quality.

While dynamic grokking can be researched or tested with the usual toy problems and approaches, it's unclear how you could, even in principle, get an idea if dynamic grokking is working on a specific problem.
The externally-observable outputs (tokens or logits) might not change at all for many iterations; and all the 'internal' metrics for grokking I've seen are unscalable, architecture/problem-specific, noisy, or otherwise flawed.