The Unreasonable Selectiveness of Mathematics
Meta-mathematics discussion of why math is open-ended and generalizes, from the perspective of meta-reinforcement learning and contrasted with fields like speedrunning: it culturally selects for feasibly finding reusable findings about short general Turing machines.
the main issue with video games is that a guy who, if he lived in 1820s germany, would have done something like document every type of beetle in his local province instead ends up making a 26 part youtube series about how to get all the rings in every sonic game
owen cyclops, 2020-07-30
While funny, on purely statistical and psychometric grounds, we can be highly confident that the YouTuber in question would not have done anything particularly remarkable; so the series did not come at the expense of our next Linnaeus or Einstein. The discussion of why will not interest most readers, so I’ll move on to a different topic
If there is an argument here, it must be about something else; OC here also implicitly dismisses the value of the YouTube series, and there I find something more thought-provoking.
A better example than a YouTube series documentary would be a video gamer who was not documenting how to get all the rings—which after all, might be quite trivial: presumably by this point there are wikis or walkthroughs documenting—but who was a speedrunner competing to get rings faster than the current speedrunning record.
A speedrunner may dump staggering amounts of effort and analysis into trying to set a new record for playing a video game in some way. The amounts are particularly staggering given that the records can be so bizarrely specific and niche. For example, one of the most interesting YouTube videos I have watched on speedrunning is about playing Mario 64 levels… without fully pressing the jump button. (Specifically, pannenkoek2012’s ‘parallel universes’ Super Mario 64 hack which avoids using any jumps; also noteworthy is speedrunning the game by exploiting an integer overflow bug & modulo wraparound to accelerate Mario to near-infinite speed, passing through the entire map multiple times, in order to stop at the right place.) Another remarkable example is a TAS tool-run of bruteforcing Breakout by building a simplified simulation to explore every possible action in a level for 6 CPU-years in order to find the optimal sequence of actions that just somehow work (similar to chess endgame databases). Then there are the speedrunners who may simply play a game 10,000× (literally), hoping that they will get lucky with the random number generator and execute their strategy flawlessly and shave a second or two off the record.
There is, of course, no possible practical justification for speedrunning. Despite being an area which has occupied countless nerdy minds for decades, sometimes fulltime, highly motivated (and highly successful), and motivating application of some impressive machinery from other fields like AI or decision theory or probability theory, I am aware of exactly 0 instances of breakthroughs in speedrunning, per se, ever yielding anything important that would be useful to other STEM fields. It is purely l’art pour l’art at its best. If you do not like it (I have to confess I do not), then you just have to leave it to those who do, and admire it as a social phenomenon. Who would have ever thought that speedrunning could be such a thing? (As a kid I would not have believed you if you told me people would be making a career out of playing through my NES or N64 games as fast as possible.)
The most interesting part to me (besides TASes as AI risk analogies) is that speedrunning doesn’t settle for the obvious goal of “complete a game as fast as possible”. It keeps inventing new and deliriously niche categories of speedruns. If you’re not interested in a new game, then simply tack on an arbitrary requirement to an old game: you speedran it on easy, now do it on hard. Or do it without taking any damage from enemies. Or do it with only the starting weapon? You can beat Mario 64 in a few minutes—fine, let’s see you beat it without jumping! Because the constraints are so arbitrary, it is a major problem when subtle differences in hardware undermine a record (similar to how drug doping in sports is a bad thing, not a good thing); hence, if at some point you realize that there is a SNES in your refrigerator and this is critically important, you must accept that this is your life now.
Notably, this is not necessarily true of many kinds of games. Regular tournament chess is not superseded each year by a bunch of weird new chess variants with rules like ‘you can castle, but only if you can lift the pieces using a single pinky finger without any pieces touching’, nor is 19×19 Go about to be replaced with an 18×20 variant just for the heck of it.
As it happens, I was recently reading about recent mathematical breakthroughs in a geometric conjecture about how to efficiently twirl a pen in place, the Kakeya Conjecture. And as I read through the discussion of the original, concrete, intuitive Kakeya Conjecture, to more abstract versions generalizing to higher-dimensions and entirely different quantities, I became increasingly lost and puzzled by how fascinating mathematicians found these puzzles. (“What does any of this have to do with physics?”, to paraphrase one ex-theoretical physicist, who despaired of the mathematics.) A mathematician is never at a loss for a puzzle to solve, because every solved problem is just something to add additional rules to or otherwise tweak, and now they have a new puzzle to solve.
Indeed, doesn’t mathematics somewhat resemble… speedrunning?
Well, I’m sure there are is least more than one difference between math and speedrunning but the first one that comes to mind is: math is sometimes useful, while speedrunning never is.
This is a little odd. Both seem to consist of often arbitrary-seeming games. Both can drive a lot of deep analysis of a rather arbitrary object. (Consider all the analyses of a game like chess, which depend utterly on the exact rules of the game, and would not apply to an even slightly different game of ‘chmess’; eg. if a knight lept 2 squares and then 2 squares, rather than 1 square, a large fraction of all chess research would be immediately irrelevant.) Why does twirling a pen on your school desk lead to centuries of profound mathematical analysis, while profound analysis of Mario 64 leads mostly to the depressing observation that yet another programmer was unable to program safely in C? (Or similarly, why is analysis of security flaws or bugs so singularly intellectually unrewarding? Preventing or detecting them can be highly challenging and bring in advanced mathematics, but the bugs themselves tend to range from trivially boring to enraging.)
This is of course the classic philosophy of mathematics question: the unreasonable effectiveness of mathematics.
Why should or does human math work? When mathematics encompasses every possible kind of math, and it is incredibly easy to do nothing but churn out meaningless theorems of no value whatsoever (just ask an automated theorem prover!), and so much math is pursued without any visible connection to the real world, how does math become the queen of sciences?
Why do these silly math games about things like ‘twirling an abstract pen’ keep turning out to be, perhaps millennia later, in completely unexpected ways, important, where twirling an actual pen, or a pen in a video game, never does? Why does one kind of silly game ‘work’, and the others just are silly games?
In short: what does math have that speedrunning doesn’t?
(This is a question which takes on some additional urgency right now, as we think about the rapid pace of progress and LLMs, and where creativity and new ideas come from, and why LLMs, despite rapidly growing math prowess and superhuman breadth of knowledge, still do not contribute genuinely new ideas and cannot stand as peers to humans who are their inferiors on almost every benchmark.)
One thought that occurs to me about speedrunning is that video games are both too complex and too simple.
They are extremely complex ways of implementing what, at their core, are often quite simple mechanics. A RPG might have thousands of unique items, hardly any of which matter; or tens of thousands of lines of dialogue, none of which change anything and are just scenery. You might have millions of lines of code and gigabytes of fancy art to implement a game which amounts to playing rock-paper-scissors again and again, or which measures your ability to press the ‘A’ button really fast. A speedrun record might amount to simply being willing to roll the dice enough times to get a lucky number, and that’s about it.
Much of the challenge comes from unseeing the intended game UI/UX, and finding tricks like integer overflow to manipulate the underlying game engine. But ultimately that teaches you little about any other game—there’s hardly any far transfer. There will be integer overflows in the next game you play, but they will manifest in different ways in different places and may well be useless to speedrunning.
Meanwhile in mathematics, many of the greatest discoveries are about discovering hidden unities: where a discovery about one thing turns out to be translatable and equivalent to a discovery about a completely different-looking thing. The translation or equivalence is usually not strict and complete, but about some part, or with some catch. This isn’t a problem, of course, because no problems are problems for a mathematician: just a new game to play. But it happens extensively enough to be important, and sometimes, the hidden unity turns out to involve the real world too—this dry-as-dust number theory problem about efficiently factoring certain large composite integers actually turns out to be about whether the NSA can read all your emails.
(How silly it would sound if this was ever true in video games—if tomorrow a streamer might announce their discovery that the game of Tetris you’d been playing was, all along, actually secretly Breakout, just played rotated in 3D space and a rule negated and some minor scoring rule adjustments; or if your spaceship game was actually you commanding an interstellar armada on behalf of humanity and unwittingly committing genocide.)
We might say that the problem with speedrunning is that it is playing large, complex, hyper-specific games which do not appear elsewhere, while with math, even if the problems appear ‘complex’, they are actually simple enough that they keep turning up, like a bad penny. To paraphrase von Neumann: if people do not think math is simple, it’s because they don’t appreciate how complex is everything else.
Something like an integer is so simple that patterns which are integer-like will appear everywhere. Something which is Mario-like in being accelerated by integer-overflow by running in place in the castle moat will appear… not many places aside from that Mario game.
Is this a full explanation? I don’t think it works; math is simple, but it’s not that simple, and there seem to be plenty of simple statements which are not the subject of math research.
If it’s just about reuse of patterns, why aren’t short arithmetical statements overwhelmingly the subject of math research, and research spends so much time going off on strange safari expeditions to exotic locale? There’s surely plenty of short addition or multiplications which are not written down anywhere.
Well, one reason is that if you need those, you can just do it yourself. Way back when, mathematicians did spend quite a lot of effort working on reference books of logarithms, interest rates, normal distributions etc, which were useful to them and to others; but they don’t now because you just compute it yourself with software.
Let’s take a step back here; if deep learning and AI are for real, then one of the things they should tell us a lot about is humans and mathematics—how do we do it? Why? And what makes it work?
From the big picture, when we look at how neural nets scale and how their internals (as best as we can interpret them), what they seem to be is a giant mish-mash of little heuristics, memorized factoids, and short simple ‘circuits’ or programs, all of which are flexibly mixed together on the fly; and the more you scale them appropriately, the more little widgets they have to recombine to solve problems. And I think this is true of human neural networks too. To brutally summarize it:
the master synthesis of human & artificial intelligence is lots of tiny Turing machines.
Now, from this perspective, what is mathematics?
Mathematics resembles the study of pure Turing machines, formal pattern and computation liberated from the messy real-world details. We do not study a long line of, say, alternating apples and oranges, nor do we even study a sequence of integers; we study a binary sequence, which is computed by a very short, simple Turing machine which alternates between two arbitrary but distinct outputs.
Alright, if we want, we can think of math this way and contort any definitions or concepts as necessary (Church-Turing, universal Turing machines, ultrafinitism etc). What does this change of perspective buy us? Instead of asking why mathematicians pick ‘useful’ theorems or axioms or topics, now we have to ask why they pick ‘useful’ Turing machines, which doesn’t seem too helpful.
In the light of the view of intelligence as just a bunch of Turing machines mashed up as necessary, suddenly, this looks a little different. If a mathematician deeply investigates a Turing machine, they may discover some important property of it, like being able to predict its halting output without the expense of computing it indefinitely, or ruling out certain outputs ever appearing, or relating it to another well-studied Turing machine. Indeed, simply being able to say, “this X Turing machine looks like a lot like this other Y Turing machine” is itself quite useful.
To use a programming metaphor, math begins to look like a giant standard library of highly-optimized & debugged code. That can save a lot of time & effort if you have a problem which happens to closely match something in the standard library. Even if the standard library functions can’t solve the problem completely, maybe because they don’t match exactly or you don’t understand them, they can usually shorten your final solution and reduce how much boring repetitive boilerplate code you write.
What makes a good standard library? Well, a standard library can be judged by how much it shortens your code in general. If it saves you a lot of code, it’s good: a well-designed ‘batteries included’ standard library may be able to make many tasks trivial and reducible to a handful of lines of code. These libraries may be very hard to write—it’s shocking how hard it is to write a good pocket calculator—but are well worthwhile as they save millions of programmers huge amounts of time for decades to come, between the easier writing of their code and (even more importantly) easier reading of their code.
A bad one will have lots of functions… none of which are relevant or which have nasty limitations or just defined in the wrong way for your purposes, time and again. If your code is about the same length as it would be writing it the hard way with the ‘core’ language, then it’s a bad standard library—why bother with it, and why did anyone bother spending the time to write it, either? And you have to write it out every time, and everyone else has to read your idiosyncratic version every time, too.
That is, good code is amortized over many uses, like running or reading; bad code is not—it may be run only once (or perhaps never), while being hard to read. A good math theory or theorem or axiom is like that as well: if it never gets used anywhere else for anything, is it useful, or just trivial—as useful as recording the result of adding two randomly-chosen large integers?
We can draw an analogy here to “amortized inference”, like neural networks: unlike many machine learning or statistical inference methods, a neural network is expensive to train (often it must be run billions of times), but then it is cheap to run once (which is necessary, because you may be running it billions of times for just the training!). The cost of training is ‘amortized’ over all of the future uses: if you do not use the neural net enough, you will not amortize it over many uses, and each use will be expensive; if you do, then it can be highly efficient.
So, we can see math as “amortized inference over Turing machines”, in a way which is explicit and symbolic, rather than implicit and hidden in the weights of a giant neural network.
OK, this analogy makes some sense, but still doesn’t answer the question: you can often gauge the quality of a new standard library function or a new math result by looking at the existing corpus and seeing how much it can help, like if you have just proved a conjecture that many other results depend on. That is easy. But it doesn’t explain the real problem of how math results keep turning up like a bad penny elsewhere for problems or topics that no one even imagined existed at that time.
Going back to neural networks, a major example of amortized inference is meta reinforcement learning: not just learning to solve a problem, but learning to solve a whole class of similar problems, perhaps never seen before and never seen again. A neural network can learn to meta-learn by being dropped into constantly changing problems, and it will learn how to update and how to explore; it can learn heuristics like to look into each corner of a maze just to see what happens.
A meta-RL agent does not have to know what it is doing or why, and just does; what it does may be the Bayes-optimal solution to a very complicated, difficult, computationally expensive problem, but over many episodes, it has slowly learned to approximate the optimal solution. (It may not even have ‘learned’ in any familiar sense like stochastic gradient descent; it could have been evolved, by mutating a neural network’s weights and keeping the ones that perform better than average.)
So if we imagine mathematicians as meta-RL agents learning and evolving over many generations, what are they optimizing, and what is the reward?
I would say that they are optimizing for results that either are or will become useful: mathematicians who are able to prove a key theorem early in a developing area will get more citations, fame, funding, grad students, and prestige from fellow mathematicians, who will flock to them and imitate them. Note that ‘becomes useful’ does not need to happen in a mathematician’s lifetime; you do not have to be alive for future mathematicians to imitate you.
That is, they are rewarded not for proving stuff about short Turing machines or for writing short proofs; they are rewarded for proving stuff which turns out to additionally shorten proofs. The reward is the second derivative of compression, which is also, as it happens, the Schmidhuber interpretation of optimal novelty generation/exploration for creativity.
At some point, there may just not be much that can be said about an area like Euclidean geometry without bringing in some novel angle from elsewhere, and it ‘dies’ as everyone finds greener pastures. Fortunately, if there is not much more to be said, then there is no loss from that.
But there also are many areas which are ‘dead’ because there is nothing that can be said about them. They are statements which are uncomputable or undecidable, or they require too much computation to think about, or they are ‘chaotic’ in the sense of having no easily-predictable patterns. They appear to us as ‘brute facts’, with no shorter description than the fact itself; like trying to explain a move in a chess endgame—there is not, and cannot be, for most endgame moves, any explanation other than ‘that is simply what is the optimal move in chess as defined by these rules, and that is that’. From computability theory, we know that such brute facts are in some sense almost all facts. And also fortunately, such brute facts are not useful to know: we could neither recognize them in advance, nor will they recur.
So we reach a kind of anthropic principle like the ‘bet on sparsity’ principle: the only Turing machines worth analyzing are short Turing machines which are from classes of Turing machines where the second derivative of compression is positive, because those are the ones which are not yet analyzed, are analyzable, and will possibly be reused in future analysis.
There are, of course, indefinitely many such classes of Turing machines, but mathematicians have evolved to pick rewarding ones; those who chose poorly, and went down blind alleys, failed to reproduce their particular brand of ‘math’. (1969: “Creatures inveterately wrong in their inductions have a pathetic but praiseworthy tendency to die before reproducing their kind.”)
Over many generations, heuristics, attitudes, rules-of-thumb, a large corpus of past successful data, publishing practices, and so on and so forth, will accumulate. That is what we call ‘taste’ or ‘beauty’; like many kinds of esthetic or intellectual appreciation, they are unnatural and must be learned—few people automatically can appreciate a ‘beautiful’ or ‘elegant’ theorem, like few children will automatically appreciate a great coffee or opera. (This may be why it is so notoriously hard to define those properties and requires ‘a wordless transmission outside the scriptures’: there may not actually be any single property, and they may reduce down to uninteresting facts like ‘we like coffee so much because there is a slight mutation in our muscles which means that an insecticide like caffeine doesn’t kill us, but stimulates us’.) They are not ‘trying to optimize for reward in this lifetime’, because they do not necessarily know what they are doing or why their heuristic choices work (and may be deeply mistaken about those things—especially if being mistaken is adaptive, like quasi-religious Platonic ideologies driving them to work harder in the absence of any earthly reward); they simply do. “The heart has reasons that reasons knows not”, and we might say that mathematics has unreasons that its reasons know not. These amortize the work of past generations of mathematicians, and are distilled into the current generation—who are then set free into the world to make their way.
Most will fail, like fish spawn trying to swim to the sea, ‘dying’ along the way They blindly executed their inherited package of amortized inference heuristics, bet big on some specific topic or method—and failed to produce anything useful. They will have fewer mathematical descendants, and will serve as a cautionary example against that topic or method. Entire schools (species) of mathematics may go extinct as they exhaust their topic and the years pass with no one interested in learning or using their results; they too are replaced by more productive schools. Indeed, large areas may effectively go defunct. But a few will thrive: they will execute their heuristics, and perhaps after decades of computation, be rewarded by stumbling on a goldmine—a clever analogy, a strange axiom, a key conjecture, an unexpected & hugely important real-world application—and spawn. Those will bequeath their own slightly tweaked, and on average, better, package of ‘taste’ to their offspring.
Further, those are the ones who continue to justify the existence of ‘mathematics’ as a profession: they bring in the funding, and the cultural cachet.
The outsiders start the bootstrap: they observe mathematics as a giant opaque blackbox that they throw money into and occasionally something magical comes out. How does this work?
So, we have a hierarchy here of reinforcement learning, where it’s meta-learning all the way down (see my backstop essay), where each level optimizes with increasingly more insight, but constrained by the higher level to eventually on average yield results:
from the perspective of countries, those which have funded ‘math’ have done better than ones which did not, and they know “We need to fund strong STEM”—but they do not know what good math is and must delegate it to organizations like universities or non-profits to fund ‘math’.
Over a long timespan, like centuries, they observe the outputs; if magical things stop coming out, and they see other black-boxes producing more, they may decide to simply stop funding their blackbox, wipe it out, and replace it with another. (An example of this might be the indigenous math traditions in Japan and China, often fairly sophisticated, being replaced wholesale by imported European higher-education organizations & mathematics.)
from the perspective of funders, they can reward practical applications that outsiders can judge based on whether it works or not; but those are so rare and so often inapplicable that they must base their funding on indirect metrics decided by mathematicians, like number of citations or math awards or just prestigious reputation among mathematicians collectively
Over a medium-term timespan like decades to centuries, funders who fail to pick winners will gradually lose interest or find some other more rewarding place to spend their funds.
from the perspective of mathematicians collectively, they often cannot understand the details of another area or even a specific paper (which may take years to work through in complete detail), but they can observe whether those results show up elsewhere and get used in other math, and if they seem to lead to more results which themselves show up elsewhere and so on
Over a short-medium-term timespan like generations, a ‘school’ or ‘area’ may be given a long leash, but if they beaver industriously away for decades and still no one outside the area cares or sees signs of progress, it will gradually wither away, as people decide it has succumbed to the pathology of l’art pour l’art and involution and decay.
from the perspective of mathematicians in an area or the original mathematician, they can understand the work properly, but to assess whether it’s valuable, they deploy their internal neural net blackbox of things like ‘taste’ or ‘beauty’ or ‘elegance’ and directly reason their way through a problem.
Over a short timespan like months or years, if something just isn’t working out, they will try a different approach.
This produces mathematics as we know it: with only occasional feedback from the real world, groups of mathematicians can autonomously explore an endless succession of new problems while refining their heuristics about what areas of new problems will be potentially soluble and at least occasionally relevant to the real world at some point, while avoiding getting trapped by spiraling into recherché navel-gazing niches which eventually degenerate into intellectual masturbation and numerology.
Now we can finally tie this back to our starting point: why do the abstract games of mathematics ‘work’, but the games of speedrunning are always art/entertainment?
Because speedrunning is focused on the analysis of Turing machines which are a mix of extremely simple and Turing machines (‘this game mechanic is just rock-paper-scissors’), and brute facts (‘there is a buffer overflow at this exact position in the level we can use to skip forward because of the RAM padding on a SNES chip, where they cut off the last kilobyte to save 0.1¢ during manufacturing back in 1989’). The simplicity of the game Turing machines means that they are so easily figured out that there is little need for anyone else to know about it. (It would be like trying to explain rock-paper-scissors to someone by telling them it’s like a particular set of weapons/armor in a RPG from 198144ya—there would be no point, you could explain rock-paper-scissors from scratch in the time it takes you to explain the RPG.) And then further progress depends on the brute facts, which do not re-appear anywhere else. (No game will ever have that exact sequence of button pushes or buffer overflows, unless by design.) So speedrunning is doomed to being like History: “one d—n thing after another”.
And because there is no transfer elsewhere, there cannot be any grounding for the blackbox of speedrunning culture. Speedrunners can borrow scientific or mathematical methodology and culture to some extent, but not much. Thus, speedrunning niches are doomed to l’art pour l’art, and to eventually being exhausted, becoming passé and unfashionable, and everyone moving on to a new game or a newly-invented constraint. (We need not fear for them; there has never been a shortage of new games to speedrun.)
Turning back to a more important topic: AI. LLMs in particular have been accused of lacking open-endedness and creativity; despite their superhuman knowledge of every area, there is a lack of impressive novel insights. The best LLMs, as of this writing, like Gemini-2.5-pro, can offer surprisingly high-quality feedback and suggestions, but only in response to an input. Unrelated, spontaneous insights or ideas seem to effectively never happen. No one has ever observed an “incubation effect” in LLMs where a system like o1-pro spontaneously interrupts a task to report a neat idea that just occurred to it. There is nothing in LLMs equivalent to a community of mathematicians; you cannot run a math LLM indefinitely and once in a while get out something unexpected.
If we analogize math-oriented LLMs to mathematics, the LLM is closest to a knowledgeable student who has studied textbooks and homework problems, but has never done research. You can set them a problem which has an answer, and they may well be able to find the answer. But at no point have they ever learned to solve the problem of coming up with problems. That is written down in no textbook, nor is there any homework problem for it (almost by definition, despite occasional valiant efforts like Pólya’s How to Solve It).
And it is hard to learn this by simply sitting down and doing a tree search over theorems; you may learn superhumanly well to prove specific theorems, like AlphaZero learns to play superhuman Go, but how do you learn how to come up with entirely new theorems—not lemmas for a pre-specified theorem—which are worth solving? (No matter how well it plays Go, AlphaZero does not know how to come up with a Go variant worth playing.)
The good news is that while ‘taste’ or ‘elegance’ are written down in no textbook or dataset, they can’t be that complex, because they must be broad principles which cover large areas of Turing machine space, and cannot contain much detail. Further, meta-learning struggles to learn complicated things: the RL feedback is too uninformative, noisy, sparse, and delayed (sometimes by centuries); and at the community level, there has not been much time or selective pressure, so there cannot have been too much optimization. We can also note that mathematicians are trained quickly, and can make major contributions when young.
So similar to RLHF or instruction-tuning or related topics, we can expect ‘taste’ to be ‘superficial’ in the sense of something the LLMs are already capable of, and which only needs to be elicited by tweaking just a few parameters.