[replication & extension] The largest experiments in machine learning now require resources far beyond the budget of all but a few institutions. Fortunately, it has recently been shown that the results of these huge experiments can often be extrapolated from the results of a sequence of far smaller, cheaper experiments. In this work, we show that not only can the extrapolation be done based on the size of the model, but on the size of the problem as well.
By conducting a sequence of experiments using AlphaZero and Hex, we show that the performance achievable with a fixed amount of compute degrades predictably as the game gets larger and harder. Along with our main result, we further show that the test-time and train-time compute available to an agent can be traded off while maintaining performance.
Figure 5: Each training run (each faint line) of each differently-sized agent follows a sigmoid, starting at random play and progressing up to some plateau. The frontiers (dark lines) formed by taking a maximum across training runs have a similar form across board sizes (colors).
Figure 6: The compute-performance frontier follows the same sigmoid for each board size 3 through 9, just scaled and shifted. The dotted lines give the fitted curves.
Slope: The slope of the incline is 500 Elo per order of magnitude increase in compute.
A more memorable interpretation is that if you are in the linearly-increasing regime, then you will need about 2× as much compute as your opponent to beat them 2⁄3 of the time.
Perfect play: The minimum compute needed for perfect play increases 7× for each increment in board size.
Takeoff: The minimum training compute needed to see any improvement over random play increases by 4× for each increment of board size.
Random play: Finally, the distance between random play and perfect play increases by 500 Elo for each increment of board size.
Unlike the other quantities mentioned previously, the distance between random and perfect play is a property of the game itself rather than of the agent.
…Train-test trade-off: So far we have focused on the compute budget during training, but another pertinent budget is the compute spent during evaluation. All the results discussed previously have used a tree search of size 64 during evaluation, the same as used during training. But there is no reason that the train-time search and test-time search have to be the same size, and so by varying the size of the test-time compute budget we can see in Figure 8 that larger tree searches at test time can substantially improve the performance of an agent.
Knowing now that compute can be spent in 2 places, at train time and test time, the immediate question is: how do these 2 budgets trade off? This is illustrated in Figure 9, which shows that the trade-off is linear in log-compute: for each additional 10× of train-time compute, about 15× of test-time compute can be eliminated, down to a floor of a single-node tree search…the simple relationship between compute at train time and compute at test time was originally surprising to us. Our intuition was that test-time compute is much ‘cheaper’ than train-time compute, and so we were surprised that one could easily substitute for the other. On reflection however, we believe the key distinction is that an optimization at test-time needs only optimize over one sample, while train-time compute meanwhile must optimize over the entire distribution of samples.
Figure 9: The trade-off between train-time compute and test-time compute. Each dotted line gives the minimum train-test compute required for a certain Elo on a 9 × 9 board.
…the way in which performance scales with compute is that an agent with twice as much compute as its opponent can win roughly 2⁄3 of the time. This behavior is strikingly similar to that of a toy model where each player chooses as many random numbers as they have compute, and the player with the highest number wins3. In this toy model, doubling your compute doubles how many random numbers you draw, and the probability that you possess the largest number is 2⁄3 [as you go from 1:1, half the total numbers drawn, to 2:1, or 2/(2+1)—as if each tree search were an independent lottery ticket]. This suggests that the complex game play of Hex might actually reduce to each agent having a ‘pool’ of strategies proportional to its compute, and whoever picks the better strategy wins. While on the basis of the evidence presented herein we can only consider this to be serendipity, we are keen to see whether the same behavior holds in other games.
Second, both the relation of performance to board size and the relation of performance to compute are smooth. Before embarking on this project, a key unknown was whether performance would show any ‘spikes’ with regards to compute or board size. A spike with regards to compute might indicate the model had achieved some key insight, while a spike with regards to board size might indicate a minimum complexity past which key insights are available for the model to discover. As is however, models’ performance changes smoothly and predictably with both increased compute and increased complexity.