“A Solvable Model of Neural Scaling Laws”, Alexander Maloney, Daniel A. Roberts, James Sully2022-10-30 (; backlinks)⁠:

[Twitter; see also Bahri et al 2021] Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws: specifically, their performance behaves predictably as a power law in either parameters or dataset size until bottlenecked by the other resource.

To understand this better, we first identify the necessary properties allowing such scaling laws to arise and then propose a statistical model—a joint generative data model and random feature model—that captures this neural scaling phenomenology. By solving this model in the dual limit of large training set size and large number of parameters, we gain insight into (1) the statistical structure of datasets and tasks that lead to scaling laws, (2) the way nonlinear feature maps, such as those provided by neural networks, enable scaling laws when trained on these datasets, (3) the optimality of the equiparameterization scaling of training sets and parameters, and (4) whether such scaling laws can break down and how they behave when they do.

Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps and then translated into power-law scalings of the test loss and how the finite extent of the data’s spectral power law causes the model’s performance to plateau.


An important insight is the role of a new scale that can be understood as the size of the latent space from which the data is generated. If the model size or training set exceed this scale, the model enters a new regime of behavior not yet observed in the LLM experiments.

Also, for generalized linear models, scaling the model linearly size with training set size is optimal if regularization is used, not overparameterization! Intuitively, each additional sample + parameter pair can be used to learn an additional component in the latent space


One of our main results is a lack of universality of scaling laws across differently structured data generation processes: datasets that lead to scaling laws have a particular power-law structure in their spectral statistics, which ultimately leads to a power-law scaling of the test loss when there are no resource bottlenecks present. Moreover, we find that an essential role of nonlinear feature maps is extending the power law in the spectrum of the representation as a function of the number of features. This ability to extend the power law differentiates the performance of different deep neural network (DNN) models and, although we don’t investigate it here, is presumably an important reason why—from the perspective of this analysis—transformers enable neural scaling law phenomenology. Finally, for generalized linear models—i.e. linear regressions of potentially nonlinear feature maps—we learn that exact equiparameterization—scaling the number of features identically with the size of the training set—is optimal when some kind of regularization is applied. [see also Michaud et al 2023, Bahri et al 2021]

Intuitively, for the sort of data that leads to scaling laws, each additional sample can be used to learn about an additional feature in the latent feature space, and the model should have an additional parameter in order to represent the information from this new latent feature.

This is consistent with the finding of Chinchilla, though is slightly counter to the initial empirical results in Kaplan et al 2020. However, both of those references concern empirical investigations of LLMs, while our analysis concerns generalized linear models and may not apply in the same way for nonlinear models that learn representations. (See §5 under the subheading “Representation Learning?” for further discussion.)

An important insight that emerges from our analysis is the role of a new scale that determines when the empirical behavior found by Kaplan et al 2020 breaks down. This scale can be understood as the size of the latent space from which the data is generated and must be much larger than both the size of the training set and the number of parameters of the model in order to observe the power-law scaling and bottleneck behavior of Kaplan et al 2020.

(This is perhaps surprising given a general expectation that natural data should live on manifold of smaller intrinsic dimension than its embedding dimension; see §4.3 for further discussion. [see Sharma & Kaplan2020 on the manifold hypothesis, and Spigler et al 2020])

If either of these two resource scales exceed the size of the latent space, our analysis shows a new regime of different behaviors for the test loss that has not yet been seen in the LLM experiments. Since we have a generative model of the data we control this scale directly in our analysis, but it would be extremely interesting to understand this scale in natural data, such as images or text.


In §4, we interpret our calculations from the previous section and expand on our results. Most importantly, in §4.1, we characterize the breakdown of neural scaling law behavior in our model by considering our result from §3 in the limit where the size of the latent space becomes smaller than either the size of the training set or the number of features in the model. We also confirm the validity of our calculation in this limit by comparing against numerical simulations in the same regime. Then, in §4.2 we explain the optimality of the equiparameterizated regime for neural scaling, contrasting with the overparameterized regime and discussing the double descent phenomenon, while in §4.3 we further consider our new scale that controls the size of the latent space and the breakdown of scaling laws in the context of traditional notions of dimensionality reduction. We close in §4.4 by discussing some limitations of our minimal power-law spectral data model that could be improved in future analyses.


4. Discussion of Results

Now that we have a statistical model of scaling laws that we understand for jointly large-but-finite model size N, training set size T, and latent space size M, in this section we have a discussion of what we can learn from our statistical model of scaling laws.

  1. In §4.1, we interpret our results from §3 in the limit that the model size or training set size approaches the size of the latent space, N, T ~ M, and the neural scaling law phenomenology of §1 breaks down.

  2. In §4.2, we discuss how the regime of scaling laws, of large training set size and large model size, pushes resource efficient and properly regularized models towards the equiparameterization regime, and how the phenomenon of double descent is not really relevant for such models.

  3. In §4.3, we try to reconcile the large latent space, M, required for datasets that allow for neural scaling laws with the traditional idea that input datasets are embedded in high-dimensional spaces, Nin, and can be compressed to a latent space with much smaller intrinsic dimension, din.

    We note that there are a number of notions of dimensionality, and the particular power-law structure of the datasets that give rise to scaling laws makes different notions meaningful for different questions.

  4. Finally, in §4.4, we identify some limitations of our generative data model that could be improved in future analyses.


No noise: First, let’s consider the case without noise. In the left panel of Figure 9 we plot a simulation of our statistical model with no regularization (Îł = 0) and also plot our RMT calculation of the test loss, (Equation 163), for models for which the number of features is much larger than the size of the latent space and the number of samples in the training set (N > T, M); in the right panel, we use optimal regularization (Îł = Îłâœ±) in the simulation and also plot a new fit that we will discuss below.

In both panels, we learn that “breakdown” of scaling laws—without noise—is a lot like a singularity: all of the sudden at T = M the test loss drops very rapidly to zero! [cf. Rosenfeld2021]


Noise: Now, let’s turn on the label noise
 [Figure 10] 
Most notably, for large T we see that both the unregularized case, (Equation 186), and the regularized case, (Equation 188), there’s a universal ~ 1⁄T falloff of the test loss when the model size and training set size jointly exceed the size of the latent space (N, T > M). If such a transition in powers appears in a model at an otherwise undefined scale, it could be suggestive of a breakdown associated with having reached the size of the latent space.5

Interestingly, the consequences of this for the practitioner now depend on the size of the power-law exponent α: for α > 1, this transition would limit the model’s performance gains with increasing resources N and T, while for exponents α < 1, it would enhance such gains.

Figure 11: Sketch of test losses of our statistical model from §3 on a log-log scale for different fixed training set sizes for both unregularized or ridgeless (solid) and optimally regularized (dashed) models. The solid blue curve exhibits the double descent phenomenon, with a local minimum of performance in the under-parameterized region (black star, n = N✱) and with performance further improving asymptotically in the overparameterized region. The two blue curves illustrate how the double-descent peak is an artifact of the ridgeless [weight-decay-less?] limit (Îł = 0), with performance monotonically improving through the point of equiparameterization (vertical dotted lines) when the models are properly regularized. Comparison of the dashed blue (T = T0) and orange (T = 8T0) curves illustrates the optimality of near-equiparameterization when using regularization properly: the best performance boost results from scaling the model jointly with the size of the training set (N ~ T).


with this regularization it’s apparent that the slow asymptotic improvement for very large models at fixed training set size is simply the plateau region of (Equation 1). Furthermore, comparison of the unregularized and regularized models—from our numerical simulations and exhibited in the figure by the blue solid and blue dashed curves, respectively—we see that regularization only seems to be important around the equiparameterization peak.61 From this perspective, the double descent phenomenon is an artifact of not using regularization [like early stopping] in a small region around N ~ T.

Thus, when using proper regularization, we can achieve the best test loss by jointly scaling the sizes of the model and the size of the training set. Comparison of the two regularized (dashed) curves in Figure 11 illustrates the way that increasing model size alone eventually encounters a plateau, while increasing the size of the training set extends the power-law portion of the performance gains.

As originally pointed out by Kaplan et al 2020, the reason for the optimality of this joint scaling is avoiding resource bottlenecks: each additional sample in the training set is informative about one additional eigen-feature in the power-law portion of the spectrum of data’s latent spectrum (Equation 23), and we need an additional feature in our model to represent that eigen-feature.62

As such, given finite resources and an ability to both scale models as well as gather training points, an optimal allocation involves a kind of joint near-equiparameterization scaling: for generalized linear models such as our statistical model, there is only a single exponent, α, that controls the test loss power-law behavior in both the number of features and size of the training set, and for this model scaling the number of features of the model to equal the size of the training set, N(T) = T, will avoid the plateau region; for other models such as the LLMs discussed in Kaplan et al 2020, we may have a more general scaling relation, N(T) ~ Tp, as in (Equation 6).63 Even in this more general case, jointly scaling both training data and model size pushes the performance away from the tails of the test loss curves and back towards the termination of the power-law region—back towards the non-analytic peak of the unregularized model—which is the region of large-data and large-parameter equiparameterization. Altogether, we conclude that this regime with proper regularization—and not the overparameterization regime—is the practical setting of interest for deep learning.

Interestingly, a curated dataset of machine learning systems taken from highly-cited and highly influential papers 1952–692021 gives strong evidence that skilled practitioners have always been implicitly working in this jointly large-training-set-and-large-model-size equiparameterized regime: plotting the parameter counts vs. training set sizes of the models in this dataset on a log-log scale gives a linear fit with a slope extremely close to unity. We thank Ben Adlam for bringing this point to our attention.