“Scaling Laws for Autoregressive Generative Modeling”, Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya A. Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, Sam McCandlish2020-10-28 (, , , ; similar)⁠:

We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image ↔︎ text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains.

The cross-entropy loss has an information theoretic interpretation as S(True)+DKL(True||Model), and the empirical scaling laws suggest a prediction for both the true data distribution’s entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an 8×8 resolution, and we can forecast the model size needed to achieve any given reducible loss (ie. DKL) in nats/image for other resolutions.

We find a number of additional scaling laws in specific domains: (1) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question “Is a picture worth a thousand words?”; (2) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (3) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.

…As we increase model and dataset sizes, optimization becomes increasingly efficient, until eventually learning curves begin to merge with the L(D) trend, so that there are no benefits to be gained from training for more than a single epoch [Komatsuzaki2019].

…We have argued that a single neural architecture, the Transformer, can be applied to the generative modeling of images, videos, multimodal data, and math, along with language [Kaplan et al 2020, Brown et al 2020]. We identified common scaling laws for the loss achieved on all data modalities as a function of both model size and compute budget. As in the case of language, these results imply that larger models become more sample-efficient. Furthermore, we found that in some important cases, fine-tuned performance on downstream tasks also follows similar scaling laws. This suggests that trends in the generative modeling loss translate into advantages in practical capabilities.

A greater surprise was the universal trend (Figure 2) for optimal model size as a function of the training compute budget—we did not anticipate that the exponent NoptC0.7 would be largely independent of the data distribution. This trend implies a dual trend for the number of tokens elapsed during optimized training, as a function of C or N, and leads to the conclusion that larger compute budgets should be “spent” mostly on larger models, rather than much longer training runs. So this lesson from language modeling [Kaplan et al 2020] generalizes. These empirical regularities beg for theoretical explanation—why do these scaling relations hold? The scaling laws also suggest a shift in perspective away from the particularities of neural architectures, loss functions, and training algorithms and towards the broader commonalities that appear when machine learning is studied across a large hierarchy of model, data, and compute scales. Work in ML often involves identifying specific deficiencies in current capabilities and remedying them through the alteration of models and algorithms. Perhaps many capabilities simply lie on a spectrum that can be continuously unlocked through increasing scale, as might be suggested by the meta-learning capabilities of the GPT-3 model [Brown et al 2020].

[Perlis: “39. Re graphics: A picture is worth 10K words—but only those to describe the picture. Hardly any sets of 10K words can be adequately described with pictures.” cf. emoji writing exercises like Book from the Ground/Emoji Dick; “To Understand Language is to Understand Generalization”. DL newbies are always shocked how large LLMs are compared to image-related models or in other modalities like DRL. The 2nd-most interesting problem in philosophy of mind, language, & epistemology right now is the asymmetry between language models/everything else: LMs transfer to other domains (eg. SayCan!), but not vice-versa.]

Figure 1: Smooth scaling of reducible loss across domains—We show power-law scaling laws for the reducible loss L−L∞ as a function of compute, where the irreducible loss L∞ is a fitted domain-dependent constant. Under plausible assumptions concerning the infinite data and compute limits, the irreducible loss estimates the entropy of the underlying data distribution, while the reducible loss approximates the KL divergence between the data and model distributions. In the case of language we use results from [BMR+20], and only show the full loss L.
Table 1: Summary of scaling laws—In this table we summarize the model size and compute scaling fits to equation (1.1) along with N~opt~(C), with the loss in nats/token, and compute measured in petaflop-days. In most cases the irreducible losses match quite well between model size and compute scaling laws. The math compute scaling law may be affected by the use of weight decay, which typically hurts performance early in training and improves performance late in training. The compute scaling results and data for language are from [BMR+20], while_N_opt(C)comes from [KMH+20]. Unfortunately, even with data from the largest language models we cannot yet obtain a meaningful estimate for the entropy of natural language.
Table 1: Summary of scaling laws—In this table we summarize the model size and compute scaling fits to equation (1.1) along with Nopt(C), with the loss in nats/token, and compute measured in petaflop-days. In most cases the irreducible losses match quite well between model size and compute scaling laws. The math compute scaling law may be affected by the use of weight decay, which typically hurts performance early in training and improves performance late in training. The compute scaling results and data for language are from [BMR+20], while Nopt(C)comes from [KMH+20]. Unfortunately, even with data from the largest language models we cannot yet obtain a meaningful estimate for the entropy of natural language.
Figure 2: Optimal model size is consistent across domains—We display the optimal model size N~opt~ as a function of the training compute budget C. Not only does N~opt~(C) behave as a power-law, but the behavior is remarkably similar for all data modalities.
Figure 2: Optimal model size is consistent across domains—We display the optimal model size Nopt as a function of the training compute budget C. Not only does Nopt(C) behave as a power-law, but the behavior is remarkably similar for all data modalities.
Figure 31: Q&A—We show the progression of simple Q&A capabilities of GPT-3 family models as we increase the parameter count [BMR+20]. We ask the model who the first and second president of the United States was. · Tiny models appear to have trouble understanding the question, and don’t place any substantial probability on the correct answer. Larger models understand that we’re requesting a US president, but fail to understand that the “second president” and “first president” are different requests, placing most of their weight for both questions on “George Washington”. Only larger models understand both aspects of the questions, answering both correctly.

[cf.: Figure 3 & Figure 11.]