“Scaling Laws for Neural Language Models”, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei2020-01-23 (, , ; similar)⁠:

We study empirical scaling laws for language model performance on the cross-entropy loss.

The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget.

Larger models are substantially more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping substantially before convergence.

Figure 1: Language modeling performance improves smoothly as we increase the model size, dataset size, and amount of compute used for training. For optimal performance all 3 factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two. [The distinct bump at the tiniest models is probably lack of induction heads.]
Figure 1: Language modeling performance improves smoothly as we increase the model size, dataset size, and amount of compute used for training.
For optimal performance all 3 factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two. [The distinct bump at the tiniest models might be lack of induction heads.]
Figure 15: Far beyond the model sizes we study empirically, we find a contradiction between our equations for L(C~min~) and L(D) due to the slow growth of data needed for compute-efficient training. The intersection marks the point before which we expect our predictions to break down. The location of this point is highly sensitive to the precise exponents from our power-law fits.
Figure 15: Far beyond the model sizes we study empirically, we find a contradiction between our equations for L(Cmin) and L(D) due to the slow growth of data needed for compute-efficient training.
The intersection marks the point before which we expect our predictions to break down. The location of this point is highly sensitive to the precise exponents from our power-law fits.
3.2.1: Comparing to LSTMs and Universal Transformers: In Figure 7 we compare LSTM and Transformer performance as a function of non-embedding parameter count n. The LSTMs were trained with the same dataset and context length. We see from these figures that the LSTMs perform as well as Transformers for tokens appearing early in the context, but cannot match the Transformer performance for later tokens. We present power-law relationships between performance and context position in Appendix D.5, where increasingly large powers for larger models suggest improved ability to quickly recognize patterns. [see Khandelwal et al 2018 on the rapid forgetting of RNNs, SSMs as a possible optimization fix, and “Scaling Laws for Acoustic Models” for another direct LSTM-RNN vs Transformer comparison.]
Appendix A: Summary of Power Laws
Table 1: Summary of scaling laws—In this table we summarize the model size and compute scaling fits to equation (1.1) along with N~opt~(C), with the loss in nats/token, and compute measured in petaflop-days. In most cases the irreducible losses match quite well between model size and compute scaling laws. The math compute scaling law may be affected by the use of weight decay, which typically hurts performance early in training and improves performance late in training. The compute scaling results and data for language are from [BMR+20], while N~opt~(C)comes from [KMH+20]. Unfortunately, even with data from the largest language models we cannot yet obtain a meaningful estimate for the entropy of natural language. [This is an updated scaling power law summary from Henighan et al 2020.]
Table 1: Summary of scaling laws—In this table we summarize the model size and compute scaling fits to equation (1.1) along with Nopt(C), with the loss in nats/token, and compute measured in petaflop-days.
In most cases the irreducible losses match quite well between model size and compute scaling laws. The math compute scaling law may be affected by the use of weight decay, which typically hurts performance early in training and improves performance late in training. The compute scaling results and data for language are from [BMR+20], while Nopt(C)comes from [KMH+20]. Unfortunately, even with data from the largest language models we cannot yet obtain a meaningful estimate for the entropy of natural language. [This is an updated scaling power law summary from Henighan et al 2020.]