“Scaling Laws for Transfer”, Danny Hernandez, Jared Kaplan, Tom Henighan, Sam McCandlish2021-02-02 (, , ; similar)⁠:

We study empirical scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting. When we train increasingly large neural networks from-scratch on a fixed-size dataset, they eventually become data-limited and stop improving in performance (cross-entropy loss). When we do the same for models pre-trained on a large language dataset, the slope in performance gains is merely reduced rather than going to zero.

We calculate the effective data “transferred” from pre-training by determining how much data a transformer of the same size would have required to achieve the same loss when training from scratch. In other words, we focus on units of data while holding everything else fixed. We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size.

We believe the exponents in these power-laws correspond to measures of the generality of a model and proximity of distributions (in a directed rather than symmetric sense). We find that pre-training effectively multiplies the fine-tuning dataset size. Transfer, like overall performance, scales predictably in terms of parameters, data, and compute.

The effective data transferred is well-described by a power-law in the low-data regime: We use DT to represent the effective data transferred, ie. the amount of additional python data that a model of the same size trained on only python would have needed to achieve the same loss on python as a model pre-trained on language. Our notation is indicated visually in Figure 1. The scaling law for transfer in equation 1.1 is at the core of many key insights and predictions in this work. We find the simplicity of this result very intriguing:

DT = effective data transferred = k(DF)α(N)β

where N is the number of non-embedding model parameters, and DF is the size of the fine-tuning data distribution.

Figure 1: We display the performance of a 40M parameter Transformer model on python, both trained from scratch on python and pre-trained on text then fine-tuned on python. DT is the amount of additional python characters that a from-scratch model of the same size would have needed to achieve the same loss on python as a fine-tuned model. In the labeled example, we see that for a 40M parameter transformer fine-tuned on 3e5 characters, DT is ~1000× bigger than DF. The less fine-tuning data is available, the more pre-training helps.
Figure 2: In the low-data regime, we observe a good fit for over 4 orders of magnitude in model size and 3 orders of magnitude in fine-tuning dataset size. The fit equation is shown above in terms of DT for simplicity, but the fractional form is given by equation B.2. We show the omitted high data regime points in Appendix D. Details for the approach used to generate these fits are shown in Appendix C.