“Extrapolating GPT-N Performance”, Lukas Finnveden2020-12-18 (; similar)⁠:

[Improving estimates of Transformer scaling curves: the graph usually shown for how GPT-3 model scaling leads to better performance on benchmark tasks is misleading, because the tasks have different ceilings & floors, and because the smaller models were trained for an unfair amount of time—every model was trained on the same (very large) fixed amount of text data, even though this is extremely wasteful of FLOPs, not how practitioners would want to train models, and results in the smallest models performing much better than they ‘ought’ to. Finnveden rescales each benchmark tasks to be 0–100% (random–perfect), and considers models trained in compute-optimal fashion.

Plotting against loss.

When he re-analyzes the reported benchmark performance, he finds that GPT-3 scaling is far smoother in model size than the original graphs would indicate, with fewer exceptions. (Two of them are explained, I believe, as simply BPE-caused problems; and the third was adversarially collected to target language model weaknesses, and GPT models may just be starting to solve them.)

Extrapolating.

Finnveden further considers extrapolating the scaling law and cross-referencing with model sizes, budgets, and estimated limits on dataset sizes.]

Conclusion: