[Improving estimates of Transformer scaling curves: the graph usually shown for how GPT-3 model scaling leads to better performance on benchmark tasks is misleading, because the tasks have different ceilings & floors, and because the smaller models were trained for an unfair amount of time—every model was trained on the same (very large) fixed amount of text data, even though this is extremely wasteful of FLOPs, not how practitioners would want to train models, and results in the smallest models performing much better than they ‘ought’ to. Finnveden rescales each benchmark tasks to be 0–100% (random–perfect), and considers models trained in compute-optimal fashion.
Plotting against loss.
When he re-analyzes the reported benchmark performance, he finds that GPT-3 scaling is far smoother in model size than the original graphs would indicate, with fewer exceptions. (Two of them are explained, I believe, as simply BPE-caused problems; and the third was adversarially collected to target language model weaknesses, and GPT models may just be starting to solve them.)
Extrapolating.
Finnveden further considers extrapolating the scaling law and cross-referencing with model sizes, budgets, and estimated limits on dataset sizes.]
Conclusion:
On benchmark performance, GPT-3 seems to be in line with performance predicted by smaller sizes, and doesn’t seem to particularly break or accelerate the trend…
Close-to-optimal performance on these benchmarks seems like it’s at least ~3 orders of magnitude compute away (costing around $1b at current prices). This means that I’d be somewhat surprised if a 100× scaling brought us there immediately; but another 100× scaling after that might do it (for reference, a 10,000× increase in compute would correspond to a bit more than 100× increase in size, which is the difference between GPT-2 and GPT-3). If we kept scaling these models naively, I’d think it’s more likely than not that we’d get there after increasing the training FLOP by ~5–6 orders of magnitude (costing $100–$1t at current prices).
Taking into account both software improvements and potential bottlenecks like data, I’d be inclined to update that downwards, maybe an order of magnitude or so (for a total cost of ~$10–$100b). Given hardware improvements in the next 5–10 years, I would expect that to fall further to ~$1–$10b.
I think this would be more than sufficient for automating the tasks mentioned above—though rolling out changes in practice could still take years. (Note that some of these tasks could be automated with today’s model sizes, already, if sufficient engineering work was spent to fine-tune them properly. I’m making the claim that automation will quite easily be doable by this point, if it hasn’t already been done.)
Assuming that hardware and algorithmic progress have reduced the cost of inference by at least 10×, this will cost less than 1 cent per word.
I think this would probably not be enough to automate the majority of human economic activity or otherwise completely transform society (but I think we should be investing substantial resources in preparing for that eventuality).
If I adopt the framework from Ajeya Cotra’s draft report—where a model with the right number of parameters can become ~human-equivalent at tasks with a certain horizon length if trained on the right number of data points of that horizon length—I’m inclined to treat these extrapolations as a guess for how many parameters will be required for ~human-equivalence. Given that Cotra’s model’s median number of parameters is close to my best guess of where near-optimal performance is achieved, the extrapolations do not contradict the model’s estimates, and constitute some evidence for the median being roughly right.