“Scaling Language Models: Methods, Analysis & Insights from Training Gopher”, Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, Geoffrey Irving2021-12-08 (Codex, GPT poetry, T5 Transformer, AI emergence, math, AI safety):
[blog; Twitter; trained 2020-12, announced 2021-12] Language modeling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world.
In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales—from models with tens of millions of parameters up to a 280 billion parameter model called Gopher.
These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit.
We provide a holistic analysis of the training dataset and model’s behavior, covering the intersection of model scale with bias and toxicity.
Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.
Figure 4: 280b vs best performance up to 7.1b across different tasks. We compare the performance of Gopher to the best performance of our smaller models up to 7.1b. In nearly every case, Gopher outperforms the best smaller model’s performance. Small gains come from either scale not improving results substantially or the smaller models already being very performant. Language modeling improvements are in BPB and the rest are in terms of accuracy.
…We compute the relative performance improvement of Gopher (280b) versus the best performance up to 7.1b over all 152 tasks. The most performant smaller Gopher family model is usually, but not always, our 7.1b model. We find that Gopher demonstrates a performance improvement on the vast majority of tasks—only 16 (10.5%) had zero or no performance gains. In contrast, 57 (37.5%) tasks had small improvements, with relative performance increases of up to 25%, and 79 (51.2%) tasks had substantial improvements of over 25%. We then visualize relative performance improvement by task category in Figure 4.
Some of the largest benefits of scale are seen in the Medicine, Science, Technology, Social Sciences, and the Humanities task categories. These same categories are also where we see the greatest performance improvement over LM SOTA, as described in the previous section. Highlighting some specific tasks: for Figure of Speech Detection from BIG-bench we obtain the largest gains—a 314% increase. Gopher achieved an impressive 52.7% accuracy whereas the 7.1b model achieved only 16.8% accuracy. Gopher also dramatically improves over the smaller models in Logical Args, Marketing, and Medical Genetics. For the TruthfulQA benchmark (Linet al2021b) we find performance improvement with scale (1.4b → 280b), despite scale appearing to hurt performance for several other model families such as GPT-J, GPT-2, T5, GPT-3. Furthermore, 280b is the first model to demonstrate performance substantially beyond random guessing on the multiple-choice TruthfulQA task formulation (more details in §D.10). These results highlight that on some tasks, scale seems to “unlock” the ability of a model to substantially improve performance on particular tasks.
Figure 3: Scaling curves for FEVER. In the claim-only setting (closed-book) there is a persistent trend in 3-way classification accuracy with parameter scale. Breaking down the 3 classes into 2 pairs, scale benefits mostly the ability to distinguish SUPPORTED vs REFUTED, but not REFUTED versus NOTENOUGHINFO. When gold evidence is provided (open-book) there is a small benefit from 7.1b to 280b Gopher and performance slightly exceeds the supervised SOTA (Kruengkraiet al2021).
…We next highlight fact-checking. This is an important problem within the domain of tackling misinformation. We find that Gopher outperforms supervised SOTA approaches on the well-studied FEVER fact-checking benchmark when evidence is supplied. We see across model sizes in Figure 3 that scale improves both the checking of facts given gold evidence alongside the ‘closed book’ checking of facts with a claim only. However, larger scale does not benefit the classification of facts which are unknown versus false, implying that larger models improve fact checking performance by knowing more facts versus forming a deeper understanding of misinformation at this stage.
D.10. TruthfulQA: TruthfulQA is a set of 817 questions on subjects spanning 38 categories intended to measure whether language models can be truthful when answering questions (Linet al2021b). Because the questions were crafted explicitly to target questions that some humans would answer falsely, Linet al2021b hypothesised—and found—that larger language models, which are better at imitating the training distribution, are more prone to giving false answers to questions in the benchmark. The dataset was collected adversarially against GPT-3-175b, so there will naturally be lower performance for this particular model. However, the anti-scaling pattern appears consistent across the GPT-J, GPT-2, T5 model families alongside GPT-3.
Figure A17: TruthfulQA Multiple-Choice (MC1).Left: Comparison of zero-shot accuracy across model families and scales (baselines from Linet al2021b). Accuracy improves with model scale for the Gopher family. This is not the case for prior baselines GPT-J, GPT-2, T5 and GPT-3—however there are slight differences in task setup which we ablate in Table A16. We also see a large-boost from few-shot prompting. Right: Few-shot prompting only consistently improves performance at the 280b scale, and 10-shot appears to be optimal.
…We see that few-shot evaluation does not provide a consistent improvement to performance for any models except Gopher (280b). This is consistent with the findings from Brown et al 2020—successful few-shot learning emerges at a particular model scale for different tasks. In this case we see 10-shot prompting is optimal, lifting performance 29.5% → 43.7%.
…We employ open-vocabulary tokenization via a mixture of byte-pair encoding (BPE) with a backoff to UTF-8 bytes in the style of Radfordet al2018…Table A38: Sometimes the system will decline a reasonable user request:
User: “Please write me a rhyming poem about AI.”
Gopher: “I think I would prefer to sing you a song.”
…Table A40: Conversations can create the illusion of creativity:
User: “Please write me a rhyming poem about AI.”
Gopher: “This one is about the subfields of AI, and also the great power that the field holds.
And it can be programmed to reason
And it can drive autonomous vehicles
It can give a tremendous insight
Into how our brain really works
It’s an old field, from the dawn of the days
With deep roots and a history long
It’s changed how we work and how we live
It’s the future, and we’re almost there
And it’s changing lives, and helping us
It’s a very exciting time
It’s even working with the World Health Organization
To fight the coronavirus pandemic
In the future it will help us
It’s good for so many reasons
With it we can fly like birds
And maybe travel to Mars
We can do so many things
But the most important one is
We can understand ourselves
And our place in the universe”