“Predictability and Surprise in Large Generative Models”, Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, Tom Henighan, Andy L. Jones, Nicholas Joseph, Jackson Kernion, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, Neel Nanda, Kamal Ndousse, Catherine Olsson, Daniela Amodei, Dario Amodei, Tom Brown, Jared Kaplan, Sam McCandlish, Chris Olah, Jack Clark2022-02-15 (, , )⁠:

Large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others.

In this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property. Namely, these generative models have an unusual combination of predictable loss on a broad training distribution (as embodied in their “scaling laws”), and unpredictable specific capabilities, inputs, and outputs.

We believe that the high-level predictability and appearance of useful capabilities drives rapid development of such models, while the unpredictable qualities make it difficult to anticipate the consequences of model deployment. We go through examples of how this combination can lead to socially harmful behavior with examples from the literature and real world observations, and we also perform two novel experiments to illustrate our point about harms from unpredictability. Furthermore, we analyze how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment.

We conclude with a list of possible interventions the AI community may take to increase the chance of these models having a beneficial impact. We intend this paper to be useful to policymakers who want to understand and regulate AI systems, technologists who care about the potential policy impact of their work, and academics who want to analyze, critique, and potentially develop large generative models.

Figure 2: 3 examples of abrupt specific capability scaling described in §2.2, based on 3 different models: GPT-3 (blue), Gopher (orange), and a Google language model (green). (Left) 3-Digit addition with GPT-3. (Middle) Language understanding with GPT-3 and Gopher. (Right) Program synthesis with Google language model LaMDA.

…Here, we illustrate 3 examples of abrupt capability scaling for arithmetic,11 language understanding, [32, 56], and programming4 (Figure 2). For arithmetic, GPT-3 displays a sharp capability transition somewhere between 6b parameters and 175b parameters, depending on the operation and the number of digits.11 For example, 3-digit addition is performed accurately less than 1% of the time on any model with less than 6b parameters, but this jumps to 8% accuracy on a 13b parameter model and 80% accuracy on a 175b parameter model—producing a “hockey stick”-style graph (Figure 2, Left) in which arithmetic ability appears suddenly after several orders of magnitude of nothing.

A different language model, DeepMind’s Gopher,56 also displays an abrupt jump in performance on a different dataset, the MMLU language understanding benchmark32 (Figure 2, Middle, orange). For all models under 6b parameters, Gopher performs under 30% accuracy, which is a little better than chance (25% accuracy). However, the full 280b parameter Gopher model achieves 60% accuracy, a substantial jump. GPT-3 displays a similar phenomenon though of smaller magnitude (Figure 2, Middle, blue).

As a third example, a recently developed class of program synthesis models from Google display dramatic improvements in their ability to create computer programs as they increase in size from 10B to 100b parameters4 (Figure 2, Right). For example, the percentage of generated synthetic programs that solve a given programming problem jumps substantially 6% → 13% when the model size increases by ~2× from 68B to 138b parameters, despite very small increases over the previous 2 orders of magnitude.

Abrupt specific capability scaling presents large challenges for safety assurance and deployment of large models. Although we’ve demonstrated this phenomenon for relatively anodyne capabilities, potentially harmful ones may emerge at scale (that will not exist in smaller models) and may be difficult to anticipate.

A.3 Recommendation System Experiment: To illustrate how smooth general capability scaling (discussed in §2.1) may correlate with task performance and forecast economic value, we perform a small original experiment where we analyze the relationship between scale and capabilities for GPT-3-like language models3 to be used as recommendation systems with zero-shot learning. We choose a recommendation system example because these systems have tangible economic relevance and societal impact. [cf. GPT-3 for classification/regression]

Figure 8: Language models can perform as zero-shot recommendation systems with increasing scale. This demonstrates how general capability scaling can correlate with an economically valuable task as described in §2.1.

Figure 8 shows that language models smoothly decrease in the standard Root Mean Square Error (RMSE, lower is better) metric on the widely used MovieLens 1M movie recommendation system task [31] as they increase in size. The smallest model achieves a substantially better RMSE (1.06) than chance (RMSE 1.91), and the largest model achieves a substantially lower RMSE (0.94) than a strong baseline model (RMSE 0.98, see below for further details). Although no models achieve state-of-the-art (SOTA) performance (RMSE 0.82), these results are still surprising because the language models (in our zero-shot setting) see 2 orders of magnitude less training data than the SOTA model

…To perform this experiment, we chose the Movielens 1M (1 million ratings) dataset31 both because of its widespread use, the fact that it contains demographic information about users (age, occupation, gender, zip code), and because we have observed language models to have considerable knowledge about movies (presumably due to a preponderance of text on the internet about movies)…It’s unclear how to use language models as matrix factorizers. Instead, we employ similar zero-shot learning approach with the following prompt:

A {age} {gender} who is employed as an {occupation} previously rated {list_of_movies_and_ratings_from_training_set} will rate {movie_from_test_set} a

…Capabilities may emerge in areas that are challenging to evaluate quantitatively, and therefore likely to resist systematic analysis. A key example is the case of AI models mimicking human creative expression.

As a concrete example, we provide a sample of >3,000 imitation poems generated randomly from a large language model (more accurately, these are samples generated from a prompt including several modern and contemporary poems, so a small fraction of the samples are not actually poems). We cannot provide any official evaluation, but informally we find both the quality of some of the texts, and the imitation of specific authorial styles quite impressive.

Some professional writers who are aware of the growing capabilities of large language models are very impressed38, but also alarmed by their far-reaching implications. Academics outside of engineering departments are also starting to consider the pros and cons of machine creativity.