“Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition”, Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, Maosong Sun2024-02-23 (; backlinks)⁠:

Recent studies have uncovered intriguing phenomena in deep learning, such as double descent, grokking, and emergent abilities in large language models, which challenge human intuition and are crucial for a deeper understanding of neural models.

In this paper, we present a comprehensive framework that provides a unified view of these 3 phenomena, focusing on the competition between memorization & generalization circuits (Varma et al 2023). This approach, initially employed to explain grokking, is extended in our work to encompass a wider range of model sizes and training data volumes.

Our framework delineates 4 distinct training dynamics, each depending on varying combinations of model size and training data quantity. Utilizing this framework, we provide a detailed analysis of the double descent phenomenon and propose two verifiable predictions regarding its occurrence, both substantiated by our experimental results.

Moreover, we expand our framework to the multi-task learning paradigm, demonstrating how algorithm tasks can be turned into emergent abilities. This offers a novel perspective to understand emergent abilities in large language models.

Figure 1: The increasing memorization capacity and decreasing critical dataset size for larger models split the figure into 4 distinct zones including progression, ungrokking, grokking and semi-grokking. Each zone will show a specific training dynamic.

…The reverse relationships with model size inevitably lead to an intersection point of the two curves (black star in Figure 1). At this intersection, the model’s memorization limit is reached, and the efficacy of this extreme memorization circuit is comparable to its generalization circuit. Besides, the two curves create 4 distinct zones on the graph, each reflecting a unique training dynamic in our experiments as shown in Figure 2.

  1. Excess training data or reduced model size hinders complete memorization of training data, causing model to exhibit an interesting phenomenon where model first memorizes part of the training data with zero validation performance and then generalizes to part of validation data with an increase in train accuracy either, which we name it progression.

  2. Insufficient data boosts memorization efficiency over generalization, causing model to choose pure memorization without generalization, which is consistent with ungrokking stated by Varma et al 2023.

  3. Conversely, increasing model size diminishes memorization efficiency relative to generalization, causing model transfer from memorization to generalization after training enough steps, which is exactly grokking (Power et al 2022).

  4. When the number of training data points approximates the critical dataset size, the model exhibits a phenomenon named semi-grokking, characterized by moderate generalization capabilities. This behavior was first identified by Varma et al 2023.

…[see also Wang et al 2024] Further, we extend our experiments to the multitask learning paradigm where an algorithm task and a pure memorization task are mixed to train the model. Interestingly, adding a pure memorization task largely hinders model from formalizing the generalization circuits for the algorithm task. With model size increasing, model always achieves near zero validation performance until a relatively large model size, which is about 1,570× larger than training solely on the algorithm task.

This phenomenon reminds us of the “emergent abilities” in LLMs (Wei et al 2022a). The pretraining stage can also be seen as a multi-task learning process, where model has to remember numerous world knowledge while developing some general rules and abilities, such as reasoning. Our study suggests that this multi-task learning feature can be one important reason for emergent abilities in LLM.

Figure 3: Final validation accuracy across various training dataset sizes and model hidden sizes. Larger models are represented in green, while smaller models are in blue. This figure demonstrates that models with larger hidden sizes attain near-perfect validation accuracy with comparatively less training data, indicating a reduced critical dataset size for these models.

…From the analysis presented in Figure 3, it is evident that models with larger hidden sizes tend to exhibit grokking with smaller datasets. For instance, a model with a hidden size dh of 64 achieves near-perfect validation accuracy with just 3,000 training samples. In contrast, a model with a smaller hidden size of 32 requires an increase in training data size to 4,000 to achieve similar validation accuracy. This trend is consistent across models with intermediate hidden sizes ranging 32–64.

5. Multi-Task Learning Leads to Emergent Ability

In this section, we expand our research to the multi-task learning paradigm, which combines an algorithm task, such as modular addition, with a task focused solely on memorization. This approach reveals that the model’s generalization ability on the algorithm task remains negligible until the model reaches a substantially larger size—specifically, 1,570× larger than training on a single task. As a result, the validation performance on algorithm task show an emergent phenomenon relative to model size.

Experiment Setup: Our experiments use the modular addition task as the algorithm component. For the memorization task, we assign random labels to a subtraction task, compelling the model to memorize these associations. By incorporating different calculation symbols (+, −) into the input tokens, we ensure that the memorization and algorithm task have no overlapping inputs.

Our experiments involve 3,000 data points from the modular addition task and a varying number of memorization data points, ranging 3,000–6,000. We adjust the model size from a 1-layer transformer with a hidden size of 64 to an 8-layer transformer with a hidden size of 1,024. Each experiment is conducted 3×, and we report the highest validation accuracy for each configuration to showcase each model’s optimal potential. The results are presented in Figure 8:

Figure 8: Adding a pure memorization task into the modular addition task makes it become an emergent ability.

TAKEAWAY #7: We observe that incorporating pure memorization data substantially impedes smaller models in developing generalization circuits. The emergence of generalization ability in the modular addition task is notable, with models typically displaying substantial validation performance at relatively larger sizes, ~1,570× larger than those trained solely on the modular addition task. Additionally, the volume of memorization data appears to have minimal impact on the emergent model size. [cf. ray interference?]

Discussion: From the perspective of the competition between memorization and generalization circuits, the presence of a pure memorization task prevents the model from transitioning from memorization to generalization after mastering all training data, as there are no generalization circuits for the pure memorization task. However, once a sufficiently large model size is attained, the model’s memorization capacity substantially exceeds the training data volume, allowing it to allocate extra parameters for generalization in the algorithm task.

The experiment in Appendix B, which reduces the emergent model size by allocating separate parameters for algorithm and memorization data, further supports this hypothesis. This phenomenon echoes the emergent abilities observed in current LLMs. Considering that the pretraining stage resembles a multi-task learning scenario—where the model must retain a vast array of world knowledge while acquiring general rules and capabilities such as in context learning (Brown et al 2020) and multistep reasoning (Wei et al 2022b)—our experiments may provide fresh insights into the emergent abilities in LLMs. This observation further elucidates the hypothesis proposed by Hu et al 2023, where it is hypothesized that emergent abilities are formed through the competition of different neural circuits.

…To investigate this further, we designed experiments to segregate parameters dedicated to memorization and generalization. Prior studies suggest that the feed-forward layer in the Transformer architecture predominantly facilitates memorization (Geva et al 2021). Additionally, this FFN layer in transformer is also crucial for the generalization process in the modular addition task (Nanda et al 2023; Chughtai et al 2023). Consequently, we partitioned the feed-forward layer into two specialized sections by dividing the intermediate dimension, akin to the current MoE architecture (Fedus et al 2022). In this setup, one section exclusively processes the modular addition task data, while the other focuses solely on the memorization task data.

Our experiments, conducted on 3,000 instances each of modular addition and memorization data, are depicted in Figure 10:

Figure 10: Experiments on multi-task learning with a pure memorization task and modular addition task. This study delves into the effects of manually constructing a sparse model to separately address pure memorization and modular addition data. Through this approach, we substantially accelerate the emergence of the modular addition capability in terms of model size. This finding highlights the crucial role of functional differentiation within neural models in fostering the emergence of new abilities.

The results demonstrate that manually creating a sparse network substantially accelerates emergence of the modular addition task capabilities with the same training dataset. This finding underscores the critical role of functional differentiation in neural models for the development of diverse abilities.