“Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets”, Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant Misra2021-05-01 (, ; similar)⁠:

[Poster; paper; official code (reimplementation, 2); learning phases; Yannic Kilcher video; mechanism speculation; cf. emergence / phase transitions; commentary] In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail.

In some situations we show that neural networks learn through a process of grokking a pattern in the data, improving generalization performance from random-chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting.

We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset.

…We find that adding weight decay has a very large effect on data efficiency, more than halving the amount of samples needed compared to most other interventions. We found that weight decay towards the initialization of the network is also effective, but not quite as effective as weight decay towards the origin. This makes us believe that the prior, that ~0 weights are suitable for small algorithmic tasks, explains part, but not all of the superior performance of weight decay. Adding some noise to the optimization process (eg. gradient noise from using minibatches, Gaussian noise applied to weights before or after computing the gradients) is beneficial for generalization, consistent with the idea that such noise might induce the optimization to find flatter minima that generalize better. We found that learning rate had to be tuned in a relatively narrow window for the generalization to happen (within 1 order of magnitude).

Figure 1: Left. ‘Grokking’: A dramatic example of generalization far after overfitting on an algorithmic dataset. We train on the binary operation of division mod 97 with 50% of the data in the training set. Each of the 97 residues is presented to the network as a separate symbol, similar to the representation in the figure to the right. The red curves show training accuracy and the green ones show validation accuracy. Training accuracy becomes close to perfect at <103 optimization steps, but it takes close to 106 steps for validation accuracy to reach that level, and we see very little evidence of any generalization until 105 steps. Center: Training time required to reach 99% validation accuracy increases rapidly as the training data fraction decreases. Right: An example of a small binary operation table. We invite the reader to make their guesses as to which elements are missing.
Figure 1:
Left: ‘Grokking’: A dramatic example of generalization far after overfitting on an algorithmic dataset. We train on the binary operation of division mod 97 with 50% of the data in the training set. Each of the 97 residues is presented to the network as a separate symbol, similar to the representation in the figure to the right. The red curves show training accuracy and the green ones show validation accuracy. Training accuracy becomes close to perfect at <103 optimization steps, but it takes close to 106 steps for validation accuracy to reach that level, and we see very little evidence of any generalization until 105 steps.
Center: Training time required to reach 99% validation accuracy increases rapidly as the training data fraction decreases.
Right: An example of a small binary operation table. We invite the reader to make their guesses as to which elements are missing.

[cf.: blessings of scale; super-convergence; catapult phase/slingshot mechanism; double descent; flooding; wide flat basins; Neelakantan et al 2015/small batch generalization; Zhang et al 2016; on the benefits of weight decay: GPT-3/“Scaling Laws for Autoregressive Generative Modeling”/“One Epoch Is All You Need”; patient teachers. Anecdote. A case of grad student descent: “Incidentally, according to Ethan Caballero, at their poster they said how they happened to discover such a weird thing; apparently it was by accidentally letting their NNs run too long!”]