“Predicting Grokking Long Before It Happens: A Look into the Loss Landscape of Models Which Grok”, Pascal Junior Tikeng Notsawo, Hattie Zhou, Mohammad Pezeshki, Irina Rish, Guillaume Dumas2023-06-23 ()⁠:

[cf. Žunkovič & Ilievski2022, slingshot mechanism/catapulting, Jiang et al 2019] This paper focuses on predicting the occurrence of grokking in neural networks, a phenomenon in which perfect generalization emerges long after signs of overfitting or memorization are observed. It has been reported that grokking can only be observed with certain hyper-parameters. This makes it critical to identify the parameters that lead to grokking. However, since grokking occurs after a large number of epochs, searching for the hyper-parameters that lead to it is time-consuming.

In this paper, we propose a low-cost method to predict grokking without training for a large number of epochs. In essence, by studying the learning curve of the first few epochs, we show that one can predict whether grokking will occur later on. Specifically, if certain oscillations occur in the early epochs, one can expect grokking to occur if the model is trained for a much longer period of time.

We propose using the spectral signature of a learning curve derived by applying the Fourier transform to quantify the amplitude of low-frequency components to detect the presence of such oscillations.

We also present additional experiments aimed at explaining the cause of these oscillations and characterizing the loss landscape.

…Regarding catastrophic forgetting, there is a link between it when learning many tasks and the sharpness of the optimum for each task; so that the slightest update to one task makes the optimum escape from its basin of attraction for the other tasks. Mirzadeh et al 2020 & Mirzadeh et al 2021 formalize this.

6. Summary and Discussion: We made the following observations:

  1. The memorization phase is characterized by a perturbed landscape, and it is separated from comprehension by a perturbed valley of bad solutions.

    Small data results in the slow progression of SGD in this region, causing a delay in generalization. During the comprehension phase, the loss and accuracy of training and validation show a periodic perturbation. Thilak et al 2022 named a related phenomenon the slingshot mechanism. We found that these perturbation points are characterized at the level of loss (respectively accuracy) by a sudden increase-decrease (respectively decrease-increase), at the level of the model weights by a sudden variation of the relative cosine similarity, and at the level of the loss landscape by obstacles.

    This last point goes against what Goodfellow & Vinyals2015 observed, namely that a variety of state-of-the-art neural networks never encounter any substantial obstacles from initialization to solution. The slingshot mechanism also contradicts the idea that SGD spends most of its time exploring the flat region at the bottom of the valley surrounding a flat minimizer (Goodfellow & Vinyals2015), since it goes with the model from confusion to the terminal phase of training, even after the model generalized, for many datasets.

  2. The Hessian of the grokking loss function is characterized by larger condition numbers, leading to a slower convergence of gradient descent.

    We observed that more than 98% of the total variance in the parameter space occurs in the first 2 PCA modes much smaller than the total number of weights, suggesting that the optimization dynamics are embedded in a low-dimensional space (Li et al 2018; Feng & Tu2021). Moreover, the model remains in a lazy training regime (Chizat et al 2019; Berner et al 2021) most of the time, as the measure of cosine distance between the model weights from one training step to the next remains almost constant, except at the slingshot location.

From the point of view of the landscape, grokking seems a bit clearer: landscape geometry has an effect on generalization, and can allow in the early stages of training to know if the model will generalize or not by just looking at a microscopic quantity characteristic of that landscape like the empirical risk.