“Grokfast: Accelerated Grokking by Amplifying Slow Gradients”, Jaerin Lee, Bong Gyun Kang, Kihoon Kim, Kyoung Mu Lee2024-05-30 (, , , , )⁠:

One puzzling artifact in machine learning dubbed grokking is where delayed generalization is achieved ten-folds of iterations after near perfect overfitting to the training data.

Focusing on the long delay itself on behalf of machine learning practitioners, our goal is to accelerate generalization of a model under the grokking phenomenon. By regarding a series of gradients of a parameter over training iterations as a random signal over time, we can spectrally decompose the parameter trajectories under gradient descent into two components: the fast-varying, overfitting-yielding component and the slow-varying, generalization-inducing component.

This analysis allows us to accelerate the grokking phenomenon more than 50× with only a few lines of code [Grokfast] that amplifies the slow-varying components of gradients.

The experiments show that our algorithm applies to diverse tasks involving images, languages, and graphs, enabling practical availability of this peculiar artifact of sudden generalization.

Our code is available at Github.


5.1 Difference between Algorithm 2 and the Momentum in Typical Optimizers

The lines 7–8 of Algorithm 2 take a similar form to the momentum variable, which is frequently used in optimizers in deep learning frameworks. However, notable differences exist:

  1. Instead of using the scaled momentum as a parameter update, we use the smoothened gradient as a residual, which is added to the gradient before it is fed into the optimizer.

    Rather, the formula is more similar to Nesterov’s momentum; however, the filtering is applied before the optimizer, which is different from typical applications of Nesterov’s momentum such as NAdam [Dozat2016] (2) The line 7–8 is applied to the gradients independently to the underlying optimizer. The optimizer can be of any type unless it is of the first-order gradient descent-based.

    Low-pass filtering the gradients g(t) has the same effect as filtering the post-optimizer parameter updates u(t) as mathematically explained in Appendix A with SGD and variants, and empirically proved in the previous sections with Adam [Kingma & Ba2014] and AdamW [Loshchilov & Hutter2018] optimizers.

Q3. Synergistic effect with weight decay: Besides from our gradient filtering approach, the authors of Omnigrok (Liu et al 2022b) have suggested that the weight decay hyperparameter is a critical determinant of the grokking phenomenon. According to the report, the grokking phenomenon appears and even becomes faster when the weight decay becomes larger. We, therefore, conduct additional experiments to find out how these two approaches affect the model when applied together.

The results are summarized in Figure 7. Compared with the result from GROKFAST-MA with no weight decay (orange), applying the weight decay (red) generally yields even faster generalization. The maximum acceleration appears at weight decay = 0.01 with 3.72× faster generalization than GROKFAST-MA with no weight decay. We choose this result of 50.49× faster grokking to be our main demonstration in Figure 2a.

Interestingly, Figure 7 also reveals that applying the same weight decay with no GROKFAST-MA (brown) makes the training unstable. The results demonstrates that applying our gradient filtering and setting up a proper weight decay together gives synergistic benefits.

Figure 7: The acceleration effect of GROKFAST-MA is greatly enhanced when accompanied with appropriate value of weight decay. However, the weight decay alone not always yield beneficial results.

[I am not convinced this is not simply equivalent to momentum / weight decay. Tuning weight decay leads to drastic differences in grokking speed, and the speedup left after that is not much.]