“Grokking Phase Transitions in Learning Local Rules With Gradient Descent”, 2022-10-26 (; backlinks):
We discuss two solvable grokking (generalization beyond overfitting) models in a rule learning scenario.
We show that grokking is a phase transition and find exact analytic expressions for the critical exponents, grokking probability, and grokking time distribution. Further, we introduce a tensor-network map that connects the proposed grokking setup with the standard (perceptron) statistical learning theory and show that grokking is a consequence of the locality of the teacher model.
As an example, we analyse the cellular automata learning task, numerically determine the critical exponent and the grokking time distributions and compare them with the prediction of the proposed grokking model.
Finally, we numerically analyse the connection between structure formation and grokking.
…Our analytical results and numerical experiments show a large difference between 𝓁1 [weight decay] and 𝓁2 regularizations. The 𝓁1 regularized models have a larger grokking probability, shorter grokking time, shorter generalization time, and smaller effective dimension compared to 𝓁2 regularized models.
…Further, we show that spikes in the loss (which often occur during training of deep neural networks) correspond to latent space structural changes [phase shift] that can be beneficial or detrimental for generalization. Assuming this is the case also in deep networks, we can use the information about the latent space effective dimension to revert the model to a state before the spike or continue training with the current model.
…We find a similar distinction between the 𝓁1 and 𝓁2 regularizations as in the simple 1D case. At ε = 1 and λ1 = 0 the grokking probability vanishes for any value of λ2. In contrast, for λ1 > 0 the grokking probability can increase even above 90% for any D ≥ 2. Interestingly, the grokking probability increases with the dimensionality of the data distribution D. In fact, if we send D → ∞ the grokking probability becomes 100% if 0 < λ1 < ε. This result is a consequence of the concentration of measure of the uniform distribution ‘around the equator’. Similarly, by using the lower bound Equation 33 we estimate the best value of λ1 for any ε, D and N and find that the grokking probability maximum is always larger than 0.915. In contrast, in the λ1 = 0 case, the grokking probability becomes exponentially small with D, independent of the remaining parameter values. We make similar observations also if we relax the condition λ2 ≫ 1 (see Appendix B).