Bibliography (21):

  1. Grokking phase transitions in learning local rules with gradient descent

  2. The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

  3. The large learning rate phase of deep learning: the catapult mechanism

  4. https://arxiv.org/abs/1912.02178#google

  5. https://www.lesswrong.com/posts/YjQ8yY8AA6Ye2rLTN/grokking

  6. Understanding the Role of Training Regimes in Continual Learning

  7. Wide Neural Networks Forget Less Catastrophically

  8. Visualizing the Loss Landscape of Neural Nets

  9. Qualitatively characterizing neural network optimization problems

  10. The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima

  11. On Lazy Training in Differentiable Programming

  12. The Modern Mathematics of Deep Learning