Bibliography (14):

  1. Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets

  2. Explaining grokking through circuit efficiency

  3. Attention Is All You Need

  4. Omnigrok: Grokking Beyond Algorithmic Data

  5. Towards Understanding Grokking: An Effective Theory of Representation Learning

  6. https://arxiv.org/pdf/2401.10463#page=16

  7. PassUntil: Predicting Emergent Abilities with Infinite Resolution Evaluation

  8. Decoupled Weight Decay Regularization

  9. https://arxiv.org/pdf/2401.10463#page=18

  10. https://arxiv.org/pdf/2401.10463#page=14