Bibliography (21):

  1. https://github.com/OSU-NLP-Group/GrokkedTransformer

  2. https://x.com/BoshiWang2/status/1795294846212567089

  3. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets [paper]

  4. https://openai.com/index/gpt-4-research/

  5. Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition

  6. 2024-wang-figure4-grokkingphasetransitionofcompositionalcircuitintransformer.jpg

  7. 2024-wang-figure5-grokkingphasetransitionofcomparisoncircuitintransformer.png

  8. Omnigrok: Grokking Beyond Algorithmic Data

  9. Critical Data Size of Language Models from a Grokking Perspective

  10. Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

  11. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

  12. Explaining grokking through circuit efficiency

  13. Decoupled Weight Decay Regularization

  14. https://arxiv.org/pdf/2405.15071#page=19

  15. https://arxiv.org/pdf/2405.15071#page=17

  16. https://x.com/OwainEvans_UK/status/1804931838529638896

  17. Taken out of context: On measuring situational awareness in LLMs

  18. Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data