-
Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets
-
Explaining grokking through circuit efficiency
-
Attention Is All You Need
-
Omnigrok: Grokking Beyond Algorithmic Data
-
Towards Understanding Grokking: An Effective Theory of Representation Learning
-
https://arxiv.org/pdf/2401.10463#page=16
-
PassUntil: Predicting Emergent Abilities with Infinite Resolution Evaluation
-
Decoupled Weight Decay Regularization
-
https://arxiv.org/pdf/2401.10463#page=18
-
https://arxiv.org/pdf/2401.10463#page=14
-