Bibliography (18):

  1. Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time

  2. https://gwern.net/doc/ai/nn/fully-connected/2021-power.pdf#openai

  3. Explaining grokking through circuit efficiency

  4. https://arxiv.org/pdf/2402.15175#page=2

  5. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets [paper]

  6. Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

  7. Emergent Abilities of Large Language Models

  8. Ray Interference: a Source of Plateaus in Deep Reinforcement Learning

  9. https://arxiv.org/pdf/2402.15175#page=12

  10. GPT-3: Language Models are Few-Shot Learners

  11. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

  12. PassUntil: Predicting Emergent Abilities with Infinite Resolution Evaluation

  13. Attention Is All You Need

  14. Transformer Feed-Forward Layers Are Key-Value Memories

  15. Progress measures for grokking via mechanistic interpretability

  16. A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

  17. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity