Bibliography (18):

Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time
https://gwern.net/doc/ai/nn/fully-connected/2021-power.pdf#openai
Explaining grokking through circuit efficiency
https://arxiv.org/pdf/2402.15175#page=2
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets [paper]
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
Emergent Abilities of Large Language Models
Ray Interference: a Source of Plateaus in Deep Reinforcement Learning
https://arxiv.org/pdf/2402.15175#page=12
GPT-3: Language Models are Few-Shot Learners
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
PassUntil: Predicting Emergent Abilities with Infinite Resolution Evaluation
Attention Is All You Need
Transformer Feed-Forward Layers Are Key-Value Memories
Progress measures for grokking via mechanistic interpretability
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Wikipedia Bibliography:
1. https://en.wikipedia.org/wiki/Modular_addition :
  
  https://en.wikipedia.org/wiki/Modular_addition