https://x.com/danintheory/status/1587461257745022976
Explaining Neural Scaling Laws
The Quantization Model of Neural Scaling
Chinchilla: Training Compute-Optimal Large Language Models
Scaling Laws for Neural Language Models
https://arxiv.org/pdf/2210.16859#page=17
https://arxiv.org/pdf/2210.16859#page=64
Scaling Laws from the Data Manifold Dimension
Asymptotic learning curves of kernel methods: empirical data versus Teacher-Student paradigm
https://arxiv.org/pdf/2210.16859#page=57
https://arxiv.org/pdf/2210.16859#page=58
https://arxiv.org/pdf/2210.16859#page=62
Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time
https://arxiv.org/pdf/2210.16859#page=65
https://arxiv.org/pdf/2210.16859#page=69
https://arxiv.org/pdf/2210.16859#page=7
https://arxiv.org/pdf/2210.16859#page=59
https://arxiv.org/pdf/2210.16859#page=51
Scaling Laws for Deep Learning
https://arxiv.org/pdf/2210.16859#page=61
https://arxiv.org/pdf/2210.16859#page=19
Parameter Counts in Machine Learning
Parameter Count vs Training Dataset Size (1952–2021)
https://web.archive.org/web/20220908153010/https://www.benadlam.com/