-
https://x.com/danintheory/status/1587461257745022976
-
Explaining Neural Scaling Laws
-
The Quantization Model of Neural Scaling
-
Chinchilla: Training Compute-Optimal Large Language Models
-
Scaling Laws for Neural Language Models
-
https://arxiv.org/pdf/2210.16859#page=17
-
https://arxiv.org/pdf/2210.16859#page=64
-
Scaling Laws from the Data Manifold Dimension
-
Asymptotic learning curves of kernel methods: empirical data versus Teacher-Student paradigm
-
https://arxiv.org/pdf/2210.16859#page=57
-
https://arxiv.org/pdf/2210.16859#page=58
-
https://arxiv.org/pdf/2210.16859#page=62
-
Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time
-
https://arxiv.org/pdf/2210.16859#page=65
-
https://arxiv.org/pdf/2210.16859#page=69
-
https://arxiv.org/pdf/2210.16859#page=7
-
https://arxiv.org/pdf/2210.16859#page=59
-
https://arxiv.org/pdf/2210.16859#page=51
-
Scaling Laws for Deep Learning
-
https://arxiv.org/pdf/2210.16859#page=61
-
https://arxiv.org/pdf/2210.16859#page=19
-
Parameter Counts in Machine Learning
-
Parameter Count vs Training Dataset Size (1952–2021)
-
https://web.archive.org/web/20220908153010/https://www.benadlam.com/
-