Bibliography (30):

  1. https://x.com/danintheory/status/1587461257745022976

  2. Explaining Neural Scaling Laws

  3. The Quantization Model of Neural Scaling

  4. Chinchilla: Training Compute-Optimal Large Language Models

  5. Scaling Laws for Neural Language Models

  6. https://arxiv.org/pdf/2210.16859#page=17

  7. https://arxiv.org/pdf/2210.16859#page=64

  8. Scaling Laws from the Data Manifold Dimension

  9. Asymptotic learning curves of kernel methods: empirical data versus Teacher-Student paradigm

  10. https://arxiv.org/pdf/2210.16859#page=57

  11. https://arxiv.org/pdf/2210.16859#page=58

  12. https://arxiv.org/pdf/2210.16859#page=62

  13. Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time

  14. https://arxiv.org/pdf/2210.16859#page=65

  15. https://arxiv.org/pdf/2210.16859#page=69

  16. https://arxiv.org/pdf/2210.16859#page=7

  17. https://arxiv.org/pdf/2210.16859#page=59

  18. https://arxiv.org/pdf/2210.16859#page=51

  19. Scaling Laws for Deep Learning

  20. https://arxiv.org/pdf/2210.16859#page=61

  21. https://arxiv.org/pdf/2210.16859#page=19

  22. Parameter Counts in Machine Learning

  23. Parameter Count vs Training Dataset Size (1952–2021)

  24. https://web.archive.org/web/20220908153010/https://www.benadlam.com/