Bibliography (160):

  1. โ€‹ scaling-hypothesis#blessings-of-scale

    [Transclude the forward-link's context]

  2. https://cse-robotics.engr.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf

  3. Do Deep Convolutional Nets Really Need to be Deep and Convolutional?

  4. https://arxiv.org/pdf/1603.05691.pdf#page=7

  5. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

  6. Deep Learning Scaling is Predictable, Empirically

  7. Learning Visual Features from Large Weakly Supervised Data

  8. Exploring the Limits of Weakly Supervised Pretraining

  9. SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models

  10. WebVision Challenge: Visual Learning and Understanding With Web Data

  11. WebVision Database: Visual Learning and Understanding from Web Data

  12. CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images

  13. Measuring the Effects of Data Parallelism on Neural Network Training

  14. An Empirical Model of Large-Batch Training

  15. A Constructive Prediction of the Generalization Error Across Scales

  16. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

  17. One Epoch Is All You Need

  18. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

  19. Small Data, Big Decisions: Model Selection in the Small-Data Regime

  20. Scaling Laws for Neural Language Models

  21. Scaling Laws from the Data Manifold Dimension

  22. Scaling Laws for Autoregressive Generative Modeling

  23. Broken Neural Scaling Laws

  24. GPT-3: Language Models are Few-Shot Learners

  25. MMLU: Measuring Massive Multitask Language Understanding

  26. Measuring Mathematical Problem Solving With the MATH Dataset

  27. Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

  28. Scaling Laws for Transfer

  29. Scaling Laws for Language Transfer Learning

  30. When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

  31. Scaling Laws for Neural Machine Translation

  32. Data and Parameter Scaling Laws for Neural Machine Translation

  33. Unsupervised Neural Machine Translation with Generative Language Models Only

  34. Data Scaling Laws in NMT: The Effect of Noise and Architecture

  35. How Many Data Points is a Prompt Worth?

  36. Recursively Summarizing Books with Human Feedback

  37. Evaluating Large Language Models Trained on Code

  38. https://github.com/features/copilot/

  39. Solving Linear Algebra by Program Synthesis

  40. Solving Probability and Statistics Problems by Program Synthesis

  41. Program Synthesis with Large Language Models

  42. Show Your Work: Scratchpads for Intermediate Computation with Language Models

  43. Few-Shot Self-Rationalization with Natural Language Prompts

  44. Scarecrow: A Framework for Scrutinizing Machine Text

  45. A Recipe For Arbitrary Text Style Transfer with Large Language Models

  46. โ€‹ โ€˜instruct-tuning LLMsโ€™ directory

  47. M6โ€“10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

  48. Training Verifiers to Solve Math Word Problems

  49. Symbolic Knowledge Distillation: from General Language Models to Commonsense Models

  50. An Explanation of In-context Learning as Implicit Bayesian Inference

  51. Recipes for building an open-domain chatbot

  52. SimCLRv2: Big Self-Supervised Models are Strong Semi-Supervised Learners

  53. iGPT: Generative Pretraining from Pixels

  54. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

  55. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

  56. Exploring Sparse Expert Models and Beyond

  57. On the Predictability of Pruning Across Scales

  58. โ€‹ โ€˜NN pruningโ€™ directory

  59. How Big Should My Language Model Be?

  60. When Do You Need Billions of Words of Pretraining Data?

  61. Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)

  62. Probing Across Time: What Does RoBERTa Know and When?

  63. CLIP: Connecting Text and Images: Weโ€™re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the โ€˜zero-shotโ€™ capabilities of GPT-2 and GPT-3

  64. ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

  65. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

  66. WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

  67. Multimodal Few-Shot Learning with Frozen Language Models

  68. GrokNet: Unified Computer Vision Model Trunk and Embeddings For Commerce

  69. Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations

  70. Zero-Shot Text-to-Image Generation

  71. DALLยทE 1: Creating Images from Text: Weโ€™ve trained a neural network called DALLยทE that creates images from text captions for a wide range of concepts expressible in natural language

  72. M6: A Chinese Multimodal Pretrainer

  73. Improved Denoising Diffusion Probabilistic Models

  74. Denoising Diffusion Probabilistic Models

  75. Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

  76. Scaling Laws for Acoustic Models

  77. Unsupervised Cross-lingual Representation Learning for Speech Recognition

  78. Scaling End-to-End Models for Large-Scale Multilingual ASR

  79. Scaling ASR Improves Zero and Few Shot Learning

  80. VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

  81. Large-Scale Self-Supervised and Semi-Supervised Learning for Speech Translation

  82. Toward a realistic model of speech processing in the brain with self-supervised learning

  83. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

  84. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

  85. https://openai.com/index/whisper/

  86. SEER: Self-supervised Pretraining of Visual Features in the Wild

  87. Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

  88. Fast and Accurate Model Scaling

  89. Revisiting ResNets: Improved Training and Scaling Strategies

  90. Unsupervised Cross-lingual Representation Learning at Scale

  91. XLM-R XL: Larger-Scale Transformers for Multilingual Masked Language Modeling

  92. Facebook AI WMT21 News Translation Task Submission

  93. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

  94. LEMON: Scaling Up Vision-Language Pre-training for Image Captioning

  95. Flamingo: a Visual Language Model for Few-Shot Learning

  96. Scaling Vision Transformers

  97. CoAtNet: Marrying Convolution and Attention for All Data Sizes

  98. BEiT: BERT Pre-Training of Image Transformers

  99. MAE: Masked Autoencoders Are Scalable Vision Learners

  100. A Universal Law of Robustness via Isoperimetry

  101. Exploring the Limits of Out-of-Distribution Detection

  102. Partial success in closing the gap between human and machine vision

  103. Effect of scale on catastrophic forgetting in neural networks

  104. On the Opportunities and Risks of Foundation Models

  105. Exploring the Limits of Large Scale Pre-training

  106. Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers

  107. E(3)-Equivariant Graph Neural Networks for Data-Efficient and Accurate Interatomic Potentials

  108. WebFace260M: A Benchmark for Million-Scale Deep Face Recognition

  109. CT0: Fine-tuned Language Models are Continual Learners

  110. DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications

  111. High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models (DLRMs)

  112. Make Every Feature Binary: A 135B Parameter Sparse Neural Network for Massively Improved Search Relevance

  113. Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

  114. Scaling Law for Recommendation Models: Towards General-purpose User Representations

  115. Understanding Scaling Laws for Recommendation Models

  116. โ€‹ โ€˜MLP NNโ€™ directory

  117. MLP-Mixer: An all-MLP Architecture for Vision

  118. Pay Attention to MLPs

  119. Fine-Tuning Language Models from Human Preferences

  120. Learning to summarize from human feedback

  121. Measuring hardware overhang

  122. Scaling Scaling Laws with Board Games

  123. โ€‹ Computer Optimization: Your Computer Is Faster Than You Think

  124. MuZero Unplugged: Online and Offline Reinforcement Learning by Planning with a Learned Model

  125. From Motor Control to Team Play in Simulated Humanoid Football

  126. Open-Ended Learning Leads to Generally Capable Agents

  127. Procedural Generalization by Planning with Self-Supervised World Models

  128. Collaborating with Humans without Human Data

  129. Gato: A Generalist Agent

  130. Multi-Game Decision Transformers

  131. Does Learning Require Memorization? A Short Tale about a Long Tail

  132. Generalization bounds for deep learning

  133. The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers

  134. Explaining Neural Scaling Laws

  135. Learning Curve Theory

  136. [AN #140]: Theoretical Models That Predict Scaling Laws

  137. The Shape of Learning Curves: a Review

  138. A mathematical theory of semantic development in deep neural networks

  139. The Shape of Learning Curves: a Review: 6. Ill-Behaved Learning Curves: 6.1. Phase Transitions

  140. The Phase Transition In Human Cognition ยง Phase Transitions in Language Processing

  141. Acquisition of Chess Knowledge in AlphaZero

  142. https://arxiv.org/pdf/2111.09259.pdf#page=19

  143. OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

  144. A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning

  145. Toward A Universal Law Of Generalization For Psychological Science

  146. Scaling to Very Very Large Corpora for Natural Language Disambiguation

  147. https://papers.nips.cc/paper/2003/file/9fb7b048c96d44a0337f049e0a61ff06-Paper.pdf

  148. Tree Induction vs. Logistic Regression: A Learning-Curve Analysis

  149. Large Language Models in Machine Translation

  150. Six Challenges for Neural Machine Translation

  151. 2017-koehn-figure3-bleuscoreswithvaryingamountsoftrainingdata.png

  152. The Unreasonable Effectiveness of Data

  153. The Tradeoffs of Large-Scale Learning

  154. Largeโ€“Scale Machine Learning Revisited [Slides]

  155. ML Scaling subreddit

  156. โ€‹ It Looks Like Youโ€™re Trying To Take Over The World

  157. โ€‹ โ€˜AI scalingโ€™ directory