โ scaling-hypothesis#blessings-of-scale
https://cse-robotics.engr.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf
Do Deep Convolutional Nets Really Need to be Deep and Convolutional?
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models
WebVision Challenge: Visual Learning and Understanding With Web Data
WebVision Database: Visual Learning and Understanding from Web Data
CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images
Measuring the Effects of Data Parallelism on Neural Network Training
A Constructive Prediction of the Generalization Error Across Scales
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Small Data, Big Decisions: Model Selection in the Small-Data Regime
Measuring Mathematical Problem Solving With the MATH Dataset
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method
Data and Parameter Scaling Laws for Neural Machine Translation
Unsupervised Neural Machine Translation with Generative Language Models Only
Data Scaling Laws in NMT: The Effect of Noise and Architecture
Solving Probability and Statistics Problems by Program Synthesis
Show Your Work: Scratchpads for Intermediate Computation with Language Models
A Recipe For Arbitrary Text Style Transfer with Large Language Models
M6โ10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining
Symbolic Knowledge Distillation: from General Language Models to Commonsense Models
An Explanation of In-context Learning as Implicit Bayesian Inference
SimCLRv2: Big Self-Supervised Models are Strong Semi-Supervised Learners
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
GrokNet: Unified Computer Vision Model Trunk and Embeddings For Commerce
Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations
Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning
Unsupervised Cross-lingual Representation Learning for Speech Recognition
Large-Scale Self-Supervised and Semi-Supervised Learning for Speech Translation
Toward a realistic model of speech processing in the brain with self-supervised learning
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
SEER: Self-supervised Pretraining of Visual Features in the Wild
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision
Revisiting ResNets: Improved Training and Scaling Strategies
XLM-R XL: Larger-Scale Transformers for Multilingual Masked Language Modeling
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
LEMON: Scaling Up Vision-Language Pre-training for Image Captioning
CoAtNet: Marrying Convolution and Attention for All Data Sizes
Partial success in closing the gap between human and machine vision
Effect of scale on catastrophic forgetting in neural networks
Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers
E(3)-Equivariant Graph Neural Networks for Data-Efficient and Accurate Interatomic Potentials
WebFace260M: A Benchmark for Million-Scale Deep Face Recognition
DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications
High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models (DLRMs)
Scaling Law for Recommendation Models: Towards General-purpose User Representations
โ Computer Optimization: Your Computer Is Faster Than You Think
MuZero Unplugged: Online and Offline Reinforcement Learning by Planning with a Learned Model
From Motor Control to Team Play in Simulated Humanoid Football
Procedural Generalization by Planning with Self-Supervised World Models
Does Learning Require Memorization? A Short Tale about a Long Tail
The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers
A mathematical theory of semantic development in deep neural networks
The Shape of Learning Curves: a Review: 6. Ill-Behaved Learning Curves: 6.1. Phase Transitions
The Phase Transition In Human Cognition ยง Phase Transitions in Language Processing
Toward A Universal Law Of Generalization For Psychological Science
Scaling to Very Very Large Corpora for Natural Language Disambiguation
https://papers.nips.cc/paper/2003/file/9fb7b048c96d44a0337f049e0a61ff06-Paper.pdf
Tree Induction vs. Logistic Regression: A Learning-Curve Analysis
2017-koehn-figure3-bleuscoreswithvaryingamountsoftrainingdata.png
Wikipedia Bibliography:
Kuaishouโโ: