scaling-hypothesis#blessings-of-scale

[Transclude the forward-link's context]

https://cse-robotics.engr.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf

Do Deep Convolutional Nets Really Need to be Deep and Convolutional?

https://arxiv.org/pdf/1603.05691.pdf#page=7

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

Deep Learning Scaling is Predictable, Empirically

Learning Visual Features from Large Weakly Supervised Data

Exploring the Limits of Weakly Supervised Pretraining

SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models

WebVision Challenge: Visual Learning and Understanding With Web Data

WebVision Database: Visual Learning and Understanding from Web Data

CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images

Measuring the Effects of Data Parallelism on Neural Network Training

An Empirical Model of Large-Batch Training

A Constructive Prediction of the Generalization Error Across Scales

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

One Epoch Is All You Need

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Small Data, Big Decisions: Model Selection in the Small-Data Regime

Scaling Laws for Neural Language Models

Scaling Laws from the Data Manifold Dimension

Scaling Laws for Autoregressive Generative Modeling

Broken Neural Scaling Laws

GPT-3: Language Models are Few-Shot Learners

MMLU: Measuring Massive Multitask Language Understanding

Measuring Mathematical Problem Solving With the MATH Dataset

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

Scaling Laws for Transfer

Scaling Laws for Language Transfer Learning

When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

Scaling Laws for Neural Machine Translation

Data and Parameter Scaling Laws for Neural Machine Translation

Unsupervised Neural Machine Translation with Generative Language Models Only

Data Scaling Laws in NMT: The Effect of Noise and Architecture

How Many Data Points is a Prompt Worth?

Recursively Summarizing Books with Human Feedback

Evaluating Large Language Models Trained on Code

https://github.com/features/copilot/

Solving Linear Algebra by Program Synthesis

Solving Probability and Statistics Problems by Program Synthesis

Program Synthesis with Large Language Models

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Few-Shot Self-Rationalization with Natural Language Prompts

Scarecrow: A Framework for Scrutinizing Machine Text

A Recipe For Arbitrary Text Style Transfer with Large Language Models

‘instruct-tuning LLMs’ directory

M6–10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

Training Verifiers to Solve Math Word Problems

Symbolic Knowledge Distillation: from General Language Models to Commonsense Models

An Explanation of In-context Learning as Implicit Bayesian Inference

Recipes for building an open-domain chatbot

SimCLRv2: Big Self-Supervised Models are Strong Semi-Supervised Learners

iGPT: Generative Pretraining from Pixels

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Exploring Sparse Expert Models and Beyond

On the Predictability of Pruning Across Scales

‘NN pruning’ directory

How Big Should My Language Model Be?

When Do You Need Billions of Words of Pretraining Data?

Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)

Probing Across Time: What Does RoBERTa Know and When?

CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ‘zero-shot’ capabilities of GPT-2 and GPT-3

ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Multimodal Few-Shot Learning with Frozen Language Models

GrokNet: Unified Computer Vision Model Trunk and Embeddings For Commerce

Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations

Zero-Shot Text-to-Image Generation

DALL·E 1: Creating Images from Text: We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language

M6: A Chinese Multimodal Pretrainer

Improved Denoising Diffusion Probabilistic Models

Denoising Diffusion Probabilistic Models

Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

Scaling Laws for Acoustic Models

Unsupervised Cross-lingual Representation Learning for Speech Recognition

Scaling End-to-End Models for Large-Scale Multilingual ASR

Scaling ASR Improves Zero and Few Shot Learning

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Large-Scale Self-Supervised and Semi-Supervised Learning for Speech Translation

Toward a realistic model of speech processing in the brain with self-supervised learning

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

https://openai.com/index/whisper/

SEER: Self-supervised Pretraining of Visual Features in the Wild

Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

Fast and Accurate Model Scaling

Revisiting ResNets: Improved Training and Scaling Strategies

Unsupervised Cross-lingual Representation Learning at Scale

XLM-R XL: Larger-Scale Transformers for Multilingual Masked Language Modeling

Facebook AI WMT21 News Translation Task Submission

ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

LEMON: Scaling Up Vision-Language Pre-training for Image Captioning

Flamingo: a Visual Language Model for Few-Shot Learning

Scaling Vision Transformers

CoAtNet: Marrying Convolution and Attention for All Data Sizes

BEiT: BERT Pre-Training of Image Transformers

MAE: Masked Autoencoders Are Scalable Vision Learners

A Universal Law of Robustness via Isoperimetry

Exploring the Limits of Out-of-Distribution Detection

Partial success in closing the gap between human and machine vision

Effect of scale on catastrophic forgetting in neural networks

On the Opportunities and Risks of Foundation Models

Exploring the Limits of Large Scale Pre-training

Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers

E(3)-Equivariant Graph Neural Networks for Data-Efficient and Accurate Interatomic Potentials

WebFace260M: A Benchmark for Million-Scale Deep Face Recognition

CT0: Fine-tuned Language Models are Continual Learners

DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications

High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models (DLRMs)

Make Every Feature Binary: A 135B Parameter Sparse Neural Network for Massively Improved Search Relevance

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Scaling Law for Recommendation Models: Towards General-purpose User Representations

Understanding Scaling Laws for Recommendation Models

‘MLP NN’ directory

MLP-Mixer: An all-MLP Architecture for Vision

Pay Attention to MLPs

Fine-Tuning Language Models from Human Preferences

Learning to summarize from human feedback

Measuring hardware overhang

Scaling Scaling Laws with Board Games

Computer Optimization: Your Computer Is Faster Than You Think

MuZero Unplugged: Online and Offline Reinforcement Learning by Planning with a Learned Model

From Motor Control to Team Play in Simulated Humanoid Football

Open-Ended Learning Leads to Generally Capable Agents

Procedural Generalization by Planning with Self-Supervised World Models

Collaborating with Humans without Human Data

Gato: A Generalist Agent

Multi-Game Decision Transformers

Does Learning Require Memorization? A Short Tale about a Long Tail

Generalization bounds for deep learning

The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers

Explaining Neural Scaling Laws

Learning Curve Theory

[AN #140]: Theoretical Models That Predict Scaling Laws

The Shape of Learning Curves: a Review

A mathematical theory of semantic development in deep neural networks

The Shape of Learning Curves: a Review: 6. Ill-Behaved Learning Curves: 6.1. Phase Transitions

The Phase Transition In Human Cognition § Phase Transitions in Language Processing

Acquisition of Chess Knowledge in AlphaZero

https://arxiv.org/pdf/2111.09259.pdf#page=19

OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning

Toward A Universal Law Of Generalization For Psychological Science

Scaling to Very Very Large Corpora for Natural Language Disambiguation

https://papers.nips.cc/paper/2003/file/9fb7b048c96d44a0337f049e0a61ff06-Paper.pdf

Tree Induction vs. Logistic Regression: A Learning-Curve Analysis

Large Language Models in Machine Translation

Six Challenges for Neural Machine Translation

2017-koehn-figure3-bleuscoreswithvaryingamountsoftrainingdata.png

The Unreasonable Effectiveness of Data

The Tradeoffs of Large-Scale Learning

Large–Scale Machine Learning Revisited [Slides]

ML Scaling subreddit

It Looks Like You’re Trying To Take Over The World

‘AI scaling’ directory

Wikipedia Bibliography:

Hashtag
Kuaishou :

https://en.wikipedia.org/wiki/Kuaishou
Hex (board game)