‘AI scaling’ directory

Bibliography of ML scaling papers showing smooth scaling of neural net performance in general with increasingly large parameters, data, & compute.

⁠The Blessings of Scale⁠: when ⁠more is different⁠
“Do Deep Convolutional Nets Really Need to be Deep and Convolutional?”⁠, Urban et al 2016 (negative result, particularly ⁠on scaling⁠—wrong, but why?)
“Revisiting Unreasonable Effectiveness of Data in Deep Learning Era”⁠, Sun et al 2017
“Deep Learning Scaling is Predictable, Empirically”⁠, Hestness et al 2017
“Learning Visual Features from Large Weakly Supervised Data”⁠, Joulin et al 2015; “Exploring the Limits of Weakly Supervised Pretraining”⁠, Mahajan et al 2018; “Revisiting Weakly Supervised Pre-Training of Visual Perception Models”⁠, Singh et al 2022 (CNNs⁠ scale to billions of hashtagged⁠ Instagram images)
WebVision: “WebVision Challenge: Visual Learning and Understanding With Web Data”⁠, Li et al 2017a/“WebVision Database: Visual Learning and Understanding from Web Data”⁠, Li et al 2017b/“CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images”⁠, Guo et al 2018
“Measuring the Effects of Data Parallelism on Neural Network Training”⁠, Shallue et al 2018
“Gradient Noise Scale: An Empirical Model of Large-Batch Training”⁠, McCandlish et al 2018
“A Constructive Prediction of the Generalization Error Across Scales”⁠, Rosenfeld et al 2019
“EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”⁠, Tan & Le2019
“One Epoch Is All You Need”⁠, Komatsuzaki2019
“Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers”⁠, Li et al 2020
“Small Data, Big Decisions: Model Selection in the Small-Data Regime”⁠, Bornschein et al 2020
Key GPT papers:
- “Scaling Laws for Neural Language Models”⁠, Kaplan et al 2020
- “Scaling Laws from the Data Manifold Dimension”⁠, Sharma & Kaplan2020
- “Scaling Laws for Autoregressive Generative Modeling”⁠, Henighan et al 2020 (noise & resolution); “Broken Neural Scaling Laws”⁠, Caballero et al 2022
- “GPT-3: Language Models are Few-Shot Learners”⁠, Brown et al 2020
- “Measuring Massive Multitask Language Understanding”⁠, Hendrycks et al 2020; “Measuring Mathematical Problem Solving With the MATH Dataset”⁠, Hendrycks et al 2021
- “Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers”⁠, Hendricks et al 2021
- “Scaling Laws for Transfer”⁠, Hernandez et al 2021; “Scaling Laws for Language Transfer Learning”⁠, Christina Kim (Hernandez et al 2021 followup: smooth scaling for En → De/Es/Zh); “When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method”⁠, Zhang et al 2024
- “Scaling Laws for Neural Machine Translation”⁠, Ghorbani et al 2021; “Data and Parameter Scaling Laws for Neural Machine Translation”⁠, Gordon et al 2021; “Unsupervised Neural Machine Translation with Generative Language Models Only”⁠, Han et al 2021; “Data Scaling Laws in NMT: The Effect of Noise and Architecture”⁠, Bansal et al 2022
- “How Many Data Points is a Prompt Worth?”⁠, Le Scao & Rush2021
- “Recursively Summarizing Books with Human Feedback”⁠, Wu et al 2021
- “Codex: Evaluating Large Language Models Trained on Code”⁠, Chen et al 2021 (small versions of GitHub Copilot⁠, solves simple linear algebra⁠/statistics problems⁠ too); “Program Synthesis with Large Language Models”⁠, Austin et al 2021; “Show Your Work: Scratchpads for Intermediate Computation with Language Models”⁠, Anonymous et al 2021; “Few-Shot Self-Rationalization with Natural Language Prompts”⁠, Marasović et al 2021
- “Scarecrow: A Framework for Scrutinizing Machine Text”⁠, Dou et al 2021
- “A Recipe For Arbitrary Text Style Transfer with Large Language Models”⁠, Reif et al 2021
- Instruction tuning⁠/multi-task finetuning
- “M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining”⁠, Lin et al 2021
- “Training Verifiers to Solve Math Word Problems”⁠, Cobbe et al 2021
- “Symbolic Knowledge Distillation: from General Language Models to Commonsense Models”⁠, West et al 2021
- “An Explanation of In-Context Learning as Implicit Bayesian Inference”⁠, Xie et al 2021
“Blender: Recipes for building an open-domain chatbot”⁠, Roller et al 2020
“Big Self-Supervised Models are Strong Semi-Supervised Learners”⁠, Chen et al 2020a
“iGPT: Generative Pretraining from Pixels”⁠, Chen et al 2020b
“GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding”⁠, Lepikhin et al 2020; “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”⁠, Fedus et al 2021; “Exploring Sparse Expert Models and Beyond”⁠, Yang et al 2021
- “On the Predictability of Pruning Across Scales”⁠, Rosenfeld et al 2020 (scaling laws for sparsity⁠: initially large size reductions are free, then power-law⁠ worsening, then plateau at tiny but bad models)
“How big should my language model be?”⁠, Huggingface2020
“When Do You Need Billions of Words of Pretraining Data?”⁠, Zhang et al 2020; “Learning Which Features Matter: RoBERTa Acquires a Preference forLinguistic Generalizations (Eventually)”⁠, Warstadt et al 2020; “Probing Across Time: What Does RoBERTa Know and When?”⁠, Liu et al 2021
CLIP⁠; “ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision”⁠, Jia et al 2021 (see also CC-12M⁠; EfficientNet trained on 1.8 billion images on a TPUv3-1024⁠); “WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training”⁠, Huo et al 2021; “Multimodal Few-Shot Learning with Frozen Language Models”⁠, Tsimpoukelli et al 2021; “GrokNet: Unified Computer Vision Model Trunk and Embeddings For Commerce”⁠, Bell et al 2020; “Billion-Scale (Pinterest) Pretraining with Vision Transformers for Multi-Task Visual Representations”⁠, Beale et al 2021
“DALL·E 1: Zero-Shot Text-to-Image Generation”⁠, Ramesh et al 2021 (blog⁠); “M6: A Chinese Multimodal Pretrainer”⁠, Lin et al 2021 (Chinese DALL·E 1: 1.9TB images/0.29TB text for 10b-parameter dense/100b-parameter MoE Transformer; shockingly fast Chinese replication of DALL·E 1/CLIP)
“Improved Denoising Diffusion Probabilistic Models”⁠, Nichol & Dhariwal2021 (DDPM⁠ scaling laws for FID⁠ & likelihood)
“Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning”⁠, Lee et al 2021
“Scaling Laws for Acoustic Models”⁠, Droppo & Elibol2021
- “XLSR: Unsupervised Cross-lingual Representation Learning for Speech Recognition”⁠, Conneau et al 2020
- “Scaling End-to-End Models for Large-Scale Multilingual ASR”⁠, Li et al 2021; “Scaling ASR Improves Zero and Few Shot Learning”⁠, Xiao et al 2021
- “VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation”⁠, Wang et al 2021; “wav2vec: Large-Scale Self-Supervised and Semi-Supervised Learning for Speech Translation”⁠, Wang et al 2021 (fMRI⁠); “XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale”⁠, Babu et al 2021
- “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units”⁠, Hsu et al 2021
- Whisper⁠
“SEER: Self-supervised Pretraining of Visual Features in the Wild”⁠, Goyal et al 2021; “Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision”⁠, Goyal et al 2022
“Fast and Accurate Model Scaling”⁠, Dollár et al 2021; “Revisiting ResNets: Improved Training and Scaling Strategies”⁠, Bello et al 2021
“XLM-R: Unsupervised Cross-lingual Representation Learning at Scale”⁠, Conneau et al 2019; “XLM-R XL/XLM-R XXL: Larger-Scale Transformers for Multilingual Masked Language Modeling”⁠, Goyal et al 2021; “Facebook AI WMT21 News Translation Task Submission”⁠, Tran et al 2021
“ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation”⁠, Sun et al 2021
“LEMON: Scaling Up Vision-Language Pre-training for Image Captioning”⁠, Hu et al 2021
“Flamingo: a Visual Language Model for Few-Shot Learning”⁠, Alayrac et al 2022
“Scaling Vision Transformers”⁠, Zhai et al 2021
“CoAtNet: Marrying Convolution and Attention for All Data Sizes”⁠, Dai et al 2021
“BEiT: BERT Pre-Training of Image Transformers”⁠, Bao et al 2021; “Masked Autoencoders Are Scalable Vision Learners”⁠, He et al 2021
“A Universal Law of Robustness via Isoperimetry”⁠, Bubeck & Sellke2021; “Exploring the Limits of Out-of-Distribution Detection”⁠, Fort et al 2021; “Partial success in closing the gap between human and machine vision”⁠, Geirhos et al 2021
“Effect of scale on catastrophic forgetting in neural networks”⁠, Anonymous2021
“On the Opportunities and Risks of Foundation Models”⁠, Bommasani et al 2021 (review)
“Exploring the Limits of Large Scale Pre-training”⁠, Abnar et al 2021
“Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers”⁠, Prato et al 2021
“E(3)-Equivariant Graph Neural Networks for Data-Efficient and Accurate Interatomic Potentials”⁠, Batzner et al 2021
Face recognition: “WebFace260M: A Benchmark for Million-Scale Deep Face Recognition”⁠, Zhu et al 2022
“Fine-tuned Language Models are Continual Learners”⁠, Scialom at al 2022
Embeddings: “DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications”⁠, Zeng et al 2020; “DLRM: High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models”⁠, Mudigere et al 2021; ⁠“Make Every feature Binary (MEB): A 135b-parameter sparse neural network for massively improved search relevance”⁠; “Persia: A Hybrid System Scaling Deep Learning Based Recommenders up to 100 Trillion Parameters”⁠, Lian et al 2021 (Kuaisho⁠)
- “Scaling Law for Recommendation Models: Towards General-purpose User Representations”⁠, Shin et al 2021; “Understanding Scaling Laws for Recommendation Models”⁠, Ardalani et al 2022
MLPs/FCs: from the “Fully-Connected Neural Nets” bibliography⁠: ⁠Urban et al 2016⁠; “MLP-Mixer: An all-MLP Architecture for Vision”⁠, Tolstikhin et al 2021; “gMLP: Pay Attention to MLPs”⁠, Liu et al 2021
Reinforcement Learning:
- “Fine-Tuning Language Models from Human Preferences”⁠, Ziegler et al 2019; “Learning to summarize from human feedback”⁠, Stiennon et al 2020
- “Measuring hardware overhang”⁠, hippke (the curves cross: “with today’s [trained] algorithms, computers would have beat the world chess champion already in 1994_31ya on a contemporary desk computer”)
- “Scaling Scaling Laws with Board Games”⁠, Jones2021 (AlphaZero⁠/Hex⁠: ⁠highly-optimized⁠ GPU implementation enables showing smooth scaling across 6 OOM of compute—2× FLOPS = 66% victory; amortization of training → runtime tree-search, where 10× training = 15× runtime)
- “MuZero Unplugged: Online and Offline Reinforcement Learning by Planning with a Learned Model”⁠, Schrittwieser et al 2021
- “From Motor Control to Team Play in Simulated Humanoid Football”⁠, Liu et al 2021
- “Open-Ended Learning Leads to Generally Capable Agents”⁠, Open Ended Learning Team et al 2021; “Procedural Generalization by Planning with Self-Supervised World Models”⁠, Anand et al 2021
- “Fictitious Co-Play: Collaborating with Humans without Human Data”⁠, Strouse et al 2021
- “Gato: A Generalist Agent”⁠, Reed et al 2022 (small Decision Transformer⁠ can learn >500 tasks; scaling smoothly)
- “Multi-Game Decision Transformers”⁠, Lee et al 2022 (near-human offline single-checkpoint ALE agent with scaling & rapid transfer)
Theory:
- “Does Learning Require Memorization? A Short Tale about a Long Tail”⁠, Feldman2019
- “Generalization bounds for deep learning”⁠, Valle-Pérez & Louis2020
- “The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers”⁠, Nakkiran et al 2020
- “Explaining Neural Scaling Laws”⁠, Bahri et al 2021
- “Learning Curve Theory”⁠, Hutter2021 (⁠Rohin Shah commentary⁠; more on the manifold hypothesis)
- “The Shape of Learning Curves: a Review”⁠, Viering & Loog2021
- “A mathematical theory of semantic development in deep neural networks”⁠, Saxe et al 2019 (are jumps in NN capabilities to be expected when scaling? see also ⁠Viering & Loog2021’s⁠ discussion of phase transitions & averaging of exponentials giving power-laws, human “vocabulary spurts”⁠, and “Acquisition of Chess Knowledge in AlphaZero”, McGrath et al 2021⁠ ⁠§6 “Rapid increase of basic knowledge”⁠); sequential learning in OpenFold⁠
- “A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning”⁠, Dar et al 2021
Historical:
- “Toward A Universal Law Of Generalization For Psychological Science”⁠, Shepard1987
- “Scaling to Very Very Large Corpora for Natural Language Disambiguation”⁠, Banko & Brill2001
- ⁠“Large Scale Online Learning”⁠, Bottou & LeCun2003 (“We argue that suitably designed online learning algorithms asymptotically outperform any batch learning algorithm.”)
- “Tree Induction versus Logistic Regression: A Learning-Curve Analysis”⁠, Perlich et al 2003
- “Large Language Models in Machine Translation”⁠, Brants et al 2007; Koehn & Knowles2017⁠ (⁠Figure 3⁠)
- “The Unreasonable Effectiveness of Data”⁠, Halevy et al 2009
- “The Tradeoffs of Large-Scale Learning”⁠, Bottou & Bousquet2007/2012; ⁠“Large-Scale Machine Learning Revisited [slides]”⁠, Bottou2013
See Also: For more ML scaling research, follow the /r/MLScaling⁠ subreddit; “It Looks Like You’re Trying To Take Over The World”⁠

Gwern

“GPT-3 2^nd Anniversary & Looking Forward 2 Years ”, Gwern 2022

⁠GPT-3 2^nd Anniversary & Looking Forward 2 Years⁠

“Is OpenAI OK? ”, Gwern 2024

Is OpenAI OK?⁠

“Gwern Branwen—How an Anonymous Researcher Predicted AI’s Trajectory ”, Gwern & Patel 2024

Gwern Branwen—How an Anonymous Researcher Predicted AI’s Trajectory⁠

“‘Winning’ AI Arms Races: Then What? ”, Gwern 2024

‘Winning’ AI arms races: then what?⁠

“Absolute Unit NNs: Regression-Based MLPs for Everything ”, Gwern 2023

Absolute Unit NNs: Regression-Based MLPs for Everything

“Scaling ‘Diminishing Returns’ ”, Gwern 2024

Scaling ‘diminishing returns’⁠

“Research Ideas ”, Gwern 2017

Research Ideas

“GPT-3 Creative Fiction ”, Gwern 2020

GPT-3 Creative Fiction

“GANs Didn’t Fail, They Were Abandoned ”, Gwern 2022

GANs Didn’t Fail, They Were Abandoned

“The Scaling Hypothesis ”, Gwern 2020

The Scaling Hypothesis

“ML Scaling Subreddit ”, Gwern 2020

ML Scaling subreddit⁠

“WBE & DRL: a Middle Way of Imitation Learning on Brains ”, Gwern 2018

WBE & DRL: a Middle Way of imitation learning on brains⁠

“Computer Optimization: Your Computer Is Faster Than You Think ”, Gwern 2021

⁠Computer Optimization: Your Computer Is Faster Than You Think

“Technology Forecasting: The Garden of Forking Paths ”, Gwern 2014

Technology Forecasting: The Garden of Forking Paths

Links

“Cyc: Obituary for the Greatest Monument to Logical AGI ”, Liu 2025

⁠Cyc: Obituary for the greatest monument to logical AGI

“Emuru: Zero-Shot Styled Text Image Generation, but Make It Autoregressive ”, Pippi et al 2025

⁠Emuru: Zero-Shot Styled Text Image Generation, but Make It Autoregressive⁠

“Compute-Optimal LLMs Provably Generalize Better With Scale ”, Finzi et al 2025

⁠Compute-Optimal LLMs Provably Generalize Better with Scale⁠

“My Thoughts on the Future of ‘AI’ ”, Carlini 2025

⁠My Thoughts on the Future of ‘AI’

“Deep Learning Is Not So Mysterious or Different ”, Wilson 2025

⁠Deep Learning is Not So Mysterious or Different⁠

“SSI Israel Hires First Senior Researchers ”, Gilead 2025

⁠SSI Israel hires first senior researchers

“Obscure Scientific Facts Benchmark ”, Azulay 2025

⁠⁠Obscure Scientific Facts Benchmark⁠

“LLaDA: Large Language Diffusion Models ”, Nie et al 2025

LLaDA: Large Language Diffusion Models⁠

tamaybes @ "2025-02-13"

[NanoGPT optimization experience curve]⁠

“Over-Tokenized Transformer: Vocabulary Is Generally Worth Scaling ”, Huang et al 2025

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling⁠

“Do Generative Video Models Learn Physical Principles from Watching Videos? ”, Motamed et al 2025

Do generative video models learn physical principles from watching videos?⁠

“Emergent Effects of Scaling on the Functional Hierarchies within Large Language Models ”, Foop 2025

⁠Emergent effects of scaling on the functional hierarchies within large language models⁠

“What’s the Deal With Mid-Training? ”, Doria 2025

⁠What’s the deal with mid-training?

“Things We Learned about LLMs in 2024 ”

⁠Things we learned about LLMs in 2024

“2024 Letter [On LLM Benchmarking] ”, Wang 2024

⁠2024 letter [on LLM benchmarking] :

View HTML:

⁠/doc/www/zhengdongwang.com/31731b18ff5b80dac347b34dcdd2bdf1ebc63b6a.html⁠

“Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference ”, Warner et al 2024

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference⁠

“Byte Latent Transformer (BLT): Patches Scale Better Than Tokens ”, Pagnoni et al 2024

⁠Byte Latent Transformer (BLT): Patches Scale Better Than Tokens⁠

“Densing Law of LLMs ”, Xiao et al 2024

Densing Law of LLMs⁠

“Liquid: Language Models Are Scalable and Unified Multi-Modal Generators ”, Wu et al 2024

⁠Liquid: Language Models are Scalable and Unified Multi-modal Generators⁠

“PaliGemma 2: A Family of Versatile VLMs for Transfer ”, Steiner et al 2024

PaliGemma 2: A Family of Versatile VLMs for Transfer⁠

“Best-Of-N Jailbreaking ”, Hughes et al 2024

Best-of-N Jailbreaking⁠

“Drowning in Documents: Consequences of Scaling Reranker Inference ”, Jacob et al 2024

Drowning in Documents: Consequences of Scaling Reranker Inference⁠

“Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress? ”, Jeong et al 2024

Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?⁠

“How Far Is Video Generation from World Model: A Physical Law Perspective ”, Kang et al 2024

How Far is Video Generation from World Model: A Physical Law Perspective⁠

“Scaling up Masked Diffusion Models on Text ”, Nie et al 2024

Scaling up Masked Diffusion Models on Text⁠

“ABBYY’s Bitter Lesson: How Linguists Lost the Last Battle for NLP ”, Skorinkin 2024

ABBYY’s Bitter Lesson: How Linguists Lost the Last Battle for NLP

“CT Foundation: Taking Medical Imaging Embeddings 3D ”, Kiraly & Traverse 2024

CT Foundation: Taking medical imaging embeddings 3D⁠

“Inference Scaling for Long-Context Retrieval Augmented Generation ”, Yue et al 2024

Inference Scaling for Long-Context Retrieval Augmented Generation⁠

“Strategic Insights from Simulation Gaming of AI Race Dynamics ”, Gruetzemacher et al 2024

Strategic Insights from Simulation Gaming of AI Race Dynamics⁠

“How Feature Learning Can Improve Neural Scaling Laws ”, Bordelon et al 2024

How Feature Learning Can Improve Neural Scaling Laws⁠

“Dwarkesh Podcast Progress Update ”, Patel 2024

Dwarkesh Podcast Progress Update⁠

“Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? ”, Ren et al 2024

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?⁠

“Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process ”, Ye et al 2024

Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process⁠

“Scaling Law in Neural Data: Non-Invasive Speech Decoding With 175 Hours of EEG Data ”, Sato et al 2024

Scaling Law in Neural Data: Non-Invasive Speech Decoding with 175 Hours of EEG Data⁠

“Future Events As Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs ”, Price et al 2024

Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs⁠

“Resolving Discrepancies in Compute-Optimal Scaling of Language Models ”, Porian et al 2024

Resolving Discrepancies in Compute-Optimal Scaling of Language Models⁠

“Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? ”, Lee et al 2024

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?⁠

“Probing the Decision Boundaries of In-Context Learning in Large Language Models ”, Zhao et al 2024

Probing the Decision Boundaries of In-context Learning in Large Language Models⁠

“How Do Large Language Models Acquire Factual Knowledge During Pretraining? ”, Chang et al 2024

How Do Large Language Models Acquire Factual Knowledge During Pretraining?⁠

“Explore the Limits of Omni-Modal Pretraining at Scale ”, Zhang et al 2024

Explore the Limits of Omni-modal Pretraining at Scale⁠

“Self-Consuming Generative Models With Curated Data Provably Optimize Human Preferences ”, Ferbach et al 2024

Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences⁠

“Beyond Model Collapse: Scaling Up With Synthesized Data Requires Reinforcement ”, Feng et al 2024

Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement⁠

“Attention As a Hypernetwork ”, Schug et al 2024

Attention as a Hypernetwork⁠

“Training Compute-Optimal Protein Language Models ”, Cheng et al 2024

Training Compute-Optimal Protein Language Models⁠

“AI Will Become Mathematicians’ ‘Co-Pilot’: Fields Medalist Terence Tao Explains How Proof Checkers and AI Programs Are Dramatically Changing Mathematics ”, Drösser & Tao 2024

AI Will Become Mathematicians’ ‘Co-Pilot’: Fields Medalist Terence Tao explains how proof checkers and AI programs are dramatically changing mathematics⁠

“Regularization Properties of Polynomial Bases ”, Shtoff 2024

⁠⁠Regularization properties of polynomial bases :

View HTML:

⁠https://alexshtf.github.io/2024/06/03/PolynomialBasesRegProps.html

“The Scaling Law in Stellar Light Curves ”, Pan et al 2024

The Scaling Law in Stellar Light Curves⁠

“AstroPT: Scaling Large Observation Models for Astronomy ”, Smith et al 2024

AstroPT: Scaling Large Observation Models for Astronomy⁠

“XLSTM: Extended Long Short-Term Memory ”, Beck et al 2024

xLSTM: Extended Long Short-Term Memory⁠

“Position: Understanding LLMs Requires More Than Statistical Generalization ”, Reizinger et al 2024

Position: Understanding LLMs Requires More Than Statistical Generalization⁠

“GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic ”, Zhang et al 2024

GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic⁠

“Scaling and Renormalization in High-Dimensional Regression ”, Atanasov et al 2024

⁠Scaling and renormalization in high-dimensional regression⁠

“CatLIP: CLIP-Level Visual Recognition Accuracy With 2.7× Faster Pre-Training on Web-Scale Image-Text Data ”, Mehta et al 2024

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7× Faster Pre-training on Web-scale Image-Text Data⁠

“Test-Time Augmentation to Solve ARC ”, Cole 2024

Test-Time Augmentation to solve ARC

“Compression Represents Intelligence Linearly ”, Huang et al 2024

⁠Compression Represents Intelligence Linearly⁠

“Chinchilla Scaling: A Replication Attempt ”, Besiroglu et al 2024

Chinchilla Scaling: A replication attempt⁠

“Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies ”, Li et al 2024

Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies⁠

“Why Do Small Language Models Underperform? Studying Language Model Saturation via the Softmax Bottleneck ”, Godey et al 2024

Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck⁠

“Language Imbalance Can Boost Cross-Lingual Generalization ”, Schäfer et al 2024

Language Imbalance Can Boost Cross-lingual Generalization⁠

“CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack Of) Multicultural Knowledge ”, Chiu et al 2024

CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge⁠

“Conformer-1: Robust ASR via Large-Scale Semi-Supervised Bootstrapping ”, Zhang et al 2024

Conformer-1: Robust ASR via Large-Scale Semi-supervised Bootstrapping⁠

“MiniCPM: Unveiling the Potential of Small Language Models With Scalable Training Strategies ”, Hu et al 2024

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies⁠

“Visual Autoregressive Modeling (VAR): Scalable Image Generation via Next-Scale Prediction ”, Tian et al 2024

Visual Autoregressive Modeling (VAR): Scalable Image Generation via Next-Scale Prediction⁠

“Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data ”, Gerstgrasser et al 2024

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data⁠

“Scaling Laws For Dense Retrieval ”, Fang et al 2024

Scaling Laws For Dense Retrieval⁠

“Long-Form Factuality in Large Language Models ”, Wei et al 2024

Long-form factuality in large language models⁠

“Mechanistic Design and Scaling of Hybrid Architectures ”, Poli et al 2024

Mechanistic Design and Scaling of Hybrid Architectures⁠

“8 Google Employees Invented Modern AI. Here’s the Inside Story: They Met by Chance, Got Hooked on an Idea, and Wrote the Transformers Paper—The Most Consequential Tech Breakthrough in Recent History ”, Levy 2024

8 Google Employees Invented Modern AI. Here’s the Inside Story: They met by chance, got hooked on an idea, and wrote the Transformers paper—the most consequential tech breakthrough in recent history⁠

“Inflection-2.5: Meet the World’s Best Personal AI ”, Inflection 2024

Inflection-2.5: meet the world’s best personal AI

“Actions Speak Louder Than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations (HSTU) ”, Zhai et al 2024

Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations (HSTU)⁠

“When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method ”, Zhang et al 2024

When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method⁠

“Investigating Continual Pretraining in Large Language Models: Insights and Implications ”, Yıldız et al 2024

Investigating Continual Pretraining in Large Language Models: Insights and Implications⁠

“The Era of 1-Bit LLMs: All Large Language Models Are in 1.58 Bits ”, Ma et al 2024

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits⁠

“StructLM: Towards Building Generalist Models for Structured Knowledge Grounding ”, Zhuang et al 2024

StructLM: Towards Building Generalist Models for Structured Knowledge Grounding⁠

“How to Train Data-Efficient LLMs ”, Sachdeva et al 2024

How to Train Data-Efficient LLMs⁠

“Weaver: Foundation Models for Creative Writing ”, Wang et al 2024

Weaver: Foundation Models for Creative Writing⁠

“Arrows of Time for Large Language Models ”, Papadopoulos et al 2024

Arrows of Time for Large Language Models⁠

“Can AI Assistants Know What They Don’t Know? ”, Cheng et al 2024

Can AI Assistants Know What They Don’t Know?⁠

“I Am a Strange Dataset: Metalinguistic Tests for Language Models ”, Thrush et al 2024

I am a Strange Dataset: Metalinguistic Tests for Language Models⁠

“TF-T2V: A Recipe for Scaling up Text-To-Video Generation With Text-Free Videos ”, Wang et al 2023

TF-T2V: A Recipe for Scaling up Text-to-Video Generation with Text-free Videos⁠

“Generative Multimodal Models Are In-Context Learners ”, Sun et al 2023

Generative Multimodal Models are In-Context Learners⁠

“Zoology: Measuring and Improving Recall in Efficient Language Models ”, Arora et al 2023

Zoology: Measuring and Improving Recall in Efficient Language Models⁠

“Seamless: Multilingual Expressive and Streaming Speech Translation ”, Communication et al 2023

Seamless: Multilingual Expressive and Streaming Speech Translation⁠

“Scaling Transformer Neural Networks for Skillful and Reliable Medium-Range Weather Forecasting ”, Nguyen et al 2023

Scaling transformer neural networks for skillful and reliable medium-range weather forecasting⁠

“Instruction-Tuning Aligns LLMs to the Human Brain ”, Aw et al 2023

Instruction-tuning Aligns LLMs to the Human Brain⁠

“Mamba: Linear-Time Sequence Modeling With Selective State Spaces ”, Gu & Dao 2023

Mamba: Linear-Time Sequence Modeling with Selective State Spaces⁠

“Sequential Modeling Enables Scalable Learning for Large Vision Models ”, Bai et al 2023

Sequential Modeling Enables Scalable Learning for Large Vision Models⁠

“UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition ”, Ding et al 2023

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition⁠

“Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets ”, Blattmann et al 2023

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets⁠

“In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search ”, Li et al 2023

In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search⁠

“First Tragedy, Then Parse: History Repeats Itself in the New Era of Large Language Models ”, Saphra et al 2023

First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models⁠

“I2VGen-XL: High-Quality Image-To-Video Synthesis via Cascaded Diffusion Models ”, Zhang et al 2023

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models⁠

“A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models ”, Eisape et al 2023

A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models⁠

“Sam Altman Accepts the 2023 Hawking Fellowship Award § Is There Another Breakthrough That’s Needed to Reach AGI? ”, Altman 2023

Sam Altman accepts the 2023 Hawking Fellowship Award § Is there another breakthrough that’s needed to reach AGI?⁠

“ConvNets Match Vision Transformers at Scale ”, Smith et al 2023

ConvNets Match Vision Transformers at Scale⁠

“Evidence of Interrelated Cognitive-Like Capabilities in Large Language Models: Indications of Artificial General Intelligence or Achievement? ”, Ilić & Gignac 2023

Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement?⁠

“PaLI-3 Vision Language Models: Smaller, Faster, Stronger ”, Chen et al 2023

PaLI-3 Vision Language Models: Smaller, Faster, Stronger⁠

“GeoLLM: Extracting Geospatial Knowledge from Large Language Models ”, Manvi et al 2023

GeoLLM: Extracting Geospatial Knowledge from Large Language Models⁠

“Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition ”, Chen et al 2023

Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition⁠

“Sheared LLaMA: Accelerating Language Model Pre-Training via Structured Pruning ”, Xia et al 2023

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning⁠

“FreshLLMs: Refreshing Large Language Models With Search Engine Augmentation ”, Vu et al 2023

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation⁠

“Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors ”, Amos et al 2023

Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors⁠

“MTOB: A Benchmark for Learning to Translate a New Language from One Grammar Book ”, Tanzer et al 2023

MTOB: A Benchmark for Learning to Translate a New Language from One Grammar Book⁠

“Intriguing Properties of Generative Classifiers ”, Jaini et al 2023

Intriguing properties of generative classifiers⁠

“Taken out of Context: On Measuring Situational Awareness in LLMs ”, Berglund et al 2023

Taken out of context: On measuring situational awareness in LLMs⁠

“SeamlessM4T: Massively Multilingual & Multimodal Machine Translation ”, Communication et al 2023

SeamlessM4T: Massively Multilingual & Multimodal Machine Translation⁠

“Simple Synthetic Data Reduces Sycophancy in Large Language Models ”, Wei et al 2023

Simple synthetic data reduces sycophancy in large language models⁠

“Scaling Relationship on Learning Mathematical Reasoning With Large Language Models ”, Yuan et al 2023

⁠Scaling Relationship on Learning Mathematical Reasoning with Large Language Models⁠

“LLaMA-2: Open Foundation and Fine-Tuned Chat Models ”, Touvron et al 2023

LLaMA-2: Open Foundation and Fine-Tuned Chat Models⁠

“Measuring Faithfulness in Chain-Of-Thought Reasoning ”, Lanham et al 2023

Measuring Faithfulness in Chain-of-Thought Reasoning⁠

“Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration ”, Wang et al 2023

Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration⁠

“Introducing Superalignment ”, Leike & Sutskever 2023

Introducing Superalignment⁠

“Gödel, Escher, Bach Author Douglas Hofstadter on the State of AI Today § What about AI Terrifies You? ”, Hofstadter & Kim 2023

Gödel, Escher, Bach author Douglas Hofstadter on the state of AI today § What about AI terrifies you?⁠

“Pretraining Task Diversity and the Emergence of Non-Bayesian In-Context Learning for Regression ”, Raventós et al 2023

Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression⁠

“Beyond Scale: the Diversity Coefficient As a Data Quality Metric Demonstrates LLMs Are Pre-Trained on Formally Diverse Data ”, Lee et al 2023

Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data⁠

“Scaling MLPs: A Tale of Inductive Bias ”, Bachmann et al 2023

Scaling MLPs: A Tale of Inductive Bias⁠

“Understanding Social Reasoning in Language Models With Language Models ”, Gandhi et al 2023

Understanding Social Reasoning in Language Models with Language Models⁠

“Image Captioners Are Scalable Vision Learners Too ”, Tschannen et al 2023

Image Captioners Are Scalable Vision Learners Too⁠

“PaLI-X: On Scaling up a Multilingual Vision and Language Model ”, Chen et al 2023

PaLI-X: On Scaling up a Multilingual Vision and Language Model⁠

“The False Promise of Imitating Proprietary LLMs ”, Gudibande et al 2023

The False Promise of Imitating Proprietary LLMs⁠

“Scaling Data-Constrained Language Models ”, Muennighoff et al 2023

Scaling Data-Constrained Language Models⁠

“FST: Improving Speech Translation by Fusing Speech and Text ”, Yin et al 2023

⁠FST: Improving speech translation by fusing speech and text⁠

“Scaling Laws for Language Encoding Models in FMRI ”, Antonello et al 2023

Scaling laws for language encoding models in fMRI⁠

“LIMA: Less Is More for Alignment ”, Zhou et al 2023

LIMA: Less Is More for Alignment⁠

“Google’s Newest AI Model Uses Nearly 5× More Text Data for Training Than Its Predecessor ”, Elias 2023

Google’s newest AI model uses nearly 5× more text data for training than its predecessor⁠

“TorToise: Better Speech Synthesis through Scaling ”, Betker 2023

TorToise: Better speech synthesis through scaling⁠

“TinyStories: How Small Can Language Models Be and Still Speak Coherent English? ”, Eldan & Li 2023

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?⁠

“ImageBind: One Embedding Space To Bind Them All ”, Girdhar et al 2023

ImageBind: One Embedding Space To Bind Them All⁠

“Finding Neurons in a Haystack: Case Studies With Sparse Probing ”, Gurnee et al 2023

Finding Neurons in a Haystack: Case Studies with Sparse Probing⁠

“Geoffrey Hinton Tells Us Why He’s Now Scared of the Tech He Helped Build: ‘I Have Suddenly Switched My Views on Whether These Things Are Going to Be More Intelligent Than Us.’ ”, Heaven 2023

Geoffrey Hinton tells us why he’s now scared of the tech he helped build: ‘I have suddenly switched my views on whether these things are going to be more intelligent than us.’⁠

“Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4 ”, Chang et al 2023

Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4⁠

“Google’s DeepMind-Brain Merger: Tech Giant Regroups for AI Battle ”, Murgia 2023

Google’s DeepMind-Brain merger: tech giant regroups for AI battle⁠

“CLaMP: Contrastive Language-Music Pre-Training for Cross-Modal Symbolic Music Information Retrieval ”, Wu et al 2023

CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval⁠

“Emergent and Predictable Memorization in Large Language Models ”, Biderman et al 2023

Emergent and Predictable Memorization in Large Language Models⁠

“Power Law Trends in Speedrunning and Machine Learning ”, Erdil & Sevilla 2023

Power Law Trends in Speedrunning and Machine Learning⁠

“Even The Politicians Thought the Open Letter Made No Sense In The Senate Hearing on AI Today’s Hearing on Ai Covered Ai Regulation and Challenges, and the Infamous Open Letter, Which Nearly Everyone in the Room Thought Was Unwise ”, Gorrell 2023

Even The Politicians Thought the Open Letter Made No Sense In The Senate Hearing on AI today’s hearing on ai covered ai regulation and challenges, and the infamous open letter, which nearly everyone in the room thought was unwise

“DINOv2: Learning Robust Visual Features without Supervision ”, Oquab et al 2023

DINOv2: Learning Robust Visual Features without Supervision⁠

“Segment Anything ”, Kirillov et al 2023

Segment Anything⁠

“Humans in Humans Out: On GPT Converging Toward Common Sense in Both Success and Failure ”, Koralus & Wang-Maścianica 2023

Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure⁠

“Sigmoid Loss for Language Image Pre-Training ”, Zhai et al 2023

Sigmoid Loss for Language Image Pre-Training⁠

“‘AI’ on a Calculator: Part 1 [MNIST CNN on a TI-84 Graphing Calculator] ”, Mitchell 2023

‘AI’ on a Calculator: Part 1 [MNIST CNN on a TI-84 graphing calculator]

“How Well Do Large Language Models Perform in Arithmetic Tasks? ”, Yuan et al 2023

How well do Large Language Models perform in Arithmetic tasks?⁠

“GPT-4 Technical Report ”, OpenAI 2023

GPT-4 Technical Report⁠

“Securing Liberal Democratic Control of AGI through UK Leadership ”, Phillips 2023

Securing Liberal Democratic Control of AGI through UK Leadership⁠

“GigaGAN: Scaling up GANs for Text-To-Image Synthesis ”, Kang et al 2023

GigaGAN: Scaling up GANs for Text-to-Image Synthesis⁠

“Language Is Not All You Need: Aligning Perception With Language Models (Kosmos-1) ”, Huang et al 2023

Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)⁠

“Why Didn’t DeepMind Build GPT-3? ”, Godwin 2023

Why didn’t DeepMind build GPT-3?⁠

“Is Multimodal Vision Supervision Beneficial to Language? ”, Madasu & Lal 2023

⁠Is Multimodal Vision Supervision Beneficial to Language?⁠

“Scaling Vision Transformers to 22 Billion Parameters ”, Dehghani et al 2023

Scaling Vision Transformers to 22 Billion Parameters⁠

“John Carmack’s ‘Different Path’ to Artificial General Intelligence ”, Carmack 2023

John Carmack’s ‘Different Path’ to Artificial General Intelligence

“Large Language Models As Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards ”, Nay 2023

Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards⁠

“ClimaX: A Foundation Model for Weather and Climate ”, Nguyen et al 2023

ClimaX: A foundation model for weather and climate⁠

“StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-To-Image Synthesis ”, Sauer et al 2023

StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis⁠

“MUG: Vision Learners Meet Web Image-Text Pairs ”, Zhao et al 2023

MUG: Vision Learners Meet Web Image-Text Pairs⁠

“GPT-3 As Knowledge Worker: A Zero-Shot Evaluation of AI CPA Capabilities ”, Bommarito et al 2023

GPT-3 as Knowledge Worker: A Zero-Shot Evaluation of AI CPA Capabilities⁠

“Scaling Laws for Generative Mixed-Modal Language Models ”, Aghajanyan et al 2023

Scaling Laws for Generative Mixed-Modal Language Models⁠

“VALL-E: Neural Codec Language Models Are Zero-Shot Text to Speech Synthesizers ”, Wang et al 2023

VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers⁠

“GPT-3 Takes the Bar Exam ”, II & Katz 2022

GPT-3 Takes the Bar Exam⁠

“Cramming: Training a Language Model on a Single GPU in One Day ”, Geiping & Goldstein 2022

Cramming: Training a Language Model on a Single GPU in One Day⁠

“Evolutionary-Scale Prediction of Atomic Level Protein Structure With a Language Model ”, Lin et al 2022

Evolutionary-scale prediction of atomic level protein structure with a language model⁠

“Discovering Language Model Behaviors With Model-Written Evaluations ”, Perez et al 2022

Discovering Language Model Behaviors with Model-Written Evaluations⁠

“One Embedder, Any Task: Instruction-Finetuned Text Embeddings (INSTRUCTOR) ”, Su et al 2022

One Embedder, Any Task: Instruction-Finetuned Text Embeddings (INSTRUCTOR)⁠

“Reproducible Scaling Laws for Contrastive Language-Image Learning ”, Cherti et al 2022

Reproducible scaling laws for contrastive language-image learning⁠

“ERNIE-Code: Beyond English-Centric Cross-Lingual Pretraining for Programming Languages ”, Chai et al 2022

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages⁠

“VideoCoCa: Video-Text Modeling With Zero-Shot Transfer from Contrastive Captioners ”, Yan et al 2022

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners⁠

“VindLU: A Recipe for Effective Video-And-Language Pretraining ”, Cheng et al 2022

VindLU: A Recipe for Effective Video-and-Language Pretraining⁠

“Whisper: Robust Speech Recognition via Large-Scale Weak Supervision ”, Radford et al 2022

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision⁠

“Scaling Language-Image Pre-Training via Masking ”, Li et al 2022

Scaling Language-Image Pre-training via Masking⁠

“Galactica: A Large Language Model for Science ”, Taylor et al 2022

Galactica: A Large Language Model for Science⁠

“Large Language Models Struggle to Learn Long-Tail Knowledge ”, Kandpal et al 2022

Large Language Models Struggle to Learn Long-Tail Knowledge⁠

“EVA: Exploring the Limits of Masked Visual Representation Learning at Scale ”, Fang et al 2022

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale⁠

“MMDialog: A Large-Scale Multi-Turn Dialogue Dataset Towards Multi-Modal Open-Domain Conversation ”, Feng et al 2022

MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation⁠

“Adversarial Policies Beat Superhuman Go AIs ”, Wang et al 2022

Adversarial Policies Beat Superhuman Go AIs⁠

“Increments Podcast: #45—4 Central Fallacies of AI Research (With Melanie Mitchell) ”, Mitchell & Chugg 2022

Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)⁠

“A Solvable Model of Neural Scaling Laws ”, Maloney et al 2022

A Solvable Model of Neural Scaling Laws⁠

“Will We Run out of Data? An Analysis of the Limits of Scaling Datasets in Machine Learning ”, Villalobos et al 2022

Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning⁠

“Evaluating Parameter Efficient Learning for Generation ”, Xu et al 2022

Evaluating Parameter Efficient Learning for Generation⁠

“FLAN: Scaling Instruction-Finetuned Language Models ”, Chung et al 2022

FLAN: Scaling Instruction-Finetuned Language Models⁠

“BioGPT: Generative Pre-Trained Transformer for Biomedical Text Generation and Mining ”, Luo et al 2022

BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining⁠

“Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends ”, Gan et al 2022

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends⁠

“Foundation Transformers ”, Wang et al 2022

Foundation Transformers⁠

“Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle) ”, Press et al 2022

Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle)⁠

“The Lie Derivative for Measuring Learned Equivariance ”, Gruver et al 2022

⁠The Lie Derivative for Measuring Learned Equivariance⁠

“GLM-130B: An Open Bilingual Pre-Trained Model ”, Zeng et al 2022

GLM-130B: An Open Bilingual Pre-trained Model⁠

“Ask Me Anything (AMA): A Simple Strategy for Prompting Language Models ”, Arora et al 2022

Ask Me Anything (AMA): A simple strategy for prompting language models⁠

“Do Current Multi-Task Optimization Methods in Deep Learning Even Help? ”, Xin et al 2022

Do Current Multi-Task Optimization Methods in Deep Learning Even Help?⁠

“Monolith: Real Time Recommendation System With Collisionless Embedding Table ”, Liu et al 2022

Monolith: Real Time Recommendation System With Collisionless Embedding Table⁠

“Machine Reading, Fast and Slow: When Do Models "Understand" Language? ”, Choudhury et al 2022

Machine Reading, Fast and Slow: When Do Models "Understand" Language?⁠

“PaLI: A Jointly-Scaled Multilingual Language-Image Model ”, Chen et al 2022

PaLI: A Jointly-Scaled Multilingual Language-Image Model⁠

“Using Large Language Models to Simulate Multiple Humans ”, Aher et al 2022

Using Large Language Models to Simulate Multiple Humans⁠

“Understanding Scaling Laws for Recommendation Models ”, Ardalani et al 2022

Understanding Scaling Laws for Recommendation Models⁠

“`LLM.int8()`: 8-Bit Matrix Multiplication for Transformers at Scale ”, Dettmers et al 2022

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale⁠

“Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP ”, Nguyen et al 2022

Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP⁠

“Efficient Training of Language Models to Fill in the Middle ”, Bavarian et al 2022

Efficient Training of Language Models to Fill in the Middle⁠

“Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data? ”, Grinsztajn et al 2022

Why do tree-based models still outperform deep learning on tabular data?⁠

“PIXEL: Language Modeling With Pixels ”, Rust et al 2022

PIXEL: Language Modeling with Pixels⁠

“High-Performing Neural Network Models of Visual Cortex Benefit from High Latent Dimensionality ”, Elmoznino & Bonner 2022

High-performing neural network models of visual cortex benefit from high latent dimensionality⁠

“Exploring Length Generalization in Large Language Models ”, Anil et al 2022

Exploring Length Generalization in Large Language Models⁠

“Language Models (Mostly) Know What They Know ”, Kadavath et al 2022

Language Models (Mostly) Know What They Know⁠

“On-Device Training Under 256KB Memory ”, Lin et al 2022

On-Device Training Under 256KB Memory⁠

“Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning ”, Sorscher et al 2022

Beyond neural scaling laws: beating power law scaling via data pruning⁠

“ProGen2: Exploring the Boundaries of Protein Language Models ”, Nijkamp et al 2022

ProGen2: Exploring the Boundaries of Protein Language Models⁠

“RST: ReStructured Pre-Training ”, Yuan & Liu 2022

RST: reStructured Pre-training⁠

“Limitations of the NTK for Understanding Generalization in Deep Learning ”, Vyas et al 2022

Limitations of the NTK for Understanding Generalization in Deep Learning⁠

“Modeling Transformative AI Risks (MTAIR) Project—Summary Report ”, Clarke et al 2022

Modeling Transformative AI Risks (MTAIR) Project—Summary Report⁠

“LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks ”, Dinh et al 2022

LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks⁠

“BigVGAN: A Universal Neural Vocoder With Large-Scale Training ”, Lee et al 2022

BigVGAN: A Universal Neural Vocoder with Large-Scale Training⁠

“An Improved One Millisecond Mobile Backbone ”, Vasu et al 2022

An Improved One millisecond Mobile Backbone⁠

“A Neural Corpus Indexer for Document Retrieval ”, Wang et al 2022

A Neural Corpus Indexer for Document Retrieval⁠

“Toward a Realistic Model of Speech Processing in the Brain With Self-Supervised Learning ”, Millet et al 2022

Toward a realistic model of speech processing in the brain with self-supervised learning⁠

“Teaching Models to Express Their Uncertainty in Words ”, Lin et al 2022

Teaching Models to Express Their Uncertainty in Words⁠

“Why Robust Generalization in Deep Learning Is Difficult: Perspective of Expressive Power ”, Li et al 2022

Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power⁠

“M3AE: Multimodal Masked Autoencoders Learn Transferable Representations ”, Geng et al 2022

M3AE: Multimodal Masked Autoencoders Learn Transferable Representations⁠

“InstructDial: Improving Zero and Few-Shot Generalization in Dialogue through Instruction Tuning ”, Gupta et al 2022

InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning⁠

“Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models ”, Tirumala et al 2022

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models⁠

“Least-To-Most Prompting Enables Complex Reasoning in Large Language Models ”, Zhou et al 2022

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models⁠

“Continual Pre-Training Mitigates Forgetting in Language and Vision ”, Cossu et al 2022

Continual Pre-Training Mitigates Forgetting in Language and Vision⁠

“Dialog Inpainting: Turning Documents into Dialogues ”, Dai et al 2022

Dialog Inpainting: Turning Documents into Dialogues⁠

“UL2: Unifying Language Learning Paradigms ”, Tay et al 2022

UL2: Unifying Language Learning Paradigms⁠

“Building Machine Translation Systems for the Next Thousand Languages ”, Bapna et al 2022

Building Machine Translation Systems for the Next Thousand Languages⁠

“When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on ImageNet ”, Vasudevan et al 2022

When does dough become a bagel? Analyzing the remaining mistakes on ImageNet⁠

“CoCa: Contrastive Captioners Are Image-Text Foundation Models ”, Yu et al 2022

CoCa: Contrastive Captioners are Image-Text Foundation Models⁠

“Data Determines Distributional Robustness in Contrastive Language Image Pre-Training (CLIP) ”, Fang et al 2022

Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)⁠

“Continual Learning With Foundation Models: An Empirical Study of Latent Replay ”, Ostapenko et al 2022

Continual Learning with Foundation Models: An Empirical Study of Latent Replay⁠

“Flamingo: a Visual Language Model for Few-Shot Learning ”, Alayrac et al 2022

Flamingo: a Visual Language Model for Few-Shot Learning⁠

“WebFace260M: A Benchmark for Million-Scale Deep Face Recognition ”, Zhu et al 2022

WebFace260M: A Benchmark for Million-Scale Deep Face Recognition⁠

“What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? ”, Wang et al 2022

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?⁠

“DeepMind: The Podcast—Excerpts on AGI ”, Kiely 2022

DeepMind: The Podcast—Excerpts on AGI⁠

“Can Language Models Learn from Explanations in Context? ”, Lampinen et al 2022

Can language models learn from explanations in context?⁠

“Chinchilla: Training Compute-Optimal Large Language Models ”, Hoffmann et al 2022

Chinchilla: Training Compute-Optimal Large Language Models⁠

“A Roadmap for Big Model ”, Yuan et al 2022

A Roadmap for Big Model⁠

“A Conversational Paradigm for Program Synthesis ”, Nijkamp et al 2022

A Conversational Paradigm for Program Synthesis⁠

“Self-Consistency Improves Chain-Of-Thought Reasoning in Language Models ”, Wang et al 2022

Self-Consistency Improves Chain-of-Thought Reasoning in Language Models⁠

“Effect of Scale on Catastrophic Forgetting in Neural Networks ”, Ramasesh et al 2022

Effect of scale on catastrophic forgetting in neural networks⁠

“Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer ”, Yang et al 2022

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer⁠

“FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours ”, Cheng et al 2022

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours⁠

“Variational Autoencoders Without the Variation ”, Daly et al 2022

Variational Autoencoders Without the Variation⁠

“Performance Reserves in Brain-Imaging-Based Phenotype Prediction ”, Schulz et al 2022

Performance reserves in brain-imaging-based phenotype prediction⁠

“Self-Distilled StyleGAN: Towards Generation from Internet Photos ”, Mokady et al 2022

Self-Distilled StyleGAN: Towards Generation from Internet Photos⁠

“UnifiedQA-V2: Stronger Generalization via Broader Cross-Format Training ”, Khashabi et al 2022

UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training⁠

“Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision ”, Goyal et al 2022

Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision⁠

“Brains and Algorithms Partially Converge in Natural Language Processing ”, Caucheteux & King 2022

Brains and algorithms partially converge in natural language processing⁠

“Quantifying Memorization Across Neural Language Models ”, Carlini et al 2022

Quantifying Memorization Across Neural Language Models⁠

“Wukong: 100 Million Large-Scale Chinese Cross-Modal Pre-Training Dataset and A Foundation Framework ”, Gu et al 2022

Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework⁠

“OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-To-Sequence Learning Framework ”, Wang et al 2022

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework⁠

“Data Scaling Laws in NMT: The Effect of Noise and Architecture ”, Bansal et al 2022

Data Scaling Laws in NMT: The Effect of Noise and Architecture⁠

“Webly Supervised Concept Expansion for General Purpose Vision Models ”, Kamath et al 2022

Webly Supervised Concept Expansion for General Purpose Vision Models⁠

“StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets ”, Sauer et al 2022

StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets⁠

“Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model ”, Smith et al 2022

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model⁠

“Reasoning Like Program Executors ”, Pi et al 2022

Reasoning Like Program Executors⁠

“Text and Code Embeddings by Contrastive Pre-Training ”, Neelakantan et al 2022

Text and Code Embeddings by Contrastive Pre-Training⁠

“LaMDA: Language Models for Dialog Applications ”, Thoppilan et al 2022

LaMDA: Language Models for Dialog Applications⁠

“SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models ”, Singh et al 2022

SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models⁠

“CM3: A Causal Masked Multimodal Model of the Internet ”, Aghajanyan et al 2022

CM3: A Causal Masked Multimodal Model of the Internet⁠

“ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization ”, Xu et al 2022

ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization⁠

“A High-Dimensional Sphere Spilling out of a High-Dimensional Cube despite Exponentially Many Constraints ”, Fort 2022

⁠A high-dimensional sphere spilling out of a high-dimensional cube despite exponentially many constraints :

View HTML:

⁠/doc/www/stanislavfort.com/e3e25cb54a89d63575071a99ca0ae7e925e62326.html⁠

“ConvNeXt: A ConvNet for the 2020s ”, Liu et al 2022

ConvNeXt: A ConvNet for the 2020s⁠

“The Defeat of the Winograd Schema Challenge ”, Kocijan et al 2022

The Defeat of the Winograd Schema Challenge⁠

“Robust Self-Supervised Audio-Visual Speech Recognition ”, Shi et al 2022

Robust Self-Supervised Audio-Visual Speech Recognition⁠

“AV-HuBERT: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction ”, Shi et al 2022

AV-HuBERT: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction⁠

“Self-Supervised Learning from 100 Million Medical Images ”, Ghesu et al 2022

Self-supervised Learning from 100 Million Medical Images⁠

“The Evolution of Quantitative Sensitivity ”, Bryer et al 2021

The evolution of quantitative sensitivity⁠

“ERNIE 3.0 Titan: Exploring Larger-Scale Knowledge Enhanced Pre-Training for Language Understanding and Generation ”, Wang et al 2021

ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation⁠

“XGLM: Few-Shot Learning With Multilingual Language Models ”, Lin et al 2021

XGLM: Few-shot Learning with Multilingual Language Models⁠

“An Empirical Investigation of the Role of Pre-Training in Lifelong Learning ”, Mehta et al 2021

An Empirical Investigation of the Role of Pre-training in Lifelong Learning⁠

“Few-Shot Instruction Prompts for Pretrained Language Models to Detect Social Biases ”, Prabhumoye et al 2021

Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases⁠

“Knowledge-Rich Self-Supervised Entity Linking ”, Zhang et al 2021

Knowledge-Rich Self-Supervised Entity Linking⁠

“You Only Need One Model for Open-Domain Question Answering ”, Lee et al 2021

You Only Need One Model for Open-domain Question Answering⁠

“EBERT: Epigenomic Language Models Powered by Cerebras ”, Trotter et al 2021

EBERT: Epigenomic language models powered by Cerebras⁠

“MAGMA—Multimodal Augmentation of Generative Models through Adapter-Based Finetuning ”, Eichenberg et al 2021

MAGMA—Multimodal Augmentation of Generative Models through Adapter-based Finetuning⁠

“Improving Language Models by Retrieving from Trillions of Tokens ”, Borgeaud et al 2021

Improving language models by retrieving from trillions of tokens⁠

“MLP Architectures for Vision-And-Language Modeling: An Empirical Study ”, Nie et al 2021

MLP Architectures for Vision-and-Language Modeling: An Empirical Study⁠

“LEMON: Scaling Up Vision-Language Pre-Training for Image Captioning ”, Hu et al 2021

LEMON: Scaling Up Vision-Language Pre-training for Image Captioning⁠

“Sparse Is Enough in Scaling Transformers ”, Jaszczur et al 2021

Sparse is Enough in Scaling Transformers⁠

“Can Pre-Trained Language Models Be Used to Resolve Textual and Semantic Merge Conflicts? ”, Zhang et al 2021

Can Pre-trained Language Models be Used to Resolve Textual and Semantic Merge Conflicts?⁠

“ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning ”, Aribandi et al 2021

ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning⁠

“L-Verse: Bidirectional Generation Between Image and Text ”, Kim et al 2021

L-Verse: Bidirectional Generation Between Image and Text⁠

“RedCaps: Web-Curated Image-Text Data Created by the People, for the People ”, Desai et al 2021

RedCaps: web-curated image-text data created by the people, for the people⁠

“Florence: A New Foundation Model for Computer Vision ”, Yuan et al 2021

Florence: A New Foundation Model for Computer Vision⁠

“BASIC: Combined Scaling for Open-Vocabulary Image Classification ”, Pham et al 2021

BASIC: Combined Scaling for Open-Vocabulary Image Classification⁠

“Swin Transformer V2: Scaling Up Capacity and Resolution ”, Liu et al 2021

Swin Transformer V2: Scaling Up Capacity and Resolution⁠

“XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale ”, Babu et al 2021

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale⁠

“Solving Linear Algebra by Program Synthesis ”, Drori & Verma 2021

Solving Linear Algebra by Program Synthesis⁠

“Covariate Shift in High-Dimensional Random Feature Regression ”, Tripuraneni et al 2021

Covariate Shift in High-Dimensional Random Feature Regression⁠

“Solving Probability and Statistics Problems by Program Synthesis ”, Tang et al 2021

Solving Probability and Statistics Problems by Program Synthesis⁠

“Few-Shot Self-Rationalization With Natural Language Prompts ”, Marasović et al 2021

Few-Shot Self-Rationalization with Natural Language Prompts⁠

“INTERN: A New Learning Paradigm Towards General Vision ”, Shao et al 2021

INTERN: A New Learning Paradigm Towards General Vision⁠

“Scaling Law for Recommendation Models: Towards General-Purpose User Representations ”, Shin et al 2021

Scaling Law for Recommendation Models: Towards General-purpose User Representations⁠

“MAE: Masked Autoencoders Are Scalable Vision Learners ”, He et al 2021

MAE: Masked Autoencoders Are Scalable Vision Learners⁠

“Persia: An Open, Hybrid System Scaling Deep Learning-Based Recommenders up to 100 Trillion Parameters ”, Lian et al 2021

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters⁠

“Scaling ASR Improves Zero and Few Shot Learning ”, Xiao et al 2021

Scaling ASR Improves Zero and Few Shot Learning⁠

“Turing-Universal Learners With Optimal Scaling Laws ”, Nakkiran 2021

Turing-Universal Learners with Optimal Scaling Laws⁠

“LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs ”, Schuhmann et al 2021

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs⁠

“Training Verifiers to Solve Math Word Problems ”, Cobbe et al 2021

Training Verifiers to Solve Math Word Problems⁠

“Wide Neural Networks Forget Less Catastrophically ”, Mirzadeh et al 2021

Wide Neural Networks Forget Less Catastrophically⁠

“When in Doubt, Summon the Titans: Efficient Inference With Large Models ”, Rawat et al 2021

When in Doubt, Summon the Titans: Efficient Inference with Large Models⁠

“The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail ”, Bowman 2021

The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail⁠

“Symbolic Knowledge Distillation: from General Language Models to Commonsense Models ”, West et al 2021

Symbolic Knowledge Distillation: from General Language Models to Commonsense Models⁠

“LFPT5: A Unified Framework for Lifelong Few-Shot Language Learning Based on Prompt Tuning of T5 ”, Qin & Joty 2021

LFPT5: A Unified Framework for Lifelong Few-shot Language Learning Based on Prompt Tuning of T5⁠

“Scaling Laws for the Few-Shot Adaptation of Pre-Trained Image Classifiers ”, Prato et al 2021

Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers⁠

“Unsupervised Neural Machine Translation With Generative Language Models Only ”, Han et al 2021

Unsupervised Neural Machine Translation with Generative Language Models Only⁠

“Yuan 1.0: Large-Scale Pre-Trained Language Model in Zero-Shot and Few-Shot Learning ”, Wu et al 2021

Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning⁠

“Universal Paralinguistic Speech Representations Using Self-Supervised Conformers ”, Shor et al 2021

Universal Paralinguistic Speech Representations Using Self-Supervised Conformers⁠

“M6–10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining ”, Lin et al 2021

M6–10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining⁠

“A Few More Examples May Be Worth Billions of Parameters ”, Kirstain et al 2021

A Few More Examples May Be Worth Billions of Parameters⁠

“Exploring the Limits of Large Scale Pre-Training ”, Abnar et al 2021

Exploring the Limits of Large Scale Pre-training⁠

“Show Your Work: Scratchpads for Intermediate Computation With Language Models ”, Nye et al 2021

Show Your Work: Scratchpads for Intermediate Computation with Language Models⁠

“Mining for Strong Gravitational Lenses With Self-Supervised Learning ”, Stein et al 2021

Mining for strong gravitational lenses with self-supervised learning⁠

“Stochastic Training Is Not Necessary for Generalization ”, Geiping et al 2021

Stochastic Training is Not Necessary for Generalization⁠

“Evaluating Machine Accuracy on ImageNet ”, Shankar et al 2021

Evaluating Machine Accuracy on ImageNet⁠

“BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition ”, Zhang et al 2021

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition⁠

“Scale Efficiently: Insights from Pre-Training and Fine-Tuning Transformers ”, Tay et al 2021

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers⁠

“Scaling Laws for Neural Machine Translation ”, Ghorbani et al 2021

Scaling Laws for Neural Machine Translation⁠

“What Changes Can Large-Scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-Scale Korean Generative Pretrained Transformers ”, Kim et al 2021

What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers⁠

“A Recipe For Arbitrary Text Style Transfer With Large Language Models ”, Reif et al 2021

A Recipe For Arbitrary Text Style Transfer with Large Language Models⁠

“TruthfulQA: Measuring How Models Mimic Human Falsehoods ”, Lin et al 2021

TruthfulQA: Measuring How Models Mimic Human Falsehoods⁠

“A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning ”, Dar et al 2021

A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning⁠

“General-Purpose Question-Answering With Macaw ”, Tafjord & Clark 2021

General-Purpose Question-Answering with Macaw⁠

“An Empirical Exploration in Quality Filtering of Text Data ”, Gao 2021

An Empirical Exploration in Quality Filtering of Text Data⁠

“A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP ”, Zhao et al 2021

A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP⁠

“Want To Reduce Labeling Cost? GPT-3 Can Help ”, Wang et al 2021

Want To Reduce Labeling Cost? GPT-3 Can Help⁠

“Data and Parameter Scaling Laws for Neural Machine Translation ”, Gordon et al 2021

Data and Parameter Scaling Laws for Neural Machine Translation⁠

“Do Vision Transformers See Like Convolutional Neural Networks? ”, Raghu et al 2021

Do Vision Transformers See Like Convolutional Neural Networks?⁠

“Modeling Protein Using Large-Scale Pretrain Language Model ”, Xiao et al 2021

Modeling Protein Using Large-scale Pretrain Language Model⁠

“Scaling Laws for Deep Learning ”, Rosenfeld 2021

Scaling Laws for Deep Learning⁠

“Billion-Scale Pretraining With Vision Transformers for Multi-Task Visual Representations ”, Beal et al 2021

Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations⁠

“Facebook AI WMT21 News Translation Task Submission ”, Tran et al 2021

Facebook AI WMT21 News Translation Task Submission⁠

“EVA: An Open-Domain Chinese Dialogue System With Large-Scale Generative Pre-Training ”, Zhou et al 2021

EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training⁠

“The History of Speech Recognition to the Year 2030 ”, Hannun 2021

The History of Speech Recognition to the Year 2030⁠

“The History of Speech Recognition to the Year 2030 [Blog] ”, Hannun 2021

⁠The History of Speech Recognition to the Year 2030 [blog] :

View HTML:

⁠/doc/www/awni.github.io/4cd4c7eaf803a808b8ea623005f67672af20a2fd.html⁠

“A Field Guide to Federated Optimization ”, Wang et al 2021

A Field Guide to Federated Optimization⁠

“HTLM: Hyper-Text Pre-Training and Prompting of Language Models ”, Aghajanyan et al 2021

HTLM: Hyper-Text Pre-Training and Prompting of Language Models⁠

“Brain-Like Functional Specialization Emerges Spontaneously in Deep Neural Networks ”, Dobs et al 2021

Brain-like functional specialization emerges spontaneously in deep neural networks⁠

“ERNIE 3.0: Large-Scale Knowledge Enhanced Pre-Training for Language Understanding and Generation ”, Sun et al 2021

ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation⁠

“Scarecrow: A Framework for Scrutinizing Machine Text ”, Dou et al 2021

Scarecrow: A Framework for Scrutinizing Machine Text⁠

“The Dimpled Manifold Model of Adversarial Examples in Machine Learning ”, Shamir et al 2021

The Dimpled Manifold Model of Adversarial Examples in Machine Learning⁠

“Revisiting the Calibration of Modern Neural Networks ”, Minderer et al 2021

Revisiting the Calibration of Modern Neural Networks⁠

“Partial Success in Closing the Gap between Human and Machine Vision ”, Geirhos et al 2021

Partial success in closing the gap between human and machine vision⁠

“HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units ”, Hsu et al 2021

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units⁠

“Scaling Laws for Acoustic Models ”, Droppo & Elibol 2021

Scaling Laws for Acoustic Models⁠

“CoAtNet: Marrying Convolution and Attention for All Data Sizes ”, Dai et al 2021

CoAtNet: Marrying Convolution and Attention for All Data Sizes⁠

“Scaling Vision Transformers ”, Zhai et al 2021

Scaling Vision Transformers⁠

“Exploring the Limits of Out-Of-Distribution Detection ”, Fort et al 2021

Exploring the Limits of Out-of-Distribution Detection⁠

“Effect of Pre-Training Scale on Intra/Inter-Domain Full and Few-Shot Transfer Learning for Natural and Medical X-Ray Chest Images ”, Cherti & Jitsev 2021

Effect of Pre-Training Scale on Intra/Inter-Domain Full and Few-Shot Transfer Learning for Natural and Medical X-Ray Chest Images⁠

“A Universal Law of Robustness via Isoperimetry ”, Bubeck & Sellke 2021

A Universal Law of Robustness via Isoperimetry⁠

“Naver Unveils First ‘Hyperscale’ AI Platform ”, Jae-eun 2021

Naver unveils first ‘hyperscale’ AI platform

“Unsupervised Speech Recognition ”, Baevski et al 2021

Unsupervised Speech Recognition⁠

“One4all User Representation for Recommender Systems in E-Commerce ”, Shin et al 2021

One4all User Representation for Recommender Systems in E-commerce⁠

“RecPipe: Co-Designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance ”, Gupta et al 2021

RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance⁠

“Google Details New AI Accelerator Chips ”, Wiggers 2021

Google details new AI accelerator chips⁠

“MLP-Mixer: An All-MLP Architecture for Vision ”, Tolstikhin et al 2021

MLP-Mixer: An all-MLP Architecture for Vision⁠

“XLM-R XL: Larger-Scale Transformers for Multilingual Masked Language Modeling ”, Goyal et al 2021

XLM-R XL: Larger-Scale Transformers for Multilingual Masked Language Modeling⁠

“Scaling End-To-End Models for Large-Scale Multilingual ASR ”, Li et al 2021

Scaling End-to-End Models for Large-Scale Multilingual ASR⁠

“DINO: Emerging Properties in Self-Supervised Vision Transformers ”, Caron et al 2021

DINO: Emerging Properties in Self-Supervised Vision Transformers⁠

“What Are Bayesian Neural Network Posteriors Really Like? ”, Izmailov et al 2021

What Are Bayesian Neural Network Posteriors Really Like?⁠

“Machine Learning Scaling ”, Gwern 2021

⁠Machine Learning Scaling

“[Ali Released PLUG: 27 Billion Parameters, the Largest Pre-Trained Language Model in the Chinese Community] ”, Yuying 2021

[Ali released PLUG: 27 billion parameters, the largest pre-trained language model in the Chinese community]

“The Power of Scale for Parameter-Efficient Prompt Tuning ”, Lester et al 2021

The Power of Scale for Parameter-Efficient Prompt Tuning⁠

“Revealing Persona Biases in Dialogue Systems ”, Sheng et al 2021

Revealing Persona Biases in Dialogue Systems⁠

“CrossFit: A Few-Shot Learning Challenge for Cross-Task Generalization in NLP ”, Ye et al 2021

CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP⁠

“Probing Across Time: What Does RoBERTa Know and When? ”, Liu et al 2021

Probing Across Time: What Does RoBERTa Know and When?⁠

“Memorization versus Generalization in Pre-Trained Language Models ”, Tänzer et al 2021

Memorization versus Generalization in Pre-trained Language Models⁠

“Large-Scale Self-Supervised and Semi-Supervised Learning for Speech Translation ”, Wang et al 2021

Large-Scale Self-Supervised and Semi-Supervised Learning for Speech Translation⁠

“Scaling Laws for Language Transfer Learning ”, Kim 2021

Scaling Laws for Language Transfer Learning⁠

“Adapting Language Models for Zero-Shot Learning by Meta-Tuning on Dataset and Prompt Collections ”, Zhong et al 2021

Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections⁠

“SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network ”, Chan et al 2021

SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network⁠

“Understanding Robustness of Transformers for Image Classification ”, Bhojanapalli et al 2021

Understanding Robustness of Transformers for Image Classification⁠

“UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark ”, Lourie et al 2021

UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark⁠

“Controllable Generation from Pre-Trained Language Models via Inverse Prompting ”, Zou et al 2021

Controllable Generation from Pre-trained Language Models via Inverse Prompting⁠

“The Shape of Learning Curves: a Review ”, Viering & Loog 2021

The Shape of Learning Curves: a Review⁠

“Efficient Visual Pretraining With Contrastive Detection ”, Hénaff et al 2021

Efficient Visual Pretraining with Contrastive Detection⁠

“Revisiting ResNets: Improved Training and Scaling Strategies ”, Bello et al 2021

Revisiting ResNets: Improved Training and Scaling Strategies⁠

“Learning from Videos to Understand the World ”, Zweig et al 2021

Learning from videos to understand the world⁠

“WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training ”, Huo et al 2021

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training⁠

“Fast and Accurate Model Scaling ”, Dollár et al 2021

Fast and Accurate Model Scaling⁠

“Pretrained Transformers As Universal Computation Engines ”, Lu et al 2021

Pretrained Transformers as Universal Computation Engines⁠

“Greedy Hierarchical Variational Autoencoders (GHVAEs) for Large-Scale Video Prediction ”, Wu et al 2021

Greedy Hierarchical Variational Autoencoders (GHVAEs) for Large-Scale Video Prediction⁠

“Measuring Mathematical Problem Solving With the MATH Dataset ”, Hendrycks et al 2021

Measuring Mathematical Problem Solving With the MATH Dataset⁠

“A Law of Robustness for Two-Layers Neural Networks ”, Bubeck et al 2021

A law of robustness for two-layers neural networks⁠

“SEER: Self-Supervised Pretraining of Visual Features in the Wild ”, Goyal et al 2021

SEER: Self-supervised Pretraining of Visual Features in the Wild⁠

“M6: A Chinese Multimodal Pretrainer ”, Lin et al 2021

M6: A Chinese Multimodal Pretrainer⁠

“Zero-Shot Text-To-Image Generation ”, Ramesh et al 2021

Zero-Shot Text-to-Image Generation⁠

“Improved Denoising Diffusion Probabilistic Models ”, Nichol & Dhariwal 2021

Improved Denoising Diffusion Probabilistic Models⁠

“Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts ”, Changpinyo et al 2021

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts⁠

“A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes ”, Nado et al 2021

A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes⁠

“Explaining Neural Scaling Laws ”, Bahri et al 2021

Explaining Neural Scaling Laws⁠

“ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision ”, Jia et al 2021

ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision⁠

“NFNet: High-Performance Large-Scale Image Recognition Without Normalization ”, Brock et al 2021

NFNet: High-Performance Large-Scale Image Recognition Without Normalization⁠

“Learning Curve Theory ”, Hutter 2021

Learning Curve Theory⁠

“1-Bit Adam: Communication Efficient Large-Scale Training With Adam’s Convergence Speed ”, Tang et al 2021

1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed⁠

“Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Scaling ”, Lazaridou et al 2021

Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Scaling⁠

“Scaling Laws for Transfer ”, Hernandez et al 2021

Scaling Laws for Transfer⁠

“Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning ”, Lee et al 2021

Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning⁠

“Muppet: Massive Multi-Task Representations With Pre-Finetuning ”, Aghajanyan et al 2021

Muppet: Massive Multi-task Representations with Pre-Finetuning⁠

“Language Processing in Brains and Deep Neural Networks: Computational Convergence and Its Limits ”, Caucheteux & King 2021

Language processing in brains and deep neural networks: computational convergence and its limits⁠

“Meta Pseudo Labels ”, Pham et al 2021

Meta Pseudo Labels⁠

“CLIP: Learning Transferable Visual Models From Natural Language Supervision ”, Radford et al 2021

CLIP: Learning Transferable Visual Models From Natural Language Supervision⁠

“VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation ”, Wang et al 2021

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation⁠

“CDLM: Cross-Document Language Modeling ”, Caciularu et al 2021

CDLM: Cross-Document Language Modeling⁠

“VinVL: Revisiting Visual Representations in Vision-Language Models ”, Zhang et al 2021

VinVL: Revisiting Visual Representations in Vision-Language Models⁠

“Parameter Count vs Training Dataset Size (1952–2021) ”, Adlam 2021

⁠Parameter count vs Training dataset size (1952–2021)⁠ :

View PDF:

⁠/doc/ai/scaling/2021-adlam.pdf⁠

“Process for Adapting Language Models to Society (PALMS) With Values-Targeted Datasets ”, Solaiman & Dennison 2021

Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets⁠

“Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning ”, Aghajanyan et al 2020

⁠Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning⁠

“Extrapolating GPT-N Performance ”, Finnveden 2020

Extrapolating GPT-N performance⁠

“Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences ”, Rives et al 2020

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences⁠

“CPM: A Large-Scale Generative Chinese Pre-Trained Language Model ”, Zhang et al 2020

CPM: A Large-scale Generative Chinese Pre-trained Language Model⁠

“Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images ”, Child 2020

Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images⁠

“When Do You Need Billions of Words of Pretraining Data? ”, Zhang et al 2020

When Do You Need Billions of Words of Pretraining Data?⁠

“Scaling Laws for Autoregressive Generative Modeling ”, Henighan et al 2020

Scaling Laws for Autoregressive Generative Modeling⁠

“Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus ”, Caswell et al 2020

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus⁠

“MT5: A Massively Multilingual Pre-Trained Text-To-Text Transformer ”, Xue et al 2020

mT5: A massively multilingual pre-trained text-to-text transformer⁠

“Beyond English-Centric Multilingual Machine Translation ”, Fan et al 2020

Beyond English-Centric Multilingual Machine Translation⁠

“Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition ”, Zhang et al 2020

Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition⁠

“Towards End-To-End In-Image Neural Machine Translation ”, Mansimov et al 2020

Towards End-to-End In-Image Neural Machine Translation⁠

“The First AI Model That Translates 100 Languages without Relying on English Data ”, Fan 2020

The first AI model that translates 100 languages without relying on English data⁠

“The Deep Bootstrap Framework: Good Online Learners Are Good Offline Generalizers ”, Nakkiran et al 2020

The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers⁠

“Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) ”, Warstadt et al 2020

Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)⁠

“The Neural Architecture of Language: Integrative Reverse-Engineering Converges on a Model for Predictive Processing ”, Schrimpf et al 2020

The neural architecture of language: Integrative reverse-engineering converges on a model for predictive processing⁠

“Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples ”, Gowal et al 2020

Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples⁠

“Fast Stencil-Code Computation on a Wafer-Scale Processor ”, Rocki et al 2020

Fast Stencil-Code Computation on a Wafer-Scale Processor⁠

“Vision Transformer: An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale ”, Dosovitskiy et al 2020

Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale⁠

“Small Data, Big Decisions: Model Selection in the Small-Data Regime ”, Bornschein et al 2020

Small Data, Big Decisions: Model Selection in the Small-Data Regime⁠

“New Report on How Much Computational Power It Takes to Match the Human Brain ”, Carlsmith 2020

New Report on How Much Computational Power It Takes to Match the Human Brain⁠

“Generative Language Modeling for Automated Theorem Proving ”, Polu & Sutskever 2020

Generative Language Modeling for Automated Theorem Proving⁠

“GrokNet: Unified Computer Vision Model Trunk and Embeddings For Commerce ”, Bell et al 2020

GrokNet: Unified Computer Vision Model Trunk and Embeddings For Commerce⁠

“Accuracy and Performance Comparison of Video Action Recognition Approaches ”, Hutchinson et al 2020

Accuracy and Performance Comparison of Video Action Recognition Approaches⁠

“Generative Models Are Unsupervised Predictors of Page Quality: A Colossal-Scale Study ”, Bahri et al 2020

Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study⁠

“Matt Botvinick on the Spontaneous Emergence of Learning Algorithms ”, Scholl 2020

Matt Botvinick on the spontaneous emergence of learning algorithms⁠

“Self-Supervised Learning through the Eyes of a Child ”, Orhan et al 2020

Self-supervised learning through the eyes of a child⁠

“On Robustness and Transferability of Convolutional Neural Networks ”, Djolonga et al 2020

On Robustness and Transferability of Convolutional Neural Networks⁠

“Hopfield Networks Is All You Need ”, Ramsauer et al 2020

Hopfield Networks is All You Need⁠

“ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing ”, Elnaggar et al 2020

ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing⁠

“NVAE: A Deep Hierarchical Variational Autoencoder ”, Vahdat & Kautz 2020

NVAE: A Deep Hierarchical Variational Autoencoder⁠

“Measuring Robustness to Natural Distribution Shifts in Image Classification ”, Taori et al 2020

Measuring Robustness to Natural Distribution Shifts in Image Classification⁠

“WinoGrande: An Adversarial Winograd Schema Challenge at Scale ”, Sakaguchi et al 2020

WinoGrande: An Adversarial Winograd Schema Challenge at Scale⁠

“Is SGD a Bayesian Sampler? Well, Almost ”, Mingard et al 2020

Is SGD a Bayesian sampler? Well, almost⁠

“Unsupervised Cross-Lingual Representation Learning for Speech Recognition ”, Conneau et al 2020

Unsupervised Cross-lingual Representation Learning for Speech Recognition⁠

“Logarithmic Pruning Is All You Need ”, Orseau et al 2020

Logarithmic Pruning is All You Need⁠

“Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations ”, Baevski et al 2020

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations⁠

“Denoising Diffusion Probabilistic Models ”, Ho et al 2020

Denoising Diffusion Probabilistic Models⁠

“On the Predictability of Pruning Across Scales ”, Rosenfeld et al 2020

On the Predictability of Pruning Across Scales⁠

“IGPT: Generative Pretraining from Pixels ”, Chen et al 2020

iGPT: Generative Pretraining from Pixels⁠

“SwAV: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments ”, Caron et al 2020

SwAV: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments⁠

“SimCLRv2: Big Self-Supervised Models Are Strong Semi-Supervised Learners ”, Chen et al 2020

SimCLRv2: Big Self-Supervised Models are Strong Semi-Supervised Learners⁠

“Image GPT (IGPT): We Find That, Just As a Large Transformer Model Trained on Language Can Generate Coherent Text, the Same Exact Model Trained on Pixel Sequences Can Generate Coherent Image Completions and Samples ”, Chen et al 2020

Image GPT (iGPT): We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples⁠

“Are We Done With ImageNet? ”, Beyer et al 2020

Are we done with ImageNet?⁠

“OpenAI API ”, Brockman et al 2020

OpenAI API⁠

“Object Segmentation Without Labels With Large-Scale Generative Models ”, Voynov et al 2020

Object Segmentation Without Labels with Large-Scale Generative Models⁠

“How Big Should My Language Model Be? ”, Scao 2020

How Big Should My Language Model Be?⁠

“GPT-3 Paper § Figure F.1: Four Uncurated Completions from a Context Suggesting the Model Compose a Poem in the Style of Wallace Stevens With the Title ‘Shadows on the Way’ ”, GPT-3 2020 (page 48)

GPT-3 paper § Figure F.1: Four uncurated completions from a context suggesting the model compose a poem in the style of Wallace Stevens with the title ‘Shadows on the Way’⁠

“Danny Hernandez on Forecasting and the Drivers of AI Progress ”, Koehler et al 2020

Danny Hernandez on forecasting and the drivers of AI progress⁠

“Powered by AI: Advancing Product Understanding and Building New Shopping Experiences ”, Berg et al 2020

Powered by AI: Advancing product understanding and building new shopping experiences⁠

“ZeRO-2 & DeepSpeed: Shattering Barriers of Deep Learning Speed & Scale ”, Team 2020

ZeRO-2 & DeepSpeed: Shattering barriers of deep learning speed & scale⁠

“Measuring the Algorithmic Efficiency of Neural Networks ”, Hernandez & Brown 2020

Measuring the Algorithmic Efficiency of Neural Networks⁠

“Pushing the Limit of Molecular Dynamics With ab Initio Accuracy to 100 Million Atoms With Machine Learning ”, Jia et al 2020

Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning⁠

“Jukebox: We’re Introducing Jukebox, a Neural Net That Generates Music, including Rudimentary Singing, As Raw Audio in a Variety of Genres and Artist Styles. We’re Releasing the Model Weights and Code, along With a Tool to Explore the Generated Samples. ”, Dhariwal et al 2020

Jukebox: We’re introducing Jukebox, a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles. We’re releasing the model weights and code, along with a tool to explore the generated samples.⁠

“Blender: A State-Of-The-Art Open Source Chatbot ”, Roller et al 2020

Blender: A state-of-the-art open source chatbot⁠

“A Review of Winograd Schema Challenge Datasets and Approaches ”, Kocijan et al 2020

A Review of Winograd Schema Challenge Datasets and Approaches⁠

“Scaling Laws from the Data Manifold Dimension ”, Sharma & Kaplan 2020

Scaling Laws from the Data Manifold Dimension⁠

“DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications ”, Zeng et al 2020

DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications⁠

“PALM: Pre-Training an Autoencoding & Autoregressive Language Model for Context-Conditioned Generation ”, Bi et al 2020

PALM: Pre-training an Autoencoding & Autoregressive Language Model for Context-conditioned Generation⁠

“Deep Learning Training in Facebook Data Centers: Design of Scale-Up and Scale-Out Systems ”, Naumov et al 2020

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems⁠

“TTTTTackling WinoGrande Schemas ”, Lin et al 2020

TTTTTackling WinoGrande Schemas⁠

“A Metric Learning Reality Check ”, Musgrave et al 2020

A Metric Learning Reality Check⁠

“Zoom In: An Introduction to Circuits—By Studying the Connections between Neurons, We Can Find Meaningful Algorithms in the Weights of Neural Networks ”, Olah et al 2020

Zoom In: An Introduction to Circuits—By studying the connections between neurons, we can find meaningful algorithms in the weights of neural networks⁠

“Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited ”, Maddox et al 2020

Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited⁠

“Rethinking Bias-Variance Trade-Off for Generalization of Neural Networks ”, Yang et al 2020

Rethinking Bias-Variance Trade-off for Generalization of Neural Networks⁠

“Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers ”, Li et al 2020

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers⁠

“The Messy, Secretive Reality behind OpenAI’s Bid to save the World: The AI Moonshot Was Founded in the Spirit of Transparency. This Is the inside Story of How Competitive Pressure Eroded That Idealism ”, Hao 2020

The messy, secretive reality behind OpenAI’s bid to save the world: The AI moonshot was founded in the spirit of transparency. This is the inside story of how competitive pressure eroded that idealism⁠

“The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence ”, Marcus 2020

The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence⁠

“A Simple Framework for Contrastive Learning of Visual Representations ”, Chen et al 2020

A Simple Framework for Contrastive Learning of Visual Representations⁠

“How Much Knowledge Can You Pack Into the Parameters of a Language Model? ”, Roberts et al 2020

How Much Knowledge Can You Pack Into the Parameters of a Language Model?⁠

“Turing-NLG: A 17-Billion-Parameter Language Model by Microsoft ”, Rosset 2020

Turing-NLG: A 17-billion-parameter language model by Microsoft⁠

“Quasi-Equivalence of Width and Depth of Neural Networks ”, Fan et al 2020

Quasi-Equivalence of Width and Depth of Neural Networks⁠

“Impact of ImageNet Model Selection on Domain Adaptation ”, Zhang & Davison 2020

Impact of ImageNet Model Selection on Domain Adaptation⁠

“Direct Fit to Nature: An Evolutionary Perspective on Biological and Artificial Neural Networks ”, Hasson et al 2020

Direct Fit to Nature: An Evolutionary Perspective on Biological and Artificial Neural Networks⁠

“Towards a Conversational Agent That Can Chat About…Anything ”, Adiwardana & Luong 2020

Towards a Conversational Agent that Can Chat About…Anything⁠

“Towards a Human-Like Open-Domain Chatbot ”, Adiwardana et al 2020

Towards a Human-like Open-Domain Chatbot⁠

“Scaling Laws for Neural Language Models ”, Kaplan et al 2020

Scaling Laws for Neural Language Models⁠

“Scaling Laws for Neural Language Models: Figure 15: Far beyond the Model Sizes We Study Empirically, We Find a Contradiction between Our Equations § Pg17 ”, Kaplan 2020 (page 17 org openai)

⁠Scaling Laws for Neural Language Models: Figure 15: Far beyond the model sizes we study empirically, we find a contradiction between our equations § pg17⁠ :

View PDF:

⁠/doc/www/arxiv.org/20d126b9c3baf640f8d1d5dff3e253faac2e8242.pdf#page=17&org=openai⁠

“The Importance of Deconstruction ”, Weinberger 2020

The Importance of Deconstruction⁠

“Big Transfer (BiT): General Visual Representation Learning ”, Kolesnikov et al 2019

Big Transfer (BiT): General Visual Representation Learning⁠

“More Data Can Hurt for Linear Regression: Sample-Wise Double Descent ”, Nakkiran 2019

⁠More Data Can Hurt for Linear Regression: Sample-wise Double Descent⁠

“12-In-1: Multi-Task Vision and Language Representation Learning ”, Lu et al 2019

12-in-1: Multi-Task Vision and Language Representation Learning⁠

“Deep Double Descent: We Show That the Double Descent Phenomenon Occurs in CNNs, ResNets, and Transformers: Performance First Improves, Then Gets Worse, and Then Improves Again With Increasing Model Size, Data Size, or Training Time ”, Nakkiran et al 2019

Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time⁠

“Deep Double Descent: Where Bigger Models and More Data Hurt ”, Nakkiran et al 2019

Deep Double Descent: Where Bigger Models and More Data Hurt⁠

“What’s Hidden in a Randomly Weighted Neural Network? ”, Ramanujan et al 2019

What’s Hidden in a Randomly Weighted Neural Network?⁠

“Understanding the Generalization of ‘Lottery Tickets’ in Neural Networks ”, Morcos & Tian 2019

Understanding the generalization of ‘lottery tickets’ in neural networks⁠

“The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design ”, Dean 2019

The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design⁠

“Momentum Contrast for Unsupervised Visual Representation Learning ”, He et al 2019

Momentum Contrast for Unsupervised Visual Representation Learning⁠

“SimpleShot: Revisiting Nearest-Neighbor Classification for Few-Shot Learning ”, Wang et al 2019

SimpleShot: Revisiting Nearest-Neighbor Classification for Few-Shot Learning⁠

“Self-Training With Noisy Student Improves ImageNet Classification ”, Xie et al 2019

Self-training with Noisy Student improves ImageNet classification⁠

“CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB ”, Schwenk et al 2019

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB⁠

“CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs ”, El-Kishky et al 2019

CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs⁠

“XLM-R: State-Of-The-Art Cross-Lingual Understanding through Self-Supervision ”, FAIR 2019

XLM-R: State-of-the-art cross-lingual understanding through self-supervision⁠

“High Fidelity Video Prediction With Large Stochastic Recurrent Neural Networks ”, Villegas et al 2019

High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks⁠

“Unsupervised Cross-Lingual Representation Learning at Scale ”, Conneau et al 2019

Unsupervised Cross-lingual Representation Learning at Scale⁠

“T5: Exploring the Limits of Transfer Learning With a Unified Text-To-Text Transformer ”, Raffel et al 2019

T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer⁠

“ZeRO: Memory Optimizations Toward Training Trillion Parameter Models ”, Rajbhandari et al 2019

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models⁠

“Environmental Drivers of Systematicity and Generalization in a Situated Agent ”, Hill et al 2019

Environmental drivers of systematicity and generalization in a situated agent⁠

“A Constructive Prediction of the Generalization Error Across Scales ”, Rosenfeld et al 2019

A Constructive Prediction of the Generalization Error Across Scales⁠

“Large-Scale Pretraining for Neural Machine Translation With Tens of Billions of Sentence Pairs ”, Meng et al 2019

Large-scale Pretraining for Neural Machine Translation with Tens of Billions of Sentence Pairs⁠

“UNITER: UNiversal Image-TExt Representation Learning ”, Chen et al 2019

UNITER: UNiversal Image-TExt Representation Learning⁠

“Exascale Deep Learning for Scientific Inverse Problems ”, Laanait et al 2019

Exascale Deep Learning for Scientific Inverse Problems⁠

“Simple, Scalable Adaptation for Neural Machine Translation ”, Bapna et al 2019

Simple, Scalable Adaptation for Neural Machine Translation⁠

“CTRL: A Conditional Transformer Language Model For Controllable Generation ”, Keskar et al 2019

CTRL: A Conditional Transformer Language Model For Controllable Generation⁠

“Show Your Work: Improved Reporting of Experimental Results ”, Dodge et al 2019

Show Your Work: Improved Reporting of Experimental Results⁠

“MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism ”, ADLR 2019

MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism⁠

“RoBERTa: A Robustly Optimized BERT Pretraining Approach ”, Liu et al 2019

RoBERTa: A Robustly Optimized BERT Pretraining Approach⁠

“Robustness Properties of Facebook’s ResNeXt WSL Models ”, Orhan 2019

Robustness properties of Facebook’s ResNeXt WSL models⁠

“Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges ”, Arivazhagan et al 2019

Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges⁠

“Large Scale Adversarial Representation Learning ”, Donahue & Simonyan 2019

Large Scale Adversarial Representation Learning⁠

“One Epoch Is All You Need ”, Komatsuzaki 2019

One Epoch Is All You Need⁠

“Does Learning Require Memorization? A Short Tale about a Long Tail ”, Feldman 2019

Does Learning Require Memorization? A Short Tale about a Long Tail⁠

“Intriguing Properties of Adversarial Training at Scale ”, Xie & Yuille 2019

Intriguing properties of adversarial training at scale⁠

“Scaling Autoregressive Video Models ”, Weissenborn et al 2019

Scaling Autoregressive Video Models⁠

“A Mathematical Theory of Semantic Development in Deep Neural Networks ”, Saxe et al 2019

A mathematical theory of semantic development in deep neural networks⁠

“Adversarially Robust Generalization Just Requires More Unlabeled Data ”, Zhai et al 2019

Adversarially Robust Generalization Just Requires More Unlabeled Data⁠

“ICML 2019 Notes ”, Abel 2019

ICML 2019 Notes⁠

“Are Labels Required for Improving Adversarial Robustness? ”, Uesato et al 2019

Are Labels Required for Improving Adversarial Robustness?⁠

“EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks ”, Tan & Le 2019

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks⁠

“SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers ”, Fedorov et al 2019

SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers⁠

“Asymptotic Learning Curves of Kernel Methods: Empirical Data versus Teacher-Student Paradigm ”, Spigler et al 2019

Asymptotic learning curves of kernel methods: empirical data versus Teacher-Student paradigm⁠

“UniLM: Unified Language Model Pre-Training for Natural Language Understanding and Generation ”, Dong et al 2019

UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation⁠

“Adversarial Examples Are Not Bugs, They Are Features ”, Ilyas et al 2019

Adversarial Examples Are Not Bugs, They Are Features⁠

“Billion-Scale Semi-Supervised Learning for Image Classification ”, Yalniz et al 2019

Billion-scale semi-supervised learning for image classification⁠

“VideoBERT: A Joint Model for Video and Language Representation Learning ”, Sun et al 2019

VideoBERT: A Joint Model for Video and Language Representation Learning⁠

“Benchmarking Neural Network Robustness to Common Corruptions and Perturbations ”, Hendrycks & Dietterich 2019

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations⁠

“Surprises in High-Dimensional Ridgeless Least Squares Interpolation ”, Hastie et al 2019

Surprises in High-Dimensional Ridgeless Least Squares Interpolation⁠

“The Bitter Lesson ”, Sutton 2019

The Bitter Lesson

“GPT-2 As Step Toward General Intelligence ”, Alexander 2019

GPT-2 As Step Toward General Intelligence⁠

“Deep Learning Hardware: Past, Present, & Future ”, LeCun 2019

⁠Deep Learning Hardware: Past, Present, & Future⁠ :

View PDF:

⁠/doc/ai/scaling/2019-02-18-lecun-isscc-talk-deeplearninghardwarepastpresentandfuture.pdf⁠

“Language Models Are Unsupervised Multitask Learners ”, Radford et al 2019

Language Models are Unsupervised Multitask Learners⁠

“Better Language Models and Their Implications ”, Radford et al 2019

Better Language Models and Their Implications⁠

“Do ImageNet Classifiers Generalize to ImageNet? ”, Recht et al 2019

Do ImageNet Classifiers Generalize to ImageNet?⁠

“Cross-Lingual Language Model Pretraining ”, Lample & Conneau 2019

Cross-lingual Language Model Pretraining⁠

“Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified ”, Mitchell 2019

Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified

“High Fidelity Video Prediction With Large Stochastic Recurrent Neural Networks: Videos ”, Villegas et al 2019

High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks: Videos⁠

“Reconciling Modern Machine Learning Practice and the Bias-Variance Trade-Off ”, Belkin et al 2018

Reconciling modern machine learning practice and the bias-variance trade-off⁠

“Nocaps: Novel Object Captioning at Scale ”, Agrawal et al 2018

nocaps: novel object captioning at scale⁠

“On Lazy Training in Differentiable Programming ”, Chizat et al 2018

On Lazy Training in Differentiable Programming⁠

“How AI Training Scales ”, McCandlish et al 2018

How AI Training Scales⁠

“Is Science Slowing Down? ”, Alexander 2018

Is Science Slowing Down?⁠

“Large Scale GAN Training for High Fidelity Natural Image Synthesis ”, Brock et al 2018

Large Scale GAN Training for High Fidelity Natural Image Synthesis⁠

“BigGAN: Large Scale GAN Training For High Fidelity Natural Image Synthesis § 5.2 Additional Evaluation On JFT-300M ”, Brock et al 2018 (page 8 org deepmind)

BigGAN: Large Scale GAN Training For High Fidelity Natural Image Synthesis § 5.2 Additional Evaluation On JFT-300M⁠

“Measurement Invariance Explains the Universal Law of Generalization for Psychological Perception ”, Frank 2018

Measurement invariance explains the universal law of generalization for psychological perception⁠

“CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images ”, Guo et al 2018

CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images⁠

“Large-Scale Visual Speech Recognition ”, Shillingford et al 2018

Large-Scale Visual Speech Recognition⁠

“Troubling Trends in Machine Learning Scholarship ”, Lipton & Steinhardt 2018

Troubling Trends in Machine Learning Scholarship⁠

“Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations ”, Hendrycks & Dietterich 2018

Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations⁠

“Neural Scene Representation and Rendering ”, Eslami et al 2018

Neural scene representation and rendering⁠

“GPT-1: Improving Language Understanding With Unsupervised Learning ”, OpenAI 2018

GPT-1: Improving Language Understanding with Unsupervised Learning⁠

“GPT-1: Improving Language Understanding by Generative Pre-Training ”, Radford et al 2018

GPT-1: Improving Language Understanding by Generative Pre-Training⁠

“GPT-1: Improving Language Understanding by Generative Pre-Training § Model Specifications ”, Radford et al 2018 (page 5)

GPT-1: Improving Language Understanding by Generative Pre-Training § Model specifications⁠

“Do CIFAR-10 Classifiers Generalize to CIFAR-10? ”, Recht et al 2018

Do CIFAR-10 Classifiers Generalize to CIFAR-10?⁠

“Deep Learning Generalizes Because the Parameter-Function Map Is Biased towards Simple Functions ”, Valle-Pérez et al 2018

Deep learning generalizes because the parameter-function map is biased towards simple functions⁠

“Google DeepMind Founder and Leader in Artificial Intelligence Returns to Hamilton ”, Tantau 2018

Google DeepMind founder and leader in artificial intelligence returns to Hamilton⁠

“Exploring the Limits of Weakly Supervised Pretraining ”, Mahajan et al 2018

Exploring the Limits of Weakly Supervised Pretraining⁠

“One Big Net For Everything ”, Schmidhuber 2018

One Big Net For Everything⁠

“Sensitivity and Generalization in Neural Networks: an Empirical Study ”, Novak et al 2018

Sensitivity and Generalization in Neural Networks: an Empirical Study⁠

“The Description Length of Deep Learning Models ”, Blier & Ollivier 2018

⁠The Description Length of Deep Learning Models⁠

“Learning and Memorization ”, Chatterjee 2018

Learning and Memorization⁠

“ULMFiT: Universal Language Model Fine-Tuning for Text Classification ”, Howard & Ruder 2018

ULMFiT: Universal Language Model Fine-tuning for Text Classification⁠

“GPipe: Easy Scaling With Micro-Batch Pipeline Parallelism § Pg4 ”, Huang 2018 (page 4 org google)

⁠GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism § pg4⁠ :

View PDF:

⁠/doc/www/arxiv.org/a8efcc8272af6f434119f87a00c2edaf84241597.pdf#page=4&org=google⁠

“Deep Image Reconstruction from Human Brain Activity ”, Shen et al 2017

Deep image reconstruction from human brain activity⁠

“Deep Learning Scaling Is Predictable, Empirically ”, Hestness et al 2017

Deep Learning Scaling is Predictable, Empirically⁠

“Are GANs Created Equal? A Large-Scale Study ”, Lucic et al 2017

Are GANs Created Equal? A Large-Scale Study⁠

“Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN ”, Gao et al 2017

Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN⁠

“Rethinking Generalization Requires Revisiting Old Ideas: Statistical Mechanics Approaches and Complex Learning Behavior ”, Martin & Mahoney 2017

Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior⁠

“There’s No Fire Alarm for Artificial General Intelligence ”, Yudkowsky 2017

There’s No Fire Alarm for Artificial General Intelligence⁠

“The Devil Is in the Tails: Fine-Grained Classification in the Wild ”, Horn & Perona 2017

The Devil is in the Tails: Fine-grained Classification in the Wild⁠

“WebVision Database: Visual Learning and Understanding from Web Data ”, Li et al 2017

WebVision Database: Visual Learning and Understanding from Web Data⁠

“Revisiting Unreasonable Effectiveness of Data in Deep Learning Era ”, Sun et al 2017

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era⁠

“Towards Deep Learning Models Resistant to Adversarial Attacks ”, Madry et al 2017

Towards Deep Learning Models Resistant to Adversarial Attacks⁠

“Gradient Diversity: a Key Ingredient for Scalable Distributed Learning ”, Yin et al 2017

Gradient Diversity: a Key Ingredient for Scalable Distributed Learning⁠

“Learning to Learn from Noisy Web Videos ”, Yeung et al 2017

Learning to Learn from Noisy Web Videos⁠

“Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour ”, Goyal et al 2017

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour⁠

“A Simple Neural Network Module for Relational Reasoning ”, Santoro et al 2017

A simple neural network module for relational reasoning⁠

“Deep Learning Is Robust to Massive Label Noise ”, Rolnick et al 2017

Deep Learning is Robust to Massive Label Noise⁠

“Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset ”, Carreira & Zisserman 2017

Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset⁠

“WebVision Challenge: Visual Learning and Understanding With Web Data ”, Li et al 2017

WebVision Challenge: Visual Learning and Understanding With Web Data⁠

“Geometry of Optimization and Implicit Regularization in Deep Learning ”, Neyshabur et al 2017

Geometry of Optimization and Implicit Regularization in Deep Learning⁠

“On the Impossibility of Supersized Machines ”, Garfinkel et al 2017

On the Impossibility of Supersized Machines⁠

“Parallel Multiscale Autoregressive Density Estimation ”, Reed et al 2017

Parallel Multiscale Autoregressive Density Estimation⁠

“Universal Representations: The Missing Link between Faces, Text, Planktons, and Cat Breeds ”, Bilen & Vedaldi 2017

Universal representations: The missing link between faces, text, planktons, and cat breeds⁠

“Estimation of Gap Between Current Language Models and Human Performance ”, Shen et al 2017

Estimation of Gap Between Current Language Models and Human Performance⁠

“Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles ”, Lakshminarayanan et al 2016

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles⁠

“Understanding Deep Learning Requires Rethinking Generalization ”, Zhang et al 2016

Understanding deep learning requires rethinking generalization⁠

“Why Does Deep and Cheap Learning Work so Well? ”, Lin et al 2016

Why does deep and cheap learning work so well?⁠

“The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context ”, Paperno et al 2016

The LAMBADA dataset: Word prediction requiring a broad discourse context⁠

“Residual Networks Behave Like Ensembles of Relatively Shallow Networks ”, Veit et al 2016

Residual Networks Behave Like Ensembles of Relatively Shallow Networks⁠

“Do Deep Convolutional Nets Really Need to Be Deep and Convolutional? ”, Urban et al 2016

Do Deep Convolutional Nets Really Need to be Deep and Convolutional?⁠

“PlaNet—Photo Geolocation With Convolutional Neural Networks ”, Weyand et al 2016

PlaNet—Photo Geolocation with Convolutional Neural Networks⁠

“Exploring the Limits of Language Modeling ”, Jozefowicz et al 2016

Exploring the Limits of Language Modeling⁠

“The Singularity: A Philosophical Analysis ”, Chalmers 2016

The Singularity: A Philosophical Analysis⁠

“Microsoft Researchers Win ImageNet Computer Vision Challenge ”, Linn 2015

Microsoft researchers win ImageNet computer vision challenge⁠

“The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition ”, Krause et al 2015

The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition⁠

“Net2Net: Accelerating Learning via Knowledge Transfer ”, Chen et al 2015

Net2Net: Accelerating Learning via Knowledge Transfer⁠

“Generative Concatenative Nets Jointly Learn to Write and Classify Reviews ”, Lipton et al 2015

Generative Concatenative Nets Jointly Learn to Write and Classify Reviews⁠

“Learning Visual Features from Large Weakly Supervised Data ”, Joulin et al 2015

Learning Visual Features from Large Weakly Supervised Data⁠

“LSUN: Construction of a Large-Scale Image Dataset Using Deep Learning With Humans in the Loop ”, Yu et al 2015

LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop⁠

“Clothing-1M: Learning from Massive Noisy Labeled Data for Image Classification ”, Xiao et al 2015

Clothing-1M: Learning from Massive Noisy Labeled Data for Image Classification⁠

“The Unreasonable Effectiveness of Recurrent Neural Networks ”, Karpathy 2015

The Unreasonable Effectiveness of Recurrent Neural Networks

“LSTM: A Search Space Odyssey ”, Greff et al 2015

LSTM: A Search Space Odyssey⁠

“YFCC100M: The New Data in Multimedia Research ”, Thomee et al 2015

YFCC100M: The New Data in Multimedia Research⁠

“Machine Intelligence, Part 1 ”, Altman 2015

Machine intelligence, part 1⁠

“Evolution of the Human Brain: From Matter to Mind ”, Hofman 2015

Evolution of the Human Brain: From Matter to Mind⁠

“In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning ”, Neyshabur et al 2014

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning⁠

“Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article] ”, Cambria & White 2014

Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article]⁠

“Neural Networks, Manifolds, and Topology ”, Olah 2014

Neural Networks, Manifolds, and Topology

“Computing’s Energy Problem (And What We Can Do about It) ”, Horowitz 2014b

Computing’s Energy Problem (and what we can do about it)⁠

“On the Number of Linear Regions of Deep Neural Networks ”, Montúfar et al 2014

On the Number of Linear Regions of Deep Neural Networks⁠

“N-Gram Counts and Language Models from the Common Crawl ”, Buck et al 2014

N-gram Counts and Language Models from the Common Crawl⁠

“Evolution of the Human Brain: When Bigger Is Better ”, Hofman 2014

Evolution of the human brain: when bigger is better⁠

“One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling ”, Chelba et al 2013

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling⁠

“Algorithmic Progress in Six Domains ”, Grace 2013

Algorithmic Progress in Six Domains⁠

“Large–Scale Machine Learning Revisited [Slides] ”, Bottou 2013

⁠Large–Scale Machine Learning Revisited [slides]⁠ :

View PDF:

⁠/doc/ai/scaling/2013-bottou.pdf⁠

“20 Years of Bitext ”, Brown et al 2013

⁠20 Years of Bitext⁠

“Intelligence Explosion Microeconomics ”, Yudkowsky 2013

Intelligence Explosion Microeconomics⁠

“Scalable Modified Kneser-Ney Language Model Estimation ”, Heafield et al 2013

Scalable Modified Kneser-Ney Language Model Estimation⁠

“Large Scale Language Modeling in Automatic Speech Recognition ”, Chelba et al 2012

Large Scale Language Modeling in Automatic Speech Recognition⁠

“The Remarkable, yet Not Extraordinary, Human Brain As a Scaled-Up Primate Brain and Its Associated Cost ”, Herculano-Houzel 2012

The remarkable, yet not extraordinary, human brain as a scaled-up primate brain and its associated cost⁠

“Advantages of Artificial Intelligences, Uploads, and Digital Minds ”, Sotala 2012

Advantages of Artificial Intelligences, Uploads, and Digital Minds⁠

“Recurrent Neural Network Based Language Model ”, Mikolov et al 2010

Recurrent Neural Network Based Language Model⁠

“Understanding Sources of Inefficiency in General-Purpose Chips ”, Hameed et al 2010

Understanding sources of inefficiency in general-purpose chips⁠

“The Teenies ”, Legg 2009

The Teenies⁠

“Tick, Tock, Tick, Tock… BING ”, Legg 2009

Tick, tock, tick, tock… BING⁠

“Halloween Nightmare Scenario, Early 2020’s ”, Wood 2009

Halloween nightmare scenario, early 2020’s

“Matrix Factorization Techniques for Recommender Systems ”, Koren et al 2009

Matrix factorization techniques for recommender systems⁠

“The Unreasonable Effectiveness of Data ”, Halevy et al 2009

The Unreasonable Effectiveness of Data⁠

“Economics Of The Singularity: Stuffed into Skyscrapers by the Billion, Brainy Bugbots Will Be the Knowledge Workers of the Future ”, Hanson 2008

Economics Of The Singularity: Stuffed into skyscrapers by the billion, brainy bugbots will be the knowledge workers of the future⁠

“Large Language Models in Machine Translation ”, Brants et al 2007

Large Language Models in Machine Translation⁠

“The Tradeoffs of Large-Scale Learning ”, Bottou & Bousquet 2007

The Tradeoffs of Large-Scale Learning⁠

“Cellular Scaling Rules for Primate Brains ”, Herculano-Houzel et al 2007

Cellular scaling rules for primate brains⁠

“Robot Predictions Evolution ”, Moravec 2004

Robot Predictions Evolution⁠

“Tree Induction versus Logistic Regression: A Learning-Curve Analysis ”, Perlich et al 2003

Tree Induction versus Logistic Regression: A Learning-Curve Analysis⁠

“Analytic and Algorithmic Solution of Random Satisfiability Problems ”, Mezard et al 2002

Analytic and Algorithmic Solution of Random Satisfiability Problems⁠

“A Bit of Progress in Language Modeling ”, Goodman 2001

A Bit of Progress in Language Modeling⁠

“Scaling to Very Very Large Corpora for Natural Language Disambiguation ”, Banko & Brill 2001

Scaling to Very Very Large Corpora for Natural Language Disambiguation⁠

“On Discriminative versus Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes ”, Ng & Jordan 2001

On Discriminative versus Generative Classifiers: A comparison of logistic regression and naive Bayes⁠

“A Survey of Methods for Scaling Up Inductive Algorithms ”, Provost & Kolluri 1999

A Survey of Methods for Scaling Up Inductive Algorithms⁠

“On The Effect of Data Set Size on Bias And Variance in Classification Learning ”, Brain & Webb 1999

On The Effect of Data Set Size on Bias And Variance in Classification Learning⁠

“The Anatomy of a Large-Scale Hypertextual Web Search Engine ”, Brin & Page 1998

The Anatomy of a Large-Scale Hypertextual Web Search Engine⁠

“The Effects of Training Set Size on Decision Tree Complexity ”, Oates & Jensen 1997

The Effects of Training Set Size on Decision Tree Complexity⁠

“Rigorous Learning Curve Bounds from Statistical Mechanics ”, Haussler et al 1996

Rigorous Learning Curve Bounds from Statistical Mechanics⁠

“Scaling up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid ”, Kohavi 1996

Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid⁠

“Reflections After Refereeing Papers for NIPS ”, Breiman 1995

Reflections After Refereeing Papers for NIPS⁠

“Biological Limits to Information Processing in the Human Brain ”, Cochrane et al 1995

Biological limits to information processing in the human brain⁠

“Handwritten Character Classification Using Nearest Neighbor in Large Databases ”, Smith et al 1994

⁠Handwritten Character Classification Using Nearest Neighbor in Large Databases⁠

“Building a Large Annotated Corpus of English: The Penn Treebank ”, Marcus et al 1993

Building a Large Annotated Corpus of English: The Penn Treebank⁠

“Statistical Theory of Learning Curves under Entropic Loss Criterion ”, Amari & Murata 1993

Statistical Theory of Learning Curves under Entropic Loss Criterion⁠

“Learning Curves: Asymptotic Values and Rate of Convergence ”, Cortes et al 1993

Learning Curves: Asymptotic Values and Rate of Convergence⁠

“Exhaustive Learning ”, Schwartz et al 1990

Exhaustive Learning⁠

“Don’t Worry—It Can’t Happen ”, Harrington 1940

Don’t Worry—It Can’t Happen⁠

“Eric Michaud on Neural Quantum Interpretability ”

⁠Eric Michaud on Neural Quantum Interpretability :

View HTML:

⁠/doc/www/theinsideview.ai/3d4ef31011b49fa3442733759bb92f0b3bb8b6c5.html#the-quantization-model-of-neural-scaling⁠

“Billion-Scale Semi-Supervised Learning for State-Of-The-Art Image and Video Classification ”

⁠Billion-scale semi-supervised learning for state-of-the-art image and video classification⁠ :

View HTML:

⁠/doc/www/ai.meta.com/95a63d763a16bfbb572a5262b01b97751a797dc0.html⁠

“No Physics? No Problem. AI Weather Forecasting Is Already Making Huge Strides. ”

No physics? No problem. AI weather forecasting is already making huge strides.⁠

“Report Describes Apple’s ‘Organizational Dysfunction’ and ‘Lack of Ambition’ in AI ”

Report describes Apple’s ‘organizational dysfunction’ and ‘lack of ambition’ in AI⁠

“StyleGAN-2 512px Trained on Danbooru2019 ”

⁠StyleGAN-2 512px trained on Danbooru2019

“Blake Bordelon ”, Bordelon 2025

⁠Blake Bordelon :

View HTML:

⁠/doc/www/blakebordelon.github.io/331c92ad2086bba413920a6a3b4e40d57e52e33a.html⁠

“Inside the CodeBot: A Gentle Introduction to How LLMs Understand Nullability ”

⁠Inside the CodeBot: A Gentle Introduction to How LLMs Understand Nullability

“Komodo 8: the Smartphone vs Desktop Challenge ”

Komodo 8: the smartphone vs desktop challenge

“Trading Off Compute in Training and Inference § Pruning ”

Trading Off Compute in Training and Inference § Pruning

“Eric Tang ”

⁠Eric Tang :

View HTML:

⁠/doc/www/erictang000.github.io/53e33b4414445bd98b072e0cf38a6904b118594f.html⁠

“How Can We Make Robotics More like Generative Modeling? ”

⁠How Can We Make Robotics More like Generative Modeling? :

View HTML:

⁠/doc/www/evjang.com/a3524d3155b3ef44b83dfc99082aeb52e87a9bdc.html⁠

“Compression Represents Intelligence Linearly [Code] ”

⁠⁠Compression Represents Intelligence Linearly [code]⁠ :

View HTML:

⁠/doc/www/github.com/f8d2badbe21687a400993008fc11ca12d5fc6552.html⁠

“Inverse-Scaling/prize: A Prize for Finding Tasks That Cause Large Language Models to Show Inverse Scaling ”

inverse-scaling/prize: A prize for finding tasks that cause large language models to show inverse scaling⁠

“Scaling up StyleGAN-2 ”

Scaling up StyleGAN-2⁠

“Finally, a Replacement for BERT: Introducing ModernBERT ”

⁠Finally, a Replacement for BERT: Introducing ModernBERT⁠

“Llm-Compression Data ”

⁠llm-compression data⁠

“Semi Supervised Learning ”

⁠Semi Supervised Learning :

View HTML:

⁠/doc/www/lilianweng.github.io/2a16890d3828767743c0e7a177a4036828957ff4.html⁠

“Homepage of Paul F. Christiano ”, Christiano 2025

Homepage of Paul F. Christiano⁠

“Statistical Modeling: The Two Cultures ”, Breiman 2025

Statistical Modeling: The Two Cultures⁠

“Jared Kaplan ”

Jared Kaplan

“Safe Superintelligence Inc. ”

Safe Superintelligence Inc.

“OpenAI Disbands Its Robotics Research Team ”

⁠OpenAI disbands its robotics research team :

View External Link:

⁠https://venturebeat.com/business/openai-disbands-its-robotics-research-team/

“Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks ”

⁠Google workloads for consumer devices: mitigating data movement bottlenecks⁠ :

View External Link:

⁠https://web.archive.org/web/20190131021221/https://blog.acolyer.org/2018/04/18/google-workloads-for-consumer-devices-mitigating-data-movement-bottlenecks/⁠

“The Uneasy Relationship between Deep Learning and (Classical) Statistics ”

⁠The uneasy relationship between deep learning and (classical) statistics :

View HTML:

⁠/doc/www/windowsontheory.org/ed98775344f67ec385a16cd234c9c7888602e97f.html⁠

“Parameter Counts in Machine Learning ”

⁠Parameter counts in Machine Learning⁠ :

View External Link:

⁠https://www.alignmentforum.org/posts/GzoWcYibWYwJva8aL/parameter-counts-in-machine-learning⁠

“Can LLMs Learn from a Single Example? ”

⁠Can LLMs learn from a single example?⁠ :

View HTML:

⁠/doc/www/www.fast.ai/5c73cf7b7ebdb67c15013107c0ba82613c5661ef.html⁠

“Deciphering China’s AI Dream ”

⁠Deciphering China’s AI Dream

“Jason Wei ”

“Appendix: More Is Different In Other Domains ”

⁠Appendix: More Is Different In Other Domains⁠ :

View External Link:

⁠https://www.lesswrong.com/posts/3daPPjWbjYNP6nbre/appendix-more-is-different-in-other-domains⁠

“Understanding ‘Deep Double Descent’ ”

⁠Understanding ‘Deep Double Descent’⁠ :

View External Link:

⁠https://www.lesswrong.com/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent⁠

“How Much Compute Was Used to Train DeepMind’s Generally Capable Agents? ”

⁠How much compute was used to train DeepMind’s generally capable agents?⁠ :

View External Link:

⁠https://www.lesswrong.com/posts/KaPaTdpLggdMqzdyo/how-much-compute-was-used-to-train-deepmind-s-generally⁠

“Why Neural Networks Generalise, and Why They Are (Kind Of) Bayesian ”

⁠Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian⁠ :

View External Link:

⁠https://www.lesswrong.com/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of⁠

“What’s the Backward-Forward FLOP Ratio for Neural Networks? ”

⁠What’s the backward-forward FLOP ratio for Neural Networks?⁠ :

View External Link:

⁠https://www.lesswrong.com/posts/fnjKpBoWJXcSDwhZk/what-s-the-backward-forward-flop-ratio-for-nns⁠

“Optimality Is the Tiger, and Agents Are Its Teeth ”

⁠Optimality is the tiger, and agents are its teeth⁠

“What Next? A Dozen Information-Technology Research Goals: 3. Turing’s Vision of Machine Intelligence ”

⁠What Next? A Dozen Information-Technology Research Goals: 3. Turing’s vision of machine intelligence⁠ :

View PDF:

⁠/doc/www/www.microsoft.com/5620cc2a603069db2406c32006715aa6535d051b.pdf#page=11⁠

“Was Linguistic AI Created by Accident? ”

Was Linguistic AI Created by Accident?⁠

“Ilya Sutskever: Deep Learning | AI Podcast #94 With Lex Fridman ”

⁠Ilya Sutskever: Deep Learning | AI Podcast #94 with Lex Fridman⁠

“A Universal Law of Robustness ”

⁠A Universal Law of Robustness⁠ :

⁠https://www.youtube.com/watch?v=OzGguadEHOU⁠

“Greg Brockman: OpenAI and AGI ”, Brockman 2025

⁠Greg Brockman: OpenAI and AGI⁠ :

⁠https://www.youtube.com/watch?v=bIrEM2FbOLU&t=2740⁠

“Season 1 Ep. 22 OpenAI’s Ilya Sutskever: The Man Who Made AI Work ”

⁠Season 1 Ep. 22 OpenAI’s Ilya Sutskever: The man who made AI work⁠ :

⁠https://www.youtube.com/watch?v=fCoavgGZ64Y&t=2796s⁠

“A Law of Robustness and the Importance of Overparameterization in Deep Learning ”

⁠A law of robustness and the importance of overparameterization in deep learning⁠ :

⁠https://www.youtube.com/watch?v=ujMvnQpP528⁠

“WELM ”

“Yuxi on the Wired ”, Liu 2025

⁠⁠Yuxi on the Wired :

View External Link:

⁠https://yuxi-liu-wired.github.io/

Sort By Magic

Annotations sorted by machine learning into ⁠inferred 'tags'⁠. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

`neural-scaling`

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

`video-prediction`

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

`speech-synthesis, representation-learning, text-generation, robust-neural, uncertainty-estimation, text-to-image-generation`

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

`visual-embedding`

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

`robustness-scaling`

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

`agi-ethics`

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

Wikipedia (8)

Miscellaneous

Bibliography

https://arxiv.org/abs/2503.17074: “Emuru: Zero-Shot Styled Text Image Generation, but Make It Autoregressive ”⁠, Vittorio Pippi, Fabio Quattrini, Silvia Cascianelli …, Alessio Tonioni, Rita Cucchiara
link-bibliography⁠
https://arxiv.org/abs/2502.09992: “LLaDA: Large Language Diffusion Models ”⁠, Shen Nie, Fengqi Zhu, Zebin You …, Xiaolu Zhang, Jingyang Ou, Jun Hu⁠, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li
link-bibliography⁠
https://arxiv.org/abs/2501.09038#deepmind: “Do Generative Video Models Learn Physical Principles from Watching Videos? ”⁠, Saman Motamed, Laura Culp, Kevin Swersky …, Priyank Jaini, Robert Geirhos⁠
link-bibliography⁠
https://arxiv.org/abs/2412.04332: “Liquid: Language Models Are Scalable and Unified Multi-Modal Generators ”⁠, Junfeng Wu, Yi Jiang⁠, Chuofan Ma …, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, Xiang Bai
link-bibliography⁠
https://arxiv.org/abs/2410.18514: “Scaling up Masked Diffusion Models on Text ”⁠, Shen Nie, Fengqi Zhu, Chao Du …, Tianyu Pang, Qian Liu⁠, Guangtao Zeng, Min Lin, Chongxuan Li
link-bibliography⁠
https://research.google/blog/taking-medical-imaging-embeddings-3d/: “CT Foundation: Taking Medical Imaging Embeddings 3D ”⁠, Atilla Kiraly, Madeleine Traverse
link-bibliography⁠
https://arxiv.org/abs/2407.04108: “Future Events As Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs ”⁠, Sara Price, Arjun Panickssery, ⁠Samuel R. Bowman, Asa Cooper Stickland
link-bibliography⁠
https://arxiv.org/abs/2406.13121#google: “Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? ”⁠, Jinhyuk Lee, Anthony Chen⁠, Zhuyun Dai …, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien M. R. Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, Kelvin Guu
link-bibliography⁠
https://arxiv.org/abs/2406.11233: “Probing the Decision Boundaries of In-Context Learning in Large Language Models ”⁠, Siyan Zhao, Tung Nguyen, Aditya Grover⁠
link-bibliography⁠
https://www.biorxiv.org/content/10.1101/2024.06.06.597716.full: “Training Compute-Optimal Protein Language Models ”⁠, Xingyi Cheng, Bo Chen, Pan Li …, Jing Gong, Jie Tang⁠, Le Song
link-bibliography⁠
https://arxiv.org/abs/2405.14930: “AstroPT: Scaling Large Observation Models for Astronomy ”⁠, Michael J. Smith, Ryan J. Roberts, Eirini Angeloudi, Marc Huertas-Company
link-bibliography⁠
https://arxiv.org/abs/2405.00332#scale: “GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic ”⁠, Hugh Zhang, Jeff Da, Dean Lee …, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele Lunati, Summer Yue
link-bibliography⁠
https://lab42.global/community-interview-jack-cole/: “Test-Time Augmentation to Solve ARC ”, Jack Cole
link-bibliography⁠
https://arxiv.org/abs/2404.09937: “Compression Represents Intelligence Linearly ”⁠, Yuzhen Huang, Jinghan Zhang, Zifei Shan, Junxian He
link-bibliography⁠
https://arxiv.org/abs/2404.06664: “CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack Of) Multicultural Knowledge ”⁠, Yu Ying Chiu, Liwei Jiang, Maria Antoniak …, Chan Young Park, Shuyue Stella Li, Mehar Bhatia, Sahithya Ravi, Yulia Tsvetkov, Vered Shwartz, Yejin Choi⁠
link-bibliography⁠
https://arxiv.org/abs/2404.02905#bytedance: “Visual Autoregressive Modeling (VAR): Scalable Image Generation via Next-Scale Prediction ”⁠, Keyu Tian, Yi Jiang⁠, Zehuan Yuan …, Bingyue Peng, Liwei Wang
link-bibliography⁠
https://arxiv.org/abs/2403.18802#deepmind: “Long-Form Factuality in Large Language Models ”⁠, Jerry Wei, Chengrun Yang, Xinying Song …, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le⁠
link-bibliography⁠
https://arxiv.org/abs/2403.17844: “Mechanistic Design and Scaling of Hybrid Architectures ”⁠, Michael Poli, Armin W. Thomas, Eric Nguyen …, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting⁠, Taiji Suzuki, Brian Hie, Stefano Ermon⁠, Christopher Ré⁠, Ce Zhang, Stefano Massaroli
link-bibliography⁠
https://www.wired.com/story/eight-google-employees-invented-modern-ai-transformers-paper/: “8 Google Employees Invented Modern AI. Here’s the Inside Story: They Met by Chance, Got Hooked on an Idea, and Wrote the Transformers Paper—The Most Consequential Tech Breakthrough in Recent History ”⁠, Steven Levy⁠
link-bibliography⁠
https://inflection.ai/inflection-2-5: “Inflection-2.5: Meet the World’s Best Personal AI ”, Inflection
link-bibliography⁠
https://arxiv.org/abs/2402.17152#facebook: “Actions Speak Louder Than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations (HSTU) ”⁠, Jiaqi Zhai, Lucy Liao, Xing Liu …, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Michael He, Yinghai Lu, Yu Shi⁠
link-bibliography⁠
https://arxiv.org/abs/2402.16671: “StructLM: Towards Building Generalist Models for Structured Knowledge Grounding ”⁠, Alex Zhuang, Ge Zhang, Tianyu Zheng …, Xinrun Du, Junjie Wang, Weiming Ren, Stephen W. Huang, Jie Fu, Xiang Yue, Wenhu Chen
link-bibliography⁠
https://arxiv.org/abs/2312.15770#alibaba: “TF-T2V: A Recipe for Scaling up Text-To-Video Generation With Text-Free Videos ”⁠, Xiang Wang, Shiwei Zhang, Hangjie Yuan …, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang
link-bibliography⁠
https://arxiv.org/abs/2312.04927: “Zoology: Measuring and Improving Recall in Efficient Language Models ”⁠, Simran Arora, Sabri Eyuboglu, Aman Timalsina …, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher Ré⁠
link-bibliography⁠
https://arxiv.org/abs/2312.03876: “Scaling Transformer Neural Networks for Skillful and Reliable Medium-Range Weather Forecasting ”⁠, Tung Nguyen, Rohan Shah, Hritik Bansal …, Troy Arcomano, Sandeep Madireddy, Romit Maulik, Veerabhadra Kotamarthi, Ian Foster, Aditya Grover⁠
link-bibliography⁠
https://arxiv.org/abs/2312.00752: “Mamba: Linear-Time Sequence Modeling With Selective State Spaces ”⁠, Albert Gu⁠, ⁠Tri Dao
link-bibliography⁠
https://arxiv.org/abs/2311.15599#tencent: “UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition ”⁠, Xiaohan Ding, Yiyuan Zhang, Yixiao Ge …, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan⁠
link-bibliography⁠
https://arxiv.org/abs/2311.04145#alibaba: “I2VGen-XL: High-Quality Image-To-Video Synthesis via Cascaded Diffusion Models ”⁠, Shiwei Zhang, Jiayu Wang, Yingya Zhang …, Kang Zhao⁠, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou
link-bibliography⁠
https://arxiv.org/abs/2310.16764#deepmind: “ConvNets Match Vision Transformers at Scale ”⁠, Samuel L. Smith⁠, Andrew Brock⁠, Leonard Berrada, Soham De
link-bibliography⁠
https://arxiv.org/abs/2310.09199#google: “PaLI-3 Vision Language Models: Smaller, Faster, Stronger ”⁠, Xi Chen⁠, Xiao Wang⁠, Lucas Beyer⁠ …, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai⁠, Radu Soricut
link-bibliography⁠
https://arxiv.org/abs/2310.06213: “GeoLLM: Extracting Geospatial Knowledge from Large Language Models ”⁠, Rohin Manvi, Samar Khanna, Gengchen Mai …, Marshall Burke, David Lobell⁠, Stefano Ermon⁠
link-bibliography⁠
https://arxiv.org/abs/2310.06694: “Sheared LLaMA: Accelerating Language Model Pre-Training via Structured Pruning ”⁠, Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen⁠
link-bibliography⁠
https://arxiv.org/abs/2310.03214#google: “FreshLLMs: Refreshing Large Language Models With Search Engine Augmentation ”⁠, Tu Vu, Mohit Iyyer, Xuezhi Wang …, Noah Constant⁠, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, ⁠Denny Zhou, Quoc V. Le⁠, Thang Luong
link-bibliography⁠
https://arxiv.org/abs/2310.02980: “Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors ”⁠, Ido Amos, ⁠Jonathan Berant, Ankit Gupta
link-bibliography⁠
https://arxiv.org/abs/2309.00667: “Taken out of Context: On Measuring Situational Awareness in LLMs ”⁠, Lukas Berglund, Asa Cooper Stickland, Mikita Balesni …, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo⁠, ⁠Owain Evans
link-bibliography⁠
https://arxiv.org/abs/2308.11596#facebook: “SeamlessM4T: Massively Multilingual & Multimodal Machine Translation ”⁠, Seamless Communication, Loïc Barrault, Yu-An Chung …, Mariano Cora Meglioli, David Dale⁠, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard⁠, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun⁠, Kevin Tran⁠, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang⁠, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-jussà, Onur Celebi, Maha Elbayad, Cynthia Gao, Francisco Guzmán, Justine Kao, Ann Lee⁠, Alexandre Mourachko, Juan Pino⁠, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang
link-bibliography⁠
https://arxiv.org/abs/2308.03958#deepmind: “Simple Synthetic Data Reduces Sycophancy in Large Language Models ”⁠, Jerry Wei, Da Huang, Yifeng Lu …, ⁠Denny Zhou, Quoc V. Le⁠
link-bibliography⁠
https://arxiv.org/abs/2307.05300#microsoft: “Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration ”⁠, Zhenhailong Wang, Shaoguang Mao, Wenshan Wu …, Tao Ge, Furu Wei⁠, Heng Ji⁠
link-bibliography⁠
https://openai.com/index/introducing-superalignment/: “Introducing Superalignment ”⁠, ⁠Jan Leike, Ilya Sutskever⁠
link-bibliography⁠
https://www.youtube.com/watch?v=lfXxzAVtdpU&t=1763s: “Gödel, Escher, Bach Author Douglas Hofstadter on the State of AI Today § What about AI Terrifies You? ”⁠, Douglas Hofstadter⁠, Amy Jo Kim
link-bibliography⁠
https://arxiv.org/abs/2306.13575: “Scaling MLPs: A Tale of Inductive Bias ”⁠, Gregor Bachmann, Sotiris Anagnostidis, Thomas Hofmann⁠
link-bibliography⁠
https://arxiv.org/abs/2306.15448: “Understanding Social Reasoning in Language Models With Language Models ”⁠, Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gerstenberg, Noah D. Goodman
link-bibliography⁠
https://arxiv.org/abs/2305.15717: “The False Promise of Imitating Proprietary LLMs ”⁠, Arnav Gudibande, Eric Wallace⁠, Charlie Snell⁠ …, Xinyang Geng, Hao Liu, Pieter Abbeel⁠, Sergey Levine⁠, Dawn Song⁠
link-bibliography⁠
https://arxiv.org/abs/2305.11863: “Scaling Laws for Language Encoding Models in FMRI ”⁠, Richard Antonello, Aditya Vaidya, Alexander G. Huth
link-bibliography⁠
https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html: “Google’s Newest AI Model Uses Nearly 5× More Text Data for Training Than Its Predecessor ”⁠, Jennifer Elias
link-bibliography⁠
https://arxiv.org/abs/2305.07759#microsoft: “TinyStories: How Small Can Language Models Be and Still Speak Coherent English? ”⁠, Ronen Eldan⁠, Yuanzhi Li
link-bibliography⁠
https://arxiv.org/abs/2305.05665#facebook: “ImageBind: One Embedding Space To Bind Them All ”⁠, Rohit Girdhar, Alaaeldin El-Nouby, ⁠Zhuang Liu …, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin⁠, Ishan Misra
link-bibliography⁠
https://www.ft.com/content/f4f73815-6fc2-4016-bd97-4bace459e95e: “Google’s DeepMind-Brain Merger: Tech Giant Regroups for AI Battle ”⁠, Madhumita Murgia
link-bibliography⁠
https://arxiv.org/abs/2304.07193#facebook: “DINOv2: Learning Robust Visual Features without Supervision ”⁠, Maxime Oquab, Timothée Darcet, Théo Moutakanni …, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin⁠, Piotr Bojanowski
link-bibliography⁠
https://arxiv.org/abs/2303.15343#google: “Sigmoid Loss for Language Image Pre-Training ”⁠, Xiaohua Zhai⁠, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer⁠
link-bibliography⁠
https://arxiv.org/abs/2304.02015#alibaba: “How Well Do Large Language Models Perform in Arithmetic Tasks? ”⁠, Zheng Yuan, Hongyi Yuan, Chuanqi Tan …, Wei Wang, Songfang Huang
link-bibliography⁠
https://jameswphillips.substack.com/p/securing-liberal-democratic-control: “Securing Liberal Democratic Control of AGI through UK Leadership ”⁠, James W. Phillips
link-bibliography⁠
https://arxiv.org/abs/2303.05511#adobe: “GigaGAN: Scaling up GANs for Text-To-Image Synthesis ”⁠, Minguk Kang, ⁠Jun-Yan Zhu⁠, Richard Zhang …, Jaesik Park, Eli Shechtman, Sylvain Paris, Taesung Park
link-bibliography⁠
https://arxiv.org/abs/2302.05442#google: “Scaling Vision Transformers to 22 Billion Parameters ”⁠, Mostafa Dehghani, Josip Djolonga, Basil Mustafa …, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos⁠, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer⁠, ⁠Michael Tschannen, Anurag Arnab, Xiao Wang⁠, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, ⁠Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai⁠, Daniel Keysers, Jeremiah Harmsen, ⁠Neil Houlsby
link-bibliography⁠
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4335945: “Large Language Models As Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards ”⁠, John Nay
link-bibliography⁠
https://arxiv.org/abs/2301.09515#nvidia: “StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-To-Image Synthesis ”⁠, Axel Sauer, Tero Karras⁠, ⁠Samuli Laine …, Andreas Geiger, Timo Aila⁠
link-bibliography⁠
https://arxiv.org/abs/2301.07088#bytedance: “MUG: Vision Learners Meet Web Image-Text Pairs ”⁠, Bingchen Zhao, Quan Cui, Hao Wu⁠ …, Osamu Yoshie, Cheng Yang
link-bibliography⁠
https://arxiv.org/abs/2301.04408: “GPT-3 As Knowledge Worker: A Zero-Shot Evaluation of AI CPA Capabilities ”⁠, Jillian Bommarito, Michael Bommarito, Daniel Martin Katz, Jessica Katz
link-bibliography⁠
https://arxiv.org/abs/2301.03728#facebook: “Scaling Laws for Generative Mixed-Modal Language Models ”⁠, Armen Aghajanyan, Lili Yu, Alexis Conneau …, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy⁠, Luke Zettlemoyer⁠
link-bibliography⁠
https://arxiv.org/abs/2301.02111#microsoft: “VALL-E: Neural Codec Language Models Are Zero-Shot Text to Speech Synthesizers ”⁠, Chengyi Wang, Sanyuan Chen, Yu Wu …, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li⁠, Lei He, Sheng Zhao, Furu Wei⁠
link-bibliography⁠
https://arxiv.org/abs/2212.14402: “GPT-3 Takes the Bar Exam ”⁠, Michael Bommarito II, Daniel Martin Katz
link-bibliography⁠
https://arxiv.org/abs/2212.09741: “One Embedder, Any Task: Instruction-Finetuned Text Embeddings (INSTRUCTOR) ”⁠, Hongjin Su, Weijia Shi, Jungo Kasai …, ⁠Yizhong Wang, Yushi Hu, Mari Ostendorf⁠, Wen-tau Yih, Noah Smith⁠, Luke Zettlemoyer⁠, Tao Yu
link-bibliography⁠
https://arxiv.org/abs/2212.07143: “Reproducible Scaling Laws for Contrastive Language-Image Learning ”⁠, Mehdi Cherti, Romain Beaumont, Ross Wightman⁠ …, ⁠Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt⁠, Jenia Jitsev
link-bibliography⁠
https://arxiv.org/abs/2212.04979#google: “VideoCoCa: Video-Text Modeling With Zero-Shot Transfer from Contrastive Captioners ”⁠, Shen Yan, Tao Zhu⁠, Zirui Wang …, Yuan Cao⁠, Mi Zhang⁠, Soham Ghosh⁠, Yonghui Wu⁠, Jiahui Yu
link-bibliography⁠
https://arxiv.org/abs/2212.05051: “VindLU: A Recipe for Effective Video-And-Language Pretraining ”⁠, Feng Cheng, Xizi Wang, Jie Lei …, David Crandall, ⁠Mohit Bansal, Gedas Bertasius
link-bibliography⁠
https://arxiv.org/abs/2212.04356#openai: “Whisper: Robust Speech Recognition via Large-Scale Weak Supervision ”⁠, Alec Radford⁠, ⁠Jong Wook Kim, Tao Xu …, Greg Brockman⁠, Christine McLeavey, Ilya Sutskever⁠
link-bibliography⁠
https://arxiv.org/abs/2211.09085#facebook: “Galactica: A Large Language Model for Science ”⁠, Ross Taylor⁠, Marcin Kardas, Guillem Cucurull …, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, Robert Stojnic
link-bibliography⁠
https://arxiv.org/abs/2211.08411: “Large Language Models Struggle to Learn Long-Tail Knowledge ”⁠, Nikhil Kandpal, Haikang Deng, Adam Roberts⁠ …, Eric Wallace⁠, ⁠Colin Raffel
link-bibliography⁠
https://arxiv.org/abs/2211.07636#baai: “EVA: Exploring the Limits of Masked Visual Representation Learning at Scale ”⁠, Yuxin Fang, Wen Wang⁠, Binhui Xie …, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao
link-bibliography⁠
https://arxiv.org/abs/2211.00241: “Adversarial Policies Beat Superhuman Go AIs ”⁠, Tony T. Wang, Adam Gleave, Tom Tseng …, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine⁠, Stuart Russell
link-bibliography⁠
https://www.youtube.com/watch?v=Q-TJFyUoenc&t=2444s: “Increments Podcast: #45—4 Central Fallacies of AI Research (With Melanie Mitchell) ”⁠, Melanie Mitchell⁠, Benny Chugg
link-bibliography⁠
https://arxiv.org/abs/2210.16859: “A Solvable Model of Neural Scaling Laws ”⁠, Alexander Maloney, Daniel A. Roberts, James Sully
link-bibliography⁠
https://arxiv.org/abs/2210.13673#nvidia: “Evaluating Parameter Efficient Learning for Generation ”⁠, Peng Xu, Mostofa Patwary, Shrimai Prabhumoye …, Virginia Adams, Ryan J. Prenger, Wei Ping, Nayeon Lee, Mohammad Shoeybi, Bryan Catanzaro⁠
link-bibliography⁠
https://arxiv.org/abs/2210.11416#google: “FLAN: Scaling Instruction-Finetuned Language Models ”⁠, Hyung Won Chung, Le Hou, Shayne Longpre …, ⁠Barret Zoph, ⁠Yi Tay, William Fedus⁠, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu⁠, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi⁠, Jeff Dean⁠, Jacob Devlin, Adam Roberts⁠, ⁠Denny Zhou, Quoc V. Le⁠, Jason Wei
link-bibliography⁠
https://arxiv.org/abs/2210.10341#microsoft: “BioGPT: Generative Pre-Trained Transformer for Biomedical Text Generation and Mining ”⁠, Renqian Luo, Liai Sun, Yingce Xia …, Tao Qin⁠, Sheng Zhang, Hoifung Poon, Tie-Yan Liu⁠
link-bibliography⁠
https://arxiv.org/abs/2210.06423#microsoft: “Foundation Transformers ”⁠, Hongyu Wang, Shuming Ma, Shaohan Huang …, Li Dong⁠, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav Chaudhary, Xia Song, Furu Wei⁠
link-bibliography⁠
https://arxiv.org/abs/2210.03350#allen: “Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle) ”⁠, Ofir Press, Muru Zhang, Sewon Min …, Ludwig Schmidt⁠, ⁠Noah A. Smith, Mike Lewis⁠
link-bibliography⁠
https://arxiv.org/abs/2210.02414#baai: “GLM-130B: An Open Bilingual Pre-Trained Model ”⁠, Aohan Zeng, Xiao Liu, Zhengxiao Du …, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu⁠, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng Zhang, Yuxiao Dong, Jie Tang⁠
link-bibliography⁠
https://arxiv.org/abs/2210.02441: “Ask Me Anything (AMA): A Simple Strategy for Prompting Language Models ”⁠, Simran Arora, Avanika Narayan, Mayee F. Chen …, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, Christopher Ré⁠
link-bibliography⁠
https://arxiv.org/abs/2208.05516: “Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP ”⁠, Thao Nguyen, Gabriel Ilharco, ⁠Mitchell Wortsman …, Sewoong Oh, Ludwig Schmidt⁠
link-bibliography⁠
https://arxiv.org/abs/2207.06991: “PIXEL: Language Modeling With Pixels ”⁠, Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello …, Elizabeth Salesky, Miryam de Lhoneux, Desmond Elliott⁠
link-bibliography⁠
https://arxiv.org/abs/2207.05221#anthropic: “Language Models (Mostly) Know What They Know ”⁠, Saurav Kadavath⁠, Tom Conerly, ⁠Amanda Askell …, Tom Henighan, Dawn Drain, ⁠Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston⁠, Sheer El-Showk, ⁠Andy L. Jones, ⁠Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai⁠, ⁠Samuel R. Bowman, Stanislav Fort, ⁠Deep Ganguli, Danny Hernandez⁠, Josh Jacobson, ⁠Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei⁠, Tom B. Brown⁠, ⁠Jack Clark⁠, Nicholas Joseph, Ben Mann, Sam McCandlish⁠, Chris Olah, Jared Kaplan
link-bibliography⁠
https://arxiv.org/abs/2206.15472: “On-Device Training Under 256KB Memory ”⁠, Ji Lin, Ligeng Zhu, Wei-Ming Chen …, Wei-Chen Wang, Chuang Gan, Song Han
link-bibliography⁠
https://arxiv.org/abs/2206.04658#nvidia: “BigVGAN: A Universal Neural Vocoder With Large-Scale Training ”⁠, Sang-gil Lee, Wei Ping, Boris Ginsburg …, Bryan Catanzaro⁠, Sungroh Yoon
link-bibliography⁠
https://arxiv.org/abs/2206.01685: “Toward a Realistic Model of Speech Processing in the Brain With Self-Supervised Learning ”⁠, Juliette Millet, Charlotte Caucheteux, Pierre Orhan …, Yves Boubenec, Alexandre Gramfort, Ewan Dunbar, Christophe Pallier, Jean-Remi King
link-bibliography⁠
https://arxiv.org/abs/2205.14204#google: “M3AE: Multimodal Masked Autoencoders Learn Transferable Representations ”⁠, Xinyang Geng, Hao Liu, Lisa Lee⁠ …, Dale Schuurams, Sergey Levine⁠, Pieter Abbeel⁠
link-bibliography⁠
https://arxiv.org/abs/2205.10625#google: “Least-To-Most Prompting Enables Complex Reasoning in Large Language Models ”⁠, ⁠Denny Zhou, Nathanael Schärli, Le Hou …, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc V. Le⁠, Ed Chi⁠
link-bibliography⁠
https://arxiv.org/abs/2205.09073#google: “Dialog Inpainting: Turning Documents into Dialogues ”⁠, Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Zhao …, Aida Amini, Qazi Mamunur Rashid, Mike Green, Kelvin Guu
link-bibliography⁠
https://arxiv.org/abs/2205.05131#google: “UL2: Unifying Language Learning Paradigms ”⁠, ⁠Yi Tay, Mostafa Dehghani, Vinh Q. Tran …, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, ⁠Neil Houlsby, Donald Metzler
link-bibliography⁠
https://arxiv.org/abs/2205.03983#google: “Building Machine Translation Systems for the Next Thousand Languages ”⁠, Ankur Bapna, Isaac Caswell, Julia Kreutzer …, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, Theresa Breiner, Vera Axelrod, Jason Riesa, Yuan Cao⁠, Mia Xu Chen, Klaus Macherey, Maxim Krikun, Pidong Wang, Alexander Gutkin, Apurva Shah, Yanping Huang, Zhifeng Chen, Yonghui Wu⁠, Macduff Hughes
link-bibliography⁠
https://arxiv.org/abs/2205.04596#google: “When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on ImageNet ”⁠, Vijay Vasudevan, Benjamin Caine, Raphael Gontijo-Lopes …, Sara Fridovich-Keil, Rebecca Roelofs
link-bibliography⁠
https://arxiv.org/abs/2205.01917#google: “CoCa: Contrastive Captioners Are Image-Text Foundation Models ”⁠, Jiahui Yu, Zirui Wang, Vijay Vasudevan …, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu⁠
link-bibliography⁠
https://arxiv.org/abs/2205.01397: “Data Determines Distributional Robustness in Contrastive Language Image Pre-Training (CLIP) ”⁠, Alex Fang, Gabriel Ilharco, ⁠Mitchell Wortsman …, Yuhao Wan, Vaishaal Shankar, Achal Dave, Ludwig Schmidt⁠
link-bibliography⁠
https://arxiv.org/abs/2204.14198#deepmind: “Flamingo: a Visual Language Model for Few-Shot Learning ”⁠, Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc …, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock⁠, Aida Nematzadeh, Sah, Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals⁠, Andrew Zisserman⁠, Karen Simonyan⁠
link-bibliography⁠
https://arxiv.org/abs/2204.10149: “WebFace260M: A Benchmark for Million-Scale Deep Face Recognition ”⁠, Zheng Zhu⁠, Guan Huang, Jiankang Deng …, Yun Ye, Junjie Huang, Xinze Chen, Jiagang Zhu, Tian Yang, Dalong Du, Jiwen Lu, Jie Zhou
link-bibliography⁠
https://www.lesswrong.com/posts/SbAgRYo8tkHwhd9Qx/deepmind-the-podcast-excerpts-on-agi: “DeepMind: The Podcast—Excerpts on AGI ”⁠, William Kiely
link-bibliography⁠
https://arxiv.org/abs/2203.15556#deepmind: “Chinchilla: Training Compute-Optimal Large Language Models ”⁠, Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch …, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan⁠, Erich Elsen, Jack W. Rae, Oriol Vinyals⁠, Laurent Sifre⁠
link-bibliography⁠
https://arxiv.org/abs/2203.11171#google: “Self-Consistency Improves Chain-Of-Thought Reasoning in Language Models ”⁠, Xuezhi Wang, Jason Wei, Dale Schuurmans …, Quoc V. Le⁠, Ed Chi⁠, ⁠Denny Zhou
link-bibliography⁠
https://arxiv.org/abs/2203.03466#microsoft: “Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer ”⁠, Greg Yang, Edward J. Hu, Igor Babuschkin …, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, ⁠Jianfeng Gao⁠
link-bibliography⁠
https://arxiv.org/abs/2203.00854: “FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours ”⁠, Shenggan Cheng, Ruidong Wu, Zhongming Yu …, Binrui Li, Xiwen Zhang, Jian Peng, Yang You⁠
link-bibliography⁠
https://arxiv.org/abs/2202.12211#google: “Self-Distilled StyleGAN: Towards Generation from Internet Photos ”⁠, Ron Mokady, Michal Yarom, Omer Tov …, Oran Lang, Daniel Cohen-Or, Tali Dekel, Michal Irani⁠, Inbar Mosseri
link-bibliography⁠
https://www.nature.com/articles/s42003-022-03036-1: “Brains and Algorithms Partially Converge in Natural Language Processing ”⁠, Charlotte Caucheteux, Jean-Rémi King
link-bibliography⁠
https://arxiv.org/abs/2202.06767#huawei: “Wukong: 100 Million Large-Scale Chinese Cross-Modal Pre-Training Dataset and A Foundation Framework ”⁠, Jiaxi Gu, Xiaojun Meng, Guansong Lu …, Lu Hou⁠, Minzhe Niu, Hang Xu, Xiaodan Liang, Wei Zhang, Xin Jiang⁠, Chunjing Xu
link-bibliography⁠
https://arxiv.org/abs/2202.03052#alibaba: “OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-To-Sequence Learning Framework ”⁠, Peng Wang, An Yang⁠, Rui Men …, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang
link-bibliography⁠
https://arxiv.org/abs/2202.02317#allen: “Webly Supervised Concept Expansion for General Purpose Vision Models ”⁠, Amita Kamath, Christopher Clark⁠, Tanmay Gupta …, Eric Kolve, Derek Hoiem, Aniruddha Kembhavi
link-bibliography⁠
https://arxiv.org/abs/2202.00273: “StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets ”⁠, Axel Sauer, Katja Schwarz, Andreas Geiger
link-bibliography⁠
https://arxiv.org/abs/2201.11990#microsoftnvidia: “Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model ”⁠, Shaden Smith, Mostofa Patwary, Brandon Norick …, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, ⁠Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary⁠, Bryan Catanzaro⁠
link-bibliography⁠
https://arxiv.org/abs/2201.11473#microsoft: “Reasoning Like Program Executors ”⁠, Xinyu Pi, Qian Liu⁠, Bei Chen …, Morteza Ziyadi, Zeqi Lin, Yan Gao, Qiang Fu, Jian-Guang Lou, Weizhu Chen
link-bibliography⁠
https://arxiv.org/abs/2201.10005#openai: “Text and Code Embeddings by Contrastive Pre-Training ”⁠, Arvind Neelakantan, Tao Xu, Raul Puri …, Alec Radford⁠, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, ⁠Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger⁠, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson⁠, Tabarak Khan, Toki Sherbakov, Joanne Jang, Peter Welinder⁠, ⁠Lilian Weng
link-bibliography⁠
https://arxiv.org/abs/2201.08371#facebook: “SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models ”⁠, Mannat Singh, Laura Gustafson, Aaron Adcock …, Vinicius de Freitas Reis, Bugra Gedik, Raj Prateek Kosaraju, Dhruv Mahajan, Ross Girshick⁠, Piotr Dollár, ⁠Laurens van der Maaten
link-bibliography⁠
https://arxiv.org/abs/2201.07520#facebook: “CM3: A Causal Masked Multimodal Model of the Internet ”⁠, Armen Aghajanyan, Bernie Huang, Candace Ross …, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis⁠, Luke Zettlemoyer⁠
link-bibliography⁠
https://arxiv.org/abs/2201.06910: “ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization ”⁠, Hanwei Xu, Yujun Chen, Yulun Du …, Nan Shao, Yanggang Wang, Haiyu Li, Zhilin Yang⁠
link-bibliography⁠
https://arxiv.org/abs/2201.03545#facebook: “ConvNeXt: A ConvNet for the 2020s ”⁠, ⁠Zhuang Liu, Hanzi Mao, Chao-Yuan Wu …, Christoph Feichtenhofer, Trevor Darrell⁠, Saining Xie
link-bibliography⁠
https://royalsocietypublishing.org/doi/10.1098/rstb.2020.0529: “The Evolution of Quantitative Sensitivity ”⁠, Margaret A. H. Bryer, Sarah E. Koopman, Jessica F. Cantlon⁠ …, ⁠Steven T. Piantadosi, Evan L. MacLean, Joseph M. Baker⁠, Michael J. Beran, Sarah M. Jones, Kerry E. Jordan, Salif Mahamane, Andreas Nieder, Bonnie M. Perdue, Friederike Range, Jeffrey R. Stevens, Masaki Tomonaga, Dorottya J. Ujfalussy, Jennifer Vonk
link-bibliography⁠
https://arxiv.org/abs/2112.05253: “MAGMA—Multimodal Augmentation of Generative Models through Adapter-Based Finetuning ”⁠, Constantin Eichenberg, Sidney Black, Samuel Weinbach …, Letitia Parcalabescu, Anette Frank
link-bibliography⁠
https://arxiv.org/abs/2112.04426#deepmind: “Improving Language Models by Retrieving from Trillions of Tokens ”⁠, Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann …, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, ⁠Geoffrey Irving, Oriol Vinyals⁠, Simon Osindero, Karen Simonyan⁠, Jack W. Rae, Erich Elsen, Laurent Sifre⁠
link-bibliography⁠
https://arxiv.org/abs/2111.12233#microsoft: “LEMON: Scaling Up Vision-Language Pre-Training for Image Captioning ”⁠, Xiaowei Hu, Zhe Gan, Jianfeng Wang …, Zhengyuan Yang, Zicheng Liu⁠, Yumao Lu, Lijuan Wang
link-bibliography⁠
https://arxiv.org/abs/2111.12763#google: “Sparse Is Enough in Scaling Transformers ”⁠, Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin …, Łukasz Kaiser⁠, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva
link-bibliography⁠
https://arxiv.org/abs/2111.11904#microsoft: “Can Pre-Trained Language Models Be Used to Resolve Textual and Semantic Merge Conflicts? ”⁠, Jialu Zhang, Todd Mytkowicz, Mike Kaufman …, Ruzica Piskac, Shuvendu K. Lahiri
link-bibliography⁠
https://arxiv.org/abs/2111.11133: “L-Verse: Bidirectional Generation Between Image and Text ”⁠, Taehoon Kim, Gwangmo Song, Sihaeng Lee …, Sangyun Kim, Yewon Seo, Soonyoung Lee, Seung Hwan Kim, Honglak Lee, Kyunghoon Bae
link-bibliography⁠
https://arxiv.org/abs/2111.11432#microsoft: “Florence: A New Foundation Model for Computer Vision ”⁠, Lu Yuan, Dongdong Chen, Yi-Ling Chen …, Noel Codella, Xiyang Dai, ⁠Jianfeng Gao⁠, Houdong Hu, Xuedong Huang⁠, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu⁠, Yumao Lu, Yu Shi⁠, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang
link-bibliography⁠
https://arxiv.org/abs/2111.10050#google: “BASIC: Combined Scaling for Open-Vocabulary Image Classification ”⁠, Hieu Pham, Zihang Dai⁠, Golnaz Ghiasi …, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu⁠, Mingxing Tan, Quoc V. Le⁠
link-bibliography⁠
https://arxiv.org/abs/2111.08267: “Solving Probability and Statistics Problems by Program Synthesis ”⁠, Leonard Tang, Elizabeth Ke, Nikhil Singh …, Nakul Verma⁠, Iddo Drori
link-bibliography⁠
https://arxiv.org/abs/2111.06377#facebook: “MAE: Masked Autoencoders Are Scalable Vision Learners ”⁠, Kaiming He⁠, Xinlei Chen, Saining Xie …, Yanghao Li, Piotr Dollár, Ross Girshick⁠
link-bibliography⁠
https://arxiv.org/abs/2111.02114#laion: “LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs ”⁠, Christoph Schuhmann, Richard Vencu, Romain Beaumont …, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, Aran Komatsuzaki
link-bibliography⁠
https://arxiv.org/abs/2110.14168#openai: “Training Verifiers to Solve Math Word Problems ”⁠, Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian …, ⁠Jacob Hilton, Reiichiro Nakano, Christopher Hesse, ⁠John Schulman
link-bibliography⁠
https://arxiv.org/abs/2110.11526#deepmind: “Wide Neural Networks Forget Less Catastrophically ”⁠, Seyed Iman Mirzadeh, Arslan Chaudhry, Dong Yin …, Huiyi Hu, ⁠Razvan Pascanu⁠, Dilan Gorur, Mehrdad Farajtabar
link-bibliography⁠
https://arxiv.org/abs/2110.02095#google: “Exploring the Limits of Large Scale Pre-Training ”⁠, Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, Hanie Sedghi
link-bibliography⁠
https://arxiv.org/abs/2109.10686#google: “Scale Efficiently: Insights from Pre-Training and Fine-Tuning Transformers ”⁠, ⁠Yi Tay, Mostafa Dehghani, Jinfeng Rao …, William Fedus⁠, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani⁠, Donald Metzler
link-bibliography⁠
https://arxiv.org/abs/2109.07958: “TruthfulQA: Measuring How Models Mimic Human Falsehoods ”⁠, Stephanie Lin⁠, ⁠Jacob Hilton, ⁠Owain Evans
link-bibliography⁠
https://arxiv.org/abs/2109.02593#allen: “General-Purpose Question-Answering With Macaw ”⁠, Oyvind Tafjord, Peter Clark
link-bibliography⁠
https://arxiv.org/abs/2108.13002#microsoft: “A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP ”⁠, Yucheng Zhao, Guangting Wang, Chuanxin Tang …, Chong Luo, Wenjun Zeng⁠, Zheng-Jun Zha
link-bibliography⁠
https://arxiv.org/abs/2108.08810#google: “Do Vision Transformers See Like Convolutional Neural Networks? ”⁠, Maithra Raghu, Thomas Unterthiner, Simon Kornblith …, Chiyuan Zhang, Alexey Dosovitskiy
link-bibliography⁠
https://arxiv.org/abs/2108.07686: “Scaling Laws for Deep Learning ”⁠, Jonathan S. Rosenfeld⁠
link-bibliography⁠
https://arxiv.org/abs/2107.02137#baidu: “ERNIE 3.0: Large-Scale Knowledge Enhanced Pre-Training for Language Understanding and Generation ”⁠, ⁠Yu Sun, Shuohuan Wang, Shikun Feng …, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Chen, Yanbin Zhao, Yuxiang Lu, Weixin Liu, Zhihua Wu, Weibao Gong, Jianzhong Liang, Zhizhou Shang, Peng Sun⁠, Wei Liu, Xuan Ouyang, Dianhai Yu, Hao Tian, Hua Wu, Haifeng Wang
link-bibliography⁠
https://arxiv.org/abs/2107.01294#allen: “Scarecrow: A Framework for Scrutinizing Machine Text ”⁠, Yao Dou, Maxwell Forbes, Rik Koncel-Kedziorski …, ⁠Noah A. Smith, Yejin Choi⁠
link-bibliography⁠
https://arxiv.org/abs/2106.07411: “Partial Success in Closing the Gap between Human and Machine Vision ”⁠, Robert Geirhos⁠, Kantharaju Narayanappa, Benjamin Mitzkus …, Tizian Thieringer, Matthias Bethge⁠, Felix A. Wichmann, Wiel, Brendel
link-bibliography⁠
https://arxiv.org/abs/2106.04803#google: “CoAtNet: Marrying Convolution and Attention for All Data Sizes ”⁠, Zihang Dai⁠, Hanxiao Liu, Quoc V. Le⁠, Mingxing Tan
link-bibliography⁠
https://arxiv.org/abs/2106.04560#google: “Scaling Vision Transformers ”⁠, Xiaohua Zhai⁠, Alexander Kolesnikov, ⁠Neil Houlsby, Lucas Beyer⁠
link-bibliography⁠
https://arxiv.org/abs/2106.03004#google: “Exploring the Limits of Out-Of-Distribution Detection ”⁠, Stanislav Fort, Jie Ren, Balaji Lakshminarayanan
link-bibliography⁠
https://arxiv.org/abs/2106.00116: “Effect of Pre-Training Scale on Intra/Inter-Domain Full and Few-Shot Transfer Learning for Natural and Medical X-Ray Chest Images ”⁠, Mehdi Cherti, Jenia Jitsev
link-bibliography⁠
https://arxiv.org/abs/2105.12806: “A Universal Law of Robustness via Isoperimetry ”⁠, Sébastien Bubeck⁠, Mark Sellke
link-bibliography⁠
https://m.koreaherald.com/view.php?ud=20210525000824#naver: “Naver Unveils First ‘Hyperscale’ AI Platform ”, Kang Jae-eun
link-bibliography⁠
https://arxiv.org/abs/2105.11084#facebook: “Unsupervised Speech Recognition ”⁠, Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli
link-bibliography⁠
https://venturebeat.com/ai/google-details-new-ai-accelerator-chips/: “Google Details New AI Accelerator Chips ”⁠, Kyle Wiggers
link-bibliography⁠
https://arxiv.org/abs/2105.01601#google: “MLP-Mixer: An All-MLP Architecture for Vision ”⁠, Ilya Tolstikhin, ⁠Neil Houlsby, Alexander Kolesnikov …, Lucas Beyer⁠, Xiaohua Zhai⁠, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit⁠, Mario Lucic, Alexey Dosovitskiy
link-bibliography⁠
https://arxiv.org/abs/2105.00572#facebook: “XLM-R XL: Larger-Scale Transformers for Multilingual Masked Language Modeling ”⁠, Naman Goyal, Jingfei Du, Myle Ott …, Giri Anantharaman, Alexis Conneau
link-bibliography⁠
https://arxiv.org/abs/2104.14294#facebook: “DINO: Emerging Properties in Self-Supervised Vision Transformers ”⁠, Mathilde Caron, Hugo Touvron, Ishan Misra …, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin⁠
link-bibliography⁠
abstract: ⁠“Machine Learning Scaling ”⁠, ⁠Gwern⁠
link-bibliography⁠
https://arxiv.org/abs/2104.02133#google: “SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network ”⁠, ⁠William Chan, Daniel Park, Chris Lee …, Yu Zhang, Quoc V. Le⁠, Mohammad Norouzi⁠
link-bibliography⁠
https://arxiv.org/abs/2103.14586#google: “Understanding Robustness of Transformers for Image Classification ”⁠, Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner …, Daliang Li, Thomas Unterthiner, Andreas Veit
link-bibliography⁠
https://arxiv.org/abs/2103.13009#allen: “UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark ”⁠, Nicholas Lourie, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi⁠
link-bibliography⁠
https://arxiv.org/abs/2103.10957#deepmind: “Efficient Visual Pretraining With Contrastive Detection ”⁠, Olivier J. Hénaff, Skanda Koppula, Jean-Baptiste Alayrac …, Aaron van den Oord, Oriol Vinyals⁠, João Carreira
link-bibliography⁠
https://arxiv.org/abs/2103.07579#google: “Revisiting ResNets: Improved Training and Scaling Strategies ”⁠, Irwan Bello, William Fedus⁠, Xianzhi Du …, Ekin D. Cubuk, Aravind Srinivas⁠, Tsung-Yi Lin, Jonathon Shlens, ⁠Barret Zoph
link-bibliography⁠
https://ai.meta.com/blog/learning-from-videos-to-understand-the-world/: “Learning from Videos to Understand the World ”⁠, Geoffrey Zweig, Polina Kuznetsova⁠, Michael Auli, Francois Fagan
link-bibliography⁠
https://arxiv.org/abs/2103.06561: “WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training ”⁠, Yuqi Huo, Manli Zhang, Guangzhen Liu …, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi, Yueqian Yang, Anwen Hu, Jinming Zhao, Ruichen Li, Yida Zhao, Liang Zhang, Yuqing Song, Xin Hong, Wanqing Cui, Danyang Hou, Yingyan Li, Junyi Li⁠, Peiyu Liu, Zheng Gong, Chuhao Jin, Yuchong Sun, Shizhe Chen, Zhiwu Lu⁠, Zhicheng Dou, Qin Jin, Yanyan Lan, Wayne Xin Zhao, Ruihua Song, Ji-Rong Wen
link-bibliography⁠
https://arxiv.org/abs/2103.01988#facebook: “SEER: Self-Supervised Pretraining of Visual Features in the Wild ”⁠, Priya Goyal, Mathilde Caron, Benjamin Lefaudeux …, Min Xu⁠, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin⁠, Piotr Bojanowski
link-bibliography⁠
https://arxiv.org/abs/2102.09672#openai: “Improved Denoising Diffusion Probabilistic Models ”⁠, Alex Nichol, ⁠Prafulla Dhariwal
link-bibliography⁠
https://arxiv.org/abs/2102.05918#google: “ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision ”⁠, Chao Jia, Yinfei Yang, Ye Xia⁠ …, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le⁠, Yunhsuan Sung, Zhen Li, Tom Duerig
link-bibliography⁠
https://arxiv.org/abs/2102.06171#deepmind: “NFNet: High-Performance Large-Scale Image Recognition Without Normalization ”⁠, Andrew Brock⁠, Soham De, Samuel L. Smith⁠, Karen Simonyan⁠
link-bibliography⁠
https://arxiv.org/abs/2102.02888#microsoft: “1-Bit Adam: Communication Efficient Large-Scale Training With Adam’s Convergence Speed ”⁠, Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan …, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He
link-bibliography⁠
https://arxiv.org/abs/2102.01951#scaling&org=deepmind: “Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Scaling ”⁠, Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya …, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d’Autume, Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, Phil Blunsom
link-bibliography⁠
https://arxiv.org/abs/2003.10580#google: “Meta Pseudo Labels ”⁠, Hieu Pham, Zihang Dai⁠, Qizhe Xie …, Minh-Thang Luong, Quoc V. Le⁠
link-bibliography⁠
https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf: “CLIP: Learning Transferable Visual Models From Natural Language Supervision ”⁠, Alec Radford⁠, ⁠Jong Wook Kim, Chris Hallacy …, Aditya A. Ramesh⁠, Gabriel Goh⁠, Sandhini Agarwal⁠, Girish Sastry, ⁠Amanda Askell, Pamela Mishkin⁠, ⁠Jack Clark⁠, Gretchen Krueger⁠, Ilya Sutskever⁠
link-bibliography⁠
https://www.alignmentforum.org/posts/k2SNji3jXaLGhBeYP/extrapolating-gpt-n-performance: “Extrapolating GPT-N Performance ”⁠, Lukas Finnveden
link-bibliography⁠
https://arxiv.org/abs/2012.00413: “CPM: A Large-Scale Generative Chinese Pre-Trained Language Model ”⁠, Zhengyan Zhang, Xu Han⁠, Hao Zhou⁠ …, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, ⁠Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang⁠, Juanzi Li, Xiaoyan Zhu, ⁠Maosong Sun
link-bibliography⁠
https://arxiv.org/abs/2011.10650#openai: “Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images ”⁠, ⁠Rewon Child
link-bibliography⁠
https://arxiv.org/abs/2010.14701#openai: “Scaling Laws for Autoregressive Generative Modeling ”⁠, Tom Henighan, Jared Kaplan, Mor Katz …, ⁠Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown⁠, ⁠Prafulla Dhariwal, Scott Gray⁠, Chris Hallacy, Benjamin Mann, Alec Radford⁠, Aditya A. Ramesh⁠, Nick Ryder, Daniel M. Ziegler, ⁠John Schulman, Dario Amodei⁠, Sam McCandlish⁠
link-bibliography⁠
https://arxiv.org/abs/2010.14571#google: “Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus ”⁠, Isaac Caswell, Theresa Breiner, Daan van Esch, Ankur Bapna
link-bibliography⁠
https://arxiv.org/abs/2010.10504#google: “Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition ”⁠, Yu Zhang, James Qin, Daniel S. Park …, Wei Han⁠, Chung-Cheng Chiu, Ruoming Pang, Quoc V. Le⁠, Yonghui Wu⁠
link-bibliography⁠
https://ai.meta.com/blog/introducing-many-to-many-multilingual-machine-translation/: “The First AI Model That Translates 100 Languages without Relying on English Data ”⁠, Angela Fan
link-bibliography⁠
https://arxiv.org/abs/2010.11929#google: “Vision Transformer: An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale ”⁠, Alexey Dosovitskiy, Lucas Beyer⁠, Alexander Kolesnikov …, Dirk Weissenborn, Xiaohua Zhai⁠, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit⁠, ⁠Neil Houlsby
link-bibliography⁠
https://www.openphilanthropy.org/research/new-report-on-how-much-computational-power-it-takes-to-match-the-human-brain/: “New Report on How Much Computational Power It Takes to Match the Human Brain ”⁠, Joseph Carlsmith
link-bibliography⁠
https://arxiv.org/abs/2009.03393#openai: “Generative Language Modeling for Automated Theorem Proving ”⁠, Stanislas Polu, Ilya Sutskever⁠
link-bibliography⁠
https://arxiv.org/abs/2008.09037: “Accuracy and Performance Comparison of Video Action Recognition Approaches ”⁠, Matthew Hutchinson⁠, Siddharth Samsi, William Arcand …, David Bestor, Bill Bergeron, Chansup Byun, Michael Houle, Matthew Hubbell, Michael Jones, Jeremy Kepner, Andrew Kirby, Peter Michaleas, Lauren Milechin, Julie Mullen, Andrew Prout⁠, Antonio Rosa, Albert Reuther, Charles Yee, Vijay Gadepally
link-bibliography⁠
https://www.lesswrong.com/posts/Wnqua6eQkewL3bqsF/matt-botvinick-on-the-spontaneous-emergence-of-learning: “Matt Botvinick on the Spontaneous Emergence of Learning Algorithms ”⁠, Adam Scholl
link-bibliography⁠
https://arxiv.org/abs/2008.02217: “Hopfield Networks Is All You Need ”⁠, Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner …, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter⁠
link-bibliography⁠
https://arxiv.org/abs/2007.06225: “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing ”⁠, Ahmed Elnaggar⁠, Michael Heinzinger, Christian Dallago …, Ghalia Rihawi, Yu Wang, Llion Jones⁠, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger⁠, Debsindhu Bhowmik, Burkhard Rost⁠
link-bibliography⁠
https://arxiv.org/abs/2007.03898#nvidia: “NVAE: A Deep Hierarchical Variational Autoencoder ”⁠, Arash Vahdat, Jan Kautz
link-bibliography⁠
https://arxiv.org/abs/2006.11477#facebook: “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations ”⁠, Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
link-bibliography⁠
https://arxiv.org/abs/2006.10621: “On the Predictability of Pruning Across Scales ”⁠, Jonathan S. Rosenfeld⁠, ⁠Jonathan Frankle, ⁠Michael Carbin⁠, Nir Shavit⁠
link-bibliography⁠
2020-chen-2.pdf#openai: “IGPT: Generative Pretraining from Pixels ”⁠, ⁠Mark Chen, Alec Radford⁠, ⁠Rewon Child …, Jeff Wu, Heewoo Jun, ⁠Prafulla Dhariwal, David Luan, Ilya Sutskever⁠
link-bibliography⁠
https://arxiv.org/abs/2006.09882#facebook: “SwAV: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments ”⁠, Mathilde Caron, Ishan Misra, Julien Mairal …, Priya Goyal, Piotr Bojanowski, Armand Joulin⁠
link-bibliography⁠
https://openai.com/index/image-gpt/: “Image GPT (IGPT): We Find That, Just As a Large Transformer Model Trained on Language Can Generate Coherent Text, the Same Exact Model Trained on Pixel Sequences Can Generate Coherent Image Completions and Samples ”⁠, ⁠Mark Chen, Alec Radford⁠, Ilya Sutskever⁠
link-bibliography⁠
https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/: “ZeRO-2 & DeepSpeed: Shattering Barriers of Deep Learning Speed & Scale ”⁠, DeepSpeed Team
link-bibliography⁠
https://openai.com/research/jukebox: “Jukebox: We’re Introducing Jukebox, a Neural Net That Generates Music, including Rudimentary Singing, As Raw Audio in a Variety of Genres and Artist Styles. We’re Releasing the Model Weights and Code, along With a Tool to Explore the Generated Samples. ”⁠, ⁠Prafulla Dhariwal, Heewoo Jun, Christine Payne⁠ …, ⁠Jong Wook Kim, Alec Radford⁠, Ilya Sutskever⁠
link-bibliography⁠
https://ai.meta.com/blog/state-of-the-art-open-source-chatbot/: “Blender: A State-Of-The-Art Open Source Chatbot ”⁠, Stephen Roller, Jason Weston⁠, Emily Dinan
link-bibliography⁠
https://arxiv.org/abs/2004.08366#google: “DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications ”⁠, Yun Zeng, Siqi Zuo, Dongcai Shen
link-bibliography⁠
https://arxiv.org/abs/2004.07159#alibaba: “PALM: Pre-Training an Autoencoding & Autoregressive Language Model for Context-Conditioned Generation ”⁠, Bin Bi, Chenliang Li, Chen Wu …, Ming Yan⁠, Wei Wang, Songfang Huang, Fei Huang, Luo Si
link-bibliography⁠
https://www.technologyreview.com/2020/02/17/844721/ai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality/: “The Messy, Secretive Reality behind OpenAI’s Bid to save the World: The AI Moonshot Was Founded in the Spirit of Transparency. This Is the inside Story of How Competitive Pressure Eroded That Idealism ”⁠, Karen Hao⁠
link-bibliography⁠
https://arxiv.org/abs/2002.05709#google: “A Simple Framework for Contrastive Learning of Visual Representations ”⁠, Ting Chen, Simon Kornblith, Mohammad Norouzi⁠, Geoffrey Hinton⁠
link-bibliography⁠
https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/: “Turing-NLG: A 17-Billion-Parameter Language Model by Microsoft ”⁠, Corby Rosset
link-bibliography⁠
https://research.google/blog/towards-a-conversational-agent-that-can-chat-aboutanything/: “Towards a Conversational Agent That Can Chat About…Anything ”⁠, Daniel Adiwardana, Thang Luong
link-bibliography⁠
https://arxiv.org/abs/2001.08361#openai: “Scaling Laws for Neural Language Models ”⁠, Jared Kaplan, Sam McCandlish⁠, Tom Henighan …, Tom B. Brown⁠, Benjamin Chess, ⁠Rewon Child, Scott Gray⁠, Alec Radford⁠, Jeffrey Wu⁠, Dario Amodei⁠
link-bibliography⁠
https://www.youtube.com/watch?v=kY2NHSKBi10: “The Importance of Deconstruction ”⁠, ⁠Kilian Q. Weinberger
link-bibliography⁠
https://openai.com/research/deep-double-descent: “Deep Double Descent: We Show That the Double Descent Phenomenon Occurs in CNNs, ResNets, and Transformers: Performance First Improves, Then Gets Worse, and Then Improves Again With Increasing Model Size, Data Size, or Training Time ”⁠, Preetum Nakkiran, ⁠Gal Kaplun, Yamini Bansal⁠ …, Tristan Yang⁠, Boaz Barak⁠, Ilya Sutskever⁠
link-bibliography⁠
https://arxiv.org/abs/1911.13299: “What’s Hidden in a Randomly Weighted Neural Network? ”⁠, Vivek Ramanujan, ⁠Mitchell Wortsman, Aniruddha Kembhavi …, Ali Farhadi⁠, Mohammad Rastegari
link-bibliography⁠
https://arxiv.org/abs/1911.05722#facebook: “Momentum Contrast for Unsupervised Visual Representation Learning ”⁠, Kaiming He⁠, Haoqi Fan, Yuxin Wu …, Saining Xie, Ross Girshick⁠
link-bibliography⁠
https://arxiv.org/abs/1911.04252#google: “Self-Training With Noisy Student Improves ImageNet Classification ”⁠, Qizhe Xie, Minh-Thang Luong, Eduard Hovy⁠, Quoc V. Le⁠
link-bibliography⁠
https://arxiv.org/abs/1911.02116#facebook: “Unsupervised Cross-Lingual Representation Learning at Scale ”⁠, Alexis Conneau, Kartikay Khandelwal, Naman Goyal …, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer⁠, Veselin Stoyanov⁠
link-bibliography⁠
https://arxiv.org/abs/1910.02054#microsoft: “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models ”⁠, Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He
link-bibliography⁠
https://arxiv.org/abs/1909.11740: “UNITER: UNiversal Image-TExt Representation Learning ”⁠, Yen-Chun Chen, Linjie Li, Licheng Yu …, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu
link-bibliography⁠
https://arxiv.org/abs/1909.05858#salesforce: “CTRL: A Conditional Transformer Language Model For Controllable Generation ”⁠, Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney …, ⁠Caiming Xiong, Richard Socher
link-bibliography⁠
https://nv-adlr.github.io/MegatronLM: “MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism ”⁠, NVID I. A. ADLR
link-bibliography⁠
https://arxiv.org/abs/1907.11692#facebook: “RoBERTa: A Robustly Optimized BERT Pretraining Approach ”⁠, Yinhan Liu, Myle Ott, Naman Goyal …, Jingfei Du, Mandar Joshi, Danqi Chen⁠, Omer Levy⁠, Mike Lewis⁠, Luke Zettlemoyer⁠, Veselin Stoyanov⁠
link-bibliography⁠
https://arxiv.org/abs/1907.07640: “Robustness Properties of Facebook’s ResNeXt WSL Models ”⁠, A. Emin Orhan
link-bibliography⁠
https://arxiv.org/abs/1907.02544: “Large Scale Adversarial Representation Learning ”⁠, Jeff Donahue, Karen Simonyan⁠
link-bibliography⁠
https://arxiv.org/abs/1906.06669: “One Epoch Is All You Need ”⁠, Aran Komatsuzaki
link-bibliography⁠
https://david-abel.github.io/notes/icml_2019.pdf: “ICML 2019 Notes ”⁠, David Abel
link-bibliography⁠
https://arxiv.org/abs/1905.11946#google: “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks ”⁠, Mingxing Tan, Quoc V. Le⁠
link-bibliography⁠
https://arxiv.org/abs/1905.10843: “Asymptotic Learning Curves of Kernel Methods: Empirical Data versus Teacher-Student Paradigm ”⁠, Stefano Spigler, Mario Geiger, Matthieu Wyart
link-bibliography⁠
https://arxiv.org/abs/1905.03197: “UniLM: Unified Language Model Pre-Training for Natural Language Understanding and Generation ”⁠, Li Dong⁠, Nan Yang, Wenhui Wang …, Furu Wei⁠, Xiaodong Liu, Yu Wang, ⁠Jianfeng Gao⁠, Ming Zhou, Hsiao-Wuen Hon⁠
link-bibliography⁠
https://arxiv.org/abs/1905.00546#facebook: “Billion-Scale Semi-Supervised Learning for Image Classification ”⁠, I. Zeki Yalniz, Hervé Jégou, Kan Chen …, Manohar Paluri, Dhruv Mahajan
link-bibliography⁠
http://www.incompleteideas.net/IncIdeas/BitterLesson.html: “The Bitter Lesson ”, Rich Sutton⁠
link-bibliography⁠
https://openai.com/index/better-language-models/: “Better Language Models and Their Implications ”⁠, Alec Radford⁠, Jeffrey Wu⁠, Dario Amodei⁠ …, Daniela Amodei⁠, ⁠Jack Clark⁠, ⁠Miles Brundage, Ilya Sutskever⁠
link-bibliography⁠
https://melaniemitchell.me/aibook/: “Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified ”, Melanie Mitchell⁠
link-bibliography⁠
https://openai.com/research/how-ai-training-scales: “How AI Training Scales ”⁠, Sam McCandlish⁠, Jared Kaplan, Dario Amodei⁠
link-bibliography⁠
https://slatestarcodex.com/2018/11/26/is-science-slowing-down-2/: “Is Science Slowing Down? ”⁠, ⁠Scott Alexander⁠
link-bibliography⁠
https://arxiv.org/pdf/1809.11096#page=8&org=deepmind: “BigGAN: Large Scale GAN Training For High Fidelity Natural Image Synthesis § 5.2 Additional Evaluation On JFT-300M ”⁠, Andrew Brock⁠, Jeff Donahue, Karen Simonyan⁠
link-bibliography⁠
https://arxiv.org/abs/1808.01097: “CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images ”⁠, Sheng Guo, Weilin Huang, Haozhi Zhang …, Chenfan Zhuang, Dengke Dong, Matthew R. Scott, Dinglong Huang
link-bibliography⁠
https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf#page=5: “GPT-1: Improving Language Understanding by Generative Pre-Training § Model Specifications ”⁠, Alec Radford⁠, Karthik Narasimhan, ⁠Tim Salimans⁠, Ilya Sutskever⁠
link-bibliography⁠
https://arxiv.org/abs/1805.00932#facebook: “Exploring the Limits of Weakly Supervised Pretraining ”⁠, Dhruv Mahajan, Ross Girshick⁠, Vignesh Ramanathan …, Kaiming He⁠, Manohar Paluri, Yixuan Li⁠, Ashwin Bharambe, ⁠Laurens van der Maaten
link-bibliography⁠
https://arxiv.org/abs/1801.06146: “ULMFiT: Universal Language Model Fine-Tuning for Text Classification ”⁠, Jeremy Howard, Sebastian Ruder
link-bibliography⁠
https://arxiv.org/abs/1706.06083: “Towards Deep Learning Models Resistant to Adversarial Attacks ”⁠, ⁠Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt⁠ …, Dimitris Tsipras, Adrian Vladu
link-bibliography⁠
https://arxiv.org/abs/1706.01427#deepmind: “A Simple Neural Network Module for Relational Reasoning ”⁠, Adam Santoro⁠, David Raposo, David G. T. Barrett …, Mateusz Malinowski⁠, ⁠Razvan Pascanu⁠, Peter Battaglia, Timothy Lillicrap⁠
link-bibliography⁠
https://arxiv.org/abs/1705.07750#deepmind: “Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset ”⁠, Joao Carreira, Andrew Zisserman⁠
link-bibliography⁠
https://arxiv.org/abs/1705.05640: “WebVision Challenge: Visual Learning and Understanding With Web Data ”⁠, Wen Li, Limin Wang, Wei Li⁠ …, Eirikur Agustsson, Jesse Berent, Abhinav Gupta, Rahul Sukthankar, Luc Van Gool
link-bibliography⁠
https://blogs.microsoft.com/ai/microsoft-researchers-win-imagenet-computer-vision-challenge/: “Microsoft Researchers Win ImageNet Computer Vision Challenge ”⁠, Allison Linn
link-bibliography⁠
https://arxiv.org/abs/1511.06789#google: “The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition ”⁠, Jonathan Krause⁠, Benjamin Sapp, Andrew Howard …, Howard Zhou, Alexander Toshev, Tom Duerig, James Philbin, Li Fei-Fei⁠
link-bibliography⁠
https://arxiv.org/abs/1511.02251#facebook: “Learning Visual Features from Large Weakly Supervised Data ”⁠, Armand Joulin⁠, ⁠Laurens van der Maaten, Allan Jabri, Nicolas Vasilache
link-bibliography⁠
https://openaccess.thecvf.com/content_cvpr_2015/papers/Xiao_Learning_From_Massive_2015_CVPR_paper.pdf#baidu: “Clothing-1M: Learning from Massive Noisy Labeled Data for Image Classification ”⁠, Tong Xiao, Tian Xia⁠, Yi Yang …, Chang Huang, Xiaogang Wang⁠
link-bibliography⁠
https://arxiv.org/abs/1402.1869: “On the Number of Linear Regions of Deep Neural Networks ”⁠, Guido Montúfar, ⁠Razvan Pascanu⁠, ⁠Kyunghyun Cho, Yoshua Bengio⁠
link-bibliography⁠
http://www.lrec-conf.org/proceedings/lrec2014/pdf/1097_Paper.pdf: “N-Gram Counts and Language Models from the Common Crawl ”⁠, Christian Buck, Kenneth Heafield, Bas van Ooyen
link-bibliography⁠
https://aclanthology.org/P13-2121.pdf: “Scalable Modified Kneser-Ney Language Model Estimation ”⁠, Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, Philipp Koehn⁠
link-bibliography⁠
2010-mikolov.pdf: “Recurrent Neural Network Based Language Model ”⁠, Tomas Mikolov⁠, Martin Karafiat⁠, Lukas Burget …, Jan Cernocky, Sanjeev Khudanpur
link-bibliography⁠
2010-hameed.pdf: “Understanding Sources of Inefficiency in General-Purpose Chips ”⁠, Rehan Hameed, Wajahat Qadeer, Megan Wachs …, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis⁠, Mark Alan Horowitz⁠
link-bibliography⁠
https://dw2blog.com/2009/11/02/halloween-nightmare-scenario-early-2020s/: “Halloween Nightmare Scenario, Early 2020’s ”, David Wood
link-bibliography⁠
2009-koren.pdf: “Matrix Factorization Techniques for Recommender Systems ”⁠, Yehuda Koren, Robert Bell, Chris Volinsky
link-bibliography⁠
https://web.archive.org/web/20230718144747/https://frc.ri.cmu.edu/~hpm/project.archive/robot.papers/2004/Predictions.html: “Robot Predictions Evolution ”⁠, Hans Moravec⁠
link-bibliography⁠
2003-perlich.pdf: “Tree Induction versus Logistic Regression: A Learning-Curve Analysis ”⁠, Claudia Perlich, Foster Provost⁠, Jeffrey S. Simonoff
link-bibliography⁠
http://infolab.stanford.edu/~backrub/google.html: “The Anatomy of a Large-Scale Hypertextual Web Search Engine ”⁠, Sergey Brin⁠, Lawrence Page⁠
link-bibliography⁠
https://paulfchristiano.com/: “Homepage of Paul F. Christiano ”⁠, Paul F. Christiano
link-bibliography⁠

[Quote Of The Day]

[Site Of The Day]

[Annotation Of The Day]

[adblock public service announcement]