Bibliography:

  1. Machine Learning Scaling

  2. ‘AI’ tag

  3. ‘AI economics’ tag

  4. ‘grokking (NN)’ tag

  5. ‘AI emergence’ tag

  6. ‘AI hardware’ tag

  7. ‘MoE NN’ tag

  8. ‘ML dataset’ tag

  9. ‘Highleyman’s AI’ tag

  10. ‘diffusion model’ tag

  11. ‘BigGAN’ tag

  12. ‘CLIP’ tag

  13. ‘DALL·E’ tag

  14. ‘instruct-tuning LLMs’ tag

  15. ‘masked autoencoder’ tag

  16. ‘video analysis’ tag

  17. ‘video generation’ tag

  18. ‘active learning’ tag

  19. ‘continual learning’ tag

  20. ‘Decision Transformer’ tag

  21. ‘MARL’ tag

  22. ‘robotics’ tag

  23. ‘RL scaling’ tag

  24. Is OpenAI alright? How would we know and what would it look like?

  25. What do you do after ‘winning’ an AI arms race?

  26. Absolute Unit NNs: Regression-Based MLPs for Everything

  27. What do we mean by ‘diminishing returns’ in scaling?

  28. Research Ideas

  29. GPT-3 Creative Fiction

  30. GANs Didn’t Fail, They Were Abandoned

  31. The Scaling Hypothesis

  32. ML Scaling subreddit

  33. WBE and DRL: a Middle Way of imitation learning from the human brain

  34. Computer Optimization: Your Computer Is Faster Than You Think

  35. Fully-Connected Neural Nets

  36. Machine Learning Scaling

  37. Technology Forecasting: The Garden of Forking Paths

  38. PaliGemma 2: A Family of Versatile VLMs for Transfer

  39. Best-of-N Jailbreaking

  40. Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?

  41. ABBYY’s Bitter Lesson: How Linguists Lost the Last Battle for NLP

  42. CT Foundation: Taking medical imaging embeddings 3D

  43. Inference Scaling for Long-Context Retrieval Augmented Generation

  44. Strategic Insights from Simulation Gaming of AI Race Dynamics

  45. How Feature Learning Can Improve Neural Scaling Laws

  46. Dwarkesh Podcast Progress Update

  47. 224cbb52c0b355315b030704db6347009e2ab1e0.html

  48. Gwern Branwen—How an Anonymous Researcher Predicted AI’s Trajectory

  49. Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

  50. Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

  51. Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs

  52. Resolving Discrepancies in Compute-Optimal Scaling of Language Models

  53. Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

  54. Probing the Decision Boundaries of In-context Learning in Large Language Models

  55. How Do Large Language Models Acquire Factual Knowledge During Pretraining?

  56. Explore the Limits of Omni-modal Pretraining at Scale

  57. Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences

  58. Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement

  59. Attention as a Hypernetwork

  60. Training Compute-Optimal Protein Language Models

  61. AI Will Become Mathematicians’ ‘Co-Pilot’: Fields Medalist Terence Tao explains how proof checkers and AI programs are dramatically changing mathematics

  62. The Scaling Law in Stellar Light Curves

  63. AstroPT: Scaling Large Observation Models for Astronomy

  64. xLSTM: Extended Long Short-Term Memory

  65. Position: Understanding LLMs Requires More Than Statistical Generalization

  66. GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic

  67. CatLIP: CLIP-level Visual Recognition Accuracy with 2.7× Faster Pre-training on Web-scale Image-Text Data

  68. Test-Time Augmentation to solve ARC

  69. Chinchilla Scaling: A replication attempt

  70. Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

  71. Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck

  72. Language Imbalance Can Boost Cross-lingual Generalization

  73. CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge

  74. Conformer-1: Robust ASR via Large-Scale Semi-supervised Bootstrapping

  75. MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

  76. Visual Autoregressive Modeling (VAR): Scalable Image Generation via Next-Scale Prediction

  77. Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

  78. Long-form factuality in large language models

  79. Mechanistic Design and Scaling of Hybrid Architectures

  80. 8 Google Employees Invented Modern AI. Here’s the Inside Story: They met by chance, got hooked on an idea, and wrote the Transformers paper—the most consequential tech breakthrough in recent history

  81. Inflection-2.5: meet the world’s best personal AI

  82. Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations (HSTU)

  83. When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

  84. Investigating Continual Pretraining in Large Language Models: Insights and Implications

  85. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

  86. StructLM: Towards Building Generalist Models for Structured Knowledge Grounding

  87. How to Train Data-Efficient LLMs

  88. Weaver: Foundation Models for Creative Writing

  89. Arrows of Time for Large Language Models

  90. Can AI Assistants Know What They Don’t Know?

  91. I am a Strange Dataset: Metalinguistic Tests for Language Models

  92. TF-T2V: A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

  93. Generative Multimodal Models are In-Context Learners

  94. Zoology: Measuring and Improving Recall in Efficient Language Models

  95. Seamless: Multilingual Expressive and Streaming Speech Translation

  96. Scaling transformer neural networks for skillful and reliable medium-range weather forecasting

  97. Instruction-tuning Aligns LLMs to the Human Brain

  98. Mamba: Linear-Time Sequence Modeling with Selective State Spaces

  99. Sequential Modeling Enables Scalable Learning for Large Vision Models

  100. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

  101. In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search

  102. First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models

  103. I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

  104. A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models

  105. Sam Altman accepts the 2023 Hawking Fellowship Award § Is there another breakthrough that’s needed to reach AGI?

  106. ConvNets Match Vision Transformers at Scale

  107. Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement?

  108. PaLI-3 Vision Language Models: Smaller, Faster, Stronger

  109. GeoLLM: Extracting Geospatial Knowledge from Large Language Models

  110. Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition

  111. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

  112. FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

  113. Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors

  114. MTOB: A Benchmark for Learning to Translate a New Language from One Grammar Book

  115. Intriguing properties of generative classifiers

  116. Taken out of context: On measuring situational awareness in LLMs

  117. SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

  118. Simple synthetic data reduces sycophancy in large language models

  119. LLaMA-2: Open Foundation and Fine-Tuned Chat Models

  120. Measuring Faithfulness in Chain-of-Thought Reasoning

  121. Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration

  122. Introducing Superalignment

  123. Gödel, Escher, Bach author Douglas Hofstadter on the state of AI today § What about AI terrifies you?

  124. Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression

  125. Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data

  126. Scaling MLPs: A Tale of Inductive Bias

  127. Understanding Social Reasoning in Language Models with Language Models

  128. Image Captioners Are Scalable Vision Learners Too

  129. PaLI-X: On Scaling up a Multilingual Vision and Language Model

  130. The False Promise of Imitating Proprietary LLMs

  131. Scaling Data-Constrained Language Models

  132. Scaling laws for language encoding models in fMRI

  133. LIMA: Less Is More for Alignment

  134. Google’s newest AI model uses nearly 5× more text data for training than its predecessor

  135. TorToise: Better speech synthesis through scaling

  136. TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

  137. ImageBind: One Embedding Space To Bind Them All

  138. Finding Neurons in a Haystack: Case Studies with Sparse Probing

  139. Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

  140. Google’s DeepMind-Brain merger: tech giant regroups for AI battle

  141. CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval

  142. Emergent and Predictable Memorization in Large Language Models

  143. Power Law Trends in Speedrunning and Machine Learning

  144. Even The Politicians Thought the Open Letter Made No Sense In The Senate Hearing on AI today’s hearing on ai covered ai regulation and challenges, and the infamous open letter, which nearly everyone in the room thought was unwise

  145. DINOv2: Learning Robust Visual Features without Supervision

  146. Segment Anything

  147. Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure

  148. Sigmoid Loss for Language Image Pre-Training

  149. How well do Large Language Models perform in Arithmetic tasks?

  150. GPT-4 Technical Report

  151. Securing Liberal Democratic Control of AGI through UK Leadership

  152. GigaGAN: Scaling up GANs for Text-to-Image Synthesis

  153. Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)

  154. Why didn’t DeepMind build GPT-3?

  155. Scaling Vision Transformers to 22 Billion Parameters

  156. John Carmack’s ‘Different Path’ to Artificial General Intelligence

  157. Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards

  158. ClimaX: A foundation model for weather and climate

  159. StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis

  160. MUG: Vision Learners Meet Web Image-Text Pairs

  161. GPT-3 as Knowledge Worker: A Zero-Shot Evaluation of AI CPA Capabilities

  162. Scaling Laws for Generative Mixed-Modal Language Models

  163. VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

  164. GPT-3 Takes the Bar Exam

  165. Cramming: Training a Language Model on a Single GPU in One Day

  166. Evolutionary-scale prediction of atomic level protein structure with a language model

  167. Discovering Language Model Behaviors with Model-Written Evaluations

  168. One Embedder, Any Task: Instruction-Finetuned Text Embeddings (INSTRUCTOR)

  169. Reproducible scaling laws for contrastive language-image learning

  170. ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

  171. VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

  172. VindLU: A Recipe for Effective Video-and-Language Pretraining

  173. Whisper: Robust Speech Recognition via Large-Scale Weak Supervision

  174. Scaling Language-Image Pre-training via Masking

  175. MultiRay: Optimizing efficiency for large-scale AI models

  176. Galactica: A Large Language Model for Science

  177. Large Language Models Struggle to Learn Long-Tail Knowledge

  178. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

  179. MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation

  180. Adversarial Policies Beat Superhuman Go AIs

  181. Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)

  182. A Solvable Model of Neural Scaling Laws

  183. Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning

  184. Evaluating Parameter Efficient Learning for Generation

  185. FLAN: Scaling Instruction-Finetuned Language Models

  186. BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining

  187. Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

  188. Foundation Transformers

  189. Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle)

  190. GLM-130B: An Open Bilingual Pre-trained Model

  191. Ask Me Anything (AMA): A simple strategy for prompting language models

  192. Do Current Multi-Task Optimization Methods in Deep Learning Even Help?

  193. Monolith: Real Time Recommendation System With Collisionless Embedding Table

  194. Machine Reading, Fast and Slow: When Do Models "Understand" Language?

  195. PaLI: A Jointly-Scaled Multilingual Language-Image Model

  196. Using Large Language Models to Simulate Multiple Humans

  197. Understanding Scaling Laws for Recommendation Models

  198. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

  199. Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

  200. Efficient Training of Language Models to Fill in the Middle

  201. Why do tree-based models still outperform deep learning on tabular data?

  202. PIXEL: Language Modeling with Pixels

  203. High-performing neural network models of visual cortex benefit from high latent dimensionality

  204. Exploring Length Generalization in Large Language Models

  205. Language Models (Mostly) Know What They Know

  206. On-Device Training Under 256KB Memory

  207. Beyond neural scaling laws: beating power law scaling via data pruning

  208. ProGen2: Exploring the Boundaries of Protein Language Models

  209. RST: reStructured Pre-training

  210. Limitations of the NTK for Understanding Generalization in Deep Learning

  211. Modeling Transformative AI Risks (MTAIR) Project—Summary Report

  212. BigVGAN: A Universal Neural Vocoder with Large-Scale Training

  213. An Improved One millisecond Mobile Backbone

  214. A Neural Corpus Indexer for Document Retrieval

  215. Toward a realistic model of speech processing in the brain with self-supervised learning

  216. Teaching Models to Express Their Uncertainty in Words

  217. Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power

  218. M3AE: Multimodal Masked Autoencoders Learn Transferable Representations

  219. InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning

  220. Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

  221. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

  222. Continual Pre-Training Mitigates Forgetting in Language and Vision

  223. Dialog Inpainting: Turning Documents into Dialogues

  224. Unifying Language Learning Paradigms

  225. Building Machine Translation Systems for the Next Thousand Languages

  226. When does dough become a bagel? Analyzing the remaining mistakes on ImageNet

  227. CoCa: Contrastive Captioners are Image-Text Foundation Models

  228. Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

  229. Continual Learning with Foundation Models: An Empirical Study of Latent Replay

  230. Flamingo: a Visual Language Model for Few-Shot Learning

  231. WebFace260M: A Benchmark for Million-Scale Deep Face Recognition

  232. What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

  233. DeepMind: The Podcast—Excerpts on AGI

  234. Can language models learn from explanations in context?

  235. Chinchilla: Training Compute-Optimal Large Language Models

  236. A Roadmap for Big Model

  237. A Conversational Paradigm for Program Synthesis

  238. Self-Consistency Improves Chain-of-Thought Reasoning in Language Models

  239. Effect of scale on catastrophic forgetting in neural networks

  240. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

  241. FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

  242. Variational Autoencoders Without the Variation

  243. Performance reserves in brain-imaging-based phenotype prediction

  244. Self-Distilled StyleGAN: Towards Generation from Internet Photos

  245. UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training

  246. Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

  247. Brains and algorithms partially converge in natural language processing

  248. Quantifying Memorization Across Neural Language Models

  249. Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework

  250. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

  251. Data Scaling Laws in NMT: The Effect of Noise and Architecture

  252. Webly Supervised Concept Expansion for General Purpose Vision Models

  253. StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

  254. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

  255. Reasoning Like Program Executors

  256. Text and Code Embeddings by Contrastive Pre-Training

  257. LaMDA: Language Models for Dialog Applications

  258. SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models

  259. CM3: A Causal Masked Multimodal Model of the Internet

  260. ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization

  261. A High-Dimensional Sphere Spilling out of a High-Dimensional Cube despite Exponentially Many Constraints

  262. e3e25cb54a89d63575071a99ca0ae7e925e62326.html

  263. ConvNeXt: A ConvNet for the 2020s

  264. The Defeat of the Winograd Schema Challenge

  265. Robust Self-Supervised Audio-Visual Speech Recognition

  266. AV-HuBERT: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

  267. Self-supervised Learning from 100 Million Medical Images

  268. The evolution of quantitative sensitivity

  269. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

  270. XGLM: Few-shot Learning with Multilingual Language Models

  271. An Empirical Investigation of the Role of Pre-training in Lifelong Learning

  272. Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases

  273. Knowledge-Rich Self-Supervised Entity Linking

  274. You Only Need One Model for Open-domain Question Answering

  275. EBERT: Epigenomic language models powered by Cerebras

  276. MAGMA—Multimodal Augmentation of Generative Models through Adapter-based Finetuning

  277. Improving language models by retrieving from trillions of tokens

  278. MLP Architectures for Vision-and-Language Modeling: An Empirical Study

  279. LEMON: Scaling Up Vision-Language Pre-training for Image Captioning

  280. Sparse is Enough in Scaling Transformers

  281. Can Pre-trained Language Models be Used to Resolve Textual and Semantic Merge Conflicts?

  282. ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning

  283. L-Verse: Bidirectional Generation Between Image and Text

  284. RedCaps: web-curated image-text data created by the people, for the people

  285. Florence: A New Foundation Model for Computer Vision

  286. BASIC: Combined Scaling for Open-Vocabulary Image Classification

  287. Swin Transformer V2: Scaling Up Capacity and Resolution

  288. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

  289. Solving Linear Algebra by Program Synthesis

  290. Covariate Shift in High-Dimensional Random Feature Regression

  291. Solving Probability and Statistics Problems by Program Synthesis

  292. Few-Shot Self-Rationalization with Natural Language Prompts

  293. INTERN: A New Learning Paradigm Towards General Vision

  294. Scaling Law for Recommendation Models: Towards General-purpose User Representations

  295. MAE: Masked Autoencoders Are Scalable Vision Learners

  296. Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

  297. Scaling ASR Improves Zero and Few Shot Learning

  298. Turing-Universal Learners with Optimal Scaling Laws

  299. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

  300. Training Verifiers to Solve Math Word Problems

  301. Wide Neural Networks Forget Less Catastrophically

  302. When in Doubt, Summon the Titans: Efficient Inference with Large Models

  303. The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail

  304. Symbolic Knowledge Distillation: from General Language Models to Commonsense Models

  305. LFPT5: A Unified Framework for Lifelong Few-shot Language Learning Based on Prompt Tuning of T5

  306. Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers

  307. Unsupervised Neural Machine Translation with Generative Language Models Only

  308. Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning

  309. Universal Paralinguistic Speech Representations Using Self-Supervised Conformers

  310. M6–10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

  311. A Few More Examples May Be Worth Billions of Parameters

  312. Exploring the Limits of Large Scale Pre-training

  313. Show Your Work: Scratchpads for Intermediate Computation with Language Models

  314. Mining for strong gravitational lenses with self-supervised learning

  315. Stochastic Training is Not Necessary for Generalization

  316. Evaluating Machine Accuracy on ImageNet

  317. BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

  318. Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

  319. Scaling Laws for Neural Machine Translation

  320. What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers

  321. A Recipe For Arbitrary Text Style Transfer with Large Language Models

  322. TruthfulQA: Measuring How Models Mimic Human Falsehoods

  323. A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning

  324. General-Purpose Question-Answering with Macaw

  325. An Empirical Exploration in Quality Filtering of Text Data

  326. A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

  327. Want To Reduce Labeling Cost? GPT-3 Can Help

  328. Data and Parameter Scaling Laws for Neural Machine Translation

  329. Do Vision Transformers See Like Convolutional Neural Networks?

  330. Modeling Protein Using Large-scale Pretrain Language Model

  331. Scaling Laws for Deep Learning

  332. Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations

  333. Facebook AI WMT21 News Translation Task Submission

  334. EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training

  335. The History of Speech Recognition to the Year 2030

  336. The History of Speech Recognition to the Year 2030

  337. A Field Guide to Federated Optimization

  338. HTLM: Hyper-Text Pre-Training and Prompting of Language Models

  339. Brain-like functional specialization emerges spontaneously in deep neural networks

  340. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

  341. Scarecrow: A Framework for Scrutinizing Machine Text

  342. The Dimpled Manifold Model of Adversarial Examples in Machine Learning

  343. Revisiting the Calibration of Modern Neural Networks

  344. Partial success in closing the gap between human and machine vision

  345. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

  346. Scaling Laws for Acoustic Models

  347. CoAtNet: Marrying Convolution and Attention for All Data Sizes

  348. Scaling Vision Transformers

  349. Exploring the Limits of Out-of-Distribution Detection

  350. Effect of Pre-Training Scale on Intra/Inter-Domain Full and Few-Shot Transfer Learning for Natural and Medical X-Ray Chest Images

  351. A Universal Law of Robustness via Isoperimetry

  352. Naver unveils first ‘hyperscale’ AI platform

  353. Unsupervised Speech Recognition

  354. One4all User Representation for Recommender Systems in E-commerce

  355. RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance

  356. Google details new AI accelerator chips

  357. MLP-Mixer: An all-MLP Architecture for Vision

  358. XLM-R XL: Larger-Scale Transformers for Multilingual Masked Language Modeling

  359. Scaling End-to-End Models for Large-Scale Multilingual ASR

  360. DINO: Emerging Properties in Self-Supervised Vision Transformers

  361. What Are Bayesian Neural Network Posteriors Really Like?

  362. [Ali released PLUG: 27 billion parameters, the largest pre-trained language model in the Chinese community]

  363. The Power of Scale for Parameter-Efficient Prompt Tuning

  364. Revealing Persona Biases in Dialogue Systems

  365. CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP

  366. Probing Across Time: What Does RoBERTa Know and When?

  367. Memorization versus Generalization in Pre-trained Language Models

  368. Large-Scale Self-Supervised and Semi-Supervised Learning for Speech Translation

  369. Scaling Laws for Language Transfer Learning

  370. Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections

  371. SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

  372. Understanding Robustness of Transformers for Image Classification

  373. UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

  374. Controllable Generation from Pre-trained Language Models via Inverse Prompting

  375. The Shape of Learning Curves: a Review

  376. Efficient Visual Pretraining with Contrastive Detection

  377. Revisiting ResNets: Improved Training and Scaling Strategies

  378. Learning from videos to understand the world

  379. WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

  380. Fast and Accurate Model Scaling

  381. Pretrained Transformers as Universal Computation Engines

  382. Greedy Hierarchical Variational Autoencoders (GHVAEs) for Large-Scale Video Prediction

  383. Measuring Mathematical Problem Solving With the MATH Dataset

  384. A law of robustness for two-layers neural networks

  385. SEER: Self-supervised Pretraining of Visual Features in the Wild

  386. M6: A Chinese Multimodal Pretrainer

  387. Zero-Shot Text-to-Image Generation

  388. Improved Denoising Diffusion Probabilistic Models

  389. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

  390. A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes

  391. Explaining Neural Scaling Laws

  392. ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

  393. NFNet: High-Performance Large-Scale Image Recognition Without Normalization

  394. Learning Curve Theory

  395. 1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed

  396. Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Scaling

  397. Scaling Laws for Transfer

  398. Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

  399. Muppet: Massive Multi-task Representations with Pre-Finetuning

  400. Language processing in brains and deep neural networks: computational convergence and its limits

  401. Meta Pseudo Labels

  402. CLIP: Learning Transferable Visual Models From Natural Language Supervision

  403. VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

  404. CDLM: Cross-Document Language Modeling

  405. VinVL: Revisiting Visual Representations in Vision-Language Models

  406. Parameter Count vs Training Dataset Size (1952–2021)

  407. Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets

  408. Extrapolating GPT-N performance

  409. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

  410. CPM: A Large-scale Generative Chinese Pre-trained Language Model

  411. Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images

  412. When Do You Need Billions of Words of Pretraining Data?

  413. Scaling Laws for Autoregressive Generative Modeling

  414. Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

  415. mT5: A massively multilingual pre-trained text-to-text transformer

  416. Beyond English-Centric Multilingual Machine Translation

  417. Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

  418. Towards End-to-End In-Image Neural Machine Translation

  419. The first AI model that translates 100 languages without relying on English data

  420. WinoGrande: An Adversarial Winograd Schema Challenge at Scale

  421. The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers

  422. Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)

  423. The neural architecture of language: Integrative reverse-engineering converges on a model for predictive processing

  424. Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples

  425. Fast Stencil-Code Computation on a Wafer-Scale Processor

  426. Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale

  427. Small Data, Big Decisions: Model Selection in the Small-Data Regime

  428. New Report on How Much Computational Power It Takes to Match the Human Brain

  429. Generative Language Modeling for Automated Theorem Proving

  430. GrokNet: Unified Computer Vision Model Trunk and Embeddings For Commerce

  431. Accuracy and Performance Comparison of Video Action Recognition Approaches

  432. Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

  433. Matt Botvinick on the spontaneous emergence of learning algorithms

  434. Self-supervised learning through the eyes of a child

  435. On Robustness and Transferability of Convolutional Neural Networks

  436. Hopfield Networks is All You Need

  437. ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing

  438. NVAE: A Deep Hierarchical Variational Autoencoder

  439. Measuring Robustness to Natural Distribution Shifts in Image Classification

  440. Is SGD a Bayesian sampler? Well, almost

  441. Unsupervised Cross-lingual Representation Learning for Speech Recognition

  442. Logarithmic Pruning is All You Need

  443. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

  444. Denoising Diffusion Probabilistic Models

  445. On the Predictability of Pruning Across Scales

  446. iGPT: Generative Pretraining from Pixels

  447. SwAV: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

  448. SimCLRv2: Big Self-Supervised Models are Strong Semi-Supervised Learners

  449. Image GPT (iGPT): We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples

  450. Are we done with ImageNet?

  451. OpenAI API

  452. Object Segmentation Without Labels with Large-Scale Generative Models

  453. How Big Should My Language Model Be?

  454. GPT-3 paper § Figure F.1: Four uncurated completions from a context suggesting the model compose a poem in the style of Wallace Stevens with the title ‘Shadows on the Way’

  455. Danny Hernandez on forecasting and the drivers of AI progress

  456. Powered by AI: Advancing product understanding and building new shopping experiences

  457. ZeRO-2 & DeepSpeed: Shattering barriers of deep learning speed & scale

  458. Measuring the Algorithmic Efficiency of Neural Networks

  459. Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning

  460. Jukebox: We’re introducing Jukebox, a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles. We’re releasing the model weights and code, along with a tool to explore the generated samples.

  461. Blender: A state-of-the-art open source chatbot

  462. A Review of Winograd Schema Challenge Datasets and Approaches

  463. Scaling Laws from the Data Manifold Dimension

  464. DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications

  465. PALM: Pre-training an Autoencoding & Autoregressive Language Model for Context-conditioned Generation

  466. Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

  467. TTTTTackling WinoGrande Schemas

  468. A Metric Learning Reality Check

  469. Zoom In: An Introduction to Circuits—By studying the connections between neurons, we can find meaningful algorithms in the weights of neural networks

  470. Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited

  471. Rethinking Bias-Variance Trade-off for Generalization of Neural Networks

  472. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

  473. The messy, secretive reality behind OpenAI’s bid to save the world: The AI moonshot was founded in the spirit of transparency. This is the inside story of how competitive pressure eroded that idealism

  474. The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence

  475. A Simple Framework for Contrastive Learning of Visual Representations

  476. How Much Knowledge Can You Pack Into the Parameters of a Language Model?

  477. Turing-NLG: A 17-billion-parameter language model by Microsoft

  478. Quasi-Equivalence of Width and Depth of Neural Networks

  479. Impact of ImageNet Model Selection on Domain Adaptation

  480. Direct Fit to Nature: An Evolutionary Perspective on Biological and Artificial Neural Networks

  481. Towards a Conversational Agent that Can Chat About…Anything

  482. Towards a Human-like Open-Domain Chatbot

  483. Scaling Laws for Neural Language Models

  484. Scaling Laws for Neural Language Models: Figure 15: Far beyond the Model Sizes We Study Empirically, We Find a Contradiction between Our Equations § Pg17

  485. 20d126b9c3baf640f8d1d5dff3e253faac2e8242.pdf#page=17&org=openai

  486. The Importance of Deconstruction

  487. Big Transfer (BiT): General Visual Representation Learning

  488. 12-in-1: Multi-Task Vision and Language Representation Learning

  489. Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time

  490. Deep Double Descent: Where Bigger Models and More Data Hurt

  491. What’s Hidden in a Randomly Weighted Neural Network?

  492. Understanding the generalization of ‘lottery tickets’ in neural networks

  493. The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design

  494. Momentum Contrast for Unsupervised Visual Representation Learning

  495. SimpleShot: Revisiting Nearest-Neighbor Classification for Few-Shot Learning

  496. Self-training with Noisy Student improves ImageNet classification

  497. CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

  498. CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs

  499. XLM-R: State-of-the-art cross-lingual understanding through self-supervision

  500. High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks

  501. Unsupervised Cross-lingual Representation Learning at Scale

  502. T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

  503. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

  504. Environmental drivers of systematicity and generalization in a situated agent

  505. A Constructive Prediction of the Generalization Error Across Scales

  506. Large-scale Pretraining for Neural Machine Translation with Tens of Billions of Sentence Pairs

  507. UNITER: UNiversal Image-TExt Representation Learning

  508. Exascale Deep Learning for Scientific Inverse Problems

  509. Simple, Scalable Adaptation for Neural Machine Translation

  510. CTRL: A Conditional Transformer Language Model For Controllable Generation

  511. Show Your Work: Improved Reporting of Experimental Results

  512. MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism

  513. RoBERTa: A Robustly Optimized BERT Pretraining Approach

  514. Robustness properties of Facebook’s ResNeXt WSL models

  515. Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges

  516. Large Scale Adversarial Representation Learning

  517. One Epoch Is All You Need

  518. Does Learning Require Memorization? A Short Tale about a Long Tail

  519. Intriguing properties of adversarial training at scale

  520. Scaling Autoregressive Video Models

  521. A mathematical theory of semantic development in deep neural networks

  522. Adversarially Robust Generalization Just Requires More Unlabeled Data

  523. ICML 2019 Notes

  524. Are Labels Required for Improving Adversarial Robustness?

  525. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

  526. SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers

  527. Asymptotic learning curves of kernel methods: empirical data versus Teacher-Student paradigm

  528. UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation

  529. Adversarial Examples Are Not Bugs, They Are Features

  530. Billion-scale semi-supervised learning for image classification

  531. VideoBERT: A Joint Model for Video and Language Representation Learning

  532. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

  533. Surprises in High-Dimensional Ridgeless Least Squares Interpolation

  534. The Bitter Lesson

  535. GPT-2 As Step Toward General Intelligence

  536. Deep Learning Hardware: Past, Present, & Future

  537. Language Models are Unsupervised Multitask Learners

  538. Better Language Models and Their Implications

  539. Do ImageNet Classifiers Generalize to ImageNet?

  540. Cross-lingual Language Model Pretraining

  541. Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified

  542. High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks: Videos

  543. Reconciling modern machine learning practice and the bias-variance trade-off

  544. nocaps: novel object captioning at scale

  545. On Lazy Training in Differentiable Programming

  546. How AI Training Scales

  547. Is Science Slowing Down?

  548. Large Scale GAN Training for High Fidelity Natural Image Synthesis

  549. BigGAN: Large Scale GAN Training For High Fidelity Natural Image Synthesis § 5.2 Additional Evaluation On JFT-300M

  550. Measurement invariance explains the universal law of generalization for psychological perception

  551. CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images

  552. Large-Scale Visual Speech Recognition

  553. Troubling Trends in Machine Learning Scholarship

  554. Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations

  555. Neural scene representation and rendering

  556. GPT-1: Improving Language Understanding with Unsupervised Learning

  557. GPT-1: Improving Language Understanding by Generative Pre-Training

  558. GPT-1: Improving Language Understanding by Generative Pre-Training § Model specifications

  559. Do CIFAR-10 Classifiers Generalize to CIFAR-10?

  560. Deep learning generalizes because the parameter-function map is biased towards simple functions

  561. Google DeepMind founder and leader in artificial intelligence returns to Hamilton

  562. Exploring the Limits of Weakly Supervised Pretraining

  563. One Big Net For Everything

  564. Sensitivity and Generalization in Neural Networks: an Empirical Study

  565. Learning and Memorization

  566. ULMFiT: Universal Language Model Fine-tuning for Text Classification

  567. GPipe: Easy Scaling With Micro-Batch Pipeline Parallelism § Pg4

  568. a8efcc8272af6f434119f87a00c2edaf84241597.pdf#page=4&org=google

  569. Deep image reconstruction from human brain activity

  570. Deep Learning Scaling is Predictable, Empirically

  571. Are GANs Created Equal? A Large-Scale Study

  572. Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN

  573. Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior

  574. There’s No Fire Alarm for Artificial General Intelligence

  575. WebVision Database: Visual Learning and Understanding from Web Data

  576. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

  577. Towards Deep Learning Models Resistant to Adversarial Attacks

  578. Gradient Diversity: a Key Ingredient for Scalable Distributed Learning

  579. Learning to Learn from Noisy Web Videos

  580. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

  581. A simple neural network module for relational reasoning

  582. Deep Learning is Robust to Massive Label Noise

  583. Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset

  584. WebVision Challenge: Visual Learning and Understanding With Web Data

  585. Geometry of Optimization and Implicit Regularization in Deep Learning

  586. On the Impossibility of Supersized Machines

  587. Parallel Multiscale Autoregressive Density Estimation

  588. Universal representations: The missing link between faces, text, planktons, and cat breeds

  589. Estimation of Gap Between Current Language Models and Human Performance

  590. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

  591. Understanding deep learning requires rethinking generalization

  592. Why does deep and cheap learning work so well?

  593. The LAMBADA dataset: Word prediction requiring a broad discourse context

  594. Residual Networks Behave Like Ensembles of Relatively Shallow Networks

  595. Do Deep Convolutional Nets Really Need to be Deep and Convolutional?

  596. PlaNet—Photo Geolocation with Convolutional Neural Networks

  597. Exploring the Limits of Language Modeling

  598. The Singularity: A Philosophical Analysis

  599. Microsoft researchers win ImageNet computer vision challenge

  600. The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition

  601. Net2Net: Accelerating Learning via Knowledge Transfer

  602. Generative Concatenative Nets Jointly Learn to Write and Classify Reviews

  603. Learning Visual Features from Large Weakly Supervised Data

  604. LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

  605. Clothing-1M: Learning from Massive Noisy Labeled Data for Image Classification

  606. The Unreasonable Effectiveness of Recurrent Neural Networks

  607. LSTM: A Search Space Odyssey

  608. YFCC100M: The New Data in Multimedia Research

  609. Machine intelligence, part 1

  610. Evolution of the Human Brain: From Matter to Mind

  611. In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

  612. Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article]

  613. Neural Networks, Manifolds, and Topology

  614. Computing’s Energy Problem (and what we can do about it)

  615. N-gram Counts and Language Models from the Common Crawl

  616. Evolution of the human brain: when bigger is better

  617. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

  618. Algorithmic Progress in Six Domains

  619. Large–Scale Machine Learning Revisited [Slides]

  620. Intelligence Explosion Microeconomics

  621. Scalable Modified Kneser-Ney Language Model Estimation

  622. The remarkable, yet not extraordinary, human brain as a scaled-up primate brain and its associated cost

  623. Advantages of Artificial Intelligences, Uploads, and Digital Minds

  624. Recurrent Neural Network Based Language Model

  625. Understanding sources of inefficiency in general-purpose chips

  626. The Teenies

  627. Tick, tock, tick, tock… BING

  628. Halloween nightmare scenario, early 2020’s

  629. The Unreasonable Effectiveness of Data

  630. Economics Of The Singularity: Stuffed into skyscrapers by the billion, brainy bugbots will be the knowledge workers of the future

  631. Large Language Models in Machine Translation

  632. The Tradeoffs of Large-Scale Learning

  633. Cellular scaling rules for primate brains

  634. Robot Predictions Evolution

  635. Tree Induction vs. Logistic Regression: A Learning-Curve Analysis

  636. Analytic and Algorithmic Solution of Random Satisfiability Problems

  637. A Bit of Progress in Language Modeling

  638. Scaling to Very Very Large Corpora for Natural Language Disambiguation

  639. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

  640. A Survey of Methods for Scaling Up Inductive Algorithms

  641. On The Effect of Data Set Size on Bias And Variance in Classification Learning

  642. The Anatomy of a Large-Scale Hypertextual Web Search Engine

  643. The Effects of Training Set Size on Decision Tree Complexity

  644. Rigorous Learning Curve Bounds from Statistical Mechanics

  645. Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid

  646. Reflections After Refereeing Papers for NIPS

  647. Building a Large Annotated Corpus of English: The Penn Treebank

  648. Statistical Theory of Learning Curves under Entropic Loss Criterion

  649. Learning Curves: Asymptotic Values and Rate of Convergence

  650. Exhaustive Learning

  651. Computing with Connections

  652. Don’t Worry—It Can’t Happen

  653. Eric Michaud on Neural Quantum Interpretability

  654. 3d4ef31011b49fa3442733759bb92f0b3bb8b6c5.html#the-quantization-model-of-neural-scaling

  655. Billion-Scale Semi-Supervised Learning for State-Of-The-Art Image and Video Classification

  656. No Physics? No Problem. AI Weather Forecasting Is Already Making Huge Strides.

  657. Report Describes Apple’s ‘Organizational Dysfunction’ and ‘Lack of Ambition’ in AI

  658. StyleGAN-2 512px Trained on Danbooru2019

  659. Blake Bordelon

  660. Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

  661. 71dc2bc8a6a0dd83f257bfd6d7fff056307131a2.html

  662. Komodo 8: the Smartphone vs Desktop Challenge

  663. Trading Off Compute in Training and Inference § Pruning

  664. Eric Tang

  665. How Can We Make Robotics More like Generative Modeling?

  666. a3524d3155b3ef44b83dfc99082aeb52e87a9bdc.html

  667. Inverse-Scaling/prize: A Prize for Finding Tasks That Cause Large Language Models to Show Inverse Scaling

  668. Scaling up StyleGAN-2

  669. Semi Supervised Learning

  670. 2a16890d3828767743c0e7a177a4036828957ff4.html

  671. Homepage of Paul F. Christiano

  672. Statistical Modeling: The Two Cultures

  673. Jared Kaplan

  674. Safe Superintelligence Inc.

  675. OpenAI Disbands Its Robotics Research Team

  676. The Uneasy Relationship between Deep Learning and (classical) Statistics

  677. ed98775344f67ec385a16cd234c9c7888602e97f.html

  678. Parameter Counts in Machine Learning

  679. Can LLMs Learn from a Single Example?

  680. 5c73cf7b7ebdb67c15013107c0ba82613c5661ef.html

  681. Deciphering China's AI Dream

  682. Jason Wei

  683. Appendix: More Is Different In Other Domains

  684. Understanding ‘Deep Double Descent’

  685. How Much Compute Was Used to Train DeepMind's Generally Capable Agents?

  686. Why Neural Networks Generalise, and Why They Are (Kind Of) Bayesian

  687. What’s the Backward-Forward FLOP Ratio for Neural Networks?

  688. Optimality Is the Tiger, and Agents Are Its Teeth

  689. What Next? A Dozen Information-Technology Research Goals: 3. Turing’s Vision of Machine Intelligence

  690. 5620cc2a603069db2406c32006715aa6535d051b.pdf#page=11

  691. Was Linguistic A.I. Created by Accident?

  692. Ilya Sutskever: Deep Learning | AI Podcast #94 With Lex Fridman

  693. A Universal Law of Robustness

  694. Greg Brockman: OpenAI and AGI

  695. Season 1 Ep. 22 OpenAI's Ilya Sutskever: The Man Who Made AI Work

  696. A Law of Robustness and the Importance of Overparameterization in Deep Learning

  697. WELM

  698. design#future-tag-features

    [Transclude the forward-link's context]

  699. 2024-01-01-gwern-reddit-rmachinelearning-screenshotshowingscalingcentricdiscussions.png

  700. 2024-lin-figure2-inversescalingontruthfulqa.jpg

  701. 2024-smith-figure2-validationlossesofgalaxyimagepredictiontransformershowingscalingcurves.png

  702. 2024-smith-figure4-downstreamperformanceinastronomytasksfromgalaxypretrainedgpt2.png

  703. 2024-wang-figure1-writebenchcreativewritingscalingwithmodelsizeshowingweaveroutlier.jpg

  704. 2023-eldan-figure23-scalinglawoftinystoriesgpttransformermodelswithtrainingflops.jpg

  705. 2023-manvi-figure4-llmvstabularmachinelearningscalingofpredictionperformanceinsamplesize.png

  706. 2023-nguyen-figure6-stormerweatherforecastingscalesinmodelsizeanddatagranularity.png

  707. 2023-vu-figure2-largermorepowerfulllmsperformbetteronfastchangingquestionsorfalsepremisesinfreshqa.jpg

  708. 2023-wang-figure9-videodatascalingoftft2vvideogeneration.png

  709. 2023-bachmann-figure1-mlpcomputescalingoncifar100.jpg

  710. 2023-bachmann-figure4-mlpsscalewellwithincreasingbatchsize.jpg

  711. 2023-bachmann-figure5-scalingofmlpsoncifar10andimagenet1k.png

  712. 2023-bachmann-figure6-powerlawincifar100losswhenconstrainingparametersordatasetsize.jpg

  713. 2023-bachmann-figure7-suprachinchilladatascalingformlpsoncifar100loss.jpg

  714. 2023-girdhar-figure6-imagebindscalingofperformancewithincreasingclipimageencodersize.png

  715. 2022-10-06-robert-lesswrongmoreaudiblepodcast-itlookslikeyouretryingtotakeovertheworld.mp3

  716. 2022-maloney-figure11-equiparameterizationhypothesisshows1to1parameterdatascalingratioisoptimal.jpg

  717. 2022-press-figure1-scalingofgpt3modelperformanceoncompositionalcelebritiesdatasetshowingincreasingperformanceofbothsingleand2stepquestions.png

  718. 2022-zhu-figure9-webface260mcnnfacerecognitionscalingbyn.png

  719. 2022-radford-figure4-correlationofpretraininglanguagedatawithtranslationperformance.jpg

  720. 2022-radford-figure8-whisperscalingbymodelsize.png

  721. 2022-radford-figure9-crossoverinmonolingualvsmultilingualtrainingscalingshowseventualtransfer.jpg

  722. 2021-10-11-xinzhiyuan-inspursource10gpt245b.html

  723. 2021-goyal-figure1-seerscalinginparameters.png

  724. 2021-goyal-figure6-seerscalingindatan.jpg

  725. 2021-hernandez-transferlearning-figure1-transfervsfinetuning.png

  726. 2021-hu-figure1-lemontransformerscalingonmscocoimagecaptioning.png

  727. 2021-hu-figure2-a-datascalingfinetuningperformanceonmscoco.jpg

  728. 2021-lazaridou-figure3-incorrectverysmallscalescalingoftransformerxlmodelsdoesnotleadtolargeperformancegainsontemporaldriftbenchmark.png

  729. 2021-zhang-figure1a-conformermodelworderrorscalingindatasetsize.jpg

  730. 2021-zhang-figure2-conformerpmodelworderrorscalingratesindatasetsize.png

  731. 2021-schrittwieser-figure1-mspacmanmuzerologrewardscaling.jpg

  732. 2020-carlsmith-figure5-flopsbudgetestimates.png

  733. 2020-chrisdyer-aacl2020-machinetranslationscaling-ngramsvsrnns.jpg

  734. 2020-finnveden-extrapolationwcomparisons.png

  735. 2020-finnveden-normalizedlosscurves.jpg

  736. 2020-rosset-turingnlg-nlpmodelparametercountovertime.png

  737. 2019-liu-table4-robertabenefitsfromscalingdatasets10xoverbert.png

  738. 2018-howard-figure3-datascalingofrnnpretrainingfortextclassification.jpg

  739. 2017-koehn-figure3-bleuscoreswithvaryingamountsoftrainingdata.png

  740. 2015-krause-figure11-cub2002011imageclassificationlogarithmcscalinginnoisywebimagedatasetsize.png

  741. 2015-krause-table1-effectivenessofscalingupcnnsonlargenoisywebdatasetsvscompetitors.png

  742. 2014-cambria-figure1-hypotheticalnlpprogresscurves.png

  743. 2012-bottou-figure13-1-sgdtrainingtimetestlossvstron.png

  744. 2012-bottou-figure13-2-sgdtrainingtimetestlossvsconjugategradients.png

  745. 2011-torralba-table3-positivetransfervalueofimageclassificationdatasetsacrosstasksforsvmhogs.png

  746. 2009-12-07-shanelegg-supercomputerlinpackoverpast50years.png

  747. 2001-banko-figure1-scalingcurve.png

  748. 1987-sejnowski-figure1-historyofsupercomputersextrapolationvshumanbraincomputepower.jpg

  749. http://www.incompleteideas.net/Talks/UBC-2016.pdf

  750. f72c9193ec0797e087c54b37c78c937f371c14e1.pdf

  751. https://ai.facebook.com/blog/harmful-content-can-evolve-quickly-our-new-ai-system-adapts-to-tackle-it

  752. https://ai.meta.com/blog/harmful-content-can-evolve-quickly-our-new-ai-system-adapts-to-tackle-it/

  753. https://cacm.acm.org/research/the-decline-of-computers-as-a-general-purpose-technology/

  754. https://chinamediaproject.org/2024/05/27/goldfish-memories/

  755. https://github.com/Dicklesworthstone/the_lighthill_debate_on_ai

  756. https://github.com/features/copilot/

  757. https://karpathy.github.io/2022/03/14/lecun1989/

  758. https://markovbio.github.io/biomedical-progress/

  759. https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/

  760. https://nonint.com/2024/03/03/learned-structures/

  761. af8224aa65fc8deeae3d7f01dfdb6c757fc9e81e.html

  762. https://people.eecs.berkeley.edu/~hendrycks/

  763. https://research.google/blog/large-scale-matrix-factorization-on-tpus/

  764. https://research.google/blog/scalable-deep-reinforcement-learning-for-robotic-manipulation/

  765. https://scienceblogs.de/klausis-krypto-kolumne/2019/12/19/bigram-750-challenge-solved-new-world-record-set/

  766. https://thezvi.substack.com/p/on-openais-preparedness-framework

  767. https://time.com/6556168/when-ai-outsmart-humans/

  768. 6ed9020336d83d99170b949cec916aa4956131be.html

  769. https://towardsdatascience.com/deep-neural-networks-are-biased-at-initialisation-towards-simple-functions-a63487edcb99

  770. 0f162b51fc7a757ab7d88e33488ec7aba8b3cade.html

  771. https://towardsdatascience.com/neural-networks-are-fundamentally-bayesian-bee9a172fad8

  772. 9acc9e0ee1c122dbcde7f29ff81cac1a8fe86cca.html

  773. https://web.archive.org/web/20210415022657/http://starcraft.blizzplanet.com/blog/comments/blizzcon-2018-starcraft-ii-whats-next-panel-transcript

  774. 43c3fdb781650f965548c44b58fbdfbdb88fb557.html

  775. https://windowsontheory.org/2019/12/05/deep-double-descent/

  776. 02f979bd9e31739befbdf5901ec37946585f4c70.html

  777. https://www.beren.io/2022-08-06-The-scale-of-the-brain-vs-machine-learning/

  778. dd8fd35146fefa9309d4f7035b2bc7d3521c6ca1.html

  779. https://www.dwarkeshpatel.com/p/demis-hassabis#%C2%A7timestamps

  780. https://www.dwarkeshpatel.com/p/will-scaling-work

  781. https://www.lesswrong.com/posts/75o8oja43LXGAqbAR/palm-2-and-gpt-4-in-extrapolating-gpt-n-performance

  782. https://www.lesswrong.com/posts/B8Djo44WtZK6kK4K5/outreach-success-intro-to-ai-risk-that-has-been-successful

  783. https://www.lesswrong.com/posts/KbRxdBCcJqwtbiPzm/whisper-s-wild-implications-1

  784. https://www.lesswrong.com/posts/No5JpRCHzBrWA4jmS/q-and-a-with-shane-legg-on-risks-from-ai

  785. https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50-sota-on-arc-agi-with-gpt-4o?commentId=JptpWoG5DwNDXxykC

  786. bc345dee3ab06d698397aa1ad37545ce5cc3d6fc.html

  787. https://www.lesswrong.com/posts/dLXdCjxbJMGtDBWTH/no-one-in-my-org-puts-money-in-their-pension

  788. 364673ae891789274ebc60f881f0462b89431b03.html

  789. https://www.lesswrong.com/posts/qdStMFDMrWAnTqNWL/gpt-4-predictions

  790. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/

  791. https://www.reddit.com/r/mlscaling/comments/1ggr0j4/neural_network_recognizer_for_handwritten_zip/

  792. https://www.reddit.com/r/reinforcementlearning/comments/nsi7bf/what_could_make_ai_conscious_with_wojciech/

  793. 65c3b9a7d465e757fd24e27abecc1eb424840ded.html

  794. https://x.com/AxSauer/status/1644264940218327042

  795. https://x.com/RichardSocher/status/1736161332259614989

  796. https://x.com/ShaneLegg/status/1648340576545169410

  797. https://x.com/andrewwhite01/status/1634728559506870274

  798. https://x.com/borgeaud_s/status/1780988694163321250

  799. https://x.com/davisblalock/status/1542929841338494976

  800. https://x.com/fluffykittnmeow/status/1737639861350269213

  801. https://x.com/olivercameron/status/1622802466470514688

  802. CT Foundation: Taking medical imaging embeddings 3D

  803. https%253A%252F%252Fresearch.google%252Fblog%252Ftaking-medical-imaging-embeddings-3d%252F.html

  804. Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs

  805. Sam Bowman

  806. https%253A%252F%252Farxiv.org%252Fabs%252F2407.04108.html

  807. Resolving Discrepancies in Compute-Optimal Scaling of Language Models

  808. https%253A%252F%252Farxiv.org%252Fabs%252F2406.19146.html

  809. Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

  810. https%253A%252F%252Farxiv.org%252Fabs%252F2406.13121%2523google.html

  811. Probing the Decision Boundaries of In-context Learning in Large Language Models

  812. Aditya Grover

  813. https%253A%252F%252Farxiv.org%252Fabs%252F2406.11233.html

  814. Training Compute-Optimal Protein Language Models

  815. https%253A%252F%252Fwww.biorxiv.org%252Fcontent%252F10.1101%252F2024.06.06.597716.full.html

  816. AstroPT: Scaling Large Observation Models for Astronomy

  817. https%253A%252F%252Farxiv.org%252Fabs%252F2405.14930.html

  818. GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic

  819. https%253A%252F%252Farxiv.org%252Fabs%252F2405.00332%2523scale.html

  820. Test-Time Augmentation to solve ARC

  821. https%253A%252F%252Flab42.global%252Fcommunity-interview-jack-cole%252F.html

  822. Chinchilla Scaling: A replication attempt

  823. https%253A%252F%252Farxiv.org%252Fabs%252F2404.10102.html

  824. CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge

  825. https%253A%252F%252Farxiv.org%252Fabs%252F2404.06664.html

  826. Visual Autoregressive Modeling (VAR): Scalable Image Generation via Next-Scale Prediction

  827. https%253A%252F%252Farxiv.org%252Fabs%252F2404.02905%2523bytedance.html

  828. Long-form factuality in large language models

  829. https%253A%252F%252Farxiv.org%252Fabs%252F2403.18802%2523deepmind.html

  830. Mechanistic Design and Scaling of Hybrid Architectures

  831. Stefano Ermon

  832. https%253A%252F%252Farxiv.org%252Fabs%252F2403.17844.html

  833. 8 Google Employees Invented Modern AI. Here’s the Inside Story: They met by chance, got hooked on an idea, and wrote the Transformers paper—the most consequential tech breakthrough in recent history

  834. https%253A%252F%252Fwww.wired.com%252Fstory%252Feight-google-employees-invented-modern-ai-transformers-paper%252F.html

  835. Inflection-2.5: meet the world’s best personal AI

  836. https%253A%252F%252Finflection.ai%252Finflection-2-5.html

  837. Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations (HSTU)

  838. https%253A%252F%252Farxiv.org%252Fabs%252F2402.17152%2523facebook.html

  839. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

  840. Furu Wei

  841. https%253A%252F%252Farxiv.org%252Fabs%252F2402.17764.html

  842. StructLM: Towards Building Generalist Models for Structured Knowledge Grounding

  843. https%253A%252F%252Farxiv.org%252Fabs%252F2402.16671.html

  844. TF-T2V: A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

  845. https%253A%252F%252Farxiv.org%252Fabs%252F2312.15770%2523alibaba.html

  846. Zoology: Measuring and Improving Recall in Efficient Language Models

  847. https%253A%252F%252Farxiv.org%252Fabs%252F2312.04927.html

  848. Scaling transformer neural networks for skillful and reliable medium-range weather forecasting

  849. Aditya Grover

  850. https%253A%252F%252Farxiv.org%252Fabs%252F2312.03876.html

  851. Mamba: Linear-Time Sequence Modeling with Selective State Spaces

  852. Albert Gu

  853. Tri Dao

  854. https%253A%252F%252Farxiv.org%252Fabs%252F2312.00752.html

  855. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

  856. https%253A%252F%252Farxiv.org%252Fabs%252F2311.15599%2523tencent.html

  857. I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

  858. https%253A%252F%252Farxiv.org%252Fabs%252F2311.04145%2523alibaba.html

  859. ConvNets Match Vision Transformers at Scale

  860. https%253A%252F%252Farxiv.org%252Fabs%252F2310.16764%2523deepmind.html

  861. PaLI-3 Vision Language Models: Smaller, Faster, Stronger

  862. Lucas Beyer

  863. https%253A%252F%252Farxiv.org%252Fabs%252F2310.09199%2523google.html

  864. GeoLLM: Extracting Geospatial Knowledge from Large Language Models

  865. Stefano Ermon

  866. https%253A%252F%252Farxiv.org%252Fabs%252F2310.06213.html

  867. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

  868. https%253A%252F%252Farxiv.org%252Fabs%252F2310.06694.html

  869. FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

  870. Jason Wei

  871. https%253A%252F%252Farxiv.org%252Fabs%252F2310.03214%2523google.html

  872. Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors

  873. https%253A%252F%252Farxiv.org%252Fabs%252F2310.02980.html

  874. Taken out of context: On measuring situational awareness in LLMs

  875. Owain Evans, AI Alignment Researcher

  876. https%253A%252F%252Farxiv.org%252Fabs%252F2309.00667.html

  877. SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

  878. https%253A%252F%252Farxiv.org%252Fabs%252F2308.11596%2523facebook.html

  879. Simple synthetic data reduces sycophancy in large language models

  880. https%253A%252F%252Farxiv.org%252Fabs%252F2308.03958%2523deepmind.html

  881. Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration

  882. Furu Wei

  883. https%253A%252F%252Farxiv.org%252Fabs%252F2307.05300%2523microsoft.html

  884. Introducing Superalignment

  885. Jan Leike

  886. https%253A%252F%252Fopenai.com%252Findex%252Fintroducing-superalignment%252F.html

  887. Gödel, Escher, Bach author Douglas Hofstadter on the state of AI today § What about AI terrifies you?

  888. https%253A%252F%252Fwww.youtube.com%252Fwatch%253Fv%253DlfXxzAVtdpU%2526t%253D1763s.html

  889. Scaling MLPs: A Tale of Inductive Bias

  890. https%253A%252F%252Farxiv.org%252Fabs%252F2306.13575.html

  891. Understanding Social Reasoning in Language Models with Language Models

  892. https%253A%252F%252Farxiv.org%252Fabs%252F2306.15448.html

  893. The False Promise of Imitating Proprietary LLMs

  894. Sergey Levine

  895. https%253A%252F%252Farxiv.org%252Fabs%252F2305.15717.html

  896. Scaling laws for language encoding models in fMRI

  897. https%253A%252F%252Farxiv.org%252Fabs%252F2305.11863.html

  898. Google’s newest AI model uses nearly 5× more text data for training than its predecessor

  899. https%253A%252F%252Fwww.cnbc.com%252F2023%252F05%252F16%252Fgoogles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html.html

  900. TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

  901. https%253A%252F%252Farxiv.org%252Fabs%252F2305.07759%2523microsoft.html

  902. ImageBind: One Embedding Space To Bind Them All

  903. Zhuang Liu’s Homepage

  904. https%253A%252F%252Farxiv.org%252Fabs%252F2305.05665%2523facebook.html

  905. Google’s DeepMind-Brain merger: tech giant regroups for AI battle

  906. https%253A%252F%252Fwww.ft.com%252Fcontent%252Ff4f73815-6fc2-4016-bd97-4bace459e95e.html

  907. DINOv2: Learning Robust Visual Features without Supervision

  908. https%253A%252F%252Farxiv.org%252Fabs%252F2304.07193%2523facebook.html

  909. Sigmoid Loss for Language Image Pre-Training

  910. Lucas Beyer

  911. https%253A%252F%252Farxiv.org%252Fabs%252F2303.15343%2523google.html

  912. How well do Large Language Models perform in Arithmetic tasks?

  913. https%253A%252F%252Farxiv.org%252Fabs%252F2304.02015%2523alibaba.html

  914. Securing Liberal Democratic Control of AGI through UK Leadership

  915. https%253A%252F%252Fjameswphillips.substack.com%252Fp%252Fsecuring-liberal-democratic-control.html

  916. GigaGAN: Scaling up GANs for Text-to-Image Synthesis

  917. https%253A%252F%252Farxiv.org%252Fabs%252F2303.05511%2523adobe.html

  918. Scaling Vision Transformers to 22 Billion Parameters

  919. Robert Geirhos

  920. Lucas Beyer

  921. Yi Tay

  922. Neil Houlsby

  923. https%253A%252F%252Farxiv.org%252Fabs%252F2302.05442%2523google.html

  924. Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards

  925. https%253A%252F%252Fpapers.ssrn.com%252Fsol3%252Fpapers.cfm%253Fabstract_id%253D4335945.html

  926. StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis

  927. https%253A%252F%252Farxiv.org%252Fabs%252F2301.09515%2523nvidia.html

  928. MUG: Vision Learners Meet Web Image-Text Pairs

  929. https%253A%252F%252Farxiv.org%252Fabs%252F2301.07088%2523bytedance.html

  930. GPT-3 as Knowledge Worker: A Zero-Shot Evaluation of AI CPA Capabilities

  931. https%253A%252F%252Farxiv.org%252Fabs%252F2301.04408.html

  932. Scaling Laws for Generative Mixed-Modal Language Models

  933. Omer Levy

  934. Luke Zettlemoyer

  935. https%253A%252F%252Farxiv.org%252Fabs%252F2301.03728%2523facebook.html

  936. VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

  937. Furu Wei

  938. https%253A%252F%252Farxiv.org%252Fabs%252F2301.02111%2523microsoft.html

  939. GPT-3 Takes the Bar Exam

  940. https%253A%252F%252Farxiv.org%252Fabs%252F2212.14402.html

  941. Cramming: Training a Language Model on a Single GPU in One Day

  942. https%253A%252F%252Farxiv.org%252Fabs%252F2212.14034.html

  943. One Embedder, Any Task: Instruction-Finetuned Text Embeddings (INSTRUCTOR)

  944. Yizhong Wang—University of Washington

  945. Luke Zettlemoyer

  946. https%253A%252F%252Farxiv.org%252Fabs%252F2212.09741.html

  947. Reproducible scaling laws for contrastive language-image learning

  948. https%253A%252F%252Farxiv.org%252Fabs%252F2212.07143.html

  949. VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

  950. https%253A%252F%252Farxiv.org%252Fabs%252F2212.04979%2523google.html

  951. VindLU: A Recipe for Effective Video-and-Language Pretraining

  952. Mohit Bansal

  953. https%253A%252F%252Farxiv.org%252Fabs%252F2212.05051.html

  954. Whisper: Robust Speech Recognition via Large-Scale Weak Supervision

  955. Alec Radford

  956. Jong Wook Kim

  957. https%253A%252F%252Farxiv.org%252Fabs%252F2212.04356%2523openai.html

  958. MultiRay: Optimizing efficiency for large-scale AI models

  959. https%253A%252F%252Fai.facebook.com%252Fblog%252Fmultiray-large-scale-AI-models%252F.html

  960. Galactica: A Large Language Model for Science

  961. https%253A%252F%252Farxiv.org%252Fabs%252F2211.09085%2523facebook.html

  962. Large Language Models Struggle to Learn Long-Tail Knowledge

  963. Colin Raffel

  964. https%253A%252F%252Farxiv.org%252Fabs%252F2211.08411.html

  965. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

  966. https%253A%252F%252Farxiv.org%252Fabs%252F2211.07636%2523baai.html

  967. Adversarial Policies Beat Superhuman Go AIs

  968. Sergey Levine

  969. https%253A%252F%252Farxiv.org%252Fabs%252F2211.00241.html

  970. Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)

  971. https%253A%252F%252Fwww.youtube.com%252Fwatch%253Fv%253DQ-TJFyUoenc%2526t%253D2444s.html

  972. A Solvable Model of Neural Scaling Laws

  973. https%253A%252F%252Farxiv.org%252Fabs%252F2210.16859.html

  974. Evaluating Parameter Efficient Learning for Generation

  975. https%253A%252F%252Farxiv.org%252Fabs%252F2210.13673%2523nvidia.html

  976. FLAN: Scaling Instruction-Finetuned Language Models

  977. Barret Zoph

  978. Yi Tay

  979. Jason Wei

  980. https%253A%252F%252Farxiv.org%252Fabs%252F2210.11416%2523google.html

  981. BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining

  982. https%253A%252F%252Farxiv.org%252Fabs%252F2210.10341%2523microsoft.html

  983. Foundation Transformers

  984. Furu Wei

  985. https%253A%252F%252Farxiv.org%252Fabs%252F2210.06423%2523microsoft.html

  986. Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle)

  987. Noah A. Smith

  988. Mike Lewis

  989. https%253A%252F%252Farxiv.org%252Fabs%252F2210.03350%2523allen.html

  990. GLM-130B: An Open Bilingual Pre-trained Model

  991. https%253A%252F%252Farxiv.org%252Fabs%252F2210.02414%2523baai.html

  992. Ask Me Anything (AMA): A simple strategy for prompting language models

  993. https%253A%252F%252Farxiv.org%252Fabs%252F2210.02441.html

  994. Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

  995. https%253A%252F%252Farxiv.org%252Fabs%252F2208.05516.html

  996. PIXEL: Language Modeling with Pixels

  997. https%253A%252F%252Farxiv.org%252Fabs%252F2207.06991.html

  998. Language Models (Mostly) Know What They Know

  999. Saurav Kadavath

  1000. About Me

  1001. Andy Jones

  1002. Sam Bowman

  1003. https://jack-clark.net/about/

  1004. Sam McCandlish

  1005. Jared Kaplan

  1006. https%253A%252F%252Farxiv.org%252Fabs%252F2207.05221%2523anthropic.html

  1007. On-Device Training Under 256KB Memory

  1008. https%253A%252F%252Farxiv.org%252Fabs%252F2206.15472.html

  1009. Beyond neural scaling laws: beating power law scaling via data pruning

  1010. Robert Geirhos

  1011. https%253A%252F%252Farxiv.org%252Fabs%252F2206.14486.html

  1012. BigVGAN: A Universal Neural Vocoder with Large-Scale Training

  1013. https%253A%252F%252Farxiv.org%252Fabs%252F2206.04658%2523nvidia.html

  1014. Toward a realistic model of speech processing in the brain with self-supervised learning

  1015. https%253A%252F%252Farxiv.org%252Fabs%252F2206.01685.html

  1016. M3AE: Multimodal Masked Autoencoders Learn Transferable Representations

  1017. Sergey Levine

  1018. https%253A%252F%252Farxiv.org%252Fabs%252F2205.14204%2523google.html

  1019. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

  1020. Jason Wei

  1021. https%253A%252F%252Farxiv.org%252Fabs%252F2205.10625%2523google.html

  1022. Dialog Inpainting: Turning Documents into Dialogues

  1023. https%253A%252F%252Farxiv.org%252Fabs%252F2205.09073%2523google.html

  1024. Unifying Language Learning Paradigms

  1025. Yi Tay

  1026. Neil Houlsby

  1027. https%253A%252F%252Farxiv.org%252Fabs%252F2205.05131%2523google.html

  1028. Building Machine Translation Systems for the Next Thousand Languages

  1029. https%253A%252F%252Farxiv.org%252Fabs%252F2205.03983%2523google.html

  1030. When does dough become a bagel? Analyzing the remaining mistakes on ImageNet

  1031. https%253A%252F%252Farxiv.org%252Fabs%252F2205.04596%2523google.html

  1032. CoCa: Contrastive Captioners are Image-Text Foundation Models

  1033. https%253A%252F%252Farxiv.org%252Fabs%252F2205.01917%2523google.html

  1034. Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

  1035. https%253A%252F%252Farxiv.org%252Fabs%252F2205.01397.html

  1036. Flamingo: a Visual Language Model for Few-Shot Learning

  1037. Karen Simonyan

  1038. https%253A%252F%252Farxiv.org%252Fabs%252F2204.14198%2523deepmind.html

  1039. WebFace260M: A Benchmark for Million-Scale Deep Face Recognition

  1040. https%253A%252F%252Farxiv.org%252Fabs%252F2204.10149.html

  1041. DeepMind: The Podcast—Excerpts on AGI

  1042. https%253A%252F%252Fwww.lesswrong.com%252Fposts%252FSbAgRYo8tkHwhd9Qx%252Fdeepmind-the-podcast-excerpts-on-agi.html

  1043. Chinchilla: Training Compute-Optimal Large Language Models

  1044. Karen Simonyan

  1045. https%253A%252F%252Farxiv.org%252Fabs%252F2203.15556%2523deepmind.html

  1046. Self-Consistency Improves Chain-of-Thought Reasoning in Language Models

  1047. Jason Wei

  1048. https%253A%252F%252Farxiv.org%252Fabs%252F2203.11171%2523google.html

  1049. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

  1050. Jianfeng Gao at Microsoft Research

  1051. https%253A%252F%252Farxiv.org%252Fabs%252F2203.03466%2523microsoft.html

  1052. FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

  1053. https%253A%252F%252Farxiv.org%252Fabs%252F2203.00854.html

  1054. Self-Distilled StyleGAN: Towards Generation from Internet Photos

  1055. https%253A%252F%252Farxiv.org%252Fabs%252F2202.12211%2523google.html

  1056. Brains and algorithms partially converge in natural language processing

  1057. https%253A%252F%252Fwww.nature.com%252Farticles%252Fs42003-022-03036-1.html

  1058. Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework

  1059. https%253A%252F%252Farxiv.org%252Fabs%252F2202.06767%2523huawei.html

  1060. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

  1061. https%253A%252F%252Farxiv.org%252Fabs%252F2202.03052%2523alibaba.html

  1062. Webly Supervised Concept Expansion for General Purpose Vision Models

  1063. https%253A%252F%252Farxiv.org%252Fabs%252F2202.02317%2523allen.html

  1064. StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

  1065. https%253A%252F%252Farxiv.org%252Fabs%252F2202.00273.html

  1066. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

  1067. https%253A%252F%252Farxiv.org%252Fabs%252F2201.11990%2523microsoftnvidia.html

  1068. Reasoning Like Program Executors

  1069. https%253A%252F%252Farxiv.org%252Fabs%252F2201.11473%2523microsoft.html

  1070. Text and Code Embeddings by Contrastive Pre-Training

  1071. Alec Radford

  1072. Jong Wook Kim

  1073. Gretchen Krueger

  1074. Lil'Log

  1075. https%253A%252F%252Farxiv.org%252Fabs%252F2201.10005%2523openai.html

  1076. SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models

  1077. Ross Girshick

  1078. Laurens Van Der Maaten

  1079. https%253A%252F%252Farxiv.org%252Fabs%252F2201.08371%2523facebook.html

  1080. CM3: A Causal Masked Multimodal Model of the Internet

  1081. Mike Lewis

  1082. Luke Zettlemoyer

  1083. https%253A%252F%252Farxiv.org%252Fabs%252F2201.07520%2523facebook.html

  1084. ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization

  1085. Zhilin Yang

  1086. https%253A%252F%252Farxiv.org%252Fabs%252F2201.06910.html

  1087. ConvNeXt: A ConvNet for the 2020s

  1088. Zhuang Liu’s Homepage

  1089. https%253A%252F%252Farxiv.org%252Fabs%252F2201.03545%2523facebook.html

  1090. The evolution of quantitative sensitivity

  1091. Steven T. Piantadosi

  1092. https%253A%252F%252Froyalsocietypublishing.org%252Fdoi%252F10.1098%252Frstb.2020.0529.html

  1093. MAGMA—Multimodal Augmentation of Generative Models through Adapter-based Finetuning

  1094. https%253A%252F%252Farxiv.org%252Fabs%252F2112.05253.html

  1095. Improving language models by retrieving from trillions of tokens

  1096. Karen Simonyan

  1097. https%253A%252F%252Farxiv.org%252Fabs%252F2112.04426%2523deepmind.html

  1098. LEMON: Scaling Up Vision-Language Pre-training for Image Captioning

  1099. https%253A%252F%252Farxiv.org%252Fabs%252F2111.12233%2523microsoft.html

  1100. Sparse is Enough in Scaling Transformers

  1101. Łukasz Kaiser

  1102. https%253A%252F%252Farxiv.org%252Fabs%252F2111.12763%2523google.html

  1103. Can Pre-trained Language Models be Used to Resolve Textual and Semantic Merge Conflicts?

  1104. https%253A%252F%252Farxiv.org%252Fabs%252F2111.11904%2523microsoft.html

  1105. L-Verse: Bidirectional Generation Between Image and Text

  1106. https%253A%252F%252Farxiv.org%252Fabs%252F2111.11133.html

  1107. Florence: A New Foundation Model for Computer Vision

  1108. Jianfeng Gao at Microsoft Research

  1109. https%253A%252F%252Farxiv.org%252Fabs%252F2111.11432%2523microsoft.html

  1110. BASIC: Combined Scaling for Open-Vocabulary Image Classification

  1111. Zihang Dai

  1112. https%253A%252F%252Farxiv.org%252Fabs%252F2111.10050%2523google.html

  1113. Solving Probability and Statistics Problems by Program Synthesis

  1114. https%253A%252F%252Farxiv.org%252Fabs%252F2111.08267.html

  1115. Scaling Law for Recommendation Models: Towards General-purpose User Representations

  1116. https%253A%252F%252Farxiv.org%252Fabs%252F2111.11294.html

  1117. MAE: Masked Autoencoders Are Scalable Vision Learners

  1118. Ross Girshick

  1119. https%253A%252F%252Farxiv.org%252Fabs%252F2111.06377%2523facebook.html

  1120. Turing-Universal Learners with Optimal Scaling Laws

  1121. Preetum Nakkiran

  1122. https%253A%252F%252Farxiv.org%252Fabs%252F2111.05321.html

  1123. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

  1124. https%253A%252F%252Farxiv.org%252Fabs%252F2111.02114%2523laion.html

  1125. Training Verifiers to Solve Math Word Problems

  1126. Jacob Hilton's Homepage

  1127. John Schulman’s Homepage

  1128. https%253A%252F%252Farxiv.org%252Fabs%252F2110.14168%2523openai.html

  1129. Wide Neural Networks Forget Less Catastrophically

  1130. https://sites.google.com/view/razp/home

  1131. https%253A%252F%252Farxiv.org%252Fabs%252F2110.11526%2523deepmind.html

  1132. Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers

  1133. https%253A%252F%252Farxiv.org%252Fabs%252F2110.06990.html

  1134. Exploring the Limits of Large Scale Pre-training

  1135. Behnam Neyshabur

  1136. https%253A%252F%252Farxiv.org%252Fabs%252F2110.02095%2523google.html

  1137. Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

  1138. Yi Tay

  1139. https%253A%252F%252Farxiv.org%252Fabs%252F2109.10686%2523google.html

  1140. TruthfulQA: Measuring How Models Mimic Human Falsehoods

  1141. Jacob Hilton's Homepage

  1142. Owain Evans, AI Alignment Researcher

  1143. https%253A%252F%252Farxiv.org%252Fabs%252F2109.07958.html

  1144. General-Purpose Question-Answering with Macaw

  1145. https%253A%252F%252Farxiv.org%252Fabs%252F2109.02593%2523allen.html

  1146. A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

  1147. https%253A%252F%252Farxiv.org%252Fabs%252F2108.13002%2523microsoft.html

  1148. Do Vision Transformers See Like Convolutional Neural Networks?

  1149. https%253A%252F%252Farxiv.org%252Fabs%252F2108.08810%2523google.html

  1150. Scaling Laws for Deep Learning

  1151. Jonathan S. Rosenfeld

  1152. https%253A%252F%252Farxiv.org%252Fabs%252F2108.07686.html

  1153. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

  1154. Yu Sun

  1155. https%253A%252F%252Farxiv.org%252Fabs%252F2107.02137%2523baidu.html

  1156. Scarecrow: A Framework for Scrutinizing Machine Text

  1157. Noah A. Smith

  1158. https%253A%252F%252Farxiv.org%252Fabs%252F2107.01294%2523allen.html

  1159. Partial success in closing the gap between human and machine vision

  1160. Robert Geirhos

  1161. Matthias Bethge

  1162. https%253A%252F%252Farxiv.org%252Fabs%252F2106.07411.html

  1163. Scaling Laws for Acoustic Models

  1164. https%253A%252F%252Farxiv.org%252Fabs%252F2106.09488%2523amazon.html

  1165. CoAtNet: Marrying Convolution and Attention for All Data Sizes

  1166. Zihang Dai

  1167. https%253A%252F%252Farxiv.org%252Fabs%252F2106.04803%2523google.html

  1168. Scaling Vision Transformers

  1169. Neil Houlsby

  1170. Lucas Beyer

  1171. https%253A%252F%252Farxiv.org%252Fabs%252F2106.04560%2523google.html

  1172. Exploring the Limits of Out-of-Distribution Detection

  1173. https%253A%252F%252Farxiv.org%252Fabs%252F2106.03004%2523google.html

  1174. Effect of Pre-Training Scale on Intra/Inter-Domain Full and Few-Shot Transfer Learning for Natural and Medical X-Ray Chest Images

  1175. https%253A%252F%252Farxiv.org%252Fabs%252F2106.00116.html

  1176. A Universal Law of Robustness via Isoperimetry

  1177. https%253A%252F%252Farxiv.org%252Fabs%252F2105.12806.html

  1178. Naver unveils first ‘hyperscale’ AI platform

  1179. https%253A%252F%252Fm.koreaherald.com%252Fview.php%253Fud%253D20210525000824%2523naver.html

  1180. Unsupervised Speech Recognition

  1181. https%253A%252F%252Farxiv.org%252Fabs%252F2105.11084%2523facebook.html

  1182. Google details new AI accelerator chips

  1183. https%253A%252F%252Fventurebeat.com%252Fai%252Fgoogle-details-new-ai-accelerator-chips%252F.html

  1184. MLP-Mixer: An all-MLP Architecture for Vision

  1185. Neil Houlsby

  1186. Lucas Beyer

  1187. Jakob Uszkoreit

  1188. https%253A%252F%252Farxiv.org%252Fabs%252F2105.01601%2523google.html

  1189. XLM-R XL: Larger-Scale Transformers for Multilingual Masked Language Modeling

  1190. https%253A%252F%252Farxiv.org%252Fabs%252F2105.00572%2523facebook.html

  1191. DINO: Emerging Properties in Self-Supervised Vision Transformers

  1192. https%253A%252F%252Farxiv.org%252Fabs%252F2104.14294%2523facebook.html

  1193. Understanding Robustness of Transformers for Image Classification

  1194. https%253A%252F%252Farxiv.org%252Fabs%252F2103.14586%2523google.html

  1195. UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

  1196. https%253A%252F%252Farxiv.org%252Fabs%252F2103.13009%2523allen.html

  1197. Efficient Visual Pretraining with Contrastive Detection

  1198. https%253A%252F%252Farxiv.org%252Fabs%252F2103.10957%2523deepmind.html

  1199. Revisiting ResNets: Improved Training and Scaling Strategies

  1200. Aravind Srinivas

  1201. Barret Zoph

  1202. https%253A%252F%252Farxiv.org%252Fabs%252F2103.07579%2523google.html

  1203. Learning from videos to understand the world

  1204. Polina Kuznetsova

  1205. https%253A%252F%252Fai.facebook.com%252Fblog%252Flearning-from-videos-to-understand-the-world%252F.html

  1206. SEER: Self-supervised Pretraining of Visual Features in the Wild

  1207. https%253A%252F%252Farxiv.org%252Fabs%252F2103.01988%2523facebook.html

  1208. Improved Denoising Diffusion Probabilistic Models

  1209. https%253A%252F%252Farxiv.org%252Fabs%252F2102.09672%2523openai.html

  1210. ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

  1211. https%253A%252F%252Farxiv.org%252Fabs%252F2102.05918%2523google.html

  1212. NFNet: High-Performance Large-Scale Image Recognition Without Normalization

  1213. Karen Simonyan

  1214. https%253A%252F%252Farxiv.org%252Fabs%252F2102.06171%2523deepmind.html

  1215. 1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed

  1216. https%253A%252F%252Farxiv.org%252Fabs%252F2102.02888%2523microsoft.html

  1217. Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Scaling

  1218. https%253A%252F%252Farxiv.org%252Fabs%252F2102.01951%2523scaling%2526org%253Ddeepmind.html

  1219. Meta Pseudo Labels

  1220. Zihang Dai

  1221. https%253A%252F%252Farxiv.org%252Fabs%252F2003.10580%2523google.html

  1222. CLIP: Learning Transferable Visual Models From Natural Language Supervision

  1223. Alec Radford

  1224. Jong Wook Kim

  1225. Aditya A. Ramesh

  1226. Sandhini Agarwal

  1227. About Me

  1228. https://jack-clark.net/about/

  1229. Gretchen Krueger

  1230. https%253A%252F%252Fcdn.openai.com%252Fpapers%252FLearning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf.html

  1231. Extrapolating GPT-N performance

  1232. https%253A%252F%252Fwww.alignmentforum.org%252Fposts%252Fk2SNji3jXaLGhBeYP%252Fextrapolating-gpt-n-performance.html

  1233. Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images

  1234. https%253A%252F%252Farxiv.org%252Fabs%252F2011.10650%2523openai.html

  1235. Scaling Laws for Autoregressive Generative Modeling

  1236. Jared Kaplan

  1237. Speaker Details: EmTech MIT 2023

  1238. Alec Radford

  1239. Aditya A. Ramesh

  1240. John Schulman’s Homepage

  1241. Sam McCandlish

  1242. https%253A%252F%252Farxiv.org%252Fabs%252F2010.14701%2523openai.html

  1243. Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

  1244. https%253A%252F%252Farxiv.org%252Fabs%252F2010.14571%2523google.html

  1245. Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

  1246. https%253A%252F%252Farxiv.org%252Fabs%252F2010.10504%2523google.html

  1247. The first AI model that translates 100 languages without relying on English data

  1248. https%253A%252F%252Fai.meta.com%252Fblog%252Fintroducing-many-to-many-multilingual-machine-translation%252F.html

  1249. Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale

  1250. Lucas Beyer

  1251. Jakob Uszkoreit

  1252. Neil Houlsby

  1253. https%253A%252F%252Farxiv.org%252Fabs%252F2010.11929%2523google.html

  1254. New Report on How Much Computational Power It Takes to Match the Human Brain

  1255. https%253A%252F%252Fwww.openphilanthropy.org%252Fresearch%252Fnew-report-on-how-much-computational-power-it-takes-to-match-the-human-brain%252F.html

  1256. Generative Language Modeling for Automated Theorem Proving

  1257. https%253A%252F%252Farxiv.org%252Fabs%252F2009.03393%2523openai.html

  1258. Accuracy and Performance Comparison of Video Action Recognition Approaches

  1259. https%253A%252F%252Farxiv.org%252Fabs%252F2008.09037.html

  1260. Matt Botvinick on the spontaneous emergence of learning algorithms

  1261. https%253A%252F%252Fwww.lesswrong.com%252Fposts%252FWnqua6eQkewL3bqsF%252Fmatt-botvinick-on-the-spontaneous-emergence-of-learning.html

  1262. Hopfield Networks is All You Need

  1263. https%253A%252F%252Farxiv.org%252Fabs%252F2008.02217.html

  1264. ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing

  1265. Llion Jones

  1266. https%253A%252F%252Farxiv.org%252Fabs%252F2007.06225.html

  1267. NVAE: A Deep Hierarchical Variational Autoencoder

  1268. https%253A%252F%252Farxiv.org%252Fabs%252F2007.03898%2523nvidia.html

  1269. On the Predictability of Pruning Across Scales

  1270. Jonathan S. Rosenfeld

  1271. Jonathan Frankle—Chief Neural Network Scientist at Databricks

  1272. Michael Carbin

  1273. https%253A%252F%252Farxiv.org%252Fabs%252F2006.10621.html

  1274. iGPT: Generative Pretraining from Pixels

  1275. Speaker Details: EmTech MIT 2023

  1276. Alec Radford

  1277. %252Fdoc%252Fai%252Fnn%252Ftransformer%252Fgpt%252Fdall-e%252F1%252F2020-chen-2.pdf%2523openai.html

  1278. SwAV: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

  1279. https%253A%252F%252Farxiv.org%252Fabs%252F2006.09882%2523facebook.html

  1280. Image GPT (iGPT): We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples

  1281. Speaker Details: EmTech MIT 2023

  1282. Alec Radford

  1283. https%253A%252F%252Fopenai.com%252Findex%252Fimage-gpt%252F.html

  1284. ZeRO-2 & DeepSpeed: Shattering barriers of deep learning speed & scale

  1285. https%253A%252F%252Fwww.microsoft.com%252Fen-us%252Fresearch%252Fblog%252Fzero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale%252F.html

  1286. Jukebox: We’re introducing Jukebox, a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles. We’re releasing the model weights and code, along with a tool to explore the generated samples.

  1287. Jong Wook Kim

  1288. Alec Radford

  1289. https%253A%252F%252Fopenai.com%252Fresearch%252Fjukebox.html

  1290. Blender: A state-of-the-art open source chatbot

  1291. https%253A%252F%252Fai.meta.com%252Fblog%252Fstate-of-the-art-open-source-chatbot%252F.html

  1292. Scaling Laws from the Data Manifold Dimension

  1293. Jared Kaplan

  1294. https%253A%252F%252Farxiv.org%252Fabs%252F2004.10802.html

  1295. DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications

  1296. https%253A%252F%252Farxiv.org%252Fabs%252F2004.08366%2523google.html

  1297. PALM: Pre-training an Autoencoding & Autoregressive Language Model for Context-conditioned Generation

  1298. https%253A%252F%252Farxiv.org%252Fabs%252F2004.07159%2523alibaba.html

  1299. The messy, secretive reality behind OpenAI’s bid to save the world: The AI moonshot was founded in the spirit of transparency. This is the inside story of how competitive pressure eroded that idealism

  1300. https%253A%252F%252Fwww.technologyreview.com%252F2020%252F02%252F17%252F844721%252Fai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality%252F.html

  1301. A Simple Framework for Contrastive Learning of Visual Representations

  1302. https%253A%252F%252Farxiv.org%252Fabs%252F2002.05709%2523google.html

  1303. Turing-NLG: A 17-billion-parameter language model by Microsoft

  1304. https%253A%252F%252Fwww.microsoft.com%252Fen-us%252Fresearch%252Fblog%252Fturing-nlg-a-17-billion-parameter-language-model-by-microsoft%252F.html

  1305. Towards a Conversational Agent that Can Chat About…Anything

  1306. https%253A%252F%252Fresearch.google%252Fblog%252Ftowards-a-conversational-agent-that-can-chat-aboutanything%252F.html

  1307. Scaling Laws for Neural Language Models

  1308. Jared Kaplan

  1309. Sam McCandlish

  1310. Alec Radford

  1311. https%253A%252F%252Farxiv.org%252Fabs%252F2001.08361%2523openai.html

  1312. The Importance of Deconstruction

  1313. Welcome

  1314. https%253A%252F%252Fwww.youtube.com%252Fwatch%253Fv%253DkY2NHSKBi10.html

  1315. Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time

  1316. Preetum Nakkiran

  1317. Yamini Bansal

  1318. https%253A%252F%252Fopenai.com%252Fresearch%252Fdeep-double-descent.html

  1319. What’s Hidden in a Randomly Weighted Neural Network?

  1320. https%253A%252F%252Farxiv.org%252Fabs%252F1911.13299.html

  1321. Momentum Contrast for Unsupervised Visual Representation Learning

  1322. Ross Girshick

  1323. https%253A%252F%252Farxiv.org%252Fabs%252F1911.05722%2523facebook.html

  1324. Self-training with Noisy Student improves ImageNet classification

  1325. https%253A%252F%252Farxiv.org%252Fabs%252F1911.04252%2523google.html

  1326. Unsupervised Cross-lingual Representation Learning at Scale

  1327. Luke Zettlemoyer

  1328. https%253A%252F%252Farxiv.org%252Fabs%252F1911.02116%2523facebook.html

  1329. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

  1330. https%253A%252F%252Farxiv.org%252Fabs%252F1910.02054%2523microsoft.html

  1331. UNITER: UNiversal Image-TExt Representation Learning

  1332. https%253A%252F%252Farxiv.org%252Fabs%252F1909.11740.html

  1333. CTRL: A Conditional Transformer Language Model For Controllable Generation

  1334. Caiming Xiong—Home Page

  1335. Richard Socher

  1336. https%253A%252F%252Farxiv.org%252Fabs%252F1909.05858%2523salesforce.html

  1337. MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism

  1338. https%253A%252F%252Fnv-adlr.github.io%252FMegatronLM.html

  1339. RoBERTa: A Robustly Optimized BERT Pretraining Approach

  1340. Omer Levy

  1341. Mike Lewis

  1342. Luke Zettlemoyer

  1343. https%253A%252F%252Farxiv.org%252Fabs%252F1907.11692%2523facebook.html

  1344. Large Scale Adversarial Representation Learning

  1345. Karen Simonyan

  1346. https%253A%252F%252Farxiv.org%252Fabs%252F1907.02544.html

  1347. One Epoch Is All You Need

  1348. https%253A%252F%252Farxiv.org%252Fabs%252F1906.06669.html

  1349. ICML 2019 Notes

  1350. https%253A%252F%252Fdavid-abel.github.io%252Fnotes%252Ficml_2019.pdf.html

  1351. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

  1352. https%253A%252F%252Farxiv.org%252Fabs%252F1905.11946%2523google.html

  1353. Asymptotic learning curves of kernel methods: empirical data versus Teacher-Student paradigm

  1354. https%253A%252F%252Farxiv.org%252Fabs%252F1905.10843.html

  1355. UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation

  1356. Furu Wei

  1357. Jianfeng Gao at Microsoft Research

  1358. https%253A%252F%252Farxiv.org%252Fabs%252F1905.03197.html

  1359. Billion-scale semi-supervised learning for image classification

  1360. https%253A%252F%252Farxiv.org%252Fabs%252F1905.00546%2523facebook.html

  1361. Better Language Models and Their Implications

  1362. Alec Radford

  1363. https://jack-clark.net/about/

  1364. Miles Brundage—About Me

  1365. https%253A%252F%252Fopenai.com%252Findex%252Fbetter-language-models%252F.html

  1366. Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified

  1367. https%253A%252F%252Fmelaniemitchell.me%252Faibook%252F.html

  1368. How AI Training Scales

  1369. Sam McCandlish

  1370. Jared Kaplan

  1371. https%253A%252F%252Fopenai.com%252Fresearch%252Fhow-ai-training-scales.html

  1372. Is Science Slowing Down?

  1373. Scott Alexander

  1374. https%253A%252F%252Fslatestarcodex.com%252F2018%252F11%252F26%252Fis-science-slowing-down-2%252F.html

  1375. CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images

  1376. https%253A%252F%252Farxiv.org%252Fabs%252F1808.01097.html

  1377. GPT-1: Improving Language Understanding by Generative Pre-Training § Model specifications

  1378. Alec Radford

  1379. Tim Salimans

  1380. https%253A%252F%252Fs3-us-west-2.amazonaws.com%252Fopenai-assets%252Fresearch-covers%252Flanguage-unsupervised%252Flanguage_understanding_paper.pdf%2523page%253D5.html

  1381. Exploring the Limits of Weakly Supervised Pretraining

  1382. Ross Girshick

  1383. Laurens Van Der Maaten

  1384. https%253A%252F%252Farxiv.org%252Fabs%252F1805.00932%2523facebook.html

  1385. ULMFiT: Universal Language Model Fine-tuning for Text Classification

  1386. https%253A%252F%252Farxiv.org%252Fabs%252F1801.06146.html

  1387. Towards Deep Learning Models Resistant to Adversarial Attacks

  1388. Homepage: Aleksander Mądry

  1389. https%253A%252F%252Farxiv.org%252Fabs%252F1706.06083.html

  1390. A simple neural network module for relational reasoning

  1391. https://sites.google.com/view/razp/home

  1392. https%253A%252F%252Farxiv.org%252Fabs%252F1706.01427%2523deepmind.html

  1393. Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset

  1394. https%253A%252F%252Farxiv.org%252Fabs%252F1705.07750%2523deepmind.html

  1395. WebVision Challenge: Visual Learning and Understanding With Web Data

  1396. https%253A%252F%252Farxiv.org%252Fabs%252F1705.05640.html

  1397. Microsoft researchers win ImageNet computer vision challenge

  1398. https%253A%252F%252Fblogs.microsoft.com%252Fai%252Fmicrosoft-researchers-win-imagenet-computer-vision-challenge%252F.html

  1399. The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition

  1400. Jonathan Krause

  1401. https%253A%252F%252Farxiv.org%252Fabs%252F1511.06789%2523google.html

  1402. Learning Visual Features from Large Weakly Supervised Data

  1403. Laurens Van Der Maaten

  1404. https%253A%252F%252Farxiv.org%252Fabs%252F1511.02251%2523facebook.html

  1405. Clothing-1M: Learning from Massive Noisy Labeled Data for Image Classification

  1406. https%253A%252F%252Fopenaccess.thecvf.com%252Fcontent_cvpr_2015%252Fpapers%252FXiao_Learning_From_Massive_2015_CVPR_paper.pdf%2523baidu.html

  1407. N-gram Counts and Language Models from the Common Crawl

  1408. http%253A%252F%252Fwww.lrec-conf.org%252Fproceedings%252Flrec2014%252Fpdf%252F1097_Paper.pdf.html

  1409. Scalable Modified Kneser-Ney Language Model Estimation

  1410. https%253A%252F%252Faclanthology.org%252FP13-2121.pdf.html

  1411. Recurrent Neural Network Based Language Model

  1412. %252Fdoc%252Fai%252Fnn%252Frnn%252F2010-mikolov.pdf.html

  1413. Understanding sources of inefficiency in general-purpose chips

  1414. %252Fdoc%252Fcs%252Fhardware%252F2010-hameed.pdf.html

  1415. Halloween nightmare scenario, early 2020’s

  1416. https%253A%252F%252Fdw2blog.com%252F2009%252F11%252F02%252Fhalloween-nightmare-scenario-early-2020s%252F.html

  1417. Robot Predictions Evolution

  1418. https%253A%252F%252Fweb.archive.org%252Fweb%252F20230718144747%252Fhttps%253A%252F%252Ffrc.ri.cmu.edu%252F~hpm%252Fproject.archive%252Frobot.papers%252F2004%252FPredictions.html.html

  1419. Tree Induction vs. Logistic Regression: A Learning-Curve Analysis

  1420. %252Fdoc%252Fai%252Fscaling%252F2003-perlich.pdf.html

  1421. The Anatomy of a Large-Scale Hypertextual Web Search Engine

  1422. http%253A%252F%252Finfolab.stanford.edu%252F~backrub%252Fgoogle.html.html

  1423. Homepage of Paul F. Christiano

  1424. https%253A%252F%252Fpaulfchristiano.com%252F.html

  1425. Wikipedia Bibliography:

    1. Algorithmic Information Theory

    2. Curse of Dimensionality § Blessing of Dimensionality

    3. Power Law

    4. Scale Invariance

    5. Stockfish (chess) § Fishtest

    6. Ludwig Schmidt

    7. Anthony Chen

    8. Jie Tang

    9. Yejin Choi

    10. Yi Jiang

    11. Quoc V. Le

    12. Kristian Kersting

    13. Christopher Ré

    14. Steven Levy

    15. Yu Shi

    16. Ying Shan

    17. Kang Zhao

    18. Samuel L. Smith

    19. Andrew Brock

    20. Xi Chen

    21. Xiao Wang

    22. David Lobell

    23. Danqi Chen

    24. Daniel Kokotajlo

    25. David Dale

    26. Jean Maillard

    27. Anna Sun

    28. Kevin Tran

    29. Yilin Yang

    30. Ann Lee

    31. Juan Pino

    32. Heng Ji

    33. Ilya Sutskever

    34. Douglas Hofstadter

    35. Thomas Hofmann

    36. Eric Wallace

    37. Charlie Snell

    38. Pieter Abbeel

    39. Dawn Song

    40. Ronen Eldan

    41. Jinyu Li

    42. Tom Goldstein

    43. Mari Ostendorf

    44. Noah Smith (writer)

    45. Ross Wightman

    46. Tao Zhu

    47. Yuan Cao

    48. Mi Zhang

    49. Soham Ghosh

    50. Greg Brockman

    51. Nikhil Gupta

    52. Michael Gschwind

    53. Ross Taylor

    54. Wen Wang

    55. Melanie Mitchell

    56. Ed H. Chi

    57. Jeff Dean

    58. Tao Qin

    59. Tie-Yan Liu

    60. Yifan Xu

    61. Desmond Elliott

    62. Scott Johnston

    63. Danny Hernandez

    64. Dario Amodei

    65. Surya Ganguli

    66. Lisa Lee

    67. Ed Chi

    68. Oriol Vinyals

    69. Andrew Zisserman

    70. Zheng Zhu

    71. Yang You

    72. Michal Irani

    73. Lu Hou

    74. Xin Jiang

    75. An Yang

    76. Christopher Clark

    77. Saurabh Tiwary

    78. Qian Liu

    79. Madeleine Thompson

    80. Trevor Darrell

    81. Jessica F. Cantlon

    82. Joseph M. Baker

    83. Zicheng Liu

    84. Xuedong Huang

    85. Nakul Verma

    86. Kaiming He

    87. Ashish Vaswani

    88. Stephanie Lin

    89. Wenjun Zeng

    90. Peng Sun

    91. Sébastien Bubeck

    92. Min Xu

    93. Ye Xia

    94. Wei Han

    95. Matthew Hutchinson

    96. Andrew Prout

    97. Sepp Hochreiter

    98. Ahmed Elnaggar

    99. Martin Steinegger

    100. Burkhard Rost

    101. Nir Shavit

    102. Christine Payne

    103. Jason Weston

    104. Ming Yan

    105. Karen Hao

    106. Mohammad Norouzi

    107. Geoffrey Hinton

    108. Boaz Barak

    109. Ali Farhadi

    110. Eduard Hovy

    111. Veselin Stoyanov

    112. Hsiao-Wuen Hon

    113. Daniela Amodei

    114. Yixuan Li

    115. Mateusz Malinowski

    116. Timothy Lillicrap

    117. Li Fei-Fei

    118. Tian Xia

    119. Philipp Koehn

    120. Tomas Mikolov

    121. Martin Karafiat

    122. Christos Kozyrakis

    123. Mark Horowitz

    124. Hans Moravec

    125. Foster Provost

    126. Sergey Brin

    127. Lawrence Page