Is OpenAI alright? How would we know and what would it look like?
WBE and DRL: a Middle Way of imitation learning from the human brain
Computer Optimization: Your Computer Is Faster Than You Think
Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?
ABBYY’s Bitter Lesson: How Linguists Lost the Last Battle for NLP
Inference Scaling for Long-Context Retrieval Augmented Generation
Strategic Insights from Simulation Gaming of AI Race Dynamics
Gwern Branwen—How an Anonymous Researcher Predicted AI’s Trajectory
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process
Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs
Resolving Discrepancies in Compute-Optimal Scaling of Language Models
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
Probing the Decision Boundaries of In-context Learning in Large Language Models
How Do Large Language Models Acquire Factual Knowledge During Pretraining?
Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences
Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement
AI Will Become Mathematicians’ ‘Co-Pilot’: Fields Medalist Terence Tao explains how proof checkers and AI programs are dramatically changing mathematics
Position: Understanding LLMs Requires More Than Statistical Generalization
GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7× Faster Pre-training on Web-scale Image-Text Data
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies
Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge
Conformer-1: Robust ASR via Large-Scale Semi-supervised Bootstrapping
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Visual Autoregressive Modeling (VAR): Scalable Image Generation via Next-Scale Prediction
Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
8 Google Employees Invented Modern AI. Here’s the Inside Story: They met by chance, got hooked on an idea, and wrote the Transformers paper—the most consequential tech breakthrough in recent history
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations (HSTU)
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method
Investigating Continual Pretraining in Large Language Models: Insights and Implications
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
StructLM: Towards Building Generalist Models for Structured Knowledge Grounding
I am a Strange Dataset: Metalinguistic Tests for Language Models
TF-T2V: A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Zoology: Measuring and Improving Recall in Efficient Language Models
Seamless: Multilingual Expressive and Streaming Speech Translation
Scaling transformer neural networks for skillful and reliable medium-range weather forecasting
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Sequential Modeling Enables Scalable Learning for Large Vision Models
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition
In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search
First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models
Sam Altman accepts the 2023 Hawking Fellowship Award § Is there another breakthrough that’s needed to reach AGI?
Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement?
GeoLLM: Extracting Geospatial Knowledge from Large Language Models
Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors
MTOB: A Benchmark for Learning to Translate a New Language from One Grammar Book
Taken out of context: On measuring situational awareness in LLMs
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
Simple synthetic data reduces sycophancy in large language models
Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration
Gödel, Escher, Bach author Douglas Hofstadter on the state of AI today § What about AI terrifies you?
Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression
Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data
Understanding Social Reasoning in Language Models with Language Models
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Google’s newest AI model uses nearly 5× more text data for training than its predecessor
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4
Google’s DeepMind-Brain merger: tech giant regroups for AI battle
CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval
Emergent and Predictable Memorization in Large Language Models
Even The Politicians Thought the Open Letter Made No Sense In The Senate Hearing on AI today’s hearing on ai covered ai regulation and challenges, and the infamous open letter, which nearly everyone in the room thought was unwise
DINOv2: Learning Robust Visual Features without Supervision
Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure
How well do Large Language Models perform in Arithmetic tasks?
Securing Liberal Democratic Control of AGI through UK Leadership
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)
John Carmack’s ‘Different Path’ to Artificial General Intelligence
Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards
StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis
GPT-3 as Knowledge Worker: A Zero-Shot Evaluation of AI CPA Capabilities
VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Cramming: Training a Language Model on a Single GPU in One Day
Evolutionary-scale prediction of atomic level protein structure with a language model
Discovering Language Model Behaviors with Model-Written Evaluations
One Embedder, Any Task: Instruction-Finetuned Text Embeddings (INSTRUCTOR)
Reproducible scaling laws for contrastive language-image learning
ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
VindLU: A Recipe for Effective Video-and-Language Pretraining
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
Large Language Models Struggle to Learn Long-Tail Knowledge
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation
Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)
Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning
BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle)
Ask Me Anything (AMA): A simple strategy for prompting language models
Do Current Multi-Task Optimization Methods in Deep Learning Even Help?
Monolith: Real Time Recommendation System With Collisionless Embedding Table
Machine Reading, Fast and Slow: When Do Models "Understand" Language?
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP
Efficient Training of Language Models to Fill in the Middle
Why do tree-based models still outperform deep learning on tabular data?
High-performing neural network models of visual cortex benefit from high latent dimensionality
Beyond neural scaling laws: beating power law scaling via data pruning
ProGen2: Exploring the Boundaries of Protein Language Models
Limitations of the NTK for Understanding Generalization in Deep Learning
Modeling Transformative AI Risks (MTAIR) Project—Summary Report
BigVGAN: A Universal Neural Vocoder with Large-Scale Training
Toward a realistic model of speech processing in the brain with self-supervised learning
Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power
M3AE: Multimodal Masked Autoencoders Learn Transferable Representations
InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning
Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Continual Pre-Training Mitigates Forgetting in Language and Vision
Building Machine Translation Systems for the Next Thousand Languages
When does dough become a bagel? Analyzing the remaining mistakes on ImageNet
CoCa: Contrastive Captioners are Image-Text Foundation Models
Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)
Continual Learning with Foundation Models: An Empirical Study of Latent Replay
WebFace260M: A Benchmark for Million-Scale Deep Face Recognition
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?
Chinchilla: Training Compute-Optimal Large Language Models
Self-Consistency Improves Chain-of-Thought Reasoning in Language Models
Effect of scale on catastrophic forgetting in neural networks
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours
Performance reserves in brain-imaging-based phenotype prediction
Self-Distilled StyleGAN: Towards Generation from Internet Photos
UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision
Brains and algorithms partially converge in natural language processing
Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Data Scaling Laws in NMT: The Effect of Noise and Architecture
Webly Supervised Concept Expansion for General Purpose Vision Models
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models
ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization
A High-Dimensional Sphere Spilling out of a High-Dimensional Cube despite Exponentially Many Constraints
AV-HuBERT: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
An Empirical Investigation of the Role of Pre-training in Lifelong Learning
Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases
You Only Need One Model for Open-domain Question Answering
MAGMA—Multimodal Augmentation of Generative Models through Adapter-based Finetuning
Improving language models by retrieving from trillions of tokens
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
LEMON: Scaling Up Vision-Language Pre-training for Image Captioning
Can Pre-trained Language Models be Used to Resolve Textual and Semantic Merge Conflicts?
ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
RedCaps: web-curated image-text data created by the people, for the people
BASIC: Combined Scaling for Open-Vocabulary Image Classification
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
Covariate Shift in High-Dimensional Random Feature Regression
Solving Probability and Statistics Problems by Program Synthesis
Few-Shot Self-Rationalization with Natural Language Prompts
Scaling Law for Recommendation Models: Towards General-purpose User Representations
Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
When in Doubt, Summon the Titans: Efficient Inference with Large Models
The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail
Symbolic Knowledge Distillation: from General Language Models to Commonsense Models
LFPT5: A Unified Framework for Lifelong Few-shot Language Learning Based on Prompt Tuning of T5
Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers
Unsupervised Neural Machine Translation with Generative Language Models Only
Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning
Universal Paralinguistic Speech Representations Using Self-Supervised Conformers
M6–10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Mining for strong gravitational lenses with self-supervised learning
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers
A Recipe For Arbitrary Text Style Transfer with Large Language Models
A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning
An Empirical Exploration in Quality Filtering of Text Data
A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP
Data and Parameter Scaling Laws for Neural Machine Translation
Do Vision Transformers See Like Convolutional Neural Networks?
Modeling Protein Using Large-scale Pretrain Language Model
Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations
EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training
HTLM: Hyper-Text Pre-Training and Prompting of Language Models
Brain-like functional specialization emerges spontaneously in deep neural networks
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
The Dimpled Manifold Model of Adversarial Examples in Machine Learning
Partial success in closing the gap between human and machine vision
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
CoAtNet: Marrying Convolution and Attention for All Data Sizes
Effect of Pre-Training Scale on Intra/Inter-Domain Full and Few-Shot Transfer Learning for Natural and Medical X-Ray Chest Images
One4all User Representation for Recommender Systems in E-commerce
RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance
XLM-R XL: Larger-Scale Transformers for Multilingual Masked Language Modeling
Scaling End-to-End Models for Large-Scale Multilingual ASR
DINO: Emerging Properties in Self-Supervised Vision Transformers
[Ali released PLUG: 27 billion parameters, the largest pre-trained language model in the Chinese community]
CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP
Memorization versus Generalization in Pre-trained Language Models
Large-Scale Self-Supervised and Semi-Supervised Learning for Speech Translation
Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network
Understanding Robustness of Transformers for Image Classification
UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark
Controllable Generation from Pre-trained Language Models via Inverse Prompting
Revisiting ResNets: Improved Training and Scaling Strategies
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
Greedy Hierarchical Variational Autoencoders (GHVAEs) for Large-Scale Video Prediction
Measuring Mathematical Problem Solving With the MATH Dataset
SEER: Self-supervised Pretraining of Visual Features in the Wild
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
NFNet: High-Performance Large-Scale Image Recognition Without Normalization
1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed
Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Scaling
Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning
Muppet: Massive Multi-task Representations with Pre-Finetuning
Language processing in brains and deep neural networks: computational convergence and its limits
CLIP: Learning Transferable Visual Models From Natural Language Supervision
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
VinVL: Revisiting Visual Representations in Vision-Language Models
Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
CPM: A Large-scale Generative Chinese Pre-trained Language Model
Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
mT5: A massively multilingual pre-trained text-to-text transformer
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
The first AI model that translates 100 languages without relying on English data
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers
Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)
The neural architecture of language: Integrative reverse-engineering converges on a model for predictive processing
Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples
Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
Small Data, Big Decisions: Model Selection in the Small-Data Regime
New Report on How Much Computational Power It Takes to Match the Human Brain
Generative Language Modeling for Automated Theorem Proving
GrokNet: Unified Computer Vision Model Trunk and Embeddings For Commerce
Accuracy and Performance Comparison of Video Action Recognition Approaches
Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study
Matt Botvinick on the spontaneous emergence of learning algorithms
On Robustness and Transferability of Convolutional Neural Networks
ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing
Measuring Robustness to Natural Distribution Shifts in Image Classification
Unsupervised Cross-lingual Representation Learning for Speech Recognition
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
SwAV: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
SimCLRv2: Big Self-Supervised Models are Strong Semi-Supervised Learners
Image GPT (iGPT): We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples
Object Segmentation Without Labels with Large-Scale Generative Models
GPT-3 paper § Figure F.1: Four uncurated completions from a context suggesting the model compose a poem in the style of Wallace Stevens with the title ‘Shadows on the Way’
Danny Hernandez on forecasting and the drivers of AI progress
Powered by AI: Advancing product understanding and building new shopping experiences
ZeRO-2 & DeepSpeed: Shattering barriers of deep learning speed & scale
Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning
Jukebox: We’re introducing Jukebox, a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles. We’re releasing the model weights and code, along with a tool to explore the generated samples.
A Review of Winograd Schema Challenge Datasets and Approaches
DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications
PALM: Pre-training an Autoencoding & Autoregressive Language Model for Context-conditioned Generation
Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems
Zoom In: An Introduction to Circuits—By studying the connections between neurons, we can find meaningful algorithms in the weights of neural networks
Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited
Rethinking Bias-Variance Trade-off for Generalization of Neural Networks
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
The messy, secretive reality behind OpenAI’s bid to save the world: The AI moonshot was founded in the spirit of transparency. This is the inside story of how competitive pressure eroded that idealism
The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence
A Simple Framework for Contrastive Learning of Visual Representations
How Much Knowledge Can You Pack Into the Parameters of a Language Model?
Turing-NLG: A 17-billion-parameter language model by Microsoft
Direct Fit to Nature: An Evolutionary Perspective on Biological and Artificial Neural Networks
Towards a Conversational Agent that Can Chat About…Anything
Scaling Laws for Neural Language Models: Figure 15: Far beyond the Model Sizes We Study Empirically, We Find a Contradiction between Our Equations § Pg17
20d126b9c3baf640f8d1d5dff3e253faac2e8242.pdf#page=17&org=openai
Big Transfer (BiT): General Visual Representation Learning
12-in-1: Multi-Task Vision and Language Representation Learning
Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time
Deep Double Descent: Where Bigger Models and More Data Hurt
Understanding the generalization of ‘lottery tickets’ in neural networks
The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design
Momentum Contrast for Unsupervised Visual Representation Learning
SimpleShot: Revisiting Nearest-Neighbor Classification for Few-Shot Learning
Self-training with Noisy Student improves ImageNet classification
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB
CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs
XLM-R: State-of-the-art cross-lingual understanding through self-supervision
High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks
Unsupervised Cross-lingual Representation Learning at Scale
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Environmental drivers of systematicity and generalization in a situated agent
A Constructive Prediction of the Generalization Error Across Scales
Large-scale Pretraining for Neural Machine Translation with Tens of Billions of Sentence Pairs
Simple, Scalable Adaptation for Neural Machine Translation
CTRL: A Conditional Transformer Language Model For Controllable Generation
Show Your Work: Improved Reporting of Experimental Results
MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism
Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
Does Learning Require Memorization? A Short Tale about a Long Tail
A mathematical theory of semantic development in deep neural networks
Adversarially Robust Generalization Just Requires More Unlabeled Data
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers
Asymptotic learning curves of kernel methods: empirical data versus Teacher-Student paradigm
UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation
Billion-scale semi-supervised learning for image classification
VideoBERT: A Joint Model for Video and Language Representation Learning
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
Surprises in High-Dimensional Ridgeless Least Squares Interpolation
Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified
High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks: Videos
Reconciling modern machine learning practice and the bias-variance trade-off
Large Scale GAN Training for High Fidelity Natural Image Synthesis
BigGAN: Large Scale GAN Training For High Fidelity Natural Image Synthesis § 5.2 Additional Evaluation On JFT-300M
Measurement invariance explains the universal law of generalization for psychological perception
CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images
Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations
GPT-1: Improving Language Understanding with Unsupervised Learning
GPT-1: Improving Language Understanding by Generative Pre-Training
GPT-1: Improving Language Understanding by Generative Pre-Training § Model specifications
Deep learning generalizes because the parameter-function map is biased towards simple functions
Google DeepMind founder and leader in artificial intelligence returns to Hamilton
Sensitivity and Generalization in Neural Networks: an Empirical Study
ULMFiT: Universal Language Model Fine-tuning for Text Classification
GPipe: Easy Scaling With Micro-Batch Pipeline Parallelism § Pg4
a8efcc8272af6f434119f87a00c2edaf84241597.pdf#page=4&org=google
Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN
Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior
WebVision Database: Visual Learning and Understanding from Web Data
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
Towards Deep Learning Models Resistant to Adversarial Attacks
Gradient Diversity: a Key Ingredient for Scalable Distributed Learning
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset
WebVision Challenge: Visual Learning and Understanding With Web Data
Geometry of Optimization and Implicit Regularization in Deep Learning
Universal representations: The missing link between faces, text, planktons, and cat breeds
Estimation of Gap Between Current Language Models and Human Performance
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Understanding deep learning requires rethinking generalization
The LAMBADA dataset: Word prediction requiring a broad discourse context
Residual Networks Behave Like Ensembles of Relatively Shallow Networks
Do Deep Convolutional Nets Really Need to be Deep and Convolutional?
PlaNet—Photo Geolocation with Convolutional Neural Networks
Microsoft researchers win ImageNet computer vision challenge
The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition
Generative Concatenative Nets Jointly Learn to Write and Classify Reviews
Learning Visual Features from Large Weakly Supervised Data
LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop
Clothing-1M: Learning from Massive Noisy Labeled Data for Image Classification
The Unreasonable Effectiveness of Recurrent Neural Networks
In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning
Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article]
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling
The remarkable, yet not extraordinary, human brain as a scaled-up primate brain and its associated cost
Advantages of Artificial Intelligences, Uploads, and Digital Minds
Understanding sources of inefficiency in general-purpose chips
Economics Of The Singularity: Stuffed into skyscrapers by the billion, brainy bugbots will be the knowledge workers of the future
Tree Induction vs. Logistic Regression: A Learning-Curve Analysis
Analytic and Algorithmic Solution of Random Satisfiability Problems
Scaling to Very Very Large Corpora for Natural Language Disambiguation
On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
On The Effect of Data Set Size on Bias And Variance in Classification Learning
The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Effects of Training Set Size on Decision Tree Complexity
Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid
Building a Large Annotated Corpus of English: The Penn Treebank
Statistical Theory of Learning Curves under Entropic Loss Criterion
Learning Curves: Asymptotic Values and Rate of Convergence
3d4ef31011b49fa3442733759bb92f0b3bb8b6c5.html#the-quantization-model-of-neural-scaling
Billion-Scale Semi-Supervised Learning for State-Of-The-Art Image and Video Classification
No Physics? No Problem. AI Weather Forecasting Is Already Making Huge Strides.
Report Describes Apple’s ‘Organizational Dysfunction’ and ‘Lack of Ambition’ in AI
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks
Inverse-Scaling/prize: A Prize for Finding Tasks That Cause Large Language Models to Show Inverse Scaling
The Uneasy Relationship between Deep Learning and (classical) Statistics
How Much Compute Was Used to Train DeepMind's Generally Capable Agents?
Why Neural Networks Generalise, and Why They Are (Kind Of) Bayesian
What’s the Backward-Forward FLOP Ratio for Neural Networks?
What Next? A Dozen Information-Technology Research Goals: 3. Turing’s Vision of Machine Intelligence
Ilya Sutskever: Deep Learning | AI Podcast #94 With Lex Fridman
Season 1 Ep. 22 OpenAI's Ilya Sutskever: The Man Who Made AI Work
A Law of Robustness and the Importance of Overparameterization in Deep Learning
2024-01-01-gwern-reddit-rmachinelearning-screenshotshowingscalingcentricdiscussions.png
2024-smith-figure2-validationlossesofgalaxyimagepredictiontransformershowingscalingcurves.png
2024-smith-figure4-downstreamperformanceinastronomytasksfromgalaxypretrainedgpt2.png
2024-wang-figure1-writebenchcreativewritingscalingwithmodelsizeshowingweaveroutlier.jpg
2023-eldan-figure23-scalinglawoftinystoriesgpttransformermodelswithtrainingflops.jpg
2023-manvi-figure4-llmvstabularmachinelearningscalingofpredictionperformanceinsamplesize.png
2023-nguyen-figure6-stormerweatherforecastingscalesinmodelsizeanddatagranularity.png
2023-vu-figure2-largermorepowerfulllmsperformbetteronfastchangingquestionsorfalsepremisesinfreshqa.jpg
2023-wang-figure9-videodatascalingoftft2vvideogeneration.png
2023-bachmann-figure4-mlpsscalewellwithincreasingbatchsize.jpg
2023-bachmann-figure5-scalingofmlpsoncifar10andimagenet1k.png
2023-bachmann-figure6-powerlawincifar100losswhenconstrainingparametersordatasetsize.jpg
2023-bachmann-figure7-suprachinchilladatascalingformlpsoncifar100loss.jpg
2023-girdhar-figure6-imagebindscalingofperformancewithincreasingclipimageencodersize.png
2022-10-06-robert-lesswrongmoreaudiblepodcast-itlookslikeyouretryingtotakeovertheworld.mp3
2022-maloney-figure11-equiparameterizationhypothesisshows1to1parameterdatascalingratioisoptimal.jpg
2022-press-figure1-scalingofgpt3modelperformanceoncompositionalcelebritiesdatasetshowingincreasingperformanceofbothsingleand2stepquestions.png
2022-zhu-figure9-webface260mcnnfacerecognitionscalingbyn.png
2022-radford-figure4-correlationofpretraininglanguagedatawithtranslationperformance.jpg
2022-radford-figure9-crossoverinmonolingualvsmultilingualtrainingscalingshowseventualtransfer.jpg
2021-hernandez-transferlearning-figure1-transfervsfinetuning.png
2021-hu-figure1-lemontransformerscalingonmscocoimagecaptioning.png
2021-hu-figure2-a-datascalingfinetuningperformanceonmscoco.jpg
2021-lazaridou-figure3-incorrectverysmallscalescalingoftransformerxlmodelsdoesnotleadtolargeperformancegainsontemporaldriftbenchmark.png
2021-zhang-figure1a-conformermodelworderrorscalingindatasetsize.jpg
2021-zhang-figure2-conformerpmodelworderrorscalingratesindatasetsize.png
2021-schrittwieser-figure1-mspacmanmuzerologrewardscaling.jpg
2020-chrisdyer-aacl2020-machinetranslationscaling-ngramsvsrnns.jpg
2019-liu-table4-robertabenefitsfromscalingdatasets10xoverbert.png
2018-howard-figure3-datascalingofrnnpretrainingfortextclassification.jpg
2017-koehn-figure3-bleuscoreswithvaryingamountsoftrainingdata.png
2015-krause-figure11-cub2002011imageclassificationlogarithmcscalinginnoisywebimagedatasetsize.png
2015-krause-table1-effectivenessofscalingupcnnsonlargenoisywebdatasetsvscompetitors.png
2012-bottou-figure13-2-sgdtrainingtimetestlossvsconjugategradients.png
2011-torralba-table3-positivetransfervalueofimageclassificationdatasetsacrosstasksforsvmhogs.png
2009-12-07-shanelegg-supercomputerlinpackoverpast50years.png
1987-sejnowski-figure1-historyofsupercomputersextrapolationvshumanbraincomputepower.jpg
https://ai.facebook.com/blog/harmful-content-can-evolve-quickly-our-new-ai-system-adapts-to-tackle-it
https://ai.meta.com/blog/harmful-content-can-evolve-quickly-our-new-ai-system-adapts-to-tackle-it/
https://cacm.acm.org/research/the-decline-of-computers-as-a-general-purpose-technology/
https://chinamediaproject.org/2024/05/27/goldfish-memories/
https://github.com/Dicklesworthstone/the_lighthill_debate_on_ai
https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/
https://research.google/blog/large-scale-matrix-factorization-on-tpus/
https://research.google/blog/scalable-deep-reinforcement-learning-for-robotic-manipulation/
https://scienceblogs.de/klausis-krypto-kolumne/2019/12/19/bigram-750-challenge-solved-new-world-record-set/
https://thezvi.substack.com/p/on-openais-preparedness-framework
https://towardsdatascience.com/deep-neural-networks-are-biased-at-initialisation-towards-simple-functions-a63487edcb99
https://towardsdatascience.com/neural-networks-are-fundamentally-bayesian-bee9a172fad8
https://web.archive.org/web/20210415022657/http://starcraft.blizzplanet.com/blog/comments/blizzcon-2018-starcraft-ii-whats-next-panel-transcript
https://windowsontheory.org/2019/12/05/deep-double-descent/
https://www.beren.io/2022-08-06-The-scale-of-the-brain-vs-machine-learning/
https://www.dwarkeshpatel.com/p/demis-hassabis#%C2%A7timestamps
https://www.lesswrong.com/posts/75o8oja43LXGAqbAR/palm-2-and-gpt-4-in-extrapolating-gpt-n-performance
https://www.lesswrong.com/posts/B8Djo44WtZK6kK4K5/outreach-success-intro-to-ai-risk-that-has-been-successful
https://www.lesswrong.com/posts/KbRxdBCcJqwtbiPzm/whisper-s-wild-implications-1
https://www.lesswrong.com/posts/No5JpRCHzBrWA4jmS/q-and-a-with-shane-legg-on-risks-from-ai
https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50-sota-on-arc-agi-with-gpt-4o?commentId=JptpWoG5DwNDXxykC
https://www.lesswrong.com/posts/dLXdCjxbJMGtDBWTH/no-one-in-my-org-puts-money-in-their-pension
https://www.lesswrong.com/posts/qdStMFDMrWAnTqNWL/gpt-4-predictions
https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
https://www.reddit.com/r/mlscaling/comments/1ggr0j4/neural_network_recognizer_for_handwritten_zip/
https://www.reddit.com/r/reinforcementlearning/comments/nsi7bf/what_could_make_ai_conscious_with_wojciech/
https%253A%252F%252Fresearch.google%252Fblog%252Ftaking-medical-imaging-embeddings-3d%252F.html
Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs
Resolving Discrepancies in Compute-Optimal Scaling of Language Models
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
https%253A%252F%252Farxiv.org%252Fabs%252F2406.13121%2523google.html
Probing the Decision Boundaries of In-context Learning in Large Language Models
https%253A%252F%252Fwww.biorxiv.org%252Fcontent%252F10.1101%252F2024.06.06.597716.full.html
GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic
https%253A%252F%252Farxiv.org%252Fabs%252F2405.00332%2523scale.html
https%253A%252F%252Flab42.global%252Fcommunity-interview-jack-cole%252F.html
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge
Visual Autoregressive Modeling (VAR): Scalable Image Generation via Next-Scale Prediction
https%253A%252F%252Farxiv.org%252Fabs%252F2404.02905%2523bytedance.html
https%253A%252F%252Farxiv.org%252Fabs%252F2403.18802%2523deepmind.html
8 Google Employees Invented Modern AI. Here’s the Inside Story: They met by chance, got hooked on an idea, and wrote the Transformers paper—the most consequential tech breakthrough in recent history
https%253A%252F%252Fwww.wired.com%252Fstory%252Feight-google-employees-invented-modern-ai-transformers-paper%252F.html
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations (HSTU)
https%253A%252F%252Farxiv.org%252Fabs%252F2402.17152%2523facebook.html
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
StructLM: Towards Building Generalist Models for Structured Knowledge Grounding
TF-T2V: A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
https%253A%252F%252Farxiv.org%252Fabs%252F2312.15770%2523alibaba.html
Zoology: Measuring and Improving Recall in Efficient Language Models
Scaling transformer neural networks for skillful and reliable medium-range weather forecasting
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition
https%253A%252F%252Farxiv.org%252Fabs%252F2311.15599%2523tencent.html
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
https%253A%252F%252Farxiv.org%252Fabs%252F2311.04145%2523alibaba.html
https%253A%252F%252Farxiv.org%252Fabs%252F2310.16764%2523deepmind.html
https%253A%252F%252Farxiv.org%252Fabs%252F2310.09199%2523google.html
GeoLLM: Extracting Geospatial Knowledge from Large Language Models
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
https%253A%252F%252Farxiv.org%252Fabs%252F2310.03214%2523google.html
Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors
Taken out of context: On measuring situational awareness in LLMs
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
https%253A%252F%252Farxiv.org%252Fabs%252F2308.11596%2523facebook.html
Simple synthetic data reduces sycophancy in large language models
https%253A%252F%252Farxiv.org%252Fabs%252F2308.03958%2523deepmind.html
Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration
https%253A%252F%252Farxiv.org%252Fabs%252F2307.05300%2523microsoft.html
https%253A%252F%252Fopenai.com%252Findex%252Fintroducing-superalignment%252F.html
Gödel, Escher, Bach author Douglas Hofstadter on the state of AI today § What about AI terrifies you?
https%253A%252F%252Fwww.youtube.com%252Fwatch%253Fv%253DlfXxzAVtdpU%2526t%253D1763s.html
Understanding Social Reasoning in Language Models with Language Models
Google’s newest AI model uses nearly 5× more text data for training than its predecessor
https%253A%252F%252Fwww.cnbc.com%252F2023%252F05%252F16%252Fgoogles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html.html
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
https%253A%252F%252Farxiv.org%252Fabs%252F2305.07759%2523microsoft.html
https%253A%252F%252Farxiv.org%252Fabs%252F2305.05665%2523facebook.html
Google’s DeepMind-Brain merger: tech giant regroups for AI battle
https%253A%252F%252Fwww.ft.com%252Fcontent%252Ff4f73815-6fc2-4016-bd97-4bace459e95e.html
DINOv2: Learning Robust Visual Features without Supervision
https%253A%252F%252Farxiv.org%252Fabs%252F2304.07193%2523facebook.html
https%253A%252F%252Farxiv.org%252Fabs%252F2303.15343%2523google.html
How well do Large Language Models perform in Arithmetic tasks?
https%253A%252F%252Farxiv.org%252Fabs%252F2304.02015%2523alibaba.html
Securing Liberal Democratic Control of AGI through UK Leadership
https%253A%252F%252Fjameswphillips.substack.com%252Fp%252Fsecuring-liberal-democratic-control.html
https%253A%252F%252Farxiv.org%252Fabs%252F2303.05511%2523adobe.html
https%253A%252F%252Farxiv.org%252Fabs%252F2302.05442%2523google.html
Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards
https%253A%252F%252Fpapers.ssrn.com%252Fsol3%252Fpapers.cfm%253Fabstract_id%253D4335945.html
StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis
https%253A%252F%252Farxiv.org%252Fabs%252F2301.09515%2523nvidia.html
https%253A%252F%252Farxiv.org%252Fabs%252F2301.07088%2523bytedance.html
GPT-3 as Knowledge Worker: A Zero-Shot Evaluation of AI CPA Capabilities
https%253A%252F%252Farxiv.org%252Fabs%252F2301.03728%2523facebook.html
VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
https%253A%252F%252Farxiv.org%252Fabs%252F2301.02111%2523microsoft.html
Cramming: Training a Language Model on a Single GPU in One Day
One Embedder, Any Task: Instruction-Finetuned Text Embeddings (INSTRUCTOR)
Reproducible scaling laws for contrastive language-image learning
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
https%253A%252F%252Farxiv.org%252Fabs%252F2212.04979%2523google.html
VindLU: A Recipe for Effective Video-and-Language Pretraining
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
https%253A%252F%252Farxiv.org%252Fabs%252F2212.04356%2523openai.html
https%253A%252F%252Fai.facebook.com%252Fblog%252Fmultiray-large-scale-AI-models%252F.html
https%253A%252F%252Farxiv.org%252Fabs%252F2211.09085%2523facebook.html
Large Language Models Struggle to Learn Long-Tail Knowledge
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
https%253A%252F%252Farxiv.org%252Fabs%252F2211.07636%2523baai.html
Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)
https%253A%252F%252Fwww.youtube.com%252Fwatch%253Fv%253DQ-TJFyUoenc%2526t%253D2444s.html
https%253A%252F%252Farxiv.org%252Fabs%252F2210.13673%2523nvidia.html
https%253A%252F%252Farxiv.org%252Fabs%252F2210.11416%2523google.html
BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
https%253A%252F%252Farxiv.org%252Fabs%252F2210.10341%2523microsoft.html
https%253A%252F%252Farxiv.org%252Fabs%252F2210.06423%2523microsoft.html
Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle)
https%253A%252F%252Farxiv.org%252Fabs%252F2210.03350%2523allen.html
https%253A%252F%252Farxiv.org%252Fabs%252F2210.02414%2523baai.html
Ask Me Anything (AMA): A simple strategy for prompting language models
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP
https%253A%252F%252Farxiv.org%252Fabs%252F2207.05221%2523anthropic.html
Beyond neural scaling laws: beating power law scaling via data pruning
BigVGAN: A Universal Neural Vocoder with Large-Scale Training
https%253A%252F%252Farxiv.org%252Fabs%252F2206.04658%2523nvidia.html
Toward a realistic model of speech processing in the brain with self-supervised learning
M3AE: Multimodal Masked Autoencoders Learn Transferable Representations
https%253A%252F%252Farxiv.org%252Fabs%252F2205.14204%2523google.html
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2205.10625%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2205.09073%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2205.05131%2523google.html
Building Machine Translation Systems for the Next Thousand Languages
https%253A%252F%252Farxiv.org%252Fabs%252F2205.03983%2523google.html
When does dough become a bagel? Analyzing the remaining mistakes on ImageNet
https%253A%252F%252Farxiv.org%252Fabs%252F2205.04596%2523google.html
CoCa: Contrastive Captioners are Image-Text Foundation Models
https%253A%252F%252Farxiv.org%252Fabs%252F2205.01917%2523google.html
Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)
https%253A%252F%252Farxiv.org%252Fabs%252F2204.14198%2523deepmind.html
WebFace260M: A Benchmark for Million-Scale Deep Face Recognition
https%253A%252F%252Fwww.lesswrong.com%252Fposts%252FSbAgRYo8tkHwhd9Qx%252Fdeepmind-the-podcast-excerpts-on-agi.html
Chinchilla: Training Compute-Optimal Large Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2203.15556%2523deepmind.html
Self-Consistency Improves Chain-of-Thought Reasoning in Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2203.11171%2523google.html
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
https%253A%252F%252Farxiv.org%252Fabs%252F2203.03466%2523microsoft.html
FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours
Self-Distilled StyleGAN: Towards Generation from Internet Photos
https%253A%252F%252Farxiv.org%252Fabs%252F2202.12211%2523google.html
Brains and algorithms partially converge in natural language processing
https%253A%252F%252Fwww.nature.com%252Farticles%252Fs42003-022-03036-1.html
Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework
https%253A%252F%252Farxiv.org%252Fabs%252F2202.06767%2523huawei.html
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
https%253A%252F%252Farxiv.org%252Fabs%252F2202.03052%2523alibaba.html
Webly Supervised Concept Expansion for General Purpose Vision Models
https%253A%252F%252Farxiv.org%252Fabs%252F2202.02317%2523allen.html
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
https%253A%252F%252Farxiv.org%252Fabs%252F2201.11990%2523microsoftnvidia.html
https%253A%252F%252Farxiv.org%252Fabs%252F2201.11473%2523microsoft.html
https%253A%252F%252Farxiv.org%252Fabs%252F2201.10005%2523openai.html
SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models
https%253A%252F%252Farxiv.org%252Fabs%252F2201.08371%2523facebook.html
https%253A%252F%252Farxiv.org%252Fabs%252F2201.07520%2523facebook.html
ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization
https%253A%252F%252Farxiv.org%252Fabs%252F2201.03545%2523facebook.html
https%253A%252F%252Froyalsocietypublishing.org%252Fdoi%252F10.1098%252Frstb.2020.0529.html
MAGMA—Multimodal Augmentation of Generative Models through Adapter-based Finetuning
Improving language models by retrieving from trillions of tokens
https%253A%252F%252Farxiv.org%252Fabs%252F2112.04426%2523deepmind.html
LEMON: Scaling Up Vision-Language Pre-training for Image Captioning
https%253A%252F%252Farxiv.org%252Fabs%252F2111.12233%2523microsoft.html
https%253A%252F%252Farxiv.org%252Fabs%252F2111.12763%2523google.html
Can Pre-trained Language Models be Used to Resolve Textual and Semantic Merge Conflicts?
https%253A%252F%252Farxiv.org%252Fabs%252F2111.11904%2523microsoft.html
https%253A%252F%252Farxiv.org%252Fabs%252F2111.11432%2523microsoft.html
BASIC: Combined Scaling for Open-Vocabulary Image Classification
https%253A%252F%252Farxiv.org%252Fabs%252F2111.10050%2523google.html
Solving Probability and Statistics Problems by Program Synthesis
Scaling Law for Recommendation Models: Towards General-purpose User Representations
https%253A%252F%252Farxiv.org%252Fabs%252F2111.06377%2523facebook.html
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
https%253A%252F%252Farxiv.org%252Fabs%252F2111.02114%2523laion.html
https%253A%252F%252Farxiv.org%252Fabs%252F2110.14168%2523openai.html
https%253A%252F%252Farxiv.org%252Fabs%252F2110.11526%2523deepmind.html
Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers
https%253A%252F%252Farxiv.org%252Fabs%252F2110.02095%2523google.html
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2109.10686%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2109.02593%2523allen.html
A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP
https%253A%252F%252Farxiv.org%252Fabs%252F2108.13002%2523microsoft.html
Do Vision Transformers See Like Convolutional Neural Networks?
https%253A%252F%252Farxiv.org%252Fabs%252F2108.08810%2523google.html
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
https%253A%252F%252Farxiv.org%252Fabs%252F2107.02137%2523baidu.html
https%253A%252F%252Farxiv.org%252Fabs%252F2107.01294%2523allen.html
Partial success in closing the gap between human and machine vision
https%253A%252F%252Farxiv.org%252Fabs%252F2106.09488%2523amazon.html
CoAtNet: Marrying Convolution and Attention for All Data Sizes
https%253A%252F%252Farxiv.org%252Fabs%252F2106.04803%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2106.04560%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2106.03004%2523google.html
Effect of Pre-Training Scale on Intra/Inter-Domain Full and Few-Shot Transfer Learning for Natural and Medical X-Ray Chest Images
https%253A%252F%252Fm.koreaherald.com%252Fview.php%253Fud%253D20210525000824%2523naver.html
https%253A%252F%252Farxiv.org%252Fabs%252F2105.11084%2523facebook.html
https%253A%252F%252Fventurebeat.com%252Fai%252Fgoogle-details-new-ai-accelerator-chips%252F.html
https%253A%252F%252Farxiv.org%252Fabs%252F2105.01601%2523google.html
XLM-R XL: Larger-Scale Transformers for Multilingual Masked Language Modeling
https%253A%252F%252Farxiv.org%252Fabs%252F2105.00572%2523facebook.html
DINO: Emerging Properties in Self-Supervised Vision Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2104.14294%2523facebook.html
Understanding Robustness of Transformers for Image Classification
https%253A%252F%252Farxiv.org%252Fabs%252F2103.14586%2523google.html
UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark
https%253A%252F%252Farxiv.org%252Fabs%252F2103.13009%2523allen.html
https%253A%252F%252Farxiv.org%252Fabs%252F2103.10957%2523deepmind.html
Revisiting ResNets: Improved Training and Scaling Strategies
https%253A%252F%252Farxiv.org%252Fabs%252F2103.07579%2523google.html
https%253A%252F%252Fai.facebook.com%252Fblog%252Flearning-from-videos-to-understand-the-world%252F.html
SEER: Self-supervised Pretraining of Visual Features in the Wild
https%253A%252F%252Farxiv.org%252Fabs%252F2103.01988%2523facebook.html
https%253A%252F%252Farxiv.org%252Fabs%252F2102.09672%2523openai.html
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
https%253A%252F%252Farxiv.org%252Fabs%252F2102.05918%2523google.html
NFNet: High-Performance Large-Scale Image Recognition Without Normalization
https%253A%252F%252Farxiv.org%252Fabs%252F2102.06171%2523deepmind.html
1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed
https%253A%252F%252Farxiv.org%252Fabs%252F2102.02888%2523microsoft.html
Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Scaling
https%253A%252F%252Farxiv.org%252Fabs%252F2102.01951%2523scaling%2526org%253Ddeepmind.html
https%253A%252F%252Farxiv.org%252Fabs%252F2003.10580%2523google.html
CLIP: Learning Transferable Visual Models From Natural Language Supervision
https%253A%252F%252Fcdn.openai.com%252Fpapers%252FLearning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf.html
https%253A%252F%252Fwww.alignmentforum.org%252Fposts%252Fk2SNji3jXaLGhBeYP%252Fextrapolating-gpt-n-performance.html
Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images
https%253A%252F%252Farxiv.org%252Fabs%252F2011.10650%2523openai.html
https%253A%252F%252Farxiv.org%252Fabs%252F2010.14701%2523openai.html
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
https%253A%252F%252Farxiv.org%252Fabs%252F2010.14571%2523google.html
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
https%253A%252F%252Farxiv.org%252Fabs%252F2010.10504%2523google.html
The first AI model that translates 100 languages without relying on English data
https%253A%252F%252Fai.meta.com%252Fblog%252Fintroducing-many-to-many-multilingual-machine-translation%252F.html
Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
https%253A%252F%252Farxiv.org%252Fabs%252F2010.11929%2523google.html
New Report on How Much Computational Power It Takes to Match the Human Brain
https%253A%252F%252Fwww.openphilanthropy.org%252Fresearch%252Fnew-report-on-how-much-computational-power-it-takes-to-match-the-human-brain%252F.html
Generative Language Modeling for Automated Theorem Proving
https%253A%252F%252Farxiv.org%252Fabs%252F2009.03393%2523openai.html
Accuracy and Performance Comparison of Video Action Recognition Approaches
Matt Botvinick on the spontaneous emergence of learning algorithms
https%253A%252F%252Fwww.lesswrong.com%252Fposts%252FWnqua6eQkewL3bqsF%252Fmatt-botvinick-on-the-spontaneous-emergence-of-learning.html
ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing
https%253A%252F%252Farxiv.org%252Fabs%252F2007.03898%2523nvidia.html
Jonathan Frankle—Chief Neural Network Scientist at Databricks
%252Fdoc%252Fai%252Fnn%252Ftransformer%252Fgpt%252Fdall-e%252F1%252F2020-chen-2.pdf%2523openai.html
SwAV: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
https%253A%252F%252Farxiv.org%252Fabs%252F2006.09882%2523facebook.html
Image GPT (iGPT): We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples
https%253A%252F%252Fopenai.com%252Findex%252Fimage-gpt%252F.html
ZeRO-2 & DeepSpeed: Shattering barriers of deep learning speed & scale
https%253A%252F%252Fwww.microsoft.com%252Fen-us%252Fresearch%252Fblog%252Fzero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale%252F.html
Jukebox: We’re introducing Jukebox, a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles. We’re releasing the model weights and code, along with a tool to explore the generated samples.
https%253A%252F%252Fopenai.com%252Fresearch%252Fjukebox.html
https%253A%252F%252Fai.meta.com%252Fblog%252Fstate-of-the-art-open-source-chatbot%252F.html
DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications
https%253A%252F%252Farxiv.org%252Fabs%252F2004.08366%2523google.html
PALM: Pre-training an Autoencoding & Autoregressive Language Model for Context-conditioned Generation
https%253A%252F%252Farxiv.org%252Fabs%252F2004.07159%2523alibaba.html
The messy, secretive reality behind OpenAI’s bid to save the world: The AI moonshot was founded in the spirit of transparency. This is the inside story of how competitive pressure eroded that idealism
https%253A%252F%252Fwww.technologyreview.com%252F2020%252F02%252F17%252F844721%252Fai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality%252F.html
A Simple Framework for Contrastive Learning of Visual Representations
https%253A%252F%252Farxiv.org%252Fabs%252F2002.05709%2523google.html
Turing-NLG: A 17-billion-parameter language model by Microsoft
https%253A%252F%252Fwww.microsoft.com%252Fen-us%252Fresearch%252Fblog%252Fturing-nlg-a-17-billion-parameter-language-model-by-microsoft%252F.html
Towards a Conversational Agent that Can Chat About…Anything
https%253A%252F%252Fresearch.google%252Fblog%252Ftowards-a-conversational-agent-that-can-chat-aboutanything%252F.html
https%253A%252F%252Farxiv.org%252Fabs%252F2001.08361%2523openai.html
https%253A%252F%252Fwww.youtube.com%252Fwatch%253Fv%253DkY2NHSKBi10.html
Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time
https%253A%252F%252Fopenai.com%252Fresearch%252Fdeep-double-descent.html
Momentum Contrast for Unsupervised Visual Representation Learning
https%253A%252F%252Farxiv.org%252Fabs%252F1911.05722%2523facebook.html
Self-training with Noisy Student improves ImageNet classification
https%253A%252F%252Farxiv.org%252Fabs%252F1911.04252%2523google.html
Unsupervised Cross-lingual Representation Learning at Scale
https%253A%252F%252Farxiv.org%252Fabs%252F1911.02116%2523facebook.html
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
https%253A%252F%252Farxiv.org%252Fabs%252F1910.02054%2523microsoft.html
CTRL: A Conditional Transformer Language Model For Controllable Generation
https%253A%252F%252Farxiv.org%252Fabs%252F1909.05858%2523salesforce.html
MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism
https%253A%252F%252Farxiv.org%252Fabs%252F1907.11692%2523facebook.html
https%253A%252F%252Fdavid-abel.github.io%252Fnotes%252Ficml_2019.pdf.html
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
https%253A%252F%252Farxiv.org%252Fabs%252F1905.11946%2523google.html
Asymptotic learning curves of kernel methods: empirical data versus Teacher-Student paradigm
UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation
Billion-scale semi-supervised learning for image classification
https%253A%252F%252Farxiv.org%252Fabs%252F1905.00546%2523facebook.html
https%253A%252F%252Fopenai.com%252Findex%252Fbetter-language-models%252F.html
Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified
https%253A%252F%252Fmelaniemitchell.me%252Faibook%252F.html
https%253A%252F%252Fopenai.com%252Fresearch%252Fhow-ai-training-scales.html
https%253A%252F%252Fslatestarcodex.com%252F2018%252F11%252F26%252Fis-science-slowing-down-2%252F.html
CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images
GPT-1: Improving Language Understanding by Generative Pre-Training § Model specifications
https%253A%252F%252Fs3-us-west-2.amazonaws.com%252Fopenai-assets%252Fresearch-covers%252Flanguage-unsupervised%252Flanguage_understanding_paper.pdf%2523page%253D5.html
https%253A%252F%252Farxiv.org%252Fabs%252F1805.00932%2523facebook.html
ULMFiT: Universal Language Model Fine-tuning for Text Classification
Towards Deep Learning Models Resistant to Adversarial Attacks
https%253A%252F%252Farxiv.org%252Fabs%252F1706.01427%2523deepmind.html
Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset
https%253A%252F%252Farxiv.org%252Fabs%252F1705.07750%2523deepmind.html
WebVision Challenge: Visual Learning and Understanding With Web Data
Microsoft researchers win ImageNet computer vision challenge
https%253A%252F%252Fblogs.microsoft.com%252Fai%252Fmicrosoft-researchers-win-imagenet-computer-vision-challenge%252F.html
The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition
https%253A%252F%252Farxiv.org%252Fabs%252F1511.06789%2523google.html
Learning Visual Features from Large Weakly Supervised Data
https%253A%252F%252Farxiv.org%252Fabs%252F1511.02251%2523facebook.html
Clothing-1M: Learning from Massive Noisy Labeled Data for Image Classification
https%253A%252F%252Fopenaccess.thecvf.com%252Fcontent_cvpr_2015%252Fpapers%252FXiao_Learning_From_Massive_2015_CVPR_paper.pdf%2523baidu.html
http%253A%252F%252Fwww.lrec-conf.org%252Fproceedings%252Flrec2014%252Fpdf%252F1097_Paper.pdf.html
https%253A%252F%252Faclanthology.org%252FP13-2121.pdf.html
Understanding sources of inefficiency in general-purpose chips
https%253A%252F%252Fdw2blog.com%252F2009%252F11%252F02%252Fhalloween-nightmare-scenario-early-2020s%252F.html
https%253A%252F%252Fweb.archive.org%252Fweb%252F20230718144747%252Fhttps%253A%252F%252Ffrc.ri.cmu.edu%252F~hpm%252Fproject.archive%252Frobot.papers%252F2004%252FPredictions.html.html
Tree Induction vs. Logistic Regression: A Learning-Curve Analysis
The Anatomy of a Large-Scale Hypertextual Web Search Engine
http%253A%252F%252Finfolab.stanford.edu%252F~backrub%252Fgoogle.html.html
Wikipedia Bibliography: