Efficient Attention: Breaking The Quadratic Transformer Bottleneck
Efficient Attention: Breaking The Quadratic Transformer Bottleneck
Hymba: A Hybrid-head Architecture for Small Language Models
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
Ask, and it shall be given: Turing completeness of prompting
Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects
nGPT: Normalized Transformer with Representation Learning on the Hypersphere
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
A Theoretical Understanding of Self-Correction through In-context Alignment
Retrieval Head Mechanistically Explains Long-Context Factuality
Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
Towards smaller, faster decoder-only transformers: Architectural variants and their implications
Streamlining Redundant Layers to Compress Large Language Models
8 Google Employees Invented Modern AI. Here’s the Inside Story: They met by chance, got hooked on an idea, and wrote the Transformers paper—the most consequential tech breakthrough in recent history
How Well Can Transformers Emulate In-context Newton’s Method?
RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval
A phase transition between positional and semantic learning in a solvable model of dot-product attention
Attention versus Contrastive Learning of Tabular Data—A Data-centric Benchmarking
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models
Efficient Transformer Knowledge Distillation: A Performance Review
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
On prefrontal working memory and hippocampal episodic memory: Unifying memories stored in weights and activation slots
LSS Transformer: Ultra-Long Sequence Distributed Transformer
GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling
Not all layers are equally as important: Every Layer Counts BERT
Implicit Chain-of-Thought Reasoning via Knowledge Distillation
Training Dynamics of Contextual N-Grams in Language Models
The Impact of Depth and Width on Transformer Language Model Generalization
Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models
Characterizing Mechanisms for Factual Recall in Language Models
Linear Representations of Sentiment in Large Language Models
Masked Hard-Attention Transformers and Boolean RASP Recognize Exactly the Star-Free Languages
How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?
Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors
Interpret Vision Transformers as ConvNets with Dynamic Convolutions
Activation Addition: Steering Language Models Without Optimization
Linearity of Relation Decoding in Transformer Language Models
The Hydra Effect: Emergent Self-repair in Language Model Computations
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention
Transformers learn to implement preconditioned gradient descent for in-context learning
Blockwise Parallel Transformer for Long Context Large Models
TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Finding Neurons in a Haystack: Case Studies with Sparse Probing
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Tighter Bounds on the Expressivity of Transformer Encoders
Tracr: Compiled Transformers as a Laboratory for Interpretability
Skip-Attention: Improving Vision Transformers by Paying Less Attention
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
What learning algorithm is in-context learning? Investigations with linear models
Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling
Transformers Implement First-Order Logic with Majority Quantifiers
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
Multitrack Music Transformer: Learning Long-Term Dependencies in Music with Diverse Instruments
Log-Precision Transformers are Constant-Depth Uniform Threshold Circuits
Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
General-purpose, long-context autoregressive modeling with Perceiver AR
The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention
An Explanation of In-context Learning as Implicit Bayesian Inference
Long-Range Transformers for Dynamic Spatiotemporal Forecasting
Train Short, Test Long: Attention with Linear Biases (ALiBi) Enables Input Length Extrapolation
Do Vision Transformers See Like Convolutional Neural Networks?
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers
SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training
Not All Images are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition
Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet
RoFormer: Enhanced Transformer with Rotary Position Embedding
ALD: Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
Do Transformer Modifications Transfer Across Implementations and Applications?
Unlocking Pixels for Reinforcement Learning via Implicit Attention
AdnFM: An Attentive DenseNet based Factorization Machine for CTR Prediction
Inductive Biases for Deep Learning of Higher-Level Cognition
Long Range Arena (LRA): A Benchmark for Efficient Transformers
Current Limitations of Language Models: What You Need is Retrieval
HiPPO: Recurrent Memory with Optimal Polynomial Projections
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Synthesizer: Rethinking Self-Attention in Transformer Models
Generalization through Memorization: Nearest Neighbor Language Models
The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives
Pay Less Attention with Lightweight and Dynamic Convolutions
On the Turing Completeness of Modern Neural Network Architectures
Character-Level Language Modeling with Deeper Self-Attention
Get To The Point: Summarization with Pointer-Generator Networks
Hybrid computing using a neural network with dynamic external memory
Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes
Iterative Alternating Neural Attention for Machine Reading
Neural Machine Translation by Jointly Learning to Align and Translate
Hierarchical Object Detection With Deep Reinforcement Learning
The Transformer Family: Attention and Self-Attention · Multi-Head Self-Attention · Transformer · Adaptive Computation Time (ACT) · Improved Attention Span: (Longer Attention Span (Transformer-XL) / Adaptive Attention Span / Localized Attention Span (Image Transformer)) · Less Time and Memory Cost: (Sparse Attention Matrix Factorization (Sparse Transformers) / Locality-Sensitive Hashing (Reformer)) · Make It Recurrent (Universal Transformer) · Stabilization for RL (GTrXL)
Learning to Combine Foveal Glimpses With a Third-Order Boltzmann Machine
Show, Attend and Tell: Neural Image Caption Generation With Visual Attention
A Survey of Long-Term Context in Transformers: Sparse Transformers · Adaptive Span Transformers · Transformer-XL · Compressive Transformers · Reformer · Routing Transformer · Sinkhorn Transformer · Linformer · Efficient Attention: Attention With Linear Complexities · Transformers Are RNNs · ETC · Longformer
FlashAttention-3: Fast and Accurate Attention With Asynchrony and Low-Precision
2023-09-08-charlesfoster-aunn-variantwithcausaldecoderattention.jpg
2023-trockman-figure2-attentionmappatternsbyinitializationandleveloftrainingshowpriors.png
2022-hassid-figure3-largertransformersmakemoreuseofattentionwhennablatedtomlbenchmarkperformance.jpg
2022-tay-figure5-scalingofmodelbymlpfeedforwardparameters.jpg
2020-08-11-gwern-meme-twoastronauts-hopfieldnetworksareallyouneed.jpg
https://bclarkson-code.github.io/posts/llm-from-scratch-scalar-autograd/post.html
https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention
https://mehta-rohan.com/writings/blog_posts/attention.html
https://nostalgebraist.tumblr.com/post/740164510909890560/information-flow-in-transformers
https://www.beren.io/2024-03-03-Linear-Attention-as-Iterated-Hopfield-Networks/
https://www.dipkumar.dev/becoming-the-unbeatable/posts/gpt-kvcache/
https://www.lesswrong.com/posts/7fxusXdkMNmAhkAfc/finding-sparse-linear-connections-between-features-in-llms
https://www.lesswrong.com/posts/Ei8q37PB3cAky6kaK/takeaways-from-a-mechanistic-interpretability-project-on
https://www.lesswrong.com/posts/K7AyY8LMrcKhwfbyj/no-really-attention-is-all-you-need-attention-can-do
https://www.lesswrong.com/posts/euam65XjigaCJQkcN/an-analogy-for-understanding-transformers
https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
https://www.lesswrong.com/posts/kobJymvvcvhbjWFKe/laying-the-foundations-for-vision-and-multimodal-mechanistic
https://www.lesswrong.com/posts/nuJFTS5iiJKT5G5yh/polysemantic-attention-head-in-a-4-layer-transformer
https://www.lesswrong.com/posts/thePw6qdyabD8XR4y/interpreting-openai-s-whisper
https://www.lesswrong.com/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only
https://www.lesswrong.com/postsiGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
https://www.perfectlynormal.co.uk/blog-induction-heads-illustrated
Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
https%253A%252F%252Farxiv.org%252Fabs%252F2406.13121%2523google.html
Retrieval Head Mechanistically Explains Long-Context Factuality
Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2403.18802%2523deepmind.html
8 Google Employees Invented Modern AI. Here’s the Inside Story: They met by chance, got hooked on an idea, and wrote the Transformers paper—the most consequential tech breakthrough in recent history
https%253A%252F%252Fwww.wired.com%252Fstory%252Feight-google-employees-invented-modern-ai-transformers-paper%252F.html
Efficient Transformer Knowledge Distillation: A Performance Review
Not all layers are equally as important: Every Layer Counts BERT
Linear Representations of Sentiment in Large Language Models
Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors
Interpret Vision Transformers as ConvNets with Dynamic Convolutions
Activation Addition: Steering Language Models Without Optimization
TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2306.00008%2523google.html
Skip-Attention: Improving Vision Transformers by Paying Less Attention
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
https%253A%252F%252Farxiv.org%252Fabs%252F2212.07677%2523google.html
What learning algorithm is in-context learning? Investigations with linear models
https%253A%252F%252Farxiv.org%252Fabs%252F2211.15661%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2211.05102%2523google.html
Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules
https%253A%252F%252Farxiv.org%252Fabs%252F2206.01649%2523schmidhuber.html
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
https%253A%252F%252Farxiv.org%252Fabs%252F2204.03638%2523facebook.html
General-purpose, long-context autoregressive modeling with Perceiver AR
https%253A%252F%252Farxiv.org%252Fabs%252F2202.07765%2523deepmind.html
Train Short, Test Long: Attention with Linear Biases (ALiBi) Enables Input Length Extrapolation
https%253A%252F%252Farxiv.org%252Fabs%252F2108.12409%2523facebook.html
Do Vision Transformers See Like Convolutional Neural Networks?
https%253A%252F%252Farxiv.org%252Fabs%252F2108.08810%2523google.html
Not All Images are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition
https%253A%252F%252Farxiv.org%252Fabs%252F2105.03824%2523google.html
Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet
Long Range Arena (LRA): A Benchmark for Efficient Transformers
https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DqVyeW-grC2k%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2009.06732%2523google.html
HiPPO: Recurrent Memory with Optimal Polynomial Projections
Synthesizer: Rethinking Self-Attention in Transformer Models
https%253A%252F%252Farxiv.org%252Fabs%252F2005.00743%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F1912.03458%2523microsoft.html
Wikipedia Bibliography: