“‘Self-Attention’ Tag”,2019-12-17 (; backlinks):
![]()
Bibliography for tag
ai/nn/transformer/attention, most recent first: 9 related tags, 176 annotations, & 44 links (parent).
- See Also
- Gwern
- Links
- “Hymba: A Hybrid-Head Architecture for Small Language Models”, et al 2024
- “Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models”, et al 2024
- “Long Context RAG Performance of Large Language Models”, et al 2024
- “Ask, and It Shall Be Given: Turing Completeness of Prompting”, et al 2024
- “Tackling the Abstraction and Reasoning Corpus With Vision Transformers: the Importance of 2D Representation, Positions, and Objects”, et al 2024
- “Differential Transformer”, et al 2024
- “Were RNNs All We Needed?”, et al 2024
- “NGPT: Normalized Transformer With Representation Learning on the Hypersphere”, et al 2024
- “Masked Mixers for Language Generation and Retrieval”, 2024
- “The Mamba in the Llama: Distilling and Accelerating Hybrid Models”, et al 2024
- “When Can Transformers Count to n?”, et al 2024
- “What Matters in Transformers? Not All Attention Is Needed”, et al 2024
- “Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?”, et al 2024
- “An Empirical Study of Mamba-Based Language Models”, et al 2024
- “Attention As a Hypernetwork”, et al 2024
- “Scalable Matmul-Free Language Modeling”, et al 2024
- “A Theoretical Understanding of Self-Correction through In-Context Alignment”, et al 2024
- “Attention As an RNN”, et al 2024
- “Your Transformer Is Secretly Linear”, et al 2024
- “Retrieval Head Mechanistically Explains Long-Context Factuality”, et al 2024
- “Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models”, et al 2024
- “Towards Smaller, Faster Decoder-Only Transformers: Architectural Variants and Their Implications”, Suresh & P 2024
- “ReFT: Representation Finetuning for Language Models”, et al 2024
- “Do Language Models Plan Ahead for Future Tokens?”, et al 2024
- “Streamlining Redundant Layers to Compress Large Language Models”, et al 2024
- “Long-Form Factuality in Large Language Models”, et al 2024
- “Mechanistic Design and Scaling of Hybrid Architectures”, et al 2024
- “8 Google Employees Invented Modern AI. Here’s the Inside Story: They Met by Chance, Got Hooked on an Idea, and Wrote the Transformers Paper—The Most Consequential Tech Breakthrough in Recent History”, 2024
- “How Well Can Transformers Emulate In-Context Newton’s Method?”, et al 2024
- “RNNs Are Not Transformers (Yet): The Key Bottleneck on In-Context Retrieval”, et al 2024
- “A Phase Transition between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention”, et al 2024
- “Rethinking Patch Dependence for Masked Autoencoders”, et al 2024
- “Attention versus Contrastive Learning of Tabular Data—A Data-Centric Benchmarking”, et al 2024
- “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet”
- “SwitchHead: Accelerating Transformers With Mixture-Of-Experts Attention”, et al 2023
- “Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models”, 2023
- “Can a Transformer Represent a Kalman Filter?”, 2023
- “Efficient Transformer Knowledge Distillation: A Performance Review”, et al 2023
- “Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks As an Alternative to Attention Layers in Transformers”, et al 2023
- “In-Context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering”, et al 2023
- “On Prefrontal Working Memory and Hippocampal Episodic Memory: Unifying Memories Stored in Weights and Activation Slots”, et al 2023
- “LSS Transformer: Ultra-Long Sequence Distributed Transformer”, et al 2023
- “Simplifying Transformer Blocks”, 2023
- “GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling”, 2023
- “Not All Layers Are Equally As Important: Every Layer Counts BERT”, 2023
- “Implicit Chain-Of-Thought Reasoning via Knowledge Distillation”, et al 2023
- “Training Dynamics of Contextual N-Grams in Language Models”, et al 2023
- “The Impact of Depth and Width on Transformer Language Model Generalization”, et al 2023
- “Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study With Linear Models”, et al 2023
- “Characterizing Mechanisms for Factual Recall in Language Models”, et al 2023
- “Linear Representations of Sentiment in Large Language Models”, et al 2023
- “Masked Hard-Attention Transformers and Boolean RASP Recognize Exactly the Star-Free Languages”, et al 2023
- “How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?”, et al 2023
- “Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors”, et al 2023
- “Vision Transformers Need Registers”, et al 2023
- “Interpret Vision Transformers As ConvNets With Dynamic Convolutions”, et al 2023
- “Replacing Softmax With ReLU in Vision Transformers”, et al 2023
- “One Wide Feedforward Is All You Need”, et al 2023
- “Activation Addition: Steering Language Models Without Optimization”, et al 2023
- “Linearity of Relation Decoding in Transformer Language Models”, et al 2023
- “The Hydra Effect: Emergent Self-Repair in Language Model Computations”, et al 2023
- “Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla”, et al 2023
- “FlashAttention-2: Faster Attention With Better Parallelism and Work Partitioning”, 2023
- “One Step of Gradient Descent Is Provably the Optimal In-Context Learner With One Layer of Linear Self-Attention”, et al 2023
- “Lost in the Middle: How Language Models Use Long Contexts”, et al 2023
- “Trainable Transformer in Transformer”, et al 2023
- “Transformers Learn to Implement Preconditioned Gradient Descent for In-Context Learning”, et al 2023
- “White-Box Transformers via Sparse Rate Reduction”, et al 2023
- “Blockwise Parallel Transformer for Long Context Large Models”, 2023
- “TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models”, 2023
- “Brainformers: Trading Simplicity for Efficiency”, et al 2023
- “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”, et al 2023
- “Mimetic Initialization of Self-Attention Layers”, 2023
- “Toeplitz Neural Network for Sequence Modeling”, et al 2023
- “Finding Neurons in a Haystack: Case Studies With Sparse Probing”, et al 2023
- “How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-Trained Language Model”, et al 2023
- “Coinductive Guide to Inductive Transformer Heads”, 2023
- “Tighter Bounds on the Expressivity of Transformer Encoders”, et al 2023
- “Tracr: Compiled Transformers As a Laboratory for Interpretability”, et al 2023
- “Skip-Attention: Improving Vision Transformers by Paying Less Attention”, et al 2023
- “Hungry Hungry Hippos: Towards Language Modeling With State Space Models”, et al 2022
- “Scalable Adaptive Computation for Iterative Generation”, et al 2022
- “Pretraining Without Attention”, et al 2022
- “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent As Meta-Optimizers”, et al 2022
- “Transformers Learn In-Context by Gradient Descent”, et al 2022
- “What Learning Algorithm Is In-Context Learning? Investigations With Linear Models”, et al 2022
- “Efficiently Scaling Transformer Inference”, et al 2022
- “Transformers Learn Shortcuts to Automata”, et al 2022
- “Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling”, et al 2022
- “Transformers Implement First-Order Logic With Majority Quantifiers”, 2022
- “Relaxed Attention for Transformer Models”, et al 2022
- “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”, et al 2022
- “Multitrack Music Transformer: Learning Long-Term Dependencies in Music With Diverse Instruments”, et al 2022
- “N-Grammer: Augmenting Transformers With Latent n-Grams”, et al 2022
- “Log-Precision Transformers Are Constant-Depth Uniform Threshold Circuits”, 2022
- “Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules”, et al 2022
- “FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness”, et al 2022
- “TATS: Long Video Generation With Time-Agnostic VQGAN and Time-Sensitive Transformer”, et al 2022
- “Overcoming a Theoretical Limitation of Self-Attention”, 2022
- “It’s Raw! Audio Generation With State-Space Models”, et al 2022
- “General-Purpose, Long-Context Autoregressive Modeling With Perceiver AR”, et al 2022
- “Transformer Memory As a Differentiable Search Index”, et al 2022
- “The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention”, et al 2022
- “Attention Approximates Sparse Distributed Memory”, 2021
- “An Explanation of In-Context Learning As Implicit Bayesian Inference”, et al 2021
- “Long-Range Transformers for Dynamic Spatiotemporal Forecasting”, et al 2021
- “Train Short, Test Long: Attention With Linear Biases (ALiBi) Enables Input Length Extrapolation”, et al 2021
- “Do Vision Transformers See Like Convolutional Neural Networks?”, et al 2021
- “Stable, Fast and Accurate: Kernelized Attention With Relative Positional Encoding”, et al 2021
- “RASP: Thinking Like Transformers”, et al 2021
- “On the Distribution, Sparsity, and Inference-Time Quantization of Attention Values in Transformers”, et al 2021
- “SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training”, et al 2021
- “Not All Images Are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition”, et al 2021
- “Less Is More: Pay Less Attention in Vision Transformers”, et al 2021
- “FNet: Mixing Tokens With Fourier Transforms”, Lee- et al 2021
- “Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet”, Melas-2021
- “RoFormer: Enhanced Transformer With Rotary Position Embedding”, et al 2021
- “ALD: Efficient Transformers in Reinforcement Learning Using Actor-Learner Distillation”, 2021
- “Attention Is Not All You Need: Pure Attention Loses Rank Doubly Exponentially With Depth”, et al 2021
- “Do Transformer Modifications Transfer Across Implementations and Applications?”, et al 2021
- “Linear Transformers Are Secretly Fast Weight Programmers”, et al 2021
- “Unlocking Pixels for Reinforcement Learning via Implicit Attention”, et al 2021
- “Transformer Feed-Forward Layers Are Key-Value Memories”, et al 2020
- “AdnFM: An Attentive DenseNet Based Factorization Machine for CTR Prediction”, et al 2020
- “Inductive Biases for Deep Learning of Higher-Level Cognition”, 2020
- “Long Range Arena (LRA): A Benchmark for Efficient Transformers”, et al 2020
- “Current Limitations of Language Models: What You Need Is Retrieval”, 2020
- “Efficient Transformers: A Survey”, et al 2020
- “HiPPO: Recurrent Memory With Optimal Polynomial Projections”, et al 2020
- “Pre-Training via Paraphrasing”, et al 2020
- “Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers”, et al 2020
- “GPT-3: Language Models Are Few-Shot Learners”, et al 2020
- “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, et al 2020
- “Synthesizer: Rethinking Self-Attention in Transformer Models”, et al 2020
- “PowerNorm: Rethinking Batch Normalization in Transformers”, et al 2020
- “REALM: Retrieval-Augmented Language Model Pre-Training”, et al 2020
- “Rethinking Attention With Performers”, 2020
- “Dynamic Convolution: Attention over Convolution Kernels”, et al 2019
- “Generalization through Memorization: Nearest Neighbor Language Models”, et al 2019
- “Multiplicative Interactions and Where to Find Them”, et al 2019
- “The Bottom-Up Evolution of Representations in the Transformer: A Study With Machine Translation and Language Modeling Objectives”, et al 2019
- “Large Memory Layers With Product Keys”, et al 2019
- “What Does BERT Look At? An Analysis of BERT’s Attention”, et al 2019
- “Are 16 Heads Really Better Than One?”, et al 2019
- “Pay Less Attention With Lightweight and Dynamic Convolutions”, et al 2019
- “On the Turing Completeness of Modern Neural Network Architectures”, et al 2019
- “Music Transformer”, et al 2018
- “Character-Level Language Modeling With Deeper Self-Attention”, Al- et al 2018
- “Attention Is All You Need”, et al 2017
- “A Deep Reinforced Model for Abstractive Summarization”, et al 2017
- “Get To The Point: Summarization With Pointer-Generator Networks”, et al 2017
- “RAM: Dynamic Computational Time for Visual Attention”, et al 2017
- “Hybrid Computing Using a Neural Network With Dynamic External Memory”, et al 2016
- “Scaling Memory-Augmented Neural Networks With Sparse Reads and Writes”, et al 2016
- “Modeling Human Reading With Neural Attention”, 2016
- “Iterative Alternating Neural Attention for Machine Reading”, et al 2016
- “Adaptive Computation Time for Recurrent Neural Networks”, 2016
- “Foveation-Based Mechanisms Alleviate Adversarial Examples”, et al 2015
- “Generating Images from Captions With Attention”, et al 2015
- “DRAW: A Recurrent Neural Network For Image Generation”, et al 2015
- “Neural Turing Machines”, et al 2014
- “Neural Machine Translation by Jointly Learning to Align and Translate”, et al 2014
- “On Learning Where To Look”, 2014
- “Generating Sequences With Recurrent Neural Networks”, 2013
- “Efficient Transformers: A Survey § Table 1”
- “Attention and Augmented Recurrent Neural Networks”
- “Hierarchical Object Detection With Deep Reinforcement Learning”
- “The Transformer Family: Attention and Self-Attention · Multi-Head Self-Attention · Transformer · Adaptive Computation Time (ACT) · Improved Attention Span: (Longer Attention Span (Transformer-XL) / Adaptive Attention Span / Localized Attention Span (Image Transformer)) · Less Time and Memory Cost: (Sparse Attention Matrix Factorization (Sparse Transformers) / Locality-Sensitive Hashing (Reformer)) · Make It Recurrent (Universal Transformer) · Stabilization for RL (GTrXL)”
- “Learning to Combine Foveal Glimpses With a Third-Order Boltzmann Machine”
- “Show, Attend and Tell: Neural Image Caption Generation With Visual Attention”
- “Recurrent Models of Visual Attention”
- “Can Active Memory Replace Attention?”
- “Dzmitry Bahdanau”
- “Monitor: An AI-Driven Observability Interface”
- “A Survey of Long-Term Context in Transformers: Sparse Transformers · Adaptive Span Transformers · Transformer-XL · Compressive Transformers · Reformer · Routing Transformer · Sinkhorn Transformer · Linformer · Efficient Attention: Attention With Linear Complexities · Transformers Are RNNs · ETC · Longformer”
- “FlashAttention-3: Fast and Accurate Attention With Asynchrony and Low-Precision”
- Miscellaneous
- Bibliography