- See Also
-
Links
- “Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks As an Alternative to Attention Layers in Transformers”, Bozic et al 2023
- “In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering”, Liu et al 2023
- “LSS Transformer: Ultra-Long Sequence Distributed Transformer”, Wang et al 2023
- “Not All Layers Are Equally As Important: Every Layer Counts BERT”, Charpentier & Samuel 2023
- “GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling”, Katsch 2023
- “Simplifying Transformer Blocks”, He & Hofmann 2023
- “Training Dynamics of Contextual N-Grams in Language Models”, Quirke et al 2023
- “The Impact of Depth and Width on Transformer Language Model Generalization”, Petty et al 2023
- “Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study With Linear Models”, Fu et al 2023
- “How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?”, Wu et al 2023
- “Interpret Vision Transformers As ConvNets With Dynamic Convolutions”, Zhou et al 2023
- “Replacing Softmax With ReLU in Vision Transformers”, Wortsman et al 2023
- “Absolute Unit NNs: Regression-Based MLPs for Everything”, Gwern 2023
- “Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla”, Lieberum et al 2023
- “One Step of Gradient Descent Is Provably the Optimal In-Context Learner With One Layer of Linear Self-Attention”, Mahankali et al 2023
- “Lost in the Middle: How Language Models Use Long Contexts”, Liu et al 2023
- “Trainable Transformer in Transformer”, Panigrahi et al 2023
- “White-Box Transformers via Sparse Rate Reduction”, Yu et al 2023
- “Blockwise Parallel Transformer for Long Context Large Models”, Liu & Abbeel 2023
- “Brainformers: Trading Simplicity for Efficiency”, Zhou et al 2023
- “TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models”, Hardt & Sun 2023
- “Toeplitz Neural Network for Sequence Modeling”, Qin et al 2023
- “Finding Neurons in a Haystack: Case Studies With Sparse Probing”, Gurnee et al 2023
- “How Does GPT-2 Compute Greater-than?: Interpreting Mathematical Abilities in a Pre-trained Language Model”, Hanna et al 2023
- “Coinductive Guide to Inductive Transformer Heads”, Nemecek 2023
- “Tighter Bounds on the Expressivity of Transformer Encoders”, Chiang et al 2023
- “Hungry Hungry Hippos: Towards Language Modeling With State Space Models”, Fu et al 2022
- “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent As Meta-Optimizers”, Dai et al 2022
- “Pretraining Without Attention”, Wang et al 2022
- “Efficiently Scaling Transformer Inference”, Pope et al 2022
- “How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers”, Hassid et al 2022
- “Transformers Learn Shortcuts to Automata”, Liu et al 2022
- “Transformers Implement First-Order Logic With Majority Quantifiers”, Merrill & Sabharwal 2022
- “Relaxed Attention for Transformer Models”, Lohrenz et al 2022
- “Multitrack Music Transformer: Learning Long-Term Dependencies in Music With Diverse Instruments”, Dong et al 2022
- “N-Grammer: Augmenting Transformers With Latent n-grams”, Roy et al 2022
- “Log-Precision Transformers Are Constant-Depth Uniform Threshold Circuits”, Merrill & Sabharwal 2022
- “Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules”, Irie et al 2022
- “FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness”, Dao et al 2022
- “TATS: Long Video Generation With Time-Agnostic VQGAN and Time-Sensitive Transformer”, Ge et al 2022
- “Overcoming a Theoretical Limitation of Self-Attention”, Chiang & Cholak 2022
- “It’s Raw! Audio Generation With State-Space Models”, Goel et al 2022
- “General-purpose, Long-context Autoregressive Modeling With Perceiver AR”, Hawthorne et al 2022
- “The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention”, Irie et al 2022
- “Attention Approximates Sparse Distributed Memory”, Bricken & Pehlevan 2021
- “Long-Range Transformers for Dynamic Spatiotemporal Forecasting”, Grigsby et al 2021
- “Train Short, Test Long: Attention With Linear Biases (ALiBi) Enables Input Length Extrapolation”, Press et al 2021
- “Do Vision Transformers See Like Convolutional Neural Networks?”, Raghu et al 2021
- “Stable, Fast and Accurate: Kernelized Attention With Relative Positional Encoding”, Luo et al 2021
- “RASP: Thinking Like Transformers”, Weiss et al 2021
- “SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training”, Somepalli et al 2021
- “Not All Images Are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition”, Wang et al 2021
- “Less Is More: Pay Less Attention in Vision Transformers”, Pan et al 2021
- “FNet: Mixing Tokens With Fourier Transforms”, Lee-Thorp et al 2021
- “Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet”, Melas-Kyriazi 2021
- “RoFormer: Enhanced Transformer With Rotary Position Embedding”, Su et al 2021
- “Efficient Transformers in Reinforcement Learning Using Actor-Learner Distillation”, Parisotto & Salakhutdinov 2021
- “Do Transformer Modifications Transfer Across Implementations and Applications?”, Narang et al 2021
- “Linear Transformers Are Secretly Fast Weight Programmers”, Schlag et al 2021
- “Unlocking Pixels for Reinforcement Learning via Implicit Attention”, Choromanski et al 2021
- “AdnFM: An Attentive DenseNet Based Factorization Machine for CTR Prediction”, Wang et al 2020
- “Long Range Arena (LRA): A Benchmark for Efficient Transformers”, Tay et al 2020
- “Current Limitations of Language Models: What You Need Is Retrieval”, Komatsuzaki 2020
- “Efficient Transformers: A Survey”, Tay et al 2020
- “HiPPO: Recurrent Memory With Optimal Polynomial Projections”, Gu et al 2020
- “Efficient Attention: Breaking The Quadratic Transformer Bottleneck”, Gwern 2020
- “Pre-training via Paraphrasing”, Lewis et al 2020
- “GPT-3 Creative Fiction”, Gwern 2020
- “Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers”, Choromanski et al 2020
- “GPT-3: Language Models Are Few-Shot Learners”, Brown et al 2020
- “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, Lewis et al 2020
- “Synthesizer: Rethinking Self-Attention in Transformer Models”, Tay et al 2020
- “PowerNorm: Rethinking Batch Normalization in Transformers”, Shen et al 2020
- “REALM: Retrieval-Augmented Language Model Pre-Training”, Guu et al 2020
- “Rethinking Attention With Performers”, Choromanski & Colwell 2020
- “Generalization through Memorization: Nearest Neighbor Language Models”, Khandelwal et al 2019
- “Large Memory Layers With Product Keys”, Lample et al 2019
- “What Does BERT Look At? An Analysis of BERT’s Attention”, Clark et al 2019
- “Pay Less Attention With Lightweight and Dynamic Convolutions”, Wu et al 2019
- “On the Turing Completeness of Modern Neural Network Architectures”, Pérez et al 2019
- “Music Transformer”, Huang et al 2018
- “Character-Level Language Modeling With Deeper Self-Attention”, Al-Rfou et al 2018
- “Attention Is All You Need”, Vaswani et al 2017
- “A Deep Reinforced Model for Abstractive Summarization”, Paulus et al 2017
- “Get To The Point: Summarization With Pointer-Generator Networks”, See et al 2017
- “RAM: Dynamic Computational Time for Visual Attention”, Li et al 2017
- “Research Ideas”, Gwern 2017
- “Scaling Memory-Augmented Neural Networks With Sparse Reads and Writes”, Rae et al 2016
- “Hybrid Computing Using a Neural Network With Dynamic External Memory”, Graves et al 2016
- “Modeling Human Reading With Neural Attention”, Hahn & Keller 2016
- “Iterative Alternating Neural Attention for Machine Reading”, Sordoni et al 2016
- “Adaptive Computation Time for Recurrent Neural Networks”, Graves 2016
- “Foveation-based Mechanisms Alleviate Adversarial Examples”, Luo et al 2015
- “Generating Images from Captions With Attention”, Mansimov et al 2015
- “DRAW: A Recurrent Neural Network For Image Generation”, Gregor et al 2015
- “Neural Turing Machines”, Graves et al 2014
- “Neural Machine Translation by Jointly Learning to Align and Translate”, Bahdanau et al 2014
- “On Learning Where To Look”, Ranzato 2014
- “Generating Sequences With Recurrent Neural Networks”, Graves 2013
- “Efficient Transformers: A Survey § Table 1”
- “Attention and Augmented Recurrent Neural Networks”
- “Hierarchical Object Detection With Deep Reinforcement Learning”
- “The Transformer Family: Attention and Self-Attention · Multi-Head Self-Attention · Transformer · Adaptive Computation Time (ACT) · Improved Attention Span: (Longer Attention Span (Transformer-XL) / Adaptive Attention Span / Localized Attention Span (Image Transformer)) · Less Time and Memory Cost: (Sparse Attention Matrix Factorization (Sparse Transformers) / Locality-Sensitive Hashing (Reformer)) · Make It Recurrent (Universal Transformer) · Stabilization for RL (GTrXL)”
- “Learning to Combine Foveal Glimpses With a Third-order Boltzmann Machine”
- “Show, Attend and Tell: Neural Image Caption Generation With Visual Attention”
- “Recurrent Models of Visual Attention”
- “Can Active Memory Replace Attention?”
- “A Survey of Long-Term Context in Transformers: Sparse Transformers · Adaptive Span Transformers · Transformer-XL · Compressive Transformers · Reformer · Routing Transformer · Sinkhorn Transformer · Linformer · Efficient Attention: Attention With Linear Complexities · Transformers Are RNNs · ETC · Longformer”
- Sort By Magic
- Miscellaneous
- Link Bibliography
See Also
Links
“Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks As an Alternative to Attention Layers in Transformers”, Bozic et al 2023
“In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering”, Liu et al 2023
“LSS Transformer: Ultra-Long Sequence Distributed Transformer”, Wang et al 2023
“LSS Transformer: Ultra-Long Sequence Distributed Transformer”
“Not All Layers Are Equally As Important: Every Layer Counts BERT”, Charpentier & Samuel 2023
“Not all layers are equally as important: Every Layer Counts BERT”
“GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling”, Katsch 2023
“GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling”
“Simplifying Transformer Blocks”, He & Hofmann 2023
“Training Dynamics of Contextual N-Grams in Language Models”, Quirke et al 2023
“Training Dynamics of Contextual N-Grams in Language Models”
“The Impact of Depth and Width on Transformer Language Model Generalization”, Petty et al 2023
“The Impact of Depth and Width on Transformer Language Model Generalization”
“Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study With Linear Models”, Fu et al 2023
“How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?”, Wu et al 2023
“How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?”
“Interpret Vision Transformers As ConvNets With Dynamic Convolutions”, Zhou et al 2023
“Interpret Vision Transformers as ConvNets with Dynamic Convolutions”
“Replacing Softmax With ReLU in Vision Transformers”, Wortsman et al 2023
“Absolute Unit NNs: Regression-Based MLPs for Everything”, Gwern 2023
“Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla”, Lieberum et al 2023
“One Step of Gradient Descent Is Provably the Optimal In-Context Learner With One Layer of Linear Self-Attention”, Mahankali et al 2023
“Lost in the Middle: How Language Models Use Long Contexts”, Liu et al 2023
“Trainable Transformer in Transformer”, Panigrahi et al 2023
“White-Box Transformers via Sparse Rate Reduction”, Yu et al 2023
“Blockwise Parallel Transformer for Long Context Large Models”, Liu & Abbeel 2023
“Blockwise Parallel Transformer for Long Context Large Models”
“Brainformers: Trading Simplicity for Efficiency”, Zhou et al 2023
“TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models”, Hardt & Sun 2023
“TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models”
“Toeplitz Neural Network for Sequence Modeling”, Qin et al 2023
“Finding Neurons in a Haystack: Case Studies With Sparse Probing”, Gurnee et al 2023
“Finding Neurons in a Haystack: Case Studies with Sparse Probing”
“How Does GPT-2 Compute Greater-than?: Interpreting Mathematical Abilities in a Pre-trained Language Model”, Hanna et al 2023
“Coinductive Guide to Inductive Transformer Heads”, Nemecek 2023
“Tighter Bounds on the Expressivity of Transformer Encoders”, Chiang et al 2023
“Tighter Bounds on the Expressivity of Transformer Encoders”
“Hungry Hungry Hippos: Towards Language Modeling With State Space Models”, Fu et al 2022
“Hungry Hungry Hippos: Towards Language Modeling with State Space Models”
“Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent As Meta-Optimizers”, Dai et al 2022
“Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers”
“Pretraining Without Attention”, Wang et al 2022
“Efficiently Scaling Transformer Inference”, Pope et al 2022
“How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers”, Hassid et al 2022
“Transformers Learn Shortcuts to Automata”, Liu et al 2022
“Transformers Implement First-Order Logic With Majority Quantifiers”, Merrill & Sabharwal 2022
“Transformers Implement First-Order Logic with Majority Quantifiers”
“Relaxed Attention for Transformer Models”, Lohrenz et al 2022
“Multitrack Music Transformer: Learning Long-Term Dependencies in Music With Diverse Instruments”, Dong et al 2022
“Multitrack Music Transformer: Learning Long-Term Dependencies in Music with Diverse Instruments”
“N-Grammer: Augmenting Transformers With Latent n-grams”, Roy et al 2022
“Log-Precision Transformers Are Constant-Depth Uniform Threshold Circuits”, Merrill & Sabharwal 2022
“Log-Precision Transformers are Constant-Depth Uniform Threshold Circuits”
“Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules”, Irie et al 2022
“FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness”, Dao et al 2022
“FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”
“TATS: Long Video Generation With Time-Agnostic VQGAN and Time-Sensitive Transformer”, Ge et al 2022
“TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer”
“Overcoming a Theoretical Limitation of Self-Attention”, Chiang & Cholak 2022
“It’s Raw! Audio Generation With State-Space Models”, Goel et al 2022
“General-purpose, Long-context Autoregressive Modeling With Perceiver AR”, Hawthorne et al 2022
“General-purpose, long-context autoregressive modeling with Perceiver AR”
“The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention”, Irie et al 2022
“Attention Approximates Sparse Distributed Memory”, Bricken & Pehlevan 2021
“Long-Range Transformers for Dynamic Spatiotemporal Forecasting”, Grigsby et al 2021
“Long-Range Transformers for Dynamic Spatiotemporal Forecasting”
“Train Short, Test Long: Attention With Linear Biases (ALiBi) Enables Input Length Extrapolation”, Press et al 2021
“Train Short, Test Long: Attention with Linear Biases (ALiBi) Enables Input Length Extrapolation”
“Do Vision Transformers See Like Convolutional Neural Networks?”, Raghu et al 2021
“Do Vision Transformers See Like Convolutional Neural Networks?”
“Stable, Fast and Accurate: Kernelized Attention With Relative Positional Encoding”, Luo et al 2021
“Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding”
“RASP: Thinking Like Transformers”, Weiss et al 2021
“SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training”, Somepalli et al 2021
“SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training”
“Not All Images Are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition”, Wang et al 2021
“Not All Images are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition”
“Less Is More: Pay Less Attention in Vision Transformers”, Pan et al 2021
“FNet: Mixing Tokens With Fourier Transforms”, Lee-Thorp et al 2021
“Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet”, Melas-Kyriazi 2021
“Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet”
“RoFormer: Enhanced Transformer With Rotary Position Embedding”, Su et al 2021
“RoFormer: Enhanced Transformer with Rotary Position Embedding”
“Efficient Transformers in Reinforcement Learning Using Actor-Learner Distillation”, Parisotto & Salakhutdinov 2021
“Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation”
“Do Transformer Modifications Transfer Across Implementations and Applications?”, Narang et al 2021
“Do Transformer Modifications Transfer Across Implementations and Applications?”
“Linear Transformers Are Secretly Fast Weight Programmers”, Schlag et al 2021
“Unlocking Pixels for Reinforcement Learning via Implicit Attention”, Choromanski et al 2021
“Unlocking Pixels for Reinforcement Learning via Implicit Attention”
“AdnFM: An Attentive DenseNet Based Factorization Machine for CTR Prediction”, Wang et al 2020
“AdnFM: An Attentive DenseNet based Factorization Machine for CTR Prediction”
“Long Range Arena (LRA): A Benchmark for Efficient Transformers”, Tay et al 2020
“Long Range Arena (LRA): A Benchmark for Efficient Transformers”
“Current Limitations of Language Models: What You Need Is Retrieval”, Komatsuzaki 2020
“Current Limitations of Language Models: What You Need is Retrieval”
“Efficient Transformers: A Survey”, Tay et al 2020
“HiPPO: Recurrent Memory With Optimal Polynomial Projections”, Gu et al 2020
“HiPPO: Recurrent Memory with Optimal Polynomial Projections”
“Efficient Attention: Breaking The Quadratic Transformer Bottleneck”, Gwern 2020
“Efficient Attention: Breaking The Quadratic Transformer Bottleneck”
“Pre-training via Paraphrasing”, Lewis et al 2020
“GPT-3 Creative Fiction”, Gwern 2020
“Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers”, Choromanski et al 2020
“Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers”
“GPT-3: Language Models Are Few-Shot Learners”, Brown et al 2020
“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, Lewis et al 2020
“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”
“Synthesizer: Rethinking Self-Attention in Transformer Models”, Tay et al 2020
“Synthesizer: Rethinking Self-Attention in Transformer Models”
“PowerNorm: Rethinking Batch Normalization in Transformers”, Shen et al 2020
“REALM: Retrieval-Augmented Language Model Pre-Training”, Guu et al 2020
“Rethinking Attention With Performers”, Choromanski & Colwell 2020
“Generalization through Memorization: Nearest Neighbor Language Models”, Khandelwal et al 2019
“Generalization through Memorization: Nearest Neighbor Language Models”
“Large Memory Layers With Product Keys”, Lample et al 2019
“What Does BERT Look At? An Analysis of BERT’s Attention”, Clark et al 2019
“Pay Less Attention With Lightweight and Dynamic Convolutions”, Wu et al 2019
“Pay Less Attention with Lightweight and Dynamic Convolutions”
“On the Turing Completeness of Modern Neural Network Architectures”, Pérez et al 2019
“On the Turing Completeness of Modern Neural Network Architectures”
“Music Transformer”, Huang et al 2018
“Character-Level Language Modeling With Deeper Self-Attention”, Al-Rfou et al 2018
“Character-Level Language Modeling with Deeper Self-Attention”
“Attention Is All You Need”, Vaswani et al 2017
“A Deep Reinforced Model for Abstractive Summarization”, Paulus et al 2017
“Get To The Point: Summarization With Pointer-Generator Networks”, See et al 2017
“Get To The Point: Summarization with Pointer-Generator Networks”
“RAM: Dynamic Computational Time for Visual Attention”, Li et al 2017
“Research Ideas”, Gwern 2017
“Scaling Memory-Augmented Neural Networks With Sparse Reads and Writes”, Rae et al 2016
“Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes”
“Hybrid Computing Using a Neural Network With Dynamic External Memory”, Graves et al 2016
“Hybrid computing using a neural network with dynamic external memory”
“Modeling Human Reading With Neural Attention”, Hahn & Keller 2016
“Iterative Alternating Neural Attention for Machine Reading”, Sordoni et al 2016
“Iterative Alternating Neural Attention for Machine Reading”
“Adaptive Computation Time for Recurrent Neural Networks”, Graves 2016
“Foveation-based Mechanisms Alleviate Adversarial Examples”, Luo et al 2015
“Generating Images from Captions With Attention”, Mansimov et al 2015
“DRAW: A Recurrent Neural Network For Image Generation”, Gregor et al 2015
“Neural Turing Machines”, Graves et al 2014
“Neural Machine Translation by Jointly Learning to Align and Translate”, Bahdanau et al 2014
“Neural Machine Translation by Jointly Learning to Align and Translate”
“On Learning Where To Look”, Ranzato 2014
“Generating Sequences With Recurrent Neural Networks”, Graves 2013
“Efficient Transformers: A Survey § Table 1”
“Attention and Augmented Recurrent Neural Networks”
“Hierarchical Object Detection With Deep Reinforcement Learning”
“Hierarchical Object Detection with Deep Reinforcement Learning”
“The Transformer Family: Attention and Self-Attention · Multi-Head Self-Attention · Transformer · Adaptive Computation Time (ACT) · Improved Attention Span: (Longer Attention Span (Transformer-XL) / Adaptive Attention Span / Localized Attention Span (Image Transformer)) · Less Time and Memory Cost: (Sparse Attention Matrix Factorization (Sparse Transformers) / Locality-Sensitive Hashing (Reformer)) · Make It Recurrent (Universal Transformer) · Stabilization for RL (GTrXL)”
“Learning to Combine Foveal Glimpses With a Third-order Boltzmann Machine”
“Learning to combine foveal glimpses with a third-order Boltzmann machine”
“Show, Attend and Tell: Neural Image Caption Generation With Visual Attention”
“Show, attend and tell: Neural image caption generation with visual attention”
“Recurrent Models of Visual Attention”
“Can Active Memory Replace Attention?”
“A Survey of Long-Term Context in Transformers: Sparse Transformers · Adaptive Span Transformers · Transformer-XL · Compressive Transformers · Reformer · Routing Transformer · Sinkhorn Transformer · Linformer · Efficient Attention: Attention With Linear Complexities · Transformers Are RNNs · ETC · Longformer”
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
attention-mechanisms
incontext-learning
gpt-advancements
transformer-architecture
Miscellaneous
-
/doc/ai/nn/transformer/attention/2023-09-08-charlesfoster-aunn-variantwithcausaldecoderattention.jpg
-
/doc/ai/nn/transformer/attention/2022-tay-figure5-scalingofmodelbymlpfeedforwardparameters.png
-
/doc/ai/nn/transformer/attention/2022-tay-figure4-scalingofmodelbydepth.png
-
/doc/ai/nn/transformer/attention/2020-tay-table1-efficienttransformermodels.png
-
/doc/ai/nn/transformer/attention/2020-tay-figure2-efficientattentiontaxonomy.png
-
/doc/ai/nn/transformer/attention/2020-longrangearena-figure3-performancefrontier.png
-
https://twitter.com/BrendanBycroft/status/1731042957149827140
-
https://twitter.com/LouisKnightWebb/status/1724510794514157668
-
https://twitter.com/arankomatsuzaki/status/1622666312219598864
-
https://twitter.com/mathemagic1an/status/1636121914849792000
-
https://www.dipkumar.dev/becoming-the-unbeatable/posts/gpt-kvcache/
-
https://www.lesswrong.com/posts/euam65XjigaCJQkcN/an-analogy-for-understanding-transformers
-
https://www.perfectlynormal.co.uk/blog-induction-heads-illustrated
Link Bibliography
-
https://arxiv.org/abs/2309.10713
: “Interpret Vision Transformers As ConvNets With Dynamic Convolutions”, Chong Zhou, Chen Change Loy, Bo Dai -
https://arxiv.org/abs/2309.08586
: “Replacing Softmax With ReLU in Vision Transformers”, Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, Simon Kornblith -
aunn
: “Absolute Unit NNs: Regression-Based MLPs for Everything”, Gwern -
https://arxiv.org/abs/2306.00008#google
: “Brainformers: Trading Simplicity for Efficiency”, -
https://arxiv.org/abs/2305.18466
: “TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models”, Moritz Hardt, Yu Sun -
https://arxiv.org/abs/2212.14052
: “Hungry Hungry Hippos: Towards Language Modeling With State Space Models”, Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré -
https://arxiv.org/abs/2212.10544
: “Pretraining Without Attention”, Junxiong Wang, Jing Nathan Yan, Albert Gu, Alexander M. Rush -
https://arxiv.org/abs/2211.05102#google
: “Efficiently Scaling Transformer Inference”, -
https://arxiv.org/abs/2211.03495
: “How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers”, Michael Hassid, Hao Peng, Daniel Rotem, Jungo Kasai, Ivan Montero, Noah A. Smith, Roy Schwartz -
https://arxiv.org/abs/2206.01649#schmidhuber
: “Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules”, Kazuki Irie, Francesco Faccio, Jürgen Schmidhuber -
https://arxiv.org/abs/2205.14135
: “FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness”, Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré -
https://arxiv.org/abs/2204.03638#facebook
: “TATS: Long Video Generation With Time-Agnostic VQGAN and Time-Sensitive Transformer”, Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, Devi Parikh -
https://arxiv.org/abs/2202.09729
: “It’s Raw! Audio Generation With State-Space Models”, Karan Goel, Albert Gu, Chris Donahue, Christopher Ré -
https://arxiv.org/abs/2202.07765#deepmind
: “General-purpose, Long-context Autoregressive Modeling With Perceiver AR”, -
https://arxiv.org/abs/2108.12409#facebook
: “Train Short, Test Long: Attention With Linear Biases (ALiBi) Enables Input Length Extrapolation”, Ofir Press, Noah A. Smith, Mike Lewis -
https://arxiv.org/abs/2108.08810#google
: “Do Vision Transformers See Like Convolutional Neural Networks?”, Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, Alexey Dosovitskiy -
https://arxiv.org/abs/2106.06981
: “RASP: Thinking Like Transformers”, Gail Weiss, Yoav Goldberg, Eran Yahav -
https://arxiv.org/abs/2105.15075
: “Not All Images Are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition”, Yulin Wang, Rui Huang, Shiji Song, Zeyi Huang, Gao Huang -
https://arxiv.org/abs/2105.14217
: “Less Is More: Pay Less Attention in Vision Transformers”, Zizheng Pan, Bohan Zhuang, Haoyu He, Jing Liu, Jianfei Cai -
https://arxiv.org/abs/2105.03824#google
: “FNet: Mixing Tokens With Fourier Transforms”, James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon -
https://arxiv.org/abs/2105.02723
: “Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet”, Luke Melas-Kyriazi -
https://openreview.net/forum?id=qVyeW-grC2k#google
: “Long Range Arena (LRA): A Benchmark for Efficient Transformers”, -
https://arxiv.org/abs/2009.06732#google
: “Efficient Transformers: A Survey”, Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler -
https://arxiv.org/abs/2008.07669
: “HiPPO: Recurrent Memory With Optimal Polynomial Projections”, Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, Christopher Re -
attention
: “Efficient Attention: Breaking The Quadratic Transformer Bottleneck”, Gwern -
gpt-3
: “GPT-3 Creative Fiction”, Gwern -
https://arxiv.org/abs/2005.00743#google
: “Synthesizer: Rethinking Self-Attention in Transformer Models”, Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng -
https://arxiv.org/abs/2003.07845
: “PowerNorm: Rethinking Batch Normalization in Transformers”, Sheng Shen, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer -
idea
: “Research Ideas”, Gwern