- See Also
-
Links
- “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent As Meta-Optimizers”, Et Al 2022
- “Efficient Transformers With Dynamic Token Pooling”, Et Al 2022
- “Efficiently Scaling Transformer Inference”, Et Al 2022
- “How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers”, Et Al 2022
- “Transformers Learn Shortcuts to Automata”, Et Al 2022
- “Relaxed Attention for Transformer Models”, Et Al 2022
- “Multitrack Music Transformer: Learning Long-Term Dependencies in Music With Diverse Instruments”, Et Al 2022
- “N-Grammer: Augmenting Transformers With Latent N-grams”, Et Al 2022
- “Log-Precision Transformers Are Constant-Depth Uniform Threshold Circuits”, 2022
- “Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules”, Et Al 2022
- “FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness”, Et Al 2022
- “TATS: Long Video Generation With Time-Agnostic VQGAN and Time-Sensitive Transformer”, Et Al 2022
- “Overcoming a Theoretical Limitation of Self-Attention”, 2022
- “It’s Raw! Audio Generation With State-Space Models”, Et Al 2022
- “General-purpose, Long-context Autoregressive Modeling With Perceiver AR”, Et Al 2022
- “Attention Approximates Sparse Distributed Memory”, 2021
- “Long-Range Transformers for Dynamic Spatiotemporal Forecasting”, Et Al 2021
- “Train Short, Test Long: Attention With Linear Biases (ALiBi) Enables Input Length Extrapolation”, Et Al 2021
- “Stable, Fast and Accurate: Kernelized Attention With Relative Positional Encoding”, Et Al 2021
- “RASP: Thinking Like Transformers”, Et Al 2021
- “SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training”, Et Al 2021
- “Not All Images Are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition”, Et Al 2021
- “Less Is More: Pay Less Attention in Vision Transformers”, Et Al 2021
- “FNet: Mixing Tokens With Fourier Transforms”, Lee-Et Al 2021
- “Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet”, Melas-2021
- “Efficient Transformers in Reinforcement Learning Using Actor-Learner Distillation”, 2021
- “Do Transformer Modifications Transfer Across Implementations and Applications?”, Et Al 2021
- “Linear Transformers Are Secretly Fast Weight Programmers”, Et Al 2021
- “Unlocking Pixels for Reinforcement Learning via Implicit Attention”, Et Al 2021
- “Long Range Arena (LRA): A Benchmark for Efficient Transformers”, Et Al 2020
- “Current Limitations of Language Models: What You Need Is Retrieval”, 2020
- “Efficient Transformers: A Survey”, Et Al 2020
- “HiPPO: Recurrent Memory With Optimal Polynomial Projections”, Et Al 2020
- “Efficient Attention: Breaking The Quadratic Transformer Bottleneck”, 2020
- “Pre-training via Paraphrasing”, Et Al 2020
- “GPT-3 Creative Fiction”, 2020
- “Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers”, Et Al 2020
- “GPT-3: Language Models Are Few-Shot Learners”, Et Al 2020
- “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, Et Al 2020
- “Synthesizer: Rethinking Self-Attention in Transformer Models”, Et Al 2020
- “PowerNorm: Rethinking Batch Normalization in Transformers”, Et Al 2020
- “REALM: Retrieval-Augmented Language Model Pre-Training”, Et Al 2020
- “Rethinking Attention With Performers”, 2020
- “Large Memory Layers With Product Keys”, Et Al 2019
- “What Does BERT Look At? An Analysis of BERT’s Attention”, Et Al 2019
- “Pay Less Attention With Lightweight and Dynamic Convolutions”, Et Al 2019
- “On the Turing Completeness of Modern Neural Network Architectures”, Et Al 2019
- “Music Transformer”, Et Al 2018
- “Efficient Transformers: A Survey § Table 1”
- “Attention and Augmented Recurrent Neural Networks”
- “The Transformer Family: Attention and Self-Attention · Multi-Head Self-Attention · Transformer · Adaptive Computation Time (ACT) · Improved Attention Span: (Longer Attention Span (Transformer-XL) / Adaptive Attention Span / Localized Attention Span (Image Transformer)) · Less Time and Memory Cost: (Sparse Attention Matrix Factorization (Sparse Transformers) / Locality-Sensitive Hashing (Reformer)) · Make It Recurrent (Universal Transformer) · Stabilization for RL (GTrXL)”
- “A Survey of Long-Term Context in Transformers: Sparse Transformers · Adaptive Span Transformers · Transformer-XL · Compressive Transformers · Reformer · Routing Transformer · Sinkhorn Transformer · Linformer · Efficient Attention: Attention With Linear Complexities · Transformers Are RNNs · ETC · Longformer”
- Miscellaneous
- Link Bibliography
See Also
Links
“Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent As Meta-Optimizers”, Et Al 2022
“Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers”, 2022-12-20 ( ; similar)
“Efficient Transformers With Dynamic Token Pooling”, Et Al 2022
“Efficient Transformers with Dynamic Token Pooling”, 2022-11-17 (similar)
“Efficiently Scaling Transformer Inference”, Et Al 2022
“Efficiently Scaling Transformer Inference”, 2022-11-09 ( ; similar; bibliography)
“How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers”, Et Al 2022
“How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers”, 2022-11-07 ( ; backlinks; similar; bibliography)
“Transformers Learn Shortcuts to Automata”, Et Al 2022
“Transformers Learn Shortcuts to Automata”, 2022-10-19 ( ; similar)
“Relaxed Attention for Transformer Models”, Et Al 2022
“Relaxed Attention for Transformer Models”, 2022-09-20 (similar)
“Multitrack Music Transformer: Learning Long-Term Dependencies in Music With Diverse Instruments”, Et Al 2022
“Multitrack Music Transformer: Learning Long-Term Dependencies in Music with Diverse Instruments”, 2022-07-14 ( ; similar)
“N-Grammer: Augmenting Transformers With Latent N-grams”, Et Al 2022
“N-Grammer: Augmenting Transformers with latent n-grams”, 2022-07-13 ( ; similar)
“Log-Precision Transformers Are Constant-Depth Uniform Threshold Circuits”, 2022
“Log-Precision Transformers are Constant-Depth Uniform Threshold Circuits”, 2022-07-02 ( ; similar)
“Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules”, Et Al 2022
“Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules”, 2022-06-03 (similar; bibliography)
“FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness”, Et Al 2022
“FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”, 2022-05-27 ( ; backlinks; similar; bibliography)
“TATS: Long Video Generation With Time-Agnostic VQGAN and Time-Sensitive Transformer”, Et Al 2022
“TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer”, 2022-04-07 ( ; similar; bibliography)
“Overcoming a Theoretical Limitation of Self-Attention”, 2022
“Overcoming a Theoretical Limitation of Self-Attention”, 2022-02-24 ( ; similar)
“It’s Raw! Audio Generation With State-Space Models”, Et Al 2022
“It’s Raw! Audio Generation with State-Space Models”, 2022-02-20 ( ; backlinks; similar; bibliography)
“General-purpose, Long-context Autoregressive Modeling With Perceiver AR”, Et Al 2022
“General-purpose, long-context autoregressive modeling with Perceiver AR”, 2022-02-15 ( ; similar; bibliography)
“Attention Approximates Sparse Distributed Memory”, 2021
“Attention Approximates Sparse Distributed Memory”, 2021-11-10 ( ; similar)
“Long-Range Transformers for Dynamic Spatiotemporal Forecasting”, Et Al 2021
“Long-Range Transformers for Dynamic Spatiotemporal Forecasting”, 2021-09-24 (similar)
“Train Short, Test Long: Attention With Linear Biases (ALiBi) Enables Input Length Extrapolation”, Et Al 2021
“Train Short, Test Long: Attention with Linear Biases (ALiBi) Enables Input Length Extrapolation”, 2021-08-27 (similar; bibliography)
“Stable, Fast and Accurate: Kernelized Attention With Relative Positional Encoding”, Et Al 2021
“Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding”, 2021-06-23 (backlinks; similar)
“RASP: Thinking Like Transformers”, Et Al 2021
“RASP: Thinking Like Transformers”, 2021-06-13 ( ; backlinks; similar; bibliography)
“SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training”, Et Al 2021
“SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training”, 2021-06-02 ( ; similar)
“Not All Images Are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition”, Et Al 2021
“Not All Images are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition”, 2021-05-31 (backlinks; similar; bibliography)
“Less Is More: Pay Less Attention in Vision Transformers”, Et Al 2021
“Less is More: Pay Less Attention in Vision Transformers”, 2021-05-29 (backlinks; similar; bibliography)
“FNet: Mixing Tokens With Fourier Transforms”, Lee-Et Al 2021
“FNet: Mixing Tokens with Fourier Transforms”, 2021-05-09 ( ; similar; bibliography)
“Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet”, Melas-2021
“Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet”, 2021-05-06 ( ; backlinks; similar; bibliography)
“Efficient Transformers in Reinforcement Learning Using Actor-Learner Distillation”, 2021
“Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation”, 2021-04-04 ( ; backlinks; similar)
“Do Transformer Modifications Transfer Across Implementations and Applications?”, Et Al 2021
“Do Transformer Modifications Transfer Across Implementations and Applications?”, 2021-02-23 (similar)
“Linear Transformers Are Secretly Fast Weight Programmers”, Et Al 2021
“Linear Transformers Are Secretly Fast Weight Programmers”, 2021-02-22 ( ; similar)
“Unlocking Pixels for Reinforcement Learning via Implicit Attention”, Et Al 2021
“Unlocking Pixels for Reinforcement Learning via Implicit Attention”, 2021-02-08 ( ; backlinks; similar)
“Long Range Arena (LRA): A Benchmark for Efficient Transformers”, Et Al 2020
“Long Range Arena (LRA): A Benchmark for Efficient Transformers”, 2020-09-28 (similar; bibliography)
“Current Limitations of Language Models: What You Need Is Retrieval”, 2020
“Current Limitations of Language Models: What You Need is Retrieval”, 2020-09-15 ( ; backlinks; similar)
“Efficient Transformers: A Survey”, Et Al 2020
“Efficient Transformers: A Survey”, 2020-09-14 (similar; bibliography)
“HiPPO: Recurrent Memory With Optimal Polynomial Projections”, Et Al 2020
“HiPPO: Recurrent Memory with Optimal Polynomial Projections”, 2020-08-17 ( ; backlinks; similar; bibliography)
“Efficient Attention: Breaking The Quadratic Transformer Bottleneck”, 2020
“Efficient Attention: Breaking The Quadratic Transformer Bottleneck”, 2020-07-25 ( ; backlinks; similar; bibliography)
“Pre-training via Paraphrasing”, Et Al 2020
“Pre-training via Paraphrasing”, 2020-06-26 ( ; similar)
“GPT-3 Creative Fiction”, 2020
“GPT-3 Creative Fiction”, 2020-06-19 ( ; backlinks; similar; bibliography)
“Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers”, Et Al 2020
“Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers”, 2020-06-05 (similar)
“GPT-3: Language Models Are Few-Shot Learners”, Et Al 2020
“GPT-3: Language Models are Few-Shot Learners”, 2020-05-28 ( ; similar)
“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, Et Al 2020
“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, 2020-05-22 ( ; similar)
“Synthesizer: Rethinking Self-Attention in Transformer Models”, Et Al 2020
“Synthesizer: Rethinking Self-Attention in Transformer Models”, 2020-05-02 ( ; similar; bibliography)
“PowerNorm: Rethinking Batch Normalization in Transformers”, Et Al 2020
“PowerNorm: Rethinking Batch Normalization in Transformers”, 2020-03-17 (similar)
“REALM: Retrieval-Augmented Language Model Pre-Training”, Et Al 2020
“REALM: Retrieval-Augmented Language Model Pre-Training”, 2020-02-10 ( ; similar)
“Rethinking Attention With Performers”, 2020
“Large Memory Layers With Product Keys”, Et Al 2019
“Large Memory Layers with Product Keys”, 2019-07-10 ( ; similar)
“What Does BERT Look At? An Analysis of BERT’s Attention”, Et Al 2019
“What Does BERT Look At? An Analysis of BERT’s Attention”, 2019-06-11 (backlinks; similar)
“Pay Less Attention With Lightweight and Dynamic Convolutions”, Et Al 2019
“Pay Less Attention with Lightweight and Dynamic Convolutions”, 2019-01-29 ( ; similar)
“On the Turing Completeness of Modern Neural Network Architectures”, Et Al 2019
“On the Turing Completeness of Modern Neural Network Architectures”, 2019-01-10 ( ; similar)
“Music Transformer”, Et Al 2018
“Music Transformer”, 2018-09-12 ( ; similar)
“Efficient Transformers: A Survey § Table 1”
“Attention and Augmented Recurrent Neural Networks”
“The Transformer Family: Attention and Self-Attention · Multi-Head Self-Attention · Transformer · Adaptive Computation Time (ACT) · Improved Attention Span: (Longer Attention Span (Transformer-XL) / Adaptive Attention Span / Localized Attention Span (Image Transformer)) · Less Time and Memory Cost: (Sparse Attention Matrix Factorization (Sparse Transformers) / Locality-Sensitive Hashing (Reformer)) · Make It Recurrent (Universal Transformer) · Stabilization for RL (GTrXL)”
“A Survey of Long-Term Context in Transformers: Sparse Transformers · Adaptive Span Transformers · Transformer-XL · Compressive Transformers · Reformer · Routing Transformer · Sinkhorn Transformer · Linformer · Efficient Attention: Attention With Linear Complexities · Transformers Are RNNs · ETC · Longformer”
“A Survey of Long-Term Context in Transformers: Sparse Transformers · Adaptive Span Transformers · Transformer-XL · Compressive Transformers · Reformer · Routing Transformer · Sinkhorn Transformer · Linformer · Efficient Attention: Attention with Linear Complexities · Transformers are RNNs · ETC · Longformer” (backlinks)
Miscellaneous
Link Bibliography
-
https://arxiv.org/abs/2211.05102#google
: “Efficiently Scaling Transformer Inference”, : -
https://arxiv.org/abs/2211.03495
: “How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers”, Michael Hassid, Hao Peng, Daniel Rotem, Jungo Kasai, Ivan Montero, Noah A. Smith, Roy Schwartz: -
https://arxiv.org/abs/2206.01649#schmidhuber
: “Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules”, Kazuki Irie, Francesco Faccio, Jürgen Schmidhuber: -
https://arxiv.org/abs/2205.14135
: “FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness”, Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré: -
https://arxiv.org/abs/2204.03638#facebook
: “TATS: Long Video Generation With Time-Agnostic VQGAN and Time-Sensitive Transformer”, Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, Devi Parikh: -
https://arxiv.org/abs/2202.09729
: “It’s Raw! Audio Generation With State-Space Models”, Karan Goel, Albert Gu, Chris Donahue, Christopher Ré: -
https://arxiv.org/abs/2202.07765#deepmind
: “General-purpose, Long-context Autoregressive Modeling With Perceiver AR”, : -
https://arxiv.org/abs/2108.12409#facebook
: “Train Short, Test Long: Attention With Linear Biases (ALiBi) Enables Input Length Extrapolation”, Ofir Press, Noah A. Smith, Mike Lewis: -
https://arxiv.org/abs/2106.06981
: “RASP: Thinking Like Transformers”, Gail Weiss, Yoav Goldberg, Eran Yahav: -
https://arxiv.org/abs/2105.15075
: “Not All Images Are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition”, Yulin Wang, Rui Huang, Shiji Song, Zeyi Huang, Gao Huang: -
https://arxiv.org/abs/2105.14217
: “Less Is More: Pay Less Attention in Vision Transformers”, Zizheng Pan, Bohan Zhuang, Haoyu He, Jing Liu, Jianfei Cai: -
https://arxiv.org/abs/2105.03824#google
: “FNet: Mixing Tokens With Fourier Transforms”, James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon: -
https://arxiv.org/abs/2105.02723
: “Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet”, Luke Melas-Kyriazi: -
https://openreview.net/forum?id=qVyeW-grC2k#google
: “Long Range Arena (LRA): A Benchmark for Efficient Transformers”, : -
https://arxiv.org/abs/2009.06732#google
: “Efficient Transformers: A Survey”, Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler: -
https://arxiv.org/abs/2008.07669
: “HiPPO: Recurrent Memory With Optimal Polynomial Projections”, Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, Christopher Re: -
attention
: “Efficient Attention: Breaking The Quadratic Transformer Bottleneck”, Gwern Branwen: -
gpt-3
: “GPT-3 Creative Fiction”, Gwern Branwen: -
https://arxiv.org/abs/2005.00743#google
: “Synthesizer: Rethinking Self-Attention in Transformer Models”, Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng: