- See Also
-
Links
- “Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, Et Al 2022
- “Random Feature Attention”, Et Al 2022
- “Sparse Is Enough in Scaling Transformers”, Et Al 2021
- “You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling”, Et Al 2021
- “Scatterbrain: Unifying Sparse and Low-rank Attention Approximation”, Et Al 2021
- “Combiner: Full Attention Transformer With Sparse Computation Cost”, Et Al 2021
- “OmniNet: Omnidirectional Representations from Transformers”, Et Al 2021
- “Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention”, Et Al 2021
- “Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting”, Et Al 2020
- “SMYRF: Efficient Attention Using Asymmetric Clustering”, Et Al 2020
- “FAVOR+: Rethinking Attention With Performers”, Et Al 2020
- “Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding”, Et Al 2020
- “DeepSpeed Sparse Attention”, 2020
- “BigBird: Transformers for Longer Sequences”, Et Al 2020
- “Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation”, Et Al 2020
- “Efficient Content-Based Sparse Attention With Routing Transformers”, Et Al 2020
- “Sparse Sinkhorn Attention”, Et Al 2020
- “Reformer: The Efficient Transformer”, Et Al 2020
- “The Reformer—Pushing the Limits of Language Modeling”, 2020
- “Axial Attention in Multidimensional Transformers”, Et Al 2019
- “Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting”, Et Al 2019
- “Scaling Autoregressive Video Models”, Et Al 2019
- “Adaptive Attention Span in Transformers”, Et Al 2019
- “Generative Modeling With Sparse Transformers: We’ve Developed the Sparse Transformer, a Deep Neural Network Which Sets New Records at Predicting What Comes next in a Sequence—whether Text, Images, or Sound. It Uses an Algorithmic Improvement of The Attention Mechanism to Extract Patterns from Sequences 30× Longer Than Possible Previously”, 2019
- “Generating Long Sequences With Sparse Transformers”, Et Al 2019
- “Star-Transformer”, Et Al 2019
- “CCNet: Criss-Cross Attention for Semantic Segmentation”, Et Al 2018
- “Image Transformer”, Et Al 2018
- “Constructing Transformers For Longer Sequences With Sparse Attention Methods”
- “A Deep Dive into the Reformer”
- “Optimal Transport and the Sinkhorn Transformer”
- Link Bibliography
See Also
Links
“Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, Et Al 2022
“Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?”, 2022-07-21 ( ; similar; bibliography)
“Random Feature Attention”, Et Al 2022
“Random Feature Attention”, 2022-02-10 (backlinks; similar)
“Sparse Is Enough in Scaling Transformers”, Et Al 2021
“Sparse is Enough in Scaling Transformers”, 2021-11-24 ( ; similar; bibliography)
“You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling”, Et Al 2021
“You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling”, 2021-11-18 (backlinks; similar; bibliography)
“Scatterbrain: Unifying Sparse and Low-rank Attention Approximation”, Et Al 2021
“Scatterbrain: Unifying Sparse and Low-rank Attention Approximation”, 2021-10-28 ( ; similar; bibliography)
“Combiner: Full Attention Transformer With Sparse Computation Cost”, Et Al 2021
“Combiner: Full Attention Transformer with Sparse Computation Cost”, 2021-07-12 (similar)
“OmniNet: Omnidirectional Representations from Transformers”, Et Al 2021
“OmniNet: Omnidirectional Representations from Transformers”, 2021-03-01 ( ; similar; bibliography)
“Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention”, Et Al 2021
“Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention”, 2021-02-07 (backlinks; similar; bibliography)
“Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting”, Et Al 2020
“Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting”, 2020-12-14 ( ; backlinks; similar)
“SMYRF: Efficient Attention Using Asymmetric Clustering”, Et Al 2020
“SMYRF: Efficient Attention using Asymmetric Clustering”, 2020-10-11 ( ; backlinks; similar; bibliography)
“FAVOR+: Rethinking Attention With Performers”, Et Al 2020
“FAVOR+: Rethinking Attention with Performers”, 2020-09-30 (similar)
“Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding”, Et Al 2020
“Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding”, 2020-09-13 (similar)
“DeepSpeed Sparse Attention”, 2020
“DeepSpeed Sparse Attention”, 2020-09-08 (backlinks; similar)
“BigBird: Transformers for Longer Sequences”, Et Al 2020
“BigBird: Transformers for Longer Sequences”, 2020-07-28 (similar)
“Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation”, Et Al 2020
“Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation”, 2020-03-17 (similar; bibliography)
“Efficient Content-Based Sparse Attention With Routing Transformers”, Et Al 2020
“Efficient Content-Based Sparse Attention with Routing Transformers”, 2020-03-12 ( ; similar; bibliography)
“Sparse Sinkhorn Attention”, Et Al 2020
“Sparse Sinkhorn Attention”, 2020-02-26 ( ; similar)
“Reformer: The Efficient Transformer”, Et Al 2020
“Reformer: The Efficient Transformer”, 2020-01-13 ( ; similar; bibliography)
“The Reformer—Pushing the Limits of Language Modeling”, 2020
“Axial Attention in Multidimensional Transformers”, Et Al 2019
“Axial Attention in Multidimensional Transformers”, 2019-12-20 ( ; similar)
“Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting”, Et Al 2019
“Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting”, 2019-06-29 (backlinks; similar)
“Scaling Autoregressive Video Models”, Et Al 2019
“Scaling Autoregressive Video Models”, 2019-06-06 ( ; similar)
“Adaptive Attention Span in Transformers”, Et Al 2019
“Adaptive Attention Span in Transformers”, 2019-05-19 (similar)
“Generative Modeling With Sparse Transformers: We’ve Developed the Sparse Transformer, a Deep Neural Network Which Sets New Records at Predicting What Comes next in a Sequence—whether Text, Images, or Sound. It Uses an Algorithmic Improvement of The Attention Mechanism to Extract Patterns from Sequences 30× Longer Than Possible Previously”, 2019
“Generative Modeling with Sparse Transformers: We’ve developed the Sparse Transformer, a deep neural network which sets new records at predicting what comes next in a sequence—whether text, images, or sound. It uses an algorithmic improvement of the attention mechanism to extract patterns from sequences 30× longer than possible previously”, 2019-04-23 ( ; backlinks; similar)
“Generating Long Sequences With Sparse Transformers”, Et Al 2019
“Generating Long Sequences with Sparse Transformers”, 2019-04-23 ( ; similar)
“Star-Transformer”, Et Al 2019
“Star-Transformer”, 2019-02-25 (backlinks; similar)
“CCNet: Criss-Cross Attention for Semantic Segmentation”, Et Al 2018
“CCNet: Criss-Cross Attention for Semantic Segmentation”, 2018-11-28 ( ; backlinks; similar; bibliography)
“Image Transformer”, Et Al 2018
“Image Transformer”, 2018-02-15 ( ; similar)
“Constructing Transformers For Longer Sequences With Sparse Attention Methods”
“A Deep Dive into the Reformer”
“Optimal Transport and the Sinkhorn Transformer”
Link Bibliography
-
https://arxiv.org/abs/2207.10551#google
: “Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, : -
https://arxiv.org/abs/2111.12763#google
: “Sparse Is Enough in Scaling Transformers”, : -
https://arxiv.org/abs/2111.09714
: “You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling”, Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh: -
https://arxiv.org/abs/2110.15343#facebook
: “Scatterbrain: Unifying Sparse and Low-rank Attention Approximation”, Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Ré: -
https://arxiv.org/abs/2103.01075#google
: “OmniNet: Omnidirectional Representations from Transformers”, : -
https://arxiv.org/abs/2102.03902
: “Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention”, Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh: -
https://arxiv.org/abs/2010.05315
: “SMYRF: Efficient Attention Using Asymmetric Clustering”, Giannis Daras, Nikita Kitaev, Augustus Odena, Alexandros G. Dimakis: -
https://arxiv.org/abs/2003.07853#google
: “Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation”, Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh Chen: -
https://arxiv.org/abs/2003.05997#google
: “Efficient Content-Based Sparse Attention With Routing Transformers”, Aurko Roy, Mohammad Saffar, Ashish Vaswani, David Grangier: -
https://arxiv.org/abs/2001.04451#google
: “Reformer: The Efficient Transformer”, Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya: -
https://arxiv.org/abs/1811.11721
: “CCNet: Criss-Cross Attention for Semantic Segmentation”, Zilong Huang, Xinggang Wang, Yunchao Wei, Lichao Huang, Humphrey Shi, Wenyu Liu, Thomas S. Huang: