‘Transformer matrix optimizations’ directory
- See Also
 - Links
- “EvaByte: Efficient Byte-Level Language Models at Scale: Introducing EvaByte, an Efficient and Strong Byte-Level Language Model ”, Zheng et al 2025
 - “LoLCATs: On Low-Rank Linearizing of Large Language Models ”, Zhang et al 2024
 - “SANA: Efficient High-Resolution Image Synthesis With Linear Diffusion Transformers ”, Xie et al 2024
 - “Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers ”, Gu et al 2024
 - “RWKV: Reinventing RNNs for the Transformer Era ”, Peng et al 2023
 - “CosFormer: Rethinking Softmax in Attention ”, Qin et al 2022
 - “Self-Attention Does Not Need 𝒪(n2) Memory ”, Rabe & Staats 2021
 - “Skyformer: Remodel Self-Attention With Gaussian Kernel and Nyström Method ”, Chen et al 2021
 - “On Learning the Transformer Kernel ”, Chowdhury et al 2021
 - “A Dot Product Attention Free Transformer ”, Zhai et al 2021
 - “AFT: An Attention Free Transformer ”, Zhai et al 2021
 - “Luna: Linear Unified Nested Attention ”, Ma et al 2021
 - “Beyond Self-Attention: External Attention Using Two Linear Layers for Visual Tasks (EAMLP) ”, Guo et al 2021
 - “Sub-Linear Memory: How to Make Performers SLiM ”, Likhosherstov et al 2020
 - “LambdaNetworks: Modeling Long-Range Interactions without Attention ”, Bello 2020
 - “Transformers Are RNNs: Fast Autoregressive Transformers With Linear Attention ”, Katharopoulos et al 2020
 - “Linformer: Self-Attention With Linear Complexity ”, Wang et al 2020
 - “Efficient Attention: Attention With Linear Complexities ”, Shen et al 2018
 - “Efficient Attention: Attention With Linear Complexities [Blog] ”
 - Sort By Magic
 
 - Miscellaneous
 - Bibliography
 
See Also
Links
“EvaByte: Efficient Byte-Level Language Models at Scale: Introducing EvaByte, an Efficient and Strong Byte-Level Language Model ”, Zheng et al 2025
“LoLCATs: On Low-Rank Linearizing of Large Language Models ”, Zhang et al 2024
“SANA: Efficient High-Resolution Image Synthesis With Linear Diffusion Transformers ”, Xie et al 2024
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
“Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers ”, Gu et al 2024
“RWKV: Reinventing RNNs for the Transformer Era ”, Peng et al 2023
“CosFormer: Rethinking Softmax in Attention ”, Qin et al 2022
“Self-Attention Does Not Need 𝒪(n2) Memory ”, Rabe & Staats 2021
“Skyformer: Remodel Self-Attention With Gaussian Kernel and Nyström Method ”, Chen et al 2021
Skyformer: Remodel Self-Attention with Gaussian Kernel and Nyström Method
“On Learning the Transformer Kernel ”, Chowdhury et al 2021
“A Dot Product Attention Free Transformer ”, Zhai et al 2021
“AFT: An Attention Free Transformer ”, Zhai et al 2021
“Luna: Linear Unified Nested Attention ”, Ma et al 2021
“Beyond Self-Attention: External Attention Using Two Linear Layers for Visual Tasks (EAMLP) ”, Guo et al 2021
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks (EAMLP)
“Sub-Linear Memory: How to Make Performers SLiM ”, Likhosherstov et al 2020
“LambdaNetworks: Modeling Long-Range Interactions without Attention ”, Bello 2020
LambdaNetworks: Modeling long-range Interactions without Attention
“Transformers Are RNNs: Fast Autoregressive Transformers With Linear Attention ”, Katharopoulos et al 2020
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
“Linformer: Self-Attention With Linear Complexity ”, Wang et al 2020
“Efficient Attention: Attention With Linear Complexities ”, Shen et al 2018
“Efficient Attention: Attention With Linear Complexities [Blog] ”
Efficient Attention: Attention with Linear Complexities [blog]
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
efficient-attention
attention-free
linear-attention
Miscellaneous
Bibliography
https://arxiv.org/abs/2410.10629#nvidia: “SANA: Efficient High-Resolution Image Synthesis With Linear Diffusion Transformers ”,https://arxiv.org/abs/2305.13048: “RWKV: Reinventing RNNs for the Transformer Era ”,https://openreview.net/forum?id=JVR4JswsEM: “A Dot Product Attention Free Transformer ”,https://arxiv.org/abs/1812.01243#sensetime: “Efficient Attention: Attention With Linear Complexities ”,