Bibliography:

  1. Neural Net Sparsity

  2. ‘self-attention’ tag

  3. ‘NN sparsity’ tag

  4. ‘Jukebox’ tag

  5. When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models

  6. AI Is a Black Box. Anthropic Figured Out a Way to Look Inside: What goes on in artificial neural networks work is largely a mystery, even to their creators. But researchers from Anthropic have caught a glimpse

  7. Revisiting the Equivalence of In-Context Learning and Gradient Descent: The Impact of Data Distribution

  8. Zoology: Measuring and Improving Recall in Efficient Language Models

  9. HyperAttention: Long-context Attention in Near-Linear Time

  10. LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

  11. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

  12. Unlimiformer: Long-Range Transformers with Unlimited Length Input

  13. How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

  14. Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

  15. Random Feature Attention

  16. Sparse is Enough in Scaling Transformers

  17. You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

  18. Scatterbrain: Unifying Sparse and Low-rank Attention Approximation

  19. Combiner: Full Attention Transformer with Sparse Computation Cost

  20. OmniNet: Omnidirectional Representations from Transformers

  21. Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

  22. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

  23. SMYRF: Efficient Attention using Asymmetric Clustering

  24. FAVOR+: Rethinking Attention with Performers

  25. Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

  26. DeepSpeed Sparse Attention

  27. BigBird: Transformers for Longer Sequences

  28. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation

  29. Efficient Content-Based Sparse Attention with Routing Transformers

  30. Sparse Sinkhorn Attention

  31. Reformer: The Efficient Transformer

  32. The Reformer—Pushing the Limits of Language Modeling

  33. Axial Attention in Multidimensional Transformers

  34. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

  35. Scaling Autoregressive Video Models

  36. Adaptive Attention Span in Transformers

  37. Generating Long Sequences with Sparse Transformers

  38. Generative Modeling with Sparse Transformers: We’ve developed the Sparse Transformer, a deep neural network which sets new records at predicting what comes next in a sequence—whether text, images, or sound. It uses an algorithmic improvement of the attention mechanism to extract patterns from sequences 30× longer than possible previously

  39. Star-Transformer

  40. CCNet: Criss-Cross Attention for Semantic Segmentation

  41. Image Transformer

  42. Constructing Transformers For Longer Sequences With Sparse Attention Methods

  43. A Deep Dive into the Reformer

  44. Optimal Transport and the Sinkhorn Transformer

  45. 17c317c9776c44b80bba0378603adee1bc172b0f.html

  46. design#future-tag-features

    [Transclude the forward-link's context]

  47. 2022-tay-figure1b-computeperformanceoverviewof10diversennarchitecturesbydownstreamaccuracyshowingwidespreadandconvergenceatscale.png

  48. 2022-tay-figure2-worsescalingofallvariantarchitecturescomparedtooriginalsimpletransformer.jpg

  49. 2021-jaszczur-figure1-logperplexityofscalingtransformersonc4datasetvsbaselines.jpg

  50. https://www.lesswrong.com/posts/kzc3qNMsP2xJcxhGn/gated-attention-blocks-preliminary-progress-toward-removing-1

  51. fb81f04be738bea3d02de7d632299791bf9793d3.html

  52. When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models

  53. https%253A%252F%252Farxiv.org%252Fabs%252F2406.13131.html

  54. AI Is a Black Box. Anthropic Figured Out a Way to Look Inside: What goes on in artificial neural networks work is largely a mystery, even to their creators. But researchers from Anthropic have caught a glimpse

  55. https%253A%252F%252Fwww.wired.com%252Fstory%252Fanthropic-black-box-ai-research-neurons-features%252F.html

  56. Revisiting the Equivalence of In-Context Learning and Gradient Descent: The Impact of Data Distribution

  57. https%253A%252F%252Fieeexplore.ieee.org%252Fabstract%252Fdocument%252F10446522.html

  58. Zoology: Measuring and Improving Recall in Efficient Language Models

  59. https%253A%252F%252Farxiv.org%252Fabs%252F2312.04927.html

  60. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

  61. https%253A%252F%252Farxiv.org%252Fabs%252F2306.14048.html

  62. Unlimiformer: Long-Range Transformers with Unlimited Length Input

  63. https%253A%252F%252Farxiv.org%252Fabs%252F2305.01625.html

  64. How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

  65. https%253A%252F%252Farxiv.org%252Fabs%252F2211.03495.html

  66. Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

  67. Yi Tay

  68. https%253A%252F%252Farxiv.org%252Fabs%252F2207.10551%2523google.html

  69. Sparse is Enough in Scaling Transformers

  70. Łukasz Kaiser

  71. https%253A%252F%252Farxiv.org%252Fabs%252F2111.12763%2523google.html

  72. You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

  73. https%253A%252F%252Farxiv.org%252Fabs%252F2111.09714.html

  74. Scatterbrain: Unifying Sparse and Low-rank Attention Approximation

  75. Tri Dao

  76. https%253A%252F%252Farxiv.org%252Fabs%252F2110.15343%2523facebook.html

  77. OmniNet: Omnidirectional Representations from Transformers

  78. Yi Tay

  79. https%253A%252F%252Farxiv.org%252Fabs%252F2103.01075%2523google.html

  80. Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

  81. https%253A%252F%252Farxiv.org%252Fabs%252F2102.03902.html

  82. SMYRF: Efficient Attention using Asymmetric Clustering

  83. https%253A%252F%252Farxiv.org%252Fabs%252F2010.05315.html

  84. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation

  85. https%253A%252F%252Farxiv.org%252Fabs%252F2003.07853%2523google.html

  86. Efficient Content-Based Sparse Attention with Routing Transformers

  87. https%253A%252F%252Farxiv.org%252Fabs%252F2003.05997%2523google.html

  88. Reformer: The Efficient Transformer

  89. Łukasz Kaiser

  90. https%253A%252F%252Farxiv.org%252Fabs%252F2001.04451%2523google.html

  91. CCNet: Criss-Cross Attention for Semantic Segmentation

  92. https%253A%252F%252Farxiv.org%252Fabs%252F1811.11721.html