‘sparse Transformer’ directory

Annotations sorted by machine learning into ⁠inferred 'tags'⁠. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

`long-context-modeling architecture-optimization time-series-transformers autoregressive-video internal-structure-analysis`

⁠[see previous entry]⁠

`long-range`

⁠[see previous entry]⁠

`sparse-attention neural-architecture routing-transformers efficient-long-context attention-optimization dynamic-modeling`

⁠[see previous entry]⁠

Miscellaneous

Bibliography

https://arxiv.org/abs/2406.13131: “When Parts Are Greater Than Sums: Individual LLM Components Can Outperform Full Models ”⁠, Ting-Yun Chang, Jesse Thomason, Robin Jia
link-bibliography⁠
https://www.wired.com/story/anthropic-black-box-ai-research-neurons-features/: “AI Is a Black Box. Anthropic Figured Out a Way to Look Inside: What Goes on in Artificial Neural Networks Work Is Largely a Mystery, Even to Their Creators. But Researchers from Anthropic Have Caught a Glimpse ”⁠, Steven Levy⁠
link-bibliography⁠
https://ieeexplore.ieee.org/abstract/document/10446522: “Revisiting the Equivalence of In-Context Learning and Gradient Descent: The Impact of Data Distribution ”⁠, Sadegh Mahdavi, Renjie Liao, Christos Thrampoulidis
link-bibliography⁠
https://arxiv.org/abs/2312.04927: “Zoology: Measuring and Improving Recall in Efficient Language Models ”⁠, Simran Arora, Sabri Eyuboglu, Aman Timalsina …, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher Ré⁠
link-bibliography⁠
https://arxiv.org/abs/2306.14048: “H₂O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models ”⁠, Zhenyu Zhang, Ying Sheng, Tianyi Zhou …, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song⁠, Yuandong Tian, Christopher Ré⁠, Clark Barrett, Zhangyang Wang, Beidi Chen
link-bibliography⁠
https://arxiv.org/abs/2305.01625: “Unlimiformer: Long-Range Transformers With Unlimited Length Input ”⁠, Amanda Bertsch, Uri Alon⁠, Graham Neubig, Matthew R. Gormley
link-bibliography⁠
https://arxiv.org/abs/2211.03495: “How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers ”⁠, Michael Hassid, Hao Peng, Daniel Rotem …, Jungo Kasai, Ivan Montero, Noah Smith⁠, Roy Schwartz
link-bibliography⁠
https://arxiv.org/abs/2207.10551#google: “Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling? ”⁠, ⁠Yi Tay, Mostafa Dehghani, Samira Abnar …, Hyung Won Chung, William Fedus⁠, Jinfeng Rao, Sharan Narang, Vinh Q. Tran, Dani Yogatama, Donald Metzler
link-bibliography⁠
https://arxiv.org/abs/2111.12763#google: “Sparse Is Enough in Scaling Transformers ”⁠, Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin …, Łukasz Kaiser⁠, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva
link-bibliography⁠
https://arxiv.org/abs/2111.09714: “You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling ”⁠, Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi …, Shailesh Acharya, Glenn Fung, Vikas Singh
link-bibliography⁠
https://arxiv.org/abs/2110.15343#facebook: “Scatterbrain: Unifying Sparse and Low-Rank Attention Approximation ”⁠, Beidi Chen, ⁠Tri Dao, Eric Winsor …, Zhao Song⁠, Atri Rudra, Christopher Ré⁠
link-bibliography⁠
https://arxiv.org/abs/2103.01075#google: “OmniNet: Omnidirectional Representations from Transformers ”⁠, ⁠Yi Tay, Mostafa Dehghani, Vamsi Aribandi …, Jai Gupta, Philip Pham, Zhen Qin, Dara Bahri, Da-Cheng Juan, Donald Metzler
link-bibliography⁠
https://arxiv.org/abs/2102.03902: “Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention ”⁠, Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty …, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh
link-bibliography⁠
https://arxiv.org/abs/2010.05315: “SMYRF: Efficient Attention Using Asymmetric Clustering ”⁠, Giannis Daras, Nikita Kitaev, Augustus Odena, Alexandros G. Dimakis
link-bibliography⁠
https://arxiv.org/abs/2009.14794#google: “FAVOR+: Rethinking Attention With Performers ”⁠, Krzysztof Choromanski, Valerii Likhosherstov, David Dohan …, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Łukasz Kaiser⁠, David Belanger, Lucy Colwell, Adrian Weller
link-bibliography⁠
https://arxiv.org/abs/2003.07853#google: “Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation ”⁠, Huiyu Wang, Yukun Zhu, Bradley Green …, Hartwig Adam, Alan Yuille⁠, Liang-Chieh Chen
link-bibliography⁠
https://arxiv.org/abs/2003.05997#google: “Efficient Content-Based Sparse Attention With Routing Transformers ”⁠, Aurko Roy, Mohammad Saffar, Ashish Vaswani⁠, David Grangier
link-bibliography⁠
https://arxiv.org/abs/2001.04451#google: “Reformer: The Efficient Transformer ”⁠, Nikita Kitaev, Łukasz Kaiser⁠, Anselm Levskaya
link-bibliography⁠
https://arxiv.org/abs/1811.11721: “CCNet: Criss-Cross Attention for Semantic Segmentation ”⁠, Zilong Huang, Xinggang Wang, Yunchao Wei …, Lichao Huang, Humphrey Shi, Wenyu Liu, Thomas S. Huang⁠
link-bibliography⁠