Bibliography:

  1. Efficient Attention: Breaking The Quadratic Transformer Bottleneck

  2. ‘Transformer’ tag

  3. ‘compressed Transformers’ tag

  4. ‘multi-scale Transformers’ tag

  5. ‘Transformer matrix optimizations’ tag

  6. ‘recurrent Transformers’ tag

  7. ‘sparse Transformers’ tag

  8. ‘retrieval AI’ tag

  9. ‘RNN’ tag

  10. ‘LM tokenization’ tag

  11. ‘video generation’ tag

  12. Absolute Unit NNs: Regression-Based MLPs for Everything

  13. Research Ideas

  14. GPT-3 Creative Fiction

  15. Efficient Attention: Breaking The Quadratic Transformer Bottleneck

  16. Hymba: A Hybrid-head Architecture for Small Language Models

  17. Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models

  18. Long Context RAG Performance of Large Language Models

  19. Ask, and it shall be given: Turing completeness of prompting

  20. Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects

  21. Differential Transformer

  22. Were RNNs All We Needed?

  23. nGPT: Normalized Transformer with Representation Learning on the Hypersphere

  24. Masked Mixers for Language Generation and Retrieval

  25. The Mamba in the Llama: Distilling and Accelerating Hybrid Models

  26. When Can Transformers Count to n?

  27. What Matters in Transformers? Not All Attention is Needed

  28. Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

  29. An Empirical Study of Mamba-based Language Models

  30. Attention as a Hypernetwork

  31. Scalable Matmul-free Language Modeling

  32. A Theoretical Understanding of Self-Correction through In-context Alignment

  33. Attention as an RNN

  34. Your Transformer is Secretly Linear

  35. Retrieval Head Mechanistically Explains Long-Context Factuality

  36. Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models

  37. Towards smaller, faster decoder-only transformers: Architectural variants and their implications

  38. ReFT: Representation Finetuning for Language Models

  39. Do language models plan ahead for future tokens?

  40. Streamlining Redundant Layers to Compress Large Language Models

  41. Long-form factuality in large language models

  42. Mechanistic Design and Scaling of Hybrid Architectures

  43. 8 Google Employees Invented Modern AI. Here’s the Inside Story: They met by chance, got hooked on an idea, and wrote the Transformers paper—the most consequential tech breakthrough in recent history

  44. How Well Can Transformers Emulate In-context Newton’s Method?

  45. RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval

  46. A phase transition between positional and semantic learning in a solvable model of dot-product attention

  47. Rethinking Patch Dependence for Masked Autoencoders

  48. Attention versus Contrastive Learning of Tabular Data—A Data-centric Benchmarking

  49. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

  50. SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

  51. Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models

  52. Can a Transformer Represent a Kalman Filter?

  53. Efficient Transformer Knowledge Distillation: A Performance Review

  54. Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

  55. In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering

  56. On prefrontal working memory and hippocampal episodic memory: Unifying memories stored in weights and activation slots

  57. LSS Transformer: Ultra-Long Sequence Distributed Transformer

  58. Simplifying Transformer Blocks

  59. GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling

  60. Not all layers are equally as important: Every Layer Counts BERT

  61. Implicit Chain-of-Thought Reasoning via Knowledge Distillation

  62. Training Dynamics of Contextual N-Grams in Language Models

  63. The Impact of Depth and Width on Transformer Language Model Generalization

  64. Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models

  65. Characterizing Mechanisms for Factual Recall in Language Models

  66. Linear Representations of Sentiment in Large Language Models

  67. Masked Hard-Attention Transformers and Boolean RASP Recognize Exactly the Star-Free Languages

  68. How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?

  69. Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors

  70. Vision Transformers Need Registers

  71. Interpret Vision Transformers as ConvNets with Dynamic Convolutions

  72. Replacing softmax with ReLU in Vision Transformers

  73. One Wide Feedforward is All You Need

  74. Activation Addition: Steering Language Models Without Optimization

  75. Linearity of Relation Decoding in Transformer Language Models

  76. The Hydra Effect: Emergent Self-repair in Language Model Computations

  77. Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

  78. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

  79. One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention

  80. Lost in the Middle: How Language Models Use Long Contexts

  81. Trainable Transformer in Transformer

  82. Transformers learn to implement preconditioned gradient descent for in-context learning

  83. White-Box Transformers via Sparse Rate Reduction

  84. Blockwise Parallel Transformer for Long Context Large Models

  85. TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models

  86. Brainformers: Trading Simplicity for Efficiency

  87. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

  88. Mimetic Initialization of Self-Attention Layers

  89. Toeplitz Neural Network for Sequence Modeling

  90. Finding Neurons in a Haystack: Case Studies with Sparse Probing

  91. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

  92. Coinductive guide to inductive transformer heads

  93. Tighter Bounds on the Expressivity of Transformer Encoders

  94. Tracr: Compiled Transformers as a Laboratory for Interpretability

  95. Skip-Attention: Improving Vision Transformers by Paying Less Attention

  96. Hungry Hungry Hippos: Towards Language Modeling with State Space Models

  97. Scalable Adaptive Computation for Iterative Generation

  98. Pretraining Without Attention

  99. Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers

  100. Transformers learn in-context by gradient descent

  101. What learning algorithm is in-context learning? Investigations with linear models

  102. Efficiently Scaling Transformer Inference

  103. Transformers Learn Shortcuts to Automata

  104. Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling

  105. Transformers Implement First-Order Logic with Majority Quantifiers

  106. Relaxed Attention for Transformer Models

  107. What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

  108. Multitrack Music Transformer: Learning Long-Term Dependencies in Music with Diverse Instruments

  109. N-Grammer: Augmenting Transformers with latent n-grams

  110. Log-Precision Transformers are Constant-Depth Uniform Threshold Circuits

  111. Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules

  112. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

  113. TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

  114. Overcoming a Theoretical Limitation of Self-Attention

  115. It’s Raw! Audio Generation with State-Space Models

  116. General-purpose, long-context autoregressive modeling with Perceiver AR

  117. Transformer Memory as a Differentiable Search Index

  118. The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention

  119. Attention Approximates Sparse Distributed Memory

  120. An Explanation of In-context Learning as Implicit Bayesian Inference

  121. Long-Range Transformers for Dynamic Spatiotemporal Forecasting

  122. Train Short, Test Long: Attention with Linear Biases (ALiBi) Enables Input Length Extrapolation

  123. Do Vision Transformers See Like Convolutional Neural Networks?

  124. Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

  125. RASP: Thinking Like Transformers

  126. On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers

  127. SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

  128. Not All Images are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition

  129. Less is More: Pay Less Attention in Vision Transformers

  130. FNet: Mixing Tokens with Fourier Transforms

  131. Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

  132. RoFormer: Enhanced Transformer with Rotary Position Embedding

  133. ALD: Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation

  134. Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

  135. Do Transformer Modifications Transfer Across Implementations and Applications?

  136. Linear Transformers Are Secretly Fast Weight Programmers

  137. Unlocking Pixels for Reinforcement Learning via Implicit Attention

  138. Transformer Feed-Forward Layers Are Key-Value Memories

  139. AdnFM: An Attentive DenseNet based Factorization Machine for CTR Prediction

  140. Inductive Biases for Deep Learning of Higher-Level Cognition

  141. Long Range Arena (LRA): A Benchmark for Efficient Transformers

  142. Current Limitations of Language Models: What You Need is Retrieval

  143. Efficient Transformers: A Survey

  144. HiPPO: Recurrent Memory with Optimal Polynomial Projections

  145. Pre-training via Paraphrasing

  146. Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

  147. GPT-3: Language Models are Few-Shot Learners

  148. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

  149. Synthesizer: Rethinking Self-Attention in Transformer Models

  150. PowerNorm: Rethinking Batch Normalization in Transformers

  151. REALM: Retrieval-Augmented Language Model Pre-Training

  152. Rethinking Attention With Performers

  153. Dynamic Convolution: Attention over Convolution Kernels

  154. Generalization through Memorization: Nearest Neighbor Language Models

  155. Multiplicative Interactions and Where to Find Them

  156. The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives

  157. Large Memory Layers with Product Keys

  158. What Does BERT Look At? An Analysis of BERT’s Attention

  159. Are 16 Heads Really Better than One?

  160. Pay Less Attention with Lightweight and Dynamic Convolutions

  161. On the Turing Completeness of Modern Neural Network Architectures

  162. Music Transformer

  163. Character-Level Language Modeling with Deeper Self-Attention

  164. Attention Is All You Need

  165. A Deep Reinforced Model for Abstractive Summarization

  166. Get To The Point: Summarization with Pointer-Generator Networks

  167. RAM: Dynamic Computational Time for Visual Attention

  168. Hybrid computing using a neural network with dynamic external memory

  169. Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes

  170. Modeling Human Reading with Neural Attention

  171. Iterative Alternating Neural Attention for Machine Reading

  172. Adaptive Computation Time for Recurrent Neural Networks

  173. Foveation-based Mechanisms Alleviate Adversarial Examples

  174. Generating Images from Captions with Attention

  175. DRAW: A Recurrent Neural Network For Image Generation

  176. Neural Turing Machines

  177. Neural Machine Translation by Jointly Learning to Align and Translate

  178. On Learning Where To Look

  179. Generating Sequences With Recurrent Neural Networks

  180. Efficient Transformers: A Survey § Table 1

  181. Attention and Augmented Recurrent Neural Networks

  182. Hierarchical Object Detection With Deep Reinforcement Learning

  183. f106bf397cea4b3c184a40c91893ee695f7646df.html

  184. The Transformer Family: Attention and Self-Attention · Multi-Head Self-Attention · Transformer · Adaptive Computation Time (ACT) · Improved Attention Span: (Longer Attention Span (Transformer-XL) / Adaptive Attention Span / Localized Attention Span (Image Transformer)) · Less Time and Memory Cost: (Sparse Attention Matrix Factorization (Sparse Transformers) / Locality-Sensitive Hashing (Reformer)) · Make It Recurrent (Universal Transformer) · Stabilization for RL (GTrXL)

  185. 8428b81f97ec6ee2f1a32d095f038a405e743f80.html#openai

  186. 100M Token Context Windows

  187. Learning to Combine Foveal Glimpses With a Third-Order Boltzmann Machine

  188. 972d1bb895d5d4fc9493ea05e660d61b9932b8d8.pdf

  189. Show, Attend and Tell: Neural Image Caption Generation With Visual Attention

  190. dfee6126675536fdd017d3051a161a6d79ccb35e.pdf

  191. Recurrent Models of Visual Attention

  192. 4eb939caec6f17fbc9716d110a34f3c908b0cc75.html

  193. Can Active Memory Replace Attention?

  194. ca4ef6808478dad2149c5617162e8c0d19640518.html

  195. Dzmitry Bahdanau

  196. Scaling Automatic Neuron Description

  197. Monitor: An AI-Driven Observability Interface

  198. A Survey of Long-Term Context in Transformers: Sparse Transformers · Adaptive Span Transformers · Transformer-XL · Compressive Transformers · Reformer · Routing Transformer · Sinkhorn Transformer · Linformer · Efficient Attention: Attention With Linear Complexities · Transformers Are RNNs · ETC · Longformer

  199. f15ba6e21a65128a4daf2146db0ba229b6b2e6b9.html

  200. FlashAttention-3: Fast and Accurate Attention With Asynchrony and Low-Precision

  201. c17cf751cc778ec4481da07e013f94580bf3db97.html

  202. 2023-09-08-charlesfoster-aunn-variantwithcausaldecoderattention.jpg

  203. 2023-trockman-figure2-attentionmappatternsbyinitializationandleveloftrainingshowpriors.png

  204. 2023-trockman-figure7-gpt2attentionmatrixpatterns.png

  205. 2022-hassid-figure3-largertransformersmakemoreuseofattentionwhennablatedtomlbenchmarkperformance.jpg

  206. 2022-tay-figure4-scalingofmodelbydepth.jpg

  207. 2022-tay-figure5-scalingofmodelbymlpfeedforwardparameters.jpg

  208. 2020-08-11-gwern-meme-twoastronauts-hopfieldnetworksareallyouneed.jpg

  209. 2020-longrangearena-figure3-performancefrontier.jpg

  210. 2020-tay-figure2-efficientattentiontaxonomy.png

  211. 2020-tay-table1-efficienttransformermodels.png

  212. https://bbycroft.net/llm

  213. https://bclarkson-code.github.io/posts/llm-from-scratch-scalar-autograd/post.html

  214. d6cefa37e36c1c8c71393db8dea46038bf8893f7.html

  215. https://e2eml.school/transformers.html

  216. 13cbf240da10b5f72d555661b30e7d6db1ae0c59.html

  217. https://github.com/haizelabs/thorn-in-haizestack

  218. fcccd83fa36f13a8089ac67f49c01a12adf78d66.html

  219. https://github.com/montemac/activation_additions

  220. https://lilianweng.github.io/posts/2018-06-24-attention/

  221. 5b8da6cc51d08876708acf45acdcaa95ef393981.html

  222. https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention

  223. 181dad5b96f2aa97c94d2611d91797478162fe8b.html

  224. https://mehta-rohan.com/writings/blog_posts/attention.html

  225. 3467db0af820b8bc5da2001659c596ed910b4a30.html

  226. https://nian.llmonpy.ai/

  227. cce55ea58c3f273e7b5f31d05b8e818871bc8aba.html

  228. https://nostalgebraist.tumblr.com/post/740164510909890560/information-flow-in-transformers

  229. b55af8d3b5d90fe5efd875b087b9ae55dc2c8c17.html

  230. https://shyam.blog/posts/beyond-self-attention/

  231. 6abeab52b4cb11537a3781f6f1023c51cc468a0a.html

  232. https://vgel.me/posts/handmade-transformer/

  233. cf90b73b0be124d0d66852b6c39e64ae00ecbe1b.html

  234. https://vgel.me/posts/representation-engineering/

  235. c441acb281699be920cf18b88d802628cf1aab49.html

  236. https://www.anthropic.com/index/100k-context-windows

  237. 0f2c486bdb89798a54108d69183c40e495622749.html

  238. https://www.anthropic.com/news/claude-2-1-prompting

  239. https://www.beren.io/2024-03-03-Linear-Attention-as-Iterated-Hopfield-Networks/

  240. b158cf2f242ef5b70de93bb3a8d5c3a20229097f.html

  241. https://www.dipkumar.dev/becoming-the-unbeatable/posts/gpt-kvcache/

  242. 9c04d20174d984dfdf77b8d6aae0c44280496c6c.html

  243. https://www.lesswrong.com/posts/7fxusXdkMNmAhkAfc/finding-sparse-linear-connections-between-features-in-llms

  244. https://www.lesswrong.com/posts/Ei8q37PB3cAky6kaK/takeaways-from-a-mechanistic-interpretability-project-on

  245. https://www.lesswrong.com/posts/K7AyY8LMrcKhwfbyj/no-really-attention-is-all-you-need-attention-can-do

  246. https://www.lesswrong.com/posts/euam65XjigaCJQkcN/an-analogy-for-understanding-transformers

  247. https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction

  248. 333a317e949ece5768a6be2941103fc39f04c789.html

  249. https://www.lesswrong.com/posts/kobJymvvcvhbjWFKe/laying-the-foundations-for-vision-and-multimodal-mechanistic

  250. https://www.lesswrong.com/posts/nuJFTS5iiJKT5G5yh/polysemantic-attention-head-in-a-4-layer-transformer

  251. https://www.lesswrong.com/posts/thePw6qdyabD8XR4y/interpreting-openai-s-whisper

  252. https://www.lesswrong.com/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only

  253. https://www.lesswrong.com/postsiGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall

  254. https://www.perfectlynormal.co.uk/blog-induction-heads-illustrated

  255. https://x.com/BrendanBycroft/status/1731042957149827140

  256. https://x.com/GregKamradt/status/1722386725635580292

  257. https://x.com/LouisKnightWebb/status/1724510794514157668

  258. https://x.com/arankomatsuzaki/status/1622666312219598864

  259. https://x.com/karpathy/status/1864023344435380613

  260. https://x.com/mathemagic1an/status/1636121914849792000

  261. https://x.com/swyx/status/1722441535235768372

  262. Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects

  263. https%253A%252F%252Farxiv.org%252Fabs%252F2410.06405.html

  264. Were RNNs All We Needed?

  265. https%253A%252F%252Farxiv.org%252Fabs%252F2410.01201.html

  266. The Mamba in the Llama: Distilling and Accelerating Hybrid Models

  267. Junxiong Wang

  268. https://rush-nlp.com/

  269. Tri Dao

  270. https%253A%252F%252Farxiv.org%252Fabs%252F2408.15237.html

  271. What Matters in Transformers? Not All Attention is Needed

  272. https%253A%252F%252Farxiv.org%252Fabs%252F2406.15786.html

  273. Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

  274. https%253A%252F%252Farxiv.org%252Fabs%252F2406.13121%2523google.html

  275. An Empirical Study of Mamba-based Language Models

  276. Tri Dao

  277. Albert Gu

  278. https%253A%252F%252Farxiv.org%252Fabs%252F2406.07887.html

  279. Retrieval Head Mechanistically Explains Long-Context Factuality

  280. Yizhong Wang—University of Washington

  281. https%253A%252F%252Farxiv.org%252Fabs%252F2404.15574.html

  282. Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models

  283. Sam Bowman

  284. https%253A%252F%252Farxiv.org%252Fabs%252F2404.15758.html

  285. Long-form factuality in large language models

  286. https%253A%252F%252Farxiv.org%252Fabs%252F2403.18802%2523deepmind.html

  287. Mechanistic Design and Scaling of Hybrid Architectures

  288. Stefano Ermon

  289. https%253A%252F%252Farxiv.org%252Fabs%252F2403.17844.html

  290. 8 Google Employees Invented Modern AI. Here’s the Inside Story: They met by chance, got hooked on an idea, and wrote the Transformers paper—the most consequential tech breakthrough in recent history

  291. https%253A%252F%252Fwww.wired.com%252Fstory%252Feight-google-employees-invented-modern-ai-transformers-paper%252F.html

  292. Rethinking Patch Dependence for Masked Autoencoders

  293. https%253A%252F%252Farxiv.org%252Fabs%252F2401.14391.html

  294. Efficient Transformer Knowledge Distillation: A Performance Review

  295. https%253A%252F%252Farxiv.org%252Fabs%252F2311.13657.html

  296. Not all layers are equally as important: Every Layer Counts BERT

  297. https%253A%252F%252Farxiv.org%252Fabs%252F2311.02265.html

  298. Linear Representations of Sentiment in Large Language Models

  299. https%253A%252F%252Farxiv.org%252Fabs%252F2310.15154.html

  300. Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors

  301. https%253A%252F%252Farxiv.org%252Fabs%252F2310.02980.html

  302. Interpret Vision Transformers as ConvNets with Dynamic Convolutions

  303. https%253A%252F%252Farxiv.org%252Fabs%252F2309.10713.html

  304. Replacing softmax with ReLU in Vision Transformers

  305. https%253A%252F%252Farxiv.org%252Fabs%252F2309.08586.html

  306. Activation Addition: Steering Language Models Without Optimization

  307. https%253A%252F%252Farxiv.org%252Fabs%252F2308.10248.html

  308. TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models

  309. Yu Sun

  310. https%253A%252F%252Farxiv.org%252Fabs%252F2305.18466.html

  311. Brainformers: Trading Simplicity for Efficiency

  312. https%253A%252F%252Farxiv.org%252Fabs%252F2306.00008%2523google.html

  313. Mimetic Initialization of Self-Attention Layers

  314. https%253A%252F%252Farxiv.org%252Fabs%252F2305.09828.html

  315. Skip-Attention: Improving Vision Transformers by Paying Less Attention

  316. https%253A%252F%252Farxiv.org%252Fabs%252F2301.02240.html

  317. Hungry Hungry Hippos: Towards Language Modeling with State Space Models

  318. Tri Dao

  319. https%253A%252F%252Farxiv.org%252Fabs%252F2212.14052.html

  320. Pretraining Without Attention

  321. Junxiong Wang

  322. Albert Gu

  323. https://rush-nlp.com/

  324. https%253A%252F%252Farxiv.org%252Fabs%252F2212.10544.html

  325. Transformers learn in-context by gradient descent

  326. https%253A%252F%252Farxiv.org%252Fabs%252F2212.07677%2523google.html

  327. What learning algorithm is in-context learning? Investigations with linear models

  328. Jacob Andreas @ MIT

  329. https%253A%252F%252Farxiv.org%252Fabs%252F2211.15661%2523google.html

  330. Efficiently Scaling Transformer Inference

  331. https://x.com/jekbradbury

  332. https%253A%252F%252Farxiv.org%252Fabs%252F2211.05102%2523google.html

  333. Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling

  334. https%253A%252F%252Farxiv.org%252Fabs%252F2210.05043.html

  335. What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

  336. Percy Liang

  337. https%253A%252F%252Farxiv.org%252Fabs%252F2208.01066.html

  338. Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules

  339. https%253A%252F%252Farxiv.org%252Fabs%252F2206.01649%2523schmidhuber.html

  340. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

  341. Tri Dao

  342. Stefano Ermon

  343. https%253A%252F%252Farxiv.org%252Fabs%252F2205.14135.html

  344. TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

  345. https%253A%252F%252Farxiv.org%252Fabs%252F2204.03638%2523facebook.html

  346. It’s Raw! Audio Generation with State-Space Models

  347. Albert Gu

  348. https%253A%252F%252Farxiv.org%252Fabs%252F2202.09729.html

  349. General-purpose, long-context autoregressive modeling with Perceiver AR

  350. https%253A%252F%252Farxiv.org%252Fabs%252F2202.07765%2523deepmind.html

  351. Train Short, Test Long: Attention with Linear Biases (ALiBi) Enables Input Length Extrapolation

  352. Mike Lewis

  353. https%253A%252F%252Farxiv.org%252Fabs%252F2108.12409%2523facebook.html

  354. Do Vision Transformers See Like Convolutional Neural Networks?

  355. https%253A%252F%252Farxiv.org%252Fabs%252F2108.08810%2523google.html

  356. RASP: Thinking Like Transformers

  357. https%253A%252F%252Farxiv.org%252Fabs%252F2106.06981.html

  358. Not All Images are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition

  359. https%253A%252F%252Farxiv.org%252Fabs%252F2105.15075.html

  360. Less is More: Pay Less Attention in Vision Transformers

  361. https%253A%252F%252Farxiv.org%252Fabs%252F2105.14217.html

  362. FNet: Mixing Tokens with Fourier Transforms

  363. https%253A%252F%252Farxiv.org%252Fabs%252F2105.03824%2523google.html

  364. Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

  365. https%253A%252F%252Farxiv.org%252Fabs%252F2105.02723.html

  366. Long Range Arena (LRA): A Benchmark for Efficient Transformers

  367. Yi Tay

  368. https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DqVyeW-grC2k%2523google.html

  369. Efficient Transformers: A Survey

  370. Yi Tay

  371. https%253A%252F%252Farxiv.org%252Fabs%252F2009.06732%2523google.html

  372. HiPPO: Recurrent Memory with Optimal Polynomial Projections

  373. Albert Gu

  374. Tri Dao

  375. Stefano Ermon

  376. https%253A%252F%252Farxiv.org%252Fabs%252F2008.07669.html

  377. Synthesizer: Rethinking Self-Attention in Transformer Models

  378. Yi Tay

  379. https%253A%252F%252Farxiv.org%252Fabs%252F2005.00743%2523google.html

  380. PowerNorm: Rethinking Batch Normalization in Transformers

  381. Sheng Shen’s Homepage

  382. https%253A%252F%252Farxiv.org%252Fabs%252F2003.07845.html

  383. Dynamic Convolution: Attention over Convolution Kernels

  384. https%253A%252F%252Farxiv.org%252Fabs%252F1912.03458%2523microsoft.html