When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models
AI Is a Black Box. Anthropic Figured Out a Way to Look Inside: What goes on in artificial neural networks work is largely a mystery, even to their creators. But researchers from Anthropic have caught a glimpse
Revisiting the Equivalence of In-Context Learning and Gradient Descent: The Impact of Data Distribution
Zoology: Measuring and Improving Recall in Efficient Language Models
HyperAttention: Long-context Attention in Near-Linear Time
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Unlimiformer: Long-Range Transformers with Unlimited Length Input
How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?
You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling
Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
Combiner: Full Attention Transformer with Sparse Computation Cost
OmniNet: Omnidirectional Representations from Transformers
Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
Efficient Content-Based Sparse Attention with Routing Transformers
Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting
Generative Modeling with Sparse Transformers: We’ve developed the Sparse Transformer, a deep neural network which sets new records at predicting what comes next in a sequence—whether text, images, or sound. It uses an algorithmic improvement of the attention mechanism to extract patterns from sequences 30× longer than possible previously
Constructing Transformers For Longer Sequences With Sparse Attention Methods
2022-tay-figure1b-computeperformanceoverviewof10diversennarchitecturesbydownstreamaccuracyshowingwidespreadandconvergenceatscale.png
2022-tay-figure2-worsescalingofallvariantarchitecturescomparedtooriginalsimpletransformer.jpg
2021-jaszczur-figure1-logperplexityofscalingtransformersonc4datasetvsbaselines.jpg
https://www.lesswrong.com/posts/kzc3qNMsP2xJcxhGn/gated-attention-blocks-preliminary-progress-toward-removing-1
When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models
AI Is a Black Box. Anthropic Figured Out a Way to Look Inside: What goes on in artificial neural networks work is largely a mystery, even to their creators. But researchers from Anthropic have caught a glimpse
https%253A%252F%252Fwww.wired.com%252Fstory%252Fanthropic-black-box-ai-research-neurons-features%252F.html
Revisiting the Equivalence of In-Context Learning and Gradient Descent: The Impact of Data Distribution
https%253A%252F%252Fieeexplore.ieee.org%252Fabstract%252Fdocument%252F10446522.html
Zoology: Measuring and Improving Recall in Efficient Language Models
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Unlimiformer: Long-Range Transformers with Unlimited Length Input
How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?
https%253A%252F%252Farxiv.org%252Fabs%252F2207.10551%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2111.12763%2523google.html
You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling
Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
https%253A%252F%252Farxiv.org%252Fabs%252F2110.15343%2523facebook.html
OmniNet: Omnidirectional Representations from Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2103.01075%2523google.html
Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
https%253A%252F%252Farxiv.org%252Fabs%252F2003.07853%2523google.html
Efficient Content-Based Sparse Attention with Routing Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2003.05997%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2001.04451%2523google.html
Wikipedia Bibliography: