State-space models can learn in-context by gradient descent
xT: Nested Tokenization for Larger Context in Large Images
A long-context language model for the generation of bacteriophage genomes
HGRN: Hierarchically Gated Recurrent Neural Network for Sequence Modeling
Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer
Bytes Are All You Need: Transformers Operating Directly On File Bytes
Landmark Attention: Random-Access Infinite Context Length for Transformers
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
Parallel Context Windows Improve In-Context Learning of Large Language Models
Structured Prompting: Scaling In-Context Learning to 1,000 Examples
Accurate Image Restoration with Attention Retractable Transformer (ART)
Co-Writing Screenplays and Theatre Scripts with Language Models (Dramatron): An Evaluation by Industry Professionals
Investigating Efficiently Extending Transformers for Long Input Summarization
ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths
Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention
ViS4mer: Long Movie Clip Classification with State-Space Video Models
LongT5: Efficient Text-To-Text Transformer for Long Sequences
Simple Local Attentions Remain Competitive for Long-Context Tasks
Restormer: Efficient Transformer for High-Resolution Image Restoration
Hourglass: Hierarchical Transformers Are More Efficient Language Models
AdaMRA: Adaptive Multi-Resolution Attention with Linear Complexity
Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision
A Multi-Level Attention Model for Evidence-Based Fact Checking
Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Coordination Among Neural Modules Through a Shared Global Workspace
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical Supervision from Extractive Summaries
Transformer-QL: A Step Towards Making Transformer Network Quadratically Large
Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
Conformer: Convolution-augmented Transformer for Speech Recognition
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching
BP-Transformer: Modeling Long-Range Context via Binary Partitioning
Hierarchical Transformers for Multi-Document Summarization
2022-yu-figure1-graphicaldiagramofchordcdilsparsep2pnetwork.jpg
https://magenta.tensorflow.org/blog/2017/06/01/waybackprop
https%253A%252F%252Farxiv.org%252Fabs%252F2307.02486%2523microsoft.html
Bytes Are All You Need: Transformers Operating Directly On File Bytes
https%253A%252F%252Farxiv.org%252Fabs%252F2306.00238%2523apple.html
Landmark Attention: Random-Access Infinite Context Length for Transformers
Co-Writing Screenplays and Theatre Scripts with Language Models (Dramatron): An Evaluation by Industry Professionals
https%253A%252F%252Farxiv.org%252Fabs%252F2209.14958%2523deepmind.html
ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths
Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention
LongT5: Efficient Text-To-Text Transformer for Long Sequences
https%253A%252F%252Farxiv.org%252Fabs%252F2112.07916%2523google.html
Hourglass: Hierarchical Transformers Are More Efficient Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2110.13711%2523nvidia.html
Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision
https%253A%252F%252Farxiv.org%252Fabs%252F2107.02192%2523nvidia.html
https%253A%252F%252Farxiv.org%252Fabs%252F2106.07631%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2105.08050%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2104.11227%2523facebook.html
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
https%253A%252F%252Farxiv.org%252Fabs%252F2010.10504%2523google.html
Conformer: Convolution-augmented Transformer for Speech Recognition
https%253A%252F%252Farxiv.org%252Fabs%252F2005.08100%2523google.html
Wikipedia Bibliography: