‘multi-scale Transformer’ directory

Gwern

‘multi-scale Transformer’ directory

Links

“Inner Thinking Transformer (ITT): Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking ”, Chen et al 2025

⁠Inner Thinking Transformer (ITT): Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking⁠

“Native Sparse Attention (NSA): Hardware-Aligned and Natively Trainable Sparse Attention ”, Yuan et al 2025

⁠Native Sparse Attention (NSA): Hardware-Aligned and Natively Trainable Sparse Attention⁠

“State-Space Models Can Learn In-Context by Gradient Descent ”, Sushma et al 2024

State-space models can learn in-context by gradient descent⁠

“XT: Nested Tokenization for Larger Context in Large Images ”, Gupta et al 2024

xT: Nested Tokenization for Larger Context in Large Images⁠

“A Long-Context Language Model for the Generation of Bacteriophage Genomes ”, Shao 2023

A long-context language model for the generation of bacteriophage genomes⁠

“HGRN: Hierarchically Gated Recurrent Neural Network for Sequence Modeling ”, Qin et al 2023

HGRN: Hierarchically Gated Recurrent Neural Network for Sequence Modeling⁠

“Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer ”, Zhang et al 2023

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer⁠

“LongNet: Scaling Transformers to 1,000,000,000 Tokens ”, Ding et al 2023

LongNet: Scaling Transformers to 1,000,000,000 Tokens⁠

“Bytes Are All You Need: Transformers Operating Directly On File Bytes ”, Horton et al 2023

Bytes Are All You Need: Transformers Operating Directly On File Bytes⁠

“Landmark Attention: Random-Access Infinite Context Length for Transformers ”, Mohtashami & Jaggi 2023

Landmark Attention: Random-Access Infinite Context Length for Transformers⁠

“MEGABYTE: Predicting Million-Byte Sequences With Multiscale Transformers ”, Yu et al 2023

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers⁠

“Parallel Context Windows Improve In-Context Learning of Large Language Models ”, Ratner et al 2022

Parallel Context Windows Improve In-Context Learning of Large Language Models⁠

“Structured Prompting: Scaling In-Context Learning to 1,000 Examples ”, Hao et al 2022

Structured Prompting: Scaling In-Context Learning to 1,000 Examples⁠

“Efficient Transformers With Dynamic Token Pooling ”, Nawrot et al 2022

Efficient Transformers with Dynamic Token Pooling⁠

“Accurate Image Restoration With Attention Retractable Transformer (ART) ”, Zhang et al 2022

Accurate Image Restoration with Attention Retractable Transformer (ART)⁠

“Co-Writing Screenplays and Theatre Scripts With Language Models (Dramatron): An Evaluation by Industry Professionals ”, Mirowski et al 2022

Co-Writing Screenplays and Theatre Scripts with Language Models (Dramatron): An Evaluation by Industry Professionals⁠

“DiNAT: Dilated Neighborhood Attention Transformer ”, Hassani & Shi 2022

DiNAT: Dilated Neighborhood Attention Transformer⁠

“Mega: Moving Average Equipped Gated Attention ”, Ma et al 2022

Mega: Moving Average Equipped Gated Attention⁠

“Investigating Efficiently Extending Transformers for Long Input Summarization ”, Phang et al 2022

Investigating Efficiently Extending Transformers for Long Input Summarization⁠

“ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths ”, Khalitov et al 2022

ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths⁠

“Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention ”, Yu et al 2022

Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention⁠

“NAT: Neighborhood Attention Transformer ”, Hassani et al 2022

NAT: Neighborhood Attention Transformer⁠

“ViS4mer: Long Movie Clip Classification With State-Space Video Models ”, Islam & Bertasius 2022

ViS4mer: Long Movie Clip Classification with State-Space Video Models⁠

“MaxViT: Multi-Axis Vision Transformer ”, Tu et al 2022

MaxViT: Multi-Axis Vision Transformer⁠

“Hierarchical Perceiver ”, Carreira et al 2022

Hierarchical Perceiver⁠

“Transformer Quality in Linear Time ”, Hua et al 2022

Transformer Quality in Linear Time⁠

“LongT5: Efficient Text-To-Text Transformer for Long Sequences ”, Guo et al 2021

LongT5: Efficient Text-To-Text Transformer for Long Sequences⁠

“Simple Local Attentions Remain Competitive for Long-Context Tasks ”, Xiong et al 2021

Simple Local Attentions Remain Competitive for Long-Context Tasks⁠

“Restormer: Efficient Transformer for High-Resolution Image Restoration ”, Zamir et al 2021

Restormer: Efficient Transformer for High-Resolution Image Restoration⁠

“Swin Transformer V2: Scaling Up Capacity and Resolution ”, Liu et al 2021

Swin Transformer V2: Scaling Up Capacity and Resolution⁠

“Hourglass: Hierarchical Transformers Are More Efficient Language Models ”, Nawrot et al 2021

Hourglass: Hierarchical Transformers Are More Efficient Language Models⁠

“Fastformer: Additive Attention Can Be All You Need ”, Wu et al 2021

Fastformer: Additive Attention Can Be All You Need⁠

“AdaMRA: Adaptive Multi-Resolution Attention With Linear Complexity ”, Zhang et al 2021

AdaMRA: Adaptive Multi-Resolution Attention with Linear Complexity⁠

“Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision ”, Zhu et al 2021

Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision⁠

“Global Filter Networks for Image Classification ”, Rao et al 2021

Global Filter Networks for Image Classification⁠

“HiT: Improved Transformer for High-Resolution GANs ”, Zhao et al 2021

HiT: Improved Transformer for High-Resolution GANs⁠

“A Multi-Level Attention Model for Evidence-Based Fact Checking ”, Kruengkrai et al 2021

A Multi-Level Attention Model for Evidence-Based Fact Checking⁠

“Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling ”, Wu et al 2021

Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling⁠

“Aggregating Nested Transformers ”, Zhang et al 2021

Aggregating Nested Transformers⁠

“Pay Attention to MLPs ”, Liu et al 2021

Pay Attention to MLPs⁠

“MViT: Multiscale Vision Transformers ”, Fan et al 2021

MViT: Multiscale Vision Transformers⁠

“Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows ”, Liu et al 2021

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows⁠

“Coordination Among Neural Modules Through a Shared Global Workspace ”, Goyal et al 2021

Coordination Among Neural Modules Through a Shared Global Workspace⁠

“Generative Adversarial Transformers ”, Hudson & Zitnick 2021

Generative Adversarial Transformers⁠

“LazyFormer: Self Attention With Lazy Update ”, Ying et al 2021

LazyFormer: Self Attention with Lazy Update⁠

“CDLM: Cross-Document Language Modeling ”, Caciularu et al 2021

CDLM: Cross-Document Language Modeling⁠

“Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition ”, Zhang et al 2020

Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition⁠

“Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical Supervision from Extractive Summaries ”, Sun et al 2020

Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical Supervision from Extractive Summaries⁠

“Transformer-QL: A Step Towards Making Transformer Network Quadratically Large ”, Hajra 2020

Transformer-QL: A Step Towards Making Transformer Network Quadratically Large⁠

“Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size ”, Yoshida et al 2020

Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size⁠

“Progressive Generation of Long Text ”, Tan et al 2020

Progressive Generation of Long Text⁠

“Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing ”, Dai et al 2020

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing⁠

“Conformer: Convolution-Augmented Transformer for Speech Recognition ”, Gulati et al 2020

Conformer: Convolution-augmented Transformer for Speech Recognition⁠

“Multi-Scale Transformer Language Models ”, Subramanian et al 2020

Multi-scale Transformer Language Models⁠

“Beyond 512 Tokens: Siamese Multi-Depth Transformer-Based Hierarchical Encoder for Long-Form Document Matching ”, Yang et al 2020

Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching⁠

“Lite Transformer With Long-Short Range Attention ”, Wu et al 2020

Lite Transformer with Long-Short Range Attention⁠

“ETC: Encoding Long and Structured Inputs in Transformers ”, Ainslie et al 2020

ETC: Encoding Long and Structured Inputs in Transformers⁠

“Longformer: The Long-Document Transformer ”, Beltagy et al 2020

Longformer: The Long-Document Transformer⁠

“BP-Transformer: Modeling Long-Range Context via Binary Partitioning ”, Ye et al 2019

BP-Transformer: Modeling Long-Range Context via Binary Partitioning⁠

“Blockwise Self-Attention for Long Document Understanding ”, Qiu et al 2019

Blockwise Self-Attention for Long Document Understanding⁠

“Hierarchical Transformers for Multi-Document Summarization ”, Liu & Lapata 2019

Hierarchical Transformers for Multi-Document Summarization⁠

“Hierarchical Multiscale Recurrent Neural Networks ”, Chung et al 2016

Hierarchical Multiscale Recurrent Neural Networks⁠

“A Clockwork RNN ”, Koutník et al 2014

A Clockwork RNN⁠

Miscellaneous

Bibliography

https://arxiv.org/abs/2307.02486#microsoft: “LongNet: Scaling Transformers to 1,000,000,000 Tokens ”⁠, Jiayu Ding, Shuming Ma, Li Dong⁠ …, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Furu Wei⁠
link-bibliography⁠
https://arxiv.org/abs/2306.00238#apple: “Bytes Are All You Need: Transformers Operating Directly On File Bytes ”⁠, Maxwell Horton, Sachin Mehta, Ali Farhadi⁠, Mohammad Rastegari
link-bibliography⁠
https://arxiv.org/abs/2305.16300: “Landmark Attention: Random-Access Infinite Context Length for Transformers ”⁠, Amirkeivan Mohtashami, Martin Jaggi
link-bibliography⁠
https://arxiv.org/abs/2209.14958#deepmind: “Co-Writing Screenplays and Theatre Scripts With Language Models (Dramatron): An Evaluation by Industry Professionals ”⁠, Piotr Mirowski, Kory W. Mathewson, Jaylen Pittman, Richard Evans
link-bibliography⁠
https://arxiv.org/abs/2209.15001: “DiNAT: Dilated Neighborhood Attention Transformer ”⁠, Ali Hassani⁠, Humphrey Shi
link-bibliography⁠
https://arxiv.org/abs/2209.10655: “Mega: Moving Average Equipped Gated Attention ”⁠, Xuezhe Ma, Chunting Zhou, Xiang Kong …, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, Luke Zettlemoyer⁠
link-bibliography⁠
https://arxiv.org/abs/2206.05852: “ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths ”⁠, Ruslan Khalitov, Tong Yu, Lei Cheng, Zhirong Yang
link-bibliography⁠
https://arxiv.org/abs/2204.10670: “Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention ”⁠, Tong Yu, Ruslan Khalitov, Lei Cheng, Zhirong Yang
link-bibliography⁠
https://arxiv.org/abs/2204.07143: “NAT: Neighborhood Attention Transformer ”⁠, Ali Hassani⁠, Steven Walton, Jiachen Li …, Shen Li⁠, Humphrey Shi
link-bibliography⁠
https://arxiv.org/abs/2204.01692: “ViS4mer: Long Movie Clip Classification With State-Space Video Models ”⁠, Md Mohaiminul Islam, Gedas Bertasius
link-bibliography⁠
https://arxiv.org/abs/2112.07916#google: “LongT5: Efficient Text-To-Text Transformer for Long Sequences ”⁠, Mandy Guo, Joshua Ainslie, David Uthus …, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang
link-bibliography⁠
https://arxiv.org/abs/2110.13711#nvidia: “Hourglass: Hierarchical Transformers Are More Efficient Language Models ”⁠, Piotr Nawrot, Szymon Tworkowski, Michał Tyrolski …, Łukasz Kaiser⁠, ⁠Yuhuai Wu, Christian Szegedy, Henryk Michalewski
link-bibliography⁠
https://arxiv.org/abs/2107.02192#nvidia: “Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision ”⁠, Chen Zhu, Wei Ping, Chaowei Xiao …, Mohammad Shoeybi, Tom Goldstein⁠, Anima Anandkumar⁠, Bryan Catanzaro⁠
link-bibliography⁠
https://arxiv.org/abs/2107.00645: “Global Filter Networks for Image Classification ”⁠, Yongming Rao, Wenliang Zhao, Zheng Zhu⁠ …, Jiwen Lu, Jie Zhou
link-bibliography⁠
https://arxiv.org/abs/2106.07631#google: “HiT: Improved Transformer for High-Resolution GANs ”⁠, Long Zhao, Zizhao Zhang, Ting Chen …, Dimitris N. Metaxas, Han Zhang⁠
link-bibliography⁠
https://arxiv.org/abs/2105.08050#google: “Pay Attention to MLPs ”⁠, Hanxiao Liu, Zihang Dai⁠, David R. So, Quoc V. Le⁠
link-bibliography⁠
https://arxiv.org/abs/2104.11227#facebook: “MViT: Multiscale Vision Transformers ”⁠, Haoqi Fan, Bo Xiong, Karttikeya Mangalam …, Yanghao Li, Zhicheng Yan, Jitendra Malik⁠, Christoph Feichtenhofer
link-bibliography⁠
https://arxiv.org/abs/2103.14030: “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows ”⁠, Ze Liu, Yutong Lin, Yue Cao …, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo
link-bibliography⁠
https://arxiv.org/abs/2010.10504#google: “Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition ”⁠, Yu Zhang, James Qin, Daniel S. Park …, Wei Han⁠, Chung-Cheng Chiu, Ruoming Pang, Quoc V. Le⁠, Yonghui Wu⁠
link-bibliography⁠
https://arxiv.org/abs/2005.08100#google: “Conformer: Convolution-Augmented Transformer for Speech Recognition ”⁠, Anmol Gulati, James Qin, Chung-Cheng Chiu …, Niki Parmar⁠, Yu Zhang, Jiahui Yu, Wei Han⁠, Shibo Wang, Zhengdong Zhang, Yonghui Wu⁠, Ruoming Pang
link-bibliography⁠
https://arxiv.org/abs/2004.05150: “Longformer: The Long-Document Transformer ”⁠, Iz Beltagy, Matthew E. Peters, Arman Cohan
link-bibliography⁠