Bibliography:

  1. ‘self-attention’ tag

  2. Fully-Connected Neural Nets

  3. State-space models can learn in-context by gradient descent

  4. xT: Nested Tokenization for Larger Context in Large Images

  5. A long-context language model for the generation of bacteriophage genomes

  6. HGRN: Hierarchically Gated Recurrent Neural Network for Sequence Modeling

  7. Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

  8. LongNet: Scaling Transformers to 1,000,000,000 Tokens

  9. Bytes Are All You Need: Transformers Operating Directly On File Bytes

  10. Landmark Attention: Random-Access Infinite Context Length for Transformers

  11. MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

  12. Parallel Context Windows Improve In-Context Learning of Large Language Models

  13. Structured Prompting: Scaling In-Context Learning to 1,000 Examples

  14. Efficient Transformers with Dynamic Token Pooling

  15. Accurate Image Restoration with Attention Retractable Transformer (ART)

  16. Co-Writing Screenplays and Theatre Scripts with Language Models (Dramatron): An Evaluation by Industry Professionals

  17. DiNAT: Dilated Neighborhood Attention Transformer

  18. Mega: Moving Average Equipped Gated Attention

  19. Investigating Efficiently Extending Transformers for Long Input Summarization

  20. ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths

  21. Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention

  22. NAT: Neighborhood Attention Transformer

  23. ViS4mer: Long Movie Clip Classification with State-Space Video Models

  24. MaxViT: Multi-Axis Vision Transformer

  25. Hierarchical Perceiver

  26. Transformer Quality in Linear Time

  27. LongT5: Efficient Text-To-Text Transformer for Long Sequences

  28. Simple Local Attentions Remain Competitive for Long-Context Tasks

  29. Restormer: Efficient Transformer for High-Resolution Image Restoration

  30. Swin Transformer V2: Scaling Up Capacity and Resolution

  31. Hourglass: Hierarchical Transformers Are More Efficient Language Models

  32. Fastformer: Additive Attention Can Be All You Need

  33. AdaMRA: Adaptive Multi-Resolution Attention with Linear Complexity

  34. Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision

  35. Global Filter Networks for Image Classification

  36. HiT: Improved Transformer for High-Resolution GANs

  37. A Multi-Level Attention Model for Evidence-Based Fact Checking

  38. Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling

  39. Aggregating Nested Transformers

  40. Pay Attention to MLPs

  41. MViT: Multiscale Vision Transformers

  42. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

  43. Coordination Among Neural Modules Through a Shared Global Workspace

  44. Generative Adversarial Transformers

  45. LazyFormer: Self Attention with Lazy Update

  46. CDLM: Cross-Document Language Modeling

  47. Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

  48. Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical Supervision from Extractive Summaries

  49. Transformer-QL: A Step Towards Making Transformer Network Quadratically Large

  50. Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size

  51. Progressive Generation of Long Text

  52. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

  53. Conformer: Convolution-augmented Transformer for Speech Recognition

  54. Multi-scale Transformer Language Models

  55. Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

  56. Lite Transformer with Long-Short Range Attention

  57. ETC: Encoding Long and Structured Inputs in Transformers

  58. Longformer: The Long-Document Transformer

  59. BP-Transformer: Modeling Long-Range Context via Binary Partitioning

  60. Blockwise Self-Attention for Long Document Understanding

  61. Hierarchical Transformers for Multi-Document Summarization

  62. Hierarchical Multiscale Recurrent Neural Networks

  63. A Clockwork RNN

  64. 2022-yu-figure1-graphicaldiagramofchordcdilsparsep2pnetwork.jpg

  65. anonymous-mlp-multilayerperceptron.jpg

  66. https://magenta.tensorflow.org/blog/2017/06/01/waybackprop

  67. https://x.com/IntuitMachine/status/1722727424947859896

  68. LongNet: Scaling Transformers to 1,000,000,000 Tokens

  69. Furu Wei

  70. https%253A%252F%252Farxiv.org%252Fabs%252F2307.02486%2523microsoft.html

  71. Bytes Are All You Need: Transformers Operating Directly On File Bytes

  72. https%253A%252F%252Farxiv.org%252Fabs%252F2306.00238%2523apple.html

  73. Landmark Attention: Random-Access Infinite Context Length for Transformers

  74. https%253A%252F%252Farxiv.org%252Fabs%252F2305.16300.html

  75. Co-Writing Screenplays and Theatre Scripts with Language Models (Dramatron): An Evaluation by Industry Professionals

  76. https%253A%252F%252Farxiv.org%252Fabs%252F2209.14958%2523deepmind.html

  77. DiNAT: Dilated Neighborhood Attention Transformer

  78. https%253A%252F%252Farxiv.org%252Fabs%252F2209.15001.html

  79. Mega: Moving Average Equipped Gated Attention

  80. Luke Zettlemoyer

  81. https%253A%252F%252Farxiv.org%252Fabs%252F2209.10655.html

  82. ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths

  83. https%253A%252F%252Farxiv.org%252Fabs%252F2206.05852.html

  84. Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention

  85. https%253A%252F%252Farxiv.org%252Fabs%252F2204.10670.html

  86. NAT: Neighborhood Attention Transformer

  87. https%253A%252F%252Farxiv.org%252Fabs%252F2204.07143.html

  88. LongT5: Efficient Text-To-Text Transformer for Long Sequences

  89. https%253A%252F%252Farxiv.org%252Fabs%252F2112.07916%2523google.html

  90. Hourglass: Hierarchical Transformers Are More Efficient Language Models

  91. Łukasz Kaiser

  92. Yuhuai (Tony) Wu’s Home Page

  93. https%253A%252F%252Farxiv.org%252Fabs%252F2110.13711%2523nvidia.html

  94. Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision

  95. https%253A%252F%252Farxiv.org%252Fabs%252F2107.02192%2523nvidia.html

  96. Global Filter Networks for Image Classification

  97. https%253A%252F%252Farxiv.org%252Fabs%252F2107.00645.html

  98. HiT: Improved Transformer for High-Resolution GANs

  99. https%253A%252F%252Farxiv.org%252Fabs%252F2106.07631%2523google.html

  100. Pay Attention to MLPs

  101. Zihang Dai

  102. https%253A%252F%252Farxiv.org%252Fabs%252F2105.08050%2523google.html

  103. MViT: Multiscale Vision Transformers

  104. https%253A%252F%252Farxiv.org%252Fabs%252F2104.11227%2523facebook.html

  105. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

  106. https%253A%252F%252Farxiv.org%252Fabs%252F2103.14030.html

  107. Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

  108. https%253A%252F%252Farxiv.org%252Fabs%252F2010.10504%2523google.html

  109. Conformer: Convolution-augmented Transformer for Speech Recognition

  110. Niki Parmar

  111. https%253A%252F%252Farxiv.org%252Fabs%252F2005.08100%2523google.html

  112. Longformer: The Long-Document Transformer

  113. https%253A%252F%252Farxiv.org%252Fabs%252F2004.05150.html