- See Also
-
Links
- “HGRN: Hierarchically Gated Recurrent Neural Network for Sequence Modeling”, Qin et al 2023
- “Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer”, Zhang et al 2023
- “LongNet: Scaling Transformers to 1,000,000,000 Tokens”, Ding et al 2023
- “Bytes Are All You Need: Transformers Operating Directly On File Bytes”, Horton et al 2023
- “Landmark Attention: Random-Access Infinite Context Length for Transformers”, Mohtashami & Jaggi 2023
- “MEGABYTE: Predicting Million-byte Sequences With Multiscale Transformers”, Yu et al 2023
- “Parallel Context Windows Improve In-Context Learning of Large Language Models”, Ratner et al 2022
- “Efficient Transformers With Dynamic Token Pooling”, Nawrot et al 2022
- “Accurate Image Restoration With Attention Retractable Transformer (ART)”, Zhang et al 2022
- “DiNAT: Dilated Neighborhood Attention Transformer”, Hassani & Shi 2022
- “Co-Writing Screenplays and Theatre Scripts With Language Models (Dramatron): An Evaluation by Industry Professionals”, Mirowski et al 2022
- “Mega: Moving Average Equipped Gated Attention”, Ma et al 2022
- “Investigating Efficiently Extending Transformers for Long Input Summarization”, Phang et al 2022
- “ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths”, Khalitov et al 2022
- “Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention”, Yu et al 2022
- “NAT: Neighborhood Attention Transformer”, Hassani et al 2022
- “MaxViT: Multi-Axis Vision Transformer”, Tu et al 2022
- “ViS4mer: Long Movie Clip Classification With State-Space Video Models”, Islam & Bertasius 2022
- “Hierarchical Perceiver”, Carreira et al 2022
- “Transformer Quality in Linear Time”, Hua et al 2022
- “LongT5: Efficient Text-To-Text Transformer for Long Sequences”, Guo et al 2021
- “Simple Local Attentions Remain Competitive for Long-Context Tasks”, Xiong et al 2021
- “Swin Transformer V2: Scaling Up Capacity and Resolution”, Liu et al 2021
- “Restormer: Efficient Transformer for High-Resolution Image Restoration”, Zamir et al 2021
- “Hourglass: Hierarchical Transformers Are More Efficient Language Models”, Nawrot et al 2021
- “Fastformer: Additive Attention Can Be All You Need”, Wu et al 2021
- “Adaptive Multi-Resolution Attention With Linear Complexity”, Zhang et al 2021
- “Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision”, Zhu et al 2021
- “Global Filter Networks for Image Classification”, Rao et al 2021
- “HiT: Improved Transformer for High-Resolution GANs”, Zhao et al 2021
- “Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling”, Wu et al 2021
- “A Multi-Level Attention Model for Evidence-Based Fact Checking”, Kruengkrai et al 2021
- “Aggregating Nested Transformers”, Zhang et al 2021
- “Pay Attention to MLPs”, Liu et al 2021
- “Fully-Connected Neural Nets”, Gwern 2021
- “MViT: Multiscale Vision Transformers”, Fan et al 2021
- “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows”, Liu et al 2021
- “Generative Adversarial Transformers”, Hudson & Zitnick 2021
- “Coordination Among Neural Modules Through a Shared Global Workspace”, Goyal et al 2021
- “LazyFormer: Self Attention With Lazy Update”, Ying et al 2021
- “CDLM: Cross-Document Language Modeling”, Caciularu et al 2021
- “Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition”, Zhang et al 2020
- “Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical Supervision from Extractive Summaries”, Sun et al 2020
- “Transformer-QL: A Step Towards Making Transformer Network Quadratically Large”, Hajra 2020
- “Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size”, Yoshida et al 2020
- “Progressive Generation of Long Text”, Tan et al 2020
- “Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing”, Dai et al 2020
- “Conformer: Convolution-augmented Transformer for Speech Recognition”, Gulati et al 2020
- “Multi-scale Transformer Language Models”, Subramanian et al 2020
- “Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching”, Yang et al 2020
- “Lite Transformer With Long-Short Range Attention”, Wu et al 2020
- “ETC: Encoding Long and Structured Inputs in Transformers”, Ainslie et al 2020
- “Longformer: The Long-Document Transformer”, Beltagy et al 2020
- “BP-Transformer: Modelling Long-Range Context via Binary Partitioning”, Ye et al 2019
- “Blockwise Self-Attention for Long Document Understanding”, Qiu et al 2019
- “Hierarchical Transformers for Multi-Document Summarization”, Liu & Lapata 2019
- “Hierarchical Multiscale Recurrent Neural Networks”, Chung et al 2016
- “A Clockwork RNN”, Koutník et al 2014
- Sort By Magic
- Miscellaneous
- Link Bibliography
See Also
Links
“HGRN: Hierarchically Gated Recurrent Neural Network for Sequence Modeling”, Qin et al 2023
“HGRN: Hierarchically Gated Recurrent Neural Network for Sequence Modeling”
“Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer”, Zhang et al 2023
“Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer”
“LongNet: Scaling Transformers to 1,000,000,000 Tokens”, Ding et al 2023
“Bytes Are All You Need: Transformers Operating Directly On File Bytes”, Horton et al 2023
“Bytes Are All You Need: Transformers Operating Directly On File Bytes”
“Landmark Attention: Random-Access Infinite Context Length for Transformers”, Mohtashami & Jaggi 2023
“Landmark Attention: Random-Access Infinite Context Length for Transformers”
“MEGABYTE: Predicting Million-byte Sequences With Multiscale Transformers”, Yu et al 2023
“MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers”
“Parallel Context Windows Improve In-Context Learning of Large Language Models”, Ratner et al 2022
“Parallel Context Windows Improve In-Context Learning of Large Language Models”
“Efficient Transformers With Dynamic Token Pooling”, Nawrot et al 2022
“Accurate Image Restoration With Attention Retractable Transformer (ART)”, Zhang et al 2022
“Accurate Image Restoration with Attention Retractable Transformer (ART)”
“DiNAT: Dilated Neighborhood Attention Transformer”, Hassani & Shi 2022
“Co-Writing Screenplays and Theatre Scripts With Language Models (Dramatron): An Evaluation by Industry Professionals”, Mirowski et al 2022
“Mega: Moving Average Equipped Gated Attention”, Ma et al 2022
“Investigating Efficiently Extending Transformers for Long Input Summarization”, Phang et al 2022
“Investigating Efficiently Extending Transformers for Long Input Summarization”
“ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths”, Khalitov et al 2022
“ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths”
“Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention”, Yu et al 2022
“NAT: Neighborhood Attention Transformer”, Hassani et al 2022
“MaxViT: Multi-Axis Vision Transformer”, Tu et al 2022
“ViS4mer: Long Movie Clip Classification With State-Space Video Models”, Islam & Bertasius 2022
“ViS4mer: Long Movie Clip Classification with State-Space Video Models”
“Hierarchical Perceiver”, Carreira et al 2022
“Transformer Quality in Linear Time”, Hua et al 2022
“LongT5: Efficient Text-To-Text Transformer for Long Sequences”, Guo et al 2021
“LongT5: Efficient Text-To-Text Transformer for Long Sequences”
“Simple Local Attentions Remain Competitive for Long-Context Tasks”, Xiong et al 2021
“Simple Local Attentions Remain Competitive for Long-Context Tasks”
“Swin Transformer V2: Scaling Up Capacity and Resolution”, Liu et al 2021
“Restormer: Efficient Transformer for High-Resolution Image Restoration”, Zamir et al 2021
“Restormer: Efficient Transformer for High-Resolution Image Restoration”
“Hourglass: Hierarchical Transformers Are More Efficient Language Models”, Nawrot et al 2021
“Hourglass: Hierarchical Transformers Are More Efficient Language Models”
“Fastformer: Additive Attention Can Be All You Need”, Wu et al 2021
“Adaptive Multi-Resolution Attention With Linear Complexity”, Zhang et al 2021
“Adaptive Multi-Resolution Attention with Linear Complexity”
“Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision”, Zhu et al 2021
“Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision”
“Global Filter Networks for Image Classification”, Rao et al 2021
“HiT: Improved Transformer for High-Resolution GANs”, Zhao et al 2021
“Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling”, Wu et al 2021
“A Multi-Level Attention Model for Evidence-Based Fact Checking”, Kruengkrai et al 2021
“A Multi-Level Attention Model for Evidence-Based Fact Checking”
“Aggregating Nested Transformers”, Zhang et al 2021
“Pay Attention to MLPs”, Liu et al 2021
“Fully-Connected Neural Nets”, Gwern 2021
“MViT: Multiscale Vision Transformers”, Fan et al 2021
“Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows”, Liu et al 2021
“Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”
“Generative Adversarial Transformers”, Hudson & Zitnick 2021
“Coordination Among Neural Modules Through a Shared Global Workspace”, Goyal et al 2021
“Coordination Among Neural Modules Through a Shared Global Workspace”
“LazyFormer: Self Attention With Lazy Update”, Ying et al 2021
“CDLM: Cross-Document Language Modeling”, Caciularu et al 2021
“Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition”, Zhang et al 2020
“Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition”
“Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical Supervision from Extractive Summaries”, Sun et al 2020
“Transformer-QL: A Step Towards Making Transformer Network Quadratically Large”, Hajra 2020
“Transformer-QL: A Step Towards Making Transformer Network Quadratically Large”
“Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size”, Yoshida et al 2020
“Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size”
“Progressive Generation of Long Text”, Tan et al 2020
“Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing”, Dai et al 2020
“Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing”
“Conformer: Convolution-augmented Transformer for Speech Recognition”, Gulati et al 2020
“Conformer: Convolution-augmented Transformer for Speech Recognition”
“Multi-scale Transformer Language Models”, Subramanian et al 2020
“Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching”, Yang et al 2020
“Lite Transformer With Long-Short Range Attention”, Wu et al 2020
“ETC: Encoding Long and Structured Inputs in Transformers”, Ainslie et al 2020
“Longformer: The Long-Document Transformer”, Beltagy et al 2020
“BP-Transformer: Modelling Long-Range Context via Binary Partitioning”, Ye et al 2019
“BP-Transformer: Modelling Long-Range Context via Binary Partitioning”
“Blockwise Self-Attention for Long Document Understanding”, Qiu et al 2019
“Hierarchical Transformers for Multi-Document Summarization”, Liu & Lapata 2019
“Hierarchical Transformers for Multi-Document Summarization”
“Hierarchical Multiscale Recurrent Neural Networks”, Chung et al 2016
“A Clockwork RNN”, Koutník et al 2014
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
contextual-learning
neural-attention
nested-attention
transformer-efficiency
longform
hierarchical-modeling
Miscellaneous
Link Bibliography
-
https://arxiv.org/abs/2307.02486#microsoft
: “LongNet: Scaling Transformers to 1,000,000,000 Tokens”, Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Furu Wei -
https://arxiv.org/abs/2306.00238#apple
: “Bytes Are All You Need: Transformers Operating Directly On File Bytes”, Maxwell Horton, Sachin Mehta, Ali Farhadi, Mohammad Rastegari -
https://arxiv.org/abs/2305.16300
: “Landmark Attention: Random-Access Infinite Context Length for Transformers”, Amirkeivan Mohtashami, Martin Jaggi -
https://arxiv.org/abs/2209.15001
: “DiNAT: Dilated Neighborhood Attention Transformer”, Ali Hassani, Humphrey Shi -
https://arxiv.org/abs/2209.10655
: “Mega: Moving Average Equipped Gated Attention”, Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, Luke Zettlemoyer -
https://arxiv.org/abs/2206.05852
: “ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths”, Ruslan Khalitov, Tong Yu, Lei Cheng, Zhirong Yang -
https://arxiv.org/abs/2204.10670
: “Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention”, Tong Yu, Ruslan Khalitov, Lei Cheng, Zhirong Yang -
https://arxiv.org/abs/2204.07143
: “NAT: Neighborhood Attention Transformer”, Ali Hassani, Steven Walton, Jiachen Li, Shen Li, Humphrey Shi -
https://arxiv.org/abs/2112.07916#google
: “LongT5: Efficient Text-To-Text Transformer for Long Sequences”, Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang -
https://arxiv.org/abs/2111.09883
: “Swin Transformer V2: Scaling Up Capacity and Resolution”, -
https://arxiv.org/abs/2110.13711#nvidia
: “Hourglass: Hierarchical Transformers Are More Efficient Language Models”, Piotr Nawrot, Szymon Tworkowski, Michał Tyrolski, Łukasz Kaiser, Yuhuai Wu, Christian Szegedy, Henryk Michalewski -
https://arxiv.org/abs/2107.02192#nvidia
: “Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision”, Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, Bryan Catanzaro -
https://arxiv.org/abs/2107.00645
: “Global Filter Networks for Image Classification”, Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, Jie Zhou -
https://arxiv.org/abs/2106.07631#google
: “HiT: Improved Transformer for High-Resolution GANs”, Long Zhao, Zizhao Zhang, Ting Chen, Dimitris N. Metaxas, Han Zhang -
https://arxiv.org/abs/2105.08050#google
: “Pay Attention to MLPs”, Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le -
fc
: “Fully-Connected Neural Nets”, Gwern -
https://arxiv.org/abs/2104.11227#facebook
: “MViT: Multiscale Vision Transformers”, Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer -
https://arxiv.org/abs/2103.14030
: “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows”, Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo -
https://arxiv.org/abs/2010.10504#google
: “Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition”, Yu Zhang, James Qin, Daniel S. Park, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Quoc V. Le, Yonghui Wu -
https://arxiv.org/abs/2005.08100#google
: “Conformer: Convolution-augmented Transformer for Speech Recognition”, -
https://arxiv.org/abs/2004.05150
: “Longformer: The Long-Document Transformer”, Iz Beltagy, Matthew E. Peters, Arman Cohan