- See Also
-
Links
- “Fine-Tuning Pre-trained Transformers into Decaying Fast Weights”, 2022
- “Simple Recurrence Improves Masked Language Models”, Et Al 2022
- “Block-Recurrent Transformers”, Et Al 2022
- “S4: Efficiently Modeling Long Sequences With Structured State Spaces”, Et Al 2021
- “LSSL: Combining Recurrent, Convolutional, and Continuous-time Models With Linear State-Space Layers”, Et Al 2021
- “Do Long-Range Language Models Actually Use Long-Range Context?”, Et Al 2021
- “Finetuning Pretrained Transformers into RNNs”, Et Al 2021
- “When Attention Meets Fast Recurrence: Training SRU++ Language Models With Reduced Compute”, 2021
- “Shortformer: Better Language Modeling Using Shorter Inputs”, Et Al 2020
- “Untangling Tradeoffs between Recurrence and Self-attention in Neural Networks”, Et Al 2020
- “Addressing Some Limitations of Transformers With Feedback Memory”, Et Al 2020
- “DEQ: Deep Equilibrium Models”, Et Al 2019
- “XLNet: Generalized Autoregressive Pretraining for Language Understanding”, Et Al 2019
- “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context”, Et Al 2019
- “Transformer-XL—Combining Transformers and RNNs Into a State-of-the-art Language Model”, 2019
- “Universal Transformers”, Et Al 2018
- “So I Tried out GPT-3’s Trick of Conditioning on Training Data With XLNet. While It Doesn’t Do as well as the 175B GPT-3, It Does Much Better Than the Version Which Is the Same Size As XLNet (0.4B). The Visual below Is from Their Paper on WinoGrande—I Added the Squares for XLNet.”
- Miscellaneous
- Link Bibliography
See Also
Links
“Fine-Tuning Pre-trained Transformers into Decaying Fast Weights”, 2022
“Fine-Tuning Pre-trained Transformers into Decaying Fast Weights”, 2022-10-09 ( ; similar; bibliography)
“Simple Recurrence Improves Masked Language Models”, Et Al 2022
“Simple Recurrence Improves Masked Language Models”, 2022-05-23 ( ; similar)
“Block-Recurrent Transformers”, Et Al 2022
“Block-Recurrent Transformers”, 2022-03-11 ( ; backlinks; similar; bibliography)
“S4: Efficiently Modeling Long Sequences With Structured State Spaces”, Et Al 2021
“S4: Efficiently Modeling Long Sequences with Structured State Spaces”, 2021-10-31 ( ; backlinks; similar; bibliography)
“LSSL: Combining Recurrent, Convolutional, and Continuous-time Models With Linear State-Space Layers”, Et Al 2021
“LSSL: Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers”, 2021-10-26 ( ; backlinks; similar)
“Do Long-Range Language Models Actually Use Long-Range Context?”, Et Al 2021
“Do Long-Range Language Models Actually Use Long-Range Context?”, 2021-09-19 (similar; bibliography)
“Finetuning Pretrained Transformers into RNNs”, Et Al 2021
“Finetuning Pretrained Transformers into RNNs”, 2021-03-24 ( ; backlinks; similar)
“When Attention Meets Fast Recurrence: Training SRU++ Language Models With Reduced Compute”, 2021
“When Attention Meets Fast Recurrence: Training SRU++ Language Models with Reduced Compute”, 2021-02-24 ( ; backlinks; similar)
“Shortformer: Better Language Modeling Using Shorter Inputs”, Et Al 2020
“Shortformer: Better Language Modeling using Shorter Inputs”, 2020-12-31 (backlinks; similar)
“Untangling Tradeoffs between Recurrence and Self-attention in Neural Networks”, Et Al 2020
“Untangling tradeoffs between recurrence and self-attention in neural networks”, 2020-06-16 ( ; backlinks; similar)
“Addressing Some Limitations of Transformers With Feedback Memory”, Et Al 2020
“Addressing Some Limitations of Transformers with Feedback Memory”, 2020-02-21 (similar)
“DEQ: Deep Equilibrium Models”, Et Al 2019
“DEQ: Deep Equilibrium Models”, 2019-09-03 (backlinks; similar)
“XLNet: Generalized Autoregressive Pretraining for Language Understanding”, Et Al 2019
“XLNet: Generalized Autoregressive Pretraining for Language Understanding”, 2019-06-19 ( ; backlinks; similar)
“Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context”, Et Al 2019
“Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context”, 2019-01-09 ( ; backlinks; similar)
“Transformer-XL—Combining Transformers and RNNs Into a State-of-the-art Language Model”, 2019
“Universal Transformers”, Et Al 2018
“Universal Transformers”, 2018-07-10 ( ; similar)
“So I Tried out GPT-3’s Trick of Conditioning on Training Data With XLNet. While It Doesn’t Do as well as the 175B GPT-3, It Does Much Better Than the Version Which Is the Same Size As XLNet (0.4B). The Visual below Is from Their Paper on WinoGrande—I Added the Squares for XLNet.”
Miscellaneous
Link Bibliography
-
https://arxiv.org/abs/2210.04243
: “Fine-Tuning Pre-trained Transformers into Decaying Fast Weights”, Huanru Henry Mao: -
https://arxiv.org/abs/2203.07852
: “Block-Recurrent Transformers”, DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, Behnam Neyshabur: -
https://arxiv.org/abs/2111.00396
: “S4: Efficiently Modeling Long Sequences With Structured State Spaces”, Albert Gu, Karan Goel, Christopher Ré: -
https://arxiv.org/abs/2109.09115
: “Do Long-Range Language Models Actually Use Long-Range Context?”, Simeng Sun, Kalpesh Krishna, Andrew Mattarella-Micke, Mohit Iyyer: