 See Also

Links
 “FineTuning Pretrained Transformers into Decaying Fast Weights”, 2022
 “Simple Recurrence Improves Masked Language Models”, Et Al 2022
 “BlockRecurrent Transformers”, Et Al 2022
 “S4: Efficiently Modeling Long Sequences With Structured State Spaces”, Et Al 2021
 “LSSL: Combining Recurrent, Convolutional, and Continuoustime Models With Linear StateSpace Layers”, Et Al 2021
 “Do LongRange Language Models Actually Use LongRange Context?”, Et Al 2021
 “Finetuning Pretrained Transformers into RNNs”, Et Al 2021
 “When Attention Meets Fast Recurrence: Training SRU++ Language Models With Reduced Compute”, 2021
 “Shortformer: Better Language Modeling Using Shorter Inputs”, Et Al 2020
 “Untangling Tradeoffs between Recurrence and Selfattention in Neural Networks”, Et Al 2020
 “Addressing Some Limitations of Transformers With Feedback Memory”, Et Al 2020
 “DEQ: Deep Equilibrium Models”, Et Al 2019
 “XLNet: Generalized Autoregressive Pretraining for Language Understanding”, Et Al 2019
 “TransformerXL: Attentive Language Models Beyond a FixedLength Context”, Et Al 2019
 “TransformerXL—Combining Transformers and RNNs Into a Stateoftheart Language Model”, 2019
 “Universal Transformers”, Et Al 2018
 “So I Tried out GPT3’s Trick of Conditioning on Training Data With XLNet. While It Doesn’t Do as well as the 175B GPT3, It Does Much Better Than the Version Which Is the Same Size As XLNet (0.4B). The Visual below Is from Their Paper on WinoGrande—I Added the Squares for XLNet.”
 Miscellaneous
 Link Bibliography
See Also
Links
“FineTuning Pretrained Transformers into Decaying Fast Weights”, 2022
“FineTuning Pretrained Transformers into Decaying Fast Weights”, 20221009 ( ; similar; bibliography)
“Simple Recurrence Improves Masked Language Models”, Et Al 2022
“Simple Recurrence Improves Masked Language Models”, 20220523 ( ; similar)
“BlockRecurrent Transformers”, Et Al 2022
“BlockRecurrent Transformers”, 20220311 ( ; backlinks; similar; bibliography)
“S4: Efficiently Modeling Long Sequences With Structured State Spaces”, Et Al 2021
“S4: Efficiently Modeling Long Sequences with Structured State Spaces”, 20211031 ( ; backlinks; similar; bibliography)
“LSSL: Combining Recurrent, Convolutional, and Continuoustime Models With Linear StateSpace Layers”, Et Al 2021
“LSSL: Combining Recurrent, Convolutional, and Continuoustime Models with Linear StateSpace Layers”, 20211026 ( ; backlinks; similar)
“Do LongRange Language Models Actually Use LongRange Context?”, Et Al 2021
“Do LongRange Language Models Actually Use LongRange Context?”, 20210919 (similar; bibliography)
“Finetuning Pretrained Transformers into RNNs”, Et Al 2021
“Finetuning Pretrained Transformers into RNNs”, 20210324 ( ; backlinks; similar)
“When Attention Meets Fast Recurrence: Training SRU++ Language Models With Reduced Compute”, 2021
“When Attention Meets Fast Recurrence: Training SRU++ Language Models with Reduced Compute”, 20210224 ( ; backlinks; similar)
“Shortformer: Better Language Modeling Using Shorter Inputs”, Et Al 2020
“Shortformer: Better Language Modeling using Shorter Inputs”, 20201231 (backlinks; similar)
“Untangling Tradeoffs between Recurrence and Selfattention in Neural Networks”, Et Al 2020
“Untangling tradeoffs between recurrence and selfattention in neural networks”, 20200616 ( ; backlinks; similar)
“Addressing Some Limitations of Transformers With Feedback Memory”, Et Al 2020
“Addressing Some Limitations of Transformers with Feedback Memory”, 20200221 (similar)
“DEQ: Deep Equilibrium Models”, Et Al 2019
“DEQ: Deep Equilibrium Models”, 20190903 (backlinks; similar)
“XLNet: Generalized Autoregressive Pretraining for Language Understanding”, Et Al 2019
“XLNet: Generalized Autoregressive Pretraining for Language Understanding”, 20190619 ( ; backlinks; similar)
“TransformerXL: Attentive Language Models Beyond a FixedLength Context”, Et Al 2019
“TransformerXL: Attentive Language Models Beyond a FixedLength Context”, 20190109 ( ; backlinks; similar)
“TransformerXL—Combining Transformers and RNNs Into a Stateoftheart Language Model”, 2019
“Universal Transformers”, Et Al 2018
“Universal Transformers”, 20180710 ( ; similar)
“So I Tried out GPT3’s Trick of Conditioning on Training Data With XLNet. While It Doesn’t Do as well as the 175B GPT3, It Does Much Better Than the Version Which Is the Same Size As XLNet (0.4B). The Visual below Is from Their Paper on WinoGrande—I Added the Squares for XLNet.”
Miscellaneous
Link Bibliography

https://arxiv.org/abs/2210.04243
: “FineTuning Pretrained Transformers into Decaying Fast Weights”, Huanru Henry Mao: 
https://arxiv.org/abs/2203.07852
: “BlockRecurrent Transformers”, DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, Behnam Neyshabur: 
https://arxiv.org/abs/2111.00396
: “S4: Efficiently Modeling Long Sequences With Structured State Spaces”, Albert Gu, Karan Goel, Christopher Ré: 
https://arxiv.org/abs/2109.09115
: “Do LongRange Language Models Actually Use LongRange Context?”, Simeng Sun, Kalpesh Krishna, Andrew MattarellaMicke, Mohit Iyyer: