RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models
Think before you speak: Training Language Models With Pause Tokens
Retentive Network: A Successor to Transformer for Large Language Models
Fine-Tuning Pre-trained Transformers into Decaying Fast Weights
S4: Efficiently Modeling Long Sequences with Structured State Spaces
LSSL: Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers
Do Long-Range Language Models Actually Use Long-Range Context?
When Attention Meets Fast Recurrence: Training SRU++ Language Models with Reduced Compute
Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Dynamic Evaluation
Shortformer: Better Language Modeling using Shorter Inputs
Untangling tradeoffs between recurrence and self-attention in neural networks
Addressing Some Limitations of Transformers with Feedback Memory
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Transformer-XL—Combining Transformers and RNNs Into a State-Of-The-Art Language Model
So I Tried out GPT-3’s Trick of Conditioning on Training Data With XLNet. While It Doesn’t Do as well as the 175B GPT-3, It Does Much Better Than the Version Which Is the Same Size As XLNet (0.4B). The Visual below Is from Their Paper on WinoGrande—I Added the Squares for XLNet.
2022-hutchins-figure6-transformerxlvsblockrecurrenttransformeroverincreasingcontextlengthvsnumberoflongdocumentsavailabletotrainon.jpg
2021-lazaridou-figure3-dynamicevaluationimprovestemporaldriftofsmalltransformerxlmodels.png
Think before you speak: Training Language Models With Pause Tokens
https%253A%252F%252Farxiv.org%252Fabs%252F2212.02475%2523google.html
Fine-Tuning Pre-trained Transformers into Decaying Fast Weights
S4: Efficiently Modeling Long Sequences with Structured State Spaces
Do Long-Range Language Models Actually Use Long-Range Context?
Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Dynamic Evaluation
https%253A%252F%252Farxiv.org%252Fpdf%252F2102.01951%2523page%253D7%2526org%253Ddeepmind.html
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Wikipedia Bibliography: