‘recurrent Transformer’ directory

Gwern

‘recurrent Transformer’ directory

See Also
Links
Miscellaneous
Bibliography

Links

“ATLAS: Learning to Optimally Memorize the Context at Test Time [DeepTransformers] ”, Behrouz et al 2025

ATLAS: Learning to Optimally Memorize the Context at Test Time [DeepTransformers]

“Scaling up Test-Time Compute With Latent Reasoning: A Recurrent Depth Approach ”, Geiping et al 2025

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

“Titans: Learning to Memorize at Test Time ”, Behrouz et al 2024

Titans: Learning to Memorize at Test Time

“Byte Latent Transformer (BLT): Patches Scale Better Than Tokens ”, Pagnoni et al 2024

Byte Latent Transformer (BLT): Patches Scale Better Than Tokens

“Transformers Can Do Arithmetic With the Right Embeddings ”, McLeish et al 2024

Transformers Can Do Arithmetic with the Right Embeddings

“RecurrentGemma: Moving Past Transformers for Efficient Open Language Models ”, Botev et al 2024

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

“Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models ”, Rannen-Triki et al 2024

Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models

“Transformers Are Multi-State RNNs ”, Oren et al 2024

Transformers are Multi-State RNNs

“Think Before You Speak: Training Language Models With Pause Tokens ”, Goyal et al 2023

Think before you speak: Training Language Models With Pause Tokens

“Retentive Network: A Successor to Transformer for Large Language Models ”, Sun et al 2023

Retentive Network: A Successor to Transformer for Large Language Models

“Block-State Transformers ”, Fathi et al 2023

Block-State Transformers

“Looped Transformers As Programmable Computers ”, Giannou et al 2023

Looped Transformers as Programmable Computers

“FWL: Meta-Learning Fast Weight Language Models ”, Clark et al 2022

FWL: Meta-Learning Fast Weight Language Models

“Fine-Tuning Pre-Trained Transformers into Decaying Fast Weights ”, Mao 2022

Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

“Simple Recurrence Improves Masked Language Models ”, Lei et al 2022

Simple Recurrence Improves Masked Language Models

“Block-Recurrent Transformers ”, Hutchins et al 2022

Block-Recurrent Transformers

“S4: Efficiently Modeling Long Sequences With Structured State Spaces ”, Gu et al 2021

S4: Efficiently Modeling Long Sequences with Structured State Spaces

“LSSL: Combining Recurrent, Convolutional, and Continuous-Time Models With Linear State-Space Layers ”, Gu et al 2021

LSSL: Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers

“Do Long-Range Language Models Actually Use Long-Range Context? ”, Sun et al 2021

Do Long-Range Language Models Actually Use Long-Range Context?

“Finetuning Pretrained Transformers into RNNs ”, Kasai et al 2021

Finetuning Pretrained Transformers into RNNs

“When Attention Meets Fast Recurrence: Training SRU++ Language Models With Reduced Compute ”, Lei 2021

When Attention Meets Fast Recurrence: Training SRU++ Language Models with Reduced Compute

“Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Dynamic Evaluation ”, Lazaridou et al 2021 (page 7 org deepmind)

Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Dynamic Evaluation

“Shortformer: Better Language Modeling Using Shorter Inputs ”, Press et al 2020

Shortformer: Better Language Modeling using Shorter Inputs

“Untangling Tradeoffs between Recurrence and Self-Attention in Neural Networks ”, Kerg et al 2020

Untangling tradeoffs between recurrence and self-attention in neural networks

“Addressing Some Limitations of Transformers With Feedback Memory ”, Fan et al 2020

Addressing Some Limitations of Transformers with Feedback Memory

“DEQ: Deep Equilibrium Models ”, Bai et al 2019

DEQ: Deep Equilibrium Models

“XLNet: Generalized Autoregressive Pretraining for Language Understanding ”, Yang et al 2019

XLNet: Generalized Autoregressive Pretraining for Language Understanding

“Dynamic Evaluation of Transformer Language Models ”, Krause et al 2019

Dynamic Evaluation of Transformer Language Models

“Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context ”, Dai et al 2019

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

“Transformer-XL—Combining Transformers and RNNs Into a State-Of-The-Art Language Model ”, Horev 2019

Transformer-XL—Combining Transformers and RNNs Into a State-of-the-art Language Model

“Universal Transformers ”, Dehghani et al 2018

Universal Transformers

“Hyperbolic Attention Networks ”, Gulcehre et al 2018

Hyperbolic Attention Networks

“Improving Neural Language Models With a Continuous Cache ”, Grave et al 2016

Improving Neural Language Models with a Continuous Cache

“Context Caching ”

Context caching :

View HTML:

/doc/www/ai.google.dev/4fef11935ed94edd3fe9c8562e140a5e4f041453.html

joeddav

So I tried out GPT-3’s trick of conditioning on training data with XLNet. While it doesn’t do as well as the 175B GPT-3, it does much better than the version which is the same size as XLNet (0.4B). The visual below is from their paper on WinoGrande—I added the squares for XLNet. :

/doc/www/localhost/aa7cf24c8fc4907aaa6c9c3b2adf17e0831ebe9d.html

Miscellaneous

Bibliography

https://arxiv.org/abs/2310.02226: “Think Before You Speak: Training Language Models With Pause Tokens ”, Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan

link-bibliography
https://arxiv.org/abs/2212.02475#google: “FWL: Meta-Learning Fast Weight Language Models ”, Kevin Clark, Kelvin Guu, Ming-Wei Chang, Panupong Pasupat, Geoffrey Hinton, Mohammad Norouzi

link-bibliography
https://arxiv.org/abs/2210.04243: “Fine-Tuning Pre-Trained Transformers into Decaying Fast Weights ”, Huanru Henry Mao

link-bibliography
https://arxiv.org/abs/2203.07852: “Block-Recurrent Transformers ”, DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, Behnam Neyshabur

link-bibliography
https://arxiv.org/abs/2111.00396: “S4: Efficiently Modeling Long Sequences With Structured State Spaces ”, Albert Gu, Karan Goel, Christopher Ré

link-bibliography
https://arxiv.org/abs/2109.09115: “Do Long-Range Language Models Actually Use Long-Range Context? ”, Simeng Sun, Kalpesh Krishna, Andrew Mattarella-Micke, Mohit Iyyer

link-bibliography
https://arxiv.org/pdf/2102.01951#page=7&org=deepmind: “Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Dynamic Evaluation ”, Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d’Autume, Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, Phil Blunsom

link-bibliography
https://arxiv.org/abs/1906.08237: “XLNet: Generalized Autoregressive Pretraining for Language Understanding ”, Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le

link-bibliography
https://arxiv.org/abs/1904.08378: “Dynamic Evaluation of Transformer Language Models ”, Ben Krause, Emmanuel Kahembwe, Iain Murray, Steve Renals

link-bibliography
https://arxiv.org/abs/1901.02860: “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context ”, Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov

link-bibliography
https://arxiv.org/abs/1807.03819#googledeepmind: “Universal Transformers ”, Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, Łukasz Kaiser

link-bibliography