Bibliography:

  1. ‘self-attention’ tag

  2. ‘compressed Transformers’ tag

  3. RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

  4. Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models

  5. Transformers are Multi-State RNNs

  6. Think before you speak: Training Language Models With Pause Tokens

  7. Retentive Network: A Successor to Transformer for Large Language Models

  8. Block-State Transformers

  9. Looped Transformers as Programmable Computers

  10. FWL: Meta-Learning Fast Weight Language Models

  11. Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

  12. Simple Recurrence Improves Masked Language Models

  13. Block-Recurrent Transformers

  14. S4: Efficiently Modeling Long Sequences with Structured State Spaces

  15. LSSL: Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers

  16. Do Long-Range Language Models Actually Use Long-Range Context?

  17. Finetuning Pretrained Transformers into RNNs

  18. When Attention Meets Fast Recurrence: Training SRU++ Language Models with Reduced Compute

  19. Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Dynamic Evaluation

  20. Shortformer: Better Language Modeling using Shorter Inputs

  21. Untangling tradeoffs between recurrence and self-attention in neural networks

  22. Addressing Some Limitations of Transformers with Feedback Memory

  23. DEQ: Deep Equilibrium Models

  24. XLNet: Generalized Autoregressive Pretraining for Language Understanding

  25. Dynamic Evaluation of Transformer Language Models

  26. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

  27. Transformer-XL—Combining Transformers and RNNs Into a State-Of-The-Art Language Model

  28. Universal Transformers

  29. Hyperbolic Attention Networks

  30. Improving Neural Language Models with a Continuous Cache

  31. Context Caching

  32. 4fef11935ed94edd3fe9c8562e140a5e4f041453.html

  33. So I Tried out GPT-3’s Trick of Conditioning on Training Data With XLNet. While It Doesn’t Do as well as the 175B GPT-3, It Does Much Better Than the Version Which Is the Same Size As XLNet (0.4B). The Visual below Is from Their Paper on WinoGrande—I Added the Squares for XLNet.

  34. design#future-tag-features

    [Transclude the forward-link's context]

  35. 2022-hutchins-figure6-transformerxlvsblockrecurrenttransformeroverincreasingcontextlengthvsnumberoflongdocumentsavailabletotrainon.jpg

  36. 2021-lazaridou-figure3-dynamicevaluationimprovestemporaldriftofsmalltransformerxlmodels.png

  37. https://reasoning-tokens.ghost.io/reasoning-tokens/

  38. https://x.com/arankomatsuzaki/status/1639000379978403853

  39. Think before you speak: Training Language Models With Pause Tokens

  40. Sanjiv Kumar

  41. https%253A%252F%252Farxiv.org%252Fabs%252F2310.02226.html

  42. FWL: Meta-Learning Fast Weight Language Models

  43. https%253A%252F%252Farxiv.org%252Fabs%252F2212.02475%2523google.html

  44. Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

  45. https%253A%252F%252Farxiv.org%252Fabs%252F2210.04243.html

  46. Block-Recurrent Transformers

  47. Yuhuai (Tony) Wu’s Home Page

  48. Behnam Neyshabur

  49. https%253A%252F%252Farxiv.org%252Fabs%252F2203.07852.html

  50. S4: Efficiently Modeling Long Sequences with Structured State Spaces

  51. Albert Gu

  52. https%253A%252F%252Farxiv.org%252Fabs%252F2111.00396.html

  53. Do Long-Range Language Models Actually Use Long-Range Context?

  54. https%253A%252F%252Farxiv.org%252Fabs%252F2109.09115.html

  55. Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Dynamic Evaluation

  56. https%253A%252F%252Farxiv.org%252Fpdf%252F2102.01951%2523page%253D7%2526org%253Ddeepmind.html

  57. Dynamic Evaluation of Transformer Language Models

  58. https%253A%252F%252Farxiv.org%252Fabs%252F1904.08378.html

  59. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

  60. Zihang Dai

  61. Zhilin Yang

  62. https://www.cs.cmu.edu/~./yiming/

  63. https%253A%252F%252Farxiv.org%252Fabs%252F1901.02860.html