Bibliography (10):

https://www.youtube.com/watch?v=hgSGHusDx7M
https://github.com/google/trax/blob/master/trax/examples/Terraformer_from_scratch.ipynb
Attention Is All You Need
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
BigBird: Transformers for Longer Sequences
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
SRU: Simple Recurrent Units for Highly Parallelizable Recurrence
Neural GPUs Learn Algorithms
GPT-3: Language Models are Few-Shot Learners
Wikipedia Bibliography:
1. ArXiv