https://github.com/google/trax/blob/master/trax/examples/Terraformer_from_scratch.ipynb
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
SRU: Simple Recurrent Units for Highly Parallelizable Recurrence
Wikipedia Bibliography: