BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Wikipedia Bibliography: