Bibliography (8):

  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  2. Language Models are Unsupervised Multitask Learners

  3. MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism

  4. T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

  5. https://github.com/microsoft/DeepSpeed

  6. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models