Bibliography (3):

  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  2. MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism

  3. Turing-NLG: A 17-billion-parameter language model by Microsoft