Bibliography (6):

  1. Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

  2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  3. GPT-3: Language Models are Few-Shot Learners