Bibliography (8):

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Language Models are Unsupervised Multitask Learners
MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
https://github.com/microsoft/DeepSpeed
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Wikipedia Bibliography:
1. Volta (microarchitecture) § Products :
  
  https://en.wikipedia.org/wiki/Volta_(microarchitecture)#Products
2. OpenAI