Bibliography (6):
Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
GPT-3: Language Models are Few-Shot Learners
Wikipedia Bibliography:
Stochastic gradient descent
https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam :
https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam
Variance