Bibliography (7):

Deep contextualized word representations
Turing-NLG: A 17-billion-parameter language model by Microsoft
https://github.com/microsoft/DeepSpeed
https://pytorch.org/
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/
Wikipedia Bibliography:
1. Long short-term memory