Bibliography (7):

  1. Deep contextualized word representations

  2. Turing-NLG: A 17-billion-parameter language model by Microsoft

  3. https://github.com/microsoft/DeepSpeed

  4. https://pytorch.org/

  5. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

  6. https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/

  7. Wikipedia Bibliography:

    1. Long short-term memory