Bibliography (6):

  1. Deep contextualized word representations

  2. https://github.com/microsoft/DeepSpeed

  3. https://pytorch.org/

  4. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

  5. https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/

  6. Wikipedia Bibliography:

    1. Long short-term memory