Bibliography (5):

  1. Language Models are Unsupervised Multitask Learners

  2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  3. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

  4. Wikipedia Bibliography:

    1. Statistical-significance

    2. Cross-entropy