Bibliography (7):

  1. Attention Is All You Need

  2. Language Models are Unsupervised Multitask Learners

  3. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  4. Layer Normalization

  5. Pointer Sentinel Mixture Models

  6. The LAMBADA dataset: Word prediction requiring a broad discourse context

  7. RACE: Large-scale ReAding Comprehension Dataset From Examinations