Bibliography (7):

Attention Is All You Need
Language Models are Unsupervised Multitask Learners
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Layer Normalization
Pointer Sentinel Mixture Models
The LAMBADA dataset: Word prediction requiring a broad discourse context
RACE: Large-scale ReAding Comprehension Dataset From Examinations