Bibliography (5):

Language Models are Unsupervised Multitask Learners
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Wikipedia Bibliography:
1. Statistical-significance
2. Cross-entropy