Bibliography (5):
Language Models are Unsupervised Multitask Learners
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Wikipedia Bibliography:
Statistical-significance
Cross-entropy