Bibliography (8):

Attention Is All You Need
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Language Models are Unsupervised Multitask Learners
Wikipedia Bibliography: