Bibliography (3):

  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  2. XLNet: Generalized Autoregressive Pretraining for Language Understanding

  3. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context