Bibliography (7):

https://github.com/facebookresearch/fairseq/tree/main/examples/roberta
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Chinchilla: Training Compute-Optimal Large Language Models
XLNet: Generalized Autoregressive Pretraining for Language Understanding
https://arxiv.org/pdf/1907.11692.pdf#page=12&org=facebook
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Wikipedia Bibliography:
1. https://en.wikipedia.org/wiki/BPEs :
  
  https://en.wikipedia.org/wiki/BPEs