“RoBERTa: A Robustly Optimized BERT Pretraining Approach”, Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov2019-07-26 (, ; similar)⁠:

[code] Language model pretraining has led to large performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have large impact on the final results.

We present a replication study of BERT pretraining (Devlin et al 2019) that carefully measures the impact of many key hyperparameters and training data size. [and switches to BPEs]

We find that BERT was undertrained, and can match or exceed the performance of every model published after it. [Main ingredient in increasing the dataset size 10×: 16GB → 160GB of text. cf. Chinchilla]

Table 4: Development set results for RoBERTa as we pretrain over more data (16GB → 160GB of text) and pretrain for longer (100K → 300K → 500K steps). Each row accumulates improvements from the rows above. RoBERTa matches the architecture and training objective of BERTLARGE. Results for BERTLARGE and XLNetLARGE are from Devlin et al 2019 and Yang et al 2019, respectively. Complete results on all GLUE tasks can be found in the Appendix.

Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements.

We release our models and code.