[code] Language model pretraining has led to large performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have large impact on the final results.
We present a replication study of BERT pretraining (Devlinet al2019) that carefully measures the impact of many key hyperparameters and training data size. [and switches to BPEs]
We find that BERT was undertrained, and can match or exceed the performance of every model published after it. [Main ingredient in increasing the dataset size 10×: 16GB → 160GB of text. cf. Chinchilla]
Table 4: Development set results for RoBERTa as we pretrain over more data (16GB → 160GB of text) and pretrain for longer (100K → 300K → 500K steps). Each row accumulates improvements from the rows above. RoBERTa matches the architecture and training objective of BERTLARGE. Results for BERTLARGE and XLNetLARGE are from Devlinet al2019 and Yanget al2019, respectively. Complete results on all GLUE tasks can be found in the Appendix.
Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements.