“T5: Exploring the Limits of Transfer Learning With a Unified Text-To-Text Transformer”, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu2019-10-23 (, , ; similar)⁠:

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice.

In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

By combining the insights from our exploration with scale and our new Colossal Clean Crawled Corpus (C4), we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Table 9: Measuring the effect of repeating data during pre-training. In these experiments, we only use the first n tokens from C4 (with varying values of n shown in the first column) but still pre-train over 235 tokens. This results in the data set being repeated over the course of pre-training (with the number of repeats for each experiment shown in the second column), which may result in memorization (see Figure 6).
Number of tokens Repeats GLUE CNNDM SQuAD SGLUE EnDe EnFr EnRo
★ Full data set 0 83.28 19.24 80.88 71.36 26.98 39.82 27.65
229 64 82.87 19.19 80.97 72.03 26.83 39.74 27.63
227 256 82.62 19.20 79.78 69.97 27.02 39.71 27.33
225 1,024 79.55 18.57 76.27 64.76 26.38 39.56 26.80
223 4,096 76.34 18.33 70.92 59.29 26.37 38.84 25.81
Figure 6: Pre-training loss for our original C4 data set as well as 4 artificially truncated versions. The sizes listed refer to the number of tokens in each data set. The 4 sizes considered correspond to repeating the data set between 64 and 4,096× over the course of pre-training. Using a smaller data set size results in smaller training loss values, which may suggest some memorization of the unlabeled data set.
Figure 6: Pre-training loss for our original C4 data set as well as 4 artificially truncated versions. The sizes listed refer to the number of tokens in each data set. The 4 sizes considered correspond to repeating the data set 64–4,096× over the course of pre-training. Using a smaller data set size results in smaller training loss values, which may suggest some memorization of the unlabeled data set.