Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice.
In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.
By combining the insights from our exploration with scale and our new Colossal Clean Crawled Corpus (C4), we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Table 9: Measuring the effect of repeating data during pre-training. In these experiments, we only use the first n tokens from C4 (with varying values of n shown in the first column) but still pre-train over 235 tokens. This results in the data set being repeated over the course of pre-training (with the number of repeats for each experiment shown in the second column), which may result in memorization (see Figure 6).
Number of tokens
Repeats
GLUE
CNNDM
SQuAD
SGLUE
EnDe
EnFr
EnRo
★ Full data set
0
83.28
19.24
80.88
71.36
26.98
39.82
27.65
229
64
82.87
19.19
80.97
72.03
26.83
39.74
27.63
227
256
82.62
19.20
79.78
69.97
27.02
39.71
27.33
225
1,024
79.55
18.57
76.27
64.76
26.38
39.56
26.80
223
4,096
76.34
18.33
70.92
59.29
26.37
38.84
25.81
Figure 6: Pre-training loss for our original C4 data set as well as 4 artificially truncated versions. The sizes listed refer to the number of tokens in each data set. The 4 sizes considered correspond to repeating the data set 64–4,096× over the course of pre-training. Using a smaller data set size results in smaller training loss values, which may suggest some memorization of the unlabeled data set.