[video; Twitter, 2] Finetuning language models on a collection of datasets phrased as instructions (instruction finetuning) has been shown to improve model performance and generalization to unseen tasks.
In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU.
We also publicly release Flan-T5 checkpoints [HF], which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B.
Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
…In this paper we advance instruction finetuning in several ways. First, we study the impact of scaling on instruction finetuning. Our experiments show that instruction finetuning does scale well with the number of tasks and the size of the model. Their respective scaling behaviors suggest that future research should scale up the number of tasks and the size of the model even further. Second, we study the effect of finetuning on the ability of the models to perform reasoning tasks. Our experiments show that whereas prior instruction finetuning methods that do not include chain-of-thought (CoT; Weiet al2022b) severely degrade performance on CoT evaluations, adding just 9 CoT datasets into the finetuning mixture enables better performance on all evaluations.
Based on these findings, we train Flan-PaLM by using a 540B-parameter model, increasing the number of finetuning tasks to 1.8K, and including CoT data. Flan-PaLM outperforms PaLM, achieving new state-of-the-art on several benchmarks. For instance, Flan-PaLM’s improved reasoning abilities enable it to leverage CoT and self-consistency (Wanget al2022c) to achieve 75.2% on Massive Multi-task Language Understanding (MMLU; Hendryckset al2020). Flan-PaLM also has improved multilingual abilities compared to PaLM, such as 14.9% absolute improvement on one-shot TydiQA (Clarket al2020) and 8.1% on arithmetic reasoning in under-represented languages (Shiet al2022). In human rater evaluations, Flan-PaLM substantially outperforms PaLM on a challenging set of open-ended generation questions, suggesting improved usability. Moreover, we found that instruction finetuning also improves performance across several responsible AI evaluation benchmarks.
In addition, we also instruction-finetune Flan-T5 models (80M to 11B). These checkpoints have strong zero-shot, few-shot, and CoT abilities, outperforming prior public checkpoints such as T5 (Raffelet al2020). For example, Flan-T5 11B outperforms T5 11B by double-digit improvements and even outperforms PaLM 62B on some challenging BIG-Bench tasks (Srivastavaet al2022). Overall, our results underscore how instruction finetuning can improve performance across a range of models, prompting setups, and evaluation tasks.
Table 1: Average 5-shot MMLU scores (%) for 57 tasks with model and human accuracy comparisons (Hendryckset al2020). Forecasts were made in July 2022 by competitive human forecasters, regarding a single model (Steinhardt 2021); see Hypermind & Metaculus. CoT + SC: chain-of-thought prompting with self-consistency (Wanget al2022b).
…In this paper we scale to 1,836 finetuning tasks by combining 4 mixtures from prior work: Muffin, T0-SF, NIV2, and CoT, as summarized in Figure 2:
Figure 2: Our finetuning data comprises 473 datasets, 146 task categories, and 1,836 total tasks. Details for the tasks used in this paper is given in Appendix F.
Table 2: Across several models, instruction finetuning only costs a small amount of compute relative to pre-training. T5: Raffelet al2020. PaLM and cont-PaLM (also known as PaLM 62B at 1.3T tokens): Chowdheryet al2022. U-PaLM: Tay et al (2022b).
…3. Scaling to 540b parameters and 1.8K tasks:
Figure 4 shows the joint effect of scaling these two variables on the normalized average of held-out benchmarks. Individual benchmark results are reported in Table 3: First, we see that for all 3 model sizes shown, multi-task instruction finetuning improves performance by a large margin compared to no finetuning. The performance gain ranges 9.4% → 15.5%.
Second, increasing the number of finetuning tasks improves performance, although the majority of the improvement comes from using up to 282 tasks. There are two potential explanations for the small gain after 282 tasks. One is that the additional tasks are not particularly diverse, and so they are not providing the model with new knowledge. Another explanation is that most of the gains from multi-task instruction finetuning come from the model learning to better express knowledge that it already knows from pretraining, and more than 282 tasks does not help too much. This second explanation could make sense since the pre-training data consists of 780B tokens, while instruction finetuning only uses 1.4B tokens (0.2% of the pre-training tokens).
Finally, we see that increasing model scale by an order of magnitude (ie. 8B → 62B or 62B → 540B) improves performance substantially for both finetuned and non-finetuned models. Note that it could be complicated to determine whether instruction finetuning improves small models or large models more (compared to the baseline of no finetuning). For example, although the absolute gain was larger for the 8B model than the 540B model (15.5% for 8B vs. 9.4% for 540B), the relative reduction in error rate was larger for the 540B model (18.4% for 540B vs. 16.6% for 8B).
Plotting such scaling curves provides insights into how scaling the model size and the number of tasks even further might improve performance. Scaling model size by another order of magnitude (though challenging) is expected to provide substantial performance gain. Scaling number of finetuning tasks should also improve performance, although likely only incrementally. Overall, the scaling curves plotted indicate that future work should continue scaling instruction finetuning…Moreover, the margin of improvement for instruction finetuning versus models without finetuning does not seem to decrease, which suggests that instruction finetuning will likely continue to be meaningful for future models.
Figure 4: Scaling behavior of multi-task instruction finetuning with respect to model size (# parameters) and number of finetuning tasks. The x-axes are log scale. The benchmark suites are MMLU (57 tasks), BBH (23 tasks), TydiQA (8 languages), and MGSM (10 languages). The evaluation metric on all 4 benchmark suites is few-shot prompted accuracy (exact match), where we take an unweighted average over all tasks. As an aggregate metric we report the normalized average of MMLU-direct, MMLU-CoT, BBH-direct, BBH-CoT, TydiQA, and MGSM. These evaluation benchmarks are held-out (not included in the finetuning data).
…We also note that Flan-PaLM does not achieve SOTA compared to certain specialized models. For example, for BBH-algo, which comprises tasks that require symbolic manipulation only (eg. keeping the order of a shuffled list of objects, sorting a list of words in alphabetical order), Flan-PaLM does not outperform code-davinci-002, even with CoT + SC. Moreover, although Flan-PaLM outperforms PaLM by 14.9% on one-shot TydiQA, it is still not on par with ByT5 finetuned on the TydiQA training set (Xueet al2022).
OpenAI’s text-davinci-003 follows instructions better. Is it also better on academic benchmarks? Summary: (1) text-davinci-3 beats text-davinci-2, but is not as good as code-davinci-2—it is behind Google Brain’s PaLM and Flan-U-PaLM