[retitled âSuper-NaturalInstructions: Generalization via Declarative Instructions on 1,600+ NLP Tasksâ] How can we measure the generalization of models to a variety of unseen tasks when provided with their language instructions?
To facilitate progress in this goal, we introduce Natural-Instructionsv2 [v1], a benchmark of 1,600+ diverse language tasks and their expert-written instructions. It covers 70+ distinct task types, such as tagging, in-filling, and rewriting. These tasks are collected with contributions of NLP practitioners in the community and through an iterative peer review process to ensure their quality. With this large and diverse collection of tasks, we are able to rigorously benchmark cross-task generalization of modelsâtraining on a subset of tasks and evaluating on the remaining unseen ones. For instance, we quantify generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances, and model sizes.
Based on these insights, we introduce Tk-Instruct, an encoder-decoder T5 Transformer that is trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples) which outperforms existing larger models on our benchmark.
We hope this benchmark facilitates future progress toward more general-purpose language understanding models.
Figure 5: Scaling trends of models performance as a function of (a) the number of training tasks; (b) the number of instances per training task; (c) model sizes.x-axes are in log scale. It is promising to see that the performance improves log-linearly with an increasing number of observed tasks and model size. Although, the performance does not gain from increasing number of instances.
âŚ5.2 Scaling Trends of Generalization: In this section, we study the Tk-INSTRUCTâs generalization performance with respect to 3 important scaling factors: the number of training tasks, the number of instances per task, and the model sizes. We conduct these experiments only on our English tasks. Figure 5 presents the performance change by scaling each of them.
More observed tasks improve the generalization:
To study how many tasks are required for training an instruction-following system, we conduct Tk-INSTRUCT finetuning with different numbers of tasks that are randomly sampled from the whole training set (Figure 5a).
The model generalization performance grows log-linearly as we increase the set of tasks used for training. Previous work (Mishraet al2022b; Sanhet al2022; Weiet al2022) has made similar observations on a much smaller scale, while we show that this trend holds even with 757 diverse training tasks.
A large number of training instances do not help generalization:
We then vary the number of instances per task that are used for finetuning Tk-INSTRUCT (Figure 5b).
While the conventional wisdom in supervised learning is that more training instances usually helps (BankoâŻ&âŻBrill2001; Sunet al2017; Hestnesset al2017), in our setup of cross-task generalization, the modelâs performance saturates when only 64 instances per task are used for training. A large number of training instances would instead lead to longer training time and more risks of overfitting to the training tasks.
Larger models consistently lead to gains:
We study the effect of model scaling by initializing Tk-INSTRUCT from different sizes of pretrained T5 checkpoints, including the small, base, large and XL sizes (Figure 5c). We found that increasing the model sizes consistently bring large improvement (log-linearly with parameter size). This finding contradicts with Xuet al2022âs claim that âmodel size has little impact on performance with an extremely large number of tasks.â Combining Figure 5a & Figure 5c, one can create a correspondence between model size and task size. For example, a T5-large model trained with 757 tasks can achieve comparable performance (48.0 ROUGE-L) to the T5â3B model trained with 128 tasks (48.4 ROUGE-L), indicating that increasing the diversity of training tasks is an alternative to scaling model sizes.
âŚBenefit of instructional elements: As shown in Figure 1, NATINSTv2 provides multiple elements for instructing a new task. To find the optimal instruction encoding, we trained multiple models with different combinations of these elements. The diagonal elements of Table 4 show the performance of our models when trained and evaluated on a particular instruction encoding. Based on the diagonal numbers, we can see that the addition of task definition is necessary for a successful generalization of the model. Moreover, the definition and example are somehow complementary. Combining them yields large improvement. Adding more demonstration examples does not seem to bring a lot of difference.