“Tk-Instruct: Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks”, Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Gary Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Maitreya Patel, Kuntal Kumar Pal, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Shailaja Keyur Sampat, Savan Doshi, Siddhartha Mishra, Sujan Reddy, Sumanta Patro, Tanay Dixit, Xudong Shen, Chitta Baral, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi, Daniel Khashabi2022-04-16 (, , , ; backlinks)⁠:

[retitled “Super-NaturalInstructions: Generalization via Declarative Instructions on 1,600+ NLP Tasks”] How can we measure the generalization of models to a variety of unseen tasks when provided with their language instructions?

To facilitate progress in this goal, we introduce Natural-Instructionsv2 [v1], a benchmark of 1,600+ diverse language tasks and their expert-written instructions. It covers 70+ distinct task types, such as tagging, in-filling, and rewriting. These tasks are collected with contributions of NLP practitioners in the community and through an iterative peer review process to ensure their quality. With this large and diverse collection of tasks, we are able to rigorously benchmark cross-task generalization of models—training on a subset of tasks and evaluating on the remaining unseen ones. For instance, we quantify generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances, and model sizes.

Based on these insights, we introduce Tk-Instruct, an encoder-decoder T5 Transformer that is trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples) which outperforms existing larger models on our benchmark.

We hope this benchmark facilitates future progress toward more general-purpose language understanding models.

Figure 5: Scaling trends of models performance as a function of (a) the number of training tasks; (b) the number of instances per training task; (c) model sizes. x-axes are in log scale. It is promising to see that the performance improves log-linearly with an increasing number of observed tasks and model size. Although, the performance does not gain from increasing number of instances.

…5.2 Scaling Trends of Generalization: In this section, we study the Tk-INSTRUCT’s generalization performance with respect to 3 important scaling factors: the number of training tasks, the number of instances per task, and the model sizes. We conduct these experiments only on our English tasks. Figure 5 presents the performance change by scaling each of them.

…Benefit of instructional elements: As shown in Figure 1, NATINSTv2 provides multiple elements for instructing a new task. To find the optimal instruction encoding, we trained multiple models with different combinations of these elements. The diagonal elements of Table 4 show the performance of our models when trained and evaluated on a particular instruction encoding. Based on the diagonal numbers, we can see that the addition of task definition is necessary for a successful generalization of the model. Moreover, the definition and example are somehow complementary. Combining them yields large improvement. Adding more demonstration examples does not seem to bring a lot of difference.