Figure 4: The effects of explanations on average accuracy, across model sizes. Untuned explanations lead to modest increases in accuracy from the largest model relative to few-shot prompts without explanations. Selecting the few-shot examples with explanations using a small validation set improves substantially over selecting only the few-shot examples. Note that these results aggregate across all 40 tasks, which have different difficulties and numbers of possible answers. See Appendix D.6 for per-task plots. (Error bars are bootstrap 95%-CIs. Points are offset horizontally to avoid overlap.)
Large language models can perform new tasks by adapting to a few in-context examples. For humans, rapid learning from examples can benefit from explanations that connect examples to task principles.
We therefore investigate whether explanations of few-shot examples can allow language models to adapt more effectively. We annotate a set of 40 challenging tasks from BIG-Bench with explanations of answers to a small subset of questions, as well as a variety of matched control explanations. We evaluate the effects of various zero-shot and few-shot prompts that include different types of explanations, instructions, and controls on the performance of a range of large language models. We analyze these results using statistical multilevel modeling techniques that account for the nested dependencies among conditions, tasks, prompts, and models.
We find that explanations of examples can improve performance. Adding untuned explanations to a few-shot prompt offers a modest improvement in performance; about 1⁄3 the effect size of adding few-shot examples, but twice the effect size of task instructions. We then show that explanations tuned for performance on a small validation set offer substantially larger benefits; building a prompt by selecting examples and explanations together substantially improves performance over selecting examples alone. Hand-tuning explanations can substantially improve performance on challenging tasks. Furthermore, even untuned explanations outperform carefully matched controls, suggesting that the benefits are due to the link between an example and its explanation, rather than lower-level features of the language used. However, only large models can benefit from explanations.
In summary, explanations can support the in-context learning abilities of large language models on challenging tasks.
Figure 3: Explanations can have substantial benefits—especially when tuned—but their effects are variable. This plot compares the improvement that explanations cause relative to matched prompts without explanations in 3 different conditions. In purple we show the effect of untuned explanations, relative to few-shot prompts with no explanations. In darker green we show the benefit of selecting examples with explanations over selecting examples alone; and in light green we show the benefit of hand-tuned explanations over prompts with no explanations. Note that we only hand-tuned explanations on a handful of tasks, so there are relatively few observations. (Points and lines at top are means and bootstrap 95%-CIs. Curves are smoothed density estimates based on the individual observed differences, which are plotted as points below. The bi-modality for hand-tuned explanations is probably due to the small number of observations.)