[blog; reinvention of inner-monologue, originally discovered by AI Dungeon 2/4chan users] Although scaling up language model size has reliably improved performance on a range of NLP tasks, even the largest models currently struggle with certain reasoning tasks such as math word problems, symbolic manipulation, and commonsense reasoning.
This paper explores the ability of language models to generate a coherent chain-of-thought—a series of short sentences that mimic the reasoning process a person might have when responding to a question [inner monologue].
Experiments show that inducing a chain-of-thought via prompting [referred to as ‘chain-of-thought’ (CoT)] can enable sufficiently large language models to better perform reasoning tasks that otherwise have flat scaling curves.
Figure 2: When scaling up the model alone already allows models to solve math word problems, chain-of-thought prompting does as well or better.
Figure 3: Employing chain-of-thought enables language models to solve problems for which standard prompting has a mostly flat scaling curve.
…As seen in Figure 3, increasing model scale for standard prompting does not improve performance on these datasets—the scaling curve is mostly flat. When adding chain-of-thought prompting, however, the model is now able to achieve performance that increases with model scale. Notably, chain-of-thought prompting does better than standard prompting only at the scale of ~100b parameters; models of smaller scale produced fluent but illogical chains of thought, leading to lower performance than standard prompting [emergence].
To better understand why chain-of-thought prompting works, we manually examine model-generated chains of thought for both correct and incorrect answers in the GSM8K dataset. Some examples are shown in Table 1: Of 50 random examples where the model returned the correct final answer, all of the generated chains of thought were also logically and mathematically correct except for one, which coincidentally arrived at the correct answer through incorrect reasoning.1 We also randomly examine 50 random samples for which the model gave the wrong answer. The summary of this analysis is that 46% of the chains of thought were almost correct, barring minor mistakes (calculator error, symbol mapping error, or one reasoning step missing), and that the other 54% of the chains of thought had major errors in semantic understanding or coherence.
Figure 5: For 3 symbolic reasoning tasks, employing chain-of-thought facilitates good performance when standard few-shot prompting is insufficient. Examples of model-produced chains of thought are shown in Table 12–Table 14 in the Appendix.
…4. Symbolic Reasoning: We next investigate the ability of language models to perform symbolic reasoning tasks. Although the symbolic reasoning tasks we consider are simple for humans, language models typically exhibit a flat scaling curve on them. We show that solving intermediate steps of a symbolic reasoning task via chain-of-thought allows models to perform tasks that are not solvable with standard prompting alone.
[concatenate the last letters of each word in a name; reverse a list of everyday objects; whether a flipped or not flipped coin changes which side is up]
…5.1. Datasets and Prompts: For evaluation, we choose 4 datasets to cover a diverse range of commonsense reasoning types.
Figure 6: Compared with standard prompting, chain-of-thought prompting also improves performance on various types of commonsense reasoning tasks. Examples of model-produced chains of thought are shown in Table 15–Table 18 in the Appendix.
…Chain-of-thought prompting appears to expand the set of tasks that large language models can perform successfully—in other words, our work underscores that standard prompting only provides a lower bound on the capabilities of large language models in principle. This observation likely raises more questions than it answers—for instance, how much more can we expect reasoning ability to improve with a further increase in model scale? What other prompting methods might expand the range of tasks that language models can solve?
…As another robustness test, we check that chain-of-thought prompting works not just for the set of language models used in the main results, but also for GPT-3 language models (Brownet al2020). We evaluate the MultiArith and GSM8K results using the OpenAI GPT-3 API, comparing standard prompting and chain-of-thought prompting. As shown in Figure 8, the overall finding remains unchanged for GPT-3 davinci versus the 137B [LaMDA] language model we use.
Figure 8: Chain-of-thought prompting outperforms standard prompting for both LaMDA, as well as for GPT-3 davinci.
…A.5. Comparison with Finetuned Models: We further compare chain-of-thought prompting to directly finetuning the same 137B model on a training dataset. We focus on GSM8K (Cobbeet al2021), which has a training set annotated with intermediate steps to arrive at the final answer (similar to a chain-of-thought, except slightly more terse). We use an external calculator on generated chains of thought for both prompting and finetuning. We compare with models finetuned on both the full GSM8K train set, as well as subsets limited by the number of hops in the train examples, which allows us to test the model’s generalization ability on problems that require more hops. The results are shown in Table 9: All 8 exemplars we composed for chain-of-thought prompting have at most 2 hops but achieved a solve rate of 19.5%, which is much higher than the finetuned results with ≤2 hops from the GSM8K train set (2,261 examples; solve rate = 14.0%). Chain-of-thought prompting is also comparable with finetuned results for ≤3 hops (4,080 examples; solve rate = 20.5%).
This comparison suggests that, compared with finetuning, prompting can exhibit better generalizability to harder problems while being far more sample-efficient.