“Self-Consistency Improves Chain-Of-Thought Reasoning in Language Models”, Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Denny Zhou2022-03-21 (, , )⁠:

Figure 5b: GSM8K accuracy. Self-consistency improves performance across language model scales.

[cf. wisdom of the crowd, “inner crowds”] We explore a simple ensemble strategy, self-consistency, that substantially improves the reasoning accuracy of large language models.

The idea is to sample a diverse set of outputs from a language model and return the most consistent answer [majority vote] in the set. Such ensembling method improves reasoning accuracy when combined with chain-of-thought prompting.

For arithmetic and commonsense reasoning benchmarks we find that self-consistency yields substantial accuracy improvements in a variety of datasets, such as GSM8K (+10%), SVAMP (+14%), MultiArith (+24%), CommonsenseQA (+5%) and ARC (easy +4%, challenge +5%).

[See also: best-of ranking, self-distillation in math (SVAMP) theorem-proving/translation/Q&A.]

Figure 2: Self-consistency (blue) substantially helps reasoning accuracy on math word problems across problem types, and on more recent arithmetic reasoning benchmarks including a challenge math dataset (SVAMP) and grade-school-math (GSM8K).