Figure 5b: GSM8K accuracy. Self-consistency improves performance across language model scales.
[cf. wisdom of the crowd, “inner crowds”] We explore a simple ensemble strategy, self-consistency, that substantially improves the reasoning accuracy of large language models.
The idea is to sample a diverse set of outputs from a language model and return the most consistent answer [majority vote] in the set. Such ensembling method improves reasoning accuracy when combined with chain-of-thought prompting.
For arithmetic and commonsense reasoning benchmarks we find that self-consistency yields substantial accuracy improvements in a variety of datasets, such as GSM8K (+10%), SVAMP (+14%), MultiArith (+24%), CommonsenseQA (+5%) and ARC (easy +4%, challenge +5%).
Figure 2: Self-consistency (blue) substantially helps reasoning accuracy on math word problems across problem types, and on more recent arithmetic reasoning benchmarks including a challenge math dataset (SVAMP) and grade-school-math (GSM8K).