[Twitter] We evaluate the reasoning abilities of large language models in multilingual settings.
We introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating 250 grade-school math problems from the GSM8K dataset (Cobbeet al2021) into 10 typologically diverse languages.
We find that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale, and that models have strikingly strong multilingual reasoning abilities, even in underrepresented languages such as Bengali and Swahili.
Finally, we show that the multilingual reasoning abilities of language models extend to other tasks such as commonsense reasoning and word-in-context semantic judgment.
The MGSM benchmark is publicly available at Github.
…Effect of model scale. We analyze the effect of model scale (ie. number of model parameters and computational resources used for training) on their multilingual arithmetic reasoning abilities (Figure 4). As the models scale up, the performance generally improves for both GPT-3 and PaLM model series on all languages. Neither model achieves a substantial solve rate until a certain scale (text-davinci-001 for GPT-3 and PaLM-62B for PaLM), hence multilingual reasoning can be considered an emergent ability of large language models (Weiet al2022a). It is worth noting that the amount of training data per language is constant across language model scales for PaLM—the fact that scale facilitates reasoning implies that further scaling may continue to improve the multilingual reasoning ability of large language models.
Figure 4: MGSM accuracy with different model scales. The letters A, B, C, D1, and D2 denote text-ada-001, text-babbage-001, text-curie-001, text-davinci-001 [InstructGPT], and text-davinci-002 in the GPT-3 (Brownet al2020; Ouyanget al2022) family, respectively. While the number of parameters in each GPT-3 model is not publicly available, we order them alphabetically. Detailed numbers can be found in Table 8.
Figure 5: MGSM accuracy of PaLM-540B with different numbers of few-shot exemplars. Detailed numbers can be found in Table 8.