Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability.
To investigate this claim rigorously, we [Scale AI] commission Grade School Math 1,000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k benchmark, the gold standard for measuring elementary mathematical reasoning. We ensure that the two benchmarks are comparable across important metrics such as human solve rates, number of steps in solution, answer magnitude, and more.
When evaluating leading open/closed-source LLMs on GSM1k, we observe accuracy drops of up to 13%, with several families of models (eg. Phi and Mistral [also 0.aiās Yi]) showing evidence of systematic overfitting across almost all model sizes. At the same time, many models, especially those on the frontier, (eg. Gemini/GPT-4/Claude) show minimal signs of overfitting.
Further analysis suggests a positive relationship (Spearmanās R2 = 0.32) between a modelās probability of generating an example from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that many models may have partially memorized GSM8k.
ā¦We do not intend to release GSM1k publicly at this time to prevent a similar problem of data contamination occurring in the future. However, we plan to run recurring evaluations of all major open/closed-source releases and to continually update our results. We will also open source our entire evaluation code so that the public version of our results can be reproduced. Additionally, we commit to open sourcing the entire benchmark when either (1) the top open source models score over 95% on GSM1k or (2) at the end of 2025, whichever comes earlier.
Figure 1: Notable models arranged by their drop in performance between GSM8k and GSM1k (lower is worse).
We notice that Mistral and Phi top the list of overfit models, with almost 10% drops on GSM1k compared to GSM8k, while models such as Gemini, GPT, and Claude show little to no signs of overfitting.
Figure 5: Models with over 70% accuracy on GSM8k compared to the line of no overfit. This plot is zoomed into the relevant sections (70ā100% accuracy). Note that some models, especially the Claude family, perform above the 45° line, which is consistent with our findings in §3 that GSM1k is slightly easier than GSM8k. In contrast, many models, especially the Mistral and Phi families lie well below the line.
ā¦Lesson 2: Other Models, Especially Frontier Models, Show No Signs of Overfitting: Nevertheless, we find that many models, through all regions of performance, show minimal signs of being overfit. In particular, we find that all frontier or close-to-frontier models (including the proprietary Mistral Large) appear to perform similarly on both GSM8k and GSM1k. We posit two potential hypotheses for this: (1) frontier models have sufficiently advanced reasoning capability so that they can generalize to new problems even if they have already seen GSM8k problems in their training set, (2) frontier model builders may be more careful about data contamination.
While it is impossible to know for certain without looking at the training set for each model, one piece of evidence in favor of the former is that Mistral Large is the only model in the Mistral family to show no signs of overfitting. Since the hypothesis that Mistral took unique care in ensuring only that their largest model was free from data contamination seems unlikely, we lean instead towards the hypothesis that sufficiently strong LLMs also learn elementary reasoning ability during training. If a model learns strong enough reasoning capabilities to solve problems of a given difficulty, it will be able to generalize to new problems even if GSM8k has appeared in their training set.
5.3 Lesson 3: Overfit Models Are Still Capable of Reasoning: One worry about model overfitting is that models are incapable of reasoning and merely only memorizing answers seen in the training data. Our results do not support this conjecture. The fact that a model is overfit does not mean that it is poor at reasoning, merely that it is not as good as the benchmarks might indicate it to be.
In fact, we find that many of the most overfit models are still capable of reasoning and solving novel problems. For example, while Phi-3 has an almost 10% drop in accuracy between GSM8k and GSM1k, we find that it is still able to correctly solve over 68% of GSM1k problemsāwhich are certain to not have appeared in its training distribution. This performance is similar to that of much larger models such as dbrx-instruct, which contains almost 35Ć as many parameters. Similarly, Mistral models remain some of the strongest open source models, even accounting for their overfitting.
This provides additional evidence for our lesson that sufficiently strong models learn elementary reasoning, even if benchmark data accidentally leaked into the training distribution, as is likely to be the case for the most overfit models.
5.4 Lesson 4: Data Contamination Is Likely Not The Full Explanation for Overfitting: ā¦Nevertheless, data contamination is likely not the full story. We observe this via the presence of several outliers, which cause the R2 = 0.32 value to be relatively low. Examining these outliers carefully reveals that the model with the lowest per-character log-likelihood (Mixtral-8Ć22b) and the model with the highest per-character log-likelihood (Mixtral-8Ć22b-Instruct) are not only variations of the same model, but also have similar levels of overfit (Jianget al2024).
Perhaps more intriguingly, the most overfit model we discovered (Math-Shepherd-Mistral-7B-RL, Yuet al2023) had a relatively low per-character log-likelihood. Math-Shepherd trains a reward model on process level data using synthetic data. As such, we hypothesize that the reward modeling process may have leaked information about the correct reasoning chains for GSM8k even if the problems themselves did not ever appear in the dataset. Finally, we observe that the Llema models (Azerbayevet al2024) have both high log-likelihoods and minimal overfit. These models are open-sourced alongside their training data, and the authors report finding a very small number of GSM8k examples in the training corpus. Nevertheless, they also find (and our study supports) that these few instances do not lead to overfitting.
The existence of these outliers suggests that overfitting on GSM8k is not purely due to data contamination, but rather may be through other indirect means, such as model builders collecting data similar in nature to benchmarks as training data or selecting final model checkpoints based on performance on benchmarks, even if the model itself may have not seen the GSM8k dataset at any point via training. Conversely, the reverse is also true: small amounts of data contamination do not necessarily lead to overfitting.