Bibliography (4):
Measuring Mathematical Problem Solving With the MATH Dataset
https://github.com/consequentai/fneval/
GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Identifying Statistical Bias in Dataset Replication