Bibliography (4):

  1. Measuring Mathematical Problem Solving With the MATH Dataset

  2. https://github.com/consequentai/fneval/

  3. GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic

  4. Identifying Statistical Bias in Dataset Replication