Bibliography (16):

  1. RHO-LOSS: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

  2. Beyond neural scaling laws: beating power law scaling via data pruning

  3. https://openwebtext2.com/

  4. Measuring Mathematical Problem Solving With the MATH Dataset

  5. https://arxiv.org/pdf/2404.07965#page=4&org=microsoft

  6. https://arxiv.org/pdf/2404.07965#page=3&org=microsoft

  7. https://arxiv.org/pdf/2404.07965#page=20&org=microsoft

  8. https://arxiv.org/pdf/2404.07965#page=19q&org=microsoft

  9. Top-K Training of GANs: Improving GAN Performance by Throwing Away Bad Samples

  10. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

  11. https://arxiv.org/pdf/2404.07965#page=26&org=microsoft

  12. Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time

  13. https://arxiv.org/pdf/2404.07965#page=9&org=microsoft