We study the performance of transformers as a function of the number of repetitions of training examples with algorithmically generated datasets.
On 3 problems of mathematics: the greatest common divisor, modular multiplication, and matrix eigenvalues, we show that for a fixed number of training steps, models trained on smaller sets of repeated examples outperform models trained on larger sets of single-use examples.
We also demonstrate that two-set training—repeated use of a small random subset of examples, along normal sampling on the rest of the training set—provides for faster learning and better performance. This highlights that the benefits of repetition can outweigh those of data diversity.
These datasets and problems provide a controlled setting to shed light on the still poorly understood interplay between generalization and memorization in deep learning.
…In ablation experiments, we show that the performance of two-set training cannot be improved by curating the set of repeated examples, or refreshing it as training proceeds. This sets us apart from curriculum learning, and strengthens the observation that repetition of a few random examples is really all we need. We also show that mixing repeated and non-repeated examples in the same mini-batches is required for two-set training to work. Finally, we propose a smooth extension of two-set training, by introducing a probability distribution on the training set.
…In all 3 cases, the benefits of repetition are substantial, but come in different flavors, from improving performance and accelerating learning (GCD), to allowing a new task to be learned (multiplication), or to be accessible to smaller models (eigenvalues). Alternatively, small random subsets of the data repeated at high frequency can elicit similar effects. These findings have profound implications and should lead to a paradigm shift where the training set size becomes a mere hyper-parameter, not solely governed by the availability of data and the belief that more is always better.
Figure 1a: Repetition Helps: Performance as a function of repetition for a fixed training budget (600M).
GCD (blue). Models trained on smaller datasets, repeated 30×, perform much better than models trained on 1–4 epochs.
Multiplication mod 67 (red). Models trained for 1–4 epochs do not learn. Learning “emerges” when models are trained on smaller data budgets, with increased repetition.
Figure 1b: Two-set training: For a fixed data budget, splitting the data into two random subsets and increasing the training frequency of one greatly improves performance.
GCD (left): repeating 50k examples 3,000× for a training budget of 600M brings performance 37 → 69 on 100M.
Modular multiplication (right): Models trained on 600M single-use examples do not learn. With 25M examples repeated 18×, and 150M single use examples, accuracy is 92%, with 2.5M examples repeated 60×, and 450M single-use, accuracy is 68%. Smooth distributions of repetition over the training set achieve 70% accuracy.
…On the surface, grokking shares similarities with our work: a small training dataset is iterated for many epochs, the phenomenon is isolated in clean experiments on synthetic data, and it contradicts traditional wisdom regarding overfitting. But there are important differences: in grokking, delayed learning occurs, we observe no such delay; grokking occurs for “tiny” training samples (hundreds or thousands of examples), our models use millions (even for modular multiplication); grokking is very sensitive to the optimizer used, our findings are robust across optimizers (Appendix C.5), and, of course, no two-set approach is documented in the grokking setting.