[cf. Sorscheret al2022] Training on web-scale data can take months. But most computation and time is wasted on redundant and noisy points that are already learnt or not learnable.
To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a simple but principled technique which selects ~those points for training that most reduce the modelâs generalization loss. As a result, RHO-LOSS mitigates the weaknesses of existing data selection methods: techniques from the optimization literature typically select âhardâ (eg. high loss) points, but such points are often noisy (not learnable) or less task-relevant. Conversely, curriculum learning prioritizes âeasyâ points, but such points need not be trained on once learned. In contrast, RHO-LOSS selects points that are learnable, worth learning, and not yet learnt.
RHO-LOSS trains in far fewer steps than prior art, improves accuracy, and speeds up training on a wide range of datasets, hyperparameters, and architectures (MLPs, CNNs, and BERT). On the large web-scraped image dataset Clothing-1M, RHO-LOSS trains in 18Ă fewer steps and reaches 2% higher final accuracy than uniform data shuffling.
Figure 1: Speedup on large-scale classification of web-scraped data (Clothing-1M). RHO-LOSS trains all architectures with fewer gradient steps than standard uniform data selection (ie. shuffling), helping reduce training time. Thin lines: ResNet-50, MobileNetV2, DenseNet121, Inception v3/GoogLeNet, mean across seeds. Bold lines: mean across all architectures.