ā€œNo Train No Gain: Revisiting Efficient Training Algorithms For Transformer-Based Language Modelsā€, Jean Kaddour, Oscar Key, Piotr Nawrot, Pasquale Minervini, Matt J. Kusner2023-07-12 (, )⁠:

The computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve training, validation, and downstream performance faster than standard training.

In this work, we revisit 3 categories of such algorithms: dynamic architectures (layer stacking, layer dropping), batch selection (selective backprop, RHO-LOSS), and efficient optimizers (Lion, Sophia). When pre-training BERT and T5 with a fixed computation budget using such methods, we find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate.

We define an evaluation protocol that enables computation to be done on arbitrary machines by mapping all computation time to a reference machine which we call reference system time. We discuss the limitations of our proposed protocol and release our code to encourage rigorous research in efficient training procedures: Github.

…Case-Study 2: Data Selection:

…5.1 Selective Backprop: Due to its simplicity, we choose selective backprop—outlined in Algorithm 3—with the high-level idea being to compute the backward pass only on the training examples with the highest loss. To construct such batches, we first compute the losses for each example in a uniformly-sampled batch via a forward pass and then sample a subset from it ranked by their loss percentiles w.r.t. historical losses among recently ingested sequences.

…5.2 RHO-LOSS: Mindermann et al 2022 argue that prioritizing high training losses results in prioritizing two types of examples that are unwanted: (1) mislabeled and ambiguous data, as commonly found in noisy, web-crawled data; and (2) outliers, which are less likely to appear at test time. The authors propose down-weighting such data via a selection objective called Reducible Holdout (RHO) loss.

…5.3 Results: We assume that the effects of selecting better training data should be largely agnostic to whether we pre-train a BERT or T5 model. Hence, instead of training both architectures, we decide to pre-train only BERT models and instead vary the datasets and budgets as follows.

Figure 3: Validation losses for different datasets. Results for data selection methods (selective backprop and RHO-LOSS) for a 12-hour RST budget.

For the first set of experiments, we fix the budget to 12 hours and consider 3 different datasets: (1) C4, consisting only of web-page text which, despite being regularly used for pre-training, is known to have some quality issues55, (2) BookCorpus & Wikipedia24 which contain polished, book(-like) text and MiniPile44, a subset of the diverse Pile pre-training corpus, containing code, mathematics, books, webpages, and other scientific articles…We find that both data selection methods underperform the baseline. Next, we investigate downstream performances, fix the C4 corpus as the pre-training corpus, and vary the budgets (6, 12, and 24 hours). Figure 2 & Figure 16 entail the results, and we again observe no noticeable difference between the methods.