“Simple and Scalable Strategies to Continually Pre-Train Large Language Models”, Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, Irina Rish2024-03-13 (, )⁠:

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data.

In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is:

sufficient to match the performance of fully re-training from scratch on all available data, as measured by final loss and language model (LM evaluation) benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English → English) and a stronger distribution shift (English → German) at the 405M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10b parameter LLM.

Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute.

Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.

Figure 1: Continual pre-training decreases computational costs of updating the model while maintaining similar final validation and evaluation performance. We report results for Pile ∪ SlimPajama (SP)/German(Ger.) [based on Red Pajama] Baseline model trained on the union of both datasets which we consider to be an upper bound on performance. We also report performance for two continually pre-trained models. “PT on Pile” starts from a pre-trained Pile checkpoint and only uses learning rate re-warming and re-decaying, while “Replay (PT on Pile)” re-warms the learning rate, re-decays it, and uses 5% replay for Slim Pajama and 25% replay for German. We observe that the combination of LR re-warming, re-decaying, and replay allows our continually pre-trained model to attain similar performance to the baseline model while requiring substantially less compute. We note that this setting assumes that a pre-trained model is available (eg. via Huggingface Hub or an in-house model designed to be continually pre-trained).
Figure 1: Continual pre-training decreases computational costs of updating the model while maintaining similar final validation and evaluation performance.
We report results for Pile ∪ Slim Pajama(SP)/German(Ger.) Baseline model trained on the union of both datasets which we consider to be an upper bound on performance. We also report performance for two continually pre-trained models. “PT on Pile” starts from a pre-trained Pile checkpoint and only uses learning rate re-warming and re-decaying, while “Replay (PT on Pile)” re-warms the learning rate, re-decays it, and uses 5% replay for Slim Pajama and 25% replay for German.
We observe that the combination of LR re-warming, re-decaying, and replay allows our continually pre-trained model to attain similar performance to the baseline model while requiring substantially less compute.
We note that this setting assumes that a pre-trained model is available (eg. via Huggingface Hub or an in-house model designed to be continually pre-trained).