[cf. RHO-LOSS, Sorscheret al2022] Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens.
Challenging this norm, we posit that “Not all tokens in a corpus are equally important for language model training”. Our initial analysis examines token-level training dynamics of language model, revealing distinct loss patterns for different tokens.
Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher scores.
When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively—matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.
Figure 1: We continual pretrain 1B and 7B LMs with 15B OpenWebMath tokens.
RHO-1 is trained with our proposed Selective Language Modeling (SLM) [training a reference model on a curated, high-quality dataset. This model then assesses the loss of each token within the pretraining corpus. In the final phase, we train the language model selectively, focusing on tokens with high excess loss between the training and reference model.], while baselines are trained using causal language modeling. SLM improves average few-shot accuracy on GSM8k and MATH by over 16%, achieving the baseline performance 5–10× faster.
Figure 2:
Upper: Even an extensively filtered pretraining corpus contains token-level noise.
Left: Previous Causal Language Modeling (CLM) trains on all tokens.
Right: Our proposed Selective Language Modeling (SLM) selectively applies loss on those useful and clean tokens.
…To explore how language models learn at the token level, we initially examined training dynamics, particularly how the token-level loss evolves during usual pretraining. In §2.1, we evaluated the model’s token perplexity at different checkpoints and categorized tokens into different types. Our findings reveal that large loss reduction is limited to a select group of tokens. Many tokens are “easy tokens” that are already learned, and some are “hard tokens” that exhibit variable losses and resist convergence. These tokens can lead to numerous ineffective gradient updates.
Figure 3: The loss of 4 categories of tokens during pretraining.
(a) shows the loss of H → H, L → H, H → L, and L → L tokens during pretraining.
(b, c) show 3 cases of fluctuating tokens’ loss in L → L and H → H during pretraining, respectively.
…2.1 Not All Tokens Are Equal: Training Dynamics of Token Loss
Our investigation begins with a critical look at how individual tokens’ losses evolve during standard pre-training. We continue pre-training TinyLlama-1B with 15B tokens from OpenWebMath, saving checkpoints after every 1B tokens. We then evaluate token-level loss at these intervals using the validation set of ~320,000 tokens. Figure 3(a) reveals a striking pattern: tokens fall into 4 categories based on their loss trajectory—persistent high loss (H → H), increasing loss (L → H), decreasing loss (H → L), and consistent low loss (L → L). For further details on these categories, see §D.1.
Our analysis uncovers that a mere 26% of tokens show a notable loss reduction (H → L), while the majority (51%) remain in the L → L category, indicating they have already been learned. Interestingly, 11% of the tokens are persistently challenging (H → H), likely due to high aleatoric uncertainty [Hüllermeier and Waegeman2021]. Additionally, 12% of tokens experience an unexpected loss increase (L → H) during training.
Our second observation is that a substantial number of token losses exhibit persistent fluctuations, and resist convergence. The loss of many L → L and H → H tokens, as depicted in Figure 3b & 3c, show high variance during training. In §D.2, we visualize and analyze the content of these tokens and find that many of them are noisy, which is consistent with our hypothesis. Consequently, we learn that the loss associated with each token during training does not decrease smoothly like the overall loss; instead, there is a complex training dynamic among different tokens. If we can select the appropriate tokens for the model to focus on during training, we may be able to stabilize the trajectory of the model’s training and enhance its efficiency.
[online selection] …In practice, token selection can be implemented by ranking the tokens in a batch according to their excess loss and using only the top k% of tokens for training [eg. in GANs]. This process eliminates the loss for undesired tokens without incurring additional costs during pretraining, making our approach both efficient and easily integrated.
[Just dropping tokens is going to be very limited in its gains, however. Need to drop entire datapoints for efficiency.]
Figure 6: The dynamics of pretraining loss and downstream loss.
(a, c) represent the loss of tokens selected/unselected by SLM during pretraining in both SLM and CLM methods, while (b) represents the loss of the SLM and CLM methods on MetaMath (Yuet al2024).
We tested the above results through the process of pretraining with a total of 4 billion tokens.
…What Tokens are Selected with SLM? We aim to analyze the tokens selected by the SLM method in pretraining to further explore its working mechanism. To this end, we visualize the token selection process during the training of RHO-1 using the OpenWebMath. In §G.1’s Figure 13, we have highlighted in blue the tokens that were retained during actual pretraining. We observe that the majority of tokens chosen by the SLM method are closely related to mathematics, effectively training the model on the parts of the original corpus that are pertinent to mathematical content.
Figure 8: The PPL of tokens selected by different checkpoint.
We test the PPL of the tokens selected at 2B, 5B, 8B, 11B, and 14B.
Furthermore, we investigated the differences in token filtering across various checkpoints during the training process and tested the perplexity of these tokens on different checkpoints. As illustrated in Figure 8, we found that the tokens selected by later checkpoints tend to have higher perplexity towards the later stages of training and lower perplexity in the earlier stages. This may suggest that the model first optimizes tokens with a larger learnable space, thereby increasing learning efficiency. Moreover, we noticed a sample-wise “double descent” (Nakkiranet al2021) on the loss of selected tokens, where the select token’s perplexity initially increases before decreases. This might be an effect of selecting tokens based on excess loss, targeting those most in need at each checkpoint.
Effect of Token Select Ratio: We investigate the impact of token selecting ratios of the SLM. Generally, the selecting ratio is defined by heuristic rules, similar to the approach previously employed in the training of Masked Language Models (MLMs) (Devlinet al2019, Liuet al2019). As shown in Figure 9, the selected tokens is suitable for accounting for about 60% of the original tokens.