“Why Do Small Language Models Underperform? Studying Language Model Saturation via the Softmax Bottleneck”, Nathan Godey, Éric de la Clergerie, Benoît Sagot2024-04-11 (, )⁠:

Recent advances in language modeling consist in pretraining highly parameterized neural networks on extremely large web-mined text corpora. Training and inference with such models can be costly in practice, which incentivizes the use of smaller counterparts. However, it has been observed that smaller models can suffer from saturation, characterized as a drop in performance at some advanced point in training followed by a plateau.

In this paper, we find that such saturation can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution. This mismatch affects the performance of the linear prediction head used in such models through the well-known softmax bottleneck phenomenon.

We measure the effect of the softmax bottleneck in various settings and find that models based on less than 1,000 hidden dimensions tend to adopt degenerate latent representations in late pretraining, which leads to reduced evaluation performance.

[large BPE vocab tokenization can destroy LLM scaling by blocking training after enough steps: an implication here would be that if there were extreme supra-Chinchilla scaling laws, and you used a standard BPE vocab (never mind the extremely large BPE vocabularies of 1 million+ some groups experiment with), you might not find them because the necessary number of training steps would take you into the saturation regime where the minor technical detail of tokenization starts degrading your scaling. (You wouldn’t have to be totally saturated to start falling off optimal scaling and derive misleading scaling laws.)

Whereas if you use character/byte tokenization, you’d never even know this was a problem. But on the gripping hand, if you used BPEs and you were affected by saturation, you might never realize that at scale, better tokenization would change your scaling laws…]