“Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models”, Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, Mansheej Paul2024-05-30 ()⁠:

[Twitter] In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that data-pruning based on the perplexity of a larger model can yield high-quality data, we investigate whether smaller models can be used for perplexity-based pruning and how pruning is affected by the domain composition of the data being pruned.

We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can substantially improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model:

improves the average performance on downstream tasks of a 3 billion parameter model by up to 2.04 and achieves up to a 1.45× reduction in pretraining steps to reach commensurate baseline performance.

Furthermore, we demonstrate that such perplexity-based data pruning also yields downstream performance gains in the over-trained and data-constrained regimes.

4.2 How Pruning Affects Domain Composition: We can also interpret the effect that perplexity-based data pruning has on a dataset by examining how pruning affects each domain’s proportion of the total dataset. We plot the pre and post-pruning domain compositions for the Pile and Dolma in Figure 4.

Interestingly, for all datasets pruning increases the proportion of data coming from web-scraped domains while decreasing the proportion of data coming from highly specific technical domains such as code or scientific papers. This trend is more pronounced in the Pile, where the proportions of Pile-CC and OpenWebText2 nearly double, while the proportions of domains such as Pubmed Central, ArXiv, and Github are all reduced by at least a factor of 3.

Future work should investigate how perplexity-based pruning affects a model’s performance on downstream tasks that are in the same category as the highly pruned domains.