"Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP"
How well do image-text models trained on a given dataset generalize to other datasets? [1/12]
Aug 17, 2022 · 7:19 AM UTC
The answer is: it’s complicated. Different pretraining datasets work better for different downstream datasets. [2/12]
One interesting but inconvenient result is that mixing more upstream datasets doesn’t necessarily work better. The benefits of the best dataset get diluted by others. [3/12]
But the good news is that you can use a model pretrained on a given dataset as a proxy for training on that dataset. In particular, rather than train on convex… [4/12]
…combinations of datasets, you can take convex combinations of the predictions from their associated models to estimate downstream performance. [5/12]
You can take this even further and estimate trend lines for how well a model would perform given more data using just the predictions of the pretrained models for relevant dataset subsets. [6/12]
Among other things, these findings highlight the importance of using multiple downstream tasks when assessing the quality of a pretraining dataset. [7/12]
My mental model for pretraining datasets after reading this is that yes, more data is more—but the relevance of the pretraining dataset is much more important than its diversity; you get way more benefit from having one highly relevant corpus than many low-relevance ones. [8/12]
In fact, given fixed pretraining compute, you might (?) be better off pretraining separate models for different downstream tasks rather than one big one. This is a pretty different paradigm than what most people seem to be assuming. [9/12]
Also—wow, these trends are so linear. We’ve seen this in several papers now, so linearity in transfer accuracy seems to be a real thing. [10/12]
Paper: arxiv.org/abs/2208.05516
If you like this paper, consider RTing this (or another!) thread to publicize the authors' work, or following the authors: @HaThaoNguyen @gabriel_ilharco @Mitchnw… [11/12]
…@sewoong79 @lschmidt3
For more paper summaries, you might like following @mosaicml, me, or my newsletter: bit.ly/3OXJbDs
As always, comments and corrections welcome! [12/12]