Davis Blalock · Aug 17, 2022 · 7:19 AM UTC

Davis Blalock · Aug 17, 2022 · 7:19 AM UTC

Davis Blalock

Davis Blalock

@davisblalock

17 Aug 2022

"Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP" How well do image-text models trained on a given dataset generalize to other datasets? [1/12]

Aug 17, 2022 · 7:19 AM UTC

154

Davis Blalock · Aug 17, 2022 · 7:19 AM UTC

Davis Blalock

@davisblalock

17 Aug 2022

The answer is: it’s complicated. Different pretraining datasets work better for different downstream datasets. [2/12]

Davis Blalock · Aug 17, 2022 · 7:19 AM UTC

Davis Blalock

@davisblalock

17 Aug 2022

One interesting but inconvenient result is that mixing more upstream datasets doesn’t necessarily work better. The benefits of the best dataset get diluted by others. [3/12]

Davis Blalock · Aug 17, 2022 · 7:19 AM UTC

Davis Blalock

@davisblalock

17 Aug 2022

But the good news is that you can use a model pretrained on a given dataset as a proxy for training on that dataset. In particular, rather than train on convex… [4/12]

Davis Blalock · Aug 17, 2022 · 7:19 AM UTC

Davis Blalock

@davisblalock

17 Aug 2022

…combinations of datasets, you can take convex combinations of the predictions from their associated models to estimate downstream performance. [5/12]

Davis Blalock · Aug 17, 2022 · 7:19 AM UTC

Davis Blalock

@davisblalock

17 Aug 2022

You can take this even further and estimate trend lines for how well a model would perform given more data using just the predictions of the pretrained models for relevant dataset subsets. [6/12]

Davis Blalock · Aug 17, 2022 · 7:19 AM UTC

Davis Blalock

@davisblalock

17 Aug 2022

Among other things, these findings highlight the importance of using multiple downstream tasks when assessing the quality of a pretraining dataset. [7/12]

Davis Blalock · Aug 17, 2022 · 7:19 AM UTC

Davis Blalock

@davisblalock

17 Aug 2022

My mental model for pretraining datasets after reading this is that yes, more data is more—but the relevance of the pretraining dataset is much more important than its diversity; you get way more benefit from having one highly relevant corpus than many low-relevance ones. [8/12]

Davis Blalock · Aug 17, 2022 · 7:19 AM UTC

Davis Blalock

@davisblalock

17 Aug 2022

In fact, given fixed pretraining compute, you might (?) be better off pretraining separate models for different downstream tasks rather than one big one. This is a pretty different paradigm than what most people seem to be assuming. [9/12]

Davis Blalock · Aug 17, 2022 · 7:19 AM UTC

Davis Blalock

@davisblalock

17 Aug 2022

Also—wow, these trends are so linear. We’ve seen this in several papers now, so linearity in transfer accuracy seems to be a real thing. [10/12]

Davis Blalock · Aug 17, 2022 · 7:19 AM UTC

Davis Blalock

@davisblalock

17 Aug 2022

Paper: arxiv.org/abs/2208.05516 If you like this paper, consider RTing this (or another!) thread to publicize the authors' work, or following the authors: @HaThaoNguyen @gabriel_ilharco @Mitchnw… [11/12]

Davis Blalock · Aug 17, 2022 · 7:19 AM UTC

Davis Blalock

@davisblalock

17 Aug 2022

…@sewoong79 @lschmidt3 For more paper summaries, you might like following @mosaicml, me, or my newsletter: bit.ly/3OXJbDs As always, comments and corrections welcome! [12/12]

Davis Blalock

@davisblalock

17 Aug 2022

"Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP" How well do image-text models trained on a given dataset generalize to other datasets? [1/12]