“Data Determines Distributional Robustness in Contrastive Language Image Pre-Training (CLIP)”, Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, Ludwig Schmidt2022-05-03 (, , )⁠:

Contrastively trained image-text models such as CLIP, ALIGN, and BASIC have demonstrated unprecedented robustness to multiple challenging natural distribution shifts. Since these image-text models differ from previous training approaches in several ways, an important question is what causes the large robustness gains.

We answer this question via a systematic experimental investigation. Concretely, we study 5 different possible causes for the robustness gains: (1) the training set size, (2) the training distribution, (3) language supervision at training time, (4) language supervision at test time, and (5) the contrastive loss function.

Our experiments show that the more diverse training distribution is the main cause for the robustness gains, with the other factors contributing little to no robustness.

Beyond our experimental results, we also introduce ImageNet-Captions, a version of ImageNet with original text annotations from Flickr [YFCC], to enable further controlled experiments of language-image training.

…In this paper, we answer the question of CLIP’s robustness via a series of controlled experiments that test the five possible causes listed above. Our main result is that CLIP’s robustness is determined almost exclusively by the training distribution. Language supervision at training time does not make the resulting models more robust than standard supervised learning when the images in the training set are the same. Hence language supervision only has an indirect effect on robustness. In particular, language supervision simplifies training on a diverse distribution of images by removing the need for consistent annotation with class labels. The more diverse training distribution—not the language supervision—then leads to more robust representations.