âIs a Caption Worth a Thousand Images? A Controlled Study for Representation Learningâ, 2022-07-15 ()â :
[Twitter] The development of CLIP [ et al 2021] has sparked a debate on whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
Our work studies this question through a carefully controlled comparison of two approaches [CLIP vs SimCLR] in terms of their ability to learn representations that generalize to downstream classification tasks.
We find that when the pre-training dataset meets certain criteriaâit is sufficiently large and contains descriptive captions with low variabilityâimage-only methods do not match CLIPâs transfer performance, even when they are trained with more image data. However, contrary to what one might expect, there are practical settings in which these criteria are not met, wherein added supervision through captions is actually detrimental.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets [by data-augmentation: generating multiple text captions for each image using GPT-J to overcome extraneous text caption differences].