“CLIP: Learning Transferable Visual Models From Natural Language Supervision”, Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever2021-01-05 (, , , ; backlinks)⁠:

[Blog] State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision.

We demonstrate that the simple pre-training [contrastive learning] task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the CLIP model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification.

The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on.

Figure 4: Prompt engineering and ensembling improve zero-shot performance. Compared to the baseline of using contextless class names, prompt engineering and ensembling boost zero-shot classification performance by almost 5 points on average across 36 datasets. This improvement is similar to the gain from using 4× more compute with the baseline zero-shot method but is “free” when amortized over many predictions.
Figure 5: Zero-shot CLIP is competitive with a fully supervised baseline. Across a 27 dataset eval suite, a zero-shot CLIP classifier outperforms a fully supervised linear classifier fitted on ResNet-50 features on 16 datasets, including ImageNet.
Figure 9: Zero-shot CLIP performance scales smoothly as a function of model compute. Across 39 evals on 36 different datasets, average zero-shot error is well modeled by a log-log linear trend across a 44× range of compute spanning 5 different CLIP models. Lightly shaded lines are performance on individual evals, showing that performance is much more varied despite the smooth overall trend.
Figure 13: Zero-shot CLIP is much more robust to distribution shift than standard ImageNet models. (Left) An ideal robust model (dashed line) performs equally well on the ImageNet distribution and on other natural image distributions. Zero-shot CLIP models shrink this “robustness gap” by up to 75%. Linear fits on logit transformed values are shown with bootstrap estimated 95% confidence intervals. (Right) Visualizing distribution shift for bananas, a class shared across 5 of the 7 natural distribution shift datasets. The performance of the best zero-shot CLIP model, ViT-L/14@336px, is compared with a model that has the same performance on the ImageNet validation set, ResNet-101.
Figure 21: Visualization of predictions from 36 CLIP zero-shot classifiers. All examples are random with the exception of reselecting Hateful Memes to avoid offensive content. The predicted probability of the top 5 classes is shown along with the text used to represent the class. When more than one template is used, the first template is shown. The ground truth label is colored green while an incorrect prediction is colored orange.

[Evaluations: Food101 · Sun398 · Youtube-BB · EuroSAT · PatchCamelyon (PCam) · ImageNet-A (Adversarial) · CIFAR-10 · CLEVR Count · Facial Emotion Recognition2013 (FER201311ya) · UCF101 · Caltech 101 · ImageNet-R (Rendition) · Oxford-IIIT Pets · CIFAR-100 · ImageNet-V2 Matched Frequency · FGVC Aircraft · Country211 · RESISC45 · Stanford Cars · SUN · Kinetics-700 · Flowers-102 · ImageNet · Birdsnap · aYahoo · ObjectNet ImageNet Overlap · ImageNet Blurry · Describable Textures Dataset (DTD) · PASCAL VOC 2007 · MNIST · Street View House Numbers (SVHN) · ImageNet Vid · ImageNet Sketch · Hateful Memes · Stanford Sentiment Treebank · German Traffic Sign Recognition Benchmark (GTSRB)]