We have trained a new ViT-G/14 CLIP model with OpenCLIP which achieves 80.1% zero-shot accuracy on ImageNet and 74.9% zero-shot image retrieval (Recall@5) on MS COCO. As of January 2023, this is the best open-source CLIP model.
We believe this is interesting because:
CLIP models are useful for zero-shot classification, retrieval, and for guidance/conditioning in generative models (OpenCLIP is used in Stable Diffusion V2 and currently the third most downloaded model on HuggingFace is a CLIP model). The approach underlying CLIP—self supervised learning on a large, heterogeneous dataset—has been shown to produce models which are more robust and fair.
Our new ViT-G model achieves the highest zero-shot ImageNet accuracy among models that use only naturally occurring image-text pairs as training data, and without explicit labels, pseudo-labels, or any pretrained image or text encoders.
Our training run used multiple new techniques, including FLIP to accelerate training and model soups [ensembling] to surpass 80% accuracy.
…Here is a summary figure comparing G/14 and [the now-obsolete prior OpenCLIP model] H/14 made with evals by Romain Beaumont.
Comparison to previous open source SoTA: Δ Zero-shot accuracy (percentage points) of OpenCLIP ViT-G/14 vs ViT-H/14
…To scale up the batch size to 160k, we used gradient checkpointing and 80GB VRAM A100s…
For our first unmasked fine-tuning run we did not modify the learning rate schedule, but instead doubled the base LR and extended the number of iterations so that the run would proceed for an additional 2B samples seen. LR started at 3.8 × 10−5. For the second run we used LR 5.5 × 10−5 with a full cosine schedule (warmup for roughly 200M samples and a total of 4B samples). The third run had identical hyperparameters to the first but used the LAION-A subset of LAION-2B. LAION-A is a 900M subset of LAION-2B filtered with esthetic V2 4.5+ and pHash deduplicated. Instead of waiting for the third run to complete we use the checkpoint after ~700M samples which, when “souped” with the final checkpoints from the two proceeding runs, already allowed us to surpass our goal of 80% accuracy. This individual checkpoint achieved 79.2%.
Unmasked fine-tuning was done on 512 A100 GPUs at a speed of roughly 10,450 samples/s or 20.4 samples/s/GPU.