“Reaching 80% Zero-Shot Accuracy With OpenCLIP: VIT-G/14 Trained On LAION-2B”, Mitchell Wortsman2023 (; backlinks)⁠:

We have trained a new ViT-G/14 CLIP model with OpenCLIP which achieves 80.1% zero-shot accuracy on ImageNet and 74.9% zero-shot image retrieval (Recall@5) on MS COCO. As of January 2023, this is the best open-source CLIP model.

We believe this is interesting because:

…Also see the figure below (figure code by Ross) and our analysis of scaling trends for OpenCLIP models:

ln1k zero-shot vs activation cost

…Here is a summary figure comparing G/14 and [the now-obsolete prior OpenCLIP model] H/14 made with evals by Romain Beaumont.

Comparison to previous open source SoTA: Δ Zero-shot accuracy (percentage points) of OpenCLIP ViT-G/14 vs ViT-H/14

…To scale up the batch size to 160k, we used gradient checkpointing and 80GB VRAM A100s…

For our first unmasked fine-tuning run we did not modify the learning rate schedule, but instead doubled the base LR and extended the number of iterations so that the run would proceed for an additional 2B samples seen. LR started at 3.8 × 10−5. For the second run we used LR 5.5 × 10−5 with a full cosine schedule (warmup for roughly 200M samples and a total of 4B samples). The third run had identical hyperparameters to the first but used the LAION-A subset of LAION-2B. LAION-A is a 900M subset of LAION-2B filtered with esthetic V2 4.5+ and pHash deduplicated. Instead of waiting for the third run to complete we use the checkpoint after ~700M samples which, when “souped” with the final checkpoints from the two proceeding runs, already allowed us to surpass our goal of 80% accuracy. This individual checkpoint achieved 79.2%.

Unmasked fine-tuning was done on 512 A100 GPUs at a speed of roughly 10,450 samples/s or 20.4 samples/s/GPU.