Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which offers valuable guidance as large-scale experiments are becoming increasingly expensive. However, previous work on scaling laws has primarily used private data & models or focused on uni-modal language or vision learning.
To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository. Our large-scale experiments involve models trained on up to two billion image-text pairs and:
identify power law scaling for multiple downstream tasks including zero-shot classification, retrieval, linear probing, and end-to-end fine-tuning.
We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures and similar training recipes [better at zero-shot retrieval vs zero-shot classification, respectively].
We open-source our evaluation workflow and all models, including the largest public CLIP models, to ensure reproducibility and make scaling laws research more accessible.
Source code and instructions to reproduce this study are available at Github.
Figure 1a: Relationship between total training compute and performance in zero-shot classification (1a) and retrieval (1b). We fit a power law on the Pareto frontier of the available models. Since total compute budgets (measured in GMACs) of different trained models are not exactly aligned, we divide the total compute scale into bins and select the best model performance from each bin.
(a) Relationship between total training compute and zero-shot classification performance on downstream tasks. Left: ImageNet performance. Right: average performance on 5 ImageNet robustness datasets (ImageNet-V2, ImageNet-R, ImageNet-Sketch, ObjectNet, and ImageNet-A). Scaling model size, data size, and samples seen leads to better performance on zero-shot classification. Models trained on OpenAIās WebImageText (WIT) show a stronger scaling than models trained on LAION.
Figure 1b: Relationship between total training compute and zero-shot image retrieval performance on MS-COCO (Left) and Flickr30K (Right). Scaling model size, data size, and samples seen leads to better performance on zero-shot image retrieval. Interestingly, in contrast to zero-shot classification (Figure 1a), models trained on LAION show a stronger scaling trend than OpenAI CLIP models trained on OpenAIās WebImageText (WIT) dataset.
ā¦Compared to the original CLIP training procedure, we work with larger batch sizes and adapt the learning rate accordingly. We opt for larger batch sizes to allow for more efficient distributed training; maximizing the local batch size per GPU and using close to 1,000 GPUs lead us to global batch sizes in the range of 86ā88K samples. In order to assess the validity of re-using measurements obtained with different batch sizes, we perform a number of control experiments varying batch size from 32K to 86ā88K, and observe a difference of 0.2ā0.5% across different settings (see Appendix §B.2.3), which is small enough not to confound observations on the effect of scaleā¦Parallel training via PyTorch DDP, we conduct experiments with up to 1,520 NVIDIA A100 GPUs. Distributed training was executed on JUWELSBooster, the supercomputer at Juelich Supercomputing Center (JSC, Germany), and partly also at Stability AI AWS supercomputer.
ā¦We also observe bottleneck behaviors35, 84 that occur when fixing one scaling dimension while increasing others. For instance, OpenCLIP ViT-B/32 and ViT-B/16 are bottlenecked by the number of samples seen at the 13B scale. Increasing the number of samples seen to 34B reveals that LAION-2B brings clear improvement over LAION-400M, which would remain hidden when fixing the number-of-samples-seen scale to a lower value. Similar observations may occur along other scaling dimensions. OpenCLIP ViT L/14 shows an example of data scale bottleneck on LAION-400M scale, as increasing the number of samples seen from 13B to 34B does not lead to improvements. The benefit of using a larger number of samples seen is then revealed when going to the larger LAION-2B dataset.
ā¦Using the obtained power law, we can make a prediction for the performance of a well-tuned ViT-g/14 model when using the largest data scale of 2B and samples seen scale of 34B, giving us error estimate of 20.9% (79.1% top-1 accuracy) on ImageNet. We predict even stronger performance at larger scales. For instance, assuming 68B samples seen we estimate top-1 accuracies of 79.7%, 80.7%, and 81.9% for ViT-H/14, ViT-g/14 and ViT-G/14, respectively (see also Appendix §B.2.1).