[followup to Mahajanet al2018] Model pre-training is a cornerstone of modern visual recognition systems. Although fully supervised pre-training on datasets like ImageNet is still the de-facto standard, recent studies suggest that large-scale weakly supervised pre-training can outperform fully supervised approaches.
This paper revisits weakly-supervised pre-training of models using hashtag supervision with modern versions of residual networks and the largest-ever dataset of Instagram images and corresponding hashtags. We study the performance of the resulting models in various transfer-learning settings including zero-shot transfer. We also compare our models with those obtained via large-scale self-supervised learning.
We find our weakly-supervised models to be very competitive across all settings, and find they substantially outperform their self-supervised counterparts. We also include an investigation into whether our models learned potentially troubling associations or stereotypes. Overall, our results provide a compelling argument for the use of weakly supervised learning in the development of visual recognition systems.
Our models, Supervised Weakly through hashtAGs (SWAG), are available publicly.
…We note that this means that in a single training epoch, each unique tail image appears multiple times. This implies there is a discrepancy between the number of unique images in an epoch and the number of total samples processed in that epoch. We label our dataset by the number of unique images in the dataset: our IG-3.6B dataset has ~3.6 billion unique images. However, a single training epoch over that dataset processes ~5 billion samples due to our re-sampling procedure. This is different from other datasets we compare with (eg. JFT-300M) in which the unique number of images equals the total samples processed in an epoch…Although our system-level evaluations hamper exact comparisons, our results suggest that the weakly supervised IG-3.6B dataset provides the same amount of supervisory signal as the supervised JFT-300M dataset.
…We trained on machines connected to each other via Ethernet, with 8 GPUs in every machine connected via NVLink. Our largest model was trained for 2 epochs of the IG-3.6B dataset (10 billion samples seen during training) using 128 Nvidia V100 32GB GPUs across 16 machines.
…We perform transfer-learning experiments on ImageNet-1k that compare our weakly-supervised learner with SimCLRv2,13SEER,27 and BEiT.3 The comparison with SEER is of particular interest: because it is trained on a similar collection of Instagram images, we can readily compare both learning paradigms on the same data distribution…Our results show that weakly-supervised learning substantially outperforms current self-supervised learners, in particular, in low-shot transfer settings. This result is likely due the fact that our weakly-supervised learners receive much more learning signal per sample. Moreover, our results show that weakly-supervised learners benefit from their zero-shot initialization abilities in low-shot transfer settings. We note that our observations may change if self-supervised learners are scaled further.
Figure 3: Scaling model and dataset sizes. ImageNet top-1 linear classifier accuracy for various model sizes as a function of the number of pre-training samples (left) and the training budget (right). As we go larger in model size, the models become more sample-efficient in using a given number of pre-training samples, and additional training samples improve performance. Training time calculated by dividing the total samples with the training speeds from Table 4.
…Comparing our models with CLIP,57 we observe that the CLIP ViT L/14 model slightly outperforms our model in zero-shot transfer to the IN-1k dataset; whereas the smaller RN50×64 CLIP model underperforms it. On some datasets, the ALIGN37 model performs even slightly better. However, the results are not fully consistent: our models do obtain the best performance on the ImageNet-v2 dataset.60 Because these experiments perform system-level comparisons, it is difficult to articulate what drives these differences in performance. Nonetheless, our results provide further evidence that weakly-supervised approaches like ours, CLIP, and ALIGN provide a promising path towards the development of open-world visual-recognition models.33