“SEER: Self-Supervised Pretraining of Visual Features in the Wild”, Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, Piotr Bojanowski2021-03-02 (; similar)⁠:

Figure 1: Performance of large pretrained models on ImageNet. We pretrain our SEER models on uncurated and random images. They are RegNet architectures40 trained with the SwAV self-supervised method7. We compare with the original models trained in Caron et al7 as well as the pretraining on curated data from SimCLRv2 and ViT. The network architectures are different. We report the top-1 accuracy after finetuning on ImageNet.

Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods. These results have been achieved in a control environment, that is the highly curated ImageNet dataset. However, the premise of self-supervised learning is that it can learn from any random image and from any unbounded dataset.

In this work, we explore if self-supervision lives up to its expectation by training large models on random, uncurated images with no supervision. Our final SElf-supERvised (SEER) model, a RegNetY with 1.3b parameters trained on 1B random images with 512 GPUs achieves 84.2% top-1 accuracy, surpassing the best self-supervised pretrained model by 1% and confirming that self-supervised learning works in a real world setting. Interestingly, we also observe that self-supervised models are good few-shot learners achieving 77.9% top-1 with access to only 10% of ImageNet. Code: this URL. [cf. “DetCon: Efficient Visual Pretraining with Contrastive Detection”, Hénaff et al 2021; “DINO: Emerging Properties in Self-Supervised Vision Transformers”, Caron et al 2021]

Figure 6: (left) Impact of number of updates. We compare the quality of a RegNetY-128GF after different number of updates of an online pretraining on 1B images. For both studies, we report the relative improvement in top-1 accuracy for a linear evaluation of frozen features on ImageNet. (right) Impact of number of unique images. We compare the impact of the size of the training set for a RegNetY-8GF and a RegNetY-16GF pretrained for the same number of updates. The number of updates corresponds to 1 epoch for 1B images, 32 epochs for 32M images and 1K for 1M images.