“Exploring the Limits of Weakly Supervised Pretraining”, Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten2018-05-02 (, , ; similar)⁠:

[followup] State-of-the-art visual perception models for a wide range of tasks rely on supervised pretraining. ImageNet classification is the de facto pretraining task for these models. Yet, ImageNet is now nearly ten years old and is by modern standards “small”. Even so, relatively little is known about the behavior of pretraining with datasets that are multiple orders of magnitude larger. The reasons are obvious: such datasets are difficult to collect and annotate.

In this paper, we present an unique study of transfer learning with large convolutional networks trained to predict hashtags on billions [3.5b] of social media [Instagram] images.

Our experiments demonstrate that training for large-scale hashtag prediction leads to excellent results. We show improvements on several image classification and object detection tasks, and report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4% (97.6% top-5). We also perform extensive experiments that provide novel empirical data on the relationship between large-scale pretraining and transfer learning performance.

Figure 2: Classification accuracies on ImageNet-{1k, 5k, 9k} and CUB2011 target tasks as a function of the number of Instagram images used for pretraining for 3 network architectures (colors) and 2 hashtag vocabularies (dashed / solid lines). Only the linear classifier is trained on the target task. Higher is better.

…In line with prior results[16,17], we observe near log-linear behavior: each time we multiply the amount of training data by a factor of x, we observe a fixed increase y in classification accuracy. While the scaling behavior is consistent across hashtag vocabulary sizes and models, the accuracy increase y is larger for higher-capacity networks: across all figures, the lines corresponding to ResNeXt-101 32×16d networks (purple) are steeper than those corresponding to 32×8d and 32×4d models. This result suggests that when training convolutional networks on billions of training images, current network architectures are prone to underfitting. We also observe log-linear scaling break down in 2 regimes: (1) because accuracy is bounded, endless log-linear scaling is not possible. On datasets like IN-1k and CUB2011 the ceiling effect necessarily creates sub-log-linear scaling. (2) We observe a deviation from log-linear scaling in the 1B to 3.5B image regime even without apparent ceiling effects on IN-{5k, 9k}.

These plots also illustrate an interesting effect of the hashtag vocabulary on the transfer task accuracy. On IN-1k, networks pretrained on the target-task-aligned 1.5k hashtags outperform those trained using a larger hashtag vocabulary, because the 1.5k hashtags were selected to match the ImageNet synsets. However, as the matching between hashtag vocabulary and target classes disappears and the visual variety in the transfer task increases, networks pretrained to recognize a larger number of hashtags increasingly outperform networks pre-trained on fewer hashtags: on the IN-9k transfer task, the difference in accuracy between networks trained on 1.5k and those trained on 17k hashtags is ~7%.

3.1.3 What is the effect of hashtag label noise on model accuracy?…we investigate the effect of injecting additional label noise on the accuracy of our networks. To do so, we pretrain ResNeXt-101 32×16d networks on a version of IG-1B-17k in which we randomly replaced p% of the hashtags by hashtags obtained by sampling from the marginal distribution over hashtags (excluding the tag to be replaced). …The results suggest that the networks are remarkably resilient against label noise: a noise level of p = 10% leads to a loss of less than 1% in classification accuracy, and at p = 25% label noise, the reduction in accuracy is around 2%. These results suggest that label noise may be a limited issue if networks are trained on billions of images.