“Learning Visual Features from Large Weakly Supervised Data”, Armand Joulin, Laurens van der Maaten, Allan Jabri, Nicolas Vasilache2015-11-06 (, ; similar)⁠:

Convolutional networks trained on large supervised dataset produce visual features which form the basis for the state-of-the-art in many computer-vision problems. Further improvements of these visual features will likely require even larger manually labeled data sets, which severely limits the pace at which progress can be made. In this paper, we explore the potential of leveraging massive, weakly-labeled image collections for learning good visual features. We train convolutional networks on a dataset of 100 million Flickr photos & captions, and show that these networks produce features that perform well in a range of vision problems. We also show that the networks appropriately capture word similarity, and learn correspondences between different languages.

AlexNet takes up to 2 weeks to train on a setup with 4 GPUs, while training a GoogLeNet takes up to 3 weeks [ie. just <84 GPU-days].

Figure 2: Left-hand side: Precision@10 of by weakly supervised AlexNets trained on Flickr datasets of different sizes on a held-out test set, using K = 1,000 (in red) and a single crop. For reference, we also show the precision@10 of logistic regression trained on features from convolutional networks trained on ImageNet with and without jittering (in blue and black, respectively). Right-hand side: Mean average precision on Pascal VOC 2007 dataset obtained by logistic regressors trained on features extracted from AlexNet trained on Flickr (in red) and ImageNet with and without jittering (in blue and black). Higher values are better.

…To investigate the performance of our models as a function of the amount of training data, we also performed experiments in which we varied the Flickr training set size. The left-hand side of Figure 2 presents the resulting learning curves for the AlexNet architecture with K = 1,000. The figure shows that there is a clear benefit of training on larger datasets: the word prediction performance of the networks increases substantially when the training set is increased beyond 1 million images (which is roughly the size of Imagenet); for our networks, it only levels out after ~50 million images.

[Joulin et al 2015 didn’t scale up model capacity accordingly, probably because they were so they were underfitting, and so don’t get the log-linear curve we know goes out to billions, but just an asymptote n ~ 0.05b.]