“The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition”, Jonathan Krause, Benjamin Sapp, Andrew Howard, Howard Zhou, Alexander Toshev, Tom Duerig, James Philbin, Li Fei-Fei2015-11-20 (, , )⁠:

Current approaches for fine-grained recognition do the following: First, recruit experts to annotate a dataset of images, optionally also collecting more structured data in the form of part annotations and bounding boxes. Second, train a model using this data.

Toward the goal of solving fine-grained recognition, we introduce an alternative approach, leveraging free, noisy data from the web and simple, generic methods of recognition. This approach has benefits in both performance and scalability.

We demonstrate its efficacy on 4 fine-grained datasets, greatly exceeding existing state-of-the-art without the manual collection of even a single label, and furthermore show first results at scaling to more than 10,000 fine-grained categories.

Quantitatively, we achieve top-1 accuracies of 92.3% on CUB-200-2011, 85.4% on Birdsnap [Berg et al 2014], 93.4% on FGVC-Aircraft, and 80.8% on Stanford Dogs without using their annotated training sets…In total, for all four datasets, we obtained 9.8 million images for 26,458 categories, requiring 151.8GB of disk space.

We compare our approach to an active learning approach for expanding fine-grained datasets…Surprisingly, performance is very similar, with only a 0.4% advantage for the cleaner, annotated active learning data, highlighting the effectiveness of noisy web data despite the lack of manual annotation. If we furthermore augment the filtered web images with the Stanford Dogs training set, which the active learning method notably used both as training data and its seed set of images, performance improves to even be slightly better than the manually-annotated active learning data (0.5% improvement).

Table 2: Comparison with prior work on CUB-200-2011. We only include methods which use no annotations at test time. Here “GT” refers to using Ground Truth category labels in the training set of CUB, “BBox” indicates using bounding boxes, and “Parts” additionally uses part annotations.

How Much Data is Really Necessary? In order to better understand the utility of noisy web data for fine-grained recognition, we perform a control experiment on the web data for CUB. Using the filtered web images as a base, we train models using progressively larger subsets of the results as training data, taking the top ranked images across categories for each experiment. Performance versus the amount of training data is shown in Figure 11. Surprisingly, relatively few web images are required to do as well as training on the CUB training set, and adding more noisy web images always helps, even when at the limit of search results. Based on this analysis, we estimate that one noisy web image for CUB categories is “worth” 0.507 ground truth training images57.

Figure 11 Number of web images used for training vs. performance on CUB-200-2011. We vary the amount of web training data in multiples of the CUB training set size (5,994 images). Also shown is performance when training on the ground truth CUB training set (CUB-GT).