Datasets are an integral part of contemporary object recognition research. They have been the chief reason for the considerable progress in the field, not just as source of large amounts of training data, but also as means of measuring and comparing performance of competing algorithms. At the same time, datasets have often been blamed for narrowing the focus of object recognition research, reducing it to a single benchmark performance number. Indeed, some datasets, that started out as data capture efforts aimed at representing the visual world, have become closed worlds unto themselves (eg. the Corelworld, the Caltech-101 world, the PASCAL VOC world).
With the focus on beating the latest benchmark numbers on the latest dataset, have we perhaps lost sight of the original purpose? The goal of this paper is to take stock of the current state of recognition datasets.
We present a comparison study using a set of popular datasets [training SVMs + HOG detectors], evaluated based on a number of criteria including: relative data bias, cross-dataset generalization, effects of closed-world assumption, and sample value.
The experimental results, some rather surprising, suggest directions that can improve dataset collection as well as algorithm evaluation protocols.
But more broadly, the hope is to stimulate discussion in the community regarding this very important, but largely neglected issue.
…The lesson from this toy experiment is that, despite the best efforts of their creators, the datasets appear to have a strong build-in bias. Of course, much of the bias can be accounted for by the divergent goals of the different datasets: some captured more urban scenes, others more rural landscapes; some collected professional photographs, others the amateur snapshots from the Internet; some focused on entire scenes, others on single objects, etc. Yet, even if we try to control for these capture biases by isolating specific objects of interest, we find that the biases are still present in some form. As a demonstration, we applied the same analysis that we did for full images to object crops of cars from 5 datasets where car bounding boxes have been provided (PASCAL, ImageNet, SUN09, LabelMe, Caltech101). Interestingly, the classifier was still quite good at telling the different datasets apart, giving 61% performance (at 20% chance). Visually examining the most discriminable cars (Figure 4), we observe some subtle but important differences: Caltech has a strong preference for side views, while ImageNet is into racing cars; PASCAL have cars at noncanonical view-points; SUNS and LabelMe cars appear to be similar, except LabelMe cars are often occluded by small objects, etc. Clearly, whatever we, as a community, are trying to do to get rid of dataset bias is not quite working.
…In general there is a dramatic drop of performance in all tasks and classes when testing on a different test set. For instance, for the “car” classification task the average performance obtained when training and testing on the same dataset is 53.4% which drops to 27.5%. This is a very important drop that would, for instance, make a method ranking first in the PASCAL competition become one of the worst. Figure 5 shows a typical example of car classification gone bad. A classifier trained on MSRC “cars” has been applied to 6 datasets, but it can only find cars in one—MSRC itself.
…For instance, 1 LabelMe car sample is worth 0.26 PASCAL car samples on the PASCAL benchmark. This means that if we want to have a modest increase (maybe 10% AP) in performance on the car detector trained with 1250 PASCAL samples available on PASCAL VOC 2007, we will need 1/0.26 × 1,250 × 10 = 50,000 LabelMe samples!…Table 3 shows the “market value” of training samples from different datasets2. One observation is that the sample values are always smaller than 1—each training sample gets devalued if it is used on a different dataset. There is no theoretical reason why this should be the case and it is only due to the strong biases present in actual datasets. So, what is the value of current datasets when used to train algorithms that will be deployed in the real world? The answer that emerges can be summarized as: “better than nothing, but not by much”.
Table 3: “Market Value” for a “car” sample across datasets.
[Is the scaling glass half-full or half-empty? It’s definitely closer to ‘half-empty’ when you’re training SVMs in 2011, anyway…]