“When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on ImageNet”, Vijay Vasudevan, Benjamin Caine, Raphael Gontijo-Lopes, Sara Fridovich-Keil, Rebecca Roelofs2022-05-09 (, , )⁠:

Image classification accuracy on the ImageNet dataset has been a barometer for progress in computer vision over the last decade. Several recent papers have questioned the degree to which the benchmark remains useful to the community, yet innovations continue to contribute gains to performance, with today’s largest models achieving 90%+ top-1 accuracy.

To help contextualize progress on ImageNet and provide a more meaningful evaluation for today’s state-of-the-art models, we manually review and categorize every remaining mistake that a few top models make in order to provide insight into the long-tail of errors on one of the most benchmarked datasets in computer vision. We focus on the multi-label subset evaluation of ImageNet, where today’s best models achieve upwards of 97% top-1 accuracy.

Our analysis reveals that nearly half of the supposed mistakes are not mistakes at all, and we uncover new valid multi-labels, demonstrating that, without careful review, we are substantially underestimating the performance of these models. On the other hand, we also find that today’s best models still make a substantial number of mistakes (40%) that are obviously wrong to human reviewers.

To calibrate future progress on ImageNet, we provide an updated multi-label evaluation set, and we curate ImageNet-Major: a 68-example “major error” slice of the obvious mistakes made by today’s top models—a slice where models should achieve near perfection, but today are far from doing so.

…We used a standard ViT model scaled to 3b parameters (ViT-3B) that was pre-trained on JFT-3B and fine-tuned on ImageNet-1K, achieving a top-1 accuracy of 89.5%. Details on this model can be found in Appendix A. We also later review mistakes made by the Greedy Soups model.

…In Appendix D we provide many additional examples of each mistake severity, examples where the model is actually correct (a label was missing from the multi-label annotations), and problematic examples that should be removed from the validation set (eg. because the original label was incorrect).

…After review of the original 676 mistakes, we found that 298 were either correct or unclear, or determined the original ground-truth incorrect or problematic. Our evaluation of the ViT-3B model on this re-labeled dataset is shown in Table 1, with the model making a total of 378 mistakes on the dataset. In other words, ~44% of the initial mistakes made by this model were determined to be correct!

…How do models that were not used to select this dataset perform? We evaluate the suite of 70 models from Shankar et al 2020 on this dataset, in addition to 4 recent top models not directly used to help filter the ImageNet-M set: a ViT-G/14 model (90.5% top-1), a BASIC model fine-tuned on ImageNet (90.7% top-1), an ALIGN model fine-tuned on ImageNet (88.1% top-1), and a CoCa model fine-tuned on ImageNet (91.0% top-1). The plot shown here shows that most models as far back as AlexNet through ResNets get between 10–25 examples correct, but recent high accuracy models such as ViT-G/14, BASIC-FT, and CoCa-FT are starting to solve more of these ‘major’ mistakes: CoCa-FT gets 42 of the 68 examples correct. We reviewed the mistakes made by these 4 models, which yielded a total of 5 novel predictions; 4 of them were verified to be wrong (and major), and 1 additional new valid prediction, for which we updated the label set accordingly…Overall, we found that:

  1. when a large, high-accuracy model makes a novel prediction not made by other models, it ends up being a correct new multi-label almost half of the time;

  2. higher accuracy models do not demonstrate an obvious pattern in our categories and severities of mistakes they solve;

  3. SOTA models today are largely matching or beating the performance of the best expert human on the human-evaluated multi-label subset;

  4. noisy training data and under-specified classes may be a factor limiting the effective measurement of improvements in image classification.