“Partial Success in Closing the Gap between Human and Machine Vision”, 2021-06-14 (; backlinks; similar):
A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines “in the wild” and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision?
To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, adding the “missing human baseline” by recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along 3 axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (eg. vision transformers), and dataset size (ranging from 1M to 1B).
Our findings are threefold. (1.) The long-standing robustness gap between humans and CNNs is closing, with the best models now matching or exceeding human performance on most OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorization errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to 3 orders of magnitude.
Our results give reason for cautious optimism: While there is still much room for improvement, the behavioral difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioral data are provided as a benchmark here: https://github.com/bethgelab/model-vs-human/.