DeepAI AI Chat
Log In Sign Up

Partial success in closing the gap between human and machine vision

by   Robert Geirhos, et al.

A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines "in the wild" and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, adding the "missing human baseline" by recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along three axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (e.g. vision transformers), and dataset size (ranging from 1M to 1B). Our findings are threefold. (1.) The longstanding robustness gap between humans and CNNs is closing, with the best models now matching or exceeding human performance on most OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorisation errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude. Our results give reason for cautious optimism: While there is still much room for improvement, the behavioural difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioural data are provided as a benchmark here:


page 4

page 8

page 17

page 18

page 22

page 23

page 24


On the surprising similarities between supervised and self-supervised models

How do humans learn to acquire a powerful, flexible and robust represent...

Progress and limitations of deep networks to recognize objects in unusual poses

Deep networks should be robust to rare events if they are to be successf...

Trivial or impossible – dichotomous data difficulty masks model differences (on ImageNet and beyond)

"The power of a generalization system follows directly from its biases" ...

An Empirical Study of Training Self-Supervised Vision Transformers

This paper does not describe a novel method. Instead, it studies a strai...

Masked autoencoders are effective solution to transformer data-hungry

Vision Transformers (ViTs) outperforms convolutional neural networks (CN...

Learning Online Visual Invariances for Novel Objects via Supervised and Self-Supervised Training

Humans can identify objects following various spatial transformations su...