Partial success in closing the gap between human and machine vision

06/14/2021
by   Robert Geirhos, et al.
0

A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines "in the wild" and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, adding the "missing human baseline" by recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along three axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (e.g. vision transformers), and dataset size (ranging from 1M to 1B). Our findings are threefold. (1.) The longstanding robustness gap between humans and CNNs is closing, with the best models now matching or exceeding human performance on most OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorisation errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude. Our results give reason for cautious optimism: While there is still much room for improvement, the behavioural difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioural data are provided as a benchmark here: https://github.com/bethgelab/model-vs-human/

READ FULL TEXT

page 4

page 8

page 17

page 18

page 22

page 23

page 24

research
10/16/2020

On the surprising similarities between supervised and self-supervised models

How do humans learn to acquire a powerful, flexible and robust represent...
research
07/16/2022

Progress and limitations of deep networks to recognize objects in unusual poses

Deep networks should be robust to rare events if they are to be successf...
research
10/12/2021

Trivial or impossible – dichotomous data difficulty masks model differences (on ImageNet and beyond)

"The power of a generalization system follows directly from its biases" ...
research
04/05/2021

An Empirical Study of Training Self-Supervised Vision Transformers

This paper does not describe a novel method. Instead, it studies a strai...
research
12/12/2022

Masked autoencoders are effective solution to transformer data-hungry

Vision Transformers (ViTs) outperforms convolutional neural networks (CN...
research
10/04/2021

Learning Online Visual Invariances for Novel Objects via Supervised and Self-Supervised Training

Humans can identify objects following various spatial transformations su...
research
09/23/2021

How much "human-like" visual experience do current self-supervised learning algorithms need to achieve human-level object recognition?

This paper addresses a fundamental question: how good are our current se...

Please sign up or login with your details

Forgot password? Click here to reset