From Recognition to Cognition: Visual Commonsense Reasoning

11/27/2018
by   Rowan Zellers, et al.
30

Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. In this paper, we formalize this task as Visual Commonsense Reasoning. In addition to answering challenging visual questions expressed in natural language, a model must provide a rationale explaining why its answer is true. We introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe to generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. To move towards cognition-level image understanding, we present a new reasoning engine, called Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. Experimental results show that while humans find VCR easy (over 90 state-of-the-art models struggle ( 45 still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.

READ FULL TEXT

page 8

page 11

page 18

page 19

page 20

page 21

page 22

page 23

research
11/26/2019

PIQA: Reasoning about Physical Commonsense in Natural Language

To apply eyeshadow without a brush, should I use a cotton swab or a toot...
research
10/15/2020

Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

Natural language rationales could provide intuitive, higher-level explan...
research
12/14/2022

Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding

From a visual scene containing multiple people, human is able to disting...
research
08/16/2018

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Given a partial description like "she opened the hood of the car," human...
research
10/16/2022

COFAR: Commonsense and Factual Reasoning in Image Search

One characteristic that makes humans superior to modern artificially int...
research
05/19/2019

HellaSwag: Can a Machine Really Finish Your Sentence?

Recent work by Zellers et al. (2018) introduced a new task of commonsens...
research
05/16/2018

Modeling Naive Psychology of Characters in Simple Commonsense Stories

Understanding a narrative requires reading between the lines and reasoni...

Please sign up or login with your details

Forgot password? Click here to reset