Visual Commonsense Graphs: Reasoning about the Dynamic Context of a Still Image

by   Jae Sung Park, et al.

Even from a single frame of a still image, people can reason about the dynamic story of the image before, after, and beyond the frame. For example, given an image of a man struggling to stay afloat in water, we can reason that the man fell into the water sometime in the past, the intent of that man at the moment is to stay alive, and he will need help in the near future or else he will get washed away. We propose VisualComet, the novel framework of visual commonsense reasoning tasks to predict events that might have happened before, events that might happen next, and the intents of the people at present. To support research toward visual commonsense reasoning, we introduce the first large-scale repository of Visual Commonsense Graphs that consists of over 1.4 million textual descriptions of visual commonsense inferences carefully annotated over a diverse set of 60,000 images, each paired with short video summaries of before and after. In addition, we provide person-grounding (i.e., co-reference links) between people appearing in the image and people mentioned in the textual commonsense descriptions, allowing for tighter integration between images and text. We establish strong baseline performances on this task and demonstrate that integration between visual and textual commonsense reasoning is the key and wins over non-integrative alternatives.


page 6

page 14

page 16

page 19

page 20

page 24

page 25

page 26


Inferring Commonsense Explanations as Prompts for Future Event Generation

Future Event Generation aims to generate fluent and reasonable future ev...

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Answering complex questions about images is an ambitious goal for machin...

MERLOT: Multimodal Neural Script Knowledge Models

As humans, we understand events in the visual world contextually, perfor...

Event2Mind: Commonsense Inference on Events, Intents, and Reactions

We investigate a new commonsense inference task: given an event describe...

Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

Natural language rationales could provide intuitive, higher-level explan...

Modeling Naive Psychology of Characters in Simple Commonsense Stories

Understanding a narrative requires reading between the lines and reasoni...

From Recognition to Cognition: Visual Commonsense Reasoning

Visual understanding goes well beyond object recognition. With one glanc...

Code Repositories


VisualCOMET: Reasoning about the Dynamic Context of a Still Image

view repo