Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

10/15/2020
by   Ana Marasović, et al.
10

Natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans, complementing the more broadly studied lower-level explanations based on gradients or attention weights. We present the first study focused on generating natural language rationales across several complex visual reasoning tasks: visual commonsense reasoning, visual-textual entailment, and visual question answering. The key challenge of accurate rationalization is comprehensive image understanding at all levels: not just their explicit content at the pixel level, but their contextual contents at the semantic and pragmatic levels. We present Rationale^VT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs. Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.

READ FULL TEXT

page 2

page 3

page 4

page 7

page 19

page 20

12/16/2021

Commonsense Knowledge-Augmented Pretrained Language Models for Causal Reasoning Classification

Commonsense knowledge can be leveraged for identifying causal relations ...
01/27/2021

Knowledge-driven Natural Language Understanding of English Text and its Applications

Understanding the meaning of a text is a fundamental challenge of natura...
11/27/2018

From Recognition to Cognition: Visual Commonsense Reasoning

Visual understanding goes well beyond object recognition. With one glanc...
06/01/2021

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World

We propose PIGLeT: a model that learns physical commonsense knowledge th...
07/23/2022

Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

Visual Entailment with natural language explanations aims to infer the r...
04/17/2022

Attention Mechanism based Cognition-level Scene Understanding

Given a question-image input, the Visual Commonsense Reasoning (VCR) mod...
04/22/2020

Visual Commonsense Graphs: Reasoning about the Dynamic Context of a Still Image

Even from a single frame of a still image, people can reason about the d...

Code Repositories

visual-reasoning-rationalization

Code associated with the "Natural Language Rationales with Full-Stack Visual Reasoning" EMNLP Findings 2020 paper


view repo