Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

by   Ana Marasović, et al.

Natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans, complementing the more broadly studied lower-level explanations based on gradients or attention weights. We present the first study focused on generating natural language rationales across several complex visual reasoning tasks: visual commonsense reasoning, visual-textual entailment, and visual question answering. The key challenge of accurate rationalization is comprehensive image understanding at all levels: not just their explicit content at the pixel level, but their contextual contents at the semantic and pragmatic levels. We present Rationale^VT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs. Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.


page 2

page 3

page 4

page 7

page 19

page 20


Commonsense Knowledge-Augmented Pretrained Language Models for Causal Reasoning Classification

Commonsense knowledge can be leveraged for identifying causal relations ...

Knowledge-driven Natural Language Understanding of English Text and its Applications

Understanding the meaning of a text is a fundamental challenge of natura...

From Recognition to Cognition: Visual Commonsense Reasoning

Visual understanding goes well beyond object recognition. With one glanc...

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World

We propose PIGLeT: a model that learns physical commonsense knowledge th...

Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

Visual Entailment with natural language explanations aims to infer the r...

Attention Mechanism based Cognition-level Scene Understanding

Given a question-image input, the Visual Commonsense Reasoning (VCR) mod...

Visual Commonsense Graphs: Reasoning about the Dynamic Context of a Still Image

Even from a single frame of a still image, people can reason about the d...

Code Repositories


Code associated with the "Natural Language Rationales with Full-Stack Visual Reasoning" EMNLP Findings 2020 paper

view repo