Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

07/23/2022
by   Qian Yang, et al.
0

Visual Entailment with natural language explanations aims to infer the relationship between a text-image pair and generate a sentence to explain the decision-making process. Previous methods rely mainly on a pre-trained vision-language model to perform the relation inference and a language model to generate the corresponding explanation. However, the pre-trained vision-language models mainly build token-level alignment between text and image yet ignore the high-level semantic alignment between the phrases (chunks) and visual contents, which is critical for vision-language reasoning. Moreover, the explanation generator based only on the encoded joint representation does not explicitly consider the critical decision-making points of relation inference. Thus the generated explanations are less faithful to visual-language reasoning. To mitigate these problems, we propose a unified Chunk-aware Alignment and Lexical Constraint based method, dubbed as CALeC. It contains a Chunk-aware Semantic Interactor (arr. CSI), a relation inferrer, and a Lexical Constraint-aware Generator (arr. LeCG). Specifically, CSI exploits the sentence structure inherent in language and various image regions to build chunk-aware semantic alignment. Relation inferrer uses an attention-based reasoning network to incorporate the token-level and chunk-level vision-language representations. LeCG utilizes lexical constraints to expressly incorporate the words or chunks focused by the relation inferrer into explanation generation, improving the faithfulness and informativeness of the explanations. We conduct extensive experiments on three datasets, and experimental results indicate that CALeC significantly outperforms other competitor models on inference accuracy and quality of generated explanations.

READ FULL TEXT

page 2

page 3

page 8

page 11

research
03/09/2022

NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks

Natural language explanation (NLE) models aim at explaining the decision...
research
04/24/2019

Generating Token-Level Explanations for Natural Language Inference

The task of Natural Language Inference (NLI) is widely modeled as superv...
research
04/28/2023

Interpreting Vision and Language Generative Models with Semantic Visual Priors

When applied to Image-to-text models, interpretability methods often pro...
research
09/11/2022

Chain of Explanation: New Prompting Method to Generate Higher Quality Natural Language Explanation for Implicit Hate Speech

Recent studies have exploited advanced generative language models to gen...
research
10/15/2020

Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

Natural language rationales could provide intuitive, higher-level explan...
research
08/06/2021

LadRa-Net: Locally-Aware Dynamic Re-read Attention Net for Sentence Semantic Matching

Sentence semantic matching requires an agent to determine the semantic r...
research
11/16/2022

AlignVE: Visual Entailment Recognition Based on Alignment Relations

Visual entailment (VE) is to recognize whether the semantics of a hypoth...

Please sign up or login with your details

Forgot password? Click here to reset