Visual Entailment: A Novel Task for Fine-Grained Image Understanding

01/20/2019
by   Ning Xie, et al.
0

Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer from biases conditioned on the question, image or answer distributions. The recently proposed CLEVR dataset addresses these limitations and requires fine-grained reasoning but the dataset is synthetic and consists of similar objects and sentence structures across the dataset. In this paper, we introduce a new inference task, Visual Entailment (VE) - consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal of a trained VE model is to predict whether the image semantically entails the text. To realize this task, we build a dataset SNLI-VE based on the Stanford Natural Language Inference corpus and Flickr30k dataset. We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71 accuracy and outperforms several other state-of-the-art VQA based models. Finally, we demonstrate the explainability of EVE through cross-modal attention visualizations. The SNLI-VE dataset is publicly available at https://github.com/ necla-ml/SNLI-VE.

READ FULL TEXT
research
11/26/2018

Visual Entailment Task for Visually-Grounded Language Learning

We introduce a new inference task - Visual Entailment (VE) - which diffe...
research
04/12/2020

YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos

The goal of the YouMakeup VQA Challenge 2020 is to provide a common benc...
research
03/29/2022

Fine-Grained Visual Entailment

Visual entailment is a recently proposed multimodal reasoning task where...
research
06/23/2019

Investigating Biases in Textual Entailment Datasets

The ability to understand logical relationships between sentences is an ...
research
05/16/2016

Joint Learning of Sentence Embeddings for Relevance and Entailment

We consider the problem of Recognizing Textual Entailment within an Info...
research
04/07/2020

e-SNLI-VE-2.0: Corrected Visual-Textual Entailment with Natural Language Explanations

The recently proposed SNLI-VE corpus for recognising visual-textual enta...
research
04/08/2021

How Transferable are Reasoning Patterns in VQA?

Since its inception, Visual Question Answering (VQA) is notoriously know...

Please sign up or login with your details

Forgot password? Click here to reset