Visual Entailment Task for Visually-Grounded Language Learning

by   Ning Xie, et al.
Wright State University

We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30K. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.



There are no comments yet.


page 2

page 3


Visual Entailment: A Novel Task for Fine-Grained Image Understanding

Existing visual reasoning datasets such as Visual Question Answering (VQ...

A large annotated corpus for learning natural language inference

Understanding entailment and contradiction is fundamental to understandi...

Natural Language Inference from Multiple Premises

We define a novel textual entailment task that requires inference over m...

Investigating Biases in Textual Entailment Datasets

The ability to understand logical relationships between sentences is an ...

VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks

Neural module networks (NMN) have achieved success in image-grounded tas...

LSOIE: A Large-Scale Dataset for Supervised Open Information Extraction

Open Information Extraction (OIE) systems seek to compress the factual p...

Entailment Relation Aware Paraphrase Generation

We introduce a new task of entailment relation aware paraphrase generati...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multimodal inference, reasoning, and fact entailment across image data and text have the potential to solve problems where the veracity of a text statement is drawn from visual facts. Representative applications involve the fake news detection and court cross-examination. The former aims to detect contradictions between the text news and visual facts such as an image or video clip in order to reduce the influence of misleading news. The latter intends to validate the testimony in case of any contradictions to visual evidence for a fair judgment.

Recent progress in visual reasoning using datasets such as the Visual Question Answering (VQA) dataset (Antol et al., 2015) and CLEVR (Johnson et al., 2017a) has been encouraging. However, the high accuracy in these datasets is often because of the bias in these datasets. For the VQA dataset, there is a question-conditioned bias (Goyal et al., 2017) where questions may hint at the answers such that the correct answer may be inferred without even considering the visual information. The following version of the VQA dataset (Goyal et al., 2017) reduces the bias by pairing questions with similar images that lead to different answers. Even so, the sentence structures in the VQA dataset remain simple and the yes/no questions are insufficient for training entailment tasks that include the neutral case. CLEVR on the other hand is designed for fine-grained reasoning but its synthetic nature introduces the uniformity in image and text structures, resulting in very high accuracy models (Hudson and Manning, 2018) that may not generalize well to real world settings. Hence, we need a more challenging inference task that requires learning grounded representations from cross-modal (image, text) pairs, where the same image is used for multiple natural language sentences, each of which may correspond to different answers. Derived from this motivation, we propose a new Visual Entailment (VE) task in this paper.

Prior to VE, the Textual Entailment (TE) task has been extensively studied in the natural language processing (NLP) community as part of natural language inference (NLI). In the TE task, given a text premise

and a text hypothesis , the goal is to determine if implies . A TE model outputs a label out of the three classes: entailment, neutral or contradiction based on the relation conveyed by the text pair. Entailment holds if there is enough evidence in to conclude that is true. Contradiction is concluded wherever contradicts . Otherwise, the relation is neutral, indicating the evidence in is insufficient to draw a conclusion from . We extend TE to the visual domain by replacing each text premise with a corresponding real world image. Figure 1 illustrates a VE example where given an image premise, the three text hypotheses lead to three different class labels.

Figure 1: A VE example showing an image pairing with different hypotheses leads to different labels.

In contrast to existing yes/no VQA problems, our VE task is more challenging for requiring the model to deduce the neutral case due to insufficient information. To the best of our knowledge, there is no well-annotated dataset for VE. We then build a new dataset, SNLI-VE, by replacing the premises in the Stanford Natural Language Inference corpus (SNLI) (Bowman et al., 2015), a TE dataset, with the corresponding images in Flickr30K (Young et al., 2014), an image captioning dataset. This adaption is possible since the premises in SNLI are from the Flickr30K image captions which are entailed by the corresponding images automatically. By transitivity of entailment, those hypotheses entailed by the text premises are also entailed by the original caption images. There is a chance that neutral and contradiction relations may change because the images may include other entities that unexpectedly rewrite neutral and contradiction conclusions. Recently work (Vu et al., 2018) combining both images and captions as premises validates that the effects of conclusion changes happen to be tolerable.

Related work.

The most relevant task to VE is VQA (Antol et al., 2015; Goyal et al., 2017; Zhu et al., 2016; Ren et al., 2015; Johnson et al., 2017b; Hudson and Manning, 2018; Anderson et al., 2018; Fukui et al., 2016)

, which is a representative multimodal task in machine learning that involves both images and text. State-of-the-art VQA models commonly apply the attention mechanism 

(Kim et al., 2018; Anderson et al., 2018; Hudson and Manning, 2018) to relate image regions with specific text features. Our developed model tackles the VE task by further employing self-attention (Vaswani et al., 2017) to find the inner relationships in both image and text feature spaces as well as text-image attention to ground relevant image regions.

2 The EVE Architecture

Figure 2:

EVE architecture. EVE determines if a hypothesis (text input) is entailed by an image premise (image input). The bottom half shows two methods on image feature extraction, either from the CNN feature maps or object detection ROIs.

We develop a new Explainable Visual Entailment architecture (EVE) shown in Figure 2. EVE uses the Attention Top-Down/Bottom-Up (Anderson et al., 2018) model as a starting point. The architecture consists of two branches. The text branch applies self-attention (Vaswani et al., 2017)

to the word embeddings of a given text hypothesis, then passes the weighted word embedding sequence through gated recurrent units to extract the text features. Depending on the image feature extraction in the image branch, there are two EVE variants:

EVE-Image and EVE-ROI

. The image features captured by EVE-Image come from a pre-trained convolutional neural network (CNN) with

feature maps of dimension

. The feature vector at each pixel position across the

feature maps represents an image region. In contrast, EVE-ROI considers regions of interest (ROI) proposals from MASK-RCNN (He et al., 2017) to locate prominent objects in images. The image regions either from EVE-ROI or EVE-Image are also self-attended and further weighted by the text-image attention. Both the text and image features are finally fused for later prediction.

We apply self-attention to capture the hidden relations between elements in the text and the image feature spaces respectively. The intuition of using self-attention is, under a long and complex hypothesis, it is increasingly necessary for the model to be able to attend to only the most relevant words. The effect of self-attention on the image is similar: image regions that jointly benefit the current prediction receive more attention. On the other hand, the text-image attention allows the model to select relevant image regions conditioned on the given text hypothesis.

Lastly, the split-transform-merge technique follows from the VQA 2017 winner (Teney et al., 2017). The fused features are split and transformed through the Text MLP and the Image MLP respectively in expectation of more representational power to make better conclusions. On the contrary, there are no similar pre-trained MLP weights corresponding to the answers as in VQA tasks because the VE class labels are not associated with particular images or text. Therefore, those MLP weights are simply initialized randomly.

3 Evaluation on SNLI-VE

Val Acc Per Class (%)
Acc (%)
Test Acc Per Class (%)
Model Name
Acc (%)
Hypothesis Only 67.04 65.45 63.36 72.31 67.01 65.85 63.78 71.40
Image Captioning 68.14 67.3 63.12 73.99 67.47 66.75 63.56 72.07
Relational Network 67.81 68.01 63.94 71.49 68.39 69.13 65.58 70.45
Attention Top-Down 70.59 72.94 66.88 71.96 70.3 72.94 66.63 71.34
Attention Bottom-Up 69.79 71.56 64.25 73.57 69.34 70.56 64.49 72.96
EVE-Image* 71.40 70.48 66.88 76.83 71.36 70.61 67.17 76.31
EVE-ROI* 71.11 66.41 68.2 78.69 70.21 65.63 68.83 76.16
Table 1: Model Performance on SNLI-VE dataset

We evaluate the performance of EVE against several other baselines over SNLI-VE including the existing state-of-the-art VQA based models. Details about the dataset and our experiments are discussed in the supplemental materials. The performance results, as listed in Table 1, involve comparisons between the following models:

Hypothesis Only: This model uses hypotheses only without image premises. Based on no premises, the model was expected to make random guesses but the resulting accuracy is up to 67%, as reproduced by others (Gururangan et al., 2018; Vu et al., 2018). This indicates the performance of our model must exceed the 67% lower bound to make sense.

Image Captioning: Before VE, there are many captioning models (Karpathy and Fei-Fei, 2015; Vinyals et al., 2017; Chen et al., 2017)

which can serve as a useful baseline by generating an image caption as the premise and then apply existing TE models for classification. For this baseline, we use a PyTorch implementation

(Choi, )

which extracts the image features with a pre-trained ResNet152 backbone and generates the captions using an LSTM. The generated text premise is encoded with the input text hypothesis. Both text features are concatenated for classification. The model performance achieves a marginally higher accuracy of 68.14% and 67.47% on the validation and test sets respectively, implying that the generated image caption premise does not help much. After reviewing the generated captions, it is possible that the quality of the generated captions are too poor or missing the necessary information for the TE classifier. To address this problem, the captioning may be improved by using sophisticated models such as the dense captioning 

(Johnson et al., 2016) but there is no guarantee that every detail in the image potentially described by the hypothesis would be covered. Nevertheless, the TE classifier could still perform poorly due to the increase in the length of the caption premises.

Relational Network: The Relational Network (RN), proposed to tackle the CLEVR dataset considers pairwise feature fusions between different image regions in the CNN feature maps and the question embedding (Santoro et al., 2017). Although RN provides high accuracy on CLEVR, only a marginal improvement is achieved at the accuracy of 67.81% and 68.39% on the validation and test splits of SNLI-VE.

Attention Top-Down: We also adopt the model from the winner (Anderson et al., 2018) of VQA challenge 2017, which applies text-image attention to the image regions in the CNN feature maps based on the question embedding. The weighted image features are then projected and fused with the question embedding using dot-product for classification. This attention based VQA model achieves the best accuracy so far, with 70.59% and 70.3% on the validation and test splits, respectively, implying attention can effectively use image premise features.

Attention Bottom-Up: The model design for Attention Bottom-Up is quite similar to Attention Top-Down, except the image features used are the ROIs extracted by a Mask-RCNN  (He et al., 2017) implementation (Matterport, ). The best performance achieved is 69.79% and 69.34% accuracy on the validation and testing splits respectively. Though we also evaluate the model with more than 10 ROIs, we observe no significant improvement.

EVE-Image and EVE-ROI: We finally evaluate our model, EVE, as described in Section 2. EVE-Image achieves the best performance of 71.4% and 71.36% accuracy on the validation and test partitions. EVE-ROI achieves a slightly lower accuracy of 71.11% and 70.21% but still better than the counterpart Attention Bottom-Up. The improvement, even just marginal, is likely attributed to the introduction of self-attention that captures the hidden relations in the same feature space. We do not find evidence that the split-transform-merge construct contributes much to VE tasks.

4 Conclusion

This work introduces visual entailment, a novel multimodal task to determine if a text hypothesis is entailed based on the visual information in the image premise. We build the SNLI-VE dataset providing real-world images from Flickr30K as premises, and the corresponding text hypotheses from SNLI. To address VE, we develop EVE and demonstrate its performance over several baselines, including the existing state-of-the-art VQA based models. The inherent language-bias induced by SNLI (Gururangan et al., 2018) serves as a strong baseline. The SNLI-VE dataset is scheduled to be publicly available.


Ning Xie and Derek Doran were supported by the Ohio Federal Research Network project Human-Centered Big Data. Any opinions, findings, and conclusions or recommendations expressed in this article are those of the author(s) and do not necessarily reflect the views of the Ohio Federal Research Network.


Supplementary Materials

Dataset statistics.

The original SNLI dataset split does not consider the arrangement of the original caption images. Therefore, the same image may appear in both training and test sets if directly adapted to VE. To address the issue, we disjointedly partition SNLI-VE by images following the splits in (Gong et al., 2014) and make sure that each class instances are balanced across the training, validation, and test sets as shown in Table 2.

Training Validation Testing
#Images 29,783 1,000 1,000
#Entailment 176,932 5,959 5,973
#Neutral 176,045 5,960 5,964
#Contradiction 176,550 5,939 5,964
Vocabulary Size 29,550 6,576 6,592
Table 2: SNLI-VE statistics: number of images, per class examples and vocabulary size by split.

Implementation details.

The proposed EVE model is implemented in PyTorch. We use the pre-trained GloVe.6B.300D (Pennington et al., 2014) for word embedding, where 6B is the corpus size and 300D is the embedding dimension. The image features used for EVE-Image are generated from a pre-trained ResNet101. The ROI features used for EVE-ROI are extracted using the Mask-RCNN implementation (Matterport, ). The Adam optimizer is used for training with a batch size of 64. Adaptive learning rate is applied with both initial value and weight decay set to be 0.0001.