Visual Entailment: A Novel Task for Fine-Grained Image Understanding

01/20/2019 ∙ by Ning Xie, et al. ∙ Wright State University 0

Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer from biases conditioned on the question, image or answer distributions. The recently proposed CLEVR dataset addresses these limitations and requires fine-grained reasoning but the dataset is synthetic and consists of similar objects and sentence structures across the dataset. In this paper, we introduce a new inference task, Visual Entailment (VE) - consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal of a trained VE model is to predict whether the image semantically entails the text. To realize this task, we build a dataset SNLI-VE based on the Stanford Natural Language Inference corpus and Flickr30k dataset. We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71 accuracy and outperforms several other state-of-the-art VQA based models. Finally, we demonstrate the explainability of EVE through cross-modal attention visualizations. The SNLI-VE dataset is publicly available at necla-ml/SNLI-VE.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The pursuit of “visual intelligence” is a long lasting theme of the machine learning community. While the performance of image classification and object detection has significantly improved in the recent years 

[42, 63, 65, 26]

, progress in higher-level scene reasoning tasks such as scene understanding is relatively limited 


Recently, several datasets, such as VQA-v1.0 [2], VQA-v2.0 [23], CLEVR [32], Visual7w [81], Visual Genome [41], COCO-QA [57], and models [33, 60, 29, 31, 1, 67, 17, 37] have been used to measure the progress in understanding the interaction between vision and language modalities. However, the quality of the widely used VQA-v1.0 dataset [2] suffers from a natural bias [23]. Specifically, there is a long tail distribution of answers and also a question-conditioned bias where, questions may hint at the answers, such that the correct answer may be inferred without even considering the visual information. For instance, of the question “Do you see a …?”, the model may bias towards the answer “Yes” since it is correct for 87% of times during training. Besides, many questions in the VQA-v1.0 dataset are simple and straightforward and do not require compositional reasoning from the trained model. VQA-v2.0 [23] has been proposed to reduce the dataset “bias” considerably in VQA-v1.0 by associating each question with relatively balanced different answers. However, the questions are rather straight-forward and require limited fine-grained reasoning.

CLEVR dataset [32], is designed for fine-grained reasoning and consists of compositional questions such as “What size is the cylinder that is left of the brown metal thing that is left of the big sphere?”. This kind of questions requires learning fine-grained reasoning based on visual information. However, CLEVR is a synthetic dataset, and visual information and sentence structures are very similar across the dataset. Hence, models that provide good performance on CLEVR dataset may not generalize to real-world settings.

To address the above limitations, we propose a novel inference task, Visual Entailment (VE), which requires fine-grained reasoning in real-world settings. The design is derived from Text Entailment (TE) [12] task. In our VE task, a real world image premise and a natural language hypothesis are given, and the goal is to determine if can be concluded given the information provided by . Three labels entailment, neutral or contradiction are assigned based on the relationship conveyed by the .

  • Entailment holds if there is enough evidence in to conclude that is true.

  • Contradiction holds if there is enough evidence in to conclude that is false.

  • Otherwise, the relationship is neutral, implying the evidence in is insufficient to draw a conclusion about .

The main difference between VE and TE task is, the premise in TE in a natural language sentence , instead of an image premise . Note that the existing of “neutral” makes the VE task more challenging compared to previous “yes-no” VQA tasks, since “neutral” requires the model to conclude the uncertainty between “entailment (yes)” and “contradiction (no)”. Figure 1 illustrates a VE example, which is from the SNLI-VE dataset we propose below, that given an image premise, the three different text hypotheses lead to different labels.

Figure 1: An Example from SNLI-VE dataset

We build the SNLI-VE dataset to illustrate the VE task, based on Stanford Natural Language Inference (SNLI) [4], which is a widely used text-entailment dataset, and Flickr30k [76], which is an image captioning dataset. The combination of SNLI and Flickr30k is straightforward since SNLI is created using Flickr30k. The detailed process of creating the SNLI-VE dataset is discussed in Section 3.2.

We develop an Explainable Visual Entailment (EVE) model to address the VE task. EVE captures the interaction within and between the image premise and the text hypothesis through attention. We evaluate EVE against several other state-of-the-art (SOTA) visual question answering (VQA) baselines and an image captioning based model on the SNLI-VE dataset. The interpretability of EVE is demonstrated using attention visualizations.

In summary, the contributions of our work are:

  • We propose a novel inference task, Visual Entailment, that requires a systematic cross-modal understanding between vision and a natural language.

  • We build a VE dataset, SNLI-VE, consisting of real-world image and natural language sentence pairs for VE tasks. The dataset is publicly available111

  • We design a VE model, EVE, to solve the VE task with interpretable attention visualizations.

  • We evaluate EVE against other SOTA VQA and image captioning based baselines.

2 Related Work

Our work is inspired by previous work on NLI, VQA, image captioning, and interpretable models.

Natural Language Inference.

We focus on textual entailment as our NLI task [18, 11, 3, 12, 46]. Annotated corpus for TE was limited in size until SNLI [4] was proposed, which is based on the Flickr30k [76]

image captions. Since then, several neural-network based methods have been proposed over SNLI that either use sentence encoding models to individually encode hypothesis and premise or attention based models that encode the sentences together and align similar words in hypothesis and premise 

[8, 50, 62, 59]. Our paper extends the TE task in the visual domain – allowing future work on our SNLI-VE task to build new models on recent progress in SNLI and VQA. Our work is different from the recent work [71] that combines both images and captions as premises.

Visual Question Answering.

Recent work on VQA includes datasets [32, 2, 23, 81, 41, 57, 47, 19, 66] and models [33, 60, 29, 31, 1, 67, 17, 37]. The goal of VQA is to answer natural language questions based on the provided visual information. VQA-v2.0 [23] and CLEVR [32] datasets are designed to address bias and reasoning limitations of VQA-v1.0, respectively. Recent work on compositional reasoning systems have achieved nearly 100% results on CLEVR  [29] but the SOTA performance on VQA-v2.0 is no more than 75% [15], implying learning multi-modal feature interaction using natural images has room for improvement. There have been a large number of models and approaches to address the VQA task. This includes simple linear models using ranking loss [16, 36], bi-linear pooling methods [45, 20, 55, 17, 37], attention-based methods  [1, 52, 64] and reasoning based approaches [54, 27, 33, 38, 29] on CLEVER and VQA-v1.0 datasets.

Image Captioning.

The problem of image captioning explores the generation of natural language sentences to best depict input image content. A common approach for these tasks is to use temporal models over convolutional features [36, 70, 7]. Recent work has also explored generating richer captions to describe images in a more fine-grained manner [34]. EVE differs from image-captioning since it requires discerning fine-grained information about an image conditioned on the hypothesis into three classes. However, existing image-captioning methods can serve as a baseline, where the output class label is based on a distance measure between the generated caption and the input hypothesis.

Visual Relationship Detection.

Relationship detection among image constituents uses separate branches in a ConvNet to model objects, humans, and their interactions  [5, 21]. A distinct approach in Santoro et al.  [60] treats each of the cells across channels in convolutional feature maps as an object and the relationships are modeled by a pairwise concatenation of the feature representations of individual cells.

Scene graph based relationship modeling, using a structured representation for describing object relationships and their attributes  [35, 43, 44, 74] has been extensively studied. Furthermore, pairing different objects in a scene [13, 28, 60, 78] is also common. However, a scene with many objects may have only a few individual interacting objects. Hence, it can be inefficient to model all relationships across all individual object pairs [80], making these methods computationally expensive for complex scene understanding tasks such as VE.

Our model, EVE instead uses self-attention to efficiently learn the relationships between various scene elements and words instead of bi-gram or tri-gram based modeling as used in previous work.


As deep neural networks have become widespread in real-world applications, there has been an increasing focus on interpretability and transparency. Recent work addresses this requirement either through saliency-map visualizations  [61, 77, 49], attention mechanism  [75, 79, 51, 14], or other analysis  [30, 39, 56, 58]. Our work demonstrates interpretability via attention visualizations.

3 Visual Entailment Task

Figure 2: More examples from SNLI-VE dataset

3.1 Formal Definition

We introduce a dataset for VE task structured as , where is an instance from , with , , and denoting an image premise, a text hypothesis and a class label, respectively. It is worth noting that each image is used multiple times with different labels given distinct hypotheses .

Three labels , , or are assigned based on the relationship conveyed by . Specifically, i) (entailment) is assigned if , ii) (neutral) is assigned if , iii) (contradiction) is assigned if .

3.2 Visual Entailment Dataset

3.2.1 Dataset criteria

Based on the vision community’s experience with SNLI, VQA-v1.0, VQA-v2.0, and CLEVR, there are four criteria in developing an effective dataset:

  1. Structured set of real-world images. The dataset should be based on real-world images and the same image can be paired with different hypotheses to form different labels.

  2. Fine-grained. The dataset should enforce fine-grained reasoning about subtle changes in hypotheses that could lead to distinct labels.

  3. Sanitization. No instance overlapping across different dataset partitions. One image can only exist in a single partition.

  4. Account for any bias. Measure the dataset bias and provide baselines to serve as the performance lower bound for potential future evaluations.

3.2.2 SNLI-VE Construction

We now describe how we construct SNLI-VE, which is a dataset for VE tasks.

We build the dataset SNLI-VE based on two existing datasets, Flickr30k  [76] and SNLI  [4]. Flickr30k is a widely used image captioning dataset containing 31,783 images and 158,915 corresponding captions. The images in Flickr30k consist of everyday activities, events and scenes  [76], with 5 captions per image generated via crowdsourcing. SNLI is a large annotated TE dataset built upon Flickr30k captions. Each image caption in Flickr30k is used as a text premise in SNLI. The authors of SNLI collect multiple hypotheses in the three classes - entailment, neutral, and contradiction - for a given premise via Amazon Mechanical Turk [68], resulting in about 570K pairs. Data validation is conducted in SNLI to measure the label agreement. Specifically, each pair is assigned a gold label, indicating the label is agreed by a majority of crowdsourcing workers (at least 3 out of 5). If such a consensus is not reached, the gold label is marked as “-”.

Since SNLI was constructed using Flickr30k captions, for each pair in SNLI, it is feasible to find the corresponding Flickr30k image through the annotations in SNLI. This enables us to create a structured VE dataset based on both. Specifically, for each pair in SNLI with an agreed gold label, we replace the text premise with its corresponding Flickr30k image, resulting in a pair in SNLI-VE. Figures 1 and 2 illustrate examples from the SNLI-VE dataset. SNLI-VE naturally meets the aforementioned criterion 1 and criterion 2. Each image in SNLI-VE are real-world ones and is associated with distinct labels given different hypotheses. Furthermore, Flickr30k and SNLI are well-studied datasets, allowing the community to focus on the new task that our paper introduces, rather than spending time familiarizing oneself with the idiosyncrasies of a new dataset.

Training Validation Testing
#Image 29,783 1,000 1,000
#Entailment 176,932 5,959 5,973
#Neutral 176,045 5,960 5,964
#Contradiction 176,550 5,939 5,964
Vocabulary Size 29,550 6,576 6,592
Table 1: SNLI-VE dataset

A sanity check is applied to SNLI-VE dataset partitions in order to guarantee criterion 3. We notice the original SNLI dataset partitions does not consider the arrangement of the original caption images. If SNLI-VE directly adopts the original partitions from SNLI, all images in validation or testing partitions also exist in the training partitions, violating criterion 3. To amend this, we disjointedly partition SNLI-VE by images following the partition in [22] and make sure instances with different labels are of similar numbers across training, validation, and testing partitions as shown in Table 1.

Regarding criterion 4, since SNLI has already been extensively studied, we are aware that there exists a hypothesis-conditioned bias in SNLI as recently reported by Gururangan et al[24]. Though the labels in SNLI-VE are distributed evenly across dataset partitions, SNLI-VE still inevitably suffers from this bias inherently. Therefore, we provide a hypothesis-only baseline in Section 5.1 to serve as a performance lower bound.

3.3 SNLI-VE and VQA Datasets

Figure 3: Question Length Distribution
Partition Size:
Training 529,527 443,757 699,989
Validation 17,858 214,354 149,991
Testing 17,901 555,187 149,988
Question Length:
Mean 7.4 6.1 18.4
Median 7.0 6.0 17.0
Mode 6 5 14
Max 56 23 43
Vocabulary Size 32,191 19,174 87
Table 2: Dataset Comparison Summary

We further compare our SNLI-VE dataset with the two widely used VQA datasets, VQA-v2.0 and CLEVR. The comparison focuses on the questions (for SNLI-VE dataset, we consider a hypothesis as a question). Table 2 is a statistical summary about the questions from three datasets. Before generating Table 2, questions are prepossessed by three steps: i) split into words, ii) lower case all words, iii) removing punctuation symbols {‘’“”,.-?!}. Figure 3 depicts a detailed question length distribution.

According to Table 2, among the three datasets, our SNLI-VE dataset, which contains the smallest total number of questions (summing up training, validation and testing partitions), has the largest vocabulary size. The maximum question length in SNLI-VE is 56, which is the largest among these three datasets, and represents real-world descriptions. Both the mean and median lengths are larger than VQA-v2.0 dataset. The question length distribution of SNLI-VE, as shown in Figure 3, is quite heavy-tailed in contrast to the others. These observations indicate that the text in SNLI-VE may be difficult to handle compared to VQA-v2.0 for certain models. As for CLEVR dataset, even though most sentences are much longer than SNLI-VE as shown in Figure 3, the vocabulary size is only 87. We believe this is due to the synthetic nature of CLEVR, which also indicates models that achieve high-accuracy on CLEVR may not be able to generalize to our SNLI-VE dataset.

4 EVE: Explainable Visual Entailment System

Figure 4: Our model EVE combines image and ROI information to model fine-grained cross-modal information

The design of our explainable VE architecture, as shown in Figure 4, is based on the Attention Top-Down/Bottom-Up model discussed later in Subsection 5.4, which is the winner of VQA Challenge, 2017. Similar to the Attention Top-Down/Bottom-Up, our EVE architecture is composed of a text and an image branch. The text branch extracts features from the input text hypothesis through an RNN. The image branch generates image features from

. The features produced from the two branches are then fused and projected through fully-connected (FC) layers towards predicting the final conclusion. The image features can be configured to take the feature maps from a pre-trained convolutional neural network (CNN) or ROI-pooled image regions from a region of interest (ROI) proposal network (RPN).

We build two model variants, EVE-Image and EVE-ROI, for image and ROI features, respectively. EVE-Image incorporates a pre-trained ResNet101  [26], which generates feature maps of size

. For each feature map position, the feature vector across all the

feature maps is considered as an object. As a result, there are a total number of objects of feature size for an input image. In contrast, the EVE-ROI variant takes ROIs as objects extracted from a pre-trained Mask R-CNN  [48].

In order to accurately solve this cross-model VE task, we need: both a mechanism to identify the salient features in images and text inputs and a cross-modal embedding to effectively learn the image-text interactions, which are addressed by employing self-attention and text-image attention techniques in the EVE model respectively. We next describe the design and implementation of the mechanisms in EVE model.

4.1 Self-Attention

EVE utilizes self-attention  [69] in both text and image branches as highlighted with dotted blue frame in Figure 4

. Since the hypothesis in SNLI-VE can be relatively long and complex, self-attention helps focus on important keywords in a sentence that relate to each other. The text branch applies self-attention to the projected word embeddings from a multi-layer perceptron (MLP). It is worth noting that although word embeddings, either from GloVe or other existing models, may be fixed, the MLP transformation is able to be trained to generate adaptive projected word embeddings. Similarly, the image branch applies the self-attention to projected image regions either from the aforementioned feature maps or ROIs in expectation of capturing the hidden relations between elements in the same feature space.

Specifically, we use the scaled dot product (SDP) attention in  [69] to capture this hidden information:


where is the query feature matrix and is the reference feature matrix. and represent the number of features vectors in matrix and respectively, and denotes the dimension of each feature vector. is the resulting attention mask for given . Each element in represents how much weight (before scaled by and normalized by softmax) the model should put on each query feature vector in w.r.t. each reference feature vector in . The attended query feature matrix is the weighted and fused version of the original query feature matrix , calculated by the matrix dot product between the attention mask and the query feature matrix . Note that for the self-attention, the query matrix and the “reference” matrix are the same matrix.

4.2 Text-Image Attention

Multi-modal tasks such as phrase grounding [6] demonstrate that high-quality cross-modal feature interactions improve the overall performance. The dotted red frame highlighted area in Figure 4 shows that EVE incorporates the text-image attention to relevant image regions based on the text embedding from the GRU. The feature interaction between the text and image regions are computed using the same SDP technique introduced in Section 4.1, serving as the attention weights. The weighted features of image regions are then fused with the text features for further decision making. Specifically, for the text-image attention, the query matrix is the image features while the “reference” matrix is the text features. Note that although and are from different feature spaces, the dimension of each feature vector is projected to be the same in respective branches for ease of the attention calculation.

5 Experiments

Val Acc Per Class (%)
Test Acc
Overall (%)
Test Acc Per Class (%)
Model Name
Val Acc
Overall (%)
Hypothesis Only 66.68 67.54 66.90 65.60 66.71 67.60 67.71 64.83
Image Captioning 67.83 66.61 69.23 67.65 67.67 66.25 70.69 66.08
Relational Network 67.56 67.86 67.80 67.02 67.55 67.29 68.86 66.50
Attention Top-Down 70.53 70.23 68.66 72.71 70.30 69.72 69.33 71.86
Attention Bottom-Up 69.34 71.26 70.10 66.67 68.90 70.52 70.96 65.23
EVE-Image* 71.56 71.04 70.55 73.10 71.16 71.56 70.52 71.39
EVE-ROI* 70.81 68.55 68.78 75.10 70.47 67.69 69.45 74.25
Table 3: Model Performance on SNLI-VE dataset

In this section, we evaluate EVE as well as several other baseline models on SNLI-VE. Most of the baselines are existing or previous SOTA VQA architectures. The performance results of all models are listed in Table 3.

All models are implemented in PyTorch. We use the pre-trained GloVe.6B.300D for word embedding 


, where 6B is the corpus size and 300D is the embedding dimension. Input hypotheses are padded to the maximum sentence length in a batch. Note we do not truncate the sentences because unlike VQA where the beginning of questions typically indicates what is asked about, labels of VE task may depend on keywords or small details at the end of sentences. For example, truncating the hypothesis “The person who is standing next to the tree and wearing a blue shirt is playing _____” inevitably loses the key detail and changes the conclusion. In addition, the maximum sentence length in SNLI is 56, which is much larger than 23 in VQA-v2.0 as shown in Table 

2. Always padding to the dataset maximum is not necessarily efficient for training. As a consequence, we opt for padding to the batch-wise maximum sentence length.

Unless explicitly mentioned, all models are trained using a cross-entropy loss function optimized by the Adam optimizer with a batch size of 64. We use an adaptive learning rate scheduler which reduces the learning rate whenever no improvement on the validation dataset for a period of time. The initial learning rate and weight decay are both set to be

. The maximum number of training epochs is set to 100. We save a checkpoint whenever the model achieves a higher overall validation accuracy. The final model checkpoint selected for testing is the one with the highest lowest per class accuracy in case the model performance is biased towards particular classes. The batch size is set as 32 for validation and testing. In the following, we discuss the details for each baseline.

5.1 Hypothesis Only

This baseline verifies the existing data bias in the SNLI dataset, as mentioned by Gururangan et al[24] and Vu et al[71], by using hypotheses only without the image premise information.

The model consists of a text processing component followed by two FC layers. The text processing component is used to extract the text feature from the given hypothesis. It first generates a sequence of word-embeddings for the given text hypothesis. The embedding sequence is then fed into a GRU  [10] to output the text features of dimension 300. The input and output dimensions of the two FC layers are [300, 300] and [300, 3] respectively.

Without any premise information, this baseline is supposed to make a random guess out of the three classes but the resulting accuracy is up to 67%, implying the existence of a dataset bias. We do not intend to rewrite the hypotheses in SNLI to reduce the bias but instead, aim at using the premise (image) features to outperform the hypothesis only baseline.

5.2 Image Captioning

Since the original SNLI premises are image captions, a straightforward idea to address VE is to first apply an image caption generator to convert image premises to text premises and then followed by a TE classifier. Particularly, we adopt the PyTorch tutorial implementation


as a caption generator. A pre-trained ResNet152 serves as the image encoder while the caption decoder is a long short-term memory (LSTM) network. Once the image caption is generated, the image premise is replaced with the caption and the original VE task is reduced to a TE task. Similar to the Hypothesis-Only baseline, the TE classifier is composed of two text processing components to extract text features from both the premise and hypothesis. The text features are fused and go through two FC layers with input and output dimensions of [600, 300] and [300, 3] for the final prediction.

The resulting performance achieves a slightly higher accuracy of 67.83% and 67.67% on the validation and testing partitions over the Hypothesis-Only baseline, implying that the generated image caption premise does not improve much. We suspect that the generated captions may not cover the necessary information in the image as required by the hypothesis to make the correct conclusion. This is possible in a complex scene where exhaustive enumeration of captions may be needed to cover every detail potentially described by the hypothesis.

5.3 Relational Network

The Relational Network (RN) baseline is based on  [60] which is proposed to tackle the CLEVR dataset with high accuracy. There are an image branch and a text branch in the model. The image branch extracts image features in a similar manner as EVE, as described in Section 4, but without self-attention. The text branch generates the hypothesis embedding through an RNN. The highlight of RN is to capture pairwise feature interactions between image regions and the text embedding. Each pair of image region feature and question embedding goes through an MLP. The final classification takes the element-wise sum over the MLP output for each pair as input.

Despite the high accuracy on the synthetic dataset CLEVR, RN only achieves a marginal improvement on SNLI-VE at the accuracy of 67.56% and 67.55% on the validation and testing partitions. This may be attributed to the limited representational power of RN that fails to produce effective cross-modal feature fusion of the natural image premises and the free-form text hypothesis input from SNLI-VE.

5.4 Attention Top-Down and Bottom-Up

We consider the Attention Top-Down and Attention Bottom-Up baselines based on the winner of VQA challenge 2017 [1]. Similar to the RN baseline, there is an image branch and a text branch. The difference between the image branches in Attention Top-Down and Attention Bottom-Up is similar to our EVE. The image features of Attention Top-Down come from the feature maps generated from a pre-trained CNN. As for Attention Bottom-Up, the image features are the top 10 ROIs extracted from a pre-trained Mask-RCNN implementation  [25]. No self-attention is applied in both image and text branches. Moreover, the text-image attention is implemented by feeding the concatenation of both image and text features into an FC layer to derive the attention weights rather than using SDP as described in Section 4.1. Then the attended image features and text features are projected separately and fused by dot product. The fused features go through two different MLPs. The element-wise sum of both MLP output serves as the final features for classification.

The SOTA VQA winner model, Attention Top-Down, achieves an accuracy of 70.53% and 70.30% on the validation and testing partitions respectively, implying cross-modal attention is the key to effectively leveraging image premise features. The Attention Bottom-Up model using ROIs also achieves a good accuracy of 69.34% and 68.90% on the validation and testing partitions. The reason why Attention Bottom-Up performs worse than Attention Top-Down could be possibly due to lack of background information in ROI features and ROI feature quality. It is not guaranteed that those top ROIs cover necessary details described by the hypothesis. However, even with more than 10 ROIs, we observe no significant improvement in performance.

5.5 EVE-Image and EVE-ROI

The details of our EVE architecture have been described in Section 4. EVE-Image achieves the best performance of 71.56% and 71.16% accuracy on the validation and testing partitions respectively. The performance of EVE-ROI is similar, with an accuracy of 70.81% and 70.47%, possibly suffering from similar issues as the Attention Bottom-Up model. However, the improvement is likely due to the introduction of self-attention and text-image attention through SDP that potentially captures the hidden relations in the same feature space and better attended cross-modal feature interaction.

Figure 5: An attention visualization for EVE-Image
Figure 6: An attention visualization for EVE-ROI

Attention Visualization. The explainability of EVE is attained using attention visualizations in the areas of interest in the image premise given the hypothesis. Figure 5 and 6 illustrate two visualization examples of the text-image attention from EVE-Image and EVE-ROI respectively. The image premise of the EVE-Image example is shown on the left of Figure 5, and the corresponding hypothesis is “A human playing guitar”. On the right of Figure 5, our EVE-Image model successfully attends to the guitar area, leading to the correct conclusion: entailment. In Figure 6, our EVE-ROI focuses on the children and the sand area in the image premise, leading to the contradiction conclusion for the given hypothesis “Two children are swimming in the ocean.”

5.6 Discussion

In this section, we discuss why existing VQA and CLEVER models have modest performs over SNLI-VE dataset and the possible future directions based on our experience. VQA models are not trained to distinguish fine-grained information. Furthermore, with the same image present across all the three classes in the SNLI-VE dataset, SNLI-VE removes any bias that may originate from just the image premise information and an effectively fused representation is important for high accuracy. Furthermore, models that provide good performance on CLEVR may not work on SNLI-VE since these models have rather simplistic image processing pipelines, often with a couple of convolutional layers that may be sufficient to process synthetic images but works poorly on real images. More importantly, the sentences are not synthetic in the SNLI-VE dataset. As a result, building compositional reasoning modules over SNLI-VE hypotheses is out of reach for existing models.

To effectively address SNLI-VE, we believe three approaches can be beneficial. First, using external knowledge beyond pre-trained models and/or visual entity extraction can be beneficial. If the external knowledge can provide information allowing the model to learn relationships between the entities that may be obvious to humans but difficult or impossible to learn from the dataset (such as “two women in the image are sisters”), it will improve the model performance over SNLI-VE.

Second, it is possible for the hypothesis to contain multiple class labels assigned to its different entities or relationships w.r.t. the premise. However, SNLI-VE lacks annotations for localizing the labels to specific entities in the hypothesis (e.g. as is often provided in synthetic datasets like bABi [72]). Since the hypothesis can be broken down into individual entities and relationships between pairs of entities, providing fine-grained labels for each target in the hypothesis likely facilitates strongly-supervised training.

Finally, a third possible approach is to build effective attention based models as done in TE that encodes the sentences together and align similar words in hypothesis and premise instead of a late-fusion of separately encoded modalities. Hence, the active research on visual grounding can benefit addressing the SNLI-VE task.

6 Conclusion

We introduce a novel task, visual entailment, that requires fine-grained reasoning over the image and text. We build the SNLI-VE dataset for VE using real-world images from Flickr30k as premises, and the corresponding text hypotheses from SNLI. We then develop the EVE architecture to address VE and evaluate against multiple baselines, including existing SOTA VQA based models. We expect more effort to be devoted to generating fine-grained VE annotations for large image datasets such as the Visual Genome [41] and Open Images Dataset [40] as well as improved models on fine-grained visual reasoning.


Ning Xie and Derek Doran were supported by the Ohio Federal Research Network project Human-Centered Big Data. Any opinions, findings, and conclusions or recommendations expressed in this article are those of the author(s) and do not necessarily reflect the views of the Ohio Federal Research Network.


Supplementary: Additional Examples

Figure 7 shows random examples from SNLI-VE with predictions from our EVE-Image. Each example consists of an image premise and three selected hypotheses of different labels. Note that for each image premise, the total number of hypotheses are not limited to three.

Figure 7: Random examples from SNLI-VE with prediction results from our best-performed EVE-Image