Learning Visually-Grounded Semantics from Contrastive Adversarial Samples

We study the problem of grounding distributional representations of texts on the visual domain, namely visual-semantic embeddings (VSE for short). Begin with an insightful adversarial attack on VSE embeddings, we show the limitation of current frameworks and image-text datasets (e.g., MS-COCO) both quantitatively and qualitatively. The large gap between the number of possible constitutions of real-world semantics and the size of parallel data, to a large extent, restricts the model to establish the link between textual semantics and visual concepts. We alleviate this problem by augmenting the MS-COCO image captioning datasets with textual contrastive adversarial samples. These samples are synthesized using linguistic rules and the WordNet knowledge base. The construction procedure is both syntax- and semantics-aware. The samples enforce the model to ground learned embeddings to concrete concepts within the image. This simple but powerful technique brings a noticeable improvement over the baselines on a diverse set of downstream tasks, in addition to defending known-type adversarial attacks. We release the codes at https://github.com/ExplorerFreda/VSE-C.


page 2

page 7


Language with Vision: a Study on Grounded Word and Sentence Embeddings

Language grounding to vision is an active field of research aiming to en...

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

In natural language processing, most models try to learn semantic repres...

Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

We propose the Unified Visual-Semantic Embeddings (Unified VSE) for lear...

UniVSE: Robust Visual Semantic Embeddings via Structured Semantic Representations

We propose Unified Visual-Semantic Embeddings (UniVSE) for learning a jo...

Aligning Visual Regions and Textual Concepts: Learning Fine-Grained Image Representations for Image Captioning

In image-grounded text generation, fine-grained representations of the i...

Comprehending and Ordering Semantics for Image Captioning

Comprehending the rich semantics in an image and ordering them in lingui...

Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach

Constructing an organized dataset comprised of a large number of images ...

1 Introduction

The visual grounding of language plays an indispensable role in our daily lives. We use language to name, refer, and describe objects, their properties and generally, visual concepts. Distributional semantics (e.g., global word embeddings [Pennington et al.2014]

) based on large-scale corpora have shown great success in modeling the functionality and correlation of words in the natural language domain. This further contributes to the success in numerous natural language processing (NLP) tasks such as language modeling 

[Cheng et al.2016, Inan et al.2017]

, sentiment analysis 

[Cheng et al.2016, Kumar et al.2016], and reading comprehension  [Cheng et al.2016, Chen et al.2016, Shen et al.2017]. However, effective and efficient grounding of distributional embeddings remains challenging. Being ignorant of the corresponding visual concepts, pure textual embeddings demonstrate inferior performances when incorporating with visual inputs. A set of typical tasks includes image/video captioning, multi-modal retrieval/understanding, and visual reasoning, some of which are further extensively studied in the paper.

Visual concept and its link with textual semantics, as a cognitive alignment, provide rich supervision to learning systems. Introduced in kiros2014unifying, Visual-Semantic Embedding (VSE) aims at building the bridge between natural language and the underlying visual world by jointly optimize and align the embedding spaces of both images and descriptive texts (captions). Nevertheless, even for large-scale datasets such as MS-COCO [Lin et al.2014], the number of image-caption pairs are far less than the number of possible constitutions of real-world semantics, making the dataset inevitably sparse and biased.

To reveal this, we begin with constructing textual adversarial samples to attack the state-of-the-art system VSE++ [Faghri et al.2017]. Specifically, we study the composition of sentences from two aspects: (1) content words including nouns and numerals and (2) prepositions indicating spatial relations (e.g., in, on, above, below). As shown in Figure 1, we manipulate the original caption to construct hard negative captions with similar structure but completely contradictory semantics. We found that the models easily get confused, suffering a noticeable drop in confidence or even wrong predictions in the caption retrieval task.

Figure 1: An overview of our textual contrastive adversarial samples. For each caption, we have three paradigms to generate contrastive adversarial samples, i.e., noun, numeral and relation. For an given image, we expect the model to distinguish the real captions against the generated adversarial ones.

We propose VSE-C, which enforces the learning of correlation and correspondence between textual semantics and visual concepts by providing contrastive adversarial samples during the training procedure, incorporating with an intra-pair hard negative sample mining. Instead of defending adversarial attacks, we focus on the study of limitations of current visual-semantic datasets and the transferability of learned embeddings. To fulfill the large gap between the number of parallel image-caption pairs and the expressiveness of natural languages, we augment the data by employing a set of heuristic rules to generate large sets of contrastive negative captions, as demonstrated in Figure 

1. The candidates are selectively used for training by an intra-pair hard-example mining technique. VSE-C alleviate the bias of dataset and provide rich and effective samples on par with original image captions. This strengthen the link between text and visual concepts by requiring models to detect a mismatch on the level of some precise concepts.

VSE-C learns discriminative and visually-grounded word embeddings on the MS-COCO dataset [Lin et al.2014]. It is extensively compared with existing works with rich experiments and analyses. Most importantly, we explore the transferability of the learned embeddings on several real-world applications both qualitatively and quantitatively, including image-to-text retrieval and bidirectional word-to-concept retrieval. Furthermore, VSE-C demonstrates a general framework for augmenting textual inputs considering semantical consistency. The introduction human priors and knowledge bases alleviates the sparsity and non-contiguity of languages. We release our codes and data at https://github.com/ExplorerFreda/VSE-C.

2 Related works

Joint embeddings

Joint embedding is a common technique for a wide range of tasks incorporating multiple domains, including audio-video embeddings for unsupervised representation learning [Ngiam et al.2011], shape-image embeddings [Li et al.2015] for shape inference, bilingual word embeddings for machine translation [Zou et al.2013], human pose-image embeddings for pose inference [Li2011], image-text embeddings for visual description [Reed et al.2016], and global representation learning from multiple domains [Castrejon et al.2016]

. These embeddings map multiple domains into a joint vector space which describes the semantical relations between inputs (

e.g., distance, correlation).

We focus on the visual-semantic embedding [Mao et al.2016, Kiros et al.2014, Faghri et al.2017], learning word embeddings with visually-grounded semantics. Examples of related applications include image caption retrieval and generation [Kiros et al.2014, Karpathy and Fei-Fei2015], and visual question-answering [Malinowski et al.2015].

Image-to-text translation

Canonical Correlation Analysis (CCA) [Hotelling1936]

is a statistical method that projects two views linearly into a common space to maximize their correlation. andrew2013deep proposes a deep learning framework to extend CCA so that it is able to learn nonlinear projections and has better scalability on relatively large datasets.

In the state-of-the-art frameworks, the pairwise ranking is often adopted to learn a distance metric [Socher et al.2014, Niu et al.2017, Nam et al.2017]. frome2013devise proposes a cross-modal feature embedding framework that uses CNN and Skip-Gram [Mikolov et al.2013]

to extract representations for images and texts respectively, then an objective is applied to ensure that the distance between the matched image-text pair is smaller than that between the mismatched pair. A similar framework proposed by kiros2014unifying uses a Gated Recurrent Unit (GRU) as the sentence encoder. wang2016learning uses a bidirectional loss function with structure-preserving constraints. An attention mechanism on both image and caption is used by nam2017dual where the model estimates the similarity between images and texts by sequentially focusing on a subset of image regions and words that have shared semantics. huang2017instance utilizes a multi-modal context-modulated attention mechanism to compute the similarity between an image and a caption. faghri2017vse++ proposes a novel loss to penalize the hard negatives,

i.e., the closest mismatched pairs, instead of averaging the individual violations across all negatives in kiros2014unifying.

Adversarial attack in text domain

Adversarial attacks have recently drawn significant attention in the deep learning community. The adversarial attacks spread over multiple domains including image classification [Nguyen et al.2015], image segmentation and object detection [Xie et al.2017], and textual reading comprehension [Jia and Liang2017]

, and deep reinforcement learning 

[Kos and Song2017].

In this paper, we present textual adversarial attacks in image-to-text translation systems such as image caption frameworks. While focusing on the problem of learning visually-grounded semantics, the adversarial attack brings new solutions to fulfill the gap between limited training data and numerous constitutions of natural languages. With extensive experiments on the effects of the adversarial samples, we reach the conclusion that current visual-semantic embeddings are “insensitive” to the underlying semantics. The proposed VSE-C shows advance across multiple visual-semantic tasks.

3 Method

3.1 Preliminaries

Word embeddings

We manually split the embeddings of each word into two parts: distributional embeddings, and visually-grounded embeddings. We use GloVe [Pennington et al.2014] as the distributional embeddings, pre-trained unsupervisedly on large-scale corpora. We focus on the visually-grounded embeddings of words. The embeddings are optimized using the visual-semantic embedding (VSE) technique.

Visual-semantic embeddings

VSE optimizes and aligns the latent space of both visual and textual domains. Parallel data are typically obtained from image captioning datasets such as Flickr30K [Young et al.2014] or MS-COCO [Lin et al.2014]. The training set contains image-caption pairs. Typically all and form the negative samples for a specific pair .

Following the notations used by kiros2014unifying, domain-specific encoders are first employed to extract latent features of both images and captions, denoted as and , respectively. We use ResNet-152 [He et al.2016]

as visual domain encoder and GRU as text domain sentence encoder, which are both effective for VSE. They are projected into a joint latent space with a linear transformation. A hinge loss with margin

is employed to optimize the alignment:


where , and measuring the distance between projected image embedding and caption embedding . The summations are taken over all image-caption pairs within a sampled batch.

3.2 Generating contrastive adversarial samples

Class Original Caption Contrastive Adversarial Example
Noun A person feeding a cat with a banana. A person feeding a dog with a banana.
Numeral A person feeding a cat with a banana. A person feeding five cats with a banana.
Relation-1 A person feeding a cat with a banana. A cat feeding a person with a banana.
Relation-2 A person feeding a cat with a banana. A person feeding a cat in a banana.
Table 1:

Examples of contrastive adversarial samples generated with our heuristic rules and knowledge from WordNet. The samples can be classified into four types: noun replacement, numeral replacement, relation shuffling, and relation replacement.

Our contrastive adversarial samples can be split into three classes: noun, numeral and relation. Each class of samples is generated separately.


We extract a list of heads [Zwicky1985] of noun phrases in MS-COCO dataset and label those with frequency larger than 200 be frequent heads. In addition, since images usually reflect concrete concepts better than abstract ones, we compute the concreteness of words following turney2011literal, and only consider those heads with concreteness larger than . Only frequent concrete heads can be replaced by other frequent concrete heads with different meaning to form contrastive adversarial samples.

While replacing, we utilize the hypernymy/hyponymy relations in WordNet [Miller1995] to confirm the original noun and the corresponding contrastive adversarial sample are semantically different. Only words without hypernymy or hyponymy relations can be used as the replacement for adversarial sample generation. For example, “animal” is a hypernym of “cat”. Therefore, “A person feeding an animal with a banana” cannot be a valid generated contrastive adversarial caption for the image with the caption of “A person feeding a cat with a banana.”


For each caption, we detect numerals and replace them with other numerals indicating different quantities to form contrastive adversarial samples. Note that “a” and “an” are treated as “one” here, though they are (indefinite) articles instead of numerals. Meanwhile, we singularize or pluralize the corresponding nouns when necessary.


The relation class includes two different paradigms.

The first one can be viewed as shuffle of noninterchangeable noun phrases. After extracting noun phrases of a caption, we shuffle them and put them back to the original positions. Although the bag of words features of the two sentences (caption) remain the same, the semantic meaning alters through this process.

The second one is replacement of prepositions. We extract the prepositions with frequency higher than 200 in MS-COCO dataset. Then we manually annotate a semantic overlap table, which can be found in Appendix A. In this table, words in the same set may have semantic overlap with each other, e.g., by and with, in and among.

The noun phrase detection, preposition detection and numeral detection mentioned above are performed with SpaCy [Honnibal and Johnson2015]. Examples of different classes of contrastive adversarial sample generation are shown in Table 1.

3.3 Intra-pair hard negative mining

We extend the online hard example mining (OHEM) technique used by VSE++ [Faghri et al.2017]. The original hinge loss is computed by choosing the hardest sample within an batch (inter-pair). Mathematically,


There are two major concerns regarding the in-batch hard negative mining. On one hand, mining negatives from a single batch is inefficient when batch size is not comparable with the size of the dataset. On the other hand, for real-world datasets, taking the max in loss function tends to be very sensitive to label noise, resulting in fake negative samples.

In contrast, given an image-caption pair , we employ human heuristics and WordNet knowledge base to generate contrastive negative samples . To utilize these candidate caption sets, we employ an intra-pair hard negative mining strategy. Specifically, during the optimization, we add an extra loss term:


In our implementation, the candidate set has approximately samples. In each iteration, we randomly sample negatives from it. This simple sample technique are effective and computation-friendly based on our empirical studies.

4 Experiments

We begin our experiments with an extensive study on the effect of adversarial samples on the baseline models. Even trained with hard negative mining techniques, VSE++ fails to discriminate words with completely contradictory visually-grounded semantics. Furthermore, we study the improvement brought by the introduction of contrastive adversarial samples on a diverse set of tasks.

4.1 Adversarial attacks

We select 1,000 images for test in MS-COCO 5k test split following karpathy2015deep. Each image is associated with five captions. Each caption in the selected test set can be manipulated to generate at least 20 contrastive adversarial samples by all manners (noun, numeral, and relation adversary). The image-to-caption retrieval task is defined as ranking the candidate captions based on the distance between their semantics and the given image.

We follow the metric used in faghri2017vse++ computing R@1, R@10, median rank and mean rank w.r.t. the top-ranked correct caption for each image. For each image, the database of retrieval contains the full set of   captions, in which only 5 captions are labeled as positive. The R@k metric essentially measures the percentage of images where the set of top-k ranked captions contains at least one positive caption.

We attack the existing models by adversarial samples. We extend each caption with 60 adversarial samples (20 noun-typed, 20 numeral-typed and 20 relation-typed). Therefore, each image has 60 5 contrastive adversarial samples in total. The candidate retrieval set for each image now becomes 5000 300. We discuss the experimental results as follows:

VSE-C are more robust to known-typed adversarial attacks than VSE and VSE++.

We compare the performance of VSE-C with VSE [Kiros et al.2014] and VSE++ [Faghri et al.2017] in Table 2. Both VSE and VSE++ have a significant drop in performance after adding adversarial samples, while VSE-C training with contrastive adversarial samples is less vulnerable to the attacks. This phenomenon reflects that the text encoders of VSE and VSE++ do not actually make a good use of the image encodings, as the image encodings are fixed in all experiments.

Detailed attacking results are shown in Table 3. The three hyper-columns show the ability of the models to defend the adversarial attack of noun, numeral, and relation-typed respectively. Among three types of attacks, VSE and VSE++ suffer least from the noun attack. As the constitution of the dataset ensures the frequency in the entire dataset of the words used for replacement, the visual grounding of these frequent nouns is easy to obtain. However, the semantics of relations (including prepositions and entity relations) or numbers are not diverse enough in the dataset, leading to the poor performance of VSE against these attacks.

Numeral-typed VSE-C improves the counting ability of models.

As shown in Table 3, numeral-typed contrastive adversarial samples improve the counting ability of models. However, it is still not clear about where the gain comes from, as the creation of numeral-typed samples may change the form (i.e., singular or plural) of nouns to make a sentence plausible. Does the gain comes from the improved ability to distinguish singulars and plurals?

We conduct the following evaluation to study the counting ability. We extract all the images associated with captions including plurals from our test split of MS-COCO, forming a plural-split of the dataset, and generate only plural (numeral)-typed contrastive adversarial samples (changing the numerals) w.r.t. the plurals in the captions. We report the performance of VSE++ and numeral-typed VSE-C on this plural split in Table 4. It clearly shows that what VSE-C does not only distinguish singulars against plurals, but also, at least, distinguish plurals against other plurals (e.g., 3 vs. 5).

It is worth noting that such counting ability is still not evaluated completely due to the limitation of the current MS-COCO test split. We find that 99.8% of the plurals in MS-COCO test set comes from one of “two”, “three”, “four” and “five”. This may reduce the counting problem to a much simpler classification one.

Model MS-COCO Test MS-COCO Test (w/. adversarial)
R@1 R@10 Med r. Mean r. R@1 R@10 Med r. Mean r.
VSE 47.7 87.8 2.0 5.8 28.0 71.6 4.0 11.7
VSE++ 55.7 92.4 1.0 4.3 35.6 72.5 3.0 11.8
VSE-C (+n.) 50.7 90.7 1.0 5.2 40.3 80.2 2.0 9.2
VSE-C (+num.) 53.3 90.2 1.0 5.8 46.9 86.3 2.0 6.9
VSE-C (+rel.) 52.4 89.0 1.0 5.7 42.3 82.5 2.0 7.2
VSE-C (+all) 50.2 89.8 1.0 5.2 47.4 88.8 2.0 5.5
Table 2: Evaluation on image-to-caption retrieval. Although VSE++ [Faghri et al.2017] obtains the best performance on original MS-COCO test set, it is more vulnerable to the caption-specific adversarial attack compared with the proposed VSE-C, and so does VSE [Kiros et al.2014].
Model MS-COCO Test (+n.) MS-COCO Test (+num.) MS-COCO Test (+rel.)
R@1 R@10 Mean r. R@1 R@10 Mean r. R@1 R@10 Mean r.
VSE 37.6 85.8 6.9 38.5 82.3 7.7 30.7 76.7 8.8
VSE++ 45.7 89.1 5.5 45.9 82.3 7.2 42.3 80.0 7.6
VSE-C (+n.) 49.2 88.4 5.7 42.1 80.3 9.1 40.4 83.3 7.1
VSE-C (+num.) 51.0 89.5 6.1 53.3 90.2 5.8 49.0 87.0 6.6
VSE-C (+rel.) 48.0 88.8 5.3 45.4 83.9 6.7 50.1 90.2 4.9
VSE-C (+all.) 49.4 89.3 5.3 49.9 89.6 5.2 47.9 89.4 5.3
Table 3: Detailed results on each type of adversarial attack. Training VSE-C on one class gains the best performance on robustness against the adversarial attack of the class itself. In addition, training with numeral-typed adversarial samples helps improve the robustness against noun-typed and relation-typed attack. We hypothesis that this is attributed to the singularization or pluralization of the corresponding nouns in the process of numeral-typed adversarial sample generation.
Model MS-COCO Test (plural split, +plurals)
       R@1         R@10 Mean r.
VSE++ 43.7 78.3 9.1
VSE-C (+num.) 50.6 84.4 7.8
Table 4: Results on plural-typed adversarial attack to the plural split of MS-COCO test set. This split consists of 205 images, together with the 1,025 original captions. VSE-C outperforms VSE++ by a large margin on all the three considered metrics.

4.2 Saliency visualization

Figure 2: Saliency analysis on adversarial samples. The left column shows the saliency of VSE-C on the image (what is the difference between the image and the image you imagine from caption ), while the right column shows the saliency of both VSE++ and VSE-C on the caption (what is the difference between the caption and the caption you summarize from image ). The magnitude of values indicates the level of saliency. For better visualization, the image saliency is -normalized and the caption saliency is -normalized. For all the three classes of textual adversarial samples, the image encoding model (ResNet-152) almost only focuses on the main part of the image, i.e., elephant. For numeral-typed and relation-typed adversarial samples, VSE-C pays much more attention to the manipulated segments of the sentence than VSE++.

Given an image-caption pair and its corresponding textual adversarial samples, we are interested in the following question: what is the semantic distance between the image an adversarial caption? In other words, which part in the image or caption, in particular, makes them semantically different?

We visualize the saliency on input images and captions w.r.t. changes in sentence semantics. Specifically, given an image-caption pair , we manually modify the semantics of the caption with the techniques introduced in Section 3.2, and obtains . We compute the saliency of or w.r.t. this change by visualizing the Jacobian:


where is the similarity metric for image-caption pairs.

Shown in Figure 2, as for captions, VSE-C captures the change in sentence semantics and thus possesses large saliency on the manipulated words. In contrast, although trained with hard-negative mining, it is difficult for VSE++ to capture differences other than nouns.

Interestingly, the saliency of images shows less correlated response to semantics changes while the replaced word is not the major component in the image. We attribute this to the image embedding extractor, ResNet, because it is pre-trained on the ImageNet classification task. As the ResNet learns to produce shift-invariant features focusing on the major components (or concepts) of images, it inevitably learns less about secondary (and other) concepts.

4.3 Correlate words and objects

As only textual adversarial samples are provided during the training, the model may overfit the training samples by memorizing incorrect co-occurrence of words or concepts. To quantitatively evaluate the learned word embeddings, we conduct experiments on word-level image-to-word retrieval. Specifically, we first examine how each noun is linked with a visual object. This task shows the concrete link between words and image concepts, which supports the effectiveness of adversarial samples in enforcing the learning of visually-grounded semantics beyond co-occurrence memorizing.


Based on captions, we extract positive objects for each image in MS-COCO dataset by detecting heads of noun phrases using SpaCy. As mentioned in Section 3.2, we only let those objects without direct hypernymy/hyponymy relation to positive objects of the image be negative objects to avoid ambiguity. Table 5 shows an example of the preparation of image-object dataset.

Image Captions
A table with a huge glass vase and fake flowers come out of it.
A plant in a vase sits at the end of a table.
A vase with flowers in it with long stems sitting on a table with candles.
A large centerpiece that is sitting on the edge of a dining table.
Flowers in a clear vase sitting on a table.
Positive Objects: table, plant, vase.
Negative Objects: screen, pickle, sandwich, toy, hill, coat, cat, etc.
Table 5: An example of the image-to-word retrieval dataset. We extract objects by detecting heads of noun phrases in captions. We only collect the “object” words with frequency higher than 200 in MS-COCO full dataset as available positive/negative objects for each image.


Inspired by gong2017natural, we train an image-word alignment network through the interaction space, since this structure reflects the property of “alignment” better than just concatenating the feature vectors of word and image. In the training stage, the network is fed by batches of samples in the form of (image, word, label), where the label is 0 or 1, indicating whether the word is a negative or positive object of the image. Let denote the embedding of word and denote the feature vector of image extracted by ResNet-152 [He et al.2016]. As shown in Figure 3, we use the full interaction matrix

as the feature for object retrieval. While fixing both the image and word features, we only tune the parameters of multi-layer perceptron (MLP).

Figure 3: Model structure for image-to-word retrieval. The network is trained though the interaction space. Note that only parameters in MLP are tuned during training.
Model MAP
GloVe 58.7
VSE 61.7
VSE++ 61.1
VSE-C (+all) 62.2
VSE-C (+n.) 62.8
VSE-C (+rel.) 62.3
VSE-C (+num.) 62.0
Table 6: Evaluation result (MAP in percentage) on image-to-word retrieval.


We use mean average precision (MAP), which is a widely-applied metric in information retrieval, to evaluate the performance of the word embeddings. For each image, we treat it as a query. The average precision (AP) is defined by


where is the quantity of objects in data base, i.e., both positive and negative objects, is the precision at cut-off in the list, is an indicator function equaling 1 if the object at rank is a positive one, 0 otherwise [Turpin and Scholer2006].

Based on the definition of AP, MAP can be computed by , where is the query set, i.e., image set. It is worth noting that the database for retrieval of each query may be different from others, which is similar to Section 4.1.


We show the evaluation results in Table 6. It is as expected that VSE-C (+n.) achieves the best performance in the image-object retrieval task. All of the VSE-C models outperform the baselines produced by VSE [Kiros et al.2014], VSE++ [Faghri et al.2017] and GloVe [Pennington et al.2014], showing the concrete link between learned word semantics and visual concepts. With surprise, VSE-C with only relation adversarial samples shows comparable performance as VSE-C with noun adversarial samples. This further supports the effectiveness of sentence-level manipulation (relation-shuffle in Figure 1) on strengthening the link.

4.4 Concept to word

We quantitatively evaluate the performance of concept-to-word retrieval performance by introducing a sentence completion task. Given an image-caption pair , we manually replace concept words (nouns and relational words) with blanks. A separate model is trained to fill in the blanks.

Dataset and implementation details

Based on captions, we extract nouns and relational words from captions for each image in MS-COCO dataset using SpaCy. These selected words are marked as “concept” representatives. During training, we randomly sample a word from the representative set, and the word is masked as a blank to be filled. Given the image and the rest of the words, the model is trained to predict the embedding of the word.


The sentences with blank are encoded by two mono-directional GRU layer. The words before the blank and after the blank are separately encoded using and

respectively. The image feature extracted from a pre-trained ResNet152 is then concatenated with the last output of both GRUs. The prediction of embedding is made by a two-layer MLP taking in the concatenated feature. We use cosine similarity as the loss function. Figure 

4 shows the demonstration of our fill-in-the-blank model.

Figure 4: Model structure for fill-in-the-blank.


We present in Table 7 the performance of the proposed VSE-C on filling in both nouns and prepositions. VSE-based models outperform GloVe without visual grounding and concretely correlate word semantics with image embeddings. We found that only small gaps exist between VSE++ and VSE-C on preposition filling, which again shows the limited diversity on visual relations within the dataset.

Model Noun Filling Prep. Filling All (n. + prep.)
R@1 R@10 R@1 R@10 R@1 R@10
GloVe 23.2 58.8 23.3 79.9 23.3 66.6
VSE++ 25.0 61.7 34.9 84.9 28.4 68.1
VSE-C (ours) 27.3 62.9 35.2 85.2 30.0 70.98
Table 7: Evaluation result on the fill-in-the-blank task (in percentage). The word embeddings learned by VSE-C with all classes of contrastive adversarial samples help reach a better performance than those learned by VSE++ [Faghri et al.2017].

5 Discussion and conclusion

In this paper, we focus on the problem of learning visually-grounded semantics using parallel image-text data. With extensive experiments on adversarial attacks against existing frameworks [Kiros et al.2014, Faghri et al.2017], we obtain new insights on the limitation of datasets as well as frameworks. (1) Even for large-scale datasets such as MS-COCO captioning, the large gap between the number of possible constitutions of real-world visual semantics and the size of dataset still exists. (2) Existing models are not powerful enough to fully capture or extract the information contained in visual embeddings.

We propose VSE-C, introducing contrastive adversarial samples in the text domain and an intra-pair hard-example mining technique. To delve deeper into the embedding space and its transferability, we study a set of multi-modal tasks both qualitatively and quantitatively. Beyond being robust to adversarial attacks on image-to-caption retrieval tasks, experimental results on image-to-word retrieval and fill-in-the-blank reveal the correlation between the learned word embeddings and visual concepts.

VSE-C also demonstrates a general framework for augmenting textual inputs considering semantical consistency. The introduction human priors and knowledge bases alleviates the sparsity and non-contiguity of languages. We hope the framework and the released data are beneficial for building more robust and data-efficient models.


  • [Andrew et al.2013] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep Canonical Correlation Analysis. In Proc. of ICML.
  • [Castrejon et al.2016] Lluis Castrejon, Yusuf Aytar, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Learning Aligned Cross-Modal Representations from Weakly Aligned Data. In Proc. of CVPR.
  • [Chen et al.2016] Danqi Chen, Jason Bolton, and Christopher D Manning. 2016. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task. In Proc. of ACL.
  • [Cheng et al.2016] Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long Short-Term Memory-Networks for Machine Reading. In Proc. of EMNLP.
  • [Faghri et al.2017] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. VSE++: Improved Visual-Semantic Embeddings. arXiv preprint arXiv:1707.05612.
  • [Frome et al.2013] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. 2013. Devise: A Deep Visual-Semantic Embedding Model. In Proc. of NIPS.
  • [Gong et al.2018] Yichen Gong, Heng Luo, and Jian Zhang. 2018. Natural Language Inference over Interaction Space. In Proc. of ICLR.
  • [He et al.2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proc. of CVPR.
  • [Honnibal and Johnson2015] Matthew Honnibal and Mark Johnson. 2015. An Improved Non-Monotonic Transition System for Dependency Parsing. In Proc. of EMNLP.
  • [Hotelling1936] Harold Hotelling. 1936. Relations between two sets of variates. Biometrika, 28(3/4):321–377.
  • [Huang et al.2017] Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM. In Proc. of CVPR.
  • [Inan et al.2017] Hakan Inan, Khashayar Khosravi, and Richard Socher. 2017. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling. In Proc. of ICLR.
  • [Jia and Liang2017] Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proc. of EMNLP.
  • [Karpathy and Fei-Fei2015] Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. In Proc. of CVPR.
  • [Kingma and Ba2015] Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proc. of ICLR.
  • [Kiros et al.2014] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. arXiv preprint arXiv:1411.2539.
  • [Kos and Song2017] Jernej Kos and Dawn Song. 2017. Delving into Adversarial Attacks on Deep Policies. arXiv preprint arXiv:1705.06452.
  • [Kumar et al.2016] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. In Proc. of ICML.
  • [Li et al.2015] Yangyan Li, Hao Su, Charles Ruizhongtai Qi, Noa Fish, Daniel Cohen-Or, and Leonidas J Guibas. 2015. Joint embeddings of shapes and images via cnn image purification. ACM Trans. Graph.
  • [Li2011] Hang Li. 2011. Learning to rank for information retrieval and natural language processing. Synthesis Lectures on Human Language Technologies, 4(1):1–113.
  • [Lin et al.2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proc. of ECCV.
  • [Malinowski et al.2015] Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015.

    Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images.

    In Proc. of ICCV.
  • [Mao et al.2016] Junhua Mao, Jiajing Xu, Kevin Jing, and Alan L Yuille. 2016. Training and Evaluating Multimodal Word Embeddings with Large-Scale Web Annotated Images. In Proc. of NIPS.
  • [Mikolov et al.2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781.
  • [Miller1995] George A Miller. 1995. WordNet: A Lexical Database for English. Communications of the ACM.
  • [Nam et al.2017] Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual Attention Networks for Multimodal Reasoning and Matching. In Proc. of CVPR.
  • [Ngiam et al.2011] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal Deep Learning. In Proc. of ICML.
  • [Nguyen et al.2015] Anh Nguyen, Jason Yosinski, and Jeff Clune. 2015.

    Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images.

    In Proc. of CVPR.
  • [Niu et al.2017] Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. 2017. Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding. In Proc. of CVPR.
  • [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Proc. of EMNLP.
  • [Reed et al.2016] Scott Reed, Zeynep Akata, Bernt Schiele, and Honglak Lee. 2016. Learning Deep Representations of Fine-Grained Visual Descriptions. In Proc. of CVPR.
  • [Shen et al.2017] Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. 2017. Reasonet: Learning to stop reading in machine comprehension. In Proc. of SIGKDD.
  • [Socher et al.2014] Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. 2014. Grounded Compositional Semantics for Finding and Describing Images with Sentences. TACL.
  • [Turney et al.2011] Peter D Turney, Yair Neuman, Dan Assaf, and Yohai Cohen. 2011. Literal and Metaphorical Sense Identification through Concrete and Abstract Context. In Proc. of EMNLP.
  • [Turpin and Scholer2006] Andrew Turpin and Falk Scholer. 2006. User Performance versus Precision Measures for Simple Search Tasks. In Proc. of SIGIR.
  • [Wang et al.2016] Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning Deep Structure-Preserving Image-Text Embeddings. In Proc. of CVPR.
  • [Xie et al.2017] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. 2017. Adversarial examples for semantic segmentation and object detection. In Proc. of ICCV.
  • [Young et al.2014] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. TACL.
  • [Zou et al.2013] Will Y Zou, Richard Socher, Daniel Cer, and Christopher D Manning. 2013. Bilingual Word Embeddings for Phrase-Based Machine Translation. In Proc. of EMNLP.
  • [Zwicky1985] Arnold M Zwicky. 1985. Heads. Journal of Linguistics.

Appendix A Semantic Overlap Table of Frequent Prepositions

Set # Words in a Semantical Set
1 towards, toward, beyond, to
2 behind, after, past
3 outside, out
4 underneath, under, beneath, down, below
5 on, upon, up, un, atop, onto, over, above, beyond
6 in, within, among, at, during, into, inside, from, between
7 if, while
8 with, by, beside
9 around, like
10 to, for, of
11 about, within
12 because, as, for
13 as, like
14 near, next, beside
15 though
16 thru, through
17 besides, along
18 against, next, to
19 along, during, across, while
20 off, out
21 without
22 than
23 before
Table 8: Manually annotated semantic overlap table.

Table 8 shows our manually annotated semantic overlap sets. Prepositions in each row has overlap in semantics, i.e., can be replaced by each other in some level. A preposition can appear in several sets.

Appendix B Training Details of VSE-C

In all experiments, we use Adam [Kingma and Ba2015] as the optimizer of which the learning rate is set to 1e-3, with the batch size of 128. The learning rate is updated by multiplying

after every 15 epochs. We do not apply any regularization or dropout term. Word embeddings are initialized with the 300-dimensional GloVe 

[Pennington et al.2014]111http://nlp.stanford.edu/data/glove.840B.300d.zip

. The text encoder is a bidirectional 512-dimensional (in total 1024D) 1-layer GRU. The dimensionality of joint (multimodal) embedding is also 1,024. Empirically, with training data and hyper-parameters fixed, there is no significant variance in performance caused by different random seeds for the sampling.