Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

04/11/2019 ∙ by Hao Wu, et al. ∙ Tsinghua University ByteDance Inc. FUDAN University 18

We propose the Unified Visual-Semantic Embeddings (Unified VSE) for learning a joint space of visual representation and textual semantics. The model unifies the embeddings of concepts at different levels: objects, attributes, relations, and full scenes. We view the sentential semantics as a combination of different semantic components such as objects and relations; their embeddings are aligned with different image regions. A contrastive learning approach is proposed for the effective learning of this fine-grained alignment from only image-caption pairs. We also present a simple yet effective approach that enforces the coverage of caption embeddings on the semantic components that appear in the sentence. We demonstrate that the Unified VSE outperforms baselines on cross-modal retrieval tasks; the enforcement of the semantic coverage improves the model's robustness in defending text-domain adversarial attacks. Moreover, our model empowers the use of visual cues to accurately resolve word dependencies in novel sentences.



There are no comments yet.


page 8

page 16

page 17

page 18

page 19

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study the problem of establishing accurate and generalizable alignments between visual concepts and textual semantics efficiently, based upon rich but few, paired but noisy, or even biased visual-textual inputs (e.g., image-caption pairs). Consider the image-caption pair A shown in Fig. 1: “A white clock on the wall is above a wooden table”. The alignments are formed at multiple levels: This short sentence can be decomposed into a rich set of semantic components Abend et al. (2017): objects (clock, table and wall) and relations (clock above table, and clock on wall). These components are linked with different parts of the scene.

This motives our work to introduce Unified Visual-Semantic Embeddings (Unified VSE for short) Shown in Fig. 2, Unified VSE bridges visual and textual representation in a joint embedding space that unifies the embeddings for objects (noun phrases vs. visual objects), attributes (prenominal phrases vs. visual attributes), relations (verbs or prepositional phrases vs. visual relations) and scenes (sentence vs. image).

Figure 1: Two examplar image-caption pairs. Humans are able to establish accurate and generalizable alignments between vision and language, at different levels: objects, relations and full sentences. Pair A and B form a pair of contrastive example for the concepts clock and basin.

There are two major challenges in establishing such a factorized alignment. First, the link between the textual description of an object and the corresponding image region is ambiguous: A visual scene consists of multiple objects, and thus it is unclear to the learner which object should be aligned with the description. Second, it could be problematic to directly learn a neural network that combines various semantic components in a caption and form an encoding for the full sentence, with the training objective to maximize the cross-modal retrieval performance in the training set (e.g., in

You et al. (2018); Ma et al. (2015); Shi et al. (2018)). As reported by Shi et al. (2018), because of the inevitable bias in the dataset (e.g., two objects may co-occur with each other in most cases, see the table and the wall in Fig. 1 as an example), the learned sentence encoders usually pay attention to only part of the sentence. As a result, they are vulnerable to text-domain adversarial attacks: Adversarial captions constructed from original captions by adding small perturbations (e.g., by changing wall to be shelf) can easily fool the model Shi et al. (2018); Shekhar et al. (2017).

We resolve the aforementioned challenges by a natural combination of two ideas: cross-situational learning and the enforcement of semantic coverage that regularizes the encoder. Cross-situational learning, or learning from contrastive examples Fazly et al. (2010), uses contrastive examples in the dataset to resolve the referential ambiguity of objects: Looking at both Pair A and B in Fig. 1, we know that Clock should refer to an object that occurs only in scene A but not B. Meanwhile, to alleviate the biases of datasets such as object co-occurrence, we present an effective approach that enforces the semantic converage: The meaning of a caption is a composition of all semantic components in the sentence Abend et al. (2017). Reflectively, the embedding of a caption should have a coverage of all semantic components, while changing any of them should affect the global caption embedding.

Figure 2: We build a visual-semantic embedding space, which unifies the embeddings for objects, attributes, relations and full scenes.

Conceptually and empirically, Unified VSE makes the following three contributions.

First, the explicit factorization of the visual-semantic embedding space enables us to build a fine-grained correspondence between visual and textual data, which further benefits a set of downstream visual-textual tasks. We achieve this through a contrastive example mining technique that uniformly applies to different semantic components, in contrast to the sentence or image-level contrastive samples used by existing visual-semantic learning You et al. (2018); Ma et al. (2015); Faghri et al. (2018).

Second, we propose a caption encoder that ensures a coverage of all semantic components appeared in the sentence. We show that this regularization helps our model to learn a robust semantic representation for captions. It effectively defends adversarial attacks on the text domain.

Furthermore, we show how our learned embeddings can provide visual cues to assist the parsing of novel sentences, including determining content word dependencies and labelling semantic roles for certain verbs. It ends up that our model can build reliable connections between vision and language using given semantic cues and in return, bootstrap the acquisition of language.

2 Related work

Visual semantic embedding. Visual semantic embedding Frome et al. (2013) is a common technique for learning a joint representation of vision and language. The embedding space empowers a set of cross-modal tasks such as image captioning Vinyals et al. (2015); Xu et al. (2015); Donahue et al. (2015) and visual question answering Antol et al. (2015); Xu and Saenko (2016).

A fundamental technique proposed in Frome et al. (2013) for aligning two modalities is to use the pairwise ranking to learn a distance metric from similar and dissimilar cross-modal pairs Wang et al. (2016); Ren et al. (2016); Kiros et al. (2014a); Eisenschtat and Wolf (2017); Liu et al. (2017); Kiros et al. (2014b). As a representative, VSE++ Faghri et al. (2018) uses the online hard negative mining (OHEM) strategy Shrivastava et al. (2016) for data sampling and shows the performance gain. VSE-C Shi et al. (2018), based on VSE++, enhances the robustness of the learned visual-semantic embeddings by incorporating rule-generated textual adversarial samples as hard negatives during training. In this paper, we present a contrastive learning approach based on semantic components.

There are multiple VSE approaches that also use linguistically-aware techniques for the sentence encoding and learning. Hierarchical multimodal LSTM (HM-LSTM) Niu et al. (2017) and Xiao et al. (2017), as two examples, both leverage the constituency parsing tree. Multimodal-CNN (m-CNN) Ma et al. (2015) and CSE You et al. (2018)

apply convolutional neural networks to the caption and extract the a hierarchical representation of sentences. Our model differs with them in two aspects. First, Unified VSE is built upon a factorized semantic space instead of the syntactic knowledge. Second, we employ a contrastive example mining approach that uniformly applies to different semantic components. It substantially improves the learned embeddings, while the related works use only sentence-level contrastive examples.

The learning of object-level alignment in unified VSE is also related to Karpathy and Fei-Fei (2015); Karpathy et al. (2014); Ren et al. (2017), where the authors incorporate pre-trained object detectors for the semantic alignment. Engilberge et al. (2018) propose a selective pooling technique for the aggregation of object features. Compared with them, Unified VSE presents a more general approach that embeds concepts of different levels, while still requiring no extra supervisions.

Structured representation for vision and language. We connect visual and textual representations in a structured embedding space. The design of its structure is partially motivated by the papers on relational visual representations (scene graphs) Lu et al. (2016); Johnson et al. (2015, 2018), where a scene is represented by a set of objects and their relations. Compared with them, our model does not rely on labelled graphs during training.

Researchers have designed various types of representations Banarescu et al. (2013); Montague (1970) as well as different models Liang et al. (2013); Zettlemoyer and Collins (2005) for translating natural language sentences into structured representations. In this paper, we present how the usage of such semantic parsing into visual-semantic embedding facilitates the learning of the embedding space. Moreover, we present how the learned VSE can, in return, helps the parser to resolve parsing ambiguities using visual cues.

3 Unified Visual-Semantic Embeddings

We now describe the overall architecture and training paradigm for the proposed Unified Visual-Semantic Embeddings. Given an image-caption pair, we first parse the caption into a structured meaning representation, composed by a set of semantic components: object nouns, prenominal modifiers, and relational dependencies. We encode different types of semantic components with type-specific encoders. A caption encoder combines the embedding of the semantic components into a caption semantic embedding. Jointly, we encode images with a convolutional neural network (CNN) into the same, unified VSE space. The distance between the image embedding and the sentential embedding measures the semantic similarity between the image and the caption.

We employ a multi-task learning approach for the joint learning of embeddings for semantic components (as the “basis” of the VSE space) as well as the caption encoder (as the combiner of semantic components).

3.1 Visual-Semantic Embedding: A Revisit

We begin the section with an introduction to the two-stream VSE approach. It jointly learns the embedding spaces of two modalities: vision and language, and aligns them using parallel image-text pairs (e.g., image and captions from the MS-COCO dataset Lin et al. (2014)).

Let be the representation of the image and be the representation of a caption matching this image, both encoded by neural modules. To archive the alignment, a bidirectional margin-based ranking loss has been widely applied Faghri et al. (2018); You et al. (2018); Huang et al. (2017). Formally, for an image (caption) embedding (), denote the embedding of its matched caption (image) as (). A negative (unmatched) caption (image) is sampled whose embedding is denoted as (). We define the bidirectional ranking loss between captions and images as:


where is a predefined margin, is the traditional ranking loss and denotes the hard negative mining strategy Faghri et al. (2018); Shrivastava et al. (2016).

is a similarity function between two embeddings and is usually implemented as cosine similarity

Faghri et al. (2018); Shi et al. (2018); You et al. (2018).

3.2 Semantic Encodings

The encoding of a caption is made up of three steps. As an example, consider the caption“A white clock on the wall is above a wooden table”. 1) We extract a structured meaning representation as a collection of three types of semantic components: object (clock, wall, table), attribute-object dependencies (white clock, wooden table) and relational dependencies (clock above table, clock on wall). 2) We encode each component as well as the full sentence with type-specific encoders into the unified VSE space. 3) We represent the embedding of the caption by combining the semantic components.

Semantic parsing. We implement a semantic parser 111https://github.com/vacancy/SceneGraphParser of image captions based on Schuster et al. (2015). Given the input sentence, the parser first performs a syntactic dependency parsing. A set of rules is applied to the dependency tree and extracts object entities appeared in the sentence, adjectives that modify the object nouns, subjects/objects of the verbs and prepositional phrases. For simplicity, we consider only single-word nouns for objects and single-word adjectives for object attributes.

Encoding objects and attributes. We use an unified object encoder for nouns and adjective-noun pairs. For each word in the vocabulary, we initialize a basic semantic embedding and a modifier semantic embedding .

For a single noun word (e.g., clock), we define its embedding as , where

means the concatenation of vectors. For an (adjective, noun) pair

(e.g., (white, clock)), its embedding is defined as where encodes the attribute information. In implementation, the basic semantic embedding is initialized from GloVe Pennington et al. (2014). The modifier semantic embeddings (both and ) are randomly initialized and jointly learned. can be regarded as an intrinsic modifier for each nouns.

To fuse the embeddings of basic and modifier semantics, we employ a gated fusion function:

Throughout the text,

denotes the sigmoid function:

, and denotes the L2 normalization, i.e., . One may interpret as a GRU cell Chung et al. (2014) taking no historical state.

Encoding relations and full sentence. Since relations and sentences are the composed based on objects, we encode them with a neural combiner , which takes the embeddings of word-level semantics encoded by as input. In practice, we implement as an uni-directional GRU Chung et al. (2014), and pick the L2-normalized last state as the output.

To obtain a visual-semantic embedding for a relational triple (e.g., (clock, above, table)), we first extract the word embeddings for the subject, relational word and the object using . We then feed the encoded word embeddings in the same order into and takes the L2-normalized last state of the GRU cell. Mathematically, .

The embedding of a sentence is computed over the word sequence of the caption:


where for any word , Note that we share the weights of the encoders and among the encoding processes of all semantic levels. This allows our encoders of various types of components to bootstrap the learning of each other.

Combining all of the components. A straight-forward implementation of the caption encoder is to directly use the sentence embedding , as it has already combined the semantics of components in a contextually-weighted manner Levy et al. (2018). However, it has been revealed in Shi et al. (2018) that such combination is vulnerable to adversarial attacks: Because of the biases in the dataset, the combiner usually focuses on only a small set of semantic components appeared in the caption.

We alleviate such biases by enforcing the coverage of the semantic components appeared in the sentence. Specifically, to form the caption embedding , the sentence embedding is combined with an explicit bag-of-components embedding . Mathematically, we define as an unweighted aggregation of all components in the sentence:

and encode the caption as: , where is a scalar weight. The presence of disallows the ignorance of any of the components in the final caption embedding .

3.3 Image Encodings

We use CNN to encode the input RGB image into the unified VSE space. Specifically, we choose a ResNet-152 model He et al. (2016)

pretrained on ImageNet

Russakovsky et al. (2015) as the image encoder. We apply a layer of convolution on top of the last convolutaion layer (i.e., conv5_3) and obtain a convolutional feature map of shape for each image. denotes the dimension of the unified VSE space.

The feature map, denoted as , can be view as the embeddings of local regions in the image. The embedding for the whole image is defined as the aggregation of the embeddings at all regions through a global spatial pooling operator.

3.4 Learning Paradigm

In this section, we present how to align vision and language into the unified space using contrastive learning on different semantic levels. We start from the generation of contrastive exampls for different semantic components.

Negative example sampling. It has been discussed in Shi et al. (2018) that to explore a large compositional space of semantics, directly sampling negative captions from a human-built dataset (e.g., MS-COCO captions) is not sufficient. In this paper, instead of manually define rules that augment the training data as in Shi et al. (2018), we address this problem by sampling contrastive negative examples in the explicitly factorized semantic space. The generation does not require manually labelled data, and can be easily applied to any datasets. For a specific caption, we generate the following four types of contrastive negative samples.

  • Nouns. We sample negative noun words from all nouns that do not appear in the caption. 222For the MS-COCO dataset, in all 5 captions associated with the same image. This also applies to other components.

  • Attribute-noun pairs. We sample negative pairs by randomly substituting the adjective by another adjective or substituting the noun.

  • Relational triples. We sample negative triples by randomly substituting the subject, or the relation, or the object. Moreover, we also sample the whole relational triples of captions in the dataset which describe other images, as the negative triples.

  • Sentences. We sample negative sentences from the whole dataset. Meanwhile, following Frome et al. (2013); Faghri et al. (2018), we also sample negative images from the whole dataset as contrastive images.

The key motivation behind our visual-semantic alignment is that: an object appears in a local region of the image, while the aggregation of all local regions should be aligned with the full semantics of a caption.

Local region-level alignment. In detail, we propose a relevance-weighted alignment mechanism for linking textual object descriptors and local image regions. As shown in Fig. 3, consider the embedding of a positive textual object descriptor , a negative textual object descriptor and the set image local region embeddings where extracted from the image. We generate a relevance map with representing the relevance between and , computed as as Eq. (2). We compute the loss for noun and (adjective, noun) pairs by:


The intuition behind the definition is that, we explicitly try to align the embedding at each image region with . The losses are weighted by the matching score, thus reinforce the correspondence between and the matched region. This technique is related to multi-instance learning Wu et al. (2015).

Figure 3: An illustration of our relevance-weighted alignment mechanism. The relevance map shows the similarity of each region with the object embedding . We weight the alignment loss with the map to reinforce the correspondence between the and its matched region.
Object attack Attribute attack Relation attack
Metric R@1 R@5 R@10 rsum R@1 R@5 R@10 rsum R@1 R@5 R@10 rsum total sum
VSE++ 32.3 69.6 81.4 183.3 19.8 59.4 76.0 155.2 26.1 66.8 78.7 171.6 510.1
VSE-C 41.1 76.0 85.6 202.7 26.7 61.0 74.3 162.0 35.5 71.1 81.5 188.1 552.8
UniVSE (+) 45.3 78.3 87.3 210.9 35.3 71.5 83.1 189.9 39.0 76.5 86.7 202.2 603.0
UniVSE () 40.7 76.4 85.5 202.6 30.0 70.5 80.6 181.1 32.6 72.6 83.5 188.7 572.4
UniVSE (+) 42.9 77.2 85.6 205.7 30.1 69.0 79.8 178.9 34.0 71.2 83.6 188.8 573.4
UniVSE (+) 40.1 73.9 83.3 197.3 37.4 72.0 81.9 191.3 30.5 70.0 81.9 182.4 571.0
UniVSE (+) 45.4 77.1 85.5 208.0 29.2 68.1 78.5 175.8 42.8 77.5 85.6 205.9 589.7
Table 1: Results on image-to-sentence retrieval task with text-domain adversarial attacks. For each caption, we generate 5 adversarial fake captions which do not match the images. Thus, the models need to retrieve 5 positive captions from 30,000 candidate captions.

Global image-level alignment. For relational triples , semantic components aggregations and sentences , their semantics usually cover multiple objects. Thus, we align them with the full image embedding via bidirectional ranking losses as Eq. (1)333Only textual negative samples are used for .. The alignment loss is denoted as and , respectively.

We want to highlight that, during training, we separately align the two type of semantic representations of the caption, i.e., and , with the image. This differs from the inference-time computation of the caption. Recall that can be viewed as a factor that balances the training objective and the enforcement of semantic coverage. This allows us to flexibly adjust during inference.

4 Experiments

We evaluate our model on the MS-COCO Lin et al. (2014) dataset. It contains 82,783 training images with each image annotated by 5 captions. We use the common 1K validation and test split from Karpathy and Fei-Fei (2015).

We first validate the effectiveness of enforcing the semantic coverage of caption embeddings by comparing models on cross-modal retrieval tasks with adversarial examples. We then propose a unified text-to-image retrieval task to support the contrastive learning on various semantic components. We end this section with an application of using visual cues to facilitate the semantic parsing of novel sentences. We include two baselines: VSE++

Faghri et al. (2018) and VSE-CShi et al. (2018) for comparison.

4.1 Retrieval under text-domain adversarial attack

Recent works Shi et al. (2018); Shekhar et al. (2017) have raised their concerns on the robustness of the learned visual-semantic embeddings. They show that existing models are vulnerable to text-domain adversarial attacks (i.e., using adversarial captions) and can be easily fooled. This is closely related to the bias in small datasets over a large, compositional semantic space Shi et al. (2018). To prove the robustness of the learned unifed VSE, we further conduct experiments on the image-to-sentence retrieval task with text-domain adversarial attacks. Following Shi et al. (2018), we first design several types of adversarial captions by adding perturbations to existing captions.

  1. Object attack: Randomly replace / append by an irrelevant one in the original caption.

  2. Attribute attack: Randomly replace / add an irrelevant attribute modifier for one object in the original caption.

  3. Relational attack: 1) Randomly replace the subject/relation/object word by an irrelevant one. 2) Randomly select an entity as a subject/object and add an irrelevant relational word and object/subject.

The results are shown in Table 1 where different columns represent different types of attacks. VSE++ performs worst as it is only optimized for the retrieval performance on the dataset. Its sentence encoder is insensitive to a small perturbation in the text. VSE-C explicitly generates the adversarial captions based on human-designed rules as hard negative examples during training, which makes it relatively robust to those adversarial attacks. Unified VSE shows strong robustness across all types of adversarial attacks and outperforms all baselines.

The ability of Unified VSE to defend adversarial texts comes almost for free: we present zero adversarial captions during training. Unified VSE builds fine-grained semantic alignments via the contrastive learning of semantic components. It use the explicit aggregation of the components to alleviate the dataset biases.

Ablation study: semantic components. We now delve into the effectiveness of different semantic components by choosing different combinations of components for the caption embedding. Shown in Table 1, we use different subsets of the semantic components to form the bag-of-component embeddings . For example, in UniVSE, only object nouns are selected and aggregated as .

The results demonstrate the effectiveness of the enforcement of semantic coverage: even if the semantic components have got fine-grained alignment with visual concepts, directly using as the caption encoding still degenerates the robustness against adversarial examples. Consistent with the intuition, enforcing of coverage of a certain type of components (e.g., objects) helps the model to defend the adversarial attacks of the same type (e.g., defending adversarial attacks of nouns). Combining all components leads to the best performance.

(a) Normal cross-modal retrieval (5,000 captions)
(b) Adversarial attacked image-to-sentence retrieval (30,000 captions)
Figure 4: The performance of UniVSE on cross-modal retrieval tasks with different combination weight . Our model can effective defending adversarial attacks, with no sacrifice for the performance on other tasks by choosing a reasonable (thus we set in all other experiments).

Choice of the combination factor: . We study the choice of by conducting experiments on both normal retrieval tasks and the adversarial one. Fig 4 shows the R@1 performance under the normal/adversarial retrieval scenario w.r.t. different choices of . We observe that the term contributes little on the normal cross-modal retrieval tasks but largely on tasks with adversarial attacks. Recall that can be viewed as a factor that balances the training objective and the enforcement of semantic coverage. By choosing from a reasonable region (e.g., from 0.6 to 0.8), our model can effective defend adversarial attacks, with no sacrifice for the overall performance.

Figure 5: The relevance maps and grounded areas obtained from the retrieved images w.r.t. three queries. The temperature of the softmax for visualizing the relevance map is . Pixels in white indicates a higher matching score. Note that the third image of the query “black dog” contains two dogs, while our model successfully locates the black one (on the left). It also succeeded in finding the white dog in the first image of “white dog”. Moreover, for the query “player swing bat”, although there are many players in the image, our model only attend to the man swinging the bat.
Figure 6: Example showing our model can leverage image to assist semantic parsing when there is ambiguity in the sentence. We can infer that the matching score of “girl eat burger” is much higher than “sweater eat burger”, which can help to eliminate the ambiguity. Note that the other components in the scene graph are also correctly inferred by our model.

4.2 Unified Text-to-Image Retrieval

Task obj attr rel obj (det) sum
VSE++ 29.95 26.64 27.54 50.57 134.70
VSE-C 27.48 28.76 26.55 46.20 128.99
UniVSE 39.49 33.43 39.13 58.37 170.42
UniVSE 39.71 33.37 34.38 56.84 164.3
UniVSE 31.31 37.51 34.73 52.26 155.81
UniVSE 37.55 32.7 39.57 59.12 168.94
Table 2: The mAP performance on the unified text-to-image retrieval task. Please refer to the text for details.

We extend the word-to-scene retrieval used by Shi et al. (2018) into a general unified text-to-image retrieval task. In this task, models receive queries of different semantic levels, including single words (e.g., “Clock.”), noun phrases (e.g., “White clock.”), relational phrases (e.g., “Clocks on wall”) and full sentences. For all baselines, the texts of different types as treated as full sentences. The result is presented in Table 2.

We generate positive image-text pairs by randomly choosing an image and a semantic component from 5 matched captions with the chosen image. It is worth mention that the semantic components extracted from captions may not cover all visual concepts in the corresponding image, which makes the annotation noisy. To address this, we also leverage the MS-COCO detection annotations to facilitate the evaluation (see obj(det) column). We treat the labels for detection bounding boxes as the annotation of objects in the scene.

Ablation study: contrastive learning of components. We evaluate the effectiveness of using contrastive samples for different semantic components. Shown in Table 2, UniVSE denotes the model trained with only contrastive samples of noun components. The same notation applies to other models. The UniVSE trained with a certain type of contrastive examples (e.g., UniVSE with contrastive nouns) consistently improves the retrieval performance of the same type of queries (e.g., retrieving images from a single noun). UniVSE trained with all kinds of contrastive samples performs best in overall and shows a significant gap w.r.t. other baselines.

Visualization of the semantic alignment. We visualize the semantic-relevance map on an image w.r.t. a given query for a qualitative evaluation of the alignment performance of various semantic components. The map is computed as the similarity between each image region and , in a similar way as Eq. (2). Shown as Fig. 5, this visualization helps to verify that our model successfully aligns different semantic components with the corresponding image regions.

4.3 Semantic Parsing with Visual Cues

As a side application, we show how the learned unified VSE space can provide the visual cues to help the semantic parsing of sentences. Fig. 6 shows the general idea. When parsing a sentence, ambiguity may occur, e.g., the subject of the relational word eat may be sweater or burger. It is not easy for a textual parser to decide which one is correct because of the innate syntactic ambiguity. However, we can use the image which is depicted by this sentence to assist the parsing by. This is related to previous works on using image segmentation models to facilitate the sentence parsing Christie et al. (2016).

This motivates us to design two tasks, 1) recovering the dependency between attributes and entities, and 2) recovering the relational triples. In detail, we first extract the entities, attributes and relational words from the raw sentence without knowing their dependencies. For each possible combination of certain semantic component, our model computes its embedding in the unified joint space. E.g., in Fig. 6, there are in total possible dependencies for eat. We choose the combination with the highest matching score with the image to decide the subject/object dependencies of the relation eat. We use parsed semantic components as the ground-truth and report the accuracy, defined as the fraction of the number of correct dependency resolution and the total number of attributes/relations.

Table 3 reports the results on assisting semantic parsing with visual cues, compared with other baselines. Fig. 6 shows a real case in which we successfully resolve the textual ambiguity.

Task attributed object relational phrase
Random 37.41 31.90
VSE++ 41.12 43.31
VSE-C 43.44 41.08
UniVSE 64.82 62.69
Table 3: The accuracy of different models on recovering word dependencies with visual cues. In the “Random” baseline, we randomly assign the word dependencies.

5 Conclusion

We present a unified visual-semantic embedding approach that learns a joint representation space of vision and language in a factorized manner: Different levels of textual semantic components such as objects and relations get aligned with regions of images. A contrastive learning approach for semantic components is proposed for the efficient learning of the fine-grained alignment. We also introduce the enforcement of semantic coverage: each caption embedding should have a coverage of all semantic components in the sentence. Unified VSE shows superiority on multiple cross-modal retrieval tasks and can effectively defend text-domain adversarial attacks. We hope the proposed approach can empower machines that learn vision and language jointly, efficiently and robustly.


  • Abend et al. (2017) Omri Abend, Tom Kwiatkowski, Nathaniel J Smith, Sharon Goldwater, and Mark Steedman. 2017. Bootstrapping language acquisition. Cognition, 164:116–143.
  • Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In

    International Conference on Computer Vision (ICCV)

  • Banarescu et al. (2013) Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract Meaning Representation for Sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 178–186.
  • Christie et al. (2016) Gordon Christie, Ankit Laddha, Aishwarya Agrawal, Stanislaw Antol, Yash Goyal, Kevin Kochersberger, and Dhruv Batra. 2016. Resolving language and vision ambiguities together: Joint segmentation & prepositional attachment resolution in captioned scenes. arXiv preprint arXiv:1604.02125.
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014.

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.


    NIPS 2014 Workshop on Deep Learning, December 2014

  • Donahue et al. (2015) Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 2625–2634.
  • Eisenschtat and Wolf (2017) Aviv Eisenschtat and Lior Wolf. 2017. Linking Image and Text with 2-way Nets. In IEEE conference on computer vision and pattern recognition (CVPR), pages 1855–1865.
  • Engilberge et al. (2018) Martin Engilberge, Louis Chevallier, Patrick Pérez, and Matthieu Cord. 2018. Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3984–3993.
  • Faghri et al. (2018) Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives.
  • Fazly et al. (2010) Afsaneh Fazly, Afra Alishahi, and Suzanne Stevenson. 2010. A probabilistic computational model of cross-situational word learning. Cognitive Science, 34(6):1017–1063.
  • Frome et al. (2013) Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. 2013. Devise: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems (NIPS), pages 2121–2129.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.
  • Huang et al. (2017) Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7254–7262.
  • Johnson et al. (2018) Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image Generation from Scene Graphs.
  • Johnson et al. (2015) Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. 2015. Image Retrieval using Scene Graphs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3668–3678.
  • Karpathy and Fei-Fei (2015) Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3128–3137.
  • Karpathy et al. (2014) Andrej Karpathy, Armand Joulin, and Li F Fei-Fei. 2014. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. In Conference on Neural Information Processing Systems (NIPS), pages 1889–1897.
  • Kiros et al. (2014a) Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel. 2014a. Multimodal Neural Language Models. In

    International Conference on Machine Learning

    , pages 595–603.
  • Kiros et al. (2014b) Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014b. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.
  • Levy et al. (2018) Omer Levy, Kenton Lee, Nicholas FitzGerald, and Luke Zettlemoyer. 2018. Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum. arXiv preprint arXiv:1805.03716.
  • Liang et al. (2013) Percy Liang, Michael I. Jordan, and Dan Klein. 2013. Learning Dependency-Based Compositional Semantics. Computational Linguistics, 39(2):389–446.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision (ECCV), pages 740–755.
  • Liu et al. (2017) Yu Liu, Yanming Guo, Erwin M Bakker, and Michael S Lew. 2017. Learning a Recurrent Residual Fusion Network for Multimodal Matching. In IEEE international conference on computer vision (ICCV), pages 4127–4136.
  • Lu et al. (2016) Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In European Conference on Computer Vision (ECCV), pages 852–869. Springer.
  • Ma et al. (2015) Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal Convolutional Neural Networks for Matching Image and Sentence. In IEEE international conference on computer vision (ICCV), pages 2623–2631.
  • Montague (1970) Richard Montague. 1970. Universal Grammar. Theoria, 36(3):373–398.
  • Niu et al. (2017) Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. 2017. Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding. In IEEE international conference on computer vision (ICCV), pages 1899–1907.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , pages 1532–1543.
  • Ren et al. (2016) Zhou Ren, Hailin Jin, Zhe Lin, Chen Fang, and Alan Yuille. 2016. Joint Image-Text Representation by Gaussian Visual-Semantic Embedding. In ACM Multimedia (ACM-MM), pages 207–211.
  • Ren et al. (2017) Zhou Ren, Hailin Jin, Zhe Lin, Chen Fang, and Alan Yuille. 2017. Multiple Instance Visual-Semantic Embedding. In British Machine Vision Conference (BMVC).
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252.
  • Schuster et al. (2015) Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei-Fei, and Christopher D. Manning. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Workshop on Vision and Language (VL15), Lisbon, Portugal. Association for Computational Linguistics.
  • Shekhar et al. (2017) Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich, Aurélie Herbelot, Moin Nabi, Enver Sangineto, and Raffaella Bernardi. 2017. FOIL it! Find One Mismatch between Image and Language Caption. arXiv preprint arXiv:1705.01359.
  • Shi et al. (2018) Haoyue Shi, Jiayuan Mao, Tete Xiao, Yuning Jiang, and Jian Sun. 2018. Learning Visually-Grounded Semantics from Contrastive Adversarial Samples. In Proceedings of the 27th International Conference on Computational Linguistics (COLING), pages 3715–3727.
  • Shrivastava et al. (2016) Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. 2016. Training Region-Based Object Detectors with Online Hard Example Mining. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5253–5262.
  • Vinyals et al. (2015) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and Tell: A Neural Image Caption Generator. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3156–3164.
  • Wang et al. (2016) Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning Deep Structure-Preserving Image-Text Embeddings. In IEEE conference on computer vision and pattern recognition (CVPR), pages 5005–5013.
  • Wu et al. (2015) Jiajun Wu, Yinan Yu, Chang Huang, and Kai Yu. 2015. Deep multiple instance learning for image classification and auto-annotation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Xiao et al. (2017) Fanyi Xiao, Leonid Sigal, and Yong Jae Lee. 2017. Weakly-Supervised Visual Grounding of Phrases with Linguistic Structures. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5945–5954.
  • Xu and Saenko (2016) Huijuan Xu and Kate Saenko. 2016. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. In European Conference on Computer Vision (ECCV), pages 451–466. Springer.
  • Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In International Conference on Machine Learning (ICML), pages 2048–2057.
  • You et al. (2018) Quanzeng You, Zhengyou Zhang, and Jiebo Luo. 2018. End-to-End Convolutional Semantic Embeddings. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5735–5744.
  • Zettlemoyer and Collins (2005) Luke S. Zettlemoyer and Michael Collins. 2005. Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars. In

    Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI)

    , pages 658–666.