Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding

by   Hassan Akbari, et al.
Columbia University

We address the problem of phrase grounding by learning a multi-level common semantic space shared by the textual and visual modalities. This common space is instantiated at multiple layers of a Deep Convolutional Neural Network by exploiting its feature maps, as well as contextualized word-level and sentence-level embeddings extracted from a character-based language model. Following a dedicated non-linear mapping for visual features at each level, word, and sentence embeddings, we obtain a common space in which comparisons between the target text and the visual content at any semantic level can be performed simply with cosine similarity. We guide the model by a multi-level multimodal attention mechanism which outputs attended visual features at different semantic levels. The best level is chosen to be compared with text content for maximizing the pertinence scores of image-sentence pairs of the ground truth. Experiments conducted on three publicly available benchmarks show significant performance gains (20 phrase localization and set a new performance record on those datasets. We also provide a detailed ablation study to show the contribution of each element of our approach.



There are no comments yet.


page 1

page 4

page 7

page 8


Video Captioning with Multi-Faceted Attention

Recently, video captioning has been attracting an increasing amount of i...

Enriching Article Recommendation with Phrase Awareness

Recent deep learning methods for recommendation systems are highly sophi...

Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features

This paper describes a conditional neural network architecture for Manda...

A Deep Multi-Level Attentive network for Multimodal Sentiment Analysis

Multimodal sentiment analysis has attracted increasing attention with br...

Cosine meets Softmax: A tough-to-beat baseline for visual grounding

In this paper, we present a simple baseline for visual grounding for aut...

Probing Representations Learned by Multimodal Recurrent and Transformer Models

Recent literature shows that large-scale language modeling provides exce...

A Systematic Review of Hindi Prosody

Prosody describes both form and function of a sentence using the suprase...

Code Repositories


Instance grounding and merging component developed for the DARPA AIDA program.

view repo


This is the repo for Multi-level textual grounding

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Phrase grounding [38, 31] is the task of localizing within an image a given natural language input phrase, as illustrated in Figure 1. This ability to link text and image content is a key component of many visual semantic tasks such as image captioning [10, 21, 18], visual question answering [2, 29, 48, 52, 11]

, text-based image retrieval 

[12, 39] and robotic navigation [44]. It is especially challenging as it requires a good representation of both the visual and textual domain and an effective way of linking them.

Figure 1: The phrase grounding task in the pointing game setting. Given the sentence on top and the image on the left, the goal is to point (illustrated by the stars here) to the correct location of each natural language query (colored text). Actual example of our method results on Flickr30k.

On the visual side, most of the works exploit Deep Convolutional Neural Networks but often rely on bounding box proposals [38, 41, 15] or use a global feature of the image [10], limiting the localization ability and freedom of the method. On the textual side, methods rely on a closed vocabulary or try to train their own language model using small image-caption pairs datasets [17, 59, 53, 9]. Finally, the mapping between the two modalities is often performed with a weak linear strategy [38, 51]. We argue that approaches in the literature have not fully leveraged the potential of the more powerful visual and textual model developed recently, and there is room for developing more sophisticated representations and mapping approaches.

In this work, we propose to explicitly learn a non-linear mapping of the visual and textual modalities into a common space, and do so at different granularity for each domain. Indeed, different layers of a deep network encode each region of the image with gradually increasing levels of discriminativeness and context awareness, similarly single words and whole sentences contain increasing levels of semantic meaning. This common space mapping is trained with weak supervision and exploited at test-time with a multi-level multimodal attention mechanism, where a natural formalism for computing attention heatmaps at each level, attended features and pertinence scoring, enables us to solve the phrase grounding task elegantly and effectively. We evaluate our model on three commonly used datasets in the literature of textual grounding and show that it sets a new state-of-the-art performance by a large margin.

Our contributions in this paper are as follows:

  • We learn, with weak-supervision, a non-linear mapping of visual and textual features to a common region-word-sentence semantic space, where comparison between any two semantic representations can be performed with a simple cosine similarity;

  • We propose a multi-level multimodal attention mechanism, which can produce either word-level or sentence-level attention maps at different semantic levels, enabling us to choose the most representative attended visual feature among different semantic levels;

  • We set new state-of-the-art performance on three commonly used datasets, and give detailed ablation results showing how each part of our method contributes to the final performance.

In the following section, we will provide a brief overview of related works in the literature and will elaborate our method in the sequel.

2 Related works

2.1 Grounding natural language in images

The earliest works on solving the textual grounding task [38, 41, 15] tried to tackle the problem by finding the right bounding box out of a set of proposals, usually obtained from pre-specified models [62, 45]

. The ranking of these proposals, for each text query, can be performed using scores estimated from a reconstruction 

[41] or sentence generation [15] procedure, or using distances in a common space [38]. However, relying on a fixed set of pre-defined concepts and proposals may not be optimal and the quality of the bounding boxes defines an upper bound [15, 46] of the performance that can be achieved. Therefore, several methods [6, 61] have proposed to integrate the proposal step in their framework to improve the bounding box quality. Works relying on bounding boxes are often operating in a fully supervised setting [5, 53, 57, 11, 6], where the mapping between sentences and bounding boxes has to be provided at training time which is not always available and is costly to gather. It is also worth mentioning that methods based on bounding boxes often extract features separately for each bounding box [15, 4, 46], inducing a high computational cost.

Some works [40, 17, 59, 47, 54] therefore choose not to rely on bounding boxes and propose to formalize the localization problem as finding a spatial heatmap for the referring expression. This setting is mostly weakly-supervised, where at training time only the image and the text (describing either the whole image or some parts of it) are provided but not the corresponding bounding box or segmentation mask for each description. This is the more general setting we are addressing in this paper. The top-down approaches [40, 59] and the attention-based approach [17] learn to produce a heatmap for each word of a vocabulary. At test time, all these methods produce the final heatmap by averaging the heatmaps of all the words in the query that exist in the vocabulary. Several grounding works have also explored the use of additional knowledge, such as image [46] and linguistic [47, 37] structures, phrase context [5] and exploiting pre-trained visual models predictions [4, 54].

In contrast to many works in the literature, we don’t use pre-defined word or image concepts in our method. We also don’t leverage any knowledge from classification or object detection tasks, or explicitly exploit image or sentences structures. We instead rely on character-based language model with contextualized embeddings which could handle any unseen word considering the context in the sentence. As we further explain in the sequel, the sentence and each word in it are assigned with a spatial heatmap explaining their similarity to different regions of the image at different visual semantic levels.

2.2 Mapping to common space

It is a common approach to extract visual and language features independently and fuse them before the prediction [9, 4, 6]

. Current works usually apply a multi-layer perceptron (MLP) 

[6, 4], element-wise multiplication [14], or cosine similarity [9] to combine representations from different modalities. Other methods have used the Canonical Correlation Analysis (CCA) [37, 38]

, which finds linear projections that maximize the correlation between projected vectors from the two views of heterogeneous data.  

[11] introduced the Multimodal Compact Bilinear (MCB) pooling method that uses a compressed feature from the outer product of two vectors of visual and language features to fuse them. Attention methods can also measure the matching of an image-sentence feature pair. In  [51, 33], attention maps are generated from dot product of the linear projection of visual and language features. In contrast, we use non-linear mapping of both visual features (in multiple semantic levels) and textual embeddings (both contextualized word and sentence embeddings) separately and use multi-level attention with multimodal loss to learn those mapping weights.

2.3 Attention mechanisms

Attention has proved its effectiveness in many visual and language tasks [23, 1, 7, 52, 50], it is designed to capture a better representation of image-sentence pairs based on their interactions. The Accumulated Attention method [8] propose to estimate attention on sentences, objects and visual feature maps in an iterative fashion, where at each iteration the attention of the other two modalities is exploited as guidance. A dense co-attention mechanism is explored in [33] to solve the Visual Question Answering task by using a fully symmetric architecture between visual and language representations. In their attention mechanism, they add a dummy location in attention map when no region or word the model should attend along with a softmax. In AttnGAN [51], a deep attention multimodal similarity model is proposed to compute a fine-grained image-text matching loss. In contrast to these works, we remove the softmax on top of the attention maps to let the model decide which word-region could be related to each other by the guide of the multimodal loss. Since we map the visual features to a multi-level visual representation, we give the model the freedom to choose any location at any level for either sentence or word. In other words, each word can choose which level of representation (and which region in that representation) to attend to. The same freedom is provided for sentence. We directly calculate this attention map by cosine similarity in the common space we learn for word, sentence, and multi-level semantic visual representations. We show that this approach significantly outperforms all the state of the art approaches on three commonly used datasets and set a new state of the art performance.

Figure 2: Overview of our method: the textual input is processed with a pre-trained text model followed by a non-linear mapping to the common semantic space. Similarly for the image input, we use a pre-trained visual model to extract visual features maps at multiple levels and learn a non-linear mapping for each of them to the common semantic space. A multi-level attention mechanism followed by a feature level selection produces the pertinence score between the image and the sentence. We train our model using only the weak supervision of image-sentence pairs.

3 Method

In this section, we will describe our method (illustrated in Figure 2) for addressing the textual grounding task and elaborate on each part with details. In Section 3.1, we first explain how we extract multi-level visual features from an image and word/sentence embeddings from the text, we then describe how we map them to a common space. In Section 3.2 we describe how we calculate multi-level multimodal attention map and attended visual feature for each word/sentence. Then, in Section 3.3 we describe how we choose which visual feature level is most representative for the given text. Finally, in Section 3.4 we define a multimodal loss to train the whole model using weak supervision.

3.1 Feature Extraction and Common Space

Visual Feature Extraction

: In contrast to many vision tasks where the last layer of a pre-trained CNN is being used as visual representation of an image, we use feature maps from different layers and map them separately to a common space to obtain a multi-level set of feature maps to be compared with text. Intuitively, using different levels of visual representations would be necessary for covering a wide range of visual concepts and patterns [26, 55, 58]. Thus, we extract sets of feature maps from

different levels of a visual network, upsample them by a bi-linear interpolation

111as transposed convolution produces checkerboard artifacts [34] to a fixed resolution for all the levels, and then apply 3 layers of 1x1 convolution (with LeakyRelu [30]) with

filters to map them into equal-sized feature maps. Finally, we stack these feature maps and space-flatten them to have an overall image representation tensor

, with . This tensor is finally normalized by the

-norm of its last dimension. An overview of the feature extraction and common space mapping for image can be seen in the left part of Figure 


In this work, we use VGG [42] as a baseline for fair comparison with other works [10, 47, 17], in the literature, and the state of the art CNN, PNASNet-5 [28], to study the ability of our model to exploit this more powerful visual model. We detail the feature maps selected in each model in Section 4.2.

Figure 3: Left: we choose feature maps of different convolutional blocks of a CNN model, resize them to the same spatial dimensions using bi-linear interpolation, and map them to feature maps of the same size. Right: word and sentences embedding to the common space from the pre-trained ELMo [36] model. The green pathway is for word embedding, the red pathway for sentence embedding. All the orange boxes ( convolutional layers of the visual mapping, linear combination and the two sets of fully connected layers of the textual mapping) are the trainable parameters of our projection to the common space.

Textual Feature Extraction

: State of the art works in grounding use a variety of approaches for textual feature extraction. Some use pre-trained LSTM or BiLSTMs on big datasets (e.g. Google 1 Billion [3]) based on either word2vec [32] or GloVe [35] representations. Some train BiLSTM solely on image-caption datasets (mostly MSCOCO) and argue that it’s necessary to train them from scratch to distinguish between visual concepts which may not be distinguishable in language (e.g. red and green are different in vision but similar in language as they are both colors) [33, 51, 17, 47, 9, 14, 61, 38, 57, 8]. The mentioned works either use the recurrent network outputs at each state as word-level representations or their last output (on each direction for BiLSTM) as sentence-level or a combination of both.

In this paper, however we use ELMo [36], a 3-layer network pre-trained on 5.5B tokens which calculates word representations on the fly (based on CNN on characters, similar to [19, 60]) and then feed them to 2 layers of BiLSTMs which produce contextualized representations. Thus, for a given sentence the model outputs three representations for each token (splitted by white space). We take a linear combination of the three representations and feed them to 2 fully connected layers (with shared weights among words), each with nodes with LeakyRelu as non-linearity between each layer, to obtain each word representation (green pathway in the right part of Figure 3). The resulting word-based text representation for an entire sentence would be a tensor built from the stacking of each word representation . The sentence-level text representation is calculated by concatenation of last output of the BiLSTMs at each direction. Similarly, we apply a linear combination on the two sentence-level representations and map it to the common space by feeding it to 2 fully connected layers of nodes, producing the sentence representation (red pathway in the right part of Figure 3). The word tensor and the sentence vector are normalized by their last dimension -norm before being fed to the multimodal attention block.

3.2 Multi-Level Multimodal Attention Mechanism

Given the image and sentence, our task is to estimate the correspondences between spatial regions () in the image at different levels (), and words in the sentence at different positions (). We seek to estimate a correspondence measure, , between each word and each region at each level. We define this correspondence by the cosine similarity between word and image region representations at different levels in common space:



represents a multi-level multi-modal attention map which could be simply used for calculating either visual or textual attended representation. We apply ReLU to the attention map to zero-out dissimilar word-visual region pairs, and simply avoid applying softmax on any dimension of the heatmap tensor. Note that this choice is very different in spirit from the commonly used approach of applying softmax to attention maps 

[50, 49, 8, 33, 17, 51, 40]. Indeed for irrelevant image-sentence pairs, the attention maps would be almost all zeros while the softmax process would always force attention to be a distribution over the image/words summing to . Furthermore, a group of words shaping a phrase could have the same attention area which is again hard to achieve considering the competition among regions/words in the case of applying softmax on the heatmap. We will analyze the influence of this choice experimentally in our ablation study.

Given the heatmap tensor, we calculate the attended visual feature for the -th level and -th word as


which is basically a weighted average over the visual representations of the -th level with the attention heatmap values as weights. In other words,

is a vector in the hyperplane spanned by a subset of visual representations in the common space, this subset being selected based on the heatmap tensor. An overview of our multi-level multimodal attention mechanism for calculating attended visual feature can be seen in Figure 

4. In the sequel, we describe how we use this attended feature to choose the most representative hyperplane, and calculate a multimodal loss to be minimized by weak supervision of image-sentence relevance labels.

Figure 4: For each word feature , we compute an attention map and an attended visual feature at each level . We choose the level that maximizes similarity between the attended visual feature and the textual feature in the common space to produce the pertinence score . This is equivalent to finding the hyperplane (spanned by each level visual feature vectors in the common space) that best matches the textual feature.

3.3 Feature Level Selection

Once we find the attended visual feature, we calculate the word-image pertinence score at level using cosine similarity for each word and the attended visual feature as


Intuitively, each visual feature map level could carry different semantic information, thus for each word we propose to apply a hard level-attention to get the score from the level contributing the most as


This procedure can be seen as finding projection of the textual embeddings on hyperplanes spanned by visual features from different levels and choosing the one that maximizes this projection. Intuitively, that chosen hyperplane can be a better representation for visual feature space attended by word . This can be seen in the top central part of Figure 2, where selecting the maximum pertinence score over levels is equivalent to selecting the hyperplane with the smallest angle with the -th word representation (or the highest similarity between attended visual feature and textual feature). Thus, selecting the most representative hyperplane (or visual feature level).

Once we find the best word-image pertinence score, similar to [51] and inspired by the minimum classification error [20], we find the overall (word-based) sentence-image pertinence score as follows:


Similarly, for the sentence we can repeat the same procedure (except that we no more need Eq. 5) for finding the attention map, attended visual feature and sentence-image pertinence score as follows, respectively:


3.4 Multimodal Loss

In this paper, we only use a weak supervision in the form of binary image-caption relevance. Thus, similar to [10, 16, 51] we train the network on a batch of image-caption pairs, and force it to have high sentence-image pertinence score for related pairs and low score for unrelated pairs. Thus, considering a pertinence score (either or

), we calculate the posterior probability of the sentence

being matched with image by applying competition among all sentences in the batch using:


Similarly, the posterior probability of being matched with could be calculated using:


Then, similarly to [10, 51], we can define the loss using the negative log posterior probability over relevant image-sentence pairs as follows:


As we want to train a common semantic space for both words and sentences, we combine the loss (that can be computed based on the word relevance ) and the sentence loss (obtained using ) to define our final loss as


This loss is minimized over a batch of images along with their related sentences. We found in preliminary experiments on held-out validation data, that the values , work well and we keep them fixed for our experiments. In the next section, we will evaluate our proposed model on different datasets and will have an ablation study to show the reason for our choices in our model.

4 Experiments

In this section, we will first present the datasets we use and our experimental setup. We then evaluate our approach comparing with the state-of-the-art, and further present ablation studies showing the influence of each step of our method.

4.1 Datasets

We here describe the publicly available datasets we have used for training and testing our approach.

Mscoco 2014

[27] consists of 82,783 training images and 40,504 validation images. Each image is associated with five captions describing the image. While an image may be associated with multiple bounding box annotations of objects, there in no textual description for these bounding boxes and we do not exploit them in any way. We use the train split of this dataset for training our model.


[56] is a dataset of approximately 31K images with 150K descriptive captions. There are 25,380 images in the training set, 2,985 images in validation set and 2,984 images in the test set. In the Flickr30k Entities dataset [38], each image caption is divided into multiple phrases that are each linked to a specific bounding box on the image, with a total of 244K phrases describing localized bounding boxes. We use the validation split of this dataset for validating our model during training (on MSCOCO) and its testing split for evaluating our model after training.


[25] contains 77,398 images in the training set, and a validation and test set of 5000 images each. Each image consists of multiple bounding box annotations and a region description associated with each bounding box. Similar to [18], we preprocess the text and remove noisy characters and punctuation from region descriptions in VisualGenome. All bounding box annotations without any description and descriptions longer than ten words have been ignored. Certain images were also found to have bounding box annotations exceeding the image size, which we have fixed to lie within the image boundaries. Note that we use the bounding boxes only to evaluate our method and not in training as our method is weakly-supervised. Since the type of region description in this dataset is different from MSCOCO and Flickr30k, we separately train our model on the training split of this dataset.


consists of 20,000 images from IAPR TC-12 dataset [13] along with 99,535 segmented image regions from SAIAPR-12 dataset [6]. Images are associated with descriptions for the entire image as well as as well as localized image regions collected in a two-player game [22] providing approximately 130k isolated entity descriptions. In our work, we have used only the unique descriptions associated with each region and any region without an associated region description has been ignored. We have used a split similar as [15] which contains 9,000 training images, 1000 validation images and 10,000 test images. We use the validation split of this dataset for validating our model while being trained on Visual Genome (VG) dataset, and use its test split for evaluating the model after training (on VG).

4.2 Experimental Setup

We use a batch size of

, where for a batch of image-caption pairs each image (caption) is only related to one caption (image). Image-caption pairs are sampled randomly with a uniform distribution. We train the network on 20 epochs with Adam optimizer 

[24] with where the learning rate is divided by 2 once at 10-th epoch and once at 15-th epoch. We use for common space mapping dimension and for LeakyReLU in the non-linear mappings. We regularize weights of the mappings with regularization with regvalue. Finally, we elaborate on visual feature map level selection for both VGG16 and PNASNet in Table 1. Both visual and textual networks weights are fixed during training and only common space mapping weights are trainable. In ablation study, we use 10 epochs without dividing learning rate, while the rest of settings remain the same.

Network Name Original Dimensions Multi-Level Feature Maps
VGG conv41
VGG conv43
VGG conv51
VGG conv53
PNASNet Cell 5
PNASNet Cell 7
PNASNet Cell 9
PNASNet Cell 11
Table 1: List of layer and their dimensions (when using an image input resized to pixels) used as our 4 levels feature maps for each network.
Method Settings Flickr30K ReferIt VG
Baseline Random 27.24 24.30 11.15
Baseline Center 49.20 30.40 20.55
FCVC [10] VGG 29.03 33.52 14.03
TD [59] Inception-V2 42.40 31.97 19.31
CGVS [40] Inception-V3 50.10 - -
VGLS [47] VGG - - 24.40
SSS [17] VGG 49.10 39.98 30.03
Ours VGG 61.66 60.01 48.76
Ours PNASNet 69.19 61.89 55.16
Table 2: Phrase localization accuracy (pointing game) on Flickr30K, ReferIt and VisualGenome (VG) compared to state of the art.

4.3 Phrase Localization Evaluation

Figure 5: Image-sentence pair from Flickr30K with four queries (colored text) and corresponding heatmaps and selected max value (stars). A nice point about the potential of our model is that it has an understanding about what is being described, as here it only points to the man who is pushing his motorcross bike up.

As stated in Section 4.1, we use the train split of MSCOCO for training our model and evaluate it on the test split of Flickr30k. Since the queries in ReferIt and VisualGenome are referring expressions - and not part of a complete sentence as in Flickr30k, we separately train the model on the train split of Visual Genome dataset and evaluate it on the test split of ReferIt and Visual Genome. For evaluation on Flickr30k, we feed a complete sentence to the model and take weighted average of attention heatmaps of words for each query with word-image pertinence scores in Eq. 4 as weights. For ReferIt and Visual Genome, we treat each query as a single sentence and take its sentence-level attention heatmap as the final query pointing heatmap. Once the pointing heatmaps are calculated, we find the max location (as pointing location for the given query) and evaluate the model by the pointing game accuracy: .

Results for the pointing game accuracy can be found in Table 2 for Flickr30k, ReferIt and Visual Genome datasets. The results in the table show that our method significantly outperforms all state of the art methods in all conditions and all datasets. For fair comparison, we used VGG16 similar to [10, 17] and yet the model gives a pointing game accuracy absolute improvement of for Flickr30k, for ReferIt, and for VisualGenome, while giving relative improvement of , , , respectively. Results with the more recent PNASNet model are even better, especially for Flickr30K and VisualGenome. In the next section, we will break our method in to different parts to study the efficacy of each of the choices that we have made, and elaborate on the most influential parts of the model contributing to these results.

4.4 Ablation Study

We trained on MSCOCO and evaluated on Flickr30K multiple configurations of our approach, with a PNASNet visual model, to better understand which aspects of our method affects positively or negatively the performance. We report these results in Table 3. Results are sorted by performance to show the most successful combinations.

We specifically evaluated: the efficacy of using multi-level feature maps and level selection (rows ); the influence of the use of softmax on attention maps (rows ); the use of ELMo for text embedding or the commonly used approach of training a Bi-LSTM (rows ); the use of a linear or non-linear mapping into the common space for the text and visual features (NLT and NLV), (rows ); and finally the choice of the visual layer (M: middle layer, L: last layer, ML: multi-level feature maps) for comparison to word and sentence embeddings (WL and SL) when we don’t use level attention. We used Cell 7 as middle layer, and Cell 11 as last layer (to be compared with word and sentence embedding in Eq. 1 and Eq. 6a, respectively).

By comparing the results in the table, we can see that using level-attention mechanism based on multi-level feature maps significantly improves the performance over separate visual-textual feature comparison (row ). By comparing rows , we can see that non-linear mapping in our model is really important, and replacing any of the mappings with a linear one significantly degrades the performance. We can also see that the use of non-linear mapping seems more important on the visual side, but best results are obtained with both text and visual non-linear mappings. By comparing rows , we find that applying softmax on the heatmaps leads to a very negative effect on the performance of the model. This makes sense, since as we elaborated in Section 3.2 this commonly used approach forces the heatmap to have an unnecessary distribution on either words or regions. Finally, rows

show the importance of using a strong contextualized text embedding. In this case, we only replaced the pre-trained BiLSTMs of ELMo model with a trainable BiLSTM (on top of word embeddings of ELMo), thus we directly feed the BiLSTM outputs to the attention model. As we can see from the table, the performance drops significantly again. It’s worth mentioning that we conducted the same experiment based on a different visual network (Inception-V4 

[43]) and watched the same trend for the baseline choices.

Figure 6: Some image-sentence pairs from Flickr30K, with two queries (colored text) and corresponding heatmaps and selected max value (stars).
Figure 7: Some failure cases of our model. The model seems to semantically make mistake in pointing to regions.
1 ML ML 67.73
2 M L 62.67
3 M L 58.40
4 M L 56.92
5 M L 56.42
6 M L 54.75
7 M L 47.20
8 M L 44.83
Table 3: Ablation study results on Flickr30K using PNASNet. SA: Softmax Attention; NLT: Non-Linear Text mapping; NLV: Non-Linear Visual mapping; WL: Word-Layer; SL: Sentence-Layer; Acc.: pointing Accuracy.

4.5 Qualitative results

We give in Figure 5, 6, and 7 some examples of heat maps generated for some queries of the Flickr30K dataset. Specifically, we upsample the heatmaps from their original size of by bilinear interpolation to the original image size. We can observe that the max (pointing) location in heatmaps point to correct location in the image and the heatmaps often capture relevant part of the image for each query. It can deal with persons, context and objects even if they are described with some very specific words (e.g. ”bronco”), which shows the power of using a character-based contextualized text embedding. Finally, Figure 7 shows some localization failures involving concepts that are semantically close, and in challenging capture conditions. For example, the frames are mistakenly pointed for the query ”window” which is overexposed.

5 Conclusion

In this paper, we present a weakly supervised method for phrase localization which relies on multi-level attention mechanism on top of multi-level visual semantic features and contextualized text embeddings. We non-linearly map both contextualized text embeddings and multi-level visual semantic features to a common space and calculate a multi-level attention map for choosing the best representative visual semantic level for the text and each word in it. We show that such combination sets a new state of the art performance and provide quantitative numbers to show the importance of 1. using correct common space mapping, 2. strong contextualized text embeddings, 3. freedom of each word to choose correct visual semantic level. Future works lies in studying other applications such as Visual Question Answering, Image Captioning, etc.


  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, volume 3, page 6, 2018.
  • [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In

    International Conference on Computer Vision (ICCV)

    , 2015.
  • [3] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One billion word benchmark for measuring progress in statistical language modeling. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  • [4] K. Chen, J. Gao, and R. Nevatia. Knowledge aided consistency for weakly supervised phrase grounding. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2018.
  • [5] K. Chen, R. Kovvuri, J. Gao, and R. Nevatia. Msrc: Multimodal spatial regression with semantic context for phrase grounding. International Journal of Multimedia Information Retrieval, 7(1):17–28, 2018.
  • [6] K. Chen, R. Kovvuri, and R. Nevatia. Query-guided regression network with context policy for phrase grounding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  • [7] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6298–6306. IEEE, 2017.
  • [8] C. Deng, Q. Wu, Q. Wu, F. Hu, F. Lyu, and M. Tan. Visual grounding via accumulated attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7746–7755, 2018.
  • [9] M. Engilberge, L. Chevallier, P. Pérez, and M. Cord. Finding beans in burgers: Deep semantic-visual embedding with localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3984–3993, 2018.
  • [10] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1473–1482, 2015.
  • [11] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing

    , 2016.
  • [12] A. Gordo, J. Almazán, J. Revaud, and D. Larlus. Deep image retrieval: Learning global representations for image search. In European Conference on Computer Vision, pages 241–257. Springer, 2016.
  • [13] M. Grubinger, P. Clough, H. Müller, and T. Deselaers. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In Int. Workshop OntoImage, volume 5, 2006.
  • [14] L. A. Hendricks, R. Hu, T. Darrell, and Z. Akata. Grounding visual explanations. In European Conference on Computer Vision. Springer, 2018.
  • [15] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4555–4564, 2016.
  • [16] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 2333–2338. ACM, 2013.
  • [17] S. A. Javed, S. Saxena, and V. Gandhi. Learning unsupervised visual grounding through semantic self-supervision. arXiv preprint arXiv:1803.06506, 2018.
  • [18] J. Johnson, A. Karpathy, and L. Fei-Fei.

    Densecap: Fully convolutional localization networks for dense captioning.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4565–4574, 2016.
  • [19] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
  • [20] B.-H. Juang, W. Hou, and C.-H. Lee. Minimum classification error rate methods for speech recognition. IEEE Transactions on Speech and Audio processing, 5(3):257–265, 1997.
  • [21] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
  • [22] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
  • [23] M. Khademi and O. Schulte. Image caption generation with hierarchical contextual visual spatial attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1943–1951, 2018.
  • [24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [25] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
  • [26] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
  • [27] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [28] C. Liu, B. Zoph, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. arXiv preprint arXiv:1712.00559, 2017.
  • [29] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pages 289–297, 2016.
  • [30] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In

    in ICML Workshop on Deep Learning for Audio, Speech and Language Processing

    . Citeseer, 2013.
  • [31] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
  • [32] T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, 2013.
  • [33] D.-K. Nguyen and T. Okatani. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [34] A. Odena, V. Dumoulin, and C. Olah. Deconvolution and checkerboard artifacts. Distill, 2016.
  • [35] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
  • [36] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 2227–2237, 2018.
  • [37] B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier, and S. Lazebnik. Phrase localization and visual relationship detection with comprehensive image-language cues. In Proc. ICCV, 2017.
  • [38] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  • [39] F. Radenović, G. Tolias, and O. Chum. Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In European conference on computer vision, pages 3–20. Springer, 2016.
  • [40] V. Ramanishka, A. Das, J. Zhang, and K. Saenko. Top-down visual saliency guided by captions. In IEEE International Conference on Computer Vision and Pattern Recognition, 2017.
  • [41] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision, pages 817–834. Springer, 2016.
  • [42] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  • [43] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.

    Inception-v4, inception-resnet and the impact of residual connections on learning.

    In AAAI, volume 4, page 12, 2017.
  • [44] J. Thomason, J. Sinapov, and R. Mooney. Guiding interaction behaviors for multi-modal grounded language learning. In Proceedings of the First Workshop on Language Grounding for Robotics, pages 20–24, 2017.
  • [45] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013.
  • [46] M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng. Structured matching for phrase localization. In European Conference on Computer Vision, pages 696–711. Springer, 2016.
  • [47] F. Xiao, L. Sigal, and Y. J. Lee. Weakly-supervised visual grounding of phrases with linguistic structures. In IEEE International Conference on Computer Vision and Pattern Recognition, 2017.
  • [48] C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. In

    International conference on machine learning

    , pages 2397–2406, 2016.
  • [49] H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision, pages 451–466. Springer, 2016.
  • [50] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
  • [51] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. arXiv preprint, 2018.
  • [52] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21–29, 2016.
  • [53] R. Yeh, J. Xiong, W.-M. Hwu, M. Do, and A. Schwing. Interpretable and globally optimal prediction for textual grounding using image concepts. In Advances in Neural Information Processing Systems, pages 1912–1922, 2017.
  • [54] R. A. Yeh, M. N. Do, and A. G. Schwing. Unsupervised textual grounding: Linking words to image concepts. In Proc. CVPR, volume 8, 2018.
  • [55] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.
  • [56] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  • [57] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [58] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
  • [59] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10):1084–1102, 2018.
  • [60] X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657, 2015.
  • [61] F. Zhao, J. Li, J. Zhao, and J. Feng.

    Weakly supervised phrase localization with multi-scale anchored transformer network.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [62] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In European conference on computer vision, pages 391–405. Springer, 2014.