Instance grounding and merging component developed for the DARPA AIDA program.
We address the problem of phrase grounding by learning a multi-level common semantic space shared by the textual and visual modalities. This common space is instantiated at multiple layers of a Deep Convolutional Neural Network by exploiting its feature maps, as well as contextualized word-level and sentence-level embeddings extracted from a character-based language model. Following a dedicated non-linear mapping for visual features at each level, word, and sentence embeddings, we obtain a common space in which comparisons between the target text and the visual content at any semantic level can be performed simply with cosine similarity. We guide the model by a multi-level multimodal attention mechanism which outputs attended visual features at different semantic levels. The best level is chosen to be compared with text content for maximizing the pertinence scores of image-sentence pairs of the ground truth. Experiments conducted on three publicly available benchmarks show significant performance gains (20 phrase localization and set a new performance record on those datasets. We also provide a detailed ablation study to show the contribution of each element of our approach.READ FULL TEXT VIEW PDF
Instance grounding and merging component developed for the DARPA AIDA program.
This is the repo for Multi-level textual grounding
Phrase grounding [38, 31] is the task of localizing within an image a given natural language input phrase, as illustrated in Figure 1. This ability to link text and image content is a key component of many visual semantic tasks such as image captioning [10, 21, 18], visual question answering [2, 29, 48, 52, 11]
, text-based image retrieval[12, 39] and robotic navigation . It is especially challenging as it requires a good representation of both the visual and textual domain and an effective way of linking them.
On the visual side, most of the works exploit Deep Convolutional Neural Networks but often rely on bounding box proposals [38, 41, 15] or use a global feature of the image , limiting the localization ability and freedom of the method. On the textual side, methods rely on a closed vocabulary or try to train their own language model using small image-caption pairs datasets [17, 59, 53, 9]. Finally, the mapping between the two modalities is often performed with a weak linear strategy [38, 51]. We argue that approaches in the literature have not fully leveraged the potential of the more powerful visual and textual model developed recently, and there is room for developing more sophisticated representations and mapping approaches.
In this work, we propose to explicitly learn a non-linear mapping of the visual and textual modalities into a common space, and do so at different granularity for each domain. Indeed, different layers of a deep network encode each region of the image with gradually increasing levels of discriminativeness and context awareness, similarly single words and whole sentences contain increasing levels of semantic meaning. This common space mapping is trained with weak supervision and exploited at test-time with a multi-level multimodal attention mechanism, where a natural formalism for computing attention heatmaps at each level, attended features and pertinence scoring, enables us to solve the phrase grounding task elegantly and effectively. We evaluate our model on three commonly used datasets in the literature of textual grounding and show that it sets a new state-of-the-art performance by a large margin.
Our contributions in this paper are as follows:
We learn, with weak-supervision, a non-linear mapping of visual and textual features to a common region-word-sentence semantic space, where comparison between any two semantic representations can be performed with a simple cosine similarity;
We propose a multi-level multimodal attention mechanism, which can produce either word-level or sentence-level attention maps at different semantic levels, enabling us to choose the most representative attended visual feature among different semantic levels;
We set new state-of-the-art performance on three commonly used datasets, and give detailed ablation results showing how each part of our method contributes to the final performance.
In the following section, we will provide a brief overview of related works in the literature and will elaborate our method in the sequel.
The earliest works on solving the textual grounding task [38, 41, 15] tried to tackle the problem by finding the right bounding box out of a set of proposals, usually obtained from pre-specified models [62, 45]
. The ranking of these proposals, for each text query, can be performed using scores estimated from a reconstruction or sentence generation  procedure, or using distances in a common space . However, relying on a fixed set of pre-defined concepts and proposals may not be optimal and the quality of the bounding boxes defines an upper bound [15, 46] of the performance that can be achieved. Therefore, several methods [6, 61] have proposed to integrate the proposal step in their framework to improve the bounding box quality. Works relying on bounding boxes are often operating in a fully supervised setting [5, 53, 57, 11, 6], where the mapping between sentences and bounding boxes has to be provided at training time which is not always available and is costly to gather. It is also worth mentioning that methods based on bounding boxes often extract features separately for each bounding box [15, 4, 46], inducing a high computational cost.
Some works [40, 17, 59, 47, 54] therefore choose not to rely on bounding boxes and propose to formalize the localization problem as finding a spatial heatmap for the referring expression. This setting is mostly weakly-supervised, where at training time only the image and the text (describing either the whole image or some parts of it) are provided but not the corresponding bounding box or segmentation mask for each description. This is the more general setting we are addressing in this paper. The top-down approaches [40, 59] and the attention-based approach  learn to produce a heatmap for each word of a vocabulary. At test time, all these methods produce the final heatmap by averaging the heatmaps of all the words in the query that exist in the vocabulary. Several grounding works have also explored the use of additional knowledge, such as image  and linguistic [47, 37] structures, phrase context  and exploiting pre-trained visual models predictions [4, 54].
In contrast to many works in the literature, we don’t use pre-defined word or image concepts in our method. We also don’t leverage any knowledge from classification or object detection tasks, or explicitly exploit image or sentences structures. We instead rely on character-based language model with contextualized embeddings which could handle any unseen word considering the context in the sentence. As we further explain in the sequel, the sentence and each word in it are assigned with a spatial heatmap explaining their similarity to different regions of the image at different visual semantic levels.
. Current works usually apply a multi-layer perceptron (MLP)[6, 4], element-wise multiplication , or cosine similarity  to combine representations from different modalities. Other methods have used the Canonical Correlation Analysis (CCA) [37, 38]
, which finds linear projections that maximize the correlation between projected vectors from the two views of heterogeneous data. introduced the Multimodal Compact Bilinear (MCB) pooling method that uses a compressed feature from the outer product of two vectors of visual and language features to fuse them. Attention methods can also measure the matching of an image-sentence feature pair. In [51, 33], attention maps are generated from dot product of the linear projection of visual and language features. In contrast, we use non-linear mapping of both visual features (in multiple semantic levels) and textual embeddings (both contextualized word and sentence embeddings) separately and use multi-level attention with multimodal loss to learn those mapping weights.
Attention has proved its effectiveness in many visual and language tasks [23, 1, 7, 52, 50], it is designed to capture a better representation of image-sentence pairs based on their interactions. The Accumulated Attention method  propose to estimate attention on sentences, objects and visual feature maps in an iterative fashion, where at each iteration the attention of the other two modalities is exploited as guidance. A dense co-attention mechanism is explored in  to solve the Visual Question Answering task by using a fully symmetric architecture between visual and language representations. In their attention mechanism, they add a dummy location in attention map when no region or word the model should attend along with a softmax. In AttnGAN , a deep attention multimodal similarity model is proposed to compute a fine-grained image-text matching loss. In contrast to these works, we remove the softmax on top of the attention maps to let the model decide which word-region could be related to each other by the guide of the multimodal loss. Since we map the visual features to a multi-level visual representation, we give the model the freedom to choose any location at any level for either sentence or word. In other words, each word can choose which level of representation (and which region in that representation) to attend to. The same freedom is provided for sentence. We directly calculate this attention map by cosine similarity in the common space we learn for word, sentence, and multi-level semantic visual representations. We show that this approach significantly outperforms all the state of the art approaches on three commonly used datasets and set a new state of the art performance.
In this section, we will describe our method (illustrated in Figure 2) for addressing the textual grounding task and elaborate on each part with details. In Section 3.1, we first explain how we extract multi-level visual features from an image and word/sentence embeddings from the text, we then describe how we map them to a common space. In Section 3.2 we describe how we calculate multi-level multimodal attention map and attended visual feature for each word/sentence. Then, in Section 3.3 we describe how we choose which visual feature level is most representative for the given text. Finally, in Section 3.4 we define a multimodal loss to train the whole model using weak supervision.
: In contrast to many vision tasks where the last layer of a pre-trained CNN is being used as visual representation of an image, we use feature maps from different layers and map them separately to a common space to obtain a multi-level set of feature maps to be compared with text. Intuitively, using different levels of visual representations would be necessary for covering a wide range of visual concepts and patterns [26, 55, 58]. Thus, we extract sets of feature maps from
different levels of a visual network, upsample them by a bi-linear interpolation111as transposed convolution produces checkerboard artifacts  to a fixed resolution for all the levels, and then apply 3 layers of 1x1 convolution (with LeakyRelu ) with
filters to map them into equal-sized feature maps. Finally, we stack these feature maps and space-flatten them to have an overall image representation tensor, with . This tensor is finally normalized by the
-norm of its last dimension. An overview of the feature extraction and common space mapping for image can be seen in the left part of Figure3.
In this work, we use VGG  as a baseline for fair comparison with other works [10, 47, 17], in the literature, and the state of the art CNN, PNASNet-5 , to study the ability of our model to exploit this more powerful visual model. We detail the feature maps selected in each model in Section 4.2.
: State of the art works in grounding use a variety of approaches for textual feature extraction. Some use pre-trained LSTM or BiLSTMs on big datasets (e.g. Google 1 Billion ) based on either word2vec  or GloVe  representations. Some train BiLSTM solely on image-caption datasets (mostly MSCOCO) and argue that it’s necessary to train them from scratch to distinguish between visual concepts which may not be distinguishable in language (e.g. red and green are different in vision but similar in language as they are both colors) [33, 51, 17, 47, 9, 14, 61, 38, 57, 8]. The mentioned works either use the recurrent network outputs at each state as word-level representations or their last output (on each direction for BiLSTM) as sentence-level or a combination of both.
In this paper, however we use ELMo , a 3-layer network pre-trained on 5.5B tokens which calculates word representations on the fly (based on CNN on characters, similar to [19, 60]) and then feed them to 2 layers of BiLSTMs which produce contextualized representations. Thus, for a given sentence the model outputs three representations for each token (splitted by white space). We take a linear combination of the three representations and feed them to 2 fully connected layers (with shared weights among words), each with nodes with LeakyRelu as non-linearity between each layer, to obtain each word representation (green pathway in the right part of Figure 3). The resulting word-based text representation for an entire sentence would be a tensor built from the stacking of each word representation . The sentence-level text representation is calculated by concatenation of last output of the BiLSTMs at each direction. Similarly, we apply a linear combination on the two sentence-level representations and map it to the common space by feeding it to 2 fully connected layers of nodes, producing the sentence representation (red pathway in the right part of Figure 3). The word tensor and the sentence vector are normalized by their last dimension -norm before being fed to the multimodal attention block.
Given the image and sentence, our task is to estimate the correspondences between spatial regions () in the image at different levels (), and words in the sentence at different positions (). We seek to estimate a correspondence measure, , between each word and each region at each level. We define this correspondence by the cosine similarity between word and image region representations at different levels in common space:
represents a multi-level multi-modal attention map which could be simply used for calculating either visual or textual attended representation. We apply ReLU to the attention map to zero-out dissimilar word-visual region pairs, and simply avoid applying softmax on any dimension of the heatmap tensor. Note that this choice is very different in spirit from the commonly used approach of applying softmax to attention maps[50, 49, 8, 33, 17, 51, 40]. Indeed for irrelevant image-sentence pairs, the attention maps would be almost all zeros while the softmax process would always force attention to be a distribution over the image/words summing to . Furthermore, a group of words shaping a phrase could have the same attention area which is again hard to achieve considering the competition among regions/words in the case of applying softmax on the heatmap. We will analyze the influence of this choice experimentally in our ablation study.
Given the heatmap tensor, we calculate the attended visual feature for the -th level and -th word as
which is basically a weighted average over the visual representations of the -th level with the attention heatmap values as weights. In other words,
is a vector in the hyperplane spanned by a subset of visual representations in the common space, this subset being selected based on the heatmap tensor. An overview of our multi-level multimodal attention mechanism for calculating attended visual feature can be seen in Figure4. In the sequel, we describe how we use this attended feature to choose the most representative hyperplane, and calculate a multimodal loss to be minimized by weak supervision of image-sentence relevance labels.
Once we find the attended visual feature, we calculate the word-image pertinence score at level using cosine similarity for each word and the attended visual feature as
Intuitively, each visual feature map level could carry different semantic information, thus for each word we propose to apply a hard level-attention to get the score from the level contributing the most as
This procedure can be seen as finding projection of the textual embeddings on hyperplanes spanned by visual features from different levels and choosing the one that maximizes this projection. Intuitively, that chosen hyperplane can be a better representation for visual feature space attended by word . This can be seen in the top central part of Figure 2, where selecting the maximum pertinence score over levels is equivalent to selecting the hyperplane with the smallest angle with the -th word representation (or the highest similarity between attended visual feature and textual feature). Thus, selecting the most representative hyperplane (or visual feature level).
Similarly, for the sentence we can repeat the same procedure (except that we no more need Eq. 5) for finding the attention map, attended visual feature and sentence-image pertinence score as follows, respectively:
In this paper, we only use a weak supervision in the form of binary image-caption relevance. Thus, similar to [10, 16, 51] we train the network on a batch of image-caption pairs, and force it to have high sentence-image pertinence score for related pairs and low score for unrelated pairs. Thus, considering a pertinence score (either or
), we calculate the posterior probability of the sentencebeing matched with image by applying competition among all sentences in the batch using:
Similarly, the posterior probability of being matched with could be calculated using:
As we want to train a common semantic space for both words and sentences, we combine the loss (that can be computed based on the word relevance ) and the sentence loss (obtained using ) to define our final loss as
This loss is minimized over a batch of images along with their related sentences. We found in preliminary experiments on held-out validation data, that the values , work well and we keep them fixed for our experiments. In the next section, we will evaluate our proposed model on different datasets and will have an ablation study to show the reason for our choices in our model.
In this section, we will first present the datasets we use and our experimental setup. We then evaluate our approach comparing with the state-of-the-art, and further present ablation studies showing the influence of each step of our method.
We here describe the publicly available datasets we have used for training and testing our approach.
 consists of 82,783 training images and 40,504 validation images. Each image is associated with five captions describing the image. While an image may be associated with multiple bounding box annotations of objects, there in no textual description for these bounding boxes and we do not exploit them in any way. We use the train split of this dataset for training our model.
 is a dataset of approximately 31K images with 150K descriptive captions. There are 25,380 images in the training set, 2,985 images in validation set and 2,984 images in the test set. In the Flickr30k Entities dataset , each image caption is divided into multiple phrases that are each linked to a specific bounding box on the image, with a total of 244K phrases describing localized bounding boxes. We use the validation split of this dataset for validating our model during training (on MSCOCO) and its testing split for evaluating our model after training.
 contains 77,398 images in the training set, and a validation and test set of 5000 images each. Each image consists of multiple bounding box annotations and a region description associated with each bounding box. Similar to , we preprocess the text and remove noisy characters and punctuation from region descriptions in VisualGenome. All bounding box annotations without any description and descriptions longer than ten words have been ignored. Certain images were also found to have bounding box annotations exceeding the image size, which we have fixed to lie within the image boundaries. Note that we use the bounding boxes only to evaluate our method and not in training as our method is weakly-supervised. Since the type of region description in this dataset is different from MSCOCO and Flickr30k, we separately train our model on the training split of this dataset.
consists of 20,000 images from IAPR TC-12 dataset  along with 99,535 segmented image regions from SAIAPR-12 dataset . Images are associated with descriptions for the entire image as well as as well as localized image regions collected in a two-player game  providing approximately 130k isolated entity descriptions. In our work, we have used only the unique descriptions associated with each region and any region without an associated region description has been ignored. We have used a split similar as  which contains 9,000 training images, 1000 validation images and 10,000 test images. We use the validation split of this dataset for validating our model while being trained on Visual Genome (VG) dataset, and use its test split for evaluating the model after training (on VG).
We use a batch size of
, where for a batch of image-caption pairs each image (caption) is only related to one caption (image). Image-caption pairs are sampled randomly with a uniform distribution. We train the network on 20 epochs with Adam optimizer with where the learning rate is divided by 2 once at 10-th epoch and once at 15-th epoch. We use for common space mapping dimension and for LeakyReLU in the non-linear mappings. We regularize weights of the mappings with regularization with regvalue. Finally, we elaborate on visual feature map level selection for both VGG16 and PNASNet in Table 1. Both visual and textual networks weights are fixed during training and only common space mapping weights are trainable. In ablation study, we use 10 epochs without dividing learning rate, while the rest of settings remain the same.
|Network||Name||Original Dimensions||Multi-Level Feature Maps|
As stated in Section 4.1, we use the train split of MSCOCO for training our model and evaluate it on the test split of Flickr30k. Since the queries in ReferIt and VisualGenome are referring expressions - and not part of a complete sentence as in Flickr30k, we separately train the model on the train split of Visual Genome dataset and evaluate it on the test split of ReferIt and Visual Genome. For evaluation on Flickr30k, we feed a complete sentence to the model and take weighted average of attention heatmaps of words for each query with word-image pertinence scores in Eq. 4 as weights. For ReferIt and Visual Genome, we treat each query as a single sentence and take its sentence-level attention heatmap as the final query pointing heatmap. Once the pointing heatmaps are calculated, we find the max location (as pointing location for the given query) and evaluate the model by the pointing game accuracy: .
Results for the pointing game accuracy can be found in Table 2 for Flickr30k, ReferIt and Visual Genome datasets. The results in the table show that our method significantly outperforms all state of the art methods in all conditions and all datasets. For fair comparison, we used VGG16 similar to [10, 17] and yet the model gives a pointing game accuracy absolute improvement of for Flickr30k, for ReferIt, and for VisualGenome, while giving relative improvement of , , , respectively. Results with the more recent PNASNet model are even better, especially for Flickr30K and VisualGenome. In the next section, we will break our method in to different parts to study the efficacy of each of the choices that we have made, and elaborate on the most influential parts of the model contributing to these results.
We trained on MSCOCO and evaluated on Flickr30K multiple configurations of our approach, with a PNASNet visual model, to better understand which aspects of our method affects positively or negatively the performance. We report these results in Table 3. Results are sorted by performance to show the most successful combinations.
We specifically evaluated: the efficacy of using multi-level feature maps and level selection (rows ); the influence of the use of softmax on attention maps (rows ); the use of ELMo for text embedding or the commonly used approach of training a Bi-LSTM (rows ); the use of a linear or non-linear mapping into the common space for the text and visual features (NLT and NLV), (rows ); and finally the choice of the visual layer (M: middle layer, L: last layer, ML: multi-level feature maps) for comparison to word and sentence embeddings (WL and SL) when we don’t use level attention. We used Cell 7 as middle layer, and Cell 11 as last layer (to be compared with word and sentence embedding in Eq. 1 and Eq. 6a, respectively).
By comparing the results in the table, we can see that using level-attention mechanism based on multi-level feature maps significantly improves the performance over separate visual-textual feature comparison (row ). By comparing rows , we can see that non-linear mapping in our model is really important, and replacing any of the mappings with a linear one significantly degrades the performance. We can also see that the use of non-linear mapping seems more important on the visual side, but best results are obtained with both text and visual non-linear mappings. By comparing rows , we find that applying softmax on the heatmaps leads to a very negative effect on the performance of the model. This makes sense, since as we elaborated in Section 3.2 this commonly used approach forces the heatmap to have an unnecessary distribution on either words or regions. Finally, rows
show the importance of using a strong contextualized text embedding. In this case, we only replaced the pre-trained BiLSTMs of ELMo model with a trainable BiLSTM (on top of word embeddings of ELMo), thus we directly feed the BiLSTM outputs to the attention model. As we can see from the table, the performance drops significantly again. It’s worth mentioning that we conducted the same experiment based on a different visual network (Inception-V4) and watched the same trend for the baseline choices.
We give in Figure 5, 6, and 7 some examples of heat maps generated for some queries of the Flickr30K dataset. Specifically, we upsample the heatmaps from their original size of by bilinear interpolation to the original image size. We can observe that the max (pointing) location in heatmaps point to correct location in the image and the heatmaps often capture relevant part of the image for each query. It can deal with persons, context and objects even if they are described with some very specific words (e.g. ”bronco”), which shows the power of using a character-based contextualized text embedding. Finally, Figure 7 shows some localization failures involving concepts that are semantically close, and in challenging capture conditions. For example, the frames are mistakenly pointed for the query ”window” which is overexposed.
In this paper, we present a weakly supervised method for phrase localization which relies on multi-level attention mechanism on top of multi-level visual semantic features and contextualized text embeddings. We non-linearly map both contextualized text embeddings and multi-level visual semantic features to a common space and calculate a multi-level attention map for choosing the best representative visual semantic level for the text and each word in it. We show that such combination sets a new state of the art performance and provide quantitative numbers to show the importance of 1. using correct common space mapping, 2. strong contextualized text embeddings, 3. freedom of each word to choose correct visual semantic level. Future works lies in studying other applications such as Visual Question Answering, Image Captioning, etc.
International Conference on Computer Vision (ICCV), 2015.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016.
Densecap: Fully convolutional localization networks for dense captioning.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4565–4574, 2016.
in ICML Workshop on Deep Learning for Audio, Speech and Language Processing. Citeseer, 2013.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, volume 4, page 12, 2017.
International conference on machine learning, pages 2397–2406, 2016.
Weakly supervised phrase localization with multi-scale anchored transformer network.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.