Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation

by   Liwei Wang, et al.

Weakly supervised phrase grounding aims at learning region-phrase correspondences using only image-sentence pairs. A major challenge thus lies in the missing links between image regions and sentence phrases during training. To address this challenge, we leverage a generic object detector at training time, and propose a contrastive learning framework that accounts for both region-phrase and image-sentence matching. Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed. Importantly, our region-phrase score function is learned by distilling from soft matching scores between the detected object class names and candidate phrases within an image-sentence pair, while the image-sentence score function is supervised by ground-truth image-sentence pairs. The design of such score functions removes the need of object detection at test time, thereby significantly reducing the inference cost. Without bells and whistles, our approach achieves state-of-the-art results on the task of visual phrase grounding, surpassing previous methods that require expensive object detectors at test time.


PIRC Net : Using Proposal Indexing, Relationships and Context for Phrase Grounding

Phrase Grounding aims to detect and localize objects in images that are ...

Detector-Free Weakly Supervised Grounding by Separation

Nowadays, there is an abundance of data involving images and surrounding...

Grounding Visual Explanations (Extended Abstract)

Existing models which generate textual explanations enforce task relevan...

Phrase Grounding by Soft-Label Chain Conditional Random Field

The phrase grounding task aims to ground each entity mention in a given ...

Contrastive Learning for Weakly Supervised Phrase Grounding

Phrase grounding, the problem of associating image regions to caption wo...

Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding

In this paper, we are tackling the weakly-supervised referring expressio...

Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

We address the problem of grounding free-form textual phrases by using w...

1 Introduction

Visual phrase grounding—finding image regions associated with phrases in a sentence description of the image, is an important problem at the intersection of computer vision and natural language processing. Most of existing approaches 

Fukui et al. (2016); Plummer et al. (2015); Wang et al. (2018) follow a fully supervised paradigm that requires the labeling of bounding boxes for each phrase. This fine-grained annotation is unfortunately expensive to obtain and thus difficult to scale. Consequently, weakly supervised grounding has recently received considerable attention Rohrbach et al. (2016); Yeh et al. (2017, 2018); Zhao et al. (2018); Chen et al. (2018); Zhang et al. (2018a, b); Fang et al. (2019); Wang and Specia (2019). In this setting, only images and their sentence descriptions are given at training time, and a method has to link image regions to sentence phrases at test time.

A major challenge of weakly supervised grounding is to distinguish among many “concurrent” visual concepts. For example, the region of a dog and that of its head are likely to co-occur in images associated with the phrase “a running puppy.” Without knowing the ground-truth region-phrase matching, learning to link the region of dog (but not dog head) to its corresponding phrase becomes very challenging. To address this challenge, recent methods Chen et al. (2018); Fang et al. (2019); Wang and Specia (2019) leverage generic object detectors during training and inference. Object detection provides high quality object regions, as well as their object names that can be further matched to candidate phrases, thereby bringing in external knowledge about region-phrase matching and thus helping to disambiguate those “concurrent” concepts.

In this paper, we focus on developing a principled approach to distill knowledge from a generic object detector for weakly supervised grounding. To this end, we present a simple method under the framework of contrastive learning. Specifically, our model learns a score function between region-phrase pairs, guided by two levels of similarity constraints in the form of noise-contrastive estimation (NCE) loss 

Gutmann and Hyvärinen (2010) during training. The first level of region-phrase similarity is distilled from object detection outputs. This is done by aligning predicted region-phrase scores to a set of soft targets, computed by matching object names and candidate phrases. The second level of image-sentence similarity is computed from a greedy matching between all region-phrase pairs, and supervised by ground-truth image-sentence pairs. During inference, our method compares each image region to candidate phrases using the learned score function. Thanks to knowledge distillation, our method does not require an object detector during inference and thus significantly reduces inference time.

To evaluate our method, we systematically vary the components of our model and demonstrate several best practices for weakly supervised phrase grounding. Moreover, we conduct extensive experiments on Flickr30K Entities Plummer et al. (2015) and ReferItGame Kazemzadeh et al. (2014) datasets, and compare our results to the latest methods of weakly supervised phrase grounding. Our experiments show that our method establishes new state-of-the-art results and outperforms the latest methods, including those that use object detectors at test time Fang et al. (2019); Wang et al. (2018); Yeh et al. (2018), and remains very efficient during inference without using object detectors. For example, on Flickr30K Entities, our method improves the result from Fang et al. (2019) by 2.3% and even outperforms the latest method Wang and Specia (2019) that combines multiple strong object detectors during inference. We hope that our simple yet strong method will shed light on new ideas and practices for weakly supervised image-text grounding.

2 Related Work

We discuss relevant work on weakly supervised phrase grounding and provide a brief review of contrastive learning and knowledge distillation—the two main pillars of our method.

Weakly Supervised Phrase Grounding

Grounding of textual phrases, also referred to as phrase localization, has received considerable attention recently. Several new datasets, e.g., Flickr30K Entities Plummer et al. (2015) and Visual Genome Krishna et al. (2017), have been constructed to capture dense phrase-region correspondences. Building on these datasets, recent approaches learn a similarity function between regions and phrases by using the ground-truth region-phrase pairs Wang et al. (2018); Fukui et al. (2016). This fully supervised paradigm has shown impressive results, yet requires labor-intensive annotations of bounding boxes for all phrases.

Recent work explores a weakly supervised setting. Specifically, these methods learn from only images and their paired sentence descriptions, without explicit region-to-phrase correspondence. For example, Rohrbach et al. Rohrbach et al. (2016) learns visual phrase grounding by reconstructing the input phrases using an attention module over image regions. Recent works Yeh et al. (2018, 2017) show that weakly supervised phrase localization can benefit from side information, such as object segmentation or detection. UTG Yeh et al. (2018) links words in text and detection classes using co-occurrence statistics from paired captions. Moreover, Xiao et al. Xiao et al. (2017) look into the linguistic structure of the sentences. They proposed a structure loss to model the compositionality of the phrases and their attention masks. Zhao et al. Zhao et al. (2018) jointly learns to propose object regions and matches the regions to phrases. Fang et al. Fang et al. (2019) explores the weakly supervised grounding by decomposing the problem into several modules and taking additional information, like the color module, to improve the performance. A more recent work of WPT Wang and Specia (2019) makes use of off-the-shelf models to detect objects, scenes and colors in images, and achieves the goal of grounding via measuring semantic similarity between the categories of detected visual elements and the sentence phrases.

Our method shares the key idea behind these previous works–we seek an explicit alignment between regions and phrases given image-sentence pairs. Our method differs from previous works by explicitly modeling the knowledge distillation from external off-the-shelf object detectors into the unified contrastive learning framework. Our method is also different from WPT Wang and Specia (2019) though we both need to use object detectors: our approach learns the distillation in the training stage which makes our model free of detectors in the testing stage, while WPT requires using detectors during inference.

A concurrent work Gupta et al. (2020) also proposed to use contrastive loss for weakly supervised phrase grounding. Different from Gupta et al. (2020), our work moves beyond the contrastive loss and focuses on knowledge distillation from an external object detector under a contrastive learning framework.

Contrastive Learning

There is recently a trend of exploring contrastive learning approaches in various tasks recently. Among them, Contrastive Predictive Coding (CPC) Oord et al. (2018) learns representations for sequential data. Deep InfoMax Hjelm et al. (2018)

achieved the goal of unsupervised learning by maximizing the mutual information between the input and output of networks. SimCLR 

Chen et al. (2020) is proposed for image classification with only a limited amount of data with class labels via contrastive learning. Multiview coding Tian et al. (2019) extends the input to more than two views. These methods are all based on the similar objectives of contrastive learning related to Noise Contrastive Estimation (NCE) Gutmann and Hyvärinen (2010) , but differ in various performed tasks. Our work is relevant to these works since our mathematical framework is also built upon the general idea of infoNCE Oord et al. (2018) and NCE Gutmann and Hyvärinen (2010). However, as far as we know, we are the first to extend this framework by integrating knowledge distillation into the cross-view weakly supervised grounding task.

Knowledge Distillation

Knowledge distillation was proposed and popularized by Buciluǎ et al. (2006); Hinton et al. (2015); Ba and Caruana (2014); Romero et al. (2014); Zagoruyko and Komodakis (2017). Several recent works Gupta et al. (2016); Garcia et al. (2018); Luo et al. (2018); Li et al. (2017) have explored knowledge distillation for multiple modalities. Knowledge distillation has also shown its effectiveness in various vision-language tasks, such as VQA Mun et al. (2018); Do et al. (2019)

, grounded image captioning 

Zhou et al. (2020), video caption Pan et al. (2020); Zhang et al. (2020), etc. Different from previous approaches, we consider knowledge distillation for region-phrase grounding by matching the outputs of a region-phrase score function to soft targets computed from object detection results.

3 Approach

Consider as the set of images and as the set of sentences. Each image consists of a set of regions with their features . Similarly, each sentence includes multiple phrase features . Thus, index images and sentences, and index regions and phrases. Oftentimes, we have multiple sentences describing the same image and many more image regions than sentence phrases. Moreover, with minor abuse of notations, we denote

as the probability of a valid image-sentence pair

, i.e., if and only if can be used to describe . Similarly, we use as the probability of a valid region-phrase pair .

Our goal is to learn a score function that measures the similarity between region features and phrase features . However, the learning of this function only has access to ground-truth image-sentence pairs without knowing the matching between regions and phrases . To address this challenge of weakly supervised grounding, we leverage a generic object detector to label candidate image regions, based on which “pseudo” labels of region-phrase correspondence can be generated by matching the region object labels to the sentence phrases. Therefore, our key innovation is the design of a contrastive loss that learns to distill from object detection outputs. A major advantage of using knowledge distillation is that our method no longer requires object detection at inference time and thus is very efficient during inference. Fig. 1 presents an overview of our method.

We now present the details of our method by first introducing the design of our score functions for image-text matching, followed by our contrastive learning loss using knowledge distillation.

Figure 1: Overview of our method. We propose to distill from object detection outputs for weakly supervised phrase grounding. A contrastive learning framework is designed to account for both region-phrase and image-sentence matching. During training, region-phrase matching is learned by distilling from object detection outputs, while image-sentence matching is supervised by ground-truth image-sentence pairs. At inference time, our method no longer requires object detectors, achieving state-of-the-art results with significantly reduced inference cost.

3.1 Score Functions for Image-Text Matching

Our model builds on a two-branch network Wang et al. (2018) for image-text matching at both region-phrase and image-sentence levels. The key idea is learning a score function to match region-phrase pairs. Based on the region-phrase matching scores, we further construct an image-sentence similarity score. Specifically, our network has two branches and that take the inputs of region and phrase features and

, respectively. Each branch is realized by a deep network by stacking multiple fully connected layers with ReLU activation in-between, followed by a L

normalization at the end. We define the similarity between a region-phrase pair

as the cosine similarity between the transformed features

and , given by


We further aggregate the region-phrase matching scores into a similarity score between a image-sentence pair , defined as


This image-sentence score is computed using greedy matching. Concretely, for each phrase in the sentence , we find its best matching region among all candidates. The scores of the best matching regions are further summed across all phrases. Note that phrases and regions are not interchangeable in this score function, i.e., . This is because that each phrase must be matched to at least one regions, while a region, e.g., a background region, might not be matched to any phrase. A similar image-sentence score function was also discussed in Karpathy et al. (2014); Zhou et al. (2018) for image-sentence retrieval.

3.2 Contrastive Learning with Knowledge Distillation

A major challenge of weakly supervised grounding is the lack of ground-truth region-phrase pairs. Our key idea is to make use of an object detector during training that can provide “pseudo” labels for learning region-phrase matching. We now describe how we generate the pseudo labels and how we supervise the learning of region-phrase matching.

Pseudo Labels for Region-Phrase Matching

An object detector predicts a distribution of object label in the form of nouns (including “background”) for each candidate region, i.e., . can be further matched to phrase , e.g., using similarity scores between object noun and the headnoun of the phrase. Let be the matching probability between and . We propose to approximate the unknown region-phrase matching ground-truth by soft “pseudo” label , given by


can be considered as a soft target distribution given by the matching between object detection outputs and the candidate phrases.

Distilling Knowledge from Pseudo Labels

We propose to distill from the pseudo label by aligning the region-phrase matching scores to the soft pseudo label . Specifically, given a matching image-sentence pair

, we propose the following distillation loss function for region-phrase matching


where is the temperature scale factor (0.5 in all our experiments). controls how we select . A simple choice is to use all regions in except , e.g., . In this case, our loss can be interpreted as the cross entropy loss, where the normalized output of the score function is trained to mimic the soft target given by object detection outputs, thereby resembling the same idea as knowledge distillation Hinton et al. (2015).

Contrastive Learning for Image Sentence Matching

Moving beyond region-phrase matching, we enforce additional constraints for image-sentence matching scores , where the ground truth pairs is readily available. To this end, we make use of the noise contrastive estimation loss Gutmann and Hyvärinen (2010) to contrast samples from data distribution (matched pairs) and noise distribution (non-matched pairs). The NCE loss for image-sentence matching is thus given by


where is again the temperature scale factor (0.5). is reduced to binary values during training, i.e., if and only if is a ground-truth image-sentence pair. includes a set of negative samples, i.e., those images not matched to the current sentence , sampled from the set of images . In practice, we always sample a fixed number of negative pairs from the current mini-batch.

We note that Eq. 4 and Eq. 5 share a similar form and can be both considered as a variant of contrastive loss. Concretely, the two loss functions seek to align the normalized scores in the form of NCE to a target distribution. The difference is how the target distribution is defined and how the samples are selected for normalization. For region-phrase matching, the target distribution is pseudo labels from object detection and local image regions are used for normalization. For image-sentence matching, the target distribution is defined by ground-truth image-sentence pairs and non-matched image-sentence pairs are sampled for normalization.

Training and Inference

For training, our final loss function is a summation of the region-phrase matching loss and the image-sentence matching loss , given by


where is the coefficient balance the two loss terms. During training, we gradually increase the coefficient , such that our model learns to optimize image-sentence matching during the early stage of training, and to focus on region-phrase matching during the late stage of training.

During inference, given an input image-sentence pair, we apply the learned region-phrase score function between every region-phrase pair. The image region with the highest score to each phrase is then selected as the grounding results. We must point out that the inference of our model does not require running object detection, therefore our method is very efficient at test time.

4 Experiments

We now present our experiments and results. We first discuss our datasets and implementation details, followed by an ablation study of our model, and finally a comparison of our results to latest methods.

Datasets and Experiment Setup

Our experiments are conducted in two major visual grounding datasets: Flickr30K Entities Plummer et al. (2015) and the ReferItGame dataset Kazemzadeh et al. (2014). Flickr30K Entities Plummer et al. (2015) includes around 30K images. Each image is associated with five sentences. We follow the same train/val/test splits from Plummer et al. (2015). For the ReferItGame dataset, we follow the standard split of Rohrbach et al. (2016). We follow the setting of weakly supervised grounding, and do not use the region-phrase annotations of both datasets during training. Following standard evaluation protocols in Chen et al. (2018); Rohrbach et al. (2016); Wang and Specia (2019)

, we report the accuracy as the evaluation metric. The accuracy is defined as the fraction of query phrases whose predicted bounding box overlaps ground-truth box with

>. For methods that select the predicted bounding box from a set of region proposals, the metric is equivalent to top-1 recall.

4.1 Implementation Details

We now describe our implementation details, including the features and object detectors, the network architecture and training scheme, and the details of object-phrase matching.

Features and Object Detectors

To establish a fair comparison to previous work using region features extracted from different backbones, we benchmark our methods by varying the backbone networks. We follow the same settings in 

Chen et al. (2018); Wang and Specia (2019) to extract activations from the last layer before the classification head in Faster R-CNN Ren et al. (2015) with VGG16 and ResNet-101 backbones pre-trained on PASCAL VOC (PV) Everingham et al. (2010) or MS COCO (CC)111 Lin et al. (2014). To further compare with the recent work of WPT Wang and Specia (2019) using object detectors trained on Open Images Dataset Krasin et al. (2017)

, we also extract classifier logits from Faster R-CNN with Inception-ResNet-V2 (IRV2) backbone pre-trained on Open Images Dataset (OI)

222 We denote these feature choices as “VGG16”, “Res101”, “IRV2” respectively plus the object data set when reporting our results. For example, “IRV2 OI” means that the backbone is Inception-ResNet-V2 (IRV2) pre-trained on the Open Images (OI) Dataset.

Network Architecture

We normalized the last layer activations to zero-mean and unit-variance using stats estimated on training samples. We find this normalization helps our model to converge faster. For phrase representation, we used the LSTM 

Hochreiter and Schmidhuber (1997) encoder with the GloVe embeddings Pennington et al. (2014)

. The embedding vocabulary contains the most frequent 13K tokens from the Flickr30K Entities training split. The same vocabulary is used for ReferItGame. The LSTM has two layers, with both embedding and hidden space dimension as 300. Max pooling is applied over the hidden states of all tokens, followed by two fully connected layers (1024->512) for the phrase representation. For visual representation, we attached two fully connected layers (1024->512) on top of the region features.

Training Details

We trained our model using AdamKingma and Ba (2015) with a learning rate of 0.0001. We used a mini-batch size of 32 image-sentence pairs (31 negative images per sentence for the contrastive loss). Unlike Fang et al. (2019), we did not fine-tune our vision backbone during training for efficiency. Similarly, the GloVe embeddings Pennington et al. (2014)

also stayed unchanged during training. We observed in our experiments that the model converges quickly within a few epochs on both Flickr30K Entities and ReferItGame datasets. Our implementation is in TensorFlow and will be made publicly available.

Object-Phrase Matching

We made use of the WordNet Miller (1995) to define a similarity score between object labels of image regions and sentence phrases. We empirically observe that using WordNet is more reliable than using word embedding for noun matching. Specifically, the headnoun for each phrase was first identified using the off-the-shelf POS tagger provided by NLTK Bird and Loper (2004), which uses the Penn Treebank tag set. If the headnoun matches one of the detector class names, the phrase was further mapped to the class. If not, the headnoun was looked up in WordNet Miller (1995) to find its corresponding synset, as well as the synset’s lemmas and hypernyms. If any of them exists in the object classes, the phrase was mapped to the corresponding class. If there are multiple synsets for a phrase, the most frequent one was considered. The WordNet synset was able to resolve phrases such as “spectators” to “person” and “sweater” to “clothing”. With the classes in Open Images Dataset Krasin et al. (2017), our matching strategy expanded the mapping to k out of k unique phrases in Flickr30k Entities and k out of k in ReferItGame training set.

4.2 Ablation Study

Method Flickr30K ACC (%) ReferItGame ACC (%)
Max Margin 42.11 22.94
NCE 48.35 26.63
Distill 45.05 17.25
NCE+Distill 50.96 27.59
Table 1: Ablation results of our proposed methods on Flickr30K and ReferItGames. Region features are from the Faster R-CNN ResNet-101 model pre-trained on COCO. Classifier logits for distillation are from the Faster R-CNN Inception-ResNet-V2 model pre-trained on Open Images Dataset.

To fully understand our model, we conduct ablation studies on both Flickr30K Entities and ReferItGame datasets. Specifically, we consider four different variants of our model: (1) our model with only image-sentence score function (Eq. 2) supervised by a max margin loss following  Karpathy et al. (2014); Zhao et al. (2018), denoted as “Max Margin”; (2) our model with only image-sentence score function (Eq. 2) supervised by the NCE loss (Eq. 5), denoted as “NCE”; (3) our model with only region-phrase score function (Eq. 1) supervised by the distillation loss (Eq. 4), denoted as “Distill”; and (4) our full model with both region-phrase and image-sentence score functions supervised by our joint loss (Eq. 6), denoted as “NCE+Distill”.

Table 1 presents our ablation results. First, we observe that NCE loss substantially outperforms the standard max margin loss by +6.2%/+3.7% on Flickr30K Entities and ReferItGame, respectively. These results suggest the effectiveness of contrastive learning, as also demonstrated in the concurrent work Gupta et al. (2020). Moreover, using only distillation loss for region-phrase matching (Distill) under-performs NCE. However, our full model that combines both region-phrase and image-sentence matching using the joint loss brings a large boost over NCE. We conjecture that NCE and Distill provide complementary information for phrase grounding. Finally, Figure 2 visualizes the grounding results of NCE and NCE+Distill. Our full model (NCE+Distill) can better locate objects corresponding to the current phrase.

Figure 2: Visualization of region-phrase matching. We compare results of using only NCE loss (left), and using our full model NCE+Distll (right). For each pixel, we compute a matching score by averaging scores from all proposals covering the pixel. The red color corresponds to high matching scores. Our knowledge distillation for region-phrase matching helps to better identify the extent of objects.
Method Backbone Require Detector Detector ACC (%)
Train Inference
GroundeR  Rohrbach et al. (2016) VGG16 PV N/A N/A N/A 28.94
MATN  Zhao et al. (2018) VGG16 PV N/A N/A N/A 33.10
UTG Yeh et al. (2018) N/A Yes Yes VGG16 PV 35.90
N/A Yes Yes YOLOv2 CC 36.93
KAC Chen et al. (2018) VGG16 PV Yes Yes VGG16 PV 36.14
VGG16 PV Yes Yes VGG16 CC 38.71
MTG Fang et al. (2019) Res101 CC+Res50 CC N/A N/A N/A 48.66
+Res50 CL
WPT Wang and Specia (2019) (w2v-max union) N/A Yes Yes IRV2 CC 37.57
N/A Yes Yes IRV2 CC+IRV2 OI 48.20
N/A Yes Yes IRV2 CC+IRV2 OI 50.49
NCE+Distillation (Ours) VGG16 PV Yes No VGG16 CC 40.38
Res101 CC Yes No IRV2 OI 50.96
Table 2: Results on Flickr30K Entities. We report phrase localization accuracy and list the settings of different methods. “Backbone” denotes the visual backbone used to extract region features. Detector denotes the detector that provides external knowledge. Dataset notations: PV=PASCAL VOC, CC=COCO, OI=Open Images, CL=Color Name, and PL=Place365.

4.3 Comparison to Latest Methods

We further compare our results to latest methods of weakly supervised phrase grounding on both Flickr30K Entities and ReferItGame.


We consider a number of baselines. Our main competitors are those methods using object detectors, including KAC Chen et al. (2018), UTG Yeh et al. (2018), MTG Fang et al. (2019) and WPT Wang and Specia (2019). Among these methods, KAC and UTG used detectors during both training and inference. MTG made use of detectors during training and WPT applied detectors during inference. While these baselines have very different sets of detectors and backbones, we try to match their settings in our experiments. Moreover, to make a fair comparison with WPT, we handle plural head noun cases following their “union” strategy for multiple instances grounding. For example, given a plural head noun, such as “men”, we report the minimum bounding box of top 5 ranked proposals as predicted bounding box. We detect such phrases automatically using NLTK WordNet morphy library333 Our baselines also include previous methods that do not use object detectors, such as MATN Zhao et al. (2018) and GroundR Rohrbach et al. (2016) for completeness.

Method Backbone Require Detector Detector ACC (%)
Train Inference
GroundeR  Rohrbach et al. (2016) VGG16 PV N/A N/A N/A 10.70
MATN  Zhao et al. (2018) VGG16 PV N/A N/A N/A 13.61
UTG  Yeh et al. (2018) N/A Yes Yes VGG16 CC+YOLOv2 CC 20.91
KAC Chen et al. (2018) VGG16 PV Yes Yes VGG16 PV 13.38
VGG16 PV Yes Yes VGG16 CC 15.83
WPT Wang and Specia (2019) (w2v-max union) N/A Yes Yes IRV2 CC 15.40
N/A Yes Yes IRV2 CC+IRV2 OI 26.48
NCE+Distillation (Ours) VGG16 PV Yes No VGG16 CC 24.52
Res101 CC Yes No IRV2 OI 27.59
Table 3: Results on ReferItGame. We report phrase localization accuracy and settings of different methods. “Backbone” denotes the visual backbone used to extract region features. Detector denotes the detector that provides external knowledge. Dataset notations: PV=PASCAL VOC, CC=COCO, OI=Open Images, CL=Color Name, and PL=Place365.


Our results are summarized in Table 2 (Flickr30K Entities) and Table 3 (ReferItGame). Table 2 and 3 compares both the settings of different methods and their phrase localization accuracy. Not surprisingly, methods using object detectors perform better than those not using detectors. Among all methods, our NCE+Distillation achieves the best performance on both datasets. Specifically, in comparison to UTG and KAC, our method removes the need of object detector at inference time, and show large performance boost (+3.4%/+3.6% on Flickr30K Entities and ReferItGame for UTG and +1.7%/+8.7% on Flickr30K and ReferItGame for KAC), when using the same backbones and detectors, as well as similar pre-training schemes. Our inference setting is similar to MTG. However, our results are significantly better (+2.6% Flickr30K Entities), despite that our method only uses a single backbone network during inference (vs. three backbones in MTG).

When using a stronger backbone (Res101 CC) and a better detector pre-trained on a larger scale dataset (IRV OI), our results are further improved by 10.6% and 3.0% on Flickr30K Entities and ReferItGame, respectively. Our final results thus outperform the latest method of WPT under a similar training setting. In comparison to WPT, our method does not require object detectors during inference, thus is more applicable for real world deployment. Finally, our results also outperform the results of a concurrent work from Gupta et al. (2020) (50.96% vs. 47.88%) when trained on the Flickr30K Entities dataset. A better result (51.67%) was reported in Gupta et al. (2020) by training on COCO Caption dataset Chen et al. (2015) and using a strong language model (BERT Devlin et al. (2019)). We conjecture that the same practices can help to further improve the performance of our model.

5 Conclusion

In this paper, we presented a novel contrastive learning framework for weakly supervised visual phrase grounding. The key idea of our method is to learn a score function measuring the similarity between region-phrase pairs, distilled from object detection outputs and further supervised by image-sentence pairs. Once learned, this score function can be used for visual grounding without the need of object detectors at test time. While conceptually simple, our method demonstrated strong results on major benchmarks, surpassing state-of-the-art methods that use expensive object detectors. Our work thus offers a principled approach to leverage object information, as well as an efficient method for weakly supervised grounding. We believe that our work provides a step forward towards modeling the link between vision and language.


  • J. Ba and R. Caruana (2014) Do deep nets really need to be deep?. In NeurIPS, pp. 2654–2662. Cited by: §2.
  • S. Bird and E. Loper (2004) NLTK: the natural language toolkit. In ACL Interactive Poster and Demonstration Sessions, pp. 214–217. Cited by: §4.1.
  • C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil (2006) Model compression. In SIGKDD, Cited by: §2.
  • K. Chen, J. Gao, and R. Nevatia (2018) Knowledge aided consistency for weakly supervised phrase grounding. In CVPR, Cited by: §1, §1, §4, §4.1, §4.3, Table 2, Table 3.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §2.
  • X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §4.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In ACL, Cited by: §4.3.
  • T. Do, T. Do, H. Tran, E. Tjiputra, and Q. D. Tran (2019) Compact trilinear interaction for visual question answering. In ICCV, pp. 392–401. Cited by: §2.
  • M. Everingham, L. Gool, C. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes VOC dataset and challenge. IJCV. Cited by: §4.1.
  • Z. Fang, S. Kong, C. Fowlkes, and Y. Yang (2019) Modularized textual grounding for counterfactual resilience. In CVPR, pp. 6378–6388. Cited by: §1, §1, §1, §2, §4.1, §4.3, Table 2.
  • A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP, Cited by: §1, §2.
  • N. Garcia, P. Morerio, and V. Murino (2018) Modality distillation with multiple stream networks for action recognition. In ECCV, Cited by: §2.
  • S. Gupta, J. Hoffman, and J. Malik (2016) Cross modal distillation for supervision transfer. In CVPR, pp. 2827–2836. Cited by: §2.
  • T. Gupta, A. Vahdat, G. Chechik, X. Yang, J. Kautz, and D. Hoiem (2020) Contrastive learning for weakly supervised phrase grounding. arXiv preprint arXiv:2006.09920. Cited by: §2, §4.2, §4.3.
  • M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In AISTATS, pp. 297–304. Cited by: §1, §2, §3.2.
  • G. Hinton, O. Vinyals, and J. Dean (2015)

    Distilling the knowledge in a neural network

    arXiv preprint arXiv:1503.02531. Cited by: §2, §3.2.
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1.
  • A. Karpathy, A. Joulin, and L. F. Fei-Fei (2014) Deep fragment embeddings for bidirectional image sentence mapping. In NIPS, Cited by: §3.1, §4.2.
  • S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg (2014) ReferItGame: referring to objects in photographs of natural scenes. In EMNLP, pp. 787–798. Cited by: §1, §4.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
  • I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit, et al. (2017) OpenImages: a public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github. com/openimages 2 (3), pp. 18. Cited by: §4.1, §4.1.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV. Cited by: §2.
  • J. Li, M. L. Seltzer, X. Wang, R. Zhao, and Y. Gong (2017) Large-scale domain adaptation via teacher-student learning. arXiv preprint arXiv:1708.05466. Cited by: §2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, pp. 740–755. Cited by: §4.1.
  • Z. Luo, J. Hsieh, L. Jiang, J. C. Niebles, and L. Fei-Fei (2018) Graph distillation for action detection with privileged modalities. In ECCV, Cited by: §2.
  • G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §4.1.
  • J. Mun, K. Lee, J. Shin, and B. Han (2018) Learning to specialize with knowledge distillation for visual question answering. In NeurIPS, pp. 8081–8091. Cited by: §2.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.
  • B. Pan, H. Cai, D. Huang, K. Lee, A. Gaidon, E. Adeli, and J. C. Niebles (2020) Spatio-temporal graph for video captioning with knowledge distillation. In CVPR, Cited by: §2.
  • J. Pennington, R. Socher, and C. D. Manning (2014)

    GloVe: global vectors for word representation

    In EMNLP, Cited by: §4.1, §4.1.
  • B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2015) Flickr30K entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, Cited by: §1, §1, §2, §4.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, Cited by: §4.1.
  • A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele (2016) Grounding of textual phrases in images by reconstruction. In ECCV, Cited by: §1, §2, §4, §4.3, Table 2, Table 3.
  • A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2014) FitNets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §2.
  • Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §2.
  • J. Wang and L. Specia (2019) Phrase localization without paired training examples. In CVPR, pp. 4663–4672. Cited by: §1, §1, §1, §2, §2, §4, §4.1, §4.3, Table 2, Table 3.
  • L. Wang, Y. Li, J. Huang, and S. Lazebnik (2018) Learning two-branch neural networks for image-text matching tasks. IEEE TPAMI. Cited by: §1, §1, §2, §3.1.
  • F. Xiao, L. Sigal, and Y. J. Lee (2017) Weakly-supervised visual grounding of phrases with linguistic structures. In CVPR, Cited by: §2.
  • R. Yeh, M. N. Do, and A. G. Schwing (2018) Unsupervised textual grounding: linking words to image concepts. In CVPR, Cited by: §1, §1, §2, §4.3, Table 2, Table 3.
  • R. Yeh, J. Xiong, W. Hwu, M. Do, and A. Schwing (2017) Interpretable and globally optimal prediction for textual grounding using image concepts. In NIPS, Cited by: §1, §2.
  • S. Zagoruyko and N. Komodakis (2017)

    Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer

    In ICLR, Cited by: §2.
  • H. Zhang, Y. Niu, and S. Chang (2018a) Grounding referring expressions in images by variational context. In CVPR, Cited by: §1.
  • J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff (2018b) Top-down neural attention by excitation backprop. IJCV. Cited by: §1.
  • Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z. Zha (2020) Object relational graph with teacher-recommended learning for video captioning. In CVPR, Cited by: §2.
  • F. Zhao, J. Li, J. Zhao, and J. Feng (2018)

    Weakly supervised phrase localization with multi-scale anchored transformer network

    In CVPR, Cited by: §1, §2, §4.2, §4.3, Table 2, Table 3.
  • L. Zhou, N. Louis, and J. J. Corso (2018) Weakly-supervised video object grounding from text by loss weighting and object interaction. In BMVC, Cited by: §3.1.
  • Y. Zhou, M. Wang, D. Liu, Z. Hu, and H. Zhang (2020) More grounded image captioning by distilling image-text matching model. In CVPR, Cited by: §2.