Contrastive Learning for Weakly Supervised Phrase Grounding

by   Tanmay Gupta, et al.

Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through language model guided word substitutions. Training with our negatives yields a ∼10% absolute gain in accuracy over randomly-sampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of 5.7% to achieve 76.7% accuracy on Flickr30K Entities benchmark.


page 2

page 6

page 14


What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs

Given an input image, and nothing else, our method returns the bounding ...

Detector-Free Weakly Supervised Grounding by Separation

Nowadays, there is an abundance of data involving images and surrounding...

MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding

Phrase localization is a task that studies the mapping from textual phra...

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation

Weakly supervised phrase grounding aims at learning region-phrase corres...

Utilizing Every Image Object for Semi-supervised Phrase Grounding

Phrase grounding models localize an object in the image given a referrin...

Phrase Grounding by Soft-Label Chain Conditional Random Field

The phrase grounding task aims to ground each entity mention in a given ...

PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding

Panoptic Narrative Grounding (PNG) is an emerging task whose goal is to ...

1 Introduction

Humans can learn from captioned images because of their ability to associate words to image regions. For instance, humans perform such word-region associations while acquiring facts from news photos, making a diagnosis from MRI scans and radiologist reports, or enjoying a movie with subtitles. This word-region association problem is called word or phrase grounding

and is a crucial capability needed for downstream applications like visual question answering, image captioning, and text-image retrieval.

Figure 1: Overview of our contrastive learning framework. We begin by extracting region and word features using an object detector and a language model respectively. Contrastive learning trains a word-region attention mechanism as part of a compatibility function between the set of region features from an image and individual contextualized word representations. The compatibility function is trained to maximize a lower bound on mutual information with two losses. For a given caption word, learns to produce a higher compatibility for the true image than a negative image in the mini-batch. learns to produce a higher compatibility of an image with a true caption-word than with a word in a negative caption. We construct negative captions by substituting a noun word like “donut” in the true caption with contextually plausible but untrue words like “cookie” using a language model.

Existing object detectors can detect and represent object regions in an image, and language models can provide contextualized representations for noun phrases in the caption. However, learning a mapping between these continuous, independently trained visual and textual representations is challenging in the absence of explicit region-word annotations. We focus on learning this mapping from weak supervision in the form of paired image-caption data without requiring laborious grounding annotations.

Current state-of-the-art approaches [11, 1, 33] formulate weakly supervised phrase grounding as a multiple instance learning (MIL) problem [25, 18]. The image can be viewed as a bag of regions. For a given phrase, all images with captions containing the phrase are treated as positive bags while remaining images are treated as negatives. Models aggregate per region features or phrase scores to construct image-level predictions that can be supervised with image-level labels in the form of phrases or captions. Common aggregation approaches include max or mean pooling, noisy-OR [13], and attention [11, 18]. Popular training objectives include binary classification loss [13] (whether the image contain the phrase) or caption reconstruction loss [33] (generalization of binary classification to caption prediction) or ranking objectives [1, 11] (do true image-caption or image-phrase pairs score higher than negative pairs).

Fig. 1

provides an overview of our proposed contrastive training. We propose a novel formulation of the weakly supervised phrase grounding problem as that of maximizing a lower bound on mutual information between set of region features extracted from an image and contextualized word representations. We use pretrained region and word representations from an object detector and a language model and perform optimization over parameters of word-region attention instead of optimizing the region and word representations themselves. Intuitively, to compute mutual information with a word’s representation, attention must discard nuisance regions in the word-conditional attended visual representation, thereby selecting regions that match the word. For any given word, the learned attention thus functions as a soft selection or grounding mechanism over regions.

Since computing MI is intractable, we maximize the recently introduced InfoNCE lower bound [30] on mutual information. The InfoNCE bound requires a compatibility score between each caption word and the image to contrast positive image and caption word pairs with negative pairs in a minibatch. We use two objectives. The first objective ( in Fig. 1) contrasts a positive pair with negative pairs with the same caption word but different image regions. The second objective ( in Fig. 1) contrasts a positive pair with negative pairs with the same image but different captions. We show empirically that sampling negative captions randomly from the training data to optimize does not yield any gains over optimizing only. Instead of random sampling, we propose to use a language model to construct context-preserving negative captions by substituting a single noun word in the caption.

We design the compatibility function using a query-key-value attention mechanism. The queries and keys, computed from words and regions respectively, are used to compute a word-specific attention over each region which acts as a soft alignment or grounding between words and regions. The compatibility score between regions and word is computed by comparing attended visual representation and the word representation.

Our key contributions are: (i) a novel MI based contrastive training framework for weakly supervised phrase grounding; (ii) an InfoNCE compatibility function between a set of regions and a caption word designed for phrase grounding; and (iii) a procedure for constructing context-preserving negative captions that provides absolute gain in grounding performance.

1.1 Related Work

Our work is closely related to three active areas of research. We now provide an overview of prior arts in each.

Weakly Supervised Phrase Grounding.

Weakly supervised phrase localization is typically posed as a multiple instance learning (MIL) problem [25, 18] where each image is considered as a bag of region proposals. Images whose captions mention a word or a phrase are treated as positive bags while rest of the images are treated as negatives for that word or phrase. Features or scores for a phrase or the entire caption are aggregated across all regions to make a prediction for the image. Common methods of aggregation are max or average pooling, noisy-OR [13], or attention [33, 18]. With the ability to produce image-level scores for pairs of images and phrases or captions, the problem becomes an image-level fully-supervised phrase classification problem [13] or an image-caption retrieval problem [1, 11]. An alternatives to the MIL formulations is the approach of Ye et al[44]

which uses statistical hypothesis testing approach to link concepts detected in an image and words mentioned in the sentence. While all the above approaches assume paired image-caption data, Wang 

et al[42]

recently address the problem of phrase grounding without access to image-caption pairs. Instead they assume access to a set of scene and color classifiers, and object detectors to detect concepts in the scene and use word2vec 

[27] similarity between concept labels and caption words to achieve grounding.

MI-based Representation Learning.

Recently MI-based approaches have shown promising results on a variety representation learning problems. Computing the MI between two representations is challenging as we often have access to samples but not the underlying joint distribution that generated the samples. Thus, recent efforts rely on variational estimation of MI 

[3, 20, 6, 30]. An overview of such estimators is discussed in [31, 40] while the statistical limitations are reviewed in [26, 34].

In practice, MI-based representation learning models are often trained by maximizing an estimation of MI across different transformations of data. For example, deep InfoMax [17] maximizes MI between local and global representation using MINE [6]. Contrastive predictive coding [30, 16]

inspired by noise contrastive estimation 

[14, 29] assumes an order in the features extracted from an image and uses summary features to predict future features. Contrastive multiview coding [39] maximizes MI between different color channels or data modalities while augmented multiscale Deep InfoMax [5] and SimCLR [8] extract views using different augmentations of data points. Since the infoNCE loss is limited by the batch size, several previous work rely on memory banks [43, 28, 15] to increase the set of negative instances.

Joint Image-Text Representation Learning.

With the advances in both visual analysis and natural language understanding, there has been a recent shift towards learning representation jointly from both visual and textual domains [23, 35, 24, 37, 38, 45, 22, 9, 2, 36]. ViLBERT [24] and LXMERT [38] learn representation from both modalities using two-stream transformers, applied to image and text independently. In contrast, UNITER [9], VisualBERT [23], Unicoder-VL [22], VL-BERT [35] and B2T2 [2] propose a unified single architecture that learns representation jointly from both domains. Our method is similar to the first group, but differs in its fundamental goal. Instead of focusing on learning a task-agnostic representation for a range of downstream tasks, we are interested in the quality of region-phrase grounding emerged by maximizing mutual information. Moreover, we rely on the language modality as a weak training signal for grounding, and we perform phrase-grounding without any further finetuning.

2 Method

Consider the set of region features and contextualized word representation as two multivariate random variables. Intuitively, estimating MI between them requires extracting the information content shared by these two variables. We model this MI estimation as maximizing a lower bound on MI with respect to parameters of a word-region attention model. This maximization forces the attention model to downweight regions from the image that do not match the word, and to attend to the image regions that contain the most shared information with the word representation.

Sec. 2.1 describes MI and the InfoNCE lower bound. Sec. 2.2 introduces notation and InfoNCE based objective for learning phrase grounding from paired image caption data. Sec. 2.3 presents the design of a word-region attention based compatibility function which is part of the InfoNCE objective.

2.1 InfoNCE Lower Bound on Mutual Information

Let and be random variables drawn from a joint distribution with density . The MI between and measures the amount of information that these two variables share:


which is also the Kullback–Leibler Divergence from

to .

However, computing MI is intractable in general because it requires a complete knowledge of the joint and marginal distributions. Among the existing MI estimators, the InfoNCE [30]

lower bound provides a low-variance estimation of MI for high dimensional data, albeit being biased 

[31]. The appealing variance properties of this estimator may explain its recent success in representation learning [8, 30, 16, 36]. InfoNCE defines a lower bound on MI by:


Here, is the InfoNCE objective defined in terms of a compatibility function parametrized by : . The lower bound is computed over a mini-batch of size , consisting of one positive pair and negative pairs where :


Oord et al[30] showed that maximizing the lower bound on MI by minimizing with respect to leads to a compatibility function that obeys


where is the optimal obtained by minimizing .

2.2 InfoNCE for Phrase Grounding

Recent work [11] has shown that pre-trained object detectors such as FasterRCNN [32] and language models such as BERT [12] provide rich representations in the visual and textual domains for the phrase grounding problem. Inspired by this, we aim to maximize mutual information between region features generated by an object detector and contextualized word representation extracted by a language model.

Let us denote image region features for an image by where is the number of regions in the image with each . Similarly, caption word representations are denoted as where is the number of words in the caption with each word represented as .

We maximize the InfoNCE lower bound on MI between image regions and each individual word representation denoted by . Thus using Eq. 2 we maximize the following lower bound:


We empirically show that maximizing the lower bound in Eq. 5 with an appropriate choice of compatibility function results in learning phrase grounding without strong grounding supervision. The following section details the design of the compatibility function.

Figure 2: Compatibility function with word-region attention. The figure shows compatibility computation between the set of image regions and the word “mug” in the caption. The compatibility function consists of learnable query-key-value functions . The query constructed from contextualized representation of the word “mug” is compared to keys created from region features to compute attention scores. The attention scores are used as weights to linearly combine values created from region features to construct an attended visual representation for “mug”. The compatibility is defined by the dot product of the attended visual representation and value representation for “mug”.

2.3 Compatibility Function with Attention

The InfoNCE loss in our phrase grounding formulation requires a compatibility function between the set

of region feature vectors

and the contextualized word representation . To define the compatibility function, we propose to use a query-key-value attention mechanism [41]. Specifically, we define neural modules to map each image region to keys and values and to compute query and values for the words. The query vectors for each word are used to compute the attention score for every region given a word using


where . The attention scores are used as a soft selection mechanism to compute a word-specific visual representation using a linear combination of region values


Finally, the compatibility function is defined as , where refers to the parameters of neural modules , implemented using simple feed-forward MLPs. Following Eqs. 35, the InfoNCE loss for phrase grounding is defined as


which is marked using subscript img as negative pairs are created by replacing image regions from a positive pair with regions extracted from negative instance in the mini-batch.


We enforce compatibility between each word and all image regions using in Eq. 5, but not between a region and all caption words (). This is because the words only describe part of the image, so there will be regions with no corresponding word in the caption.

2.4 Context-Preserving Negative Captions

Figure 3: Context-preserving negative captions. We construct negative captions which share the same context as the true caption but substitute a noun word. We choose the substitute using a language model such that it is plausible in the context but we reject potential synonyms or hypernyms of the original word by a re-ranking procedure.

The objective in Eq. 8 trains the compatibility function by contrasting positive regions-word pairs against pairs with replaced image regions. We now propose a complementary objective function that contrasts the positive pairs against negative pairs whose captions are replaced with plausible negative captions. However, extracting negative captions that are related to a captions is challenging as it requires semantic understanding of words in a caption. Here, we leverage BERT as a pretrained bidirectional language model to extract such negative captions.

For a caption with a noun word and context , we define a context-preserving negative caption as one which has the same context but a different noun with the following properties: (i) should be plausible in the context; and (ii) the new caption defined by the pair should be untrue for the image. For example, consider the caption "A man is walking on a beach" where is chosen as "man" and is defined by "A [MASK] is walking on a beach" where [MASK] is the token that denotes a missing word. A potential candidate for a context-preserving negative caption might be "A woman is walking on a beach" where is woman. However, "A car is walking on a beach" and "A person is walking on a beach" are not negative captions because car is not plausible given the context, and the statement with person is still true given that the original caption is true for the image.

Constructing context-preserving negative captions.

We propose to use a pre-trained BERT language model to construct context-preserving negative captions for a given true caption. Our approach for extracting such words consists of two steps: First, we feed the context into the language model to extract most likely candidates

for the masked word using probabilities

predicted by BERT. Intuitively, these words correspond to those that fill in the masked word in caption according to BERT. However, the original masked word or its synonyms may be present in the set as well. Thus, in the second step, we pass the original caption into BERT to compute which we use as a proxy for how true is given that is true. We re-rank the candidates using the score and we keep the top captions as negatives for the original caption .

We empirically find that the proposed approach is effective in extracting context-preserving negative captions. Fig. 3 shows a context-preserving negatives for a set of captions along with candidates that were rejected after re-ranking. Note that the selected candidates match the context and the rejected candidates are often synonyms or hypernyms of the true noun.

Training with context-preserving negative captions.

Given the context-preserving negative captions, we can train our compatibility function by contrasting the positive pairs against negative pairs with plausible negative captions. We use a loss function similar to InfoNCE to encourage higher compatibility score of an image with the true caption than any negative caption. Let

and denote the contextualized representation of the positive word and the corresponding negative noun words . The language loss is defined as


For captions with multiple noun words, we randomly select from the noun words for simplicity.

2.5 Implementation Details

Regions and Visual Features.

We use the Faster-RCNN object detector provided by Anderson et al[4] and used for extracting visual features in the current state-of-the-art phrase grounding approach Align2Ground [11]. The detector is trained jointly on Visual Genome object and attribute annotations and we use a maximum of 30 top scoring bounding boxes per image with dimensional ROI-pooled region features.

Contextualized Word Representations.

We use a pretrained BERT language model to extract dimensional contextualized word representations for each caption word. Note that BERT is trained on a text corpora using masked language model training where words are randomly replaced by a [MASK] token in the input and the likelihood of the masked word is maximized in the distribution over vocabulary words predicted at the output. Thus, BERT is trained to model distribution over words given context and hence suitable for modeling defined in Sec. 2.4 for constructing context-preserving negative captions.

Query-Key-Value Networks.

We use an MLP with 1 hidden layer for each of for all experiments except the ablation in Fig. 4. We use BatchNorm [19]

and ReLU activations after the first linear layer. The hidden layer has the same number of neurons as the input dimensions of these networks which are

for , and for . The output layer is () for all networks.


Since we only care about grounding noun phrases, we compute only for noun and adjective words in the captions as identified by a POS tagger instead of all caption words for computation efficiency.


We optimize computed over batches of 50 image-caption pairs using the ADAM optimizer [21] with a learning rate of . We compute for each image using other images in the batch as negatives.

Attention to phrase grounding.

We use the BERT tokenizer to convert captions into individual word or sub-word tokens. Attention is computed per token. For evaluation, the phrase-level attention score for each region is computed as the maximum attention score assigned to the region by any of the tokens in the phrase. The regions are then ranked according to this phrase level score.

3 Experiments

Our experiments compare our approach to state-of-the-art on weakly supervised phrase localization (Sec. 3.2), ablate gains due to pretrained language representations and context-preserving negative sampling using a language model (Sec. 3.3), and analyse the relation between phrase grounding performance and the InfoNCE bound that we optimize as a proxy for phrase grounding (Sec. 3.4).

3.1 Datasets and Metrics

We train our models on image-caption pairs from COCO training set which consists of K training images. We use the validation set with K images for part of our analysis. Each image is accompanied with 5 captions. For evaluation, we use the Flickr30K Entities validation set for model selection (early stopping) and test set for reporting final performance. Both sets consist of K images with 5 captions each. We report two metrics:


which is the fraction of phrases for which the ground truth bounding box has an IOU with any of the top-k predicted boxes.

Pointing accuracy

which requires the model to predict a single point location per phrase and the prediction is counted as correct if it falls within the ground truth bounding box for the phrase. Unlike recall@k, pointing accuracy does not require identifying the extent of the object. Since our model selects one of the detected regions in the image, we use use center of the selected bounding box as the prediction for each phrase for computing pointing accuracy.

3.2 Performance on Flickr30K Entities

Method Training Data Visual Features R@1 R@5 R@10 Accuracy
GroundeR (2015) [33] Flickr30K Entities VGG-det (VOC) 28.94 - - -
Yeh et al.(2018) [44] Flickr30K Entities VGG-cls (IN) 22.31 - - -
Yeh et al.(2018) [44] Flickr30K Entities VGG-det (VOC) 35.90 - - -
Yeh et al.(2018) [44] Flickr30K Entities YOLO (COCO) 36.93 - - -
KAC Net+Soft KBP (2018) [7] Flickr30K Entities VGG-det (VOC) 38.71 - - -
Fang et al.(2015) [13] COCO VGG-cls (IN) - - - 29.00
Akbari et al.(2019) [1] COCO VGG-cls (IN) - - - 61.66
Akbari et al.(2019) [1] COCO PNAS Net (IN) - - - 69.19
Align2Ground (2019) [11] COCO Faster-RCNN (VG) - - - 71.00
Ours Flickr30K Entities Faster-RCNN (VG) 47.88 76.63 82.91 74.94
Ours COCO Faster-RCNN (VG) 51.67 77.69 83.25 76.74
Table 1: Grounding performance on Flickr30K Entities test set. We make our approach directly comparable to the current state-of-the-art, Align2Ground [11]. The performance of older methods are reported for completeness but the use of different visual features makes direct comparison difficult.

Tab. 1 compares performance of our method to existing weakly supervised phrase grounding approaches on the Flickr30K Entities test set. A few existing approaches train on Flickr30K Entities train set and report recall@1 while recent methods use COCO train set and report pointing accuracy. Further, all approaches use different visual features making direct comparison difficult. For a fair comparison to state-of-the-art, we use Faster-RCNN trained on Visual Genome object and attribute annotations used in Align2Ground [11] and report performance for models trained on either datasets on both recall and pointing accuracy metrics.

Using the same training data and visual feature architecture, our model shows a absolute gain in pointing accuracy over Align2Ground. Learning using our contrastive formulation is also quite sample efficient as can be seen by only a to points drop in performance when the model is trained on the much smaller Flickr30K Entities train set which has approximately one-third as many image-caption pairs as COCO.

3.3 Benefits of Language Modeling

Negative Captions Language Model R@1 R@5 R@10 Accuracy
None BERT (Random) 25.66 59.57 75.16 57.37
None BERT (Pretrained) 35.74 72.91 82.07 66.89
Random BERT (Pretrained) 36.32 72.42 81.81 66.92
Contextually plausible BERT (Pretrained) 48.05 76.78 82.97 74.91
Excluding near-synonyms & hypernyms BERT (Pretrained) 51.67 77.69 83.25 76.74
Table 2: Benefits of language modeling. The first two rows show the gains due to pretrained language representations. The next three rows show gains from each step in our proposed context-preserving negative caption construction.

Our approach benefits from language modeling in two ways: (i) using the pretrained language model to extract contextualized word representations, and (ii) using the language model to sample context-preserving negative captions. Tab. 2 evaluates along both of these dimensions.

Gains from pretrained word representations.

In Tab. 2, BERT (Random) refers to the BERT architecture initialized with random weights and finetuned on COCO image-caption data along with parameters of the attention mechanism. BERT (Pretrained) refers to the off-the-shelf pretrained BERT model which is used as a contextualized word feature extractor during contrastive learning without finetuning. We observe a 10% absolute gain in both recall@1 and pointing accuracy by using pretrained word representations from BERT.

Gains from context-preserving negative caption sampling.

Our context-preserving negative sampling has two steps. The first step is drawing negative noun candidates given the context provided by the true caption. The second step is re-ranking the candidates to filter out likely synonyms or hypernyms that are also true for the image.

First, note that randomly sampling negative captions from training data for computing performs similarly to only training using . Model trained with contextually plausible negatives significantly outperforms random sampling by 8% gain in recall@1 and pointing accuracy. Excluding near-synonyms and hypernyms yields another 3 points gain in recall@1 and accuracy.

3.4 Is InfoNCE a good proxy for learning phrase grounding?

Figure 4: Relation between InfoNCE lower bound and phrase grounding performance with training iterations for 3 different choices of key-value modules in the compatibility function

. Each epoch is

K iterations. The scattered points visualize the measured quantities during training. The dashed lines are created by applying moving average to highlight the trend.

The fact that optimizing our InfoNCE objective results in learning phrase grounding is intuitive but not trivial. Fig. 4 shows how maximizing the InfoNCE lower bound correlates well with phrase grounding performance on a heldout dataset. We make several interesting observations: (i) As training progresses (from left to right), InfoNCE lower bound (Eq. 5) mostly keeps increasing on the validation set. This indicates that there is no overfitting in terms of the InfoNCE bound. (ii) With the increase in InfoNCE lower bound, phrase grounding performance first increases until peak performance and then starts decreasing. This shows that the InfoNCE bound is correlated with the grounding performance but maximizing it fully does not necessarily yields at the best grounding. Similar observation has been made in [39] for representation learning. (iii) The peak performance and the number of iterations needed for the best performance depends on the choice of key-value-query modules. One and two layer MLPs hit the peak faster and perform better than linear functions.

3.5 Qualitative Results

Figure 5: Visualization of attention. We show all detected regions and top-3 attended regions with attention scores for two words highlighted in each caption.

Fig. 5 visualizes the word-region attention learned by our model. The qualitative results demonstrate the following abilities: (i) localizing different objects mentioned in the same caption with varying degrees of semantic relatedness, e.g., man and canine in row 1 vs. man and woman in row 3; (ii) disambiguation between two instances of the same object category using caption context. For example, boy and another in row 4 and bride and groom from other men and women in row 3; (iii) localizing object parts such as toddler’s shirt in row 2 and instrument’s mouthpiece in row 5; (iv) handling occlusion, e.g., table covered with toys in row 6; (v) handling uncommon words or categories like ponytail and mouthpiece in row 5 and hose in row 7.

These results show that given rich visual and contextualized word representations, contrastive learning causes our attention mode to learn phrase grounding.

4 Limitations and Future Works

The empirical examination of our framework reveals the following limitations:

Pretrained representations.

Like prior arts, our approach relies on pretrained object detector and a language model to represent regions and caption-words. Ideally, we would expect to learn from scratch or improve existing region and word representations directly from image-caption data.

Need for fully-labeled validation set.

In Fig. 4

, we observe that an early stopping based on the validation performance is required to choose the best model for phrase grounding. While this is common practice for weakly supervised learning 

[10] and the Flickr30K Entities validation set we use is smaller than the COCO training set, this translates to using full supervision for a small set of images.

Bounds on MI.

While in Eq. 8 is a valid lower bound on MI, our in Eq. 9 is no longer a lower bound on MI as it oversamples negative words related to a caption. A valid bound would involve random sampling of captions from the training data however our context-preserving negative captions lead to much better performance.

5 Conclusion

In this work, we offer a novel perspective on weakly supervised phrase grounding from paired image-caption data which has traditionally been cast as a multiple instance learning problem. We formulate the problem as that of estimating mutual information between image regions and caption words. We demonstrate that maximizing a lower bound on mutual information with respect to parameters of a region-word attention mechanism results in learning to ground words in images. We also show that language models can be used to generate context-preserving negative captions which greatly improve learning in comparison to randomly sampling negatives from training data.


  • [1] H. Akbari, S. Karaman, S. Bhargava, B. Chen, C. Vondrick, and S. Chang (2018) Multi-level multimodal common semantic space for image-phrase grounding. CVPR. Cited by: §1.1, §1, Table 1.
  • [2] C. Alberti, J. Ling, M. Collins, and D. Reitter (2019) Fusion of detected objects in text for visual question answering. In EMNLP, Cited by: §1.1.
  • [3] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy (2017) Deep variational information bottleneck. In ICLR, Cited by: §1.1.
  • [4] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2017) Bottom-up and top-down attention for image captioning and visual question answering. CVPR. Cited by: §2.5.
  • [5] P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §1.1.
  • [6] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm (2018) Mutual information neural estimation. In ICML, Cited by: §1.1, §1.1.
  • [7] K. Chen, J. Gao, and R. Nevatia (2018) Knowledge aided consistency for weakly supervised phrase grounding. In CVPR, Cited by: Table 1.
  • [8] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §1.1, §2.1.
  • [9] Y. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2019) Uniter: learning universal image-text representations. arXiv preprint arXiv:1909.11740. Cited by: §1.1.
  • [10] J. Choe, S. J. Oh, S. Lee, S. Chun, Z. Akata, and H. Shim (2020) Evaluating weakly supervised object localization methods right. ArXiv. Cited by: §4.
  • [11] S. Datta, K. Sikka, A. Roy, K. Ahuja, D. Parikh, and A. Divakaran (2019) Align2Ground: weakly supervised phrase grounding guided by image-caption alignment. ICCV. Cited by: §0.A.3, §1.1, §1, §2.2, §2.5, §3.2, Table 1.
  • [12] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. NAACL-HLT. Cited by: §0.A.2, §2.2.
  • [13] H. Fang, S. Gupta, F. N. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig (2014) From captions to visual concepts and back. CVPR. Cited by: §1.1, §1, Table 1.
  • [14] M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In AISTATS, Cited by: §1.1.
  • [15] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §1.1.
  • [16] O. J. Hénaff, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord (2019) Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: §1.1, §2.1.
  • [17] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2019) Learning deep representations by mutual information estimation and maximization. In ICLR, Cited by: §1.1.
  • [18] M. Ilse, J. M. Tomczak, and M. Welling (2018) Attention-based deep multiple instance learning. In ICML, Cited by: §1.1, §1.
  • [19] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. ICML. Cited by: §2.5.
  • [20] H. Kim and A. Mnih (2018) Disentangling by factorising. In ICML, Cited by: §1.1.
  • [21] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §2.5.
  • [22] G. Li, N. Duan, Y. Fang, D. Jiang, and M. Zhou (2019) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066. Cited by: §1.1.
  • [23] L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §1.1.
  • [24] J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, Cited by: §1.1.
  • [25] O. Maron and T. Lozano-Pérez (1998) A framework for multiple-instance learning. In NeurIPS, Cited by: §1.1, §1.
  • [26] D. McAllester and K. Stratos (2018) Formal limitations on the measurement of mutual information. arXiv preprint arXiv:1811.04251. Cited by: §1.1.
  • [27] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS, Cited by: §1.1.
  • [28] I. Misra and L. van der Maaten (2019) Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991. Cited by: §1.1.
  • [29] A. Mnih and K. Kavukcuoglu (2013) Learning word embeddings efficiently with noise-contrastive estimation. In NeurIPS, Cited by: §1.1.
  • [30] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv. Cited by: §1.1, §1.1, §1, §2.1, §2.1.
  • [31] B. Poole, S. Ozair, A. Van Den Oord, A. Alemi, and G. Tucker (2019) On variational bounds of mutual information. In ICML, Cited by: §1.1, §2.1.
  • [32] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, Cited by: §2.2.
  • [33] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele (2016) Grounding of textual phrases in images by reconstruction. In ECCV, Cited by: §1.1, §1, Table 1.
  • [34] J. Song and S. Ermon (2020) Understanding the limitations of variational mutual information estimators. In ICLR, Cited by: §1.1.
  • [35] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2020) Vl-bert: pre-training of generic visual-linguistic representations. Cited by: §1.1.
  • [36] C. Sun, F. Baradel, K. Murphy, and C. Schmid (2019) Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743. Cited by: §1.1, §2.1.
  • [37] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) Videobert: a joint model for video and language representation learning. In ICCV, Cited by: §1.1.
  • [38] H. Tan and M. Bansal (2019) LXMERT: learning cross-modality encoder representations from transformers. In EMNLP, Cited by: §1.1.
  • [39] Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §1.1, §3.4.
  • [40] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic (2020) On mutual information maximization for representation learning. In ICLR, Cited by: §1.1.
  • [41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §0.A.2, §2.3.
  • [42] J. Wang and L. Specia (2019) Phrase localization without paired training examples. ICCV. Cited by: §1.1.
  • [43] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In CVPR, Cited by: §1.1.
  • [44] R. A. Yeh, M. N. Do, and A. G. Schwing (2018) Unsupervised textual grounding: linking words to image concepts. In CVPR, Cited by: §1.1, Table 1.
  • [45] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso, and J. Gao (2020) Unified vision-language pre-training for image captioning and vqa. In AAAI, Cited by: §1.1.

Appendix 0.A Appendix

0.a.1 Advantages of Context-Preserving Negative Sampling

Commonly used strategies for negative sampling for contrastive learning include randomly sampling captions from the training data as negatives or mining hard-negatives from a randomly sampled mini-batch. In our experiments (Tab. 2), random sampling showed no significant gains over a model trained without negative captions. This is because the sampled negatives often have an entirely different context as compared to the image and the positive caption which makes it too easy for the model to produce a low compatibility score for these negatives.

In contrast, contrast-preserving negative sampling shows significant gains over random sampling ( vs. pointing accuracy). This is because we construct harder negative captions which yield a more informative training signal than random sampling. We construct negatives by substituting only a single word in the caption while preserving the context from the positive caption. The substitutions are further chosen to be plausible given the context while discarding likely synonyms and hypernyms. Unlike random sampling approaches whose success depends on the occurrence of informative negative captions in the training data and the likelihood of sampling such negatives for a positive caption in the same minibatch, our approach can construct effective negatives for any positive caption.

0.a.2 Relation between our query-key-value attention and self-attention in Transformers

Our query-key-value attention mechanism is related to the attention mechanism used in transformer-based [41] architectures like BERT [12]. Transformers use the mechanism for self-attention where queries, keys, and values are computed for each word in the input sentence and the attention scores are used for contextualization. In contrast, we use the attention mechanism for word-region alignment. Specifically, we compute queries for each contextualized word, keys for each region, and values for regions as well as words (using separate value networks for regions and words).

0.a.3 Comparison to Align2Ground

While we use the same visual features as the previous SOTA, Align2Ground [11], the two approaches use different textual features. While Align2Ground uses a bi-GRU, we chose BERT, a transformer-based language model which became more prevalent (as opposed to RNN-based) in the vision-language community. To estimate the gain due to pretrained language representations, Tab. 2 compares the grounding performance of randomly initialized BERT () to that of pretrained BERT (). Negative sampling brings further gains ().