Humans can learn from captioned images because of their ability to associate words to image regions. For instance, humans perform such word-region associations while acquiring facts from news photos, making a diagnosis from MRI scans and radiologist reports, or enjoying a movie with subtitles. This word-region association problem is called word or phrase grounding
Existing object detectors can detect and represent object regions in an image, and language models can provide contextualized representations for noun phrases in the caption. However, learning a mapping between these continuous, independently trained visual and textual representations is challenging in the absence of explicit region-word annotations. We focus on learning this mapping from weak supervision in the form of paired image-caption data without requiring laborious grounding annotations.
Current state-of-the-art approaches [11, 1, 33] formulate weakly supervised phrase grounding as a multiple instance learning (MIL) problem [25, 18]. The image can be viewed as a bag of regions. For a given phrase, all images with captions containing the phrase are treated as positive bags while remaining images are treated as negatives. Models aggregate per region features or phrase scores to construct image-level predictions that can be supervised with image-level labels in the form of phrases or captions. Common aggregation approaches include max or mean pooling, noisy-OR , and attention [11, 18]. Popular training objectives include binary classification loss  (whether the image contain the phrase) or caption reconstruction loss  (generalization of binary classification to caption prediction) or ranking objectives [1, 11] (do true image-caption or image-phrase pairs score higher than negative pairs).
provides an overview of our proposed contrastive training. We propose a novel formulation of the weakly supervised phrase grounding problem as that of maximizing a lower bound on mutual information between set of region features extracted from an image and contextualized word representations. We use pretrained region and word representations from an object detector and a language model and perform optimization over parameters of word-region attention instead of optimizing the region and word representations themselves. Intuitively, to compute mutual information with a word’s representation, attention must discard nuisance regions in the word-conditional attended visual representation, thereby selecting regions that match the word. For any given word, the learned attention thus functions as a soft selection or grounding mechanism over regions.
Since computing MI is intractable, we maximize the recently introduced InfoNCE lower bound  on mutual information. The InfoNCE bound requires a compatibility score between each caption word and the image to contrast positive image and caption word pairs with negative pairs in a minibatch. We use two objectives. The first objective ( in Fig. 1) contrasts a positive pair with negative pairs with the same caption word but different image regions. The second objective ( in Fig. 1) contrasts a positive pair with negative pairs with the same image but different captions. We show empirically that sampling negative captions randomly from the training data to optimize does not yield any gains over optimizing only. Instead of random sampling, we propose to use a language model to construct context-preserving negative captions by substituting a single noun word in the caption.
We design the compatibility function using a query-key-value attention mechanism. The queries and keys, computed from words and regions respectively, are used to compute a word-specific attention over each region which acts as a soft alignment or grounding between words and regions. The compatibility score between regions and word is computed by comparing attended visual representation and the word representation.
Our key contributions are: (i) a novel MI based contrastive training framework for weakly supervised phrase grounding; (ii) an InfoNCE compatibility function between a set of regions and a caption word designed for phrase grounding; and (iii) a procedure for constructing context-preserving negative captions that provides absolute gain in grounding performance.
1.1 Related Work
Our work is closely related to three active areas of research. We now provide an overview of prior arts in each.
Weakly Supervised Phrase Grounding.
Weakly supervised phrase localization is typically posed as a multiple instance learning (MIL) problem [25, 18] where each image is considered as a bag of region proposals. Images whose captions mention a word or a phrase are treated as positive bags while rest of the images are treated as negatives for that word or phrase. Features or scores for a phrase or the entire caption are aggregated across all regions to make a prediction for the image. Common methods of aggregation are max or average pooling, noisy-OR , or attention [33, 18]. With the ability to produce image-level scores for pairs of images and phrases or captions, the problem becomes an image-level fully-supervised phrase classification problem  or an image-caption retrieval problem [1, 11]. An alternatives to the MIL formulations is the approach of Ye et al. 
which uses statistical hypothesis testing approach to link concepts detected in an image and words mentioned in the sentence. While all the above approaches assume paired image-caption data, Wanget al. 
recently address the problem of phrase grounding without access to image-caption pairs. Instead they assume access to a set of scene and color classifiers, and object detectors to detect concepts in the scene and use word2vec similarity between concept labels and caption words to achieve grounding.
MI-based Representation Learning.
Recently MI-based approaches have shown promising results on a variety representation learning problems. Computing the MI between two representations is challenging as we often have access to samples but not the underlying joint distribution that generated the samples. Thus, recent efforts rely on variational estimation of MI[3, 20, 6, 30]. An overview of such estimators is discussed in [31, 40] while the statistical limitations are reviewed in [26, 34].
In practice, MI-based representation learning models are often trained by maximizing an estimation of MI across different transformations of data. For example, deep InfoMax  maximizes MI between local and global representation using MINE . Contrastive predictive coding [30, 16]
inspired by noise contrastive estimation[14, 29] assumes an order in the features extracted from an image and uses summary features to predict future features. Contrastive multiview coding  maximizes MI between different color channels or data modalities while augmented multiscale Deep InfoMax  and SimCLR  extract views using different augmentations of data points. Since the infoNCE loss is limited by the batch size, several previous work rely on memory banks [43, 28, 15] to increase the set of negative instances.
Joint Image-Text Representation Learning.
With the advances in both visual analysis and natural language understanding, there has been a recent shift towards learning representation jointly from both visual and textual domains [23, 35, 24, 37, 38, 45, 22, 9, 2, 36]. ViLBERT  and LXMERT  learn representation from both modalities using two-stream transformers, applied to image and text independently. In contrast, UNITER , VisualBERT , Unicoder-VL , VL-BERT  and B2T2  propose a unified single architecture that learns representation jointly from both domains. Our method is similar to the first group, but differs in its fundamental goal. Instead of focusing on learning a task-agnostic representation for a range of downstream tasks, we are interested in the quality of region-phrase grounding emerged by maximizing mutual information. Moreover, we rely on the language modality as a weak training signal for grounding, and we perform phrase-grounding without any further finetuning.
Consider the set of region features and contextualized word representation as two multivariate random variables. Intuitively, estimating MI between them requires extracting the information content shared by these two variables. We model this MI estimation as maximizing a lower bound on MI with respect to parameters of a word-region attention model. This maximization forces the attention model to downweight regions from the image that do not match the word, and to attend to the image regions that contain the most shared information with the word representation.
Sec. 2.1 describes MI and the InfoNCE lower bound. Sec. 2.2 introduces notation and InfoNCE based objective for learning phrase grounding from paired image caption data. Sec. 2.3 presents the design of a word-region attention based compatibility function which is part of the InfoNCE objective.
2.1 InfoNCE Lower Bound on Mutual Information
Let and be random variables drawn from a joint distribution with density . The MI between and measures the amount of information that these two variables share:
which is also the Kullback–Leibler Divergence fromto .
However, computing MI is intractable in general because it requires a complete knowledge of the joint and marginal distributions. Among the existing MI estimators, the InfoNCE 31]. The appealing variance properties of this estimator may explain its recent success in representation learning [8, 30, 16, 36]. InfoNCE defines a lower bound on MI by:
Here, is the InfoNCE objective defined in terms of a compatibility function parametrized by : . The lower bound is computed over a mini-batch of size , consisting of one positive pair and negative pairs where :
Oord et al.  showed that maximizing the lower bound on MI by minimizing with respect to leads to a compatibility function that obeys
where is the optimal obtained by minimizing .
2.2 InfoNCE for Phrase Grounding
Recent work  has shown that pre-trained object detectors such as FasterRCNN  and language models such as BERT  provide rich representations in the visual and textual domains for the phrase grounding problem. Inspired by this, we aim to maximize mutual information between region features generated by an object detector and contextualized word representation extracted by a language model.
Let us denote image region features for an image by where is the number of regions in the image with each . Similarly, caption word representations are denoted as where is the number of words in the caption with each word represented as .
We maximize the InfoNCE lower bound on MI between image regions and each individual word representation denoted by . Thus using Eq. 2 we maximize the following lower bound:
We empirically show that maximizing the lower bound in Eq. 5 with an appropriate choice of compatibility function results in learning phrase grounding without strong grounding supervision. The following section details the design of the compatibility function.
2.3 Compatibility Function with Attention
The InfoNCE loss in our phrase grounding formulation requires a compatibility function between the set
of region feature vectorsand the contextualized word representation . To define the compatibility function, we propose to use a query-key-value attention mechanism . Specifically, we define neural modules to map each image region to keys and values and to compute query and values for the words. The query vectors for each word are used to compute the attention score for every region given a word using
where . The attention scores are used as a soft selection mechanism to compute a word-specific visual representation using a linear combination of region values
Finally, the compatibility function is defined as , where refers to the parameters of neural modules , implemented using simple feed-forward MLPs. Following Eqs. 3 & 5, the InfoNCE loss for phrase grounding is defined as
which is marked using subscript img as negative pairs are created by replacing image regions from a positive pair with regions extracted from negative instance in the mini-batch.
We enforce compatibility between each word and all image regions using in Eq. 5, but not between a region and all caption words (). This is because the words only describe part of the image, so there will be regions with no corresponding word in the caption.
2.4 Context-Preserving Negative Captions
The objective in Eq. 8 trains the compatibility function by contrasting positive regions-word pairs against pairs with replaced image regions. We now propose a complementary objective function that contrasts the positive pairs against negative pairs whose captions are replaced with plausible negative captions. However, extracting negative captions that are related to a captions is challenging as it requires semantic understanding of words in a caption. Here, we leverage BERT as a pretrained bidirectional language model to extract such negative captions.
For a caption with a noun word and context , we define a context-preserving negative caption as one which has the same context but a different noun with the following properties: (i) should be plausible in the context; and (ii) the new caption defined by the pair should be untrue for the image. For example, consider the caption "A man is walking on a beach" where is chosen as "man" and is defined by "A [MASK] is walking on a beach" where [MASK] is the token that denotes a missing word. A potential candidate for a context-preserving negative caption might be "A woman is walking on a beach" where is woman. However, "A car is walking on a beach" and "A person is walking on a beach" are not negative captions because car is not plausible given the context, and the statement with person is still true given that the original caption is true for the image.
Constructing context-preserving negative captions.
We propose to use a pre-trained BERT language model to construct context-preserving negative captions for a given true caption. Our approach for extracting such words consists of two steps: First, we feed the context into the language model to extract most likely candidates
for the masked word using probabilitiespredicted by BERT. Intuitively, these words correspond to those that fill in the masked word in caption according to BERT. However, the original masked word or its synonyms may be present in the set as well. Thus, in the second step, we pass the original caption into BERT to compute which we use as a proxy for how true is given that is true. We re-rank the candidates using the score and we keep the top captions as negatives for the original caption .
We empirically find that the proposed approach is effective in extracting context-preserving negative captions. Fig. 3 shows a context-preserving negatives for a set of captions along with candidates that were rejected after re-ranking. Note that the selected candidates match the context and the rejected candidates are often synonyms or hypernyms of the true noun.
Training with context-preserving negative captions.
Given the context-preserving negative captions, we can train our compatibility function by contrasting the positive pairs against negative pairs with plausible negative captions. We use a loss function similar to InfoNCE to encourage higher compatibility score of an image with the true caption than any negative caption. Letand denote the contextualized representation of the positive word and the corresponding negative noun words . The language loss is defined as
For captions with multiple noun words, we randomly select from the noun words for simplicity.
2.5 Implementation Details
Regions and Visual Features.
We use the Faster-RCNN object detector provided by Anderson et al.  and used for extracting visual features in the current state-of-the-art phrase grounding approach Align2Ground . The detector is trained jointly on Visual Genome object and attribute annotations and we use a maximum of 30 top scoring bounding boxes per image with dimensional ROI-pooled region features.
Contextualized Word Representations.
We use a pretrained BERT language model to extract dimensional contextualized word representations for each caption word. Note that BERT is trained on a text corpora using masked language model training where words are randomly replaced by a [MASK] token in the input and the likelihood of the masked word is maximized in the distribution over vocabulary words predicted at the output. Thus, BERT is trained to model distribution over words given context and hence suitable for modeling defined in Sec. 2.4 for constructing context-preserving negative captions.
Since we only care about grounding noun phrases, we compute only for noun and adjective words in the captions as identified by a POS tagger instead of all caption words for computation efficiency.
We optimize computed over batches of 50 image-caption pairs using the ADAM optimizer  with a learning rate of . We compute for each image using other images in the batch as negatives.
Attention to phrase grounding.
We use the BERT tokenizer to convert captions into individual word or sub-word tokens. Attention is computed per token. For evaluation, the phrase-level attention score for each region is computed as the maximum attention score assigned to the region by any of the tokens in the phrase. The regions are then ranked according to this phrase level score.
Our experiments compare our approach to state-of-the-art on weakly supervised phrase localization (Sec. 3.2), ablate gains due to pretrained language representations and context-preserving negative sampling using a language model (Sec. 3.3), and analyse the relation between phrase grounding performance and the InfoNCE bound that we optimize as a proxy for phrase grounding (Sec. 3.4).
3.1 Datasets and Metrics
We train our models on image-caption pairs from COCO training set which consists of K training images. We use the validation set with K images for part of our analysis. Each image is accompanied with 5 captions. For evaluation, we use the Flickr30K Entities validation set for model selection (early stopping) and test set for reporting final performance. Both sets consist of K images with 5 captions each. We report two metrics:
which is the fraction of phrases for which the ground truth bounding box has an IOU with any of the top-k predicted boxes.
which requires the model to predict a single point location per phrase and the prediction is counted as correct if it falls within the ground truth bounding box for the phrase. Unlike recall@k, pointing accuracy does not require identifying the extent of the object. Since our model selects one of the detected regions in the image, we use use center of the selected bounding box as the prediction for each phrase for computing pointing accuracy.
3.2 Performance on Flickr30K Entities
|Method||Training Data||Visual Features||R@1||R@5||R@10||Accuracy|
|GroundeR (2015) ||Flickr30K Entities||VGG-det (VOC)||28.94||-||-||-|
|Yeh et al.(2018) ||Flickr30K Entities||VGG-cls (IN)||22.31||-||-||-|
|Yeh et al.(2018) ||Flickr30K Entities||VGG-det (VOC)||35.90||-||-||-|
|Yeh et al.(2018) ||Flickr30K Entities||YOLO (COCO)||36.93||-||-||-|
|KAC Net+Soft KBP (2018) ||Flickr30K Entities||VGG-det (VOC)||38.71||-||-||-|
|Fang et al.(2015) ||COCO||VGG-cls (IN)||-||-||-||29.00|
|Akbari et al.(2019) ||COCO||VGG-cls (IN)||-||-||-||61.66|
|Akbari et al.(2019) ||COCO||PNAS Net (IN)||-||-||-||69.19|
|Align2Ground (2019) ||COCO||Faster-RCNN (VG)||-||-||-||71.00|
|Ours||Flickr30K Entities||Faster-RCNN (VG)||47.88||76.63||82.91||74.94|
Tab. 1 compares performance of our method to existing weakly supervised phrase grounding approaches on the Flickr30K Entities test set. A few existing approaches train on Flickr30K Entities train set and report recall@1 while recent methods use COCO train set and report pointing accuracy. Further, all approaches use different visual features making direct comparison difficult. For a fair comparison to state-of-the-art, we use Faster-RCNN trained on Visual Genome object and attribute annotations used in Align2Ground  and report performance for models trained on either datasets on both recall and pointing accuracy metrics.
Using the same training data and visual feature architecture, our model shows a absolute gain in pointing accuracy over Align2Ground. Learning using our contrastive formulation is also quite sample efficient as can be seen by only a to points drop in performance when the model is trained on the much smaller Flickr30K Entities train set which has approximately one-third as many image-caption pairs as COCO.
3.3 Benefits of Language Modeling
|Negative Captions||Language Model||R@1||R@5||R@10||Accuracy|
|Contextually plausible||BERT (Pretrained)||48.05||76.78||82.97||74.91|
|Excluding near-synonyms & hypernyms||BERT (Pretrained)||51.67||77.69||83.25||76.74|
Our approach benefits from language modeling in two ways: (i) using the pretrained language model to extract contextualized word representations, and (ii) using the language model to sample context-preserving negative captions. Tab. 2 evaluates along both of these dimensions.
Gains from pretrained word representations.
In Tab. 2, BERT (Random) refers to the BERT architecture initialized with random weights and finetuned on COCO image-caption data along with parameters of the attention mechanism. BERT (Pretrained) refers to the off-the-shelf pretrained BERT model which is used as a contextualized word feature extractor during contrastive learning without finetuning. We observe a 10% absolute gain in both recall@1 and pointing accuracy by using pretrained word representations from BERT.
Gains from context-preserving negative caption sampling.
Our context-preserving negative sampling has two steps. The first step is drawing negative noun candidates given the context provided by the true caption. The second step is re-ranking the candidates to filter out likely synonyms or hypernyms that are also true for the image.
First, note that randomly sampling negative captions from training data for computing performs similarly to only training using . Model trained with contextually plausible negatives significantly outperforms random sampling by 8% gain in recall@1 and pointing accuracy. Excluding near-synonyms and hypernyms yields another 3 points gain in recall@1 and accuracy.
3.4 Is InfoNCE a good proxy for learning phrase grounding?
The fact that optimizing our InfoNCE objective results in learning phrase grounding is intuitive but not trivial. Fig. 4 shows how maximizing the InfoNCE lower bound correlates well with phrase grounding performance on a heldout dataset. We make several interesting observations: (i) As training progresses (from left to right), InfoNCE lower bound (Eq. 5) mostly keeps increasing on the validation set. This indicates that there is no overfitting in terms of the InfoNCE bound. (ii) With the increase in InfoNCE lower bound, phrase grounding performance first increases until peak performance and then starts decreasing. This shows that the InfoNCE bound is correlated with the grounding performance but maximizing it fully does not necessarily yields at the best grounding. Similar observation has been made in  for representation learning. (iii) The peak performance and the number of iterations needed for the best performance depends on the choice of key-value-query modules. One and two layer MLPs hit the peak faster and perform better than linear functions.
3.5 Qualitative Results
Fig. 5 visualizes the word-region attention learned by our model. The qualitative results demonstrate the following abilities: (i) localizing different objects mentioned in the same caption with varying degrees of semantic relatedness, e.g., man and canine in row 1 vs. man and woman in row 3; (ii) disambiguation between two instances of the same object category using caption context. For example, boy and another in row 4 and bride and groom from other men and women in row 3; (iii) localizing object parts such as toddler’s shirt in row 2 and instrument’s mouthpiece in row 5; (iv) handling occlusion, e.g., table covered with toys in row 6; (v) handling uncommon words or categories like ponytail and mouthpiece in row 5 and hose in row 7.
These results show that given rich visual and contextualized word representations, contrastive learning causes our attention mode to learn phrase grounding.
4 Limitations and Future Works
The empirical examination of our framework reveals the following limitations:
Like prior arts, our approach relies on pretrained object detector and a language model to represent regions and caption-words. Ideally, we would expect to learn from scratch or improve existing region and word representations directly from image-caption data.
Need for fully-labeled validation set.
In Fig. 4
, we observe that an early stopping based on the validation performance is required to choose the best model for phrase grounding. While this is common practice for weakly supervised learning and the Flickr30K Entities validation set we use is smaller than the COCO training set, this translates to using full supervision for a small set of images.
Bounds on MI.
While in Eq. 8 is a valid lower bound on MI, our in Eq. 9 is no longer a lower bound on MI as it oversamples negative words related to a caption. A valid bound would involve random sampling of captions from the training data however our context-preserving negative captions lead to much better performance.
In this work, we offer a novel perspective on weakly supervised phrase grounding from paired image-caption data which has traditionally been cast as a multiple instance learning problem. We formulate the problem as that of estimating mutual information between image regions and caption words. We demonstrate that maximizing a lower bound on mutual information with respect to parameters of a region-word attention mechanism results in learning to ground words in images. We also show that language models can be used to generate context-preserving negative captions which greatly improve learning in comparison to randomly sampling negatives from training data.
-  (2018) Multi-level multimodal common semantic space for image-phrase grounding. CVPR. Cited by: §1.1, §1, Table 1.
-  (2019) Fusion of detected objects in text for visual question answering. In EMNLP, Cited by: §1.1.
-  (2017) Deep variational information bottleneck. In ICLR, Cited by: §1.1.
-  (2017) Bottom-up and top-down attention for image captioning and visual question answering. CVPR. Cited by: §2.5.
-  (2019) Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §1.1.
-  (2018) Mutual information neural estimation. In ICML, Cited by: §1.1, §1.1.
-  (2018) Knowledge aided consistency for weakly supervised phrase grounding. In CVPR, Cited by: Table 1.
-  (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §1.1, §2.1.
-  (2019) Uniter: learning universal image-text representations. arXiv preprint arXiv:1909.11740. Cited by: §1.1.
-  (2020) Evaluating weakly supervised object localization methods right. ArXiv. Cited by: §4.
-  (2019) Align2Ground: weakly supervised phrase grounding guided by image-caption alignment. ICCV. Cited by: §0.A.3, §1.1, §1, §2.2, §2.5, §3.2, Table 1.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. NAACL-HLT. Cited by: §0.A.2, §2.2.
-  (2014) From captions to visual concepts and back. CVPR. Cited by: §1.1, §1, Table 1.
-  (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In AISTATS, Cited by: §1.1.
-  (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §1.1.
-  (2019) Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: §1.1, §2.1.
-  (2019) Learning deep representations by mutual information estimation and maximization. In ICLR, Cited by: §1.1.
-  (2018) Attention-based deep multiple instance learning. In ICML, Cited by: §1.1, §1.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. ICML. Cited by: §2.5.
-  (2018) Disentangling by factorising. In ICML, Cited by: §1.1.
-  (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §2.5.
-  (2019) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066. Cited by: §1.1.
-  (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §1.1.
-  (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, Cited by: §1.1.
-  (1998) A framework for multiple-instance learning. In NeurIPS, Cited by: §1.1, §1.
-  (2018) Formal limitations on the measurement of mutual information. arXiv preprint arXiv:1811.04251. Cited by: §1.1.
-  (2013) Distributed representations of words and phrases and their compositionality. In NIPS, Cited by: §1.1.
-  (2019) Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991. Cited by: §1.1.
-  (2013) Learning word embeddings efficiently with noise-contrastive estimation. In NeurIPS, Cited by: §1.1.
-  (2018) Representation learning with contrastive predictive coding. arXiv. Cited by: §1.1, §1.1, §1, §2.1, §2.1.
-  (2019) On variational bounds of mutual information. In ICML, Cited by: §1.1, §2.1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, Cited by: §2.2.
-  (2016) Grounding of textual phrases in images by reconstruction. In ECCV, Cited by: §1.1, §1, Table 1.
-  (2020) Understanding the limitations of variational mutual information estimators. In ICLR, Cited by: §1.1.
-  (2020) Vl-bert: pre-training of generic visual-linguistic representations. Cited by: §1.1.
-  (2019) Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743. Cited by: §1.1, §2.1.
-  (2019) Videobert: a joint model for video and language representation learning. In ICCV, Cited by: §1.1.
-  (2019) LXMERT: learning cross-modality encoder representations from transformers. In EMNLP, Cited by: §1.1.
-  (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §1.1, §3.4.
-  (2020) On mutual information maximization for representation learning. In ICLR, Cited by: §1.1.
-  (2017) Attention is all you need. In NeurIPS, Cited by: §0.A.2, §2.3.
-  (2019) Phrase localization without paired training examples. ICCV. Cited by: §1.1.
-  (2018) Unsupervised feature learning via non-parametric instance discrimination. In CVPR, Cited by: §1.1.
-  (2018) Unsupervised textual grounding: linking words to image concepts. In CVPR, Cited by: §1.1, Table 1.
-  (2020) Unified vision-language pre-training for image captioning and vqa. In AAAI, Cited by: §1.1.
Appendix 0.A Appendix
0.a.1 Advantages of Context-Preserving Negative Sampling
Commonly used strategies for negative sampling for contrastive learning include randomly sampling captions from the training data as negatives or mining hard-negatives from a randomly sampled mini-batch. In our experiments (Tab. 2), random sampling showed no significant gains over a model trained without negative captions. This is because the sampled negatives often have an entirely different context as compared to the image and the positive caption which makes it too easy for the model to produce a low compatibility score for these negatives.
In contrast, contrast-preserving negative sampling shows significant gains over random sampling ( vs. pointing accuracy). This is because we construct harder negative captions which yield a more informative training signal than random sampling. We construct negatives by substituting only a single word in the caption while preserving the context from the positive caption. The substitutions are further chosen to be plausible given the context while discarding likely synonyms and hypernyms. Unlike random sampling approaches whose success depends on the occurrence of informative negative captions in the training data and the likelihood of sampling such negatives for a positive caption in the same minibatch, our approach can construct effective negatives for any positive caption.
0.a.2 Relation between our query-key-value attention and self-attention in Transformers
Our query-key-value attention mechanism is related to the attention mechanism used in transformer-based  architectures like BERT . Transformers use the mechanism for self-attention where queries, keys, and values are computed for each word in the input sentence and the attention scores are used for contextualization. In contrast, we use the attention mechanism for word-region alignment. Specifically, we compute queries for each contextualized word, keys for each region, and values for regions as well as words (using separate value networks for regions and words).
0.a.3 Comparison to Align2Ground
While we use the same visual features as the previous SOTA, Align2Ground , the two approaches use different textual features. While Align2Ground uses a bi-GRU, we chose BERT, a transformer-based language model which became more prevalent (as opposed to RNN-based) in the vision-language community. To estimate the gain due to pretrained language representations, Tab. 2 compares the grounding performance of randomly initialized BERT () to that of pretrained BERT (). Negative sampling brings further gains ().