More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints

by   Yuxiao Chen, et al.
Rutgers University

Attention mechanisms have been widely applied to cross-modal tasks such as image captioning and information retrieval, and have achieved remarkable improvements due to its capability to learn fine-grained relevance across different modalities. However, existing attention models could be sub-optimal and lack preciseness because there is no direct supervision involved during training. In this work, we propose Contrastive Content Re-sourcing (CCR) and Contrastive Content Swapping (CCS) constraints to address such limitation. These constraints supervise the training of attention models in a contrastive learning manner without requiring explicit attention annotations. Additionally, we introduce three metrics, namely Attention Precision, Recall and F1-Score, to quantitatively evaluate the attention quality. We evaluate the proposed constraints with cross-modal retrieval (image-text matching) task. The experiments on both Flickr30k and MS-COCO datasets demonstrate that integrating these attention constraints into two state-of-the-art attention-based models improves the model performance in terms of both retrieval accuracy and attention metrics.


page 1

page 2

page 6


Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

Despite the evolution of deep-learning-based visual-textual processing s...

FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

In this paper, we address the text and image matching in cross-modal ret...

Exploring and Distilling Cross-Modal Information for Image Captioning

Recently, attention-based encoder-decoder models have been used extensiv...

Cross-modal Contrastive Learning for Speech Translation

How can we learn unified representations for spoken utterances and their...

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

BERT-type structure has led to the revolution of vision-language pre-tra...

Discrete-continuous Action Space Policy Gradient-based Attention for Image-Text Matching

Image-text matching is an important multi-modal task with massive applic...

Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching

This paper proposes an approach to Dense Video Captioning (DVC) without ...

1 Introduction

Recently, attention mechanisms have been introduced for various cross-modal vision-language tasks, such as image-text matching [liu2019focus, lee2018stacked, nam2017dual, huang2017instance, huang2018bi], Visual Question Answering [yang2016stacked, noh2016image], and image captioning [xu2015show, you2016image]. These attention-based approaches have achieved remarkable improvements because of their capabilities of learning fine-grained cross-modal relevance. Given a sentence and its corresponding image, they are first represented by fragments (individual words and image regions). We refer to the fragments of the context modality as query fragments, and the fragments of the attended modality as key fragments. For example, when generating the attention weights of image local regions given each word, the image and text fragments are defined as the key and query fragments, respectively.

Figure 1: Visualization of the attention maps of the SCAN model learned without and with our proposed constraints.
Figure 2: Overview of (a) cross-modal attention mechanism, and our proposed attention constraints (b) Contrastive Content Re-sourcing and (c) Contrastive Content Swapping.

In ideal cases, a well-trained attention model will “attend” to all semantically relevant key fragments by assigning large attention weights to them, and “ignore” irrelevant fragments by outputting small attention weights. For example, attention models should output large attention weights for all image regions containing the dog when the word “dog” is used as a query fragment, and small attention weights for other image fragments, as shown in Figure 1 (b). However, since these attention models are trained in a data-driven manner and do not receive any explicit supervision or constraints, they are likely to be driven by biased co-occurrences in the training dataset, and not able to precisely perceive the desired contents. As show in Figure 1 (a), the model fails to attend to relevant fragments containing the dog. This example illustrates cases that challenge of low attention “recall”. Additionally, attention models can suffer from low attention “precision”. As shown in Figure 1 (c), the attention weights are driven by context (human body, background, etc.) instead of capturing the concept of “helmet”. Admittedly, context provides clues that are critical to detecting small objects. However, context also involves certain level of noise, which makes it more difficult to precisely capture the desired contents. A possible solution to these limitations is to rely on manual annotations, and then generate attention map ground truth [qiao2018exploring, zhang2019interpretable]. However, annotating attention distributions is an ill-defined task, and would be labor-intensive.

To this end, we propose two learning constraints, namely Contrastive Content Re-sourcing (CCR) and Contrastive Content Swapping (CCS)

, to supervise learning cross-modal attentions. Figure

2 gives an overview of our methods. The CCR constraint guides attention models by enforcing the key fragments with high attention weights (referred to attended key fragments) to be more relevant to the corresponding query fragment than the key fragments with low attention weights (referred to ignored key fragments). Consider Figure 1 (a) and (b) - the CCR constraint enforces the attention model to assign large attention weights to the regions containing the dog. In contrast, the CCS constraint further encourages an attention model to distinguish the content from context or background more precisely, by constraining a query fragment’s attended information, which is encoded as the weighted sum of key fragment features, to be more relevant to the query fragment than a modified query fragment generated by a “swapping” operation which will be explained in Section 3.3. In the example shown in Figure 1 (c) and (d), the high attention weights on contextual regions including human body and background are punished so that more accurate detection is achieved. The proposed constraints are training strategies and easily integrated into existing attention-based cross-modal models as discussed in Section 3.

We evaluate the performance of these constraints using the image-text matching task, and incorporate them into two state-of-the-art attention-based image-text matching networks [lee2018stacked, liu2019focus]. Image-text matching aims to retrieve the most semantically relevant images (texts) of a text (image) query. The experimental results on both MS-COCO [lin2014microsoft] and Flickr30K [plummer2015flickr30k] demonstrate that these constraints significantly improve performances of these methods in terms of accuracy and generalization ability. Additionally, in order to provide fair comparisons between different models in terms of attention correctness, we extend the most widely used qualitative attention evaluations in previous studies by exploring three new metrics, namely Attention Recall, Attention Precision and Attention F1-Score, to quantitatively evaluate the correctness of learned attention models.

Our main contributions are:

  • We propose two learning constraints to supervise the training of cross-modal attention models in a contrastive manner. They do not require additional annotations and can be generalized to different cross-modal problems.

  • We introduce three attention metrics to quantitatively evaluate the performance of learned attention models in terms of precision, recall, and F1-Score.

2 Related Work

Cross-modal attention models in various tasks. Attention-based models have become one of the mainstreams for various vision-language tasks including image captioning [anderson2018bottom, chen2018factual], image-text matching [nam2017dual, huang2018bi], visual question answering [yang2016stacked, yu2017multi], etc. Attention model is first applied to image captioning to aggregate visual signals by targeting related local images regions to the given caption [xu2015show, you2016image]. For image-text matching,  [lee2018stacked] introduced a Stacked Cross Attention Network (SCAN) to infer fine-grained relevance between words and image regions by leveraging both image-to-text and text-to-image attentions.  [lee2018stacked] extended SCAN by eliminating irrelevant fragments that are identified by intra-modality relations. Although these methods have achieved promising results, the learning process of these attention models could be sub-optimal due to the lack of direct supervision, as discussed in Section 1.

Supervisions on learning attention models. The task of properly training cross-modal attention models with supervision has drawn a growing interest. The main challenge lies in defining and collecting supervision signals. [qiao2018exploring] first trained an attention map generator on a human attention dataset, and then applies the generator’s attention map predictions as weak annotations. [liu2017attention] leveraged human annotated alignments between a word and corresponding image regions as supervision. Similar to [liu2017attention], image local region descriptions and object annotations in Visual Genome [krishna2017visual] were leveraged for generating attention supervision [zhang2019interpretable]. These methods, in contrast to ours, obtain supervisions from certain forms of human annotations, such as word-image correspondence and image local region annotations, whereas our approach provides supervision by constructing pair-wise samples for contrastive learning.

3 Methodology

3.1 Attention in Image-Text Matching

In image-text matching, given an image-sentence pair, the image

is encoded by a set of vectors each of which represents a local region, and the sentence

is encoded by a sequence of vectors which embeds the semantics of each word. In a nutshell, an attention model takes these vectors as input, computes the cross-modal correspondence between each query fragments and all key fragments, and eventually outputs a similarity of the image-sentence pair.

Let and refer to the feature representation of the -th query and -th key fragments. The attention model first calculates ’s attention weight with respect to as follows:


where is the attention function whose output is a scalar that measures the relevance between and ; is the set of indexes of all key fragments; is ’s attention weight with respect to . Therefore, ’s attended information is summarized as the vector , which is the weighted sum of key fragment features, as shown in Equation 2.


The similarity score between the image and the sentence is then defined as:


where denotes the set of indexes of all query fragments; is the similarity function; is a function that aggregates similarity scores among all query fragments.

The most widely used loss function for this task is a triplet ranking loss with hard negative sampling

[faghri2017vse++, lee2018stacked, liu2019focus] defined in Equation 4:


where is a margin parameter; the image and the sentence is a positive sample pair, and and are the hardest negatives [faghri2017vse++] for the positive sample pair. enforces the similarity between an anchor image and its matched sentence to be larger than the similarity between the anchor image and an unmatched sentence by a margin . Vice versa for the sentence .

However, this loss function works at the similarity-level and does not provide any supervision for linking cross-modal contents at the attention-level. In other words, learning cross-modal attentions is a pure data-driven approach and lacks supervision. As a result, the learned attention model could be sub-optimal.

3.2 Contrastive Content Re-sourcing

A property of a well-learned attention model is that, given a query fragment, it will assign large attention weights to key fragments that are relevant to the query fragment, and output small attention weights to irrelevant key fragments. The main idea of CCR is to explicitly constrain attention models to learn this property in a contrastive learning manner, as shown in Figure 2 (b).

Specifically, given a query fragment , the key fragments are divided into two groups based on their attention weights with respect to . One group is the attended key fragments denoted as , where ; is a function that returns 1 is is large, and returns 0 otherwise. The other group consists of the ignored key fragments denoted as , where .

We calculate feature vector for the key fragments in following Equation 5:


where can be regarded as ’s attended feature after removing the ignored key fragments in from . We can extract the feature for in the same way. The relevance between the query fragment and either the attended or ignored fragments is measured by the similarity function . Therefore, we define the loss function for CCR as:


where is a margin parameter that controls the similarity difference margin. bridges the gap between the task’s objectives (similarity score for image-text pairs) and intermediate learning process (attention weights). In other words, the model is encouraged to assign higher weights to the key fragments whose contents are more relevant to the query fragments, which enforces the motivation for applying attention models in the first place.

3.3 Contrastive Content Swapping

The main idea of Contrastive Content Swapping is similar to triplet loss based metric learning, where the similarity between a positive data pair (anchor image and its caption) is enforced to be higher than the similarity between a negative data pair (anchor image and an unmatched caption). CCS applies the same training logic at the fragment-level. We first generate negative samples, referred to swapped query fragments, by “swapping” the contents in the original query fragments with visually or semantically different contents from the same modality. The attended information of a query fragment is enforced to be more relevant to the query fragment than to any swapped query fragments, as shown in Figure 2 (c). The CCS loss function is defined as:


where is the margin parameter; is the set of swapped query fragments for ; is the embedding of the -th fragment in .

The motivation behind the CCS constraints is that, in order to minimize , attention models will learn to diminish the attention weights of the key fragments that are relevant to . As a result, the information that is relevant to and thus irrelevant to , is eliminated. However, is intractable in practice. To move beyond this limitation, we only sample one query fragment and its correspondent swapped query fragment to calculate , as shown in Equation 8.


where -th query fragment and its correspondent -th swapped query fragment are sampled.

By incorporating the CCR and CCS constraints for image-text matching, we obtain the complete loss function following Equation 9, where and is the scalar that control the contribution of CCR and CCS respectively:


3.4 Attention Metrics

Previous studies [lee2018stacked, liu2019focus] focus on qualitatively evaluating the attention models by visualizing attention maps. These approaches can not serve as a standard metric for comparing attention correctness among different models. Therefore, we propose Attention Precision, Attention Recall and Attention F1-Score, to quantitatively evaluate the performance of learned attention models. Attention Precision is the fraction of attended key fragments that are relevant to the correspondent query fragment, and Attention Recall is the fraction of relevant key fragments that are attended. Attention F1-Score is a combination of the Attention Precision and Attention Recall that provides an overall evaluation of the attention model correctness. In our experiments, we only evaluate the attention models that use texts as the query fragments. This is because the text encoders used in our models [lee2018stacked, liu2019focus] are GRUs [chung2014empirical], where semantics among text fragments propagate and cause false positives when using text fragments as the key fragments.

Given a matched image-image pair, the -th image fragment is labeled as a relevant fragment of the -th text fragment if the Intersection over Union (IoU) between and the correspondent region of is larger than a threshold . In addition, is regraded as an attended fragment by if ’s attention weight with respect to is larger than a threshold . Let and denote the sets of attended and relevant image fragments of , ’s Attention Precision , Attention Recall , and Attention F1-Score are thus defined as:


The Flickr30k Entities dataset [plummer2015flickr30k] provides annotations that links noun phrases to image regions. A noun phrase may contain multiple words, and different words could correspond to the same image region. In order to obtain the overall attention metrics on Flickr30K, we first calculate the attention metrics at word-level, and use the maximal values within each phrase as the phrase-level metrics. The overall attention metrics are then obtained by averaging the phrase-level metrics.

4 Experiments

4.1 Datasets and Settings

Datasets. We evaluate our method on two public image-text matching benchmark: Flickr30K [young2014image] and MS-COCO [lin2014microsoft] dataset. Flickr30K [young2014image] dataset contains 31K images collected from the Flickr website. Each of these images is annotated with 5 captions. Following the setting of [liu2019focus, lee2018stacked], we split the dataset into 29k training images, 1k validation images and 1k testing images. The MS-COCO dataset used for image-text matching consists of 123,287 images, each of which is with 5 human-annotated descriptions. Following [liu2019focus, lee2018stacked], the dataset is divided into 113,283 images for training, 5,000 images for validation and 5,000 for testing.

Evaluation Metrics. Following [liu2019focus, lee2018stacked], we calculate Recall@K (K = 1,5,10) for both Image Retrieval (using sentences to retrieve images) and Sentence Retrieval (using images to retrieve sentences) tasks, which are the proportions of queries whose at least one of the correspondent items belongs to their top-K retrieved items. We also report rsum which is a summation of all Recall@K values for each model. On the Flickr30K dataset, we report results on the 1K testing images. On the MS-COCO dataset, we report results through averaging over 5-folds 1K test images (referred to MS-COCO 1K) or on the full 5K test images (referred to MS-COCO 5K) following [chen2020adaptive].

For the attention metrics, is set as 0.5; is set as 1/36 (the average of image fragment attention weights with respect to a text fragment) for SCAN; is set as 0 for BFAN, because this model assigns the attention weights of irrelevant regions to 0 [liu2019focus].

Baselines. We evaluate the proposed constraints by incorporating them into the following state-of-the-art attention-based image-text matching models:

  • SCAN [lee2018stacked] infers latent semantic alignments between words and image regions dynamically by attending to words with respect to regions or attending to regions with respect to words, which can generate differential attentions to important image regions and words according to the context.

  • BFAN [liu2019focus] first infers cross-modality relevance based on conventional attention mechanism, and then removes irrelevant fragments to eliminate noise from attended features.

For the attention models that uses regions as query fragments, the swapping strategy we implement is to swap text fragments instead of image fragments. The reason is that images are by nature in a continuous space, whereas texts are discrete. Interestingly, a recent study shows that a copy-paste (swapping) strategy can function as data augmentation and improve the performance in semantic segmentation [ghiasi2020simple], which provides a valid data point that the CCS constraint can be generalized to swapping visual contents. Further implementation details are discussed in the supplementary materials.

Sentence Retrieval Image Retrieval
Method R@1 R@5 R@10 R@1 R@5 R@10 rsum
SCAN 67.4 90.7 94.9 47.8 77.4 85.3 463.5
+ CCR 71.4 92.7 95.5 50.6 79.4 86.2 475.8
+ CCS 69.8 91.4 95.9 51.1 79.2 86.4 473.8
+ CCR + CCS 70.1 92.0 96.0 52.3 79.9 86.8 477.1
BFAN 73.6 93.2 96.3 53.0 79.7 86.7 482.5
+ CCR 73.7 93.8 96.7 54.6 81.3 87.3 487.4
+ CCS 73.1 93.3 96.9 53.9 80.6 87.6 485.4
+ CCR + CCS 75.3 93.6 96.7 55.4 81.3 87.7 490.0
Table 1: Results of sentence retrieval and image retrieval tasks on the Flickr30K test set. R@K refers to Recall@K.
Sentence Retrieval Image Retrieval
Method R@1 R@5 R@10 R@1 R@5 R@10 rsum
1K Test Images
SCAN 69.4 93.5 97.4 52.2 85.1 93.1 490.7
+ CCR 71.3 94.2 97.8 56.4 87.2 94.1 501.0
+ CCS 70.4 94.2 98.0 56.8 87.4 94.2 501.0
+ CCR + CCS 70.9 94.3 98.0 57.3 87.6 94.3 502.4
BFAN 73.9 95.0 98.4 59.2 88.4 94.6 509.5
+ CCR 75.7 95.2 98.3 60.1 88.6 94.7 512.6
+ CCS 75.1 95.1 98.2 59.3 88.3 94.5 510.5
+ CCR + CCS 75.4 95.3 98.5 60.3 88.6 94.6 512.7
5K Test Images
SCAN 47.2 77.6 87.7 34.7 65.2 77.3 389.7
+ CCR 47.7 78.3 88.2 36.2 66.6 78.2 395.2
+ CCS 46.5 78.5 88.0 36.5 66.6 78.3 394.4
+ CCR + CCS 47.9 78.1 88.2 36.9 66.9 78.4 396.4
BFAN 51.2 80.4 89.5 37.4 66.5 78.2 403.2
+ CCR 53.5 81.5 90.1 38.3 67.8 78.5 409.7
+ CCS 52.8 81.1 89.8 37.9 67.0 78.2 406.8
+ CCR + CCS 53.1 81.8 90.2 38.8 67.8 78.6 410.3
Table 2: Results of sentence retrieval and image retrieval tasks on the MS-COCO test set. R@K refers Recall@K.

4.2 Experiments on Image-Text Matching

The image retrieval and sentence retrieval results on both Flickr30K and MS-COCO datasets are shown in Table 1 and 2, respectively. By applying CCR and CCS constraints individually, we achieve consistent improvements in both image retrieval and sentence retrieval on both datasets. Integrating both constraints into SCAN and BFAN achieves about 10% on rsum over baseline methods and the best overall performance. Comparing the two constraints when used separately, we observe that CCR yields slightly better performance than CCS when used with SCAN in sentence retrievals on both datasets. One possible explanation is that, CCR is designed to leverage image-sentence similarity directly, and thus is more relevant to the objective of image-text matching task.

4.3 Experiments on Generalization Ability

We evaluate the models’ generalization ability following a transfer learning setting: we train the models on MS-COCO dataset and then test them on Flickr30K dataset, as shown in Table

3. Due to the page limit, we only present the results for baselines and our complete models. We obtain better performance in every metric for both retrieval tasks, and achieves about 5% to 10% improvements in terms of rsum. The improvements are consistent with the results in Table 1 and 2, illustrating that the proposed constraints can improve the retrieval performance of attention models without compromising their generalization ability.

Sentence Retrieval Image Retrieval
Method R@1 R@5 R@10 R@1 R@5 R@10 rsum
SCAN 55.0 80.7 88.2 41.8 69.9 78.7 414.3
+ CCR + CCS 55.3 82.4 90.2 45.1 71.5 79.7 424.2
BFAN 57.5 84.7 91.5 45.0 71.7 80.0 430.4
+ CCR + CCS 59.9 85.7 92.0 46.3 72.6 80.9 435.2
Table 3: Results of testing models on Flickr30K that are trained on MS-COCO. R@K refers to Recall@K.
Figure 3: Examples showing attended image regions with respect to the given words for the SCAN model.
SCAN 0.201 0.858 0.305
+ CCR 0.221 0.914 0.333
+ CCS 0.210 0.883 0.317
+ CCR + CCS 0.238 0.916 0.352
BFAN 0.315 0.815 0.419
+ CCR 0.356 0.784 0.448
+ CCS 0.329 0.848 0.437
+ CCR + CCS 0.372 0.805 0.466
Table 4: Results of Attention Precision and Attention Recall of the SCAN trained on the Flickr30K dataset.

4.4 Attention Evaluation

Quantitative Analysis. Note that explicit annotations that links text and image fragments are required for calculating the proposed attention metrics. To obtain these metrics on MS-COCO, additional manual work is needed to provide a mapping between bounding box annotations and words/phrases for each sample. Therefore, we only present results on Flickr30K in Table 4. Similar to our observations in Section 4.2

, applying CCR and CCS individually yields higher Attention Precision and Recall over both baseline methods. By comparing the F1-Scores, CCR yields slightly better performance than CCS, and the attention models obtained under both constraints achieve the best. Additionally, the improvements obtained on attention metrics are consistent with the improvements on image-text matching in Section

4.2, indicating that learning more accurate attention models could potentially benefit the retrieval tasks.

Qualitative Analysis. We visualize the attention weights with respect to given query words in Figure 3. In SCAN (b), some irrelevant regions are assigned with large attention weights. SCAN+CCR (c) assigns large attention weights more accurately on relevant regions. However, in the second example, CCR does not fully diminished the attention weights assigned to irrelevant regions to query “table”. One possible explanation is that, if all relevant key fragments are attended together with a small number of irrelevant key fragments, will then be 0, causing CCR not punishing attention model for attending to irrelevant fragments. In contrast, the attention weights assigned to irrelevant regions are greatly diminished in SCAN+CCS (d), especially for “table”. In the first example for query “barbecue”, applying both constraints decreases the attention weights of background regions, such as the surrounding areas of the barbecue grill, more significantly. These results indicate the benefits of combining CCR and CCS.

4.5 Discussions

We conduct extensive experiments to demonstrate the effectiveness of the proposed constraints using the image-text matching task. These constraints can also be applied to other cross-modal attention-based models. However, the implementation of these constraints can be less intuitive for tasks similar to image captioning, where cross-modal attentions are calculated at each word-generation step. To implement CCR for image captioning, we first calculate the attention weights and visual features for the attended and ignored key fragment, using the previous predicted/ground truth word as the query fragment. Then, the generated hidden state vector by the attended key fragments’ feature is enforced to be more relevant to the ground truth, than the hidden state vector generated by the ignored key fragments’ features. To implement CCS for image captioning, we first use both the previous word and a swapped word fuzzed with original attended visual features at each step. Similarly, the word generated based on the original previous word should be more relevant to the ground truth word than the swapped word. The relevance can be defined as distances in the feature space.

5 Conclusions

To overcome the lack of direct supervision in learning cross-modal attention models, we introduce CCR and CCS to provide supervision in a contrastive manner. These constraints are generic learning strategies that can be integrated into attention-based models for various applications. In addition, in order to quantitatively measure the attention correctness we propose three new attention metrics. Our extensive experiments on image-text matching and attention correctness evaluation indicate that these constraints improve the cross-modal retrieval performances as well as the attention correctness when integrated into two state-of-the-art attention-based models. Although we observe consistent improvements obtained by applying proposed constraints in terms of attention correctness and image-text matching accuracy, more explorations are still needed to conclude the causality between these two objectives.