More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints

05/20/2021
by   Yuxiao Chen, et al.
21

Attention mechanisms have been widely applied to cross-modal tasks such as image captioning and information retrieval, and have achieved remarkable improvements due to its capability to learn fine-grained relevance across different modalities. However, existing attention models could be sub-optimal and lack preciseness because there is no direct supervision involved during training. In this work, we propose Contrastive Content Re-sourcing (CCR) and Contrastive Content Swapping (CCS) constraints to address such limitation. These constraints supervise the training of attention models in a contrastive learning manner without requiring explicit attention annotations. Additionally, we introduce three metrics, namely Attention Precision, Recall and F1-Score, to quantitatively evaluate the attention quality. We evaluate the proposed constraints with cross-modal retrieval (image-text matching) task. The experiments on both Flickr30k and MS-COCO datasets demonstrate that integrating these attention constraints into two state-of-the-art attention-based models improves the model performance in terms of both retrieval accuracy and attention metrics.

READ FULL TEXT

page 1

page 2

page 6

research
11/14/2022

Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment

CLIP (Contrastive Language-Image Pre-Training) has shown remarkable zero...
research
02/13/2023

CLIP-RR: Improved CLIP Network for Relation-Focused Cross-Modal Information Retrieval

Relation-focused cross-modal information retrieval focuses on retrieving...
research
03/21/2023

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

The CLIP model has been recently proven to be very effective for a varie...
research
02/28/2020

Exploring and Distilling Cross-Modal Information for Image Captioning

Recently, attention-based encoder-decoder models have been used extensiv...
research
05/06/2023

Keyword-Based Diverse Image Retrieval by Semantics-aware Contrastive Learning and Transformer

In addition to relevance, diversity is an important yet less studied per...
research
04/20/2023

Is Cross-modal Information Retrieval Possible without Training?

Encoded representations from a pretrained deep learning model (e.g., BER...
research
05/02/2023

TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis

In this paper, we present TMR, a simple yet effective approach for text ...

Please sign up or login with your details

Forgot password? Click here to reset