MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations

03/20/2023
by   Ye Wang, et al.
0

Multifold observations are common for different data modalities, e.g., a 3D shape can be represented by multi-view images and an image can be described with different captions. Existing cross-modal contrastive representation learning (XM-CLR) methods such as CLIP are not fully suitable for multifold data as they only consider one positive pair and treat other pairs as negative when computing the contrastive loss. In this paper, we propose MXM-CLR, a unified framework for contrastive learning of multifold cross-modal representations. MXM-CLR explicitly models and learns the relationships between multifold observations of instances from different modalities for more comprehensive representation learning. The key of MXM-CLR is a novel multifold-aware hybrid loss which considers multiple positive observations when computing the hard and soft relationships for the cross-modal data pairs. We conduct quantitative and qualitative comparisons with SOTA baselines for cross-modal retrieval tasks on the Text2Shape and Flickr30K datasets. We also perform extensive evaluations on the adaptability and generalizability of MXM-CLR, as well as ablation studies on the loss design and effects of batch sizes. The results show the superiority of MXM-CLR in learning better representations for the multifold data. The code is available at https://github.com/JLU-ICL/MXM-CLR.

READ FULL TEXT

page 14

page 15

page 16

research
05/05/2022

Cross-modal Contrastive Learning for Speech Translation

How can we learn unified representations for spoken utterances and their...
research
11/21/2022

Cross-Modal Contrastive Learning for Robust Reasoning in VQA

Multi-modal reasoning in visual question answering (VQA) has witnessed r...
research
08/26/2020

Delving into Inter-Image Invariance for Unsupervised Visual Representations

Contrastive learning has recently shown immense potential in unsupervise...
research
04/28/2022

Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast

We present an approach to learn voice-face representations from the talk...
research
05/09/2023

Exploiting Pseudo Image Captions for Multimodal Summarization

Cross-modal contrastive learning in vision language pretraining (VLP) fa...
research
08/21/2023

Deep Metric Loss for Multimodal Learning

Multimodal learning often outperforms its unimodal counterparts by explo...
research
10/11/2022

Improving Dense Contrastive Learning with Dense Negative Pairs

Many contrastive representation learning methods learn a single global r...

Please sign up or login with your details

Forgot password? Click here to reset