Connecting Multi-modal Contrastive Representations

05/22/2023
by   Zehan Wang, et al.
0

Multi-modal Contrastive Representation (MCR) learning aims to encode different modalities into a semantically aligned shared space. This paradigm shows remarkable generalization ability on numerous downstream tasks across various modalities. However, the reliance on massive high-quality data pairs limits its further development on more modalities. This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR). Specifically, given two existing MCRs pre-trained on (A, B) and (B, C) modality pairs, we project them to a new space and use the data from the overlapping modality B to aligning the two MCRs in the new space. Meanwhile, since the modality pairs (A, B) and (B, C) are already aligned within each MCR, the connection learned by overlapping modality can also be transferred to non-overlapping modality pair (A, C). To unleash the potential of C-MCR, we further introduce a semantic-enhanced inter- and intra-MCR connection method. We first enhance the semantic consistency and completion of embeddings across different modalities for more robust alignment. Then we utilize the inter-MCR alignment to establish the connection, and employ the intra-MCR alignment to better maintain the connection for inputs from non-overlapping modalities. We take the field of audio-visual contrastive learning as an example to demonstrate the effectiveness of C-MCR. We connect pre-trained CLIP and CLAP models via texts to derive audio-visual contrastive representations. Remarkably, without using any paired audio-visual data and further tuning, C-MCR achieves state-of-the-art performance on six datasets across three audio-visual downstream tasks.

READ FULL TEXT
research
09/02/2022

Multi-modal Contrastive Representation Learning for Entity Alignment

Multi-modal entity alignment aims to identify equivalent entities betwee...
research
03/25/2022

Versatile Multi-Modal Pre-Training for Human-Centric Perception

Human-centric perception plays a vital role in vision and graphics. But ...
research
04/28/2023

SGAligner : 3D Scene Alignment with Scene Graphs

Building 3D scene graphs has recently emerged as a topic in scene repres...
research
05/07/2020

COBRA: Contrastive Bi-Modal Representation Algorithm

There are a wide range of applications that involve multi-modal data, su...
research
03/09/2020

Multi-modal Self-Supervision from Generalized Data Transformations

Self-supervised learning has advanced rapidly, with several results beat...
research
03/10/2023

Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning

Contrastive loss has been increasingly used in learning representations ...
research
03/11/2021

Multi-Format Contrastive Learning of Audio Representations

Recent advances suggest the advantage of multi-modal training in compari...

Please sign up or login with your details

Forgot password? Click here to reset