COBRA: Contrastive Bi-Modal Representation Algorithm

05/07/2020
by   Vishaal Udandarao, et al.
5

There are a wide range of applications that involve multi-modal data, such as cross-modal retrieval, visual question-answering, and image captioning. Such applications are primarily dependent on aligned distributions of the different constituent modalities. Existing approaches generate latent embeddings for each modality in a joint fashion by representing them in a common manifold. However these joint embedding spaces fail to sufficiently reduce the modality gap, which affects the performance in downstream tasks. We hypothesize that these embeddings retain the intra-class relationships but are unable to preserve the inter-class dynamics. In this paper, we present a novel framework COBRA that aims to train two modalities (image and text) in a joint fashion inspired by the Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE) paradigms which preserve both inter and intra-class relationships. We empirically show that this framework reduces the modality gap significantly and generates a robust and task agnostic joint-embedding space. We outperform existing work on four diverse downstream tasks spanning across seven benchmark cross-modal datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/30/2021

CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations

Contrastive learning allows us to flexibly define powerful losses by con...
research
05/22/2023

Connecting Multi-modal Contrastive Representations

Multi-modal Contrastive Representation (MCR) learning aims to encode dif...
research
08/22/2023

Ceci n'est pas une pomme: Adversarial Illusions in Multi-Modal Embeddings

Multi-modal encoders map images, sounds, texts, videos, etc. into a sing...
research
03/10/2023

Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning

Contrastive loss has been increasingly used in learning representations ...
research
04/28/2023

SGAligner : 3D Scene Alignment with Scene Graphs

Building 3D scene graphs has recently emerged as a topic in scene repres...
research
03/25/2022

Versatile Multi-Modal Pre-Training for Human-Centric Perception

Human-centric perception plays a vital role in vision and graphics. But ...
research
08/12/2023

Contrastive Learning for Cross-modal Artist Retrieval

Music retrieval and recommendation applications often rely on content fe...

Please sign up or login with your details

Forgot password? Click here to reset