Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

11/21/2022
by   Peng Jin, et al.
0

Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing approaches either as a jointly training layer or an out-of-the-box inference module with no extra training, making it easy to be incorporated into any existing methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/11/2022

LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

Video-text retrieval is a class of cross-modal representation learning p...
research
05/05/2021

MiCE: Mixture of Contrastive Experts for Unsupervised Image Clustering

We present Mixture of Contrastive Experts (MiCE), a unified probabilisti...
research
09/16/2023

Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Cross-modal retrieval (CMR) has been extensively applied in various doma...
research
07/31/2019

Expectation-Maximization Attention Networks for Semantic Segmentation

Self-attention mechanism has been widely used for various tasks. It is d...
research
05/11/2020

Prototypical Contrastive Learning of Unsupervised Representations

This paper presents Prototypical Contrastive Learning (PCL), an unsuperv...
research
10/06/2020

Support-set bottlenecks for video-text representation learning

The dominant paradigm for learning video-text representations – noise co...
research
06/06/2022

OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression

This paper presents a language-powered paradigm for ordinal regression. ...

Please sign up or login with your details

Forgot password? Click here to reset