LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

07/11/2022
by   Jinbin Bai, et al.
0

Video-text retrieval is a class of cross-modal representation learning problems, where the goal is to select the video which corresponds to the text query between a given text query and a pool of candidate videos. The contrastive paradigm of vision-language pretraining has shown promising success with large-scale datasets and unified transformer architecture, and demonstrated the power of a joint latent space. Despite this, the intrinsic divergence between the visual domain and textual domain is still far from being eliminated, and projecting different modalities into a joint latent space might result in the distorting of the information inside the single modality. To overcome the above issue, we present a novel mechanism for learning the translation relationship from a source modality space 𝒮 to a target modality space 𝒯 without the need for a joint latent space, which bridges the gap between visual and textual domains. Furthermore, to keep cycle consistency between translations, we adopt a cycle loss involving both forward translations from 𝒮 to the predicted target space 𝒯', and backward translations from 𝒯' back to 𝒮. Extensive experiments conducted on MSR-VTT, MSVD, and DiDeMo datasets demonstrate the superiority and effectiveness of our LaT approach compared with vanilla state-of-the-art methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/21/2022

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

Most video-and-language representation learning approaches employ contra...
research
09/16/2023

Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Cross-modal retrieval (CMR) has been extensively applied in various doma...
research
07/06/2020

Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

The rapid growth of user-generated videos on the Internet has intensifie...
research
12/26/2017

Zero-Shot Learning via Latent Space Encoding

Zero-Shot Learning (ZSL) is typically achieved by resorting to a class s...
research
11/21/2022

Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

In this paper we tackle the cross-modal video retrieval problem and, mor...
research
06/22/2015

Modality-dependent Cross-media Retrieval

In this paper, we investigate the cross-media retrieval between images a...
research
10/03/2022

Smooth image-to-image translations with latent space interpolations

Multi-domain image-to-image (I2I) translations can transform a source im...

Please sign up or login with your details

Forgot password? Click here to reset