Masking Modalities for Cross-modal Video Retrieval

11/01/2021
by   Valentin Gabeur, et al.
2

Pre-training on large scale unlabelled datasets has shown impressive performance improvements in the fields of computer vision and natural language processing. Given the advent of large-scale instructional video datasets, a common strategy for pre-training video encoders is to use the accompanying speech as weak supervision. However, as speech is used to supervise the pre-training, it is never seen by the video encoder, which does not learn to process that modality. We address this drawback of current pre-training methods, which fail to exploit the rich cues in spoken language. Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech. We mask an entire modality in the input and predict it using the other two modalities. This encourages each modality to collaborate with the others, and our video encoder learns to process appearance and audio as well as speech. We show the superior performance of our "modality masking" pre-training approach for video retrieval on the How2R, YouCook2 and Condensed Movies datasets.

READ FULL TEXT

page 1

page 8

research
06/27/2023

Large-scale unsupervised audio pre-training for video-to-speech synthesis

Video-to-speech synthesis is the task of reconstructing the speech signa...
research
11/03/2022

Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization

Self-supervised pre-training recently demonstrates success on large-scal...
research
09/01/2022

Video-Guided Curriculum Learning for Spoken Video Grounding

In this paper, we introduce a new task, spoken video grounding (SVG), wh...
research
05/23/2023

Training Transitive and Commutative Multimodal Transformers with LoReTTa

Collecting a multimodal dataset with two paired modalities A and B or B ...
research
08/18/2020

AssembleNet++: Assembling Modality Representations via Attention Connections

We create a family of powerful video models which are able to: (i) learn...
research
06/17/2023

Enhancing the Prediction of Emotional Experience in Movies using Deep Neural Networks: The Significance of Audio and Language

Our paper focuses on making use of deep neural network models to accurat...
research
01/27/2023

Pre-training for Speech Translation: CTC Meets Optimal Transport

The gap between speech and text modalities is a major challenge in speec...

Please sign up or login with your details

Forgot password? Click here to reset