Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

12/02/2022
by   Fangxun Shu, et al.
0

We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC), for video-text retrieval tasks. Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency. Comparing conventional temporal sparse sampling, we propose to randomly mask a high ratio of spatial regions and only feed visible regions into the encoder as sparse spatial sampling. Similarly, we adopt the mask sampling technique for text inputs for consistency. Instead of blindly applying the mask-then-prediction paradigm from MAE, we propose a masked-then-alignment paradigm for efficient video-text alignment. The motivation is that video-text retrieval tasks rely on high-level alignment rather than low-level reconstruction, and multimodal alignment with masked modeling encourages the model to learn a robust and general multimodal representation from incomplete and unstable inputs. Coupling these designs enables efficient end-to-end pre-training: reduce FLOPs (60 pre-training (by 3x), and improve performance. Our MAC achieves state-of-the-art results on various video-text retrieval datasets, including MSR-VTT, DiDeMo, and ActivityNet. Our approach is omnivorous to input modalities. With minimal modifications, we achieve competitive results on image-text retrieval tasks.

READ FULL TEXT
research
11/21/2022

SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training

Video-language pre-training is crucial for learning powerful multi-modal...
research
08/18/2023

Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning

With the success of self-supervised learning, multimodal foundation mode...
research
06/05/2023

End-to-End Word-Level Pronunciation Assessment with MASK Pre-training

Pronunciation assessment is a major challenge in the computer-aided pron...
research
06/09/2023

Exploring Effective Mask Sampling Modeling for Neural Image Compression

Image compression aims to reduce the information redundancy in images. M...
research
04/18/2021

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Video-text retrieval plays an essential role in multi-modal research and...
research
03/28/2023

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Video Foundation Models (VFMs) have received limited exploration due to ...
research
10/26/2022

IMU2CLIP: Multimodal Contrastive Learning for IMU Motion Sensors from Egocentric Videos and Text

We present IMU2CLIP, a novel pre-training approach to align Inertial Mea...

Please sign up or login with your details

Forgot password? Click here to reset