VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

by   Hu Xu, et al.

We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training.


OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross...

Masking Modalities for Cross-modal Video Retrieval

Pre-training on large scale unlabelled datasets has shown impressive per...

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Building a universal video-language model for solving various video unde...

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

We propose Unicoder-VL, a universal encoder that aims to learn joint rep...

Contextual Grounding of Natural Language Entities in Images

In this paper, we introduce a contextual grounding approach that capture...

Graph-Text Multi-Modal Pre-training for Medical Representation Learning

As the volume of Electronic Health Records (EHR) sharply grows, there ha...

MultiMAE: Multi-modal Multi-task Masked Autoencoders

We propose a pre-training strategy called Multi-modal Multi-task Masked ...