VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

05/20/2021
by   Hu Xu, et al.
0

We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training.

READ FULL TEXT
07/01/2021

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross...
11/01/2021

Masking Modalities for Cross-modal Video Retrieval

Pre-training on large scale unlabelled datasets has shown impressive per...
07/16/2022

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Building a universal video-language model for solving various video unde...
08/16/2019

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

We propose Unicoder-VL, a universal encoder that aims to learn joint rep...
11/05/2019

Contextual Grounding of Natural Language Entities in Images

In this paper, we introduce a contextual grounding approach that capture...
03/18/2022

Graph-Text Multi-Modal Pre-training for Medical Representation Learning

As the volume of Electronic Health Records (EHR) sharply grows, there ha...
04/04/2022

MultiMAE: Multi-modal Multi-task Masked Autoencoders

We propose a pre-training strategy called Multi-modal Multi-task Masked ...