Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

11/19/2021
by   Hongwei Xue, et al.
0

We study joint video and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream VL tasks. Existing works either extract low-quality video features or learn limited text embedding, while neglecting that high-resolution videos and diversified semantics can significantly improve cross-modality learning. In this paper, we propose a novel High-resolution and Diversified VIdeo-LAnguage pre-training model (HD-VILA) for many visual tasks. In particular, we collect a large dataset with two distinct properties: 1) the first high-resolution dataset including 371.5k hours of 720p videos, and 2) the most diversified dataset covering 15 popular YouTube categories. To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned video features with diversified texts. Our pre-training model achieves new state-of-the-art results in 10 VL understanding tasks and 2 more novel text-to-visual generation tasks. For example, we outperform SOTA models with relative increases of 38.5 R@1 in zero-shot MSR-VTT text-to-video retrieval task, and 53.6 high-resolution dataset LSMDC. The learned VL embedding is also effective in generating visually pleasing and semantically relevant results in text-to-visual manipulation and super-resolution tasks.

READ FULL TEXT

page 2

page 7

research
02/15/2020

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

We propose UniViLM: a Unified Video and Language pre-training Model for ...
research
01/13/2022

BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions

Pre-training a model to learn transferable video-text representation for...
research
06/07/2023

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

To promote the development of Vision-Language Pre-training (VLP) and mul...
research
11/24/2021

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

This paper presents a unified multimodal pre-trained model called NÜWA t...
research
10/13/2021

CLIP4Caption: CLIP for Video Caption

Video captioning is a challenging task since it requires generating sent...
research
04/26/2022

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Dominant pre-training work for video-text retrieval mainly adopt the "du...
research
06/26/2022

RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

Seas of videos are uploaded daily with the popularity of social channels...

Please sign up or login with your details

Forgot password? Click here to reset