InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

07/13/2023
by   Yi Wang, et al.
0

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.

READ FULL TEXT

page 2

page 5

page 6

page 7

page 11

page 12

page 13

page 14

research
09/19/2023

Language as the Medium: Multimodal Video Classification through text only

Despite an exciting new wave of multimodal machine learning models, curr...
research
04/30/2021

GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

Generating videos from text is a challenging task due to its high comput...
research
06/07/2023

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

To promote the development of Vision-Language Pre-training (VLP) and mul...
research
05/10/2023

VideoChat: Chat-Centric Video Understanding

In this study, we initiate an exploration into video understanding by in...
research
11/28/2022

Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Action Generation

We introduce Action-GPT, a plug-and-play framework for incorporating Lar...
research
04/27/2023

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

Existing deep video models are limited by specific tasks, fixed input-ou...
research
05/18/2023

Paxion: Patching Action Knowledge in Video-Language Foundation Models

Action knowledge involves the understanding of textual, visual, and temp...

Please sign up or login with your details

Forgot password? Click here to reset