GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

04/30/2021
by   Chenfei Wu, et al.
14

Generating videos from text is a challenging task due to its high computational requirements for training and infinite possible answers for evaluation. Existing works typically experiment on simple or small datasets, where the generalization ability is quite limited. In this work, we propose GODIVA, an open-domain text-to-video pretrained model that can generate videos from text in an auto-regressive manner using a three-dimensional sparse attention mechanism. We pretrain our model on Howto100M, a large-scale text-video dataset that contains more than 136 million text-video pairs. Experiments show that GODIVA not only can be fine-tuned on downstream video generation tasks, but also has a good zero-shot capability on unseen texts. We also propose a new metric called Relative Matching (RM) to automatically evaluate the video generation quality. Several challenges are listed and discussed as future work.

READ FULL TEXT

page 7

page 8

page 11

page 12

research
07/13/2023

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

This paper introduces InternVid, a large-scale video-centric multimodal ...
research
06/02/2023

Probabilistic Adaptation of Text-to-Video Models

Large text-to-video models trained on internet-scale data have demonstra...
research
10/05/2022

Phenaki: Variable Length Video Generation From Open Domain Textual Description

We present Phenaki, a model capable of realistic video synthesis, given ...
research
12/31/2022

Translating Text Synopses to Video Storyboards

A storyboard is a roadmap for video creation which consists of shot-by-s...
research
04/25/2023

TCR: Short Video Title Generation and Cover Selection with Attention Refinement

With the widespread popularity of user-generated short videos, it become...
research
01/13/2022

BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions

Pre-training a model to learn transferable video-text representation for...
research
08/16/2023

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Controllable video generation has gained significant attention in recent...

Please sign up or login with your details

Forgot password? Click here to reset