Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

03/23/2023
by   Levon Khachatryan, et al.
0

Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data. Our code will be open sourced at: https://github.com/Picsart-AI-Research/Text2Video-Zero .

READ FULL TEXT

page 15

page 17

page 18

page 19

page 20

page 23

page 24

page 25

research
03/30/2023

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

Large-scale text-to-image diffusion models achieve unprecedented success...
research
05/10/2023

Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models

The proliferation of video content demands efficient and flexible neural...
research
05/23/2023

Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation

In the paradigm of AI-generated content (AIGC), there has been increasin...
research
08/07/2023

DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis

In recent years, diffusion models have emerged as the most powerful appr...
research
05/22/2023

ControlVideo: Training-free Controllable Text-to-Video Generation

Text-driven diffusion models have unlocked unprecedented abilities in im...
research
08/12/2023

ModelScope Text-to-Video Technical Report

This paper introduces ModelScopeT2V, a text-to-video synthesis model tha...
research
03/22/2023

Pix2Video: Video Editing using Image Diffusion

Image diffusion models, trained on massive image collections, have emerg...

Please sign up or login with your details

Forgot password? Click here to reset