Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

05/23/2023
by   Weifeng Chen, et al.
0

This paper presents a controllable text-to-video (T2V) diffusion model, named Video-ControlNet, that generates videos conditioned on a sequence of control signals, such as edge or depth maps. Video-ControlNet is built on a pre-trained conditional text-to-image (T2I) diffusion model by incorporating a spatial-temporal self-attention mechanism and trainable temporal layers for efficient cross-frame modeling. A first-frame conditioning strategy is proposed to facilitate the model to generate videos transferred from the image domain as well as arbitrary-length videos in an auto-regressive manner. Moreover, Video-ControlNet employs a novel residual-based noise initialization strategy to introduce motion prior from an input video, producing more coherent videos. With the proposed architecture and strategies, Video-ControlNet can achieve resource-efficient convergence and generate superior quality and consistent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality. Project Page: https://controlavideo.github.io/

READ FULL TEXT

page 1

page 4

page 7

page 8

page 9

research
07/19/2023

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

The generative AI revolution has recently expanded to videos. Neverthele...
research
08/16/2023

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Controllable video generation has gained significant attention in recent...
research
06/01/2023

Intelligent Grimm – Open-ended Visual Storytelling via Latent Diffusion Models

Generative models have recently exhibited exceptional capabilities in va...
research
04/05/2019

Point-to-Point Video Generation

While image manipulation achieves tremendous breakthroughs (e.g., genera...
research
08/24/2023

APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency

Diffusion models have exhibited promising progress in video generation. ...
research
04/30/2023

StyleLipSync: Style-based Personalized Lip-sync Video Generation

In this paper, we present StyleLipSync, a style-based personalized lip-s...
research
06/03/2023

VideoComposer: Compositional Video Synthesis with Motion Controllability

The pursuit of controllability as a higher standard of visual content cr...

Please sign up or login with your details

Forgot password? Click here to reset