Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

06/01/2023
by   Jinbo Xing, et al.
0

Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient in conveying the overall scene context, it may be insufficient to control precisely. In this paper, we explore customized video generation by utilizing text as context description and motion structure (e.g. frame-wise depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules. This two-stage learning scheme not only reduces the computing resources required, but also improves the performance by transferring the rich concepts available in image datasets solely into video generation. Moreover, we use a simple yet effective causal attention mask strategy to enable longer video synthesis, which mitigates the potential quality degradation effectively. Experimental results show the superiority of our method over existing baselines, particularly in terms of temporal coherence and fidelity to users' guidance. In addition, our model enables several intriguing applications that demonstrate potential for practical usage.

READ FULL TEXT

page 1

page 6

page 7

page 9

page 10

research
09/01/2023

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

In this paper, we present VideoGen, a text-to-video generation approach,...
research
05/06/2023

AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion

Recent advances in diffusion models have showcased promising results in ...
research
06/02/2023

Video Colorization with Pre-trained Text-to-Image Diffusion Models

Video colorization is a challenging task that involves inferring plausib...
research
11/11/2022

SSGVS: Semantic Scene Graph-to-Video Synthesis

As a natural extension of the image synthesis task, video synthesis has ...
research
06/03/2023

VideoComposer: Compositional Video Synthesis with Motion Controllability

The pursuit of controllability as a higher standard of visual content cr...
research
08/12/2023

ModelScope Text-to-Video Technical Report

This paper introduces ModelScopeT2V, a text-to-video synthesis model tha...
research
05/23/2023

Understanding Text-driven Motion Synthesis with Keyframe Collaboration via Diffusion Models

The emergence of text-driven motion synthesis technique provides animato...

Please sign up or login with your details

Forgot password? Click here to reset