Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models

08/26/2023
by   Hao Fei, et al.
0

Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches. While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the intricate temporal dynamics modeling, one of the crux of video synthesis. In this work, we investigate strengthening the awareness of video dynamics for DMs, for high-quality T2V generation. Inspired by human intuition, we design an innovative dynamic scene manager (dubbed as Dysen) module, which includes (step-1) extracting from input text the key actions with proper time-order arrangement, (step-2) transforming the action schedules into the dynamic scene graph (DSG) representations, and (step-3) enriching the scenes in the DSG with sufficient and reasonable details. Taking advantage of the existing powerful LLMs (e.g., ChatGPT) via in-context learning, Dysen realizes (nearly) human-level temporal dynamics understanding. Finally, the resulting video DSG with rich action scene details is encoded as fine-grained spatio-temporal features, integrated into the backbone T2V DM for video generating. Experiments on popular T2V datasets suggest that our framework consistently outperforms prior arts with significant margins, especially in the scenario with complex actions. Project page at https://haofei.vip/Dysen-VDM

READ FULL TEXT

page 9

page 10

page 17

research
09/08/2016

Generating Videos with Scene Dynamics

We capitalize on large amounts of unlabeled video in order to learn a mo...
research
04/18/2023

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Latent Diffusion Models (LDMs) enable high-quality image synthesis while...
research
08/20/2018

Video-to-Video Synthesis

We study the problem of video-to-video synthesis, whose goal is to learn...
research
10/21/2021

LARNet: Latent Action Representation for Human Action Synthesis

We present LARNet, a novel end-to-end approach for generating human acti...
research
06/27/2020

Compositional Video Synthesis with Action Graphs

Videos of actions are complex spatio-temporal signals, containing rich c...
research
11/11/2022

SSGVS: Semantic Scene Graph-to-Video Synthesis

As a natural extension of the image synthesis task, video synthesis has ...
research
01/05/2023

Test of Time: Instilling Video-Language Models with a Sense of Time

Modeling and understanding time remains a challenge in contemporary vide...

Please sign up or login with your details

Forgot password? Click here to reset