SSGVS: Semantic Scene Graph-to-Video Synthesis

11/11/2022
by   Yuren Cong, et al.
0

As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and auto-regressive Transformer, is proposed to synthesize a video given an initial scene image and a non-fixed number of semantic scene graphs. We evaluate SSGVS and other state-of-the-art video synthesis models on the Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis. The source code will be released.

READ FULL TEXT

page 2

page 3

page 5

page 7

page 8

page 10

page 11

page 12

research
07/26/2021

Spatial-Temporal Transformer for Dynamic Scene Graph Generation

Dynamic scene graph generation aims at generating a scene graph of the g...
research
07/01/2022

Transforming Image Generation from Scene Graphs

Generating images from semantic visual knowledge is a challenging task, ...
research
08/08/2023

3D Scene Diffusion Guidance using Scene Graphs

Guided synthesis of high-quality 3D scenes is a challenging task. Diffus...
research
05/15/2023

Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs

Dynamic scene graphs generated from video clips could help enhance the s...
research
06/01/2023

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Creating a vivid video from the event or scenario in our imagination is ...
research
01/29/2019

Adversarial Adaptation of Scene Graph Models for Understanding Civic Issues

Citizen engagement and technology usage are two emerging trends driven b...
research
08/26/2023

Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models

Text-to-video (T2V) synthesis has gained increasing attention in the com...

Please sign up or login with your details

Forgot password? Click here to reset