Text2Performer: Text-Driven Human Video Generation

04/17/2023
by   Yuming Jiang, et al.
0

Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity. Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer. Compared to general text-driven video generation, human-centric video generation requires maintaining the appearance of synthesized human while performing complex motions. In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts. Text2Performer has two novel designs: 1) decomposed human representation and 2) diffusion-based motion sampler. First, we decompose the VQVAE latent space into human appearance and pose representation in an unsupervised manner by utilizing the nature of human videos. In this way, the appearance is well maintained along the generated frames. Then, we propose continuous VQ-diffuser to sample a sequence of pose embeddings. Unlike existing VQ-based methods that operate in the discrete space, continuous VQ-diffuser directly outputs the continuous pose embeddings for better motion modeling. Finally, motion-aware masking strategy is designed to mask the pose embeddings spatial-temporally to enhance the temporal coherence. Moreover, to facilitate the task of text-driven human video generation, we contribute a Fashion-Text2Video dataset with manually annotated action labels and text descriptions. Extensive experiments demonstrate that Text2Performer generates high-quality human videos (up to 512x256 resolution) with diverse appearances and flexible motions.

READ FULL TEXT

page 1

page 6

page 7

page 8

page 13

page 14

page 15

page 16

research
04/23/2023

LaMD: Latent Motion Diffusion for Video Generation

Generating coherent and natural movement is the key challenge in video g...
research
08/16/2022

StyleFaceV: Face Video Generation via Decomposing and Recomposing Pretrained StyleGAN3

Realistic generative face video synthesis has long been a pursuit in bot...
research
10/21/2019

DwNet: Dense warp-based network for pose-guided human video generation

Generation of realistic high-resolution videos of human subjects is a ch...
research
07/04/2022

TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts

Inspired by the strong ties between vision and language, the two intimat...
research
10/06/2022

Text-driven Video Prediction

Current video generation models usually convert signals indicating appea...
research
06/30/2023

DisCo: Disentangled Control for Referring Human Dance Generation in Real World

Generative AI has made significant strides in computer vision, particula...
research
07/26/2018

Learning to Forecast and Refine Residual Motion for Image-to-Video Generation

We consider the problem of image-to-video translation, where an input im...

Please sign up or login with your details

Forgot password? Click here to reset