Audio-Driven Co-Speech Gesture Video Generation

12/05/2022
by   Xian Liu, et al.
3

Co-speech gesture is crucial for human-machine interaction and digital entertainment. While previous works mostly map speech audio to human skeletons (e.g., 2D keypoints), directly generating speakers' gestures in the image domain remains unsolved. In this work, we formally define and study this challenging problem of audio-driven co-speech gesture video generation, i.e., using a unified framework to generate speaker image sequence driven by speech audio. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. To this end, we propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns as well as fine-grained rhythmic movements. To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (e.g., 2D skeletons). Specifically, 1) we propose a vector quantized motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture patterns from implicit motion representation to codebooks. 2) Moreover, a co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to complement the subtle prosodic motion details. Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https://alvinliu0.github.io/projects/ANGIE

READ FULL TEXT

page 2

page 3

page 8

page 9

research
01/14/2021

Generating coherent spontaneous speech and gesture from text

Embodied human communication encompasses both verbal (speech) and non-ve...
research
03/24/2022

Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

Generating speech-consistent body and gesture movements is a long-standi...
research
07/23/2022

Audio-driven Neural Gesture Reenactment with Video Motion Graphs

Human speech is often accompanied by body gestures including arm and han...
research
08/18/2021

Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates

Co-speech gesture generation is to synthesize a gesture sequence that no...
research
03/08/2019

Analyzing Input and Output Representations for Speech-Driven Gesture Generation

This paper presents a novel framework for automatic speech-driven gestur...
research
11/17/2022

Listen, denoise, action! Audio-driven motion synthesis with diffusion models

Diffusion models have experienced a surge of interest as highly expressi...
research
08/29/2023

C2G2: Controllable Co-speech Gesture Generation with Latent Diffusion Model

Co-speech gesture generation is crucial for automatic digital avatar ani...

Please sign up or login with your details

Forgot password? Click here to reset