Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders

08/15/2021
by   Jing Li, et al.
0

Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping between audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to overcome this problem, we propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code mainly models the strong correlation between audio and motion (such as the synchronized audio and motion beats), while the motion-specific code captures diverse motion information independent of the audio. However, splitting the latent code into two parts poses training difficulties for the VAE model. A mapping network facilitating random sampling along with other techniques including relaxed motion loss, bicycle constraint, and diversity loss are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than state-of-the-art methods, quantitatively and qualitatively. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline. Code and more results are at https://jingli513.github.io/audio2gestures.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/17/2023

Audio2Gestures: Generating Diverse Gestures from Audio

People may perform diverse gestures affected by various mental and physi...
research
12/08/2022

Generating Holistic 3D Human Motion from Speech

This work addresses the problem of generating 3D holistic body motions f...
research
04/18/2022

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

We present a framework for modeling interactional communication in dyadi...
research
01/06/2023

CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior

Speech-driven 3D facial animation has been widely studied, yet there is ...
research
02/05/2021

Learning Audio-Visual Correlations from Variational Cross-Modal Generation

People can easily imagine the potential sound while seeing an event. Thi...
research
10/20/2020

Probabilistic Character Motion Synthesis using a Hierarchical Deep Latent Variable Model

We present a probabilistic framework to generate character animations ba...
research
03/04/2022

Freeform Body Motion Generation from Speech

People naturally conduct spontaneous body motions to enhance their speec...

Please sign up or login with your details

Forgot password? Click here to reset