Listen, denoise, action! Audio-driven motion synthesis with diffusion models

11/17/2022
by   Simon Alexanderson, et al.
0

Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, for example co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved accuracy. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Gesture-generation experiments on the Trinity Speech-Gesture and ZeroEGGS datasets confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise dance motion and path-driven locomotion using the same model architecture. Finally, we extend the guidance procedure to perform style interpolation in a manner that is appealing for synthesis tasks and has connections to product-of-experts models, a contribution we believe is of independent interest. Video examples are available at https://www.speech.kth.se/research/listen-denoise-action/

READ FULL TEXT
research
09/15/2022

ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech

We present ZeroEGGS, a neural network framework for speech-driven gestur...
research
03/16/2023

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Animating virtual avatars to make co-speech gestures facilitates various...
research
12/05/2022

Audio-Driven Co-Speech Gesture Video Generation

Co-speech gesture is crucial for human-machine interaction and digital e...
research
06/15/2023

Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

With read-aloud speech synthesis achieving high naturalness scores, ther...
research
12/16/2022

Unifying Human Motion Synthesis and Style Transfer with Denoising Diffusion Probabilistic Models

Generating realistic motions for digital humans is a core but challengin...
research
05/16/2019

MoGlow: Probabilistic and controllable motion synthesis using normalising flows

Data-driven modelling and synthesis of motion data is an active research...
research
12/07/2022

Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

In this paper, we introduce a simple and novel framework for one-shot au...

Please sign up or login with your details

Forgot password? Click here to reset