Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

09/11/2023
by   Anna Deichler, et al.
0

This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.

READ FULL TEXT

page 5

page 6

research
01/24/2023

DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

Speech-driven gesture synthesis is a field of growing interest in virtua...
research
01/25/2020

Gesticulator: A framework for semantically-aware speech-driven gesture generation

During speech, people spontaneously gesticulate, which plays a key role ...
research
09/06/2023

MCM: Multi-condition Motion Synthesis Framework for Multi-scenario

The objective of the multi-condition human motion synthesis task is to i...
research
06/20/2023

EMoG: Synthesizing Emotive Co-speech 3D Gesture with Diffusion Model

Although previous co-speech gesture generation methods are able to synth...
research
08/25/2021

Integrated Speech and Gesture Synthesis

Text-to-speech and co-speech gesture synthesis have until now been treat...
research
06/15/2023

Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

With read-aloud speech synthesis achieving high naturalness scores, ther...
research
03/26/2023

GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents

The automatic generation of stylized co-speech gestures has recently rec...

Please sign up or login with your details

Forgot password? Click here to reset