Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach

07/24/2020
by   Chaitanya Ahuja, et al.
0

How can we teach robots or virtual assistants to gesture naturally? Can we go further and adapt the gesturing style to follow a specific speaker? Gestures that are naturally timed with corresponding speech during human communication are called co-speech gestures. A key challenge, called gesture style transfer, is to learn a model that generates these gestures for a speaking agent 'A' in the gesturing style of a target speaker 'B'. A secondary goal is to simultaneously learn to generate co-speech gestures for multiple speakers while remembering what is unique about each speaker. We call this challenge style preservation. In this paper, we propose a new model, named Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures in an end-to-end manner. A novelty of Mix-StAGE is to learn a mixture of generative models which allows for conditioning on the unique gesture style of each speaker. As Mix-StAGE disentangles style and content of gestures, gesturing styles for the same input speech can be altered by simply switching the style embeddings. Mix-StAGE also allows for style preservation when learning simultaneously from multiple speakers. We also introduce a new dataset, Pose-Audio-Transcript-Style (PATS), designed to study gesture generation and style transfer. Our proposed Mix-StAGE model significantly outperforms the previous state-of-the-art approach for gesture generation and provides a path towards performing gesture style transfer across multiple speakers. Link to code, data, and videos: http://chahuja.com/mix-stage

READ FULL TEXT
research
08/03/2022

Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech using Adversarial Disentanglement of Multimodal Style Encoding

Modeling virtual agents with behavior style is one factor for personaliz...
research
09/04/2020

Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity

For human-like agents, including virtual avatars and social robots, maki...
research
08/05/2022

Real-time Gesture Animation Generation from Speech for Virtual Human Interaction

We propose a real-time system for synthesizing gestures directly from sp...
research
01/24/2022

Disentangling Style and Speaker Attributes for TTS Style Transfer

End-to-end neural TTS has shown improved performance in speech style tra...
research
08/08/2023

TranSTYLer: Multimodal Behavioral Style Transfer for Facial and Body Gestures Generation

This paper addresses the challenge of transferring the behavior expressi...
research
09/15/2022

ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech

We present ZeroEGGS, a neural network framework for speech-driven gestur...

Please sign up or login with your details

Forgot password? Click here to reset