Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

04/18/2022
by   Evonne Ng, et al.
4

We present a framework for modeling interactional communication in dyadic conversations: given multimodal inputs of a speaker, we autoregressively output multiple possibilities of corresponding listener motion. We combine the motion and speech audio of the speaker using a motion-audio cross attention transformer. Furthermore, we enable non-deterministic prediction by learning a discrete latent representation of realistic listener motion with a novel motion-encoding VQ-VAE. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions. Moreover, it produces realistic 3D listener facial motion synchronous with the speaker (see video). We demonstrate that our method outperforms baselines qualitatively and quantitatively via a rich suite of experiments. To facilitate this line of research, we introduce a novel and large in-the-wild dataset of dyadic conversations. Code, data, and videos available at https://evonneng.github.io/learning2listen/.

READ FULL TEXT
research
01/26/2023

Affective Faces for Goal-Driven Dyadic Communication

We introduce a video framework for modeling the association between verb...
research
08/15/2021

Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders

Generating conversational gestures from speech audio is challenging due ...
research
05/25/2023

ReactFace: Multiple Appropriate Facial Reaction Generation in Dyadic Interactions

In dyadic interaction, predicting the listener's facial reactions is cha...
research
09/20/2022

MTR-A: 1st Place Solution for 2022 Waymo Open Dataset Challenge – Motion Prediction

In this report, we present the 1st place solution for motion prediction ...
research
05/10/2022

ConfLab: A Rich Multimodal Multisensor Dataset of Free-Standing Social Interactions in the Wild

Recording the dynamics of unscripted human interactions in the wild is c...
research
04/15/2019

Synthesising 3D Facial Motion from "In-the-Wild" Speech

Synthesising 3D facial motion from speech is a crucial problem manifesti...
research
01/17/2023

Audio2Gestures: Generating Diverse Gestures from Audio

People may perform diverse gestures affected by various mental and physi...

Please sign up or login with your details

Forgot password? Click here to reset