Emotional Speech-Driven Animation with Content-Emotion Disentanglement

06/15/2023
by   Radek Danecek, et al.
2

To be widely adopted, 3D facial avatars need to be animated easily, realistically, and directly, from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Instead, their focus is on modeling the correlations between speech and facial motion, resulting in animations that are unemotional or do not match the input emotion. We observe that there are two contributing factors resulting in facial animation - the speech and the emotion. We exploit these insights in EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking head avatars that maintain lip sync while enabling explicit control over the expression of emotion. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained from an emotional video dataset (i.e., MEAD). To achieve this, we match speech-content between generated sequences and target videos differently from emotion content. Specifically, we train EMOTE with additional supervision in the form of a lip-reading objective to preserve the speech-dependent content (spatially local and high temporal frequency), while utilizing emotion supervision on a sequence-level (spatially global and low frequency). Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotion on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in form of a temporal VAE. Extensive qualitative, quantitative, and perceptual evaluations demonstrate that EMOTE produces state-of-the-art speech-driven facial animations, with lip sync on par with the best methods while offering additional, high-quality emotional control.

READ FULL TEXT

page 1

page 10

page 14

page 15

page 16

page 17

page 18

page 19

research
04/15/2021

Audio-Driven Emotional Video Portraits

Despite previous success in generating audio-driven talking heads, most ...
research
04/24/2022

EMOCA: Emotion Driven Monocular Face Capture and Animation

As 3D facial avatars become more widely used for communication, it is cr...
research
04/25/2021

3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head

Impressive progress has been made in audio-driven 3D facial animation re...
research
08/11/2019

Emotion Dependent Facial Animation from Affective Speech

In human-to-computer interaction, facial animation in synchrony with aff...
research
03/09/2023

FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning

This paper presents FaceXHuBERT, a text-less speech-driven 3D facial ani...
research
07/22/2022

Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos

The recent state of the art on monocular 3D face reconstruction from ima...
research
05/02/2022

Emotion-Controllable Generalized Talking Face Generation

Despite the significant progress in recent years, very few of the AI-bas...

Please sign up or login with your details

Forgot password? Click here to reset