FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning

03/09/2023
by   Kazi Injamamul Haque, et al.
0

This paper presents FaceXHuBERT, a text-less speech-driven 3D facial animation generation method that allows to capture personalized and subtle cues in speech (e.g. identity, emotion and hesitation). It is also very robust to background noise and can handle audio recorded in a variety of situations (e.g. multiple people speaking). Recent approaches employ end-to-end deep learning taking into account both audio and text as input to generate facial animation for the whole face. However, scarcity of publicly available expressive audio-3D facial animation datasets poses a major bottleneck. The resulting animations still have issues regarding accurate lip-synching, expressivity, person-specific information and generalizability. We effectively employ self-supervised pretrained HuBERT model in the training process that allows us to incorporate both lexical and non-lexical information in the audio without using a large lexicon. Additionally, guiding the training with a binary emotion condition and speaker identity distinguishes the tiniest subtle facial motion. We carried out extensive objective and subjective evaluation in comparison to ground-truth and state-of-the-art work. A perceptual user study demonstrates that our approach produces superior results with respect to the realism of the animation 78 our method is 4 times faster eliminating the use of complex sequential models such as transformers. We strongly recommend watching the supplementary video before reading the paper. We also provide the implementation and evaluation codes with a GitHub repository link.

READ FULL TEXT

page 4

page 6

page 13

research
09/20/2023

FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion

Speech-driven 3D facial animation synthesis has been a challenging task ...
research
04/16/2021

MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement

This paper presents a generic method for generating full facial 3D anima...
research
06/15/2023

Emotional Speech-Driven Animation with Content-Emotion Disentanglement

To be widely adopted, 3D facial avatars need to be animated easily, real...
research
08/03/2020

Audiovisual Speech Synthesis using Tacotron2

Audiovisual speech synthesis is the problem of synthesizing a talking fa...
research
12/04/2021

Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

Speech-driven 3D facial animation with accurate lip synchronization has ...
research
12/10/2021

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Speech-driven 3D facial animation is challenging due to the complex geom...
research
11/15/2022

Towards an objective characterization of an individual's facial movements using Self-Supervised Person-Specific-Models

Disentangling facial movements from other facial characteristics, partic...

Please sign up or login with your details

Forgot password? Click here to reset