Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

07/20/2018
by   Hang Zhou, et al.
6

Talking face generation aims to synthesize a sequence of face images that correspond to given speech semantics. However, when people talk, the subtle movements of their face region are usually a complex combination of the intrinsic face appearance of the subject and also the extrinsic speech to be delivered. Existing works either focus on the former, which constructs the specific face appearance model on a single subject; or the latter, which models the identity-agnostic transformation between lip motion and speech. In this work, we integrate both aspects and enable arbitrary-subject talking face generation by learning disentangled audio-visual representation. We assume the talking face sequence is actually a composition of both subject-related information and speech-related information. These two spaces are then explicitly disentangled through a novel associative-and-adversarial training process. The disentangled representation has an additional advantage that both audio and video can serve as the source of speech information for generation. Extensive experiments show that our proposed approach can generate realistic talking face sequences on arbitrary subjects with much clearer lip motion patterns. We also demonstrate the learned audio-visual representation is extremely useful for applications like automatic lip reading and audio-video retrieval.

READ FULL TEXT

page 6

page 8

research
11/29/2020

Audio-visual Speech Separation with Adversarially Disentangled Visual Representation

Speech separation aims to separate individual voice from an audio mixtur...
research
12/06/2021

One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning

Audio-driven one-shot talking face generation methods are usually traine...
research
04/22/2021

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation

While accurate lip synchronization has been achieved for arbitrary-subje...
research
10/02/2019

Animating Face using Disentangled Audio Representations

All previous methods for audio-driven talking head generation assume the...
research
04/13/2018

Talking Face Generation by Conditional Recurrent Adversarial Network

Given an arbitrary face image and an arbitrary speech clip, the proposed...
research
02/13/2022

Lip movements information disentanglement for lip sync

The lip movements information is critical for many audio-visual tasks. H...
research
03/28/2018

Lip Movements Generation at a Glance

Cross-modality generation is an emerging topic that aims to synthesize d...

Please sign up or login with your details

Forgot password? Click here to reset