One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning

12/06/2021
by   Suzhen Wang, et al.
0

Audio-driven one-shot talking face generation methods are usually trained on video resources of various persons. However, their created videos often suffer unnatural mouth shapes and asynchronous lips because those methods struggle to learn a consistent speech style from different speakers. We observe that it would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements. Hence, we propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker and then transferring audio-driven motion fields to a reference image. Specifically, we develop an Audio-Visual Correlation Transformer (AVCT) that aims to infer talking motions represented by keypoint based dense motion fields from an input audio. In particular, considering audio may come from different identities in deployment, we incorporate phonemes to represent audio signals. In this manner, our AVCT can inherently generalize to audio spoken by other identities. Moreover, as face keypoints are used to represent speakers, AVCT is agnostic against appearances of the training speaker, and thus allows us to manipulate face images of different identities readily. Considering different face shapes lead to different motions, a motion field transfer module is exploited to reduce the audio-driven dense motion field gap between the training identity and the one-shot reference. Once we obtained the dense motion field of the reference image, we employ an image renderer to generate its talking face videos from an audio clip. Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements. Extensive experiments demonstrate that our synthesized videos outperform the state-of-the-art in terms of visual quality and lip-sync.

READ FULL TEXT

page 1

page 3

page 5

page 6

research
07/20/2021

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

We propose an audio-driven talking-head method to generate photo-realist...
research
08/23/2022

StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

We propose StyleTalker, a novel audio-driven talking head generation mod...
research
07/20/2018

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

Talking face generation aims to synthesize a sequence of face images tha...
research
08/29/2022

StableFace: Analyzing and Improving Motion Stability for Talking Face Generation

While previous speech-driven talking face generation methods have made s...
research
03/28/2018

Lip Movements Generation at a Glance

Cross-modality generation is an emerging topic that aims to synthesize d...
research
08/31/2023

Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

Existing automated dubbing methods are usually designed for Professional...
research
06/02/2023

Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation

This paper presents a novel approach for generating 3D talking heads fro...

Please sign up or login with your details

Forgot password? Click here to reset