StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

08/23/2022
by   Dongchan Min, et al.
0

We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given audio. This is made possible with several newly devised components: 1) A contrastive lip-sync discriminator for accurate lip synchronization, 2) A conditional sequential variational autoencoder that learns the latent motion space disentangled from the lip movements, such that we can independently manipulate the motions and lip movements while preserving the identity. 3) An auto-regressive prior augmented with normalizing flow to learn a complex audio-to-motion multi-modal latent space. Equipped with these components, StyleTalker can generate talking head videos not only in a motion-controllable way when another motion source video is given but also in a completely audio-driven manner by inferring realistic motions from the input audio. Through extensive experiments and user studies, we show that our model is able to synthesize talking head videos with impressive perceptual quality which are accurately lip-synced with the input audios, largely outperforming state-of-the-art baselines.

READ FULL TEXT

page 12

page 13

research
07/20/2021

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

We propose an audio-driven talking-head method to generate photo-realist...
research
11/22/2022

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

Generating talking head videos through a face image and a piece of speec...
research
12/07/2022

Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

In this paper, we introduce a simple and novel framework for one-shot au...
research
12/06/2021

One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning

Audio-driven one-shot talking face generation methods are usually traine...
research
03/14/2023

DisCoHead: Audio-and-Video-Driven Talking Head Generation by Disentangled Control of Head Pose and Facial Expressions

For realistic talking head generation, creating natural head motion whil...
research
10/27/2022

Learning Variational Motion Prior for Video-based Motion Capture

Motion capture from a monocular video is fundamental and crucial for us ...
research
11/02/2022

Autoregressive GAN for Semantic Unconditional Head Motion Generation

We address the task of unconditional head motion generation to animate s...

Please sign up or login with your details

Forgot password? Click here to reset