DiffTalk: Crafting Diffusion Models for Generalized Talking Head Synthesis

01/10/2023
by   Shuai Shen, et al.
1

Talking head synthesis is a promising approach for the video production industry. Recently, a lot of effort has been devoted in this research area to improve the generation quality or enhance the model generalization. However, there are few works able to address both issues simultaneously, which is essential for practical applications. To this end, in this paper, we turn attention to the emerging powerful Latent Diffusion Models, and model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk). More specifically, instead of employing audio signals as the single driving factor, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis. In this way, the proposed DiffTalk is capable of producing high-quality talking head videos in synchronization with the source audio, and more importantly, it can be naturally generalized across different identities without any further fine-tuning. Additionally, our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost. Extensive experiments show that the proposed DiffTalk efficiently synthesizes high-fidelity audio-driven talking head videos for generalized novel identities. For more video results, please refer to this demonstration <https://cloud.tsinghua.edu.cn/f/e13f5aad2f4c4f898ae7/>.

READ FULL TEXT

page 1

page 3

page 5

page 6

page 7

page 8

research
01/31/2023

GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis

Generating photo-realistic video portrait with arbitrary speech audio is...
research
07/24/2022

Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis

Talking head synthesis is an emerging technology with wide applications ...
research
06/29/2023

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

The Video-to-Audio (V2A) model has recently gained attention for its pra...
research
05/04/2023

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Multimodal-driven talking face generation refers to animating a portrait...
research
12/15/2020

HeadGAN: Video-and-Audio-Driven Talking Head Synthesis

Recent attempts to solve the problem of talking head synthesis using a s...
research
06/26/2022

Perceptual Conversational Head Generation with Regularized Driver and Enhanced Renderer

This paper reports our solution for MultiMedia ViCo 2022 Conversational ...
research
10/07/2022

A Keypoint Based Enhancement Method for Audio Driven Free View Talking Head Synthesis

Audio driven talking head synthesis is a challenging task that attracts ...

Please sign up or login with your details

Forgot password? Click here to reset