DeepAI AI Chat
Log In Sign Up

High-Resolution Talking Face Generation via Mutual Information Approximation

by   Aihua Zheng, et al.
Anhui University

Given an arbitrary speech clip and a facial image, talking face generation aims to synthesize a talking face video with precise lip synchronization as well as a smooth transition of facial motion over the entire video speech. Most existing methods mainly focus on either disentangling the information in a single image or learning temporal information between frames. However, speech audio and video often have cross-modality coherence that has not been well addressed during synthesis. Therefore, this paper proposes a novel high-resolution talking face generation model for arbitrary person by discovering the cross-modality coherence via Mutual Information Approximation (MIA). By assuming the modality difference between audio and video is larger that of real video and generated video, we estimate mutual information between real audio and video, and then use a discriminator to enforce generated video distribution approach real video distribution. Furthermore, we introduce a dynamic attention technique on the mouth to enhance the robustness during the training stage. Experimental results on benchmark dataset LRW transcend the state-of-the-art methods on prevalent metrics with robustness on gender, pose variations and high-resolution synthesizing.


page 4

page 7

page 8

page 9


Talking Face Generation by Conditional Recurrent Adversarial Network

Given an arbitrary face image and an arbitrary speech clip, the proposed...

Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss

We devise a cascade GAN approach to generate talking face video, which i...

MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement

This paper presents a generic method for generating full facial 3D anima...

Talking Head Generation with Audio and Speech Related Facial Action Units

The task of talking head generation is to synthesize a lip synchronized ...

StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pretrained StyleGAN

One-shot talking face generation aims at synthesizing a high-quality tal...

Towards Highly Accurate and Stable Face Alignment for High-Resolution Videos

In recent years, heatmap regression based models have shown their effect...

Lip Movements Generation at a Glance

Cross-modality generation is an emerging topic that aims to synthesize d...