Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

by   Chao Xu, et al.

Multimodal-driven talking face generation refers to animating a portrait with the given pose, expression, and gaze transferred from the driving image and video, or estimated from the text and audio. However, existing methods ignore the potential of text modal, and their generators mainly follow the source-oriented feature rearrange paradigm coupled with unstable GAN frameworks. In this work, we first represent the emotion in the text prompt, which could inherit rich semantics from the CLIP, allowing flexible and generalized emotion control. We further reorganize these tasks as the target-oriented texture transfer and adopt the Diffusion Models. More specifically, given a textured face as the source and the rendered face projected from the desired 3DMM coefficients as the target, our proposed Texture-Geometry-aware Diffusion Model decomposes the complex transfer problem into multi-conditional denoising process, where a Texture Attention-based module accurately models the correspondences between appearance and geometry cues contained in source and target conditions, and incorporate extra implicit information for high-fidelity talking face generation. Additionally, TGDM can be gracefully tailored for face swapping. We derive a novel paradigm free of unstable seesaw-style optimization, resulting in simple, stable, and effective training and inference schemes. Extensive experiments demonstrate the superiority of our method.


page 5

page 6

page 7

page 8

page 9

page 10

page 11


High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

Recently, emotional talking face generation has received considerable at...

Person Image Synthesis via Denoising Diffusion Model

The pose-guided person image generation task requires synthesizing photo...

DiffTalk: Crafting Diffusion Models for Generalized Talking Head Synthesis

Talking head synthesis is a promising approach for the video production ...

Collaborative Diffusion for Multi-Modal Face Generation and Editing

Diffusion models arise as a powerful generative tool recently. Despite t...

HiFA: High-fidelity Text-to-3D with Advanced Diffusion Guidance

Automatic text-to-3D synthesis has achieved remarkable advancements thro...

Face Animation with an Attribute-Guided Diffusion Model

Face animation has achieved much progress in computer vision. However, p...

Audio- and Gaze-driven Facial Animation of Codec Avatars

Codec Avatars are a recent class of learned, photorealistic face models ...

Please sign up or login with your details

Forgot password? Click here to reset