Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

08/18/2023
by   Soumik Mukhopadhyay, et al.
0

The task of lip synchronization (lip-sync) seeks to match the lips of human faces with different audio. It has various applications in the film industry as well as for creating virtual avatars and for video conferencing. This is a challenging problem as one needs to simultaneously introduce detailed, realistic lip movements while preserving the identity, pose, emotions, and image quality. Many of the previous methods trying to solve this problem suffer from image quality degradation due to a lack of complete contextual information. In this paper, we present Diff2Lip, an audio-conditioned diffusion-based model which is able to do lip synchronization in-the-wild while preserving these qualities. We train our model on Voxceleb2, a video dataset containing in-the-wild talking face videos. Extensive studies show that our method outperforms popular methods like Wav2Lip and PC-AVS in Fréchet inception distance (FID) metric and Mean Opinion Scores (MOS) of the users. We show results on both reconstruction (same audio-video inputs) as well as cross (different audio-video inputs) settings on Voxceleb2 and LRW datasets. Video results and code can be accessed from our project page ( https://soumik-kanad.github.io/diff2lip ).

READ FULL TEXT

page 7

page 14

page 15

page 16

page 17

page 18

page 19

page 20

research
06/29/2023

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

The Video-to-Audio (V2A) model has recently gained attention for its pra...
research
04/22/2021

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation

While accurate lip synchronization has been achieved for arbitrary-subje...
research
07/18/2023

Plug the Leaks: Advancing Audio-driven Talking Face Generation by Preventing Unintended Information Flow

Audio-driven talking face generation is the task of creating a lip-synch...
research
08/23/2020

A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild

In this work, we investigate the problem of lip-syncing a talking face v...
research
09/05/2023

Generating Realistic Images from In-the-wild Sounds

Representing wild sounds as images is an important but challenging task ...
research
03/21/2023

ModEFormer: Modality-Preserving Embedding for Audio-Video Synchronization using Transformers

Lack of audio-video synchronization is a common problem during televisio...
research
08/07/2023

AudioVMAF: Audio Quality Prediction with VMAF

Video Multimethod Assessment Fusion (VMAF) [1], [2], [3] is a popular to...

Please sign up or login with your details

Forgot password? Click here to reset