Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

06/29/2023
by   Simian Luo, et al.
0

The Video-to-Audio (V2A) model has recently gained attention for its practical application in generating audio directly from silent videos, particularly in video/film production. However, previous methods in V2A have limited generation quality in terms of temporal synchronization and audio-visual relevance. We present Diff-Foley, a synchronized Video-to-Audio synthesis method with a latent diffusion model (LDM) that generates high-quality audio with improved synchronization and audio-visual relevance. We adopt contrastive audio-visual pretraining (CAVP) to learn more temporally and semantically aligned features, then train an LDM with CAVP-aligned visual features on spectrogram latent space. The CAVP-aligned features enable LDM to capture the subtler audio-visual correlation via a cross-attention module. We further significantly improve sample quality with `double guidance'. Diff-Foley achieves state-of-the-art V2A performance on current large scale V2A dataset. Furthermore, we demonstrate Diff-Foley practical applicability and generalization capabilities via downstream finetuning. Project Page: see https://diff-foley.github.io/

READ FULL TEXT

page 1

page 6

page 7

page 8

page 17

page 18

page 20

page 21

research
05/22/2023

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

Text-to-audio (TTA) generation is a recent popular problem that aims to ...
research
05/06/2023

AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion

Recent advances in diffusion models have showcased promising results in ...
research
08/18/2023

Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

The task of lip synchronization (lip-sync) seeks to match the lips of hu...
research
07/04/2019

LumièreNet: Lecture Video Synthesis from Audio

We present LumièreNet, a simple, modular, and completely deep-learning b...
research
03/08/2022

Attention-Based Lip Audio-Visual Synthesis for Talking Face Generation in the Wild

Talking face generation with great practical significance has attracted ...
research
01/10/2023

DiffTalk: Crafting Diffusion Models for Generalized Talking Head Synthesis

Talking head synthesis is a promising approach for the video production ...
research
06/16/2023

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Recent work has studied text-to-audio synthesis using large amounts of p...

Please sign up or login with your details

Forgot password? Click here to reset