DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

05/22/2023
by   Shentong Mo, et al.
0

Text-to-audio (TTA) generation is a recent popular problem that aims to synthesize general audio given text descriptions. Previous methods utilized latent diffusion models to learn audio embedding in a latent space with text embedding as the condition. However, they ignored the synchronization between audio and visual content in the video, and tended to generate audio mismatching from video frames. In this work, we propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA, that can simply fine-tune lightweight visual-text alignment modules with frozen modality-specific encoders to update visual-aligned text embeddings as the condition. Specifically, our DiffAVA leverages a multi-head attention transformer to aggregate temporal information from video features, and a dual multi-modal residual network to fuse temporal visual representations with text embeddings. Then, a contrastive learning objective is applied to match visual-aligned text embeddings with audio features. Experimental results on the AudioCaps dataset demonstrate that the proposed DiffAVA can achieve competitive performance on visual-aligned text-to-audio generation.

READ FULL TEXT

page 2

page 4

research
06/29/2023

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

The Video-to-Audio (V2A) model has recently gained attention for its pra...
research
11/19/2022

VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement

Video to sound generation aims to generate realistic and natural sound g...
research
05/29/2023

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

Large diffusion models have been successful in text-to-audio (T2A) synth...
research
12/31/2020

A Multi-modal Deep Learning Model for Video Thumbnail Selection

Thumbnail is the face of online videos. The explosive growth of videos b...
research
09/08/2023

The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

In recent years, video generation has become a prominent generative tool...
research
03/21/2023

ModEFormer: Modality-Preserving Embedding for Audio-Video Synchronization using Transformers

Lack of audio-video synchronization is a common problem during televisio...
research
09/19/2023

FoleyGen: Visually-Guided Audio Generation

Recent advancements in audio generation have been spurred by the evoluti...

Please sign up or login with your details

Forgot password? Click here to reset