V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

08/18/2023
by   Heng Wang, et al.
0

Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86 parameters but achieving 53

READ FULL TEXT

page 1

page 3

page 12

page 13

research
05/24/2022

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

Large-scale pretrained foundation models have been an emerging paradigm ...
research
04/26/2017

Deep Cross-Modal Audio-Visual Generation

Cross-modal audio-visual perception has been a long-lasting topic in psy...
research
03/30/2023

Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

How does audio describe the world around us? In this paper, we propose a...
research
02/05/2021

Learning Audio-Visual Correlations from Variational Cross-Modal Generation

People can easily imagine the potential sound while seeing an event. Thi...
research
08/23/2023

AdVerb: Visually Guided Audio Dereverberation

We present AdVerb, a novel audio-visual dereverberation framework that u...
research
03/29/2023

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs

Artificial Intelligence (AI) has made incredible progress recently. On t...
research
10/17/2021

Taming Visually Guided Sound Generation

Recent advances in visually-induced audio generation are based on sampli...

Please sign up or login with your details

Forgot password? Click here to reset