I Hear Your True Colors: Image Guided Audio Generation

11/06/2022
by   Roy Sheffer, et al.
0

We propose Im2Wav, an image guided open-domain audio generation system. Given an input image or a sequence of images, Im2Wav generates a semantically relevant sound. Im2Wav is based on two Transformer language models, that operate over a hierarchical discrete audio representation obtained from a VQ-VAE based model. We first produce a low-level audio representation using a language model. Then, we upsample the audio tokens using an additional language model to generate a high-fidelity audio sample. We use the rich semantics of a pre-trained CLIP embedding as a visual representation to condition the language model. In addition, to steer the generation process towards the conditioning image, we apply the classifier-free guidance method. Results suggest that Im2Wav significantly outperforms the evaluated baselines in both fidelity and relevance evaluation metrics. Additionally, we provide an ablation study to better assess the impact of each of the method components on overall performance. Lastly, to better evaluate image-to-audio models, we propose an out-of-domain image dataset, denoted as ImageHear. ImageHear can be used as a benchmark for evaluating future image-to-audio models. Samples and code can be found inside the manuscript.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2022

Audio Language Modeling using Perceptually-Guided Discrete Representations

In this work, we study the task of Audio Language Modeling, in which we ...
research
10/17/2021

Taming Visually Guided Sound Generation

Recent advances in visually-induced audio generation are based on sampli...
research
12/14/2020

Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

The goal of audio captioning is to translate input audio into its descri...
research
05/19/2023

Pengi: An Audio Language Model for Audio Tasks

In the domain of audio processing, Transfer Learning has facilitated the...
research
08/23/2023

Audio Generation with Multiple Conditional Diffusion Model

Text-based audio generation models have limitations as they cannot encom...
research
06/01/2020

High-Fidelity Audio Generation and Representation Learning with Guided Adversarial Autoencoder

Unsupervised disentangled representation learning from the unlabelled au...
research
08/21/2023

Can Language Models Learn to Listen?

We present a framework for generating appropriate facial responses from ...

Please sign up or login with your details

Forgot password? Click here to reset