Taming Visually Guided Sound Generation

10/17/2021
by   Vladimir Iashin, et al.
7

Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU. We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples. Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGAN

READ FULL TEXT

page 12

page 18

page 19

page 21

page 22

page 24

page 25

page 26

research
11/06/2022

I Hear Your True Colors: Image Guided Audio Generation

We propose Im2Wav, an image guided open-domain audio generation system. ...
research
05/03/2023

Diverse and Vivid Sound Generation from Text Descriptions

Previous audio generation mainly focuses on specified sound classes such...
research
05/15/2020

WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU

In this paper, we propose WG-WaveNet, a fast, lightweight, and high-qual...
research
08/30/2020

Hierarchical Timbre-Painting and Articulation Generation

We present a fast and high-fidelity method for music generation, based o...
research
10/21/2021

CaloFlow II: Even Faster and Still Accurate Generation of Calorimeter Showers with Normalizing Flows

Recently, we introduced CaloFlow, a high-fidelity generative model for G...
research
02/23/2018

Efficient Neural Audio Synthesis

Sequential models achieve state-of-the-art results in audio, visual and ...
research
08/18/2023

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Building artificial intelligence (AI) systems on top of a set of foundat...

Please sign up or login with your details

Forgot password? Click here to reset