Diverse and Vivid Sound Generation from Text Descriptions

05/03/2023
by   Guangwei Li, et al.
0

Previous audio generation mainly focuses on specified sound classes such as speech or music, whose form and content are greatly restricted. In this paper, we go beyond specific audio generation by using natural language description as a clue to generate broad sounds. Unlike visual information, a text description is concise by its nature but has rich hidden meanings beneath, which poses a higher possibility and complexity on the audio to be generated. A Variation-Quantized GAN is used to train a codebook learning discrete representations of spectrograms. For a given text description, its pre-trained embedding is fed to a Transformer to sample codebook indices to decode a spectrogram to be further transformed into waveform by a melgan vocoder. The generated waveform has high quality and fidelity while excellently corresponding to the given text. Experiments show that our proposed method is capable of generating natural, vivid audios, achieving superb quantitative and qualitative results.

READ FULL TEXT
research
05/11/2023

V2Meow: Meowing to the Visual Beat via Music Generation

Generating high quality music that complements the visual content of a v...
research
10/17/2021

Taming Visually Guided Sound Generation

Recent advances in visually-induced audio generation are based on sampli...
research
04/28/2021

AMSS-Net: Audio Manipulation on User-Specified Sources with Textual Queries

This paper proposes a neural network that performs audio transformations...
research
03/19/2023

Audio-Text Models Do Not Yet Leverage Natural Language

Multi-modal contrastive learning techniques in the audio-text domain hav...
research
01/26/2023

MusicLM: Generating Music From Text

We introduce MusicLM, a model generating high-fidelity music from text d...
research
09/05/2023

Generating Realistic Images from In-the-wild Sounds

Representing wild sounds as images is an important but challenging task ...
research
09/21/2021

Audio Interval Retrieval using Convolutional Neural Networks

Modern streaming services are increasingly labeling videos based on thei...

Please sign up or login with your details

Forgot password? Click here to reset