Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

01/30/2023
by   Rongjie Huang, et al.
1

Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps by 1) introducing pseudo prompt enhancement with a distill-then-reprogram approach, it alleviates data scarcity with orders of magnitude concept compositions by using language-free audios; 2) leveraging spectrogram autoencoder to predict the self-supervised audio representation instead of waveforms. Together with robust contrastive language-audio pretraining (CLAP) representations, Make-An-Audio achieves state-of-the-art results in both objective and subjective benchmark evaluation. Moreover, we present its controllability and generalization for X-to-Audio with "No Modality Left Behind", for the first time unlocking the ability to generate high-definition, high-fidelity audios given a user-defined modality input. Audio samples are available at https://Text-to-Audio.github.io

READ FULL TEXT

page 1

page 3

page 8

research
06/16/2023

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Recent work has studied text-to-audio synthesis using large amounts of p...
research
04/17/2023

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

In this paper, we propose a Vision-Audio-Language Omni-peRception pretra...
research
03/08/2023

New Audio Representations Image Gan Generation from BriVL

Recently, researchers have gradually realized that in some cases, the se...
research
05/22/2023

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Text-to-speech(TTS) has undergone remarkable improvements in performance...
research
08/23/2023

Audio Generation with Multiple Conditional Diffusion Model

Text-based audio generation models have limitations as they cannot encom...
research
09/20/2023

A Large-scale Dataset for Audio-Language Representation Learning

The AI community has made significant strides in developing powerful fou...
research
08/12/2023

Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

Spotting user-defined/flexible keywords represented in text frequently u...

Please sign up or login with your details

Forgot password? Click here to reset