Audio Generation with Multiple Conditional Diffusion Model

08/23/2023
by   Zhifang Guo, et al.
0

Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. Audio samples and our dataset are publicly available at https://conditionaudiogen.github.io/conditionaudiogen/

READ FULL TEXT

page 1

page 3

research
08/28/2023

TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models

Recently, there has been a growing interest in the field of controllable...
research
01/30/2023

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Large-scale multimodal generative modeling has created milestones in tex...
research
06/09/2022

The Case for a Single Model that can Both Generate Continuations and Fill in the Blank

The task of inserting text into a specified position in a passage, known...
research
11/06/2022

I Hear Your True Colors: Image Guided Audio Generation

We propose Im2Wav, an image guided open-domain audio generation system. ...
research
05/29/2023

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

Large diffusion models have been successful in text-to-audio (T2A) synth...
research
03/30/2023

Prefix tuning for automated audio captioning

Audio captioning aims to generate text descriptions from environmental s...
research
06/16/2020

Generative Modelling for Controllable Audio Synthesis of Expressive Piano Performance

We present a controllable neural audio synthesizer based on Gaussian Mix...

Please sign up or login with your details

Forgot password? Click here to reset