Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

05/29/2023
by   Jiawei Huang, et al.
0

Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since they do not adequately prioritize temporal information. To address these challenges, we propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio. Our approach includes several techniques to improve semantic alignment and temporal consistency: Firstly, we use pre-trained large language models (LLMs) to parse the text into structured <event order> pairs for better temporal information capture. We also introduce another structured-text encoder to aid in learning semantic alignment during the diffusion denoising process. To improve the performance of variable length generation and enhance the temporal information extraction, we design a feed-forward Transformer-based diffusion denoiser. Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data. Extensive experiments show that our method outperforms baseline models in both objective and subjective metrics, and achieves significant gains in temporal information understanding, semantic consistency, and sound quality.

READ FULL TEXT
research
05/22/2023

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

Text-to-audio (TTA) generation is a recent popular problem that aims to ...
research
06/08/2023

Instructed Diffuser with Temporal Condition Guidance for Offline Reinforcement Learning

Recent works have shown the potential of diffusion models in computer vi...
research
12/19/2022

Latent Diffusion for Language Generation

Diffusion models have achieved great success in modeling continuous data...
research
11/19/2022

VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement

Video to sound generation aims to generate realistic and natural sound g...
research
06/12/2019

CogCompTime: A Tool for Understanding Time in Natural Language Text

Automatic extraction of temporal information in text is an important com...
research
08/23/2023

Audio Generation with Multiple Conditional Diffusion Model

Text-based audio generation models have limitations as they cannot encom...
research
02/12/2023

SemanticAC: Semantics-Assisted Framework for Audio Classification

In this paper, we propose SemanticAC, a semantics-assisted framework for...

Please sign up or login with your details

Forgot password? Click here to reset