Noise2Music: Text-conditioned Music Generation with Diffusion Models

02/08/2023
by   Qingqing Huang, et al.
0

We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one using a spectrogram and the other using audio with lower fidelity. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era, but goes beyond to ground fine-grained semantics of the prompt. Pretrained large language models play a key role in this story – they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models. Generated examples: https://google-research.github.io/noise2music

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/02/2023

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

Deep generative models can generate high-fidelity audio conditioned on v...
research
05/16/2023

Generating coherent comic with rich story using ChatGPT and Stable Diffusion

Past work demonstrated that using neural networks, we can extend unfinis...
research
01/26/2023

MusicLM: Generating Music From Text

We introduce MusicLM, a model generating high-fidelity music from text d...
research
01/16/2023

Msanii: High Fidelity Music Synthesis on a Shoestring Budget

In this paper, we present Msanii, a novel diffusion-based model for synt...
research
04/07/2023

What does ChatGPT return about human values? Exploring value bias in ChatGPT using a descriptive value theory

There has been concern about ideological basis and possible discriminati...
research
05/22/2023

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

In recent years, image generation has shown a great leap in performance,...
research
05/30/2022

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

Binaural audio plays a significant role in constructing immersive augmen...

Please sign up or login with your details

Forgot password? Click here to reset