Multi-instrument Music Synthesis with Spectrogram Diffusion

06/11/2022
by   Curtis Hawthorne, et al.
6

An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-specific models that offer detailed control of only specific instruments, or raw waveform models that can train on all of music but with minimal control and slow generation. In this work, we focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime. This enables training on a wide range of transcription datasets with a single model, which in turn offers note-level control of composition and instrumentation across a wide range of instruments. We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter. We compare training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and find that the DDPM approach is superior both qualitatively and as measured by audio reconstruction and Fréchet distance metrics. Given the interactivity and generality of this approach, we find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.

READ FULL TEXT

page 2

page 3

page 6

research
11/01/2018

Neural Music Synthesis for Flexible Timbre Control

The recent success of raw audio waveform synthesis models like WaveNet m...
research
12/17/2021

MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling

Musical expression requires control of both what notes are played, and h...
research
10/23/2018

SING: Symbol-to-Instrument Neural Generator

Recent progress in deep learning for audio synthesis opens the way to mo...
research
06/28/2018

GenerationMania: Learning to Semantically Choreograph

Beatmania is a rhythm action game where players play the role of a DJ th...
research
03/13/2020

Audio inpainting with generative adversarial network

We study the ability of Wasserstein Generative Adversarial Network (WGAN...
research
07/13/2021

The Piano Inpainting Application

Autoregressive models are now capable of generating high-quality minute-...
research
11/04/2021

MT3: Multi-Task Multitrack Music Transcription

Automatic Music Transcription (AMT), inferring musical notes from raw au...

Please sign up or login with your details

Forgot password? Click here to reset