MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling

by   Yusong Wu, et al.

Musical expression requires control of both what notes are played, and how they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control. Starting from interpretable Differentiable Digital Signal Processing (DDSP) synthesis parameters, we infer musical notes and high-level properties of their expressive performance (such as timbre, vibrato, dynamics, and articulation). This creates a 3-level hierarchy (notes, performance, synthesis) that affords individuals the option to intervene at each level, or utilize trained priors (performance given notes, synthesis given performance) for creative assistance. Through quantitative experiments and listening tests, we demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence. By utilizing an interpretable hierarchy, with multiple levels of granularity, MIDI-DDSP opens the door to assistive tools to empower individuals across a diverse range of musical experience.


page 4

page 7

page 9

page 14

page 15

page 16

page 22

page 26


Multi-instrument Music Synthesis with Spectrogram Diffusion

An ideal music synthesizer should be both interactive and expressive, ge...

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Generative models in vision have seen rapid progress due to algorithmic ...

Hierarchical Linear Dynamical System for Representing Notes from Recorded Audio

We seek to develop simultaneous segmentation and classification of notes...

Assisted Sound Sample Generation with Musical Conditioning in Adversarial Auto-Encoders

Generative models have thrived in computer vision, enabling unprecedente...

Sketching the Expression: Flexible Rendering of Expressive Piano Performance with Self-Supervised Learning

We propose a system for rendering a symbolic piano performance with flex...

Differentiable Wavetable Synthesis

Differentiable Wavetable Synthesis (DWTS) is a technique for neural audi...

Vocal Tract Area Estimation by Gradient Descent

Articulatory features can provide interpretable and flexible controls fo...

Please sign up or login with your details

Forgot password? Click here to reset