Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound

by   Seokjin Lee, et al.

Deep generative models for audio synthesis have recently been significantly improved. However, the task of modeling raw-waveforms remains a difficult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in reproducing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch activation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more significant performance and stability improvement than the conventional RAVE model.


RAVE: A variational autoencoder for fast and high-quality neural audio synthesis

Deep generative models applied to audio have improved by a large margin ...

VaPar Synth – A Variational Parametric Model for Audio Synthesis

With the advent of data-driven statistical modeling and abundant computi...

Self-Supervised Disentanglement of Harmonic and Rhythmic Features in Music Audio Signals

The aim of latent variable disentanglement is to infer the multiple info...

A Robust Speaker Clustering Method Based on Discrete Tied Variational Autoencoder

Recently, the speaker clustering model based on aggregation hierarchy cl...

An Initial study on Birdsong Re-synthesis Using Neural Vocoders

Modern speech synthesis uses neural vocoders to model raw waveform sampl...

A General Framework for Learning Procedural Audio Models of Environmental Sounds

This paper introduces the Procedural (audio) Variational autoEncoder (Pr...

Prior Flow Variational Autoencoder: A density estimation model for Non-Intrusive Load Monitoring

Non-Intrusive Load Monitoring (NILM) is a computational technique to est...

Please sign up or login with your details

Forgot password? Click here to reset