HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio Codec and Latent Diffusion Models

06/12/2023
by   Ji-Sang Hwang, et al.
0

Recently, denoising diffusion models have demonstrated remarkable performance among generative models in various domains. However, in the speech domain, the application of diffusion models for synthesizing time-varying audio faces limitations in terms of complexity and controllability, as speech synthesis requires very high-dimensional samples with long-term acoustic features. To alleviate the challenges posed by model complexity in singing voice synthesis, we propose HiddenSinger, a high-quality singing voice synthesis system using a neural audio codec and latent diffusion models. To ensure high-fidelity audio, we introduce an audio autoencoder that can encode audio into an audio codec as a compressed representation and reconstruct the high-fidelity audio from the low-dimensional compressed latent vector. Subsequently, we use the latent diffusion models to sample a latent representation from a musical score. In addition, our proposed model is extended to an unsupervised singing voice learning framework, HiddenSinger-U, to train the model using an unlabeled singing voice dataset. Experimental results demonstrate that our model outperforms previous models in terms of audio quality. Furthermore, the HiddenSinger-U can synthesize high-quality singing voices of speakers trained solely on unlabeled data.

READ FULL TEXT

page 1

page 3

page 8

research
08/02/2023

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

Deep generative models can generate high-fidelity audio conditioned on v...
research
10/14/2022

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

Recent progress in deep generative models has improved the quality of ne...
research
05/30/2022

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

Binaural audio plays a significant role in constructing immersive augmen...
research
11/19/2021

Differentiable Wavetable Synthesis

Differentiable Wavetable Synthesis (DWTS) is a technique for neural audi...
research
05/06/2021

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Singing voice synthesis (SVS) system is built to synthesize high-quality...
research
11/02/2022

Singing Voice Synthesis with Vibrato Modeling and Latent Energy Representation

This paper proposes an expressive singing voice synthesis system by intr...
research
01/16/2023

Msanii: High Fidelity Music Synthesis on a Shoestring Budget

In this paper, we present Msanii, a novel diffusion-based model for synt...

Please sign up or login with your details

Forgot password? Click here to reset