Hierarchical Diffusion Models for Singing Voice Neural Vocoder

10/14/2022
by   Naoya Takahashi, et al.
5

Recent progress in deep generative models has improved the quality of neural vocoders in speech domain. However, generating a high-quality singing voice remains challenging due to a wider variety of musical expressions in pitch, loudness, and pronunciations. In this work, we propose a hierarchical diffusion model for singing voice neural vocoders. The proposed method consists of multiple diffusion models operating in different sampling rates; the model at the lowest sampling rate focuses on generating accurate low-frequency components such as pitch, and other models progressively generate the waveform at higher sampling rates on the basis of the data at the lower sampling rate and acoustic features. Experimental results show that the proposed method produces high-quality singing voices for multiple singers, outperforming state-of-the-art neural vocoders with a similar range of computational costs.

READ FULL TEXT

page 3

page 4

research
06/12/2023

HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio Codec and Latent Diffusion Models

Recently, denoising diffusion models have demonstrated remarkable perfor...
research
09/03/2020

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

High-fidelity singing voices usually require higher sampling rate (e.g.,...
research
09/21/2022

Mandarin Singing Voice Synthesis with Denoising Diffusion Probabilistic Wasserstein GAN

Singing voice synthesis (SVS) is the computer production of a human-like...
research
09/28/2021

Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme

Voice conversion is a common speech synthesis task which can be solved i...
research
09/06/2023

Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature

We propose a highly controllable voice manipulation system that can perf...
research
06/22/2023

A prior regularized full waveform inversion using generative diffusion models

Full waveform inversion (FWI) has the potential to provide high-resoluti...
research
05/12/2019

Improving Opus Low Bit Rate Quality with Neural Speech Synthesis

The voice mode of the Opus audio coder can compress wideband speech at b...

Please sign up or login with your details

Forgot password? Click here to reset