VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis

10/17/2021
by   Yongmao Zhang, et al.
0

In this paper, we propose VISinger, a complete end-to-end high-quality singing voice synthesis (SVS) system that directly generates audio waveform from lyrics and musical score. Our approach is inspired by VITS, which adopts VAE-based posterior encoder augmented with normalizing flow-based prior encoder and adversarial decoder to realize complete end-to-end speech generation. VISinger follows the main architecture of VITS, but makes substantial improvements to the prior encoder based on the characteristics of singing. First, instead of using phoneme-level mean and variance of acoustic features, we introduce a length regulator and a frame prior network to get the frame-level mean and variance on acoustic features, modeling the rich acoustic variation in singing. Second, we further introduce an F0 predictor to guide the frame prior network, leading to stabler singing performance. Finally, to improve the singing rhythm, we modify the duration predictor to specifically predict the phoneme to note duration ratio, helped with singing note normalization. Experiments on a professional Mandarin singing corpus show that VISinger significantly outperforms FastSpeech+Neural-Vocoder two-stage approach and the oracle VITS; ablation study demonstrates the effectiveness of different contributions.

READ FULL TEXT

page 2

page 4

research
06/11/2021

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Several recent end-to-end text-to-speech (TTS) models enabling single-st...
research
09/21/2022

Mandarin Singing Voice Synthesis with Denoising Diffusion Probabilistic Wasserstein GAN

Singing voice synthesis (SVS) is the computer production of a human-like...
research
06/11/2020

XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System

This paper presents XiaoiceSing, a high-quality singing voice synthesis ...
research
10/28/2022

Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis

Several fully end-to-end text-to-speech (TTS) models have been proposed ...
research
06/08/2023

VIFS: An End-to-End Variational Inference for Foley Sound Synthesis

The goal of DCASE 2023 Challenge Task 7 is to generate various sound cli...
research
08/05/2023

A Systematic Exploration of Joint-training for Singing Voice Synthesis

There has been a growing interest in using end-to-end acoustic models fo...
research
05/09/2022

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Text to speech (TTS) has made rapid progress in both academia and indust...

Please sign up or login with your details

Forgot password? Click here to reset