VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer

11/05/2022
by   Yongmao Zhang, et al.
0

End-to-end singing voice synthesis (SVS) model VISinger can achieve better performance than the typical two-stage model with fewer parameters. However, VISinger has several problems: text-to-phase problem, the end-to-end model learns the meaningless mapping of text-to-phase; glitches problem, the harmonic components corresponding to the periodic signal of the voiced segment occurs a sudden change with audible artefacts; low sampling rate, the sampling rate of 24KHz does not meet the application needs of high-fidelity generation with the full-band rate (44.1KHz or higher). In this paper, we propose VISinger 2 to address these issues by integrating the digital signal processing (DSP) methods with VISinger. Specifically, inspired by recent advances in differentiable digital signal processing (DDSP), we incorporate a DSP synthesizer into the decoder to solve the above issues. The DSP synthesizer consists of a harmonic synthesizer and a noise synthesizer to generate periodic and aperiodic signals, respectively, from the latent representation z in VISinger. It supervises the posterior encoder to extract the latent representation without phase information and avoid the prior encoder modelling text-to-phase mapping. To avoid glitch artefacts, the HiFi-GAN is modified to accept the waveforms generated by the DSP synthesizer as a condition to produce the singing voice. Moreover, with the improved waveform decoder, VISinger 2 manages to generate 44.1kHz singing audio with richer expression and better quality. Experiments on OpenCpop corpus show that VISinger 2 outperforms VISinger, CpopSing and RefineSinger in both subjective and objective metrics.

READ FULL TEXT
research
06/29/2023

Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables

This paper introduces GlOttal-flow LPC Filter (GOLF), a novel method for...
research
09/03/2020

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

High-fidelity singing voices usually require higher sampling rate (e.g.,...
research
04/08/2021

Modulated Periodic Activations for Generalizable Local Functional Representations

Multi-Layer Perceptrons (MLPs) make powerful functional representations ...
research
10/23/2022

HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

Entertainment-oriented singing voice synthesis (SVS) requires a vocoder ...
research
10/31/2022

Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS

In current two-stage neural text-to-speech (TTS) paradigm, it is ideal t...
research
10/11/2021

Pitch Preservation In Singing Voice Synthesis

Suffering from limited singing voice corpus, existing singing voice synt...
research
06/19/2023

Vocal Timbre Effects with Differentiable Digital Signal Processing

We explore two approaches to creatively altering vocal timbre using Diff...

Please sign up or login with your details

Forgot password? Click here to reset