A Systematic Exploration of Joint-training for Singing Voice Synthesis

08/05/2023
by   Yuning Wu, et al.
0

There has been a growing interest in using end-to-end acoustic models for singing voice synthesis (SVS). Typically, these models require an additional vocoder to transform the generated acoustic features into the final waveform. However, since the acoustic model and the vocoder are not jointly optimized, a gap can exist between the two models, leading to suboptimal performance. Although a similar problem has been addressed in the TTS systems by joint-training or by replacing acoustic features with a latent representation, adopting corresponding approaches to SVS is not an easy task. How to improve the joint-training of SVS systems has not been well explored. In this paper, we conduct a systematic investigation of how to better perform a joint-training of an acoustic model and a vocoder for SVS. We carry out extensive experiments and demonstrate that our joint-training strategy outperforms baselines, achieving more stable performance across different datasets while also increasing the interpretability of the entire framework.

READ FULL TEXT
research
04/25/2018

Speaker-independent raw waveform model for glottal excitation

Recent speech technology research has seen a growing interest in using W...
research
02/02/2020

WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

Tacotron-based text-to-speech (TTS) systems directly synthesize speech f...
research
11/21/2022

Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System

This paper integrates a classic mel-cepstral synthesis filter into a mod...
research
10/17/2021

VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis

In this paper, we propose VISinger, a complete end-to-end high-quality s...
research
03/15/2023

PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor

Singing voice synthesis (SVS), as a specific task for generating the voc...
research
05/16/2020

Learning Joint Articulatory-Acoustic Representations with Normalizing Flows

The articulatory geometric configurations of the vocal tract and the aco...
research
10/07/2021

Transferring Voice Knowledge for Acoustic Event Detection: An Empirical Study

Detection of common events and scenes from audio is useful for extractin...

Please sign up or login with your details

Forgot password? Click here to reset