UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

12/03/2022
by   Yi Lei, et al.
0

Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating high-quality speaking and singing voice according to textual input and music scores, respectively. Unifying TTS and SVS into a single system is crucial to the applications requiring both of them. Existing methods usually suffer from some limitations, which rely on either both singing and speaking data from the same person or cascaded models of multiple tasks. To address these problems, a simplified elegant framework for TTS and SVS, named UniSyn, is proposed in this paper. It is an end-to-end unified model that can make a voice speak and sing with only singing or speaking data from this person. To be specific, a multi-conditional variational autoencoder (MC-VAE), which constructs two independent latent sub-spaces with the speaker- and style-related (i.e. speak or sing) conditions for flexible control, is proposed in UniSyn. Moreover, supervised guided-VAE and timbre perturbation with the Wasserstein distance constraint are leveraged to further disentangle the speaker timbre and style. Experiments conducted on two speakers and two singers demonstrate that UniSyn can generate natural speaking and singing voice without corresponding training data. The proposed approach outperforms the state-of-the-art end-to-end voice generation work, which proves the effectiveness and advantages of UniSyn.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/22/2022

nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-shot Multi-speaker Text-to-Speech

Multi-speaker text-to-speech (TTS) using a few adaption data is a challe...
research
03/08/2021

CUHK-EE Voice Cloning System for ICASSP 2021 M2VoC Challenge

This paper presents the CUHK-EE voice cloning system for ICASSP 2021 M2V...
research
10/07/2021

Cloning one's voice using very limited data in the wild

With the increasing popularity of speech synthesis products, the industr...
research
12/11/2018

Learning latent representations for style control and transfer in end-to-end speech synthesis

In this paper, we introduce the Variational Autoencoder (VAE) to an end-...
research
11/17/2022

NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis

Various applications of voice synthesis have been developed independentl...
research
08/06/2019

Adversarially Trained End-to-end Korean Singing Voice Synthesis System

In this paper, we propose an end-to-end Korean singing voice synthesis s...
research
10/08/2021

KaraSinger: Score-Free Singing Voice Synthesis with VQ-VAE using Mel-spectrograms

In this paper, we propose a novel neural network model called KaraSinger...

Please sign up or login with your details

Forgot password? Click here to reset