UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder

06/06/2022
by   Jiachen Lian, et al.
0

In this paper, we propose a novel unsupervised text-to-speech (UTTS) framework which does not require text-audio pairs for the TTS acoustic modeling (AM). UTTS is a multi-speaker speech synthesizer developed from the perspective of disentangled speech representation learning. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. We leverage recent advancements in self-supervised speech representation learning as well as speech synthesis front-end techniques for the system development. Specifically, we utilize a lexicon to map input text to the phoneme sequence, which is expanded to the frame-level forced alignment (FA) with a speaker-dependent duration model. Then, we develop an alignment mapping module that converts the FA to the unsupervised alignment (UA). Finally, a Conditional Disentangled Sequential Variational Auto-encoder (C-DSVAE), serving as the self-supervised TTS AM, takes the predicted UA and a target speaker embedding to generate the mel spectrogram, which is ultimately converted to waveform with a neural vocoder. We show how our method enables speech synthesis without using a paired TTS corpus. Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/01/2021

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

We propose using self-supervised discrete representations for the task o...
research
06/01/2021

Supervised Speech Representation Learning for Parkinson's Disease Classification

Recently proposed automatic pathological speech classification technique...
research
06/15/2021

Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis

Text does not fully specify the spoken form, so text-to-speech models mu...
research
04/06/2023

DSVAE: Interpretable Disentangled Representation for Synthetic Speech Detection

Tools to generate high quality synthetic speech signal that is perceptua...
research
05/17/2019

CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

The prosodic aspects of speech signals produced by current text-to-speec...
research
04/23/2023

SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

In recent Text-to-Speech (TTS) systems, a neural vocoder often generates...
research
06/17/2019

Combining Adversarial Training and Disentangled Speech Representation for Robust Zero-Resource Subword Modeling

This study addresses the problem of unsupervised subword unit discovery ...

Please sign up or login with your details

Forgot password? Click here to reset