Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech

10/12/2022
by   Byoung Jin Choi, et al.
0

Several recently proposed text-to-speech (TTS) models achieved to generate the speech samples with the human-level quality in the single-speaker and multi-speaker TTS scenarios with a set of pre-defined speakers. However, synthesizing a new speaker's voice with a single reference audio, commonly known as zero-shot multi-speaker text-to-speech (ZSM-TTS), is still a very challenging task. The main challenge of ZSM-TTS is the speaker domain shift problem upon the speech generation of a new speaker. To mitigate this problem, we propose adversarial speaker-consistency learning (ASCL). The proposed method first generates an additional speech of a query speaker using the external untranscribed datasets at each training iteration. Then, the model learns to consistently generate the speech sample of the same speaker as the corresponding speaker embedding vector by employing an adversarial learning scheme. The experimental results show that the proposed method is effective compared to the baseline in terms of the quality and speaker similarity in ZSM-TTS.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/30/2022

SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech

Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate ...
research
08/28/2023

Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech

For personalized speech generation, a neural text-to-speech (TTS) model ...
research
04/24/2022

Few-Shot Speaker Identification Using Depthwise Separable Convolutional Network with Channel Attention

Although few-shot learning has attracted much attention from the fields ...
research
06/24/2022

Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

The cloning of a speaker's voice using an untranscribed reference sample...
research
03/29/2022

Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

Training a text-to-speech (TTS) model requires a large scale text labele...
research
11/07/2021

Speaker Generation

This work explores the task of synthesizing speech in nonexistent human-...
research
06/10/2021

Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows

Text-to-speech systems recently achieved almost indistinguishable qualit...

Please sign up or login with your details

Forgot password? Click here to reset