Multi-Speaker End-to-End Speech Synthesis

by   Jihyun Park, et al.

In this work, we extend ClariNet (Ping et al., 2019), a fully end-to-end speech synthesis model (i.e., text-to-wave), to generate high-fidelity speech from multiple speakers. To model the unique characteristic of different voices, low dimensional trainable speaker embeddings are shared across each component of ClariNet and trained together with the rest of the model. We demonstrate that the multi-speaker ClariNet outperforms state-of-the-art systems in terms of naturalness, because the whole model is jointly optimized in an end-to-end manner.



page 1

page 2

page 3

page 4


ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

In recent years, neural network based methods for multi-speaker text-to-...

Identify Speakers in Cocktail Parties with End-to-End Attention

In scenarios where multiple speakers talk at the same time, it is import...

Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis

We explore pretraining strategies including choice of base corpus with t...

GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Recent advances in neural multi-speaker text-to-speech (TTS) models have...

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

In this work, we propose an alternative solution for parallel wave gener...

End-to-end Recurrent Denoising Autoencoder Embeddings for Speaker Identification

Speech 'in-the-wild' is a handicap for speaker recognition systems due t...

Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

Lip motion reflects behavior characteristics of speakers, and thus can b...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.