Multi-Speaker End-to-End Speech Synthesis

07/09/2019
by   Jihyun Park, et al.
0

In this work, we extend ClariNet (Ping et al., 2019), a fully end-to-end speech synthesis model (i.e., text-to-wave), to generate high-fidelity speech from multiple speakers. To model the unique characteristic of different voices, low dimensional trainable speaker embeddings are shared across each component of ClariNet and trained together with the rest of the model. We demonstrate that the multi-speaker ClariNet outperforms state-of-the-art systems in terms of naturalness, because the whole model is jointly optimized in an end-to-end manner.

READ FULL TEXT

Authors

page 1

page 2

page 3

page 4

03/20/2022

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

In recent years, neural network based methods for multi-speaker text-to-...
05/22/2020

Identify Speakers in Cocktail Parties with End-to-End Attention

In scenarios where multiple speakers talk at the same time, it is import...
11/10/2020

Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis

We explore pretraining strategies including choice of base corpus with t...
06/29/2021

GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Recent advances in neural multi-speaker text-to-speech (TTS) models have...
07/19/2018

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

In this work, we propose an alternative solution for parallel wave gener...
03/13/2020

End-to-end Recurrent Denoising Autoencoder Embeddings for Speaker Identification

Speech 'in-the-wild' is a handicap for speaker recognition systems due t...
10/13/2020

Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

Lip motion reflects behavior characteristics of speakers, and thus can b...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.