SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

04/02/2021 ∙ by Edresson Casanova, et al. ∙ 0

In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen in training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms predicted by the TTS model on the training dataset can significantly improve the similarity and speech quality for new speakers. Our model is able to converge in training, using only 11 speakers, reaching state-of-the-art results for similarity with new speakers, as well as high speech quality.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

Code Repositories

SC-GlowTTS

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.