Log In Sign Up

Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis

by   Erica Cooper, et al.

We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis. We also examine choice of neural vocoder for waveform synthesis, as well as acoustic configurations used for mel spectrograms and final audio output. We find that fine-tuning a multi-speaker model from found audiobook data that has passed a simple quality threshold can improve naturalness and similarity to unseen target speakers of synthetic speech. Additionally, we find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet, with a faster inference time.


page 1

page 2

page 3

page 4


Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion

The zero-shot scenario for speech generation aims at synthesizing a nove...

Speaker-independent raw waveform model for glottal excitation

Recent speech technology research has seen a growing interest in using W...

Multi-Speaker End-to-End Speech Synthesis

In this work, we extend ClariNet (Ping et al., 2019), a fully end-to-end...

nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-shot Multi-speaker Text-to-Speech

Multi-speaker text-to-speech (TTS) using a few adaption data is a challe...

Conditional End-to-End Audio Transforms

We present an end-to-end method for transforming audio from one style to...