Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis

11/10/2020
by   Erica Cooper, et al.
7

We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis. We also examine choice of neural vocoder for waveform synthesis, as well as acoustic configurations used for mel spectrograms and final audio output. We find that fine-tuning a multi-speaker model from found audiobook data that has passed a simple quality threshold can improve naturalness and similarity to unseen target speakers of synthetic speech. Additionally, we find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet, with a faster inference time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/05/2022

Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion

The zero-shot scenario for speech generation aims at synthesizing a nove...
research
09/21/2023

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speake...
research
04/25/2018

Speaker-independent raw waveform model for glottal excitation

Recent speech technology research has seen a growing interest in using W...
research
03/18/2022

A^3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

Recently, speech representation learning has improved many speech-relate...
research
09/01/2023

The FruitShell French synthesis system at the Blizzard 2023 Challenge

This paper presents a French text-to-speech synthesis system for the Bli...
research
10/22/2020

AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

In this paper, we present AISHELL-3, a large-scale and high-fidelity mul...
research
03/30/2018

Conditional End-to-End Audio Transforms

We present an end-to-end method for transforming audio from one style to...

Please sign up or login with your details

Forgot password? Click here to reset