Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

02/07/2023
by   Eugene Kharitonov, et al.
0

We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/13/2023

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

The utilization of discrete speech tokens, divided into semantic tokens ...
research
06/29/2021

GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Recent advances in neural multi-speaker text-to-speech (TTS) models have...
research
03/18/2022

A^3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

Recently, speech representation learning has improved many speech-relate...
research
09/19/2023

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

Discrete audio representation, aka audio tokenization, has seen renewed ...
research
11/29/2022

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

In this paper, we present a novel method for phoneme-level prosody contr...
research
04/12/2021

Deep Learning for Prominence Detection in Children's Read Speech

Expressive reading, considered the defining attribute of oral reading fl...
research
04/10/2023

Modeling Speaker-Listener Interaction for Backchannel Prediction

We present our latest findings on backchannel modeling novelly motivated...

Please sign up or login with your details

Forgot password? Click here to reset