FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model

03/06/2023
by   Ruiqing Xue, et al.
0

Neural text-to-speech (TTS) generally consists of cascaded architecture with separately optimized acoustic model and vocoder, or end-to-end architecture with continuous mel-spectrograms or self-extracted speech frames as the intermediate representations to bridge acoustic model and vocoder, which suffers from two limitations: 1) the continuous acoustic frames are hard to predict with phoneme only, and acoustic information like duration or pitch is also needed to solve the one-to-many problem, which is not easy to scale on large scale and noise datasets; 2) to achieve diverse speech output based on continuous speech features, complex VAE or flow-based models are usually required. In this paper, we propose FoundationTTS, a new speech synthesis system with a neural audio codec for discrete speech token extraction and waveform reconstruction and a large language model for discrete token generation from linguistic (phoneme) tokens. Specifically, 1) we propose a hierarchical codec network based on vector-quantized auto-encoders with adversarial training (VQ-GAN), which first extracts continuous frame-level speech representations with fine-grained codec, and extracts a discrete token from each continuous speech frame with coarse-grained codec; 2) we jointly optimize speech token, linguistic tokens, speaker token together with a large language model and predict the discrete speech tokens autoregressively. Experiments show that FoundationTTS achieves a MOS gain of +0.14 compared to the baseline system. In ASR customization tasks, our method achieves 7.09% and 10.35% WERR respectively over two strong customized ASR baselines.

READ FULL TEXT

page 3

page 4

page 5

research
07/11/2022

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

Current text to speech (TTS) systems usually leverage a cascaded acousti...
research
09/01/2023

Learning Speech Representation From Contrastive Token-Acoustic Pretraining

For fine-grained generation and recognition tasks such as minimally-supe...
research
10/21/2022

Audio-to-Intent Using Acoustic-Textual Subword Representations from End-to-End ASR

Accurate prediction of the user intent to interact with a voice assistan...
research
05/18/2023

ZeroPrompt: Streaming Acoustic Encoders are Zero-Shot Masked LMs

In this paper, we present ZeroPrompt (Figure 1-(a)) and the correspondin...
research
04/13/2023

Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

This paper introduces a novel Token-and-Duration Transducer (TDT) archit...
research
11/17/2022

Token-level Speaker Change Detection Using Speaker Difference and Speech Content via Continuous Integrate-and-fire

In multi-talker scenarios such as meetings and conversations, speech pro...
research
10/31/2021

Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units

Language models (LMs) for text data have been studied extensively for th...

Please sign up or login with your details

Forgot password? Click here to reset