MHTTS: Fast multi-head text-to-speech for spontaneous speech with imperfect transcription

01/19/2022
by   Dabiao Ma, et al.
0

Neural network based end-to-end Text-to-Speech (TTS) has greatly improved the quality of synthesized speech. While how to use massive spontaneous speech without transcription efficiently still remains an open problem. In this paper, we propose MHTTS, a fast multi-speaker TTS system that is robust to transcription errors and speaking style speech data. Specifically, we introduce a multi-head model and transfer text information from high-quality corpus with manual transcription to spontaneous speech with imperfectly recognized transcription by jointly training them. MHTTS has three advantages: 1) Our system synthesizes better quality multi-speaker voice with faster inference speed. 2) Our system is capable of transferring correct text information to data with imperfect transcription, simulated using corruption, or provided by an Automatic Speech Recogniser (ASR). 3) Our system can utilize massive real spontaneous speech with imperfect transcription and synthesize expressive voice.

READ FULL TEXT
research
03/20/2022

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

In recent years, neural network based methods for multi-speaker text-to-...
research
05/22/2019

FastSpeech: Fast, Robust and Controllable Text to Speech

Neural network based end-to-end text to speech (TTS) has significantly i...
research
10/12/2022

Can we use Common Voice to train a Multi-Speaker TTS system?

Training of multi-speaker text-to-speech (TTS) systems relies on curated...
research
09/08/2021

Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis

Cross-speaker style transfer (CSST) in text-to-speech (TTS) synthesis ai...
research
02/26/2022

Revisiting Over-Smoothness in Text to Speech

Non-autoregressive text to speech (NAR-TTS) models have attracted much a...
research
06/08/2020

MultiSpeech: Multi-Speaker Text to Speech with Transformer

Transformer-based text to speech (TTS) model (e.g., Transformer TTS <cit...
research
10/31/2022

Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS

In current two-stage neural text-to-speech (TTS) paradigm, it is ideal t...

Please sign up or login with your details

Forgot password? Click here to reset