Triple M: A Practical Neural Text-to-speech System With Multi-guidance Attention And Multi-band Multi-time Lpcnet

01/30/2021
by   Shilun Lin, et al.
0

In this work, a robust and efficient text-to-speech system, named Triple M, is proposed for large-scale online application. The key components of Triple M are: 1) A seq2seq model with multi-guidance attention which obtains stable feature generation and robust long sentence synthesis ability by learning from the guidance attention mechanisms. Multi-guidance attention improves the robustness and naturalness of long sentence synthesis without any in-domain performance loss or online service modification. Compared with the our best result obtained by using single attention mechanism (GMM-based attention), the word error rate of long sentence synthesis decreases by 23.5 multi-guidance attention mechanism is applied. 2) A efficient multi-band multi-time LPCNet, which reduces the computational complexity of LPCNet through combining multi-band and multi-time strategies (from 2.8 to 1.0 GFLOP). Due to these strategies, the vocoder speed is increased by 2.75x on a single CPU without much MOS degradatiaon (4.57 vs. 4.45).

READ FULL TEXT
research
06/05/2023

Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis

Regressive Text-to-Speech (TTS) system utilizes attention mechanism to g...
research
09/04/2019

DurIAN: Duration Informed Attention Network For Multimodal Synthesis

In this paper, we present a generic and robust multimodal synthesis syst...
research
05/12/2020

FeatherWave: An efficient high-fidelity neural vocoder with multi-band linear prediction

In this paper, we propose the FeatherWave, yet another variant of WaveRN...
research
05/01/2020

Multi-head Monotonic Chunkwise Attention For Online Speech Recognition

The attention mechanism of the Listen, Attend and Spell (LAS) model requ...
research
10/23/2019

Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

Despite the ability to produce human-level speech for in-domain text, at...
research
10/23/2020

GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

Attention-based end-to-end text-to-speech synthesis (TTS) is superior to...
research
10/05/2020

Transformer-Based Neural Text Generation with Syntactic Guidance

We study the problem of using (partial) constituency parse trees as synt...

Please sign up or login with your details

Forgot password? Click here to reset