High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency

11/17/2021
by   Nikolaos Ellinas, et al.
0

This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by using a recently proposed purely location-based attention mechanism, suitable for arbitrary sentence length generation. During inference, the decoder is unrolled and acoustic feature generation is performed in a streaming manner, allowing for a nearly constant latency which is independent from the sentence length. Experimental results show that the acoustic model can produce feature sequences with minimal latency about 31 times faster than real-time on a computer CPU and 6.5 times on a mobile CPU, enabling it to meet the conditions required for real-time applications on both devices. The full end-to-end system can generate almost natural quality speech, which is verified by listening tests.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/12/2018

FPUAS : Fully Parallel UFANS-based End-to-End Acoustic System with 10x Speed Up

A lightweight end-to-end acoustic system is crucial in the deployment of...
research
11/02/2020

FeatherTTS: Robust and Efficient attention based Neural TTS

Attention based neural TTS is elegant speech synthesis pipeline and has ...
research
04/01/2021

Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling

Typical high quality text-to-speech (TTS) systems today use a two-stage ...
research
05/02/2019

High quality, lightweight and adaptable TTS using LPCNet

We present a lightweight adaptable neural TTS system with high quality o...
research
09/04/2019

DurIAN: Duration Informed Attention Network For Multimodal Synthesis

In this paper, we present a generic and robust multimodal synthesis syst...
research
12/23/2020

Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model

Text-to-speech (TTS) synthesis, a technique for artificially generating ...
research
08/06/2020

PPSpeech: Phrase based Parallel End-to-End TTS System

Current end-to-end autoregressive TTS systems (e.g. Tacotron 2) have out...

Please sign up or login with your details

Forgot password? Click here to reset