Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling

04/01/2021
by   Qing He, et al.
0

Typical high quality text-to-speech (TTS) systems today use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio. High-quality spectrum models usually incorporate the encoder-decoder architecture with self-attention or bi-directional long short-term (BLSTM) units. While these models can produce high quality speech, they often incur O(L) increase in both latency and real-time factor (RTF) with respect to input length L. In other words, longer inputs leads to longer delay and slower synthesis speed, limiting its use in real-time applications. In this paper, we propose a multi-rate attention architecture that breaks the latency and RTF bottlenecks by computing a compact representation during encoding and recurrently generating the attention vector in a streaming manner during decoding. The proposed architecture achieves high audio quality (MOS of 4.31 compared to groundtruth 4.48), low latency, and low RTF at the same time. Meanwhile, both latency and RTF of the proposed system stay constant regardless of input lengths, making it ideal for real-time applications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/17/2021

High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency

This paper presents an end-to-end text-to-speech system with low latency...
research
07/07/2021

SoundStream: An End-to-End Neural Audio Codec

We present SoundStream, a novel neural audio codec that can efficiently ...
research
07/27/2023

Turning Whisper into Real-Time Transcription System

Whisper is one of the recent state-of-the-art multilingual speech recogn...
research
10/25/2022

Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering

While recent research advances in speaker diarization mostly focus on im...
research
07/08/2022

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Unconstrained lip-to-speech synthesis aims to generate corresponding spe...
research
09/22/2021

Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network

Incremental text-to-speech (TTS) synthesis generates utterances in small...
research
10/21/2020

Real-time Speech Frequency Bandwidth Extension

In this paper we propose a lightweight model for frequency bandwidth ext...

Please sign up or login with your details

Forgot password? Click here to reset