DurIAN: Duration Informed Attention Network For Multimodal Synthesis

by   Chengzhu Yu, et al.

In this paper, we present a generic and robust multimodal synthesis system that produces highly natural speech and facial expression simultaneously. The key component of this system is the Duration Informed Attention Network (DurIAN), an autoregressive model in which the alignments between the input text and the output acoustic features are inferred from a duration model. This is different from the end-to-end attention mechanism used, and accounts for various unavoidable artifacts, in existing end-to-end speech synthesis systems such as Tacotron. Furthermore, DurIAN can be used to generate high quality facial expression which can be synchronized with generated speech with/without parallel speech and face data. To improve the efficiency of speech generation, we also propose a multi-band parallel generation strategy on top of the WaveRNN model. The proposed Multi-band WaveRNN effectively reduces the total computational complexity from 9.8 to 5.5 GFLOPS, and is able to generate audio that is 6 times faster than real time on a single CPU core. We show that DurIAN could generate highly natural speech that is on par with current state of the art end-to-end systems, while at the same time avoid word skipping/repeating errors in those systems. Finally, a simple yet effective approach for fine-grained control of expressiveness of speech and facial expression is introduced.


page 1

page 2

page 3

page 4


FeatherWave: An efficient high-fidelity neural vocoder with multi-band linear prediction

In this paper, we propose the FeatherWave, yet another variant of WaveRN...

Audiovisual Speech Synthesis using Tacotron2

Audiovisual speech synthesis is the problem of synthesizing a talking fa...

Triple M: A Practical Neural Text-to-speech System With Multi-guidance Attention And Multi-band Multi-time Lpcnet

In this work, a robust and efficient text-to-speech system, named Triple...

High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency

This paper presents an end-to-end text-to-speech system with low latency...

Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

We propose a lightweight end-to-end text-to-speech model using multi-ban...

Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech

In this paper, we propose multi-band MelGAN, a much faster waveform gene...

The Importance of Accurate Alignments in End-to-End Speech Synthesis

Unit selection synthesis systems required accurate segmentation and labe...

Please sign up or login with your details

Forgot password? Click here to reset