s-Transformer: Segment-Transformer for Robust Neural Speech Synthesis

11/17/2020
by   Xi Wang, et al.
0

Neural end-to-end text-to-speech (TTS) , which adopts either a recurrent model, e.g. Tacotron, or an attention one, e.g. Transformer, to characterize a speech utterance, has achieved significant improvement of speech synthesis. However, it is still very challenging to deal with different sentence lengths, particularly, for long sentences where sequence model has limitation of the effective context length. We propose a novel segment-Transformer (s-Transformer), which models speech at segment level where recurrence is reused via cached memories for both the encoder and decoder. Long-range contexts can be captured by the extended memory, meanwhile, the encoder-decoder attention on segment which is much easier to handle. In addition, we employ a modified relative positional self attention to generalize sequence length beyond a period possibly unseen in the training data. By comparing the proposed s-Transformer with the standard Transformer, on short sentences, both achieve the same MOS scores of 4.29, which is very close to 4.32 by the recordings; similar scores of 4.22 vs 4.2 on long sentences, and significantly better for extra-long sentences with a gain of 0.2 in MOS. Since the cached memory is updated with time, the s-Transformer generates rather natural and coherent speech for a long period of time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/21/2020

SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition

End-to-end speech recognition has become popular in recent years, since ...
research
07/07/2021

Efficient Transformer for Direct Speech Translation

The advent of Transformer-based models has surpassed the barriers of tex...
research
02/16/2021

Hierarchical Transformer-based Large-Context End-to-end ASR with Large-Context Knowledge Distillation

We present a novel large-context end-to-end automatic speech recognition...
research
05/16/2020

Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory

Transformer-based acoustic modeling has achieved great suc-cess for both...
research
11/01/2019

Improving Generalization of Transformer for Speech Recognition with Parallel Schedule Sampling and Relative Positional Embedding

Transformer showed promising results in many sequence to sequence transf...
research
06/14/2023

When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants

We present the first unified study of the efficiency of self-attention-b...
research
10/23/2020

GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

Attention-based end-to-end text-to-speech synthesis (TTS) is superior to...

Please sign up or login with your details

Forgot password? Click here to reset