Fine-grained style control in Transformer-based Text-to-speech Synthesis

10/12/2021
by   Li-Wei Chen, et al.
0

In this paper, we present a novel architecture to realize fine-grained style control on the transformer-based text-to-speech synthesis (TransformerTTS). Specifically, we model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech. The existing content encoder in TransformerTTS is then replaced by our designed cross-attention blocks for fusion and alignment between content and style. As the fusion is performed along with the skip connection, our cross-attention block provides a good inductive bias to gradually infuse the phoneme representation with a given style. Additionally, we prevent the style embedding from encoding linguistic content by randomly truncating LST during training and using wav2vec 2.0 features. Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability. Our code and samples are publicly available.

READ FULL TEXT
research
11/08/2020

Fine-grained style modelling and transfer in text-to-speech synthesis via content-style disentanglement

This paper presents a novel neural model for fine-grained style modeling...
research
11/05/2021

Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers

We present a new perspective of achieving image synthesis by viewing thi...
research
03/11/2023

PARASOL: Parametric Style Control for Diffusion Image Synthesis

We propose PARASOL, a multi-modal synthesis model that enables disentang...
research
09/10/2021

Does It Capture STEL? A Modular, Similarity-based Linguistic Style Evaluation Framework

Style is an integral part of natural language. However, evaluation metho...
research
11/06/2018

Robust and fine-grained prosody control of end-to-end speech synthesis

We propose prosody embeddings for emotional and expressive speech synthe...
research
11/19/2021

Word-Level Style Control for Expressive, Non-attentive Speech Synthesis

This paper presents an expressive speech synthesis architecture for mode...
research
08/04/2021

Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech Synthesis

Expressive neural text-to-speech (TTS) systems incorporate a style encod...

Please sign up or login with your details

Forgot password? Click here to reset