Fine-grained robust prosody transfer for single-speaker neural text-to-speech

07/04/2019
by   Viacheslav Klimkov, et al.
0

We present a neural text-to-speech system for fine-grained prosody transfer from one speaker to another. Conventional approaches for end-to-end prosody transfer typically use either fixed-dimensional or variable-length prosody embedding via a secondary attention to encode the reference signal. However, when trained on a single-speaker dataset, the conventional prosody transfer systems are not robust enough to speaker variability, especially in the case of a reference signal coming from an unseen speaker. Therefore, we propose decoupling of the reference signal alignment from the overall system. For this purpose, we pre-compute phoneme-level time stamps and use them to aggregate prosodic features per phoneme, injecting them into a sequence-to-sequence text-to-speech system. We incorporate a variational auto-encoder to further enhance the latent representation of prosody embeddings. We show that our proposed approach is significantly more stable and achieves reliable prosody transplantation from an unseen speaker. We also propose a solution to the use case in which the transcription of the reference signal is absent. We evaluate all our proposed methods using both objective and subjective listening tests.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/30/2020

CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech

Prosody Transfer (PT) is a technique that aims to use the prosody from a...
research
06/20/2023

eCat: An End-to-End Model for Multi-Speaker TTS Many-to-Many Fine-Grained Prosody Transfer

We present eCat, a novel end-to-end multispeaker model capable of: a) ge...
research
11/06/2018

Robust and fine-grained prosody control of end-to-end speech synthesis

We propose prosody embeddings for emotional and expressive speech synthe...
research
06/27/2022

CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer

In this paper, we present CopyCat2 (CC2), a novel model capable of: a) s...
research
05/27/2021

Diverse and Controllable Speech Synthesis with GMM-Based Phone-Level Prosody Modelling

Generating natural speech with diverse and smooth prosody pattern is a c...
research
06/10/2021

Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows

Text-to-speech systems recently achieved almost indistinguishable qualit...
research
06/08/2021

Speech BERT Embedding For Improving Prosody in Neural TTS

This paper presents a speech BERT model to extract embedded prosody info...

Please sign up or login with your details

Forgot password? Click here to reset