eCat: An End-to-End Model for Multi-Speaker TTS Many-to-Many Fine-Grained Prosody Transfer

06/20/2023
by   Ammar Abbas, et al.
0

We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-context speech with expressive and contextually appropriate prosody, and b) performing fine-grained prosody transfer between any pair of seen speakers. eCat is trained using a two-stage training approach. In Stage I, the model learns speaker-independent word-level prosody representations in an end-to-end fashion from speech. In Stage II, we learn to predict the prosody representations using the contextual information available in text. We compare eCat to CopyCat2, a model capable of both fine-grained prosody transfer (FPT) and multi-speaker TTS. We show that eCat statistically significantly reduces the gap in naturalness between CopyCat2 and human recordings by an average of 46.7 target-speaker similarity in FPT. We also compare eCat to VITS, and show a statistically significant preference.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/27/2022

CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer

In this paper, we present CopyCat2 (CC2), a novel model capable of: a) s...
research
11/04/2020

Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

In this paper, we introduce Kathaka, a model trained with a novel two-st...
research
07/04/2019

Fine-grained robust prosody transfer for single-speaker neural text-to-speech

We present a neural text-to-speech system for fine-grained prosody trans...
research
06/29/2022

Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

Generating expressive and contextually appropriate prosody remains a cha...
research
04/30/2020

CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech

Prosody Transfer (PT) is a technique that aims to use the prosody from a...
research
11/07/2020

TB-Net: A Three-Stream Boundary-Aware Network for Fine-Grained Pavement Disease Segmentation

Regular pavement inspection plays a significant role in road maintenance...
research
05/27/2021

Diverse and Controllable Speech Synthesis with GMM-Based Phone-Level Prosody Modelling

Generating natural speech with diverse and smooth prosody pattern is a c...

Please sign up or login with your details

Forgot password? Click here to reset