CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer

06/27/2022
by   Sri Karlapati, et al.
0

In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel approach to two-stage training. In Stage I, the model learns speaker-independent word-level prosody representations from speech which it uses for many-to-many fine-grained prosody transfer. In Stage II, we learn to predict these prosody representations using the contextual information available in text, thereby, enabling multi-speaker TTS with contextually appropriate prosody. We compare CC2 to two strong baselines, one in TTS with contextually appropriate prosody, and one in fine-grained prosody transfer. CC2 reduces the gap in naturalness between our baseline and copy-synthesised speech by 22.79%. In fine-grained prosody transfer evaluations, it obtains a relative improvement of 33.15% in target speaker similarity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/20/2023

eCat: An End-to-End Model for Multi-Speaker TTS Many-to-Many Fine-Grained Prosody Transfer

We present eCat, a novel end-to-end multispeaker model capable of: a) ge...
research
04/30/2020

CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech

Prosody Transfer (PT) is a technique that aims to use the prosody from a...
research
11/04/2020

Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

In this paper, we introduce Kathaka, a model trained with a novel two-st...
research
11/12/2014

Deep Multi-Instance Transfer Learning

We present a new approach for transferring knowledge from groups to indi...
research
05/27/2021

Diverse and Controllable Speech Synthesis with GMM-Based Phone-Level Prosody Modelling

Generating natural speech with diverse and smooth prosody pattern is a c...
research
07/20/2022

Fine-grained Early Frequency Attention for Deep Speaker Recognition

Attention mechanisms have emerged as important tools that boost the perf...
research
07/04/2019

Fine-grained robust prosody transfer for single-speaker neural text-to-speech

We present a neural text-to-speech system for fine-grained prosody trans...

Please sign up or login with your details

Forgot password? Click here to reset