CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech

04/30/2020
by   Sri Karlapati, et al.
0

Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesising speech in a different target speaker's voice. Current approaches for fine-grained PT suffer from source speaker leakage, where the synthesised speech has the voice identity of the source speaker as opposed to the target speaker. In order to mitigate this issue, they compromise on the quality of PT. In this paper, we propose CopyCat, a novel, many-to-many PT system that is robust to source speaker leakage, without using parallel data. We achieve this through a novel reference encoder architecture capable of capturing temporal prosodic representations which are robust to source speaker leakage. We compare CopyCat against a state-of-the-art fine-grained PT model through various subjective evaluations, where we show a relative improvement of 47% in the quality of prosody transfer and 14% in preserving the target speaker identity, while still maintaining the same naturalness.

READ FULL TEXT
research
06/27/2022

CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer

In this paper, we present CopyCat2 (CC2), a novel model capable of: a) s...
research
07/04/2019

Fine-grained robust prosody transfer for single-speaker neural text-to-speech

We present a neural text-to-speech system for fine-grained prosody trans...
research
06/20/2023

eCat: An End-to-End Model for Multi-Speaker TTS Many-to-Many Fine-Grained Prosody Transfer

We present eCat, a novel end-to-end multispeaker model capable of: a) ge...
research
11/29/2022

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

In this paper, we present a novel method for phoneme-level prosody contr...
research
06/28/2022

A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

Typically, singing voice conversion (SVC) depends on an embedding vector...
research
11/19/2021

Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

This paper presents a method for phoneme-level prosody control of F0 and...
research
05/09/2023

Who is Speaking Actually? Robust and Versatile Speaker Traceability for Voice Conversion

Voice conversion (VC), as a voice style transfer technology, is becoming...

Please sign up or login with your details

Forgot password? Click here to reset