Do Prosody Transfer Models Transfer Prosody?

03/07/2023
by   Atli Thor Sigurgeirsson, et al.
0

Some recent models for Text-to-Speech synthesis aim to transfer the prosody of a reference utterance to the generated target synthetic speech. This is done by using a learned embedding of the reference utterance, which is used to condition speech generation. During training, the reference utterance is identical to the target utterance. Yet, during synthesis, these models are often used to transfer prosody from a reference that differs from the text or speaker being synthesized. To address this inconsistency, we propose to use a different, but prosodically-related, utterance during training too. We believe this should encourage the model to learn to transfer only those characteristics that the reference and target have in common. If prosody transfer methods do indeed transfer prosody they should be able to be trained in the way we propose. However, results show that a model trained under these conditions performs significantly worse than one trained using the target utterance as a reference. To explain this, we hypothesize that prosody transfer models do not learn a transferable representation of prosody, but rather an utterance-level representation which is highly dependent on both the reference speaker and reference text.

READ FULL TEXT
research
03/24/2018

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

We present an extension to the Tacotron speech synthesis architecture th...
research
11/19/2020

Multi-stage Speaker Extraction with Utterance and Frame-Level Reference Signals

Speaker extraction uses a pre-recorded reference speech as the reference...
research
11/21/2019

Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features

This paper presents a simple yet effective method to achieve prosody tra...
research
03/29/2022

DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning

Most text-to-speech (TTS) methods use high-quality speech corpora record...
research
09/14/2022

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset

Non-reference speech quality models are important for a growing number o...
research
08/04/2020

Expressive TTS Training with Frame and Style Reconstruction Loss

We propose a novel training strategy for Tacotron-based text-to-speech (...
research
04/04/2019

Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis

Speech style control and transfer techniques aim to enrich the diversity...

Please sign up or login with your details

Forgot password? Click here to reset