Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

10/28/2022
by   Jason Fong, et al.
5

Text-based voice editing (TBVE) uses synthetic output from text-to-speech (TTS) systems to replace words in an original recording. Recent work has used neural models to produce edited speech that is similar to the original speech in terms of clarity, speaker identity, and prosody. However, one limitation of prior work is the usage of finetuning to optimise performance: this requires further model training on data from the target speaker, which is a costly process that may incorporate potentially sensitive data into server-side models. In contrast, this work focuses on the zero-shot approach which avoids finetuning altogether, and instead uses pretrained speaker verification embeddings together with a jointly trained reference encoder to encode utterance-level information that helps capture aspects such as speaker identity and prosody. Subjective listening tests find that both utterance embeddings and a reference encoder improve the continuity of speaker identity and prosody between the edited synthetic speech and unedited original recording in the zero-shot setting.

READ FULL TEXT
research
06/24/2022

Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

The cloning of a speaker's voice using an untranscribed reference sample...
research
09/12/2021

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Given a piece of speech and its transcript text, text-based speech editi...
research
01/25/2022

Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention

With recent advancements in voice cloning, the performance of speech syn...
research
07/01/2022

Automatic Evaluation of Speaker Similarity

We introduce a new automatic evaluation method for speaker similarity as...
research
06/07/2022

FlexLip: A Controllable Text-to-Lip System

The task of converting text input into video content is becoming an impo...
research
07/14/2023

Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts

Zero-shot text-to-speech aims at synthesizing voices with unseen speech ...
research
09/23/2022

ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Rhythm

Recent developments in neural speech synthesis and vocoding have sparked...

Please sign up or login with your details

Forgot password? Click here to reset