Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

09/12/2021
by   Chuanxin Tang, et al.
0

Given a piece of speech and its transcript text, text-based speech editing aims to generate speech that can be seamlessly inserted into the given speech by editing the transcript. Existing methods adopt a two-stage approach: synthesize the input text using a generic text-to-speech (TTS) engine and then transform the voice to the desired voice using voice conversion (VC). A major problem of this framework is that VC is a challenging problem which usually needs a moderate amount of parallel training data to work satisfactorily. In this paper, we propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the target speaker. In particular, we manage to perform accurate zero-shot duration prediction for the inserted text. The predicted duration is used to regulate both text embedding and speech embedding. Then, based on the aligned cross-modality input, we directly generate the mel-spectrogram of the edited speech with a transformer-based decoder. Subjective listening tests show that despite the lack of training data for the speaker, our method has achieved satisfactory results. It outperforms a recent zero-shot TTS engine by a large margin.

READ FULL TEXT
research
10/28/2022

Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

Text-based voice editing (TBVE) uses synthetic output from text-to-speec...
research
06/28/2022

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion

This paper proposes a new "decompose-and-edit" paradigm for the text-bas...
research
07/05/2022

Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion

The zero-shot scenario for speech generation aims at synthesizing a nove...
research
05/28/2023

Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS

Flow-based generative models are widely used in text-to-speech (TTS) sys...
research
01/10/2023

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

Text-to-speech (TTS) and voice conversion (VC) are two different tasks b...
research
02/21/2022

CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

The text-based speech editor allows the editing of speech through intuit...
research
04/23/2023

DiffVoice: Text-to-Speech with Latent Diffusion

In this work, we present DiffVoice, a novel text-to-speech model based o...

Please sign up or login with your details

Forgot password? Click here to reset