From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

05/10/2020
by   Zexin Cai, et al.
0

High-fidelity speech can be synthesized by end-to-end text-to-speech models in recent years. However, accessing and controlling speech attributes such as speaker identity, prosody, and emotion in a text-to-speech system remains a challenge. This paper presents a system involving feedback constraint for multispeaker speech synthesis. We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network. The constraint is taken by an added loss related to the speaker identity, which is centralized to improve the speaker similarity between the synthesized speech and its natural reference audio. The model is trained and evaluated on publicly available datasets. Experimental results, including visualization on speaker embedding space, show significant improvement in terms of speaker identity cloning in the spectrogram level. Synthesized samples are available online for listening. (https://caizexin.github.io/mlspk-syn-samples/index.html)

READ FULL TEXT
research
03/24/2018

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

We present an extension to the Tacotron speech synthesis architecture th...
research
06/21/2021

UniTTS: Residual Learning of Unified Embedding Space for Speech Style Control

We propose a novel high-fidelity expressive speech synthesis model, UniT...
research
06/25/2022

Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations

We formulated non-speech vocalization (NSV) modeling as a text-to-speech...
research
10/22/2020

The NTU-AISG Text-to-speech System for Blizzard Challenge 2020

We report our NTU-AISG Text-to-speech (TTS) entry systems for the Blizza...
research
07/09/2019

Multi-Speaker End-to-End Speech Synthesis

In this work, we extend ClariNet (Ping et al., 2019), a fully end-to-end...
research
11/03/2020

Training Wake Word Detection with Synthesized Speech Data on Confusion Words

Confusing-words are commonly encountered in real-life keyword spotting a...
research
10/22/2020

AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

In this paper, we present AISHELL-3, a large-scale and high-fidelity mul...

Please sign up or login with your details

Forgot password? Click here to reset