ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

03/20/2022
by   Jinlong Xue, et al.
0

In recent years, neural network based methods for multi-speaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models used in these methods still cannot capture enough speaker information. In this paper, we focus on accurate speaker encoder modeling and propose an end-to-end method that can generate high-quality speech and better similarity for both seen and unseen speakers. The proposed architecture consists of three separately trained components: a speaker encoder based on the state-of-the-art ECAPA-TDNN model which is derived from speaker verification task, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder. The comparison among different speaker encoder models shows our proposed method can achieve better naturalness and similarity. To efficiently evaluate our synthesized speech, we are the first to adopt deep learning based automatic MOS evaluation methods to assess our results, and these methods show great potential in automatic speech quality assessment.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/12/2018

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

We describe a neural network-based system for text-to-speech (TTS) synth...
research
07/09/2019

Multi-Speaker End-to-End Speech Synthesis

In this work, we extend ClariNet (Ping et al., 2019), a fully end-to-end...
research
01/19/2022

MHTTS: Fast multi-head text-to-speech for spontaneous speech with imperfect transcription

Neural network based end-to-end Text-to-Speech (TTS) has greatly improve...
research
07/01/2022

Automatic Evaluation of Speaker Similarity

We introduce a new automatic evaluation method for speaker similarity as...
research
03/02/2022

Speaker Adaption with Intuitive Prosodic Features for Statistical Parametric Speech Synthesis

In this paper, we propose a method of speaker adaption with intuitive pr...
research
06/03/2021

Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis

Building multispeaker neural network-based text-to-speech synthesis syst...
research
07/07/2021

Effective and Differentiated Use of Control Information for Multi-speaker Speech Synthesis

In multi-speaker speech synthesis, data from a number of speakers usuall...

Please sign up or login with your details

Forgot password? Click here to reset