DeepAI AI Chat
Log In Sign Up

Effective and Differentiated Use of Control Information for Multi-speaker Speech Synthesis

by   Qinghua Wu, et al.

In multi-speaker speech synthesis, data from a number of speakers usually tends to have great diversity due to the fact that the speakers may differ largely in their ages, speaking styles, speeds, emotions, and so on. The diversity of data will lead to the one-to-many mapping problem <cit.>. It is important but challenging to improve the modeling capabilities for multi-speaker speech synthesis. To address the issue, this paper researches into the effective use of control information such as speaker and pitch which are differentiated from text-content information in our encoder-decoder framework: 1) Design a representation of harmonic structure of speech, called excitation spectrogram, from pitch and energy. The excitation spectrogrom is, along with the text-content, fed to the decoder to guide the learning of harmonics of mel-spectrogram. 2) Propose conditional gated LSTM (CGLSTM) whose input/output/forget gates are re-weighted by speaker embedding to control the flow of text-content information in the network. The experiments show significant reduction in reconstruction errors of mel-spectrogram in the training of multi-speaker generative model, and a great improvement is observed in the subjective evaluation of speaker adapted model, e.g, the Mean Opinion Score (MOS) of intelligibility increases by 0.81 points.


MultiSpeech: Multi-Speaker Text to Speech with Transformer

Transformer-based text to speech (TTS) model (e.g., Transformer TTS <cit...

SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate

The mapping of text to speech (TTS) is non-deterministic, letters may be...

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

In recent years, neural network based methods for multi-speaker text-to-...

Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes

Multi-speaker speech synthesis is a technique for modeling multiple spea...

Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis

Adapting generic speech recognition models to specific individuals is a ...

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

In this paper, we present a novel method for phoneme-level prosody contr...