Building a mixed-lingual neural TTS system with only monolingual data

by   Liumeng Xue, et al., Inc.
Search Home 首页»»正文

When deploying a Chinese neural text-to-speech (TTS) synthesis system, one of the challenges is to synthesize Chinese utterances with English phrases or words embedded. This paper looks into the problem in the encoder-decoder framework when only monolingual data from a target speaker is available. Specifically, we view the problem from two aspects: speaker consistency within an utterance and naturalness. We start the investigation with an Average Voice Model which is built from multi-speaker monolingual data, i.e. Mandarin and English data. On the basis of that, we look into speaker embedding for speaker consistency within an utterance and phoneme embedding for naturalness and intelligibility and study the choice of data for model training. We report the findings and discuss the challenges to build a mixed-lingual TTS system with only monolingual data.


page 4

page 5


Improve Bilingual TTS Using Dynamic Language and Phonology Embedding

In most cases, bilingual TTS needs to handle three types of input script...

Exploring Timbre Disentanglement in Non-Autoregressive Cross-Lingual Text-to-Speech

In this paper, we present a FastPitch-based non-autoregressive cross-lin...

Improving Voice Trigger Detection with Metric Learning

Voice trigger detection is an important task, which enables activating a...

Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

This work examines the content and usefulness of disentangled phone and ...

Generating Bilingual Pragmatic Color References

Contextual influences on language exhibit substantial language-independe...

A Unified Deep Speaker Embedding Framework for Mixed-Bandwidth Speech Data

This paper proposes a unified deep speaker embedding framework for model...

Please sign up or login with your details

Forgot password? Click here to reset