Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

06/05/2022
by   Ziyue Jiang, et al.
0

Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses with different linguistic encoders demonstrate that each design in Dict-TTS is effective. Audio samples are available at <https://dicttts.github.io/DictTTS-Demo/>.

READ FULL TEXT

page 8

page 17

page 18

research
07/03/2022

M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation

End-to-end speech-to-text translation models are often initialized with ...
research
05/24/2023

LMs with a Voice: Spoken Language Modeling beyond Speech Tokens

We present SPECTRON, a novel approach to adapting pre-trained language m...
research
10/29/2018

Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language

End-to-end speech synthesis is a promising approach that directly conver...
research
04/15/2021

Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Developing Text Normalization (TN) systems for Text-to-Speech (TTS) on n...
research
11/04/2018

Towards Unsupervised Speech-to-Text Translation

We present a framework for building speech-to-text translation (ST) syst...
research
09/16/2015

amLite: Amharic Transliteration Using Key Map Dictionary

amLite is a framework developed to map ASCII transliterated Amharic text...
research
09/30/2021

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 ...

Please sign up or login with your details

Forgot password? Click here to reset