A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis

08/03/2022
by   Qibing Bai, et al.
0

In human speech, the attitude of a speaker cannot be fully expressed only by the textual content. It has to come along with the intonation. Declarative questions are commonly used in daily Cantonese conversations, and they are usually uttered with rising intonation. Vanilla neural text-to-speech (TTS) systems are not capable of synthesizing rising intonation for these sentences due to the loss of semantic information. Though it has become more common to complement the systems with extra language models, their performance in modeling rising intonation is not well studied. In this paper, we propose to complement the Cantonese TTS model with a BERT-based statement/question classifier. We design different training strategies and compare their performance. We conduct our experiments on a Cantonese corpus named CanTTS. Empirical results show that the separate training approach obtains the best generalization performance and feasibility.

READ FULL TEXT
research
08/20/2018

Multimodal speech synthesis architecture for unsupervised speaker adaptation

This paper proposes a new architecture for speaker adaptation of multi-s...
research
09/19/2023

Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation

Speaker diarization has gained considerable attention within speech proc...
research
06/13/2023

PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling

Although text-to-speech (TTS) systems have significantly improved, most ...
research
05/06/2021

What's in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus

Whereas much of the success of the current generation of neural language...
research
12/12/2022

Direct Speech-to-speech Translation without Textual Annotation using Bottleneck Features

Speech-to-speech translation directly translates a speech utterance to a...
research
09/09/2019

Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Text-to-speech systems are typically evaluated on single sentences. When...

Please sign up or login with your details

Forgot password? Click here to reset