Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

08/31/2023
by   Weiqin Li, et al.
0

The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style. However, synthesizing spontaneous-style speech is challenging due to the lack of high-quality spontaneous datasets and the high cost of labeling spontaneous behavior. In this paper, we propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels. In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech. Moreover, a linguistic-aware encoder is used to model the relationship between each sentence in the conversation. Experimental results indicate that our proposed method achieves superior expressive speech synthesis performance with the ability to model spontaneous behavior in spontaneous-style speech and predict reasonable spontaneous behavior from text.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2022

Censer: Curriculum Semi-supervised Learning for Speech Recognition Based on Self-supervised Pre-training

Recent studies have shown that the benefits provided by self-supervised ...
research
06/21/2021

Controllable Context-aware Conversational Speech Synthesis

In spoken conversations, spontaneous behaviors like filled pause and pro...
research
04/11/2022

Unified Speech-Text Pre-training for Speech Translation and Recognition

We describe a method to jointly pre-train speech and text in an encoder-...
research
06/29/2022

Improving Deliberation by Text-Only and Semi-Supervised Training

Text-only and semi-supervised training based on audio-only data has gain...
research
11/25/2019

Detecting Unknown Behaviors by Pre-defined Behaviours: An Bayesian Non-parametric Approach

An automatic mouse behavior recognition system can considerably reduce t...
research
03/02/2020

Semi-supervised learning of glottal pulse positions in a neural analysis-synthesis framework

This article investigates into recently emerging approaches that use dee...
research
01/13/2021

Whispered and Lombard Neural Speech Synthesis

It is desirable for a text-to-speech system to take into account the env...

Please sign up or login with your details

Forgot password? Click here to reset