Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis

06/11/2021
by   Jingbei Li, et al.
0

For conversational text-to-speech (TTS) systems, it is vital that the systems can adjust the spoken styles of synthesized speech according to different content and spoken styles in historical conversations. However, the study about learning spoken styles from historical conversations is still in its infancy. Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches. Moreover, only the interactions of the global aspect between speakers are modeled, missing the party aspect self interactions inside each speaker. In this paper, to achieve better spoken style learning for conversational TTS, we propose a spoken style learning approach with multi-modal hierarchical context encoding. The textual information and spoken styles in the historical conversations are processed through multiple hierarchical recurrent neural networks to learn the spoken style related features in global and party aspects. The attention mechanism is further employed to summarize these features into a conversational context encoding. Experimental results demonstrate the effectiveness of our proposed approach, which outperform a baseline method using context encoding learnt only from the transcripts in global aspects, with MOS score on the naturalness of synthesized speech increasing from 3.138 to 3.408 and ABX preference rate exceeding the baseline method by 36.45

READ FULL TEXT

page 2

page 3

research
06/21/2021

Controllable Context-aware Conversational Speech Synthesis

In spoken conversations, spontaneous behaviors like filled pause and pro...
research
05/03/2023

M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Conversational text-to-speech (TTS) aims to synthesize speech with prope...
research
07/03/2022

DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech

The majority of current TTS datasets, which are collections of individua...
research
04/07/2019

MCE 2018: The 1st Multi-target Speaker Detection and Identification Challenge Evaluation

The Multi-target Challenge aims to assess how well current speech techno...
research
05/13/2017

Annotating and Modeling Empathy in Spoken Conversations

Empathy, as defined in behavioral sciences, expresses the ability of hum...
research
03/05/2020

Learning to mirror speaking styles incrementally

Mirroring is the behavior in which one person subconsciously imitates th...
research
06/17/2022

What can Speech and Language Tell us About the Working Alliance in Psychotherapy

We are interested in the problem of conversational analysis and its appl...

Please sign up or login with your details

Forgot password? Click here to reset