FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis

10/27/2022
by   Yifan Hu, et al.
0

Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context. The correlation between the current utterance and the dialogue history at the utterance level was used to improve the expressiveness of synthesized speech. However, the fine-grained information in the dialogue history at the word level also has an important impact on the prosodic expression of an utterance, which has not been well studied in the prior work. Therefore, we propose a novel expressive conversational TTS model, termed as FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation. Specifically, the FCTalker includes fine and coarse grained encoders to exploit the word and utterance-level context dependency. To model the word-level dependencies between an utterance and its dialogue history, the fine-grained dialogue encoder is built on top of a dialogue BERT model. The experimental results show that the proposed method outperforms all baselines and generates more expressive speech that is contextually appropriate. We release the source code at: https://github.com/walker-hyf/FCTalker.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2022

Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History

We propose an end-to-end empathetic dialogue speech synthesis (DSS) mode...
research
11/01/2022

Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis

This paper proposes an Expressive Speech Synthesis model that utilizes t...
research
08/27/2018

An Auto-Encoder Matching Model for Learning Utterance-Level Semantic Dependency in Dialogue Generation

Generating semantically coherent responses is still a major challenge in...
research
05/15/2020

Feature Fusion Strategies for End-to-End Evaluation of Cognitive Behavior Therapy Sessions

Cognitive Behavioral Therapy (CBT) is a goal-oriented psychotherapy for ...
research
09/29/2020

Utterance-level Dialogue Understanding: An Empirical Study

The recent abundance of conversational data on the Web and elsewhere cal...
research
07/06/2021

Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm

Text-to-Speech synthesis systems are generally evaluated using Mean Opin...
research
10/19/2021

A non-hierarchical attention network with modality dropout for textual response generation in multimodal dialogue systems

Existing text- and image-based multimodal dialogue systems use the tradi...

Please sign up or login with your details

Forgot password? Click here to reset