M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

05/03/2023
by   Jinlong Xue, et al.
0

Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphasis. Moreover, it is insufficient to only consider the textual features, and acoustic features also contain various prosody information. Hence, we propose M2-CTTS, an end-to-end multi-scale multi-modal conversational text-to-speech system, aiming to comprehensively utilize historical conversation and enhance prosodic expression. More specifically, we design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling. Experimental results demonstrate that our model mixed with fine-grained context information and additionally considering acoustic features achieves better prosody performance and naturalness in CMOS tests.

READ FULL TEXT
research
07/15/2022

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Video-text retrieval has been a crucial and fundamental task in multi-mo...
research
05/21/2019

Acoustic-to-Word Models with Conversational Context Information

Conversational context information, higher-level knowledge that spans ac...
research
06/11/2021

Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis

For conversational text-to-speech (TTS) systems, it is vital that the sy...
research
08/19/2021

Fine-Grained Element Identification in Complaint Text of Internet Fraud

Existing system dealing with online complaint provides a final decision ...
research
11/11/2022

MaskedSpeech: Context-aware Speech Synthesis with Masking Strategy

Humans often speak in a continuous manner which leads to coherent and co...
research
06/21/2021

Controllable Context-aware Conversational Speech Synthesis

In spoken conversations, spontaneous behaviors like filled pause and pro...
research
06/29/2021

Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to s...

Please sign up or login with your details

Forgot password? Click here to reset