Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts

11/04/2022
by   Detai Xin, et al.
0

We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages multimodal context information of preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech. Previous work either uses unilateral or single-modality context, which does not fully represent the context information. The proposed method uses an acoustic context encoder and a textual context encoder to aggregate context information and feeds it to the TTS model, which enables the model to predict context-dependent prosody. We conducted comprehensive objective and subjective evaluations on a multi-speaker Japanese audiobook dataset. Experimental results demonstrate that the proposed method significantly outperforms two previous works. Additionally, we present insights about the different choices of context - modalities, lateral information and length - for audiobook TTS that have never been discussed in the literature before.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/23/2022

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Previous works on expressive speech synthesis mainly focus on current se...
research
09/24/2018

Speaker Naming in Movies

We propose a new model for speaker naming in movies that leverages visua...
research
04/06/2022

Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Previous works on expressive speech synthesis focus on modelling the mon...
research
06/22/2022

Audience Response Prediction from Textual Context

Humans' perception system closely monitors audio-visual cues during mult...
research
09/14/2018

A Multi-Stage Algorithm for Acoustic Physical Model Parameters Estimation

One of the challenges in computational acoustics is the identification o...
research
11/04/2018

Investigating context features hidden in End-to-End TTS

Recent studies have introduced end-to-end TTS, which integrates the prod...
research
07/29/2023

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

Expressive speech synthesis is crucial for many human-computer interacti...

Please sign up or login with your details

Forgot password? Click here to reset