Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

08/11/2020
by   Rui Liu, et al.
0

Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/23/2022

Adversarial Multi-Task Learning for Disentangling Timbre and Pitch in Singing Voice Synthesis

Recently, deep learning-based generative models have been introduced to ...
research
05/17/2021

Deep Multistage Multi-Task Learning for Quality Prediction of Multistage Manufacturing Systems

In multistage manufacturing systems, modeling multiple quality indices b...
research
01/26/2020

Multi-task Learning for Voice Trigger Detection

We describe the design of a voice trigger detection system for smart spe...
research
10/15/2020

Aerodynamic Data Predictions Based on Multi-task Learning

The quality of datasets is one of the key factors that affect the accura...
research
06/22/2018

Multi-task WaveNet: A Multi-task Generative Model for Statistical Parametric Speech Synthesis without Fundamental Frequency Conditions

This paper introduces an improved generative model for statistical param...
research
04/09/2023

An investigation of speaker independent phrase break models in End-to-End TTS systems

This paper presents our work on phrase break prediction in the context o...
research
07/16/2020

Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning With Spoofing Detection and Spoofing Type Classification

Several papers have proposed deep-learning-based models to predict the m...

Please sign up or login with your details

Forgot password? Click here to reset