Into-TTS : Intonation Template based Prosody Control System

04/04/2022
by   JIhwan Lee, et al.
0

Intonations take an important role in delivering the intention of the speaker. However, current end-to-end TTS systems often fail to model proper intonations. To alleviate this problem, we propose a novel, intuitive method to synthesize speech in different intonations using predefined intonation templates. Prior to the acoustic model training, speech data are automatically grouped into intonation templates by k-means clustering, according to their sentence-final F0 contour. Two proposed modules are added to the end-to-end TTS framework: intonation classifier and intonation encoder. The intonation classifier recommends a suitable intonation template to the given text. The intonation encoder, attached to the text encoder output, synthesizes speech abiding the requested intonation template. Main contributions of our paper are: (a) an easy-to-use intonation control system covering a wide range of users; (b) better performance in wrapping speech in a requested intonation with improved pitch distance and MOS; and (c) feasibility to future integration between TTS and NLP, TTS being able to utilize contextual information. Audio samples are available at https://srtts.github.io/IntoTTS.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2020

TED: Triple Supervision Decouples End-to-end Speech-to-text Translation

An end-to-end speech-to-text translation (ST) takes audio in a source la...
research
10/09/2021

Using multiple reference audios and style embedding constraints for speech synthesis

The end-to-end speech synthesis model can directly take an utterance as ...
research
02/04/2020

Variational Template Machine for Data-to-Text Generation

How to generate descriptions from structured data organized in tables? E...
research
08/30/2018

Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis

Although end-to-end text-to-speech (TTS) models such as Tacotron have sh...
research
02/21/2022

CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

The text-based speech editor allows the editing of speech through intuit...
research
03/05/2015

Do We Need More Training Data?

Datasets for training object recognition systems are steadily increasing...
research
11/07/2020

Template Controllable keywords-to-text Generation

This paper proposes a novel neural model for the understudied task of ge...

Please sign up or login with your details

Forgot password? Click here to reset