End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue

06/24/2022
by   Kentaro Mitsui, et al.
0

The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied. This study aims to realize a TTS that closely resembles human dialogue. First, we record and transcribe actual spontaneous dialogues. Then, the proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS is trained, which introduces an utterance-level latent variable into variational inference with adversarial learning for end-to-end text-to-speech (VITS), a recently proposed end-to-end TTS model. A style encoder that extracts a latent speaking style representation from speech is trained jointly with TTS. In the second stage, a style predictor is trained to predict the speaking style to be synthesized from dialogue history. During inference, by passing the speaking style representation predicted by the style predictor to VAE/GMVAE-VITS, speech can be synthesized in a style appropriate to the context of the dialogue. Subjective evaluation results demonstrate that the proposed method outperforms the original VITS in terms of dialogue-level naturalness.

READ FULL TEXT
research
06/11/2021

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Several recent end-to-end text-to-speech (TTS) models enabling single-st...
research
12/11/2018

Learning latent representations for style control and transfer in end-to-end speech synthesis

In this paper, we introduce the Variational Autoencoder (VAE) to an end-...
research
06/16/2022

Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History

We propose an end-to-end empathetic dialogue speech synthesis (DSS) mode...
research
05/09/2022

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Text to speech (TTS) has made rapid progress in both academia and indust...
research
10/19/2020

End-to-End Text-to-Speech using Latent Duration based on VQ-VAE

Explicit duration modeling is a key to achieving robust and efficient al...
research
09/13/2018

Studying Mutual Phonetic Influence with a Web-Based Spoken Dialogue System

This paper presents a study on mutual speech variation influences in a h...
research
05/11/2020

Exploring TTS without T Using Biologically/Psychologically Motivated Neural Network Modules (ZeroSpeech 2020)

In this study, we reported our exploration of Text-To-Speech without Tex...

Please sign up or login with your details

Forgot password? Click here to reset