Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis

10/28/2022
by   Yuma Shirahata, et al.
0

Several fully end-to-end text-to-speech (TTS) models have been proposed that have shown better performance compared to cascade models (i.e., training acoustic and vocoder models separately). However, they often generate unstable pitch contour with audible artifacts when the dataset contains emotional attributes, i.e., large diversity of pronunciation and prosody. To address this problem, we propose Period VITS, a novel end-to-end TTS model that incorporates an explicit periodicity generator. In the proposed method, we introduce a frame pitch predictor that predicts prosodic features, such as pitch and voicing flags, from the input text. From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch. Finally, the entire model is jointly optimized in an end-to-end manner with variational inference and adversarial objectives. As a result, the decoder becomes capable of generating more stable, expressive, and natural output waveforms. The experimental results showed that the proposed model significantly outperforms baseline models in terms of naturalness, with improved pitch stability in the generated samples.

READ FULL TEXT
research
03/31/2022

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

In neural text-to-speech (TTS), two-stage system or a cascade of separat...
research
02/24/2023

PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS

Previous pitch-controllable text-to-speech (TTS) models rely on directly...
research
10/17/2021

VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis

In this paper, we propose VISinger, a complete end-to-end high-quality s...
research
11/06/2020

Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

We describe a sequence-to-sequence neural network which can directly gen...
research
11/15/2017

Emotional End-to-End Neural Speech Synthesizer

In this paper, we introduce an emotional speech synthesizer based on the...
research
07/30/2018

Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

Generating versatile and appropriate synthetic speech requires control o...
research
07/19/2018

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

In this work, we propose an alternative solution for parallel wave gener...

Please sign up or login with your details

Forgot password? Click here to reset