Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS

05/28/2023
by   Sewade Ogun, et al.
0

Flow-based generative models are widely used in text-to-speech (TTS) systems to learn the distribution of audio features (e.g., Mel-spectrograms) given the input tokens and to sample from this distribution to generate diverse utterances. However, in the zero-shot multi-speaker TTS scenario, the generated utterances lack diversity and naturalness. In this paper, we propose to improve the diversity of utterances by explicitly learning the distribution of fundamental frequency sequences (pitch contours) of each speaker during training using a stochastic flow-based pitch predictor, then conditioning the model on generated pitch contours during inference. The experimental results demonstrate that the proposed method yields a significant improvement in the naturalness and diversity of speech generated by a Glow-TTS model that uses explicit stochastic pitch prediction, over a Glow-TTS baseline and an improved Glow-TTS model that uses a stochastic duration predictor.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/30/2022

SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech

Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate ...
research
05/07/2020

Crop Aggregating for short utterances speaker verification using raw waveforms

Most studies on speaker verification systems focus on long-duration utte...
research
09/12/2021

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Given a piece of speech and its transcript text, text-based speech editi...
research
01/25/2022

Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention

With recent advancements in voice cloning, the performance of speech syn...
research
04/24/2023

Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model

This paper proposes a zero-shot text-to-speech (TTS) conditioned by a se...
research
09/15/2023

Diversity-based core-set selection for text-to-speech with linguistic and acoustic features

This paper proposes a method for extracting a lightweight subset from a ...
research
01/23/2017

Characterisation of speech diversity using self-organising maps

We report investigations into speaker classification of larger quantitie...

Please sign up or login with your details

Forgot password? Click here to reset