FastPitch: Parallel Text-to-speech with Pitch Prediction

06/11/2020
by   Adrian Łańcucki, et al.
0

We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference, and generates speech that could be further controlled with predicted contours. FastPitch can thus change the perceived emotional state of the speaker or put emphasis on certain lexical units. We find that uniformly increasing or decreasing the pitch with FastPitch generates speech that resembles the voluntary modulation of voice. Conditioning on frequency contours improves the quality of synthesized speech, making it comparable to state-of-the-art. It does not introduce an overhead, and FastPitch retains the favorable, fully-parallel Transformer architecture of FastSpeech with a similar speed of mel-scale spectrogram synthesis, orders of magnitude faster than real-time.

READ FULL TEXT

page 6

page 7

research
04/16/2021

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction

We propose TalkNet, a non-autoregressive convolutional neural model for ...
research
10/20/2017

Deep Voice 3: 2000-Speaker Neural Text-to-Speech

We present Deep Voice 3, a fully-convolutional attention-based neural te...
research
12/17/2020

Parallel WaveNet conditioned on VAE latent vectors

Recently the state-of-the-art text-to-speech synthesis systems have shif...
research
03/15/2022

Text-free non-parallel many-to-many voice conversion using normalising flows

Non-parallel voice conversion (VC) is typically achieved using lossy rep...
research
01/09/2021

Emotion transplantation through adaptation in HMM-based speech synthesis

This paper proposes an emotion transplantation method capable of modifyi...
research
06/08/2020

MultiSpeech: Multi-Speaker Text to Speech with Transformer

Transformer-based text to speech (TTS) model (e.g., Transformer TTS <cit...
research
07/18/2021

Exploring the Potential of Lexical Paraphrases for Mitigating Noise-Induced Comprehension Errors

Listening in noisy environments can be difficult even for individuals wi...

Please sign up or login with your details

Forgot password? Click here to reset