Differentiable Duration Modeling for End-to-End Text-to-Speech

03/21/2022
by   Bac Nguyen, et al.
0

Parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis. However, such models typically require external alignment models, which are not necessarily optimized for the decoder as they are not jointly trained. In this paper, we propose a differentiable duration method for learning monotonic alignments between input and output sequences. Our method is based on a soft-duration mechanism that optimizes a stochastic process in expectation. Using this differentiable duration method, a direct text to waveform TTS model is introduced to produce raw audio as output instead of performing neural vocoding. Our model learns to perform high-fidelity speech synthesis through a combination of adversarial training and matching the total ground-truth duration. Experimental results show that our model obtains competitive results while enjoying a much simpler training pipeline. Audio samples are available online.

READ FULL TEXT
research
06/05/2020

End-to-End Adversarial Text-to-Speech

Modern text-to-speech synthesis pipelines typically involve multiple pro...
research
06/11/2021

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Several recent end-to-end text-to-speech (TTS) models enabling single-st...
research
03/26/2021

Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

This paper introduces Parallel Tacotron 2, a non-autoregressive neural t...
research
11/19/2021

Differentiable Wavetable Synthesis

Differentiable Wavetable Synthesis (DWTS) is a technique for neural audi...
research
08/22/2016

Median-Based Generation of Synthetic Speech Durations using a Non-Parametric Approach

This paper proposes a new approach to duration modelling for statistical...
research
06/05/2022

Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention Guided Heterogeneous Translator

Understanding the underlying relationship between tongue and oropharynge...
research
06/08/2020

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Advanced text to speech (TTS) models such as FastSpeech can synthesize s...

Please sign up or login with your details

Forgot password? Click here to reset