JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment

05/15/2020
by   Dan Lim, et al.
0

We propose Jointly trained Duration Informed Transformer (JDI-T), a feed-forward Transformer with a duration predictor jointly trained without explicit alignments in order to generate an acoustic feature sequence from an input text. In this work, inspired by the recent success of the duration informed networks such as FastSpeech and DurIAN, we further simplify its sequential, two-stage training pipeline to a single-stage training. Specifically, we extract the phoneme duration from the autoregressive Transformer on the fly during the joint training instead of pretraining the autoregressive model and using it as a phoneme duration extractor. To our best knowledge, it is the first implementation to jointly train the feed-forward Transformer without relying on a pre-trained phoneme duration extractor in a single training pipeline. We evaluate the effectiveness of the proposed model on the publicly available Korean Single speaker Speech (KSS) dataset compared to the baseline text-to-speech (TTS) models trained by ESPnet-TTS.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/22/2019

Sequence-to-sequence Singing Synthesis Using the Feed-forward Transformer

We propose a sequence-to-sequence singing synthesizer, which avoids the ...
research
03/04/2020

AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment

Targeting at both high efficiency and performance, we propose AlignTTS t...
research
04/16/2021

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction

We propose TalkNet, a non-autoregressive convolutional neural model for ...
research
06/23/2021

Speech is Silver, Silence is Golden: What do ASVspoof-trained Models Really Learn?

We present our analysis of a significant data artifact in the official 2...
research
07/09/2020

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

In this paper, we develop DeepSinger, a multi-lingual multi-singer singi...
research
04/11/2021

Estimating articulatory movements in speech production with transformer networks

We estimate articulatory movements in speech production from different m...
research
03/30/2023

Medical Intervention Duration Estimation Using Language-enhanced Transformer Encoder with Medical Prompts

In recent years, estimating the duration of medical intervention based o...

Please sign up or login with your details

Forgot password? Click here to reset