DiscreTalk: Text-to-Speech as a Machine Translation Problem

05/12/2020
by   Tomoki Hayashi, et al.
0

This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT). The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model. The VQ-VAE model learns a mapping function from a speech waveform into a sequence of discrete symbols, and then the Transformer-NMT model is trained to estimate this discrete symbol sequence from a given input text. Since the VQ-VAE model can learn such a mapping in a fully-data-driven manner, we do not need to consider hyperparameters of the feature extraction required in the conventional E2E-TTS models. Thanks to the use of discrete symbols, we can use various techniques developed in NMT and automatic speech recognition (ASR) such as beam search, subword units, and fusions with a language model. Furthermore, we can avoid an over smoothing problem of predicted features, which is one of the common issues in TTS. The experimental evaluation with the JSUT corpus shows that the proposed method outperforms the conventional Transformer-TTS model with a non-autoregressive neural vocoder in naturalness, achieving the performance comparable to the reconstruction of the VQ-VAE model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/27/2020

Insertion-Based Modeling for End-to-End Automatic Speech Recognition

End-to-end (E2E) models have gained attention in the research field of a...
research
03/28/2022

vTTS: visual-text to speech

This paper proposes visual-text to speech (vTTS), a method for synthesiz...
research
04/13/2021

Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation

A conventional approach to improving the performance of end-to-end speec...
research
10/12/2022

JukeDrummer: Conditional Beat-aware Audio-domain Drum Accompaniment Generation via Transformer VQ-VA

This paper proposes a model that generates a drum track in the audio dom...
research
02/06/2020

Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

Recent neural text-to-speech (TTS) models with fine-grained latent featu...
research
04/17/2023

Improving Autoregressive NLP Tasks via Modular Linearized Attention

Various natural language processing (NLP) tasks necessitate models that ...
research
05/11/2021

Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation

We study the possibilities of building a non-autoregressive speech-to-te...

Please sign up or login with your details

Forgot password? Click here to reset