Data augmentation is a ubiquitous technique used to provide robustness t...
This paper introduces WaveGrad 2, a non-autoregressive generative model ...
Supervised neural network training has led to significant progress on
si...
We describe a sequence-to-sequence neural network which can directly gen...
We propose a multitask training method for attention-based end-to-end sp...
This paper introduces WaveGrad, a conditional model for waveform generat...
In recent years, rapid progress has been made on the problem of
single-c...
This paper proposes a hierarchical, fine-grained and interpretable laten...
Recent neural text-to-speech (TTS) models with fine-grained latent featu...
We present a multispeaker, multilingual text-to-speech (TTS) synthesis m...
We present an attention-based sequence-to-sequence neural network which ...
We describe Parrotron, an end-to-end-trained speech-to-speech conversion...
This paper introduces a new speech corpus called "LibriTTS" designed for...
Lingvo is a Tensorflow framework offering a complete solution for
collab...
Attention-based sequence-to-sequence models for speech recognition joint...
We consider the task of unsupervised extraction of meaningful latent
rep...
End-to-end Speech Translation (ST) models have many potential advantages...
This paper proposes a neural end-to-end text-to-speech (TTS) model which...
In this paper, we present a novel system that separates the voice of a t...
Texture synthesis techniques based on matching the Gram matrix of featur...
We describe a neural network-based system for text-to-speech (TTS) synth...
We present an extension to the Tacotron speech synthesis architecture th...
Inspired by recent work on neural network image generation which rely on...
This paper describes Tacotron 2, a neural network architecture for speec...
Attention-based encoder-decoder architectures such as Listen, Attend, an...
Training a conventional automatic speech recognition (ASR) system to sup...
Recurrent neural network models with an attention mechanism have proven ...
A text-to-speech synthesis system typically consists of multiple stages,...
We present a recurrent encoder-decoder deep neural network architecture ...
Convolutional Neural Networks (CNNs) have proven very effective in image...