Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation

04/08/2019
by   Fadi Biadsy, et al.
0

We describe Parrotron, an end-to-end-trained speech-to-speech conversion model that maps an input spectrogram directly to another spectrogram, without utilizing any intermediate discrete representation. The network is composed of an encoder, spectrogram and phoneme decoders, followed by a vocoder to synthesize a time-domain waveform. We demonstrate that this model can be trained to normalize speech from any speaker regardless of accent, prosody, and background noise, into the voice of a single canonical target speaker with a fixed accent and consistent articulation and prosody. We further show that this normalization model can be adapted to normalize highly atypical speech from a deaf speaker, resulting in significant improvements in intelligibility and naturalness, measured via a speech recognizer and listening tests. Finally, demonstrating the utility of this model on other speech tasks, we show that the same model architecture can be trained to perform a speech separation task

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/20/2020

Wavesplit: End-to-End Speech Separation by Speaker Clustering

We introduce Wavesplit, an end-to-end speech separation system. From a s...
research
07/20/2021

SVSNet: An End-to-end Speaker Voice Similarity Assessment Model

Neural evaluation metrics derived for numerous speech generation tasks h...
research
12/17/2021

Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem

Deep learning based models have significantly improved the performance o...
research
05/15/2020

Unsupervised Cross-Domain Speech-to-Speech Conversion with Time-Frequency Consistency

In recent years generative adversarial network (GAN) based models have b...
research
05/26/2020

Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement

With the popularity of deep neural network, speech synthesis task has ac...
research
07/19/2021

Translatotron 2: Robust direct speech-to-speech translation

We present Translatotron 2, a neural direct speech-to-speech translation...
research
12/10/2022

GPU-accelerated Guided Source Separation for Meeting Transcription

Guided source separation (GSS) is a type of target-speaker extraction me...

Please sign up or login with your details

Forgot password? Click here to reset