Translatotron 2: Robust direct speech-to-speech translation

07/19/2021
by   Ye Jia, et al.
0

We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention module that connects all the previous three components. Experimental results suggest that Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the predicted speech by mitigating over-generation, such as babbling or long pause. We also propose a new method for retaining the source speaker's voice in the translated speech. The trained model is restricted to retain the source speaker's voice, and unlike the original Translatotron, it is not able to generate speech in a different speaker's voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artifacts. When the new method is used together with a simple concatenation-based data augmentation, the trained Translatotron 2 model is able to retain each speaker's voice for input with speaker turns.

READ FULL TEXT

page 5

page 8

research
04/12/2019

Direct speech-to-speech translation with a sequence-to-sequence model

We present an attention-based sequence-to-sequence neural network which ...
research
07/20/2021

SVSNet: An End-to-end Speaker Voice Similarity Assessment Model

Neural evaluation metrics derived for numerous speech generation tasks h...
research
04/07/2022

Correcting Misproducted Speech using Spectrogram Inpainting

Learning a new language involves constantly comparing speech productions...
research
06/09/2022

Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos

In this paper, we propose a neural end-to-end system for voice preservin...
research
04/08/2019

Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation

We describe Parrotron, an end-to-end-trained speech-to-speech conversion...
research
09/20/2023

TRAVID: An End-to-End Video Translation Framework

In today's globalized world, effective communication with people from di...
research
05/22/2023

Can we hear physical and social space together through prosody?

When human listeners try to guess the spatial position of a speech sourc...

Please sign up or login with your details

Forgot password? Click here to reset