UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

12/15/2022
by   Hirofumi Inaguma, et al.
2

Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/12/2021

Direct speech-to-speech translation with discrete units

We present a direct speech-to-speech translation (S2ST) model that trans...
research
09/14/2023

Direct Text to Speech Translation System using Acoustic Units

This paper proposes a direct text to speech translation system using dis...
research
04/06/2022

Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

Direct speech-to-speech translation (S2ST) models suffer from data scarc...
research
05/15/2023

Back Translation for Speech-to-text Translation Without Transcripts

The success of end-to-end speech-to-text translation (ST) is often achie...
research
03/31/2023

Practical Conformer: Optimizing size, speed and flops of Conformer for on-Device and cloud ASR

Conformer models maintain a large number of internal states, the vast ma...
research
09/27/2022

Direct Speech Translation for Automatic Subtitling

Automatic subtitling is the task of automatically translating the speech...
research
05/25/2022

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

Direct speech-to-speech translation (S2ST) systems leverage recent progr...

Please sign up or login with your details

Forgot password? Click here to reset