Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring

09/09/2021
by   Hirofumi Inaguma, et al.
0

This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models. End-to-end speech translation models have several advantages over traditional cascade systems such as inference latency reduction. However, conventional AR decoding methods are not fast enough because each token is generated incrementally. NAR models, however, can accelerate the decoding speed by generating multiple tokens in parallel on the basis of the token-wise conditional independence assumption. We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder. The auxiliary shallow AR decoder selects the best hypothesis by rescoring multiple candidates generated from the NAR decoder in parallel (parallel AR rescoring). We adopt conditional masked language model (CMLM) and a connectionist temporal classification (CTC)-based model as NAR decoders for Orthros, referred to as Orthros-CMLM and Orthros-CTC, respectively. We also propose two training methods to enhance the CMLM decoder. Experimental evaluations on three benchmark datasets with six language directions demonstrated that Orthros achieved large improvements in translation quality with a very small overhead compared with the baseline NAR model. Moreover, the Conformer encoder architecture enabled large quality improvements, especially for CTC-based models. Orthros-CTC with the Conformer encoder increased decoding speed by 3.63x on CPU with translation quality comparable to that of an AR model.

READ FULL TEXT
research
10/25/2020

Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder

Fast inference speed is an important goal towards real-world deployment ...
research
11/11/2022

Helping the Weak Makes You Strong: Simple Multi-Task Learning Improves Non-Autoregressive Translators

Recently, non-autoregressive (NAR) neural machine translation models hav...
research
07/14/2021

High-Speed and High-Quality Text-to-Lip Generation

As a key component of talking face generation, lip movements generation ...
research
05/28/2021

Differentiable Artificial Reverberation

We propose differentiable artificial reverberation (DAR), a family of ar...
research
12/22/2021

Diformer: Directional Transformer for Neural Machine Translation

Autoregressive (AR) and Non-autoregressive (NAR) models have their own s...
research
08/06/2020

FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire

Lipreading is an impressive technique and there has been a definite impr...
research
05/25/2023

Revisiting Non-Autoregressive Translation at Scale

In real-world systems, scaling has been critical for improving the trans...

Please sign up or login with your details

Forgot password? Click here to reset