Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates

09/27/2021
by   Hirofumi Inaguma, et al.
0

The multi-decoder (MD) end-to-end speech translation model has demonstrated high translation quality by searching for better intermediate automatic speech recognition (ASR) decoder states as hidden intermediates (HI). It is a two-pass decoding model decomposing the overall task into ASR and machine translation sub-tasks. However, the decoding speed is not fast enough for real-world applications because it conducts beam search for both sub-tasks during inference. We propose Fast-MD, a fast MD model that generates HI by non-autoregressive (NAR) decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder. We investigated two types of NAR HI: (1) parallel HI by using an autoregressive Transformer ASR decoder and (2) masked HI by using Mask-CTC, which combines CTC and the conditional masked language model. To reduce a mismatch in the ASR decoder between teacher-forcing during training and conditioning on CTC outputs during testing, we also propose sampling CTC outputs during training. Experimental evaluations on three corpora show that Fast-MD achieved about 2x and 4x faster decoding speed than that of the naïve MD model on GPU and CPU with comparable translation quality. Adopting the Conformer encoder and intermediate CTC loss further boosts its quality without sacrificing decoding speed.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/26/2020

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

For real-world deployment of automatic speech recognition (ASR), the sys...
05/08/2019

A Hardware-Oriented and Memory-Efficient Method for CTC Decoding

The Connectionist Temporal Classification (CTC) has achieved great succe...
07/19/2017

Fast and Accurate OOV Decoder on High-Level Features

This work proposes a novel approach to out-of-vocabulary (OOV) keyword s...
04/06/2021

Relaxing the Conditional Independence Assumption of CTC-based ASR by Conditioning on Intermediate Predictions

This paper proposes a method to relax the conditional independence assum...
05/14/2018

RETURNN as a Generic Flexible Neural Toolkit with Application to Translation and Speech Recognition

We compare the fast training and decoding speed of RETURNN of attention ...
10/25/2020

Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder

Fast inference speed is an important goal towards real-world deployment ...
03/03/2020

Controllable Time-Delay Transformer for Real-Time Punctuation Prediction and Disfluency Detection

With the increased applications of automatic speech recognition (ASR) in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.