Improved Mask-CTC for Non-Autoregressive End-to-End ASR

10/26/2020
by   Yosuke Higuchi, et al.
0

For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC), Mask-CTC, fulfills this demand by generating tokens in a non-autoregressive fashion. While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems. To boost the performance of Mask-CTC, we first propose to enhance the encoder network architecture by employing a recently proposed architecture called Conformer. Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence, which allows the model to delete or insert tokens during inference. Experimental results on different ASR tasks show that the proposed approaches improve Mask-CTC significantly, outperforming a standard CTC model (15.5 9.1 models with no degradation of inference speed (< 0.1 RTF using CPU). We also show a potential application of Mask-CTC to end-to-end speech translation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/18/2020

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

We present Mask CTC, a novel non-autoregressive end-to-end automatic spe...
research
09/27/2021

Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates

The multi-decoder (MD) end-to-end speech translation model has demonstra...
research
01/25/2022

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

While Transformers have achieved promising results in end-to-end (E2E) a...
research
10/28/2020

Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input

Non-autoregressive (NAR) transformer models have achieved significantly ...
research
12/21/2022

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

The network architecture of end-to-end (E2E) automatic speech recognitio...
research
08/16/2022

Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End Speech Recognition

Optimization of modern ASR architectures is among the highest priority t...
research
11/10/2019

Non-Autoregressive Transformer Automatic Speech Recognition

Recently very deep transformers start showing outperformed performance t...

Please sign up or login with your details

Forgot password? Click here to reset