Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

05/18/2020
by   Yosuke Higuchi, et al.
0

We present Mask CTC, a novel non-autoregressive end-to-end automatic speech recognition (ASR) framework, which generates a sequence by refining outputs of the connectionist temporal classification (CTC). Neural sequence-to-sequence models are usually autoregressive: each output token is generated by conditioning on previously generated tokens, at the cost of requiring as many iterations as the output length. On the other hand, non-autoregressive models can simultaneously generate tokens within a constant number of iterations, which results in significant inference time reduction and better suits end-to-end ASR model for real-world scenarios. In this work, Mask CTC model is trained using a Transformer encoder-decoder with joint training of mask prediction and CTC. During inference, the target sequence is initialized with the greedy CTC outputs and low-confidence tokens are masked based on the CTC probabilities. Based on the conditional dependence between output tokens, these masked low-confidence tokens are then predicted conditioning on the high-confidence tokens. Experimental results on different speech recognition tasks show that Mask CTC outperforms the standard CTC model (e.g., 17.9 12.1 inference time using CPUs (0.07 RTF in Python implementation). All of our codes will be publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2020

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

For real-world deployment of automatic speech recognition (ASR), the sys...
research
05/27/2020

Insertion-Based Modeling for End-to-End Automatic Speech Recognition

End-to-end (E2E) models have gained attention in the research field of a...
research
11/10/2019

Non-Autoregressive Transformer Automatic Speech Recognition

Recently very deep transformers start showing outperformed performance t...
research
08/21/2023

TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition

We present TokenSplit, a speech separation model that acts on discrete t...
research
10/18/2022

Personalization of CTC Speech Recognition Models

End-to-end speech recognition models trained using joint Connectionist T...
research
12/06/2019

Semantic Mask for Transformer based End-to-End Speech Recognition

Attention-based encoder-decoder model has achieved impressive results fo...
research
10/24/2020

Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment

Non-autoregressive models greatly improve decoding speed over typical se...

Please sign up or login with your details

Forgot password? Click here to reset