CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition

10/28/2020
by   Ruchao Fan, et al.
0

We propose a CTC alignment-based single step non-autoregressive transformer (CASS-NAT) for speech recognition. Specifically, the CTC alignment contains the information of (a) the number of tokens for decoder input, and (b) the time span of acoustics for each token. The information are used to extract acoustic representation for each token in parallel, referred to as token-level acoustic embedding which substitutes the word embedding in autoregressive transformer (AT) to achieve parallel generation in decoder. During inference, an error-based alignment sampling method is proposed to be applied to the CTC output space, reducing the WER and retaining the parallelism as well. Experimental results show that the proposed method achieves WERs of 3.8 on Librispeech test clean/other dataset without an external LM, and a CER of 5.8 the CASS-NAT has a performance reduction on WER, but is 51.2x faster in terms of RTF. When decoding with an oracle CTC alignment, the lower bound of WER without LM reaches 2.3 proposed method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/15/2023

A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition

Recently, end-to-end models have been widely used in automatic speech re...
research
06/18/2021

An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Non-autoregressive mechanisms can significantly decrease inference time ...
research
09/14/2021

Non-autoregressive Transformer with Unified Bidirectional Decoder for Automatic Speech Recognition

Non-autoregressive (NAR) transformer models have been studied intensivel...
research
04/10/2021

Boundary and Context Aware Training for CIF-based Non-Autoregressive End-to-end ASR

Continuous integrate-and-fire (CIF) based models, which use a soft and m...
research
10/16/2022

Acoustic-aware Non-autoregressive Spell Correction with Mask Sample Decoding

Masked language model (MLM) has been widely used for understanding tasks...
research
11/25/2021

LET-Decoder: A WFST-based Lazy-evaluation Token-group Decoder with Exact Lattice Generation

We propose a novel lazy-evaluation token-group decoding algorithm with o...
research
09/15/2023

Unimodal Aggregation for CTC-based Speech Recognition

This paper works on non-autoregressive automatic speech recognition. A u...

Please sign up or login with your details

Forgot password? Click here to reset