A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition

04/15/2023
by   Ruchao Fan, et al.
0

Recently, end-to-end models have been widely used in automatic speech recognition (ASR) systems. Two of the most representative approaches are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. Autoregressive transformers, variants of AED, adopt an autoregressive mechanism for token generation and thus are relatively slow during inference. In this paper, we present a comprehensive study of a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR. In CASS-NAT, word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs with the acoustical boundary information offered by the CTC alignment. TAE can be obtained in parallel, resulting in a parallel generation of output tokens. During training, Viterbi-alignment is used for TAE generation, and multiple training strategies are further explored to improve the word error rate (WER) performance. During inference, an error-based alignment sampling method is investigated in depth to reduce the alignment mismatch in the training and testing processes. Experimental results show that the CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a  24x inference speedup. With and without self-supervised learning, we achieve new state-of-the-art results for non-autoregressive models on several datasets. We also analyze the behavior of the CASS-NAT decoder to explain why it can perform similarly to AT. We find that TAEs have similar functionality to word embeddings for grammatical structures, which might indicate the possibility of learning some semantic information from TAEs without a language model.

READ FULL TEXT

page 1

page 9

research
06/18/2021

An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Non-autoregressive mechanisms can significantly decrease inference time ...
research
10/28/2020

CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition

We propose a CTC alignment-based single step non-autoregressive transfor...
research
02/15/2021

Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT

Attention-based encoder-decoder (AED) models have achieved promising per...
research
06/28/2023

Accelerating Transducers through Adjacent Token Merging

Recent end-to-end automatic speech recognition (ASR) systems often utili...
research
06/16/2022

Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

Transformers have recently dominated the ASR field. Although able to yie...
research
05/27/2019

CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition

Automatic speech recognition (ASR) system is undergoing an exciting path...
research
04/10/2021

Boundary and Context Aware Training for CIF-based Non-Autoregressive End-to-end ASR

Continuous integrate-and-fire (CIF) based models, which use a soft and m...

Please sign up or login with your details

Forgot password? Click here to reset