A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation

10/11/2021 ∙ by Yosuke Higuchi, et al. ∙ 0

Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines. Showing great potential for real-time applications, an increasing number of NAR models have been explored in different fields to mitigate the performance gap against AR models. In this work, we conduct a comparative study of various NAR modeling methods for end-to-end automatic speech recognition (ASR). Experiments are performed in the state-of-the-art setting using ESPnet. The results on various tasks provide interesting findings for developing an understanding of NAR ASR, such as the accuracy-speed trade-off and robustness against long-form utterances. We also show that the techniques can be combined for further improvement and applied to NAR end-to-end speech translation. All the implementations are publicly available to encourage further research in NAR speech processing.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the last decade, deep learning has brought remarkable success in automatic speech recognition (ASR) 

[1, 2], which has become a central user interface in various IoT applications. Much of the recent research progress is attributed to improvement in the end-to-end system [3, 4, 5], where an ASR model is trained to directly optimize speech-to-text conversion. Owing to the well-established sequence-to-sequence modeling techniques [6, 7, 8]

and more sophisticated neural network architectures 

[9, 10, 11], end-to-end ASR systems have achieved comparable results with those of the conventional hybrid systems [12, 13, 14].

Current state-of-the-art end-to-end ASR systems are based on autoregressive (AR) models [6, 8], where each token prediction is conditioned on the previously generated tokens (Figure 1 left). Such a generation process can lead to slow inference speed, requiring -step incremental calculations to generate an -length sequence. Non-autoregressive (NAR) models [15, 16], in contrast, permit generating multiple tokens in parallel, which significantly speeds up the decoding process (Figure 1 right). However, such simultaneous predictions often prevent the NAR models from learning the conditional dependencies between output tokens, worsening the recognition accuracy compared to AR models.

Figure 1: Illustrations of autoregressive and non-autoregressive ASR.

Fast inference is one of the crucial factors for deploying deep learning models to real-world applications. ASR systems, in particular, are desired to be fast and light in numerous scenes, such as in spoken dialogue systems, where users prefer quick interactions with a conversational agent. Accordingly, various attempts have been actively made to develop and improve NAR end-to-end ASR models 

[17, 18, 19, 20, 21, 22, 23]

, inspired by the great success of NAR modeling techniques in neural machine translation 

[24, 25, 26, 27, 28, 29, 30]. However, while an increasing number of NAR models have been proposed and shown their effectiveness, the research community lacks a comprehensive study for comparing different NAR methods in a fair experimental setting. Hence, it remains unclear what the advantages and disadvantages each model has when compared to other models.

Our work aims to conduct a comparative study on NAR modeling methods for end-to-end ASR. We have made an effort to cover a wide variety of methods, including a standard connectionist temporal classification (CTC)-based model [3]; Mask-CTC [19] and Improved Mask-CTC [31] based on masked language modeling [32, 28]; Align-Denoise [33] based on refinement training [25]; Insertion Transformer and KERMIT [21] based on insertion-based modeling [26, 34]; intermediate CTC [35] and self-conditioned CTC [36] based on regularization techniques; and a continuous integrate-and-fire (CIF)-based NAR model (CIF-NA) [37]. All the models are fairly evaluated in the state-of-the-art setup using ESPnet [38], adopting Conformer [11] for the network architecture.

The contributions of this work are summarized as follows:

  • We conduct comparative experiments on a variety of NAR ASR models using various ASR tasks. In addition to comparing the accuracy and speed, we further analyze the results to help develop a deep understanding of NAR ASR.

  • We show that different NAR techniques can be combined to improve the performance and applied to other NAR speech-to-text tasks, e.g., end-to-end speech translation.

  • We provide reproducible implementations and recipes used in our experiments, hoping to encourage further research in NAR speech processing.

2 Related works

Model #iter Processing unit CTC
A-CMLM [17] 3 token
Imputer [18] 8 frame
LASO [20] 1 token
Spike-Triggered [22] 1 token
Mask-CTC [19] 10 token
Improved Mask-CTC [31] 5 token
Align-Refine [23] 5 frame
Align-Denoise [33] 1 frame
Insertion Transformer [21] token
KERMIT [21] token
Intermediate CTC [35] 1 frame
Self-conditioned CTC [36] 1 frame
CIF-NA [39] 1 token
Table 1: Comparison of various NAR end-to-end ASR models.

In Table 1, we list several NAR end-to-end ASR models to help compare and understand them at a glance. Here, we include some important aspects for the comparison. The number of decoding iterations (#iter) can be increased to improve an output sequence at the expense of extra computations, which enables a model to generate more valid tokens conditioned on previously generated tokens in a semi-autoregressive manner [25, 28]. Processing unit

is a unique property in NAR ASR, which can be either frame-level or token-level. Token-level processing reduces the speed and computational cost during inference, which is especially important when a model performs the iterative prediction. However, it needs additional efforts to estimate or adjust the length of an output sequence 

[16]. Frame-level processing, on the other hand, is prone to slow inference with the requirement of computational resources, but it does not require the length prediction. CTC indicates the usage of connectionist temporal classification (CTC) [15], which is a promising NAR modeling method for ASR.

CTC is the very fundamental method for realizing NAR end-to-end ASR. CTC makes a strong conditional independence assumption between token frame predictions, enabling the model to perform fast inference while limiting the recognition accuracy compared to other end-to-end ASR models [40]. Inspired by the conditional masked language model (CMLM) [28], Audio-CMLM (A-CMLM) [17] effectively learns the conditional distribution of output tokens over a partially observed sequence through the NAR mask prediction task [32]. Imputer [18] and Mask-CTC [19, 31] combine CTC with CMLM to improve frame-level or token-level CTC predictions, getting rid of the cumbersome length prediction required in the previous approach. While Imputer and Mask-CTC suffer from the mismatch between training and testing conditions, Align-Refine [23] and Align-Denoise [33] introduce iterative refinement [25] to optimize the refinement process of CTC predictions directly.

Some of the recent efforts in NAR end-to-end ASR focus on improving the performance of the standard CTC-based model itself. Intermediate CTC [35] and self-conditioned CTC [36] apply auxiliary CTC losses to intermediate layers as in [41], which effectively enhances the intermediate representations and leads to improved CTC performance. Convolution-based neural network architectures have been shown to improve the CTC-based and the other end-to-end ASR models in general [31, 42, 43]. When a large amount of speech data is available for pre-training, powerful speech representations learned by wav2vec 2.0 [44] can significantly boost the performance of CTC [42].

Another direction for NAR ASR is based on insertion-based modeling, which permits the model for generating tokens in an arbitrary order without the left-to-right constraint in AR models. Showing promising results in neural machine translation, Insertion Transformer [26] and Kontextuell Encoder Representations Made by Insertion Transformations (KERMIT) [34] are successfully adopted for end-to-end ASR [21].

3 Non-autoregressive ASR

This section reviews NAR modeling methods for end-to-end ASR compared in our study, including CTC, Mask-CTC, Improved Mask-CTC, Align-Denoise, Insertion Transformer, KERMIT, intermediate CTC, self-conditioned CTC, and CIF-NA. We have made an effort to cover a wide variety of methods, each of which has a unique capability as an NAR model, as described in Section 2.

Notations: We formulate end-to-end ASR as a sequence mapping between a -length input sequence and an -length output sequence . Here, is a -dimensional acoustic feature at frame , an output token at position , and a vocabulary.

3.1 Connectionist temporal classification (CTC)

CTC [15] predicts a frame-level alignment sequence , which is obtained by introducing a special blank token into the output sequence

. Based on the conditional independence assumption per token frame prediction, CTC models the conditional probability

by marginalizing over all paths (frame alignments) as:


where denotes all possible paths compatible with . The CTC objective is defined by the negative log-likelihood of Eq. (1):


During inference, we use the best path decoding [15] to generate an output sequence, where the most probable tokens are selected at each frame, and an output sequence is obtained by suppressing repeated tokens and removing blank symbols.

3.2 Mask-CTC

Mask-CTC [19] is built upon an encoder-decoder structure, where the CTC loss (Eq. (1)) is applied to the encoder output, and the decoder adopts the conditional masked language model (CMLM) [28, 17]. The CMLM decoder is trained to predict output tokens , given a partially observed ground-truth sequence :


where are obtained by randomly replacing ground-truth tokens with a special mask token <MASK>. The objective of the CMLM decoder is defined as:


The final loss of Mask-CTC is defined as a weighted sum of the standard CTC loss in Eq. (2) and in Eq. (4). During inference, an output sequence of CTC is first obtained from the encoder, and then the decoder refines the CTC output through the mask prediction process based on Eq. (3). By masking low-confidence tokens in the CTC output and predicting the masked tokens based on the other high-confidence unmasked tokens, errors from the conditional independence assumption are expected to be recovered.

3.3 Improved Mask-CTC

One limitation of Mask-CTC is that the length of an output sequence cannot be changed from that of the CTC output during inference, making it difficult to recover deletion and insertion errors. To overcome this problem, Mask-CTC is enhanced by introducing a length prediction network in the CMLM decoder [31]. The length prediction network is trained to predict the length of a partial target sequence from a masked token. For example, given a ground-truth sequence and its masked sequence , symbols 2 and 0 are predicted from each masked position, indicating the length of the corresponding partial sequence in . With this length prediction network, during inference, the number of <MASK> at each masked position is first modified based on the predicted length. Each mask is then predicted conventionally as in Eq. (3), allowing the model to change the sequence length dynamically by deleting and inserting mask tokens.

3.4 Align-Denoise

Align-Denoise [33] is built on the prior work of Align-Refine [23] and Imputer [18] and adopts a similar encoder-decoder structure on frames. CTC is applied to both the encoder output and decoder output. Instead of taking multiple steps to get the intermediate results as Align-Refine, Align-Denoise generates noisy CTC alignment during the training and requires a single iteration for decoding. The training objective is based on the CTC loss in Eq. (2) and a ground-truth alignment :



is the noisy function for generating the noisy alignment. The idea is similar to the denoising autoencoder 

[45] (DAE), where the input to the decoder is the ground truth alignment with a certain level of noise.

3.5 Intermediate CTC

Intermediate CTC [35] extends the CTC-based modeling with a regularization loss. During training, a sequence of intermediate representations is obtained from an intermediate layer of the encoder, and its intermediate CTC loss is computed as:


The final loss is a weighted sum of the original CTC loss in Eq. (2) and intermediate loss. During inference, the intermediate loss is unused, and the model is treated as an ordinary CTC model.

3.6 Self-conditioned CTC

Self-conditioned CTC [36] extends intermediate CTC by exploiting the intermediate representations for conditioning the subsequent encoder layers. During both training and inference, each token posterior distribution in the intermediate layers is fed back to the input of the next layer, making the subsequent encoder layers conditioned on the intermediate predictions. The self-conditioned CTC loss is defined as:


The final loss is a weighted sum of the intermediate CTC loss in Eq. (6) and the self-conditioned CTC loss.

3.7 Insertion Transformer

In the case of insertion-based NAR models, the conditional probability is modeled by marginalizing over insertion order :


Insertion order represents the permutation of tokens in a sequence , e.g., if and , .

During training, an upper bound of negative log-likelihood is minimized under a prior distribution over insertion order :


Insertion Transformer [26] trains Transformer, with an encoder-decoder structure, to predict a token and its position to be inserted, which aims to model in Eq. (8). When is defined by the balanced binary tree (BBT)-based insertion order [26], decoding finishes empirically with iterations. The BBT order is to insert the centermost tokens of the current hypothesis. For example, given an output sequence with , the hypothesis grows based on the tree structure like .

3.8 Kermit

KERMIT [34] is a variant of Insertion Transformer. Its basic formulation is the same but only the Transformer encoder is used to predict a token and its position to be inserted.

KERMIT can be trained with the CTC loss in a multi-task learning manner [21]

, which makes the CTC alignment prediction conditioned on a partial hypothesis from the insertion-based decoding. Given the posterior probability of a CTC alignment at

-th decoding step as , the objective of KERMIT is to minimize the following negative log-likelihood:


where is a tunable weight on the CTC loss. During inference, the CTC decoding is performed using in Eq. (10), and it can be terminated at any number of iterations.

3.9 Cif-Na

Continuous integrate-and-fire (CIF)[37] provides a soft and monotonic alignment in the encoder-decoder framework. CIF first learns information weights from the encoder output and accumulates the weights from left to right to locate the acoustic boundary. Then, the acoustic embedding for each target token is obtained by integrating the encoder sates based on their estimated weights. In this paper, we investigated CIF with an NAR decoder, noted as CIF-NA. Following the prior work [39], both the encoder states and acoustic embeddings are fed into the decoder to predict the probability of output tokens in parallel:


CIF also adopts a quantity loss to supervise the model to predict the quantity of the integrated embeddings closer to the length of a target sequence, defined as . The final loss of CIF-NA is a weighted sum of the CTC loss in Eq. (2), and .

Model #iter Inference speed LS-100 (WER) TED2 (WER) CSJ-APS (CER)
RTF Speedup dev test dev test eval1 eval2 eval3
clean other clean other
AR CTC/attention 0.341 1.00 06.8 18.8 07.4 19.0 11.6 8.7 5.4 4.0 09.8
  + beam-search 3.419 0.10 06.3 18.2 06.8 18.5 10.4 8.4 5.1 3.8 09.0
Transducer 0.069 4.94 07.3 19.9 07.6 19.9 09.6 9.2 6.3 4.5 10.6
  + beam-search 0.234 1.46 06.4 18.8 06.8 18.9 08.6 8.2 5.2 4.1 10.0


plus1fil minus1fil

NAR CTC 1 0.059 5.78 07.4 20.5 07.8 20.8 08.9 8.6 5.4 4.0 09.6
Mask-CTC 10 0.063 5.41 07.2 20.3 07.5 20.6 08.9 8.5 5.6 4.0 09.6
Improved Mask-CTC 5 0.072 4.74 07.0 19.8 07.3 20.2 08.8 8.3 5.5 4.0 09.5
Align-Denoise 1 0.073 4.67 08.0 22.3 08.4 22.5 09.0 8.7 5.4 3.7 09.1
Intermediate CTC 1 0.059 5.78 06.9 19.7 07.1 20.2 08.5 8.3 5.6 4.1 09.8
Self-conditioned CTC 1 0.059 5.78 06.6 19.4 06.9 19.7 08.7 8.0 5.3 3.7 09.1
KERMIT 0.361 1.06 07.1 19.7 07.4 20.2 09.1 8.2 5.4 3.7 09.5
Insertion Transformer 0.083 4.11 16.0 27.3 16.2 27.4
CIF-NA 1 0.073 4.67 15.4 34.0 15.7 34.6

According to discussions with the author of CIF [37], CIF-NA suffers from the degradation due to the difficulty of acoustic boundary decisions.

Table 2: Word error rate (WER) or character error rate (CER) on LibriSpeech-100h (LS-100), TEDLIUM2 (TED2), and CSJ-APS. #iter denotes the number of iterations required to generate each token in an output sequence. Real time factor (RTF) was used to measure the inference speed and was evaluated on the LS-100 “test-other” set using CPU.

4 Experiments

Aiming to compare the NAR models in Section 3, we conducted ASR experiments using ESPnet [38, 46]. In addition to the NAR models, we evaluated autoregressive (AR) models, including the attention-based sequence-to-sequence model with the CTC/attention objectives [47, 48] and Conformer-Transducer [49]. The recognition accuracy was evaluated based on word error rate (WER) or character error rate (CER), depending on a task, and the inference speed was measured using real time factor (RTF).

4.1 Experimental setup

Data: The main experiments were carried out using three datasets, including LibriSpeech (LS) [50], TEDLIUM2 (TED2) [51], and Corpus of Spontaneous Japanese (CSJ) [52]. LS consists of utterances from read English audiobooks, and we used the 100-hour subset (LS-100) for training. TED2 contains utterances from English Ted Talks, and we used the 210-hour training data. CSJ includes Japanese public speeches on different academic topics, and we used the 271-hour subset of academic presentation speech (CSJ-APS) for training. For LS-100 and TED2, we used the standard validation and test sets for tuning hyper-parameters and evaluating performance, respectively. Specifically, for LS-100, the validation and test sets are divided into “clean” and “other” based on the quality of the recorded utterances. For CSJ-APS, we used a part of the training set as a validation set and the official evaluation sets (“eval1”, “eval2”, and “eval3”) for testing. For LS-100 and TED2, we used 300 to 500 subwords for tokenizing output texts, which were constructed from each training set using SentencePiece [53]. For CSJ-APS, we used Japanese syllable characters (Kana) and Chinese characters (Kanji), which resulted in 2753 distinct tokens.

We also evaluated several models under noisy conditions using CHiME-4 [54]. CHiME-4 contains English recordings in everyday noisy environments, including a cafe, a street junction, public transport, and a pedestrian area. We combined the CHiME-4 and Wall Street Journal (WSJ) [55] datasets to obtain a 190-hour training set. We used the validation and test sets provided by CHiME4 for tuning hyper-parameters and evaluating performance, respectively. Characters (Latin alphabets) were used for tokenizing target texts.

As input speech features, we extracted 80 mel-scale filterbank coefficients with three-dimensional pitch features using Kaldi [56]. To avoid overfitting, we applied speed perturbation [57] and SpecAugment [58] to the input speech from LS-100, TED2, and CSJ-APS, and only SpecAugment to the input speech from CHiME4.

Network architecture: All the end-to-end ASR models were constructed based on the Conformer-based architecture [11, 49]. For the self-attention module, the number of heads , the dimension of a self-attention layer , and the dimension of a feed-forward network

were set to 4, 256, and 1024, respectively. The kernel size of the convolution module was set to 15. For the models with the encoder-decoder structure (i.e., CTC/attention, Mask-CTC, Improved Mask-CTC, Align-Denoise, Insertion Transformer, and CIF-NA), the encoder consisted of two convolutional neural network (CNN)-based downsampling layers followed by a stack of 12 Conformer encoder blocks. The decoder consisted of 6 Transformer decoder blocks, where the self-attention module had the same configuration as the Conformer block, except

was increased to 2048 to match the number of parameters in the Conformer block. For the models without the decoder (i.e., CTC, KERMIT, Intermediate CTC, Self-conditioned CTC), we used two CNN-based downsampling layers followed by a stack of 18 Conformer encoder blocks. Conformer-Transducer consisted of the 18-layer encoder and a single long short-term memory (LSTM) layer for the decoder. Note that the number of parameters in all the ASR models was around 30M.

Training and decoding configurations: The ASR models were trained using the Adam optimizer [59] with , , , and the Noam learning rate scheduling [60]. Warmup steps (e.g., 4k to 25k) and a learning rate factor (e.g., 1.0 to 10.0) were tuned for each model. We followed the same setup as in [14, 49]

for regularization hyperparameters (e.g., dropout rate and label-smoothing weight). The models were trained up to 300 epochs until convergence. For evaluation, a final model was obtained by averaging model parameters over 5 to 30 checkpoints with the best validation performance. For decoding with the CTC-based NAR models, we performed the best path decoding of CTC 

[15] to keep the inference process NAR. For evaluating the AR models, we applied beam search decoding with beam sizes of 1 (i.e., greedy decoding) and 10. All of the decodings were done without using external language models (LMs). Decoding hyper-parameters that are unique to each model were tuned following the previous works. To evaluate the inference speed, RTF was measured using Intel(R) Xeon(R) Silver 4114 CPU, 2.20GHz on the same machine environment. All of the implementations are publicly available111 Mask-CTC, Improved Mask-CTC, Align-Denoise, Intermediate CTC, Self-conditioned CTC, Insertion Transformer, KERMIT, CIF-NA .

4.2 Main results

Table 2 shows the results on LS-100, TED2, and CSJ-APS.

LibriSpeech-100h (LS-100):

By comparing the results obtained from the NAR models on LS-100, all of our NAR models, except Insertion Transformer and CIF-NA, outperformed the standard CTC-based model. Moreover, on the “clean” sets, Improved Mask-CTC, intermediate CTC, self-conditioned CTC, and KERMIT achieved comparable performances with those of the AR models. On the “other” sets, in contrast, the NAR models resulted in worse performances than the CTC/attention model. Since the “other” sets include utterances with lower quality than the “clean” sets, a model is required to capture dependencies between output tokens to compensate for information loss in the low-quality speech. With the language model mechanism included in the network structure, the CTC/attention model resulted in better performances on “other” sets, effectively modeling dependencies between output tokens.

TEDLIUM2 (TED2): On TED2, all the NAR models, including the standard CTC-based model, achieved results competitive to those obtained from the AR models. Especially on the development set, the NAR models outperformed the CTC/attention model by a large margin. We further discuss this interesting outcome in Section 4.5.

CSJ-APS: All the NAR models achieved comparable performance with the AR models on CSJ-APS, indicating the effectiveness independent of the language. Especially, the performances of Align-Denoise and Self-conditioned CTC aligned with the searched results obtained from the CTC/attention model.

Discussions on Insertion Transformer and CIF-NA: Insertion Transformer resulted in the performance drop because of its two unique characteristics. The first is that it is difficult to judge whether more tokens to be inserted or not. It is controlled by a special token that represents “no more token to be inserted”. However, partial hypothesis including recognition errors leads to misjudging of token insertion. The second is that once a hypothesis encounters recognition error, it can not be recovered during inference. The error is propagated to later iterations and leads to more errors. CIF is a novel alignment mechanism that can be adopted for various ASR scenarios. However, different from the original AR version, CIF-NA resulted in the performance degradation due to the difficulty of estimating acoustic boundaries in the English speech dataset. According to [37, 39], it could achieve competitive performance in monosyllable languages with clear acoustic boundaries (e.g., Chinese Mandarin).

From the above results, it can be concluded that CTC is the key technique to realizing effective NAR end-to-end ASR, as all the well-performing models are based on CTC. Overall, self-conditioned CTC resulted in the best performance among the NAR models, filling the gap against the AR performance.

4.3 Recognition accuracy vs. inference speed

We study the trade-off between recognition accuracy and inference speed. According to RTF results in Table 2, which are measured on the “test-other” set of LS-100, all the NAR models except KERMIT achieved fast inference speed compared to the AR models. This is because of the non-autoregressive nature of NAR models, i.e., it can generate multiple tokens at a single iteration step. The RTF improvement is achieved at the expense of quality-drop in WER. Especially on the “other” sets of LS-100, NAR models did not reach the performance obtained from the CTC/attention model. In contrast, on the “clean” sets, several NAR models (e.g., self-conditioned CTC) achieved better WERs than the greedy results of the CTC/attention model. Improving the recognition of difficult utterances is an important problem for NAR models to be solved.

4.4 Error analysis

Model test-clean test-other
sub del ins sub del ins


CTC/attention 5.3 1.2 0.9 14.5 2.4 2.1
  + beam-search 5.4 0.6 0.8 14.7 1.8 2.1


plus1fil minus1fil


CTC 6.3 0.7 0.8 16.6 2.2 1.9
Improved Mask-CTC 5.9 0.6 0.8 16.2 2.0 2.0
Intermediate CTC 5.7 0.6 0.8 16.2 1.9 2.1
Self-conditioned CTC 5.6 0.6 0.7 15.8 2.0 1.9
KERMIT 6.0 0.6 0.8 16.2 2.0 1.9
Table 3: Error analysis on LS-100. WERs on Table 2 are split into substition (sub), deletion (del), and insertion (ins) error rates.

In Table 3, we break down WERs on LS-100 reported in Table 2. By comparing the results obtained from CTC and the other NAR models, it appeared that the major improvements on our NAR models are attributed to reducing substitution errors. Deletion and insertion error rates, in contrast, were kept relatively low and stayed almost the same from the CTC results, which can often be reduced by using wordpieces. The AR model handled substitution errors more effectively than the NAR models. However, deletion and insertion errors were higher than those of the NAR models, indicating that the NAR models are more effective at generating a sequence with a correct length. From these results, it can be suggested that the key for further improvement of NAR models is to mitigate substitution errors.

4.5 Robustness against output sequence length

Figure 2: Comparison of NAR and AR (CTC/attention) models evaluated on different output sequence length. WERs are compared among the models (top), and the error components are compared between AR and Self-conditioned CTC (bottom).

Figure 2 shows the correlation between output sequence length and error rate for the LS-100 test-clean set. Here, we observed that the performance of the AR model is prone to degradation when the length of an output sequence is long (Figure 2 top). On the other hand, the NAR models successfully recognized the long sequence without having such a severe quality drop. To further investigate the results, we compared the error components between the AR and self-conditioned CTC models (Figure 2 bottom). While the substitution and insertion errors were in the same range between the two models, the deletion error in the AR model got significantly high as the output sequence length increased, where we observed consecutive tokens in the sequence are completely skipped as reported in [61]. This explains why the performance of the AR models on the TED2 development set fell behind those of the NAR models (Table 2), as the set includes long-form utterances up to 40 seconds.

4.6 Evaluation under noisy condition

Model dt05 et05
real simu real simu


CTC/attention 14.0 16.0 22.2 24.1
  + beam-search 13.9 15.7 22.4 23.7


plus1fil minus1fil


CTC 16.5 18.1 25.9 27.0
Improved Mask-CTC 14.9 16.8 24.6 25.2
Intermediate CTC 15.2 16.8 24.2 25.2
Self-conditioned CTC 14.6 16.6 23.7 24.3
Table 4: Word error rate on CHiME4.

Table 4 shows results under noisy conditions based on the CHiME4 task. Note that we only focused on the models achieving promising results on the main tasks. The evaluations were performed on the 1-channel track, where the development (dt05) and test (et05) sets included real and simulated (simu) utterances recorded from one of a single microphones on the tablet device [54]. Comparing the results obtained from the NAR models, all of our NAR models outperformed the standard CTC-based model. However, the results were not comparable with those obtained from the AR model. With the nonstationary noises included in the input speech, it is crucial for an ASR model to attend to dependency among output tokens for generating an accurate sequence. As observed in the LS-100 task in Section 4.2, the results suggest that the NAR models are likely to depend more on acoustic information, having difficulty capturing the token dependencies when the quality of an input speech is low.

4.7 Combination of different techniques

Model LS-100 CHiME4
clean other real simu
Mask-CTC 7.5 20.6 24.9 25.8
  + Intermediate CTC 7.2 20.4 24.9 25.0
Table 5: Word error rate (WER) on LS-100 and CHiME4 test sets for Mask-CTC and its extension with intermediate CTC.

As intermediate CTC is a regularization method for CTC modeling and does not require any architectural changes, it is possible to augment other CTC-based NAR modeling with intermediate CTC. We combine Mask-CTC and intermediate CTC by adding an intermediate CTC loss to the encoder of Mask-CTC, as proposed in [35]. Table 5 shows the comparison of Mask-CTC and its intermediate CTC extension on LS-100 and CHiME-4. In all cases, intermediate CTC improved Mask-CTC, similar to experimental results reported by [35], while requiring no extra computation during inference.

4.8 Application to end-to-end speech translation (E2E-ST)

Model Fisher CallHome
dev dev2 test  devtest evltest
CTC 51.0 51.6 50.8 18.0 18.7
Mask-CTC 51.1 51.7 50.6 17.9 18.3
Intermediate CTC 51.3 51.4 51.0 19.0 19.0
Self-conditioned CTC 50.7 51.2 50.5 19.1 19.2
Orthros () 50.1 50.6 48.7 19.5 19.8
Orthros () 51.6 52.4 50.5 20.5 20.7
Table 6: BLEU () scores of speech translation models on Fisher-CallHome Spanish. #iter is the number of iterations.

We applied Mask-CTC, intermediate CTC, self-conditioned CTC, and Orthros [62] to the NAR E2E-ST task on Fisher-CallHome Spanish corpus [63]. Orthros is based on CMLM but has an additional AR decoder on the same encoder to select the most probable translation among multiple length candidates. All models used the Conformer encoder and 16k wordpieces as output units. The encoder parameters were initialized with a pre-trained AR ASR encoder. We followed the setting in [62] and applied sequence-level knowledge distillation [64]. With the help of the NAR modeling methods, we observed some gains over the CTC-based results (Table 6). While self-conditioned CTC was the most effective method for ASR, it had smaller impact on the E2E-ST task, and Orthros resulted in the best performance. Since input-output alignments are not monotonic in this task, token-level iterative refinement is required to make the sequence generation conditioned on a translated sequence, rather than only depending on acoustic information from input speech.

5 Conclusions

This paper presented a comprehensive study of NAR modeling methods for end-to-end ASR. Various NAR models were compared based on different ASR tasks. The results provided several interesting findings, including the accuracy-speed trade-off and robustness against long-form speech utterances. We also showed that different NAR techniques could be combined to improve the performance and applied to end-to-end speech translation. We believe that the reproducible implementations and recipes used in this paper will accelerate further NAR research in speech processing.

6 Acknowledgement

This work was partly supported by ASAPP. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) [65], which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system [66], which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC). The authors would like to thank Linhao Dong and Florian Boyer for helpful discussions.


  • [1] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process. Mag., vol. 29, no. 6, 2012.
  • [2] Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton,

    “Speech recognition with deep recurrent neural networks,”

    in Proc. ICASSP, 2013, pp. 6645–6649.
  • [3] Alex Graves and Navdeep Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proc. ICML, 2014, pp. 1764–1772.
  • [4] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “Attention-based models for speech recognition,” in Proc. NeurIPS, 2015, pp. 577–585.
  • [5] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. ICASSP, 2016, pp. 4960–4964.
  • [6] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  • [7] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks,” in Proc. NeurIPS, 2014, pp. 3104–3112.
  • [8] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. ICLR, 2014.
  • [9] Linhao Dong, Shuang Xu, and Bo Xu, “Speech-Transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in Proc. ICASSP, 2018, pp. 5884–5888.
  • [10] Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, and Yang Zhang, “Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions,” in Proc. ICASSP, 2020, pp. 6124–6128.
  • [11] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang, “Conformer: Convolution-augmented Transformer for speech recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.
  • [12] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. ICASSP, 2018, pp. 4774–4778.
  • [13] Christoph Lüscher, Eugen Beck, Kazuki Irie, Markus Kitza, Wilfried Michel, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “RWTH ASR systems for LibriSpeech: Hybrid vs attention,” in Proc. Interspeech, 2019, pp. 231–235.
  • [14] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al., “A comparative study on Transformer vs RNN in speech applications,” in Proc. ASRU, 2019, pp. 449–456.
  • [15] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006, pp. 369–376.
  • [16] Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher, “Non-autoregressive neural machine translation,” in Proc. ICLR, 2018.
  • [17] Nanxin Chen, Shinji Watanabe, Jesus Antonio Villalba, Piotr Zelasko, and Najim Dehak, “Non-autoregressive transformer for speech recognition,” IEEE Signal Process. Lett., 2020.
  • [18] William Chan, Chitwan Saharia, Geoffrey Hinton, Mohammad Norouzi, and Navdeep Jaitly, “Imputer: Sequence modelling via imputation and dynamic programming,” in Proc. ICML, 2020, pp. 1403–1413.
  • [19] Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, and Tetsunori Kobayashi, “Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict,” in Proc. Interspeech, 2020, pp. 3655–3659.
  • [20] Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, and Shuai Zhang, “Listen attentively, and spell once: Whole sentence generation via a non-autoregressive architecture for low-latency speech recognition,” in Proc. Interspeech, 2020, pp. 3381–3385.
  • [21] Yuya Fujita, Shinji Watanabe, Motoi Omachi, and Xuankai Chang, “Insertion-based modeling for end-to-end automatic speech recognition,” in Proc. Interspeech, 2020, pp. 3660–3664.
  • [22] Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Shuai Zhang, and Zhengqi Wen, “Spike-triggered non-autoregressive Transformer for end-to-end speech recognition,” in Proc. Interspeech, 2020, pp. 5026–5030.
  • [23] Ethan A Chi, Julian Salazar, and Katrin Kirchhoff, “Align-Refine: Non-autoregressive speech recognition via iterative realignment,” in Proc. NAACL-HLT, 2021, pp. 1920–1927.
  • [24] Jindřich Libovický and Jindřich Helcl, “End-to-end non-autoregressive neural machine translation with connectionist temporal classification,” in Proc. EMNLP, 2018, pp. 3016–3021.
  • [25] Jason Lee, Elman Mansimov, and Kyunghyun Cho, “Deterministic non-autoregressive neural sequence modeling by iterative refinement,” in Proc. EMNLP, 2018, pp. 1173–1182.
  • [26] Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit, “Insertion Transformer: Flexible sequence generation via insertion operations,” in Proc. ICML, 2019, pp. 5976–5985.
  • [27] Jiatao Gu, Changhan Wang, and Junbo Zhao, “Levenshtein Transformer,” in Proc. NeurIPS, 2019, pp. 11181–11191.
  • [28] Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer, “Mask-predict: Parallel decoding of conditional masked language models,” in Proc. EMNLP-IJCNLP, 2019, pp. 6114–6123.
  • [29] Chitwan Saharia, William Chan, Saurabh Saxena, and Mohammad Norouzi, “Non-autoregressive machine translation with latent alignments,” in Proc. EMNLP, 2020, pp. 1098–1108.
  • [30] Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard Hovy, “FlowSeq: Non-autoregressive conditional sequence generation with generative flow,” in Proc. EMNLP-IJCNLP, 2019, pp. 4273–4283.
  • [31] Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa, and Tetsunori Kobayashi, “Improved Mask-CTC for non-autoregressive end-to-end ASR,” in Proc. ICASSP, 2021, pp. 8363–8367.
  • [32] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
  • [33] Nanxin Chen, Piotr Żelasko, Laureano Moro-Velázquez, Jesús Villalba, and Najim Dehak, “Align-Denoise: Single-pass non-autoregressive speech recognition,” in Proc. Interspeech, 2021.
  • [34] William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit, “KERMIT: Generative insertion-based modeling for sequences,” arXiv preprint arXiv:1906.01604, 2019.
  • [35] Jaesong Lee and Shinji Watanabe, “Intermediate loss regularization for CTC-based speech recognition,” in Proc. ICASSP, 2021, pp. 6224–6228.
  • [36] Jumon Nozaki and Tatsuya Komatsu, “Relaxing the conditional independence assumption of CTC-based ASR by conditioning on intermediate predictions,” in Proc. Interspeech, 2021.
  • [37] Linhao Dong and Bo Xu, “CIF: Continuous integrate-and-fire for end-to-end speech recognition,” in Proc. ICASSP, 2020, pp. 6079–6083.
  • [38] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.
  • [39] Fan Yu, Haoneng Luo, Pengcheng Guo, Yuhao Liang, Zhuoyuan Yao, Lei Xie, Yingying Gao, Leijing Hou, and Shilei Zhang, “Boundary and context aware training for CIF-based non-autoregressive end-to-end ASR,” arXiv preprint arXiv:2104.04702, 2021.
  • [40] Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur Yi Li, Hairong Liu, Sanjeev Satheesh, Anuroop Sriram, and Zhenyao Zhu, “Exploring neural transducers for end-to-end speech recognition,” in Proc. ASRU, 2017, pp. 206–213.
  • [41] Andros Tjandra, Chunxi Liu, Frank Zhang, Xiaohui Zhang, Yongqiang Wang, Gabriel Synnaeve, Satoshi Nakamura, and Geoffrey Zweig,

    “Deja-vu: Double feature presentation and iterated loss in deep transformer networks,”

    in Proc. ICASSP, 2020, pp. 6899–6903.
  • [42] Edwin G Ng, Chung-Cheng Chiu, Yu Zhang, and William Chan, “Pushing the limits of non-autoregressive speech recognition,” in Proc. Interspeech, 2021.
  • [43] Somshubra Majumdar, Jagadeesh Balam, Oleksii Hrinchuk, Vitaly Lavrukhin, Vahid Noroozi, and Boris Ginsburg, “Citrinet: Closing the gap between non-autoregressive and autoregressive end-to-end models for automatic speech recognition,” arXiv preprint arXiv:2104.01721, 2021.
  • [44] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli,

    “wav2vec 2.0: A framework for self-supervised learning of speech representations,”

    in Proc. NeurIPS, 2020.
  • [45] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol,

    “Extracting and composing robust features with denoising autoencoders,”

    in Proc. ICML, 2008, pp. 1096–1103.
  • [46] Shinji Watanabe, Florian Boyer, Xuankai Chang, Pengcheng Guo, Tomoki Hayashi, Yosuke Higuchi, Takaaki Hori, Wen-Chin Huang, Hirofumi Inaguma, Naoyuki Kamo, et al., “The 2020 ESPNet update: New features, broadened applications, performance improvements, and future plans,” in

    2021 IEEE Data Science and Learning Workshop (DSLW)

    , 2021, pp. 1–6.
  • [47] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and Tomoki Hayashi, “Hybrid CTC/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  • [48] Shigeki Karita, Nelson Enrique Yalta Soplin, Shinji Watanabe, Marc Delcroix, Atsunori Ogawa, and Tomohiro Nakatani, “Improving Transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration,” in Proc. Interspeech, 2019, pp. 1408–1412.
  • [49] Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, et al., “Recent developments on ESPnet toolkit boosted by Conformer,” in Proc. ICASSP, 2021, pp. 5874–5878.
  • [50] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
  • [51] Anthony Rousseau et al., “Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks,” in Porc. LREC, 2014, pp. 3935–3939.
  • [52] Kikuo Maekawa, Hanae Koiso, Sadaoki Furui, and Hitoshi Isahara, “Spontaneous speech corpus of Japanese.,” in Proc. LREC, 2000, pp. 1–5.
  • [53] Taku Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” in Proc. ACL, 2018.
  • [54] Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, and Ricard Marxer, “An analysis of environment, microphone and data simulation mismatches in robust speech recognition,” Computer Speech & Language, vol. 46, pp. 535–557, 2017.
  • [55] Douglas B Paul and Janet M Baker, “The design for the wall street journal-based CSR corpus,” in Proc. Workshop on Speech and Natural Language, 1992, pp. 357–362.
  • [56] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011.
  • [57] Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur, “Audio augmentation for speech recognition,” in Proc. Interspeech, 2015, pp. 3586–3589.
  • [58] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019, pp. 2613–2617.
  • [59] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2015.
  • [60] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Proc. NeurIPS, 2017, pp. 5998–6008.
  • [61] Jan Chorowski and Navdeep Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” in Proc. Interspeech, 2017, pp. 523–527.
  • [62] Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, and Shinji Watanabe, “Orthros: Non-autoregressive end-to-end speech translation with dual-decoder,” in Proc. ICASSP, 2021, pp. 7503–7507.
  • [63] Matt Post, Gaurav Kumar, Adam Lopez, Damianos Karakos, Chris Callison-Burch, and Sanjeev Khudanpur, “Improved speech-to-text translation with the Fisher and Callhome Spanish–English speech translation corpus,” in Proc. IWSLT, 2013.
  • [64] Yoon Kim and Alexander M. Rush, “Sequence-level knowledge distillation,” in Proc. EMNLP, 2016, pp. 1317–1327.
  • [65] John Towns, Timothy Cockerill, Maytal Dahan, Ian Foster, Kelly Gaither, Andrew Grimshaw, Victor Hazlewood, Scott Lathrop, Dave Lifka, Gregory D Peterson, et al., “XSEDE: Accelerating scientific discovery,” Computing in Science & Engineering, vol. 16, no. 5, pp. 62–74, 2014.
  • [66] Nicholas A Nystrom, Michael J Levine, Ralph Z Roskies, and J Ray Scott, “Bridges: a uniquely flexible HPC resource for new communities and data analytics,” in Proc. the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure, 2015, pp. 1–8.