Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

05/18/2020 ∙ by Yosuke Higuchi, et al. ∙ 0

We present Mask CTC, a novel non-autoregressive end-to-end automatic speech recognition (ASR) framework, which generates a sequence by refining outputs of the connectionist temporal classification (CTC). Neural sequence-to-sequence models are usually autoregressive: each output token is generated by conditioning on previously generated tokens, at the cost of requiring as many iterations as the output length. On the other hand, non-autoregressive models can simultaneously generate tokens within a constant number of iterations, which results in significant inference time reduction and better suits end-to-end ASR model for real-world scenarios. In this work, Mask CTC model is trained using a Transformer encoder-decoder with joint training of mask prediction and CTC. During inference, the target sequence is initialized with the greedy CTC outputs and low-confidence tokens are masked based on the CTC probabilities. Based on the conditional dependence between output tokens, these masked low-confidence tokens are then predicted conditioning on the high-confidence tokens. Experimental results on different speech recognition tasks show that Mask CTC outperforms the standard CTC model (e.g., 17.9 12.1 inference time using CPUs (0.07 RTF in Python implementation). All of our codes will be publicly available.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Owing to the rapid development of neural sequence-to-sequence modeling [sutskever2014sequence, bahdanau2014neural]

, deep neural network (DNN)-based end-to-end automatic speech recognition (ASR) systems have become almost as effective as the traditional hidden Markov model-based systems 

[chiu2018state, luscher2019rwth, karita2019a]. Various models and approaches have been proposed for improving the performance of the autoregressive

(AR) end-to-end ASR model with the encoder-decoder architecture based on recurrent neural networks (RNNs) 

[chorowski2015attention, chan2016listen, kim2017joint] and Transformers [vaswani2017attention, dong2018speech, karita2019improving].

Contrary to the autoregressive framework, non-autoregressive (NAR) sequence generation has attracted attention, including the revisitation of connectionist temporal classification (CTC) [graves2006connectionist, libovicky2018end] and the growing interest for non-autoregressive Transformer (NAT) [gu2017non]. While the autoregressive model requires iterations to generate an -length target sequence, a non-autoregressive model costs a constant number of iterations

, independent on the length of the target sequence. Despite the limitation in this decoding iteration, some recent studies in neural machine translation have successfully shown the effectiveness of the non-autoregressive models, performing comparable results to the autoregressive models. Different types of non-autoregressive models have been proposed based on the iterative refinement decoding 

[lee2018deterministic], insert or edit-based sequence generation [stern2019insertion, gu2019levenshtein], masked language model objective [ghazvininejad2019mask, ghazvininejad2020semi, saharia2020non], and generative flow [ma2019flowseq].

Some attempts have also been made to realize the non-autoregressive model in speech recognition. CTC introduces a frame-wise latent alignment to represent the alignment between the input speech frames and the output tokens [graves2014towards]. While CTC makes use of dynamic programming to efficiently calculate the most probable alignment, the strong conditional independence assumption between output tokens results in poor performance compared to the autoregressive models [battenberg2017exploring]. On the other hand, [chen2019non] trains a Transformer encoder-decoder in a mask-predict manner [ghazvininejad2019mask]: target tokens are randomly masked and predicted conditioning on the unmasked tokens and the input speech. To generate the output sequence in parallel during inference, the target sequence is initialized as all masked tokens and the output length is predicted by finding the position of the end-of-sequence token. However with this prediction of the output length, the model is known to be vulnerable to the output sequence with a long length. At the beginning of the decoding, the model is likely to make more mistakes in predicting long masked sequence, propagating the error to the later decoding steps. [chan2020imputer]

proposes Imputer, which performs the mask prediction in CTC’s latent alignments to get rid of the output length prediction. However, unlike the mask-predict, Imputer requires more calculations in each interaction, which is proportional to the square of the input length

in the self-attention layer, and the total computational cost can be very large.

Our work aims to obtain a non-autoregressive end-to-end ASR model, which generates the sequence in token-level with low computational cost. The proposed Mask CTC framework trains a Transformer encoder-decoder model with both CTC and mask-predict objectives. During inference, the target sequence is initialized with the greedy CTC outputs and low-confidence tokens are masked based on the CTC probabilities. The masked low-confidence tokens are predicted conditioning on the high-confidence tokens not only in the past but also in the future context. The advantages of Mask CTC are summarized as follows.

No requirement for output length prediction: Predicting the output token length from input speech is rather challenging because the length of the input utterances varies greatly depending on the speaking rate or the duration of silence. By initializing the target sequence with the CTC outputs, Mask CTC does not have to care about predicting the output length at the beginning of the decoding.

Accurate and fast decoding: We observed that the results of CTC outputs themselves are quite accurate. Mask CTC does not only retain the correct tokens in the CTC outputs but also recovers the output errors by considering the entire context. Token-level iterative decoding with a small number of masks makes the model well-suited for the usage in real scenarios.

2 Mask CTC framework

The objective of end-to-end ASR is to model the joint probability of a -length output sequence given a -length input sequence . Here, is an output token at position in the vocabulary and is a -dimensional acoustic feature at frame .

The following subsections first explain a conventional autoregressive framework based on attention-based encoder-decoder and CTC. Then a non-autoregressive model trained with mask prediction is explained and finally, the proposed Mask CTC decoding method is introduced.

2.1 Attention-based encoder-decoder

Attention-based encoder-decoder models the joint probability of given

by factorizing the probability based on the probabilistic left-to-right chain rule as follows:


The model estimates the output token

at each time-step conditioning on previously generated tokens in an autoregressive manner. In general, during training, the ground truth tokens are used for the history tokens and during inference, the predicted tokens are used.

2.2 Connectionist temporal classification

CTC predicts a frame-level alignment between the input sequence and the output sequence by introducing a special <blank> token. The alignment is predicted with the conditional independence assumption between the output tokens as follows:


Considering the probability distribution over all possible alignments, CTC models the joint probability of

given as follows:


where returns all possible alignments compatible with . The summation of the probabilities for all of the alignments can be computed efficiently by using dynamic programming.

To achieve robust alignment training and fast convergence, an end-to-end ASR model based on an attention-based encoder-decoder framework is trained with CTC [kim2017joint, karita2019improving]

. The objective of the autoregressive joint CTC-attention model is defined as follows by combining Eq. (

1) and Eq. (3):


where is a tunable parameter.

2.3 Joint CTC-CMLM non-autoregressive ASR

Mask CTC adopts non-autoregressive speech recognition [chen2019non] based on a conditional masked language model (CMLM) [ghazvininejad2019mask], where the model is trained to predict masked tokens in the target sequence [devlin2019bert]111Note that CMLM is used as an ASR decoder network conditioned on the encoder output as well, and it is different from an external language model often used in shallow fusion during decoding.. Taking advantages of Transformer’s parallel computation [vaswani2017attention], CMLM can predict any arbitrary subset of masked tokens in the target sequence by attending to the entire sequence including tokens in the past and the future.

CMLM predicts a set of masked tokens conditioning on the input sequence and observed (unmasked) tokens as follows:


where . During training, the ground truth tokens are randomly replaced by a special <MASK> token and CMLM is trained to predict the original tokens conditioning on the input sequence and the unmasked tokens

. The number of tokens to be masked is sampled from a uniform distribution between 1 to

as in [ghazvininejad2019mask]. During inference, the target sequence is gradually generated in a constant number of iterations by the iterative decoding algorithm [ghazvininejad2019mask], which repeatedly masks and predicts the subset of the target sequence.

We observed that applying the original CMLM to non-autoregressive speech recognition shows poor performance, having the problem of skipping and repeating the output tokens. To deal with this, we found that jointly training with CTC similar to [kim2017joint] provides the model with absolute positional information (conditional independence) explicitly and improves the model performance reasonably well. With the CTC objective from Eq. (3) and Eq. (5), the objective of joint CTC-CMLM training for non-autoregressive ASR model is defined as follows: L_NAR = γlogP_ctc (Y — X) + (1 - γ) logP_cmlm (Y_mask — Y_obs, X), where is a tunable parameter.

2.4 Mask CTC decoding

Figure 1: Overview of Mask CTC predicting “CAT” based on CTC outputs. The model is trained with the joint CTC and mask-predict objectives. During inference, the target sequence is initialized with the greedy CTC outputs and low-confidence tokens are masked based on the CTC probabilities. The masked low-confidence tokens are predicted conditioning on the high-confidence tokens.

Non-autoregressive models must know the length of the output sequence to predict the entire sequence in parallel. For example, in the beginning of the CMLM decoding, the output length must be given to initialize the target sequence with the masked tokens. To deal with this problem, in machine translation, the output length is predicted by training a fertility model [gu2017non] or introducing a special <LENGTH> token in the encoder [ghazvininejad2019mask]. In speech recognition, however, due to the different characteristics between the input acoustic signals and the output linguistic symbols, it appeared that predicting the output length is rather challenging, e.g., the length of the input utterances of the same transcription varies greatly depending on the speaking rate or the duration of silence. [chen2019non] simply makes the decoder to predict the position of <EOS> token to deal with the output length. However, they analyzed that this prediction is vulnerable to the output sequence with a long length because the model is likely to make more mistakes in predicting a long masked sequence and the error is propagated to the later decoding, which degrades the recognition performance. To compensate this problem, they use beam search with CTC and a language model to obtain the reasonable performance, which leads to a slow down of the overall decoding speed, making the advantage of non-autoregressive framework less effective.

To tackle this problem regarding the initialization of the target sequence, we consider using the CTC outputs as the initial sequence for decoding. Figure 1 shows the decoding of CTC Mask based on the inference of CTC. CTC outputs are first obtained through a single calculation of the encoder and the decoder works as to refine the CTC outputs by attending to the whole sequence.

In this work, we use “greedy” result of CTC , which is obtained without using prefix search [graves2006connectionist]

, to keep an inference algorithm non-autoregressive. The errors caused by the conditional independence assumption are expected to be corrected using the CMLM decoder. The posterior probability of

is approximately calculated by using the frame-level CTC probabilities as follows:


where is the consecutive same alignments that corresponds to the aggregated token . Then, a part of is masked-out based on a confidence using the probability as follows:


where is a threshold to decide whether the target token is masked or not. Finally, is predicted conditioning on the high-confidence tokens and the input sequence as in Eq. (5).

We also investigated applying one of the iterative decoding methods called easy-first [goldberg2010efficient, chen2019non]. Starting with the masked CTC output, the masked tokens are gradually predicted by a confidence based on the CMLM probability. In the -th decoding iteration, is updated as follows:


Top masked tokens with the highest probabilities are predicted in each iteration. By defining , the number of total decoding iterations can be controlled in a constant iterations.

With this proposed non-autoregressive training and decoding with Mask CTC, the model does not have to take care about predicting the output length. Moreover, decoding by refining CTC outputs with the mask prediction is expected to compensate the errors come from the conditional independence assumption.

3 Experiments

To evaluate the effectiveness Mask CTC, we conducted speech recognition experiments to compare different end-to-end ASR models using ESPnet [watanabe2018espnet]. The performance of the models was evaluated based on character error rates (CERs) or word error rates (WERs) without relying on external language models.

Model Iterations dev93 eval92 RTF
 CTC-attention [kim2017joint] 14.4 11.3 0.97
  + beam search 13.5 10.9 4.62
 CTC 1 22.2 17.9 0.03
 Mask CTC 1 16.3 12.9 0.03
 Mask CTC 1 15.7 12.5 0.04
 Mask CTC 5 15.5 12.2 0.05
 Mask CTC 10 15.5 12.1 0.07
 Mask CTC #mask 15.4 12.1 0.13
 CTC [chan2020imputer] 1 15.2
 Imputer (IM) [chan2020imputer] 8 16.5
 Imputer (DP) [chan2020imputer] 8 12.7
Table 1: Word error rates (WERs) and real time factor (RTF) for WSJ (English).
Model Iterations Dev Test
 CTC-attention [kim2017joint] 35.5 35.5
  + beam search 35.4 35.7
 CTC 1 53.8 56.1
 Mask CTC () 1 41.6 40.2
 Mask CTC 1 41.3 40.1
 Mask CTC 5 40.7 39.4
 Mask CTC 10 40.5 39.2
 Mask CTC #mask 40.4 39.0
Table 2: Word error rates (WERs) for Voxforge (Italian).
Figure 2: Decoding example for utterance 443c040i in WSJ eval92. The target sequence is initialized as the CTC outputs and some tokens are replaced with masks (“_”) based on the CTC confidence. Then, the masked tokens are iteratively predicted conditioning on the other unmasked tokens. Red indicates characters with errors and blue indicates ones recovered by Mask CTC decoding.

3.1 Datasets

The experiments were carried out using three tasks with different languages and amounts of training data: the 81 hours Wall Street Journal (WSJ) in English [paul1992design], the 581 hours Corpus of Spontaneous Japanese (CSJ) in Japanese [maekawa2003corpus] and the 16 hours Voxforge in Italian [voxforge]. For the network inputs, we used 80 mel-scale filterbank coefficients with three-dimensional pitch features and applied SpecAugment [park2019specaugment] during model training. For the tokenization of the target, we used characters: Latin alphabets for English and Italian, and Japanese syllable characters (Kana) and Chinese characters (Kanji) for Japanese.

3.2 Experimental setup

For experiments in all of the tasks, we adopted the same encoder-decoder architecture as [karita2019improving]

, which consists of Transformer self-attention layers with 4 attention heads, 256 hidden units, and 2048 feed-forward inner dimension size. The encoder included 12 self-attention layers with convolutional layers for downsampling and the decoder was 6 self-attention layers. With the mask-predict objective, the convergence for training the Mask CTC model required more epochs (about 200 – 500) than the autoregressive models (about 50 – 100). The final autoregressive model was obtained by averaging the model parameters of the last 10 epochs as in 

[karita2019a]. For Mask CTC model, we found that the model performance was significantly improved by averaging the model parameters of 10 – 30 epochs with the top validation accuracies. For the threshold in Eq. (7), we used 0.999, 0.999, and 0.9 for WSJ, Voxforge, and CSJ, respectively. For all of the tasks, the loss weights and in Eq. (4) and Eq. (2.3) were set to 0.3 and 0.3, respectively.

3.3 Evaluated models

  • CTC-attention: An autoregressive model trained with the joint CTC-attention objective as in Eq. (4). During inference, the joint CTC-attention decoding is applied with beam search [hori2017joint].

  • CTC: A non-autoregressive model simply trained with the CTC objective.

  • Mask CTC: A non-autoregressive model trained with the joint CTC-CMLM objective as in Eq. (2.3). During inference, the proposed decoding based on masking the CTC outputs (explianed in Section 2.4) is applied. Note that when in Eq. (7), the greedy output of CTC was used as a decoded result.

3.4 Results

Model Eval1 Eval2 Eval3
 CTC-attention [kim2017joint] 6.37 57.0 4.76 53.7 5.40 39.6
  + beam search 6.21 56.8 4.50 53.4 5.15 40.1
 CTC 6.51 59.7 4.71 59.5 5.49 44.5
 Mask CTC 6.56 60.3 4.69 57.0 4.97 41.9
 Mask CTC () 6.56 58.7 4.57 55.5 4.96 40.7
Table 3: Character error rates (CERs) and sentence error rates (SERs) for CSJ (Japanese).

Table 1 shows the results for WSJ based on WERs and real time factors (RTFs) that were measured for decoding eval92 with Intel(R) Core(TM), i9-7980XE, 2.60GHz. By comparing the results for non-autoregressive models, we can see that the greedy CTC outputs of Mask CTC outperformed the simple CTC model by training with the mask-predict objective. By applying the refinement based on the proposed CTC masking, the model performance was steadily improved. The performance was further improved by increasing the number of decoding iterations and it resulted in the best performance with #mask iterations, which means one mask is predicted in each iteration. The results of Mask CTC are reasonable comparing to the results of prior work [chan2020imputer]. Our models also approached the results of autoregressive models from the initial CTC result. In terms of the decoding speed measured in RTF, CTC Mask is, at most, 116 times faster than the autoregressive models. Since most of the CTC outputs are fairly accurate and the number of masks are quite small, there was not so much degradation in the speed as the number of the decoding iterations was increased.

Figure 2 shows an example decoding process of a sample in the WSJ evaluation set. Here, we can see that the CTC outputs include errors mainly coming from substitution errors due to the incomplete word spelling. By applying Mask CTC decoding, the spelling errors were successfully recovered by considering the conditional dependence between characters in word-level. However, as can be seen in the error for “sifood,” Mask CTC cannot recover errors derived from character-level insertion or deletion errors because the length allocated to each word is fixed by the CTC outputs.

Table 2 shows WERs for Voxforge. Mask CTC yielded better scores than the standard CTC model as the similar results to WSJ, demonstrating that our model can be adopted to other languages with a relatively small amount of training data.

Table 3 shows character error rates (CERs) and sentence error rates (SERs) for CSJ. While Mask CTC performed quite close or even better CERs than the autoregressive model, the results showed a little improvement from the simple CTC model, compared to the results of the aforementioned tasks. Since Japanese includes a large number of characters and the characters themselves often form a certain word, the simple CTC model seemed to be dealing with the short dependence between the characters reasonably well, performing almost the same scores without applying Mask CTC. However, when we look at the results in sentence-level, we observed some clear improvements for all of the evaluation sets, again showing that our model effectively recovers the CTC errors by considering the conditional dependence.

These experimental results on different tasks indicate that Mask CTC framework is especially effective on languages having tokens with a small unit (i.e., Latin alphabet and other phonemic scripts). It is our future work for investigating the effectiveness when we use byte pair encodings (BPEs) [sennrich2016neural] for the languages with such a small unit.

4 Conclusions

This paper proposed Mask CTC, a novel non-autoregressive end-to-end speech recognition framework, which generates a sequence by refining the CTC outputs based on mask prediction. During inference, the target sequence was initialized with the greedy CTC outputs and low-confidence masked tokens were iteratively refined conditioning on the other unmasked tokens and input speech features. The experimental comparisons demonstrated that Mask CTC outperformed the standard CTC model while maintaining the decoding speed fast. Mask CTC approached the results of autoregressive models; especially for CSJ, they were comparable or even better. Our future plan is to reduce the gap of masking strategies between training using random masking and inference using CTC outputs. Furthermore, we plan to explore the integration of external language models (e.g., BERT [devlin2019bert]) in Mask CTC framework.