Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models

07/20/2021 ∙ by Tianzi Wang, et al. ∙ yahoo Johns Hopkins University Carnegie Mellon University 0

Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. With recent state-of-the-art attention-based automatic speech recognition (ASR) structure, NAR can realize promising real-time factor (RTF) improvement with only small degradation of accuracy compared to the autoregressive (AR) models. However, the recognition inference needs to wait for the completion of a full speech utterance, which limits their applications on low latency scenarios. To address this issue, we propose a novel end-to-end streaming NAR speech recognition system by combining blockwise-attention and connectionist temporal classification with mask-predict (Mask-CTC) NAR. During inference, the input audio is separated into small blocks and then processed in a blockwise streaming way. To address the insertion and deletion error at the edge of the output of each block, we apply an overlapping decoding strategy with a dynamic mapping trick that can produce more coherent sentences. Experimental results show that the proposed method improves online ASR recognition in low latency conditions compared to vanilla Mask-CTC. Moreover, it can achieve a much faster inference speed compared to the AR attention-based models. All of our codes will be publicly available at



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past years, the advances in deep learning have dramatically boosted the performance of end-to-end (E2E) automatic speech recognition (ASR)

[graves2006connectionist, graves2012sequence, chorowski2014end]. Most of these E2E-ASR studies were based on autoregressive (AR) models and they achieved state-of-the-art performance [gulati2020Conformer]. However, there is a disadvantage of the AR models in that the inference time linearly increases with the output length. Recently, non-autoregressive (NAR) models has gained more and more attention in sequence-to-sequence tasks, including machine translation [gu2017non, libovicky2018end, ghazvininejad2019mask], speech recognition (ASR) [graves2006connectionist, chan2020imputer, fujita2020insertion, chi2020align, fan2020cass], and speech translation [inaguma2020orthros]. In contrast to the AR modeling, NAR modeling can predict the output tokens concurrently, the inference speed of which is dramatically faster than AR modeling. Especially, connectionist temporal classification (CTC) is a popular and simple NAR modeling [graves2006connectionist, libovicky2018end]. However, CTC makes a strong conditional independent assumptions between the predicted tokens, leading to an inferior performance compared to the AR attention-based models [battenberg2017exploring].

To overcome this issue, several NAR studies have been proposed in the ASR field. A-FMLM[chen2019listen]

is designed to predict the masked tokens conditioning on the unmasked ones and the input speech. However, it needs to first predict the output length, which is difficult and easily leads to a long output sequence. Imputer

[chan2020imputer] directly used the length of input feature sequence to address the issue and achieves comparable performance with AR models, but the computational cost can be very large. ST-NAR [tian2020spike] used CTC to predict the target length and to guide the decoding. It is fast but suffers from large accuracy degradation compared with AR models. Different from previous methods, Mask-CTC [higuchi2020mask] first generates the output tokens with greedy decode of a CTC, and then refines the tokens which have low confidence by a mask-predict decoder [ghazvininejad2019mask]. Mask-CTC usually predicts sequences with reasonable length and can achieve a fast inference speed, 7x faster than the AR attention-based model.

In addition to fast inference, latency is an important factor to be considered for the ASR system used in a real-time speech interface. There have been a lot of prior studies for low-latency E2E ASR based on Recurrent Neural Networks Transducer (RNN-T)

[graves2012sequence, battenberg2017exploring, tripathi2020transformer, jain2019rnn] and online attention-based encoder decoder (AED) with AR models, such as monotonic chunkwise attention (MoChA) [chiu2017monotonic, inaguma2020enhancing], triggered attention [moritz2020streaming], and blockwise-attention [miao2020transformer, tsunoo2020streaming].

Motivated by such online AR AED studies and emergent NAR research trends, we propose a novel end-to-end streaming NAR speech recognition system, by combining the blockwise-attention and Mask-CTC models. During inference, the input audio is first separated into small blocks with overlap between consecutive blocks. CTC firstly predicts the preliminary tokens per block with an efficient greedy forward pass based on the output of a blockwise-attention encoder. To address the insertion and deletion error of CTC outputs frequently appeared at the boundary of each block, we apply a dynamic overlapping strategy [chiu2019comparison] to produce coherent sentences. Then, low-confidence tokens are masked and re-predicted by a mask-predict NAR decoder [higuchi2020improved] conditioning on the rest tokens. The greedy CTC, dynamic overlapping decoding, and mask-prediction all perform very fast, thus can achieve quite low RTF. We evaluated our approach on TEDLIUM2 [rousseau2014enhancing] and AISHELL1 [bu2017aishell]. Compared to vanilla full-attention Mask-CTC, our proposed method decreases the online ASR recognition error rate in a low latency condition, with very fast inference speed. To the best of our knowledge, this is the first work that extending the NAR mechanism into streaming ASR.

1.1 Relationship with other streaming ASR studies

The most important success of streaming ASR recently is RNN-T and its variants. RNN-T based systems achieve state-of-the-art ASR performance for streaming applications and are successfully deployed in production systems[tripathi2020transformer, jain2019rnn, li2020towards, mahadeokar2021alignment]. However, the recurrent mechanism predicts the token of the current input frame based on all previous tokens using recurrent layers, to which NAR cannot be easily applied. Besides, several ideas in this paper are inspired from online AR AED architectures[chiu2019comparison, miao2020transformer, tsunoo2020streaming], since they have more technical connections with NAR models based on the similar encoder-decoder framework.

2 Mask-CTC

Mask-CTC is a non-autoregressive model trained with both CTC objective and mask-prediction objective [higuchi2020mask]

, where the mask-predict decoder predicts the masked tokens based on CTC output tokens and encoder output. CTC predicts a frame-level input-output alignment based on conditional independence assumption between frames. It models the probability

of output sequence by summing up all possible alignments, where denotes the sequence of output and denotes the input audio. However, due to the conditional independence assumption, CTC loses the ability of modeling correlations between output tokens and consequently loses performance.

Mask-CTC was designed to mitigate this issue by adopting an attention-based decoder as a masked language model (MLM) [ghazvininejad2019mask, chen2019nonautoregressive], and iterative refining the output of CTC greedy decoding. During training, the tokens in the ground-truth are randomly selected and replaced by a special token. Then the decoder is trained to predict the actual tokens at the masked positions, , conditioning on the rest unmasked tokens, , and attention-based encoder output. .


where the denotes a multi-headed self-attention based encoder. The Mask-CTC model is optimized by a weighted sum of the CTC and MLM objectives:


where is a tunable hyper-parameter.

3 Streaming NAR

The overall architecture of the proposed E2E streaming NAR speech recognition system is shown in Figure 1. The main difference compared with Mask-CTC is that the normal MHSA-based encoder (e.g. Transformer/Conformer) as shown in Eq. (1) is replaced by a blockwise-attention encoder to make the model streamable.

Figure 1: Architecture of proposed streaming NAR

3.1 blockwise-attention Encoder

To build a streaming AED-based ASR system, the encoder is only allowed to access limited future context. We use a blockwise-attention (BA) based encoder [miao2020transformer, tsunoo2020streaming] instead of normal multi-headed self-attention (MHSA). In a BA based encoder, the input sequence X is divided into fix-length blocks , where . Here is the block index, is the index of the last block in the whole input and is the block length. In the computation of blockwise-attention, each block only attends to the former layer’s output within the current block and the previous block. Blockwise-attention at the -th block is defined as:


where is the output of encoder layer at -th block, . The three arguments of are query, key, and value matrix variables, respectively. Likewise, the blockwise-depthwise-convolution (BDC) in Conformer encoder is defined as:


where is 1D depth-wise convolution and

refers to blockwise-padding, that pads zeros at right edges and pads

at left edges of

to keep the input/output dimension identical. The rest operations, such as point-wise convolution and activation function, is the same as in Conformer


3.2 Blockwise Mask-CTC

As shown in Figure 1, the proposed E2E streaming NAR speech recognition system consists of a blockwise-attention based encoder, a CTC, a dynamic mapping process, and an MLM decoder. During training, we use full audio as the input for convenience. But the CTC output within a block only depends on the input in the current and previous block, since the computation of the encoder is blockwise. In this paper, following the NAR manner, we applied greedy decoding for CTC, which selects the token with the highest probability at each time step. The output of each block from greedy decoding CTC is:


where , denotes the -th time step belonging to the -th block, . BAEncoder refers to the blockwise-attention encoder as mentioned in Sec 3.1,

is set to be a zero-matrix with the same size as

during computation. At the same time, the ground-truth target tokens are randomly masked and re-predicted by the MLM decoder.

During inference, the input is online segmented into fixed-length blocks with 50% overlap and fed into the encoder in a streaming way. The encoder forward pass and the CTC decoding follow the same way as in the training. As shown in Figure 1, the dotted line denotes CTC output corresponds to input block , and solid line denotes CTC output of . A dynamic mapping trick is applied on the overlap tokens between and to make a coherent output. More details will be shown in Sec. 3.3. Then the tokens with low-confidence scores from CTC decoding outputs, , are masked and re-predicted by MLM decoder. The predicting process is done in several iterations. In each iteration, tokens with higher predicting probability are filled into masked positions and the re-filled sequences are used as input for the next iteration.


where denotes the masked token sequence from CTC output, denotes the iteration number of re-prediction, denotes re-predicted tokens in iteration, are then infilled into the predicted sequences at corresponding masked positions to form the predition . would be the final output as is total number of iterations.

Encoder type Decode Mode WER on dev WER on test Latency(ms) RTF
Transformer full-context 11.7 9.9 5220 0.46
 + beamsize=10 full-context 10.9 9.1 34160 3.01
Attention-based AR Conformer full-context 11.0 8.4 6130 0.54
 + beamsize=10 full-context 11.1 8.1 37780 3.33
Transformer full-context 11.0 10.7 790 0.07
Conformer full-context 10.0 8.8 1070 0.09
Transformer streaming 18.2 16.4 300 0.20
Mask-CTC Conformer streaming 23.5 21.3 310 0.26
Transformer full-context 12.2 11.2 910 0.08
Conformer full-context 10.4 9.4 1030 0.09
Transformer streaming 14.2 14.0 120 0.22
Streaming NAR (Proposed) Conformer streaming 12.1 11.7 140 0.32
Table 1: TEDLIUM2: WERs on dev/test, averaged Latency and RTF are reported. 640ms input segments is used for all streaming decode mode. The attention-based AR and Mask-CTC models trained with conventional attention, while the proposed streaming NAR use blockwise attention with 640ms block length.

3.3 Dynamic mapping for overlapping inference

Although splitting the input audio into small blocks with fixed-length is a straightforward approach to form streaming ASR, it will result in horrible performance degradation at the block boundaries. A segment boundary may appear in the middle of a token, leading to that one token may have repetitive recognition or non-recognition in two consecutive blocks. The VAD-based segmentation method is a way to solve the issue, but it is sensitive to the threshold and may lead to large latency.

In this paper, we applied overlapping inference with dynamic mapping tricks as in[chiu2019comparison] to recover the erroneous output at the boundary, as shown in Algorithm 1. During inference, we use overlap when segmenting the input audio, which ensures any frame of input audio is predicted twice by Encoder and CTC. By locating the token index in that is closest to the center point of , we dynamically search for the best alignment between ( and ). Normalize refers to removing repeated and blank token from CTC output . Following the scoring function: , we select one of the token in token pairs on the best alignment path as the output of the overlapped segment. Here refers to the -th token in -th block, and is block length.

1: = [];
2: = Audio blocks iterator with overlap 
3:for  = 0 to B do
4:      = CTC_Predict(BAEncoder()) 
5:     if  then
6:          = remove repeated tokens in
7:          = token index in that is closest to
8:          = Alignment(, ) 
10:         for (, ) in  do
11:              token = if Score()Score() else
12:              APPEND token to
13:         end for
14:          = Normalize() 
15:     else
16:          = Normalize() 
17:     end if
18:     APPEND to
19:end for
Algorithm 1 CTC overlap decode and dynamic map

4 Experiment

4.1 Experimental setup and Dataset

We evaluate our proposed model on both Chinese and English Speech corpora: TEDLIUM2[rousseau2014enhancing] and AISHELL1[bu2017aishell]. For all experiments, the input features are 80-dimensional log-mel filter-banks with pitch computed with frame length of 25ms and frame shift of 10ms. We use Kaldi toolkit[Povey_ASRU2011]

for feature extraction. We also apply speed perturbation(speed rate=0.9, 1.0, 1.1) and spectrum augmentation

[park2019specaugment] for data augmentation. Models are evaluated with both full-context decoding and streaming decoding.

All experiments are conducted using the open-source, E2E speech processing toolkit ESPnet

[watanabe2018espnet, karita2019comparative, guo2020recent]. The encoder first contains 2 CNN blocks to downsample the input to , followed by 12 MHSA-based layers. In streaming NAR, BA-based layers(Eq. 4, 5) is used in encoder to replace MHSA. Decoders have a similar stacked-block structure with 6 layers and only full-attention transformer blocks. For any self-attention block in this paper, we use parallel attention heads, with dimension . For feed-forward layer, we use dimensionality and apply swish as activation functions. In self-attention, relative position embedding is augmented in the input[dai2019transformer]. We use

as the kernel size for Conformer convolution. Models on TEDLIUM2 are trained for 200 epochs and on AISHELL1 are trained for 150 epochs. The evaluation is done on the averaged model over the best 10 checkpoints. We do not integrate Language Model during decoding.

Encoder type Decode Mode WER on dev WER on test Latency(ms) RTF
Transformer full-context 6.7 7.6 2040 0.45
Attention-based AR  + beamsize=10 full-context 6.6 7.4 11640 2.56
Transformer full-context 6.9 7.8 220 0.05
Mask-CTC Transformer streaming 9.0 10.4 280 0.18
Transformer full-context 8.6 9.9 230 0.05
Streaming NAR (Proposed) Transformer streaming 8.6 9.9 320 0.20
Table 2: AISHELL1: WERs, averaged latency and RTF are reported. 1280ms input segments is used for all streaming inference. The attention-based AR and Mask-CTC models work with conventional attention, while the proposed streaming NAR use blockwise attention with 1280ms block length

4.2 Results

Table 1 shows TEDLIUM2 results, including word error rates (WERs) on dev/test sets, averaged latency and real-time factor (RTF). Latency and RTF were measured on the CPU platform (Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.3GHz) with 8 parallel jobs. We set block length , corresponding to ms before subsampling. The full-context evaluation is done on the utterance level, while the streaming evaluation is done on the unsegmented streaming audio for each speaker. Since it is hard to define the latency with long unsegmented audio, we calculated the averaged latency per utterance with dev:


The last token emitted time is the time stamp that model predicts last token, and the end of speech is determined by forced alignment with external model. This latency considered both look-ahead latency and computation time. For comparison, we also report the latency for full-context decoding by measuring the average decoding time per utterance, since in full-context case decoding can not start before entire audio is fed.

Encoder Dev Test BL(ms) (ms) RTF
12.8 12.0 5120 1530 0.17
12.8 11.9 2560 700 0.18
13.0 12.2 1280 290 0.18
BA-TF 14.2 14.0 640 120 0.22
10.5 10.3 5120 1700 0.29
10.6 10.4 2560 790 0.29
11.2 10.7 1280 380 0.30
BA-CF 12.1 11.7 640 140 0.32
Table 3: WERs and RTF on TEDLIUM2 conduct on proposed Streaming NAR model with different block length(BL), Experiments are all conducted under streaming mode.

Compared to the vanilla Mask-CTC, the proposed streaming NAR model performs much better in streaming mode. With the Conformer encoder, the WER on the dev of Mask-CTC is 23.5 with 310ms averaged latency and that of streaming NAR is 12.1 with 140ms averaged latency. The RTF is about 2x/10x faster than the AR attention-based model with beam size=1/10 respectively. Mask-CTC models are trained with full context input and work better with full future information. We can observe a significant WER degradation in streaming decoding mode, which is due to the input mismatch during training and decoding. On the other hand, the proposed streaming NAR can recognize the current frame with only a small future context, thus fit well with the streaming inference. However, the WERs increased from full context to streaming mode in the proposed model (from 10.4/9.4 to 12.1/11.7). The reason would be the error raised at the segment boundaries. From our observation, these errors can be relieved but not totally solved by dynamic mapping and still exist even with large block lengths.

Table 2 shows the results on AISHELL1. Since the hyper-parameters for Conformer and Mask-CTC need careful tuning and the model is easy to be overfitted in our preliminary experiments, we only report transformer results on the AISHELL1 task. The vanilla Mask-CTC works better when full-context is provided during decoding while streaming NAR is better on streaming decode, but the benefits are smaller than those in TEDLIUM2. A possible reason is that the training utterance in AISHELL1 is much shorter than TEDLIUM2. Full-context attention can also gain the ability to recognize with only short future context.

To understand the performance of streaming NAR under different latency, in Table 3 we compare the WERs with different block lengths for blockwise-attention Transformer (BA-TF) and blockwise-attention Conformer (BA-CF) on TEDLIUM2. We observe that as the block length get shorter, the latency becomes smaller while RTF rises since the shorter blocks require more iterations to forward the whole input audio. Besides, when block length decreases from 5120ms to 1280ms, the result rarely changes. It indicates that in streaming ASR, the closer future context is a much more active player than distant ones.

5 Conclusion

In this paper, we proposed a novel E2E streaming NAR speech recognition system. Specifically, we combined the blockwise-attention based Encoder and Mask-CTC. Beside, we applied the dynamic overlapping inference to mitigate the errors at the boundary. Compared to vanilla Mask-CTC, the proposed streaming NAR model achieves competitive performance in full-context decoding and outperforms the vanilla Mask-CTC streaming decoding with very low utterance latency. Moreover, the decoding speed of the proposed model is about 2x/10x faster than the AR attention-based model with beam size=1/10 respectively. Our future plan is developing better boundaries localization method to replace the overlapping inference, and integrating external language model during decoding.

6 Acknowledgements

We would like to thank Mr. Yusuke Higuch of Waseda University for his valuable information about the Mask-CTC.