Log In Sign Up

CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR

by   Keyu An, et al.
Tsinghua University

History and future contextual information are known to be important for accurate acoustic modeling. However, acquiring future context brings latency for streaming ASR. In this paper, we propose a new framework - Chunking, Simulating Future Context and Decoding (CUSIDE) for streaming speech recognition. A new simulation module is introduced to recursively simulate the future contextual frames, without waiting for future context. The simulation module is jointly trained with the ASR model using a self-supervised loss; the ASR model is optimized with the usual ASR loss, e.g., CTC-CRF as used in our experiments. Experiments show that, compared to using real future frames as right context, using simulated future context can drastically reduce latency while maintaining recognition accuracy. With CUSIDE, we obtain new state-of-the-art streaming ASR results on the AISHELL-1 dataset.


page 1

page 2

page 3

page 4


Universal ASR: Unify and Improve Streaming ASR with Full-context Modeling

Streaming automatic speech recognition (ASR) aims to emit each hypothesi...

Multi-mode Transformer Transducer with Stochastic Future Context

Automatic speech recognition (ASR) models make fewer errors when more su...

Improving Gated Recurrent Unit Based Acoustic Modeling with Batch Normalization and Enlarged Context

The use of future contextual information is typically shown to be helpfu...

Learning a Dual-Mode Speech Recognition Model via Self-Pruning

There is growing interest in unifying the streaming and full-context aut...

An Investigation of Enhancing CTC Model for Triggered Attention-based Streaming ASR

In the present paper, an attempt is made to combine Mask-CTC and the tri...

TrimTail: Low-Latency Streaming ASR with Simple but Effective Spectrogram-Level Length Penalty

In this paper, we present TrimTail, a simple but effective emission regu...

Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies

There is often a trade-off between performance and latency in streaming ...

1 Introduction

Recently, self-attention based neural networks such as Transformer 

[self-attention] and Conformer [Conformer]

have shown excellent performance in automatic speech recognition (ASR) 

[Conformer, Transformer-Transducer, Speech-Transformer], particularly for being used as acoustic encoders to extract high-level representations from speech. Self-attention architectures capture temporal dependencies in a sequence by computing pairwise attention weights, and thus are superior in leveraging contextual information for acoustic encoders. However, the full-sequence attention mechanism involves sequence-level fully-connected computation, and thus is unsuitable for streaming ASR, where each word must be recognized shortly after it was spoken. For streaming ASR, chunk-based self-attention networks, which use chunk-level input to control the latency, are adopted in many previous works [SCAMA, SAA, Online-CTC/Attention, CBP, CAT, Streaming-Transformer] and show better ASR performance when compared to other streaming schemes such as causal self-attention [MERL-DCN].

In chunk-based latency controlled models, a certain number of left and right contextual frames are often spliced to each chunk, which is found to benefit the acoustic encoder and produce improved ASR performance [SAA, Online-CTC/Attention, CSC, CAT]. For example, Dong  [SAA] reported a 10%15% relative character error rate reduction (CERR) on HKUST dataset when splicing left and right contextual frames to each chunk, and similar results are reported in other end-to-end models. However, the use of right contextual frames (i.e., look-ahead frames) needs to wait until the arrival of the right contextual frames, and thus brings additional latency.

In this paper, we introduce a new framework - Chunking, Simulating Future Context and Decoding (CUSIDE) for streaming ASR, which is illustrated in Figure 1. Inspired from predictive coding [APC, CPC], a new simulation module, consisting of lightweight simulation encoder and predictor, is introduced to recursively simulate the future contextual frames, based only on history frames. Thus, the ASR model does not need to wait for future context. The CUSIDE framework can be easily applied to existing ASR models via multi-task training. The simulation module is jointly trained with the ASR model using a self-supervised loss; the ASR model is optimized with the usual ASR loss, e.g., CTC-CRF [crf] as used in our experiments. The CUSIDE framework can be further enhanced with several recently developed techniques for streaming ASR, such as weight sharing and joint training of a streaming model with a full-context model [dual-mode-asr], chunk size jitter [CAT, U2++] and stochastic future context [MMTT] in training. Experiments demonstrate that, compared to using real future frames as right context, using simulated future context can drastically reduce latency while maintaining recognition accuracy. With CUSIDE, we obtain new state-of-the-art streaming ASR results on the AISHELL-1 dataset [aishell].

The paper is organized as follows. Section 2 outlines related work. Section 3 details the components of CUSIDE. Experiments are shown in Section 4. Section 5 gives the conclusion. The code will be released upon the acceptance of the paper.

Figure 1: The CUSIDE framework. We split the utterance into non-overlapping chunks, and iteratively run simulation encoder and predictor to simulate a certain number of future frames. Current chunk is spliced with some history frames and the simulated future frames, and they are together fed to the acoustic encoder to produce the prediction for the current chunk. The simulation encoder, predictor, and acoustic encoder are implemented in our experiments as, but generally not limited to, GRU, feed-forward (FF) layer, and Conformer respectively.

2 Related work

2.1 Low latency ASR

Building ASR models with low latency has been studied in conventional DNN-HMM hybrid systems [CSC]. Recently, three classes of end-to-end ASR models, namely CTC [ctc], RNN-T [rnn-t] and attention-based encoder-decoder (AED) [LAS], have demonstrated excellent performance, which raises the demand to build streaming end-to-end ASR. For CTC and RNN-T, both can use frame-synchronous decoding which involves no latency, thus the challenge for streaming ASR is to control the latency of the underlying acoustic encoder while maintaining recognition accuracy being close to a full-context model. Streaming acoustic encoders can be built with uni-directional recurrent networks and causal self-attention networks, which do not use any future context, but with accuracy sacrifice. Chunk-based acoustic encoders are attractive and adopted in many previous works, where bi-directional recurrent networks or fully-connected self-attention networks can be used within a chunk [SAA, CAT, MERL-DCN], realizing full-context utilization in a chunk. Truncated self-attention and time restricted self-attention [Transformer-Transducer, resctricted-sa] are proposed to limit the contexts available for self-attention. For AED, the previously mentioned methods for low-latency acoustic encoders can be applied as well, but an additional challenge is that the decoding in AED is label-synchronous, which needs to attend to the entire sequence of representations extracted by the acoustic encoder. To address this, many efforts have been made to convert full sequence soft attention into local attention [SCAMA, mocha, cif].

2.2 Unifying streaming and non-streaming ASR

Recently, there has been a growing interest in unifying streaming and non-streaming ASR, with initial motivation to simplify the workflow of training and deploying streaming and full-context ASR. Interestingly, dual-mode ASR [dual-mode-asr], based on RNN-T, finds that weight sharing and joint training of a streaming model with a full-context model can benefit the low-latency and accuracy of streaming ASR. U2 [U2++], based on hybrid CTC-AED, adopts dynamic chunk sizes in training, which are uniformly drawn from 1 to some maximum size. In this manner, different AED models, varying from causal attention to full attention, are jointly trained with shared weights. Multi-mode Transformer Transducer [MMTT] proposes to consider stochastic future contexts during training, so that the trained model is capable to serve in various latency budget scenarios without significant accuracy deterioration.

Additionally, different two-pass decoding strategies have been used to unify streaming and non-streaming ASR, such as in Universal ASR [universal], U2++ [U2++] and RNN-T with Cascaded encoders  [cascaded]. The main idea is that during inference, a streaming model is used first to generate intermediate results (features or hypotheses), which are then further processed by a non-streaming model. The encoders can be shared [universal, U2++] or the decoders are shared [cascaded]. In all the unified models mentioned above, the non-streaming model introduces extra parameters on top of the streaming model.

2.3 Predictive coding

The idea of predicting future frames has been explored in recent self-supervised speech representation learning such as autoregressive predictive coding (APC) [APC] and contrastive predictive coding (CPC) [CPC]. Beyond predicting the next few frames by exploiting the local smoothness of speech signal, it is shown in [APC, CPC] that it is possible to learn more global structures in speech by neural networks so that the prediction can be as far as 200ms. This success of predictive coding motivates the introduction of the simulation module in CUSIDE to simulate future context. The difference is that these predictive coding methods adopt much larger encoders, which are trained on large amounts of unlabeled speech and used as speech representation learners for various downstream tasks such as speaker identification and speech translation. In CUSIDE, the purpose of predicting future is to provide context information for acoustic encoders for streaming ASR.

Figure 2: The dependency map of different types of chunking used in inference. (a) Chunking w/o context. (b) Context sensitive chunking with left and right contexts, which are colored with yellow and red, respectively. (c) The proposed chunking with simulated future context. Colored [i, j] denotes whether output[i] depends on input[j]. White means there is no dependency. Dashed means that for output[i], input[j] is simulated. Here the sizes of left and right context are plotted to be equal to the sizes of the chunk, which is not necessary in practice.

3 Method

The CUSIDE framework consists of context sensitive chunking, simulating future context and decoding. It can be easily applied to existing ASR models via multi-task training as shown below. Different neural network architectures (e.g., Bi-LSTM, Transformer and Conformer) can be used as acoustic encoders, and we can use different decoding methods, such as WFST based decoding for CTC or CTC-CRF [crf], beam-search for RNN-T or AED [SCAMA, mocha]. To be specific, we adopt state-of-the-art Conformer encoder [Conformer] for acoustic modeling and WFST based decoding for CTC-CRF in our experiments.

3.1 Context sensitive chunking

Full-sequence self-attention based encoders such as Transformer and Conformer require the entire speech utterance as input, thus unsuitable for streaming ASR. To realize streaming ASR, we adopt context sensitive chunking (CSC) [CSC, CAT] for the Conformer encoder, as illustrated in Figure 2(b). Specifically, the entire utterance is firstly split into non-overlapping chunks. For each chunk, a certain number of frames to the left and right of the chunk are spliced as contextual frames. The spliced frames are collectively called a context sensitive chunk, which together is fed into the Conformer encoder. Note that the output from contextual frames is discarded at the output layer of the Conformer. Though with some extra computation, CSC is attractive and has been shown with promising results for streaming ASR [CSC, CAT, SAA, Online-CTC/Attention], due to its use of both past and future context. However, acquiring future context brings additional latency, as the model needs to wait until the right context frames come before it can produce current output during inference. To address this drawback, we propose CSC with simulated future context for low latency ASR.

3.2 Simulating future context

As illustrated in Figure 1, in CUSIDE, the new arriving chunk is firstly encoded by a simulation encoder, which is a uni-directional GRU whose hidden state is initialized by the last hidden state from the previous chunk. The GRU hidden state at the right boundary of the current chunk is then fed into a simulation predictor, which is a simple feed-forward (FF) layer, to simulate the right context frames , i.e., the log Mel-spectrograms:


where , is the number the simulated right context frames, and is the dimension of the log Mel-spectrograms. is of shape (, ), and is of shape (). In this paper, to reduce the overhead of parameters and computation due to the introduction of simulating future context, lightweight GRU and feed-forward layer are used as simulation encoder and predictor, respectively.

In both training and recognition, the simulated future context is spliced to the chunk with the left context, as shown in Figure 2(c), which together are fed to the acoustic encoder as in usual CSC. During training, the L1 loss between the simulated future frames and the real future frames is added to the ASR loss with a scaling factor . Thus, the simulation module is trained by the usual (supervised) ASR loss and the (self-supervised) simulation loss.

3.3 Weight sharing and joint training with full-context model

Inspired by dual-mode ASR [dual-mode-asr], we train a single ASR model with shared weights for both streaming and non-streaming ASR. Notably, the forward calculation in the streaming mode of CUSIDE is actually in full-context within each context sensitive chunk, thus can be performed using exactly the same parameters and operations as in non-streaming ASR. In contrast, the forward calculation in the streaming mode of dual-model ASR uses causal operators such as causal attention and convolution, which are different from non-causal operators used in the non-streaming mode. And the non-streaming model introduces extra parameters on top of the streaming model. Moreover, we adopt chunk size jitter for different batches in training, which can improve the generalization capability of the streaming ASR model as shown in [CAT, U2++]

. Specifically, the chunk size for every batch is sampled from a uniform distribution

(C-A, C+A), where C is the default chunk size and A is the dynamic range.

The whole training in CUSIDE as shown in Algorithm 1

is in multi-task training, which involves three tasks - simulation future context, streaming and non-streaming ASR. We simulate future context for all chunks at the same time, and use the simulated frames and the ground-truth future frames (obtained by shifting the real frames backward) to calculate the simulation loss. For the streaming ASR loss, after chunk-wise forward calculation, we reshape the chunk-wise posterior predictions to construct the utterance-level posterior predictions, which are used to compute the streaming ASR loss. The non-streaming ASR loss is calculated with Conformer encoder as usual.

# Requires: dataloader, AcousticEncoder, SimuEncoder and SimuPredictor.
for X, y in dataloader do # Load the speech feature X and the label y.
      nonstreamingpred AcousticEncoder(X) # Perform full-context recognition.
      nonstreamingloss asrloss(nonstreamingpred, y) # Calculate utterance-level non-streaming loss.
      simulatedcontext = SimuPredictor(SimuEncoder(X)) # Simulate the future context.
      simuloss = L1Loss(simulatedcontext, Shift(X)) # Calculate simulation loss. Shift(X) outputs the ground-truth future context, by shifting X backward.
       = formatchunk(X, simulatedcontext) # Construct K context sensitive chunks with simulated future context. K =

. The chunk size can be randomly drawn. We use zero padding for the last chunk.

       = AcousticEncoder() # Perform batch forward for all chunks and discard output from contextual frames.
      streamingpred = formatutt() # Construct the utterance-level posterior prediction for the streaming model.
      streamingloss = asrloss(streamingpred, y) # Calculate utterance-level streaming loss.
      totalloss = nonstreamingloss + streamingloss + *simuloss # Compute total loss.
      totalloss.backward() # Update parameters.
end for
Algorithm 1 Pseudocode of model training in CUSIDE.

4 Experiments

4.1 Experiment settings

We use the CTC-CRF based ASR Toolkit - CAT [CAT] to conduct the experiments. CTC-CRF [crf] is a CRF with CTC topology, which eliminates the conditional independence assumption in CTC and performs significantly better than CTC. Experiments are conducted on Aishell-1 [aishell]

, which consists of a total of 178 hours of labeled speech. The official lexicon  

111 is used.

All experiments use 80-dim filterbank features with 3-dim pitch features, which are extracted with a 25ms sliding window at a 10ms frame rate. We apply online mean and variance normalization to the input feature. 3-fold speed perturbation and SpecAugment 

[specaug] are used for data augmentation. Unless otherwise specified, the acoustic model is a 12-layer Conformer, and the number of attention heads, attention dimension, and feed-forward dimension are 4, 256, and 2048 respectively. For simulating the future context, we use a 3-layer GRU with 256 hidden units, and 1 feed forward layer. The GRU and feed forward layer have about 5% of the total model size (M). Thus, the parameter overhead is negligible.

In training, we use the Adam optimizer and follow the Transformer learning rate scheduler [self-attention]

. The learning rate will decay with a factor of 0.1 if the loss does not decrease on the validation set. We stop training when the learning rate is less than a given threshold. The final model averages the top 5 checkpoints with the best validation loss on the development set. Gradient clipping is applied to stabilize the training. We further adopt a curriculum learning strategy in the first several training epochs, i.e., we sort the input utterances from short to long, as model training tends to be easier over short sequences. The scaling factor

for the simulation loss is set to 100. In decoding, a word-level 3-gram language model (LM) trained on the transcripts is used to build the decoding WFST.

4.2 Experiment results

The Character Error Rates (CERs) and average time costs per chunk when using different contextual information are shown in Table  1. From the table we observe that: 1) For the streaming ASR model, the lack of right context leads to a significant degradation in recognition accuracy (Exp 1 vs Exp 2, and Exp 6 vs Exp 7), and the accuracy degradation can not be compensated with more left context (Exp 1 vs Exp 7); 2) Using stochastic future context at training yields better performance at inference for the models without right context (Exp 4 and Exp 9) and with simulated right context (Exp 5 and Exp 10). 3) With stochastic future context, CUSIDE (Exp 5 and Exp 10) outperforms the model without right context (Exp 4 and Exp 9) consistently, with only a small overhead over parameter and time cost2222ms per chunk in our experiments when testing on a GeForce GTX 1080 GPU., especially when the CSC uses fewer left frames as history information. Notably, CUSIDE obtains equally good, or even better results compared with the model using real right context.

left ctx
right ctx at
training (ms)
right ctx at
inference (ms)
1 400 400 400 6.09
2 400 0 0 7.27
3 400 [400] [400] 7.28
4 400 400 or 0 0 7.15
5 400 400 or 0 or [400] [400] 6.14
6 800 400 400 6.05
7 800 0 0 6.82
8 800 [400] [400] 6.73
9 800 400 or 0 0 5.94
10 800 400 or 0 or [400] [400] 5.83
Table 1: CER and average time cost per chunk on AISHELL-1 test set. For the right context, denotes that these frames are simulated. For all experiments, the chunk size is 400ms by default. ctx is shorthand for context.

The results of CUSIDE with chunk size jitter, weight sharing and joint training with a full-context model are shown in Table 2. During training, the chunk size is randomly sampled from (200ms, 600ms) with C=400ms and A=200ms. The results demonstrate the effectiveness of these two training strategies. Notably, the performance of our streaming model is very close to the full-context model (5.47 vs 5.28). It is found that weight sharing and joint training also improve the non-streaming model (results in the parentheses). Presumably, this is because that the full-context model tends to overfit the training data, which could be mitigated by joint training with a chunk based model.

model configuration CER
Exp 5 in Table 1 6.14
 + chunk size jitter in training 5.96
  + weight sharing & joint training 5.83 (4.98)
Exp 10 in Table 1 5.83
 + chunk size jitter in training 5.72
  + weight sharing & joint training 5.47 (5.01)
full-context model 5.28
Table 2: Effect of chunk size jitter, weight sharing and joint training with a full-context model. The numbers in the parentheses are the full-context recognition results of the unified model.

We further compare the accuracy and latency of CUSIDE with other models from literature. Notably, some existing streaming models use two-pass decoding. For comparison, a 3-layer Transformer LM on the training transcripts in AISHELL-1 is used for rescoring. The results are shown in Table  3. CUSIDE obtains state-of-the-art results on AISHELL-1 with minimal overall latency.

model latency (ms) CER
SCAMA [SCAMA] 600 7.39
MMA narrow chunk  [MMA] 960 7.5
MMA wide chunk  [MMA] 1920 6.6
HS-DACS Transformer [HS-DACS] 1280 6.8
U2++ [U2++] 640 5.81
U2++ w/ rescoring [U2++] 640 + 5.05
WNARS w/ rescoring [WNARS] 640 + 5.22
CUSIDE 400 + 2 5.47
 + NNLM rescoring 400 + 2 + 4.79
Table 3: Comparison with other streaming models from literature on AISHELL-1 test set. Following [U2++], the latency is defined as the chunk size plus the right context (if any). is the additional latency introduced by rescoring the first-pass hypotheses, which is typically less than 100ms for a utterance.
Figure 3: The real spectrogram (left, with mean and variance normalized) and the simulated spectrogram (right) for the utterance (BAC009S0764W0121) in the AISHELL-1 test set.

4.3 Discussion

To observe the quality of the simulated frames, we visualize the simulated log Mel-spectrogram in Fig 3. It can be seen that the simulated result roughly follows the spectrogram patterns, e.g., the different magnitudes of the speech and non-speech segments, but may not be able to capture finer details, e.g., the formant peaks in middle and high frequencies. As we only use limited data to train the simulation module, further improving the simulation module such as self-supervised pretraining on larger datasets would be an interesting extension of our work.

5 Conclusions

In this paper, we introduce the CUSIDE framework, which uses simulated future context to improve the recognition accuracy of chunk-based streaming ASR. With CUSIDE, we achieve new state-of-the-art streaming ASR results on AISHELL-1 in terms of accuracy and latency. While we use Conformer based CTC-CRF as our ASR model, our approach can be applied to other end-to-end ASR models such as RNN-T and attention based encoder-decoder, which will be interesting future work.