1 Introduction
Endtoend (E2E) models [1, 2, 3, 4, 5, 6] gained large popularity for ASR over the last few years. These models, which combine the acoustic, pronunciation and language models into a single network, have shown competitive results compared to conventional ASR systems.
Recently, there has been considerable interest in training E2E models for streaming ASR [7, 8, 9, 10, 11]. It is shown that having access to future context to encode the current frame significantly improves speech recognition accuracy. Bidirectional encoders take advantage of future context, however only full input sequence provided can inference be performed. Therefore, streaming recognition can be achieved by specifying a limited right context, with a cost of accuracy degradation.
Typically, streaming and nonstreaming ASR models are usually trained separately. Specially, in streaming condition, various models are trained to reach a suitable latency and accuracy tradeoff. In this work, we propose an inference technique, asynchronous revision, to unify streaming and nonstreaming speech recognition models. Moreover, in streaming mode, we achieve dynamic latency ASR with only one model. To be specific, the nonstreaming model can be used with arbitrary right context during inference. In inference stage, chunkbased incremental decoding is applied on the nonstreaming model, the history encoder and decoder states of which can be asynchronously revised to achieve dynamic latency ASR. When performing asynchronous revision during inference, the encoder network may get incomplete right context. To alleviate this training and inference mismatch, we propose a training technique, segment cropping, which randomly splits input utterances into several segments with forward connections. We show that the dynamic latency model gives 8%14% relative improvements over streaming models with the same latency.
2 related work
There has been a growing interest in building streaming ASR systems based on RNNT[1, 2]. Compared with attentionbased encoder decoder (AED)[3] models, RNNT models are naturally streamable and have shown great potentials for lowlatency streaming ASR. In this work, we mainly focus on RNNT based models.
To improve the latency of AED models, partial hypothesis selection is proposed in [12]. We both use chunkbased incremental decoding during inference, the main difference is that we achieve truly streaming ASR based on RNNT, while they use global attention mechanism. Similar work is explored in simultaneous (speech) translation[13, 14]. Variable context training is used in Ymodel [15] architecture. In inference time, the context length for the variable context layers can be changed. However, the latency of Ymodel is predefined in training stage. At inference, limited latency configurations can be used. Universal ASR [16] is a unified framework to train a single E2E ASR model for both streaming and nonstreaming speech recognition. Nevertheless, only one streaming mode is available.
Unlike these approaches, our work not only explores the unification of streaming and nonstreaming ASR models, but also achieve dynamic latency ASR with only one model. In other words, with asynchronous revision decoding technique, the nonstreaming model can be used with any right context during inference.
3 dynamic latency asr
Morden E2E ASR systems have an encoderdecoder structure [1, 3]
. The encoder network encode Melspectrogram feature into hidden representation containing semantic information. Then, the decoder network make predictions based on the current output of encoder and history prediction states. In this work, we focus on the most commonly used RNNT model.
3.1 Frame Synchronous Decoding
It is clear that the decoder in RNNT is an autoregressive model in both streaming and nonstreaming ASR models. When performing streaming inference, complete input context for the encoder is necessary. Commonly, for incremental inference, we divide the input utterances into fixedsize chunks and decode every time a new chunk arrives. As shown in figure 1, when performing incremental decoding, the encoder state is conditioned on the current chunk and history states (memory). Meanwhile, the decoder state is based on the current encoder state and previous decoder state.
3.2 Asynchronous Revision Decoding
For nonstreaming model, its hard to perform incremental decoding since the long right context leads to large latency. What if we limit the right context? Figure 2 depicts the limited context decoding process for nonstreaming model. Suppose the right context of the encoder network has two chunks, figure 2 part a shows the initial states when decoding chunk C0. Since the right context of E0 is not complete, when chunk C1 comes, we need to revise the encoder state E0 along with the decoder state D0. As shown in figure 2 part b, if we only revise the last decoder state, then D0 is finalized. After chunk C2 comes, the encoder state E0 get full input context, then we revise E0 to make it finalized as shown in part c. The finalized states will not be changed when new chunks arrive as shown in part d. Temporary states will be revised in the next decoding step.
Since the revision steps of the encoder and decoder states may be different, we refer this technique as asynchronous revision decoding. For nonstreaming model, more revisions means more available right context, then gives better results. However, more revision also leads to larger latency. Nevertheless, revision provides tradeoff between latency and accuracy. In other words, the nonstreaming model can be used with any right context during inference, which achieves arbitrary latency. Thus, the latency of a ASR model can be arbitrarily changed after training, which we call dynamic latency.
3.3 Segment Cropping Training
When training a nonstreaming ASR model, the full utterance is provided. At each frame, the encoder has a complete context. When performing asynchronous revision during inference, the finalized encoder states may get incomplete right context. As shown in figure 2, suppose the right context covers five chunks, the finalized encoder state E0 only sees two chunks in the right since the encoder revision is two steps. Then, E0 provides partial memory to the next state E1 and so on. This leads to inference and training mismatch.
Prefix training is used to alleviate partial inputs in simultaneous (speech) translation[14, 13]. However, its not work for RNNT models for its intrinsic monotonic alignment property. In our framework, revision is applied on each chunk during inference, if the right context of the finalized encoder state is not the same as training, mismatch happens. We call this incomplete right context state as partial memory. To alleviate partial memory problem in incremental decoding, we propose a training strategy, segment cropping, which randomly splits input sequence into several segments with forward connections. As shown in figure 3, every chunk has full right context in normal training. At inference, partial memory occurs in every chunk if no revision. By segment cropping, we can simulate partial memory in training.
4 experiments
4.1 Data
For our experiments we use 10K hours of internal Mandarin speech dataset. The test set we use consists of 21k utterances with duration less than 30 seconds long from various applications. The training and test sets are all anonymized and handtranscribed. The input speech waveforms are framed using a 25 msec window with 10 msec shift. We use 80 dimension filter bank features. We report the model performance using Character Error Rate (CER).
4.2 Model Structures and Hyperparameters
Without loss of generality, our encoder network is based on a temporal convolutional network, DFSMN [17], which can be extended to selfattentional network like conformer [6]
. And our decoder network is consists of two LSTM layers with 1024 units. At the front of the encoder network, we use two convolution subsampling layers, each with a stride 2. For nonstreaming model, the encoder has 30L DFSMNs with 2048 hidden units, the input context for each convolution layer is
, which means 20 left frames and 20 right frames along with the current frame. For streaming models, we use three different context configurations: for 0.4 second (0.4s for short) latency model, for 1.2s latency model and for 2.4s latency model. The only difference for those models is the right context configuration. For simplicity, we use greedy search to decode our models.4.3 Asynchronous Revision on NonStreaming Model
We perform asynchronous revision decoding on the baseline nonstreaming model (CER 9.45%) with various revisions. At inference, we divide the input utterances into fixedsize chunks with 40 frames and decode every time a new chunk arrives. Hence, the number of revisions means the number of history chunks to be revised.
Figure 4 shows the relationship between revisions and CER. With the same decoder revisions, more encoder revisions has little impact on model performance. As we can see, more decoder revisions brings better results, while the marginal improvements becomes smaller.
We have also done experiments with smaller encoder revision, the results are similar with figure 4. Besides, when there is no decoder revision (revise=0), the decoder network sees partial memory without right context, which leads to bad performance (CER ). We will not report the results with no revision in the following experiments.
4.4 Asynchronous Revision on Streaming Models
The proposed framework can also works on streaming model. As shown in figure 5, we perform asynchronous revision on two different streaming models, along with the baseline nonstreaming model. For simplicity, we use the same revisions for the encoder and decoder networks. It is evident that the number of revisions is positively related to performance whenever the model latency is. Meanwhile, large revisions gives little relative improvements since the right context provided by revision is close to training. Interestingly, when revisions is small (e.g. revise=1), lower latency model gives a better result. We believe that it is because partial memory has larger impact on higher latency model.
4.5 Dynamic Latency Versus Streaming
Model  Revisions  Latency  CER 

Streaming    0.4s  11.63 
  1.2s  10.63  
  2.4s  10.20  
Dynamic Latency  1  0.4s  11.75 
3  1.2s  10.28  
6  2.4s  9.75  
  24s  9.45  
1  0.4s  10.29  
Dynamic Latency  3  1.2s  9.55 
+3 Segments Cropping  6  2.4s  9.32 
  24s  9.26  
1  0.4s  10.03  
Dynamic Latency  3  1.2s  9.55 
+5 Segments Cropping  6  2.4s  9.37 
  24s  9.29 
As discussed in section 3.2, the latency of a ASR model can be arbitrarily changed with asynchronous revision during inference. Without loss of generality, the decoding chunk size is 40 frames. Revise one chunk means 0.4 second latency, 3 chunks for 1.2 seconds latency and 6 chunks for 2.4 seconds latency. We denote the nonstreaming model applied asynchronous revision during inference as dynamic latency model.
Table 1 presents the results of dynamic latency model versus streaming models. The streaming models we used are described in section 4.2. The dynamic latency model gives better or similar results compared with streaming models in the same latency configuration. To alleviate partial memory mismatch, we apply segment cropping training. We use two cropping strategies by randomly splitting input utterances into 3 or 5 segments. As shown in the last two blocks in table 1, both strategies give similar results. Compare with streaming models, the dynamic latency model with segment cropping training gives 13.76%, 10.16% and 8.14% relative improvements in 0.4s, 1.2s and 2.4s latency respectively. Furthermore, the performance of the dynamic latency model is extremely close (less than 1% rel.) to that of the nonstreaming model in the case of 2.4 second latency.
5 conclusion
We propose an inference framework, asynchronous revision, to perform streaming and nonstreaming speech recognition in one model. Moreover, in streaming mode, the latency of a nonstreaming model can be arbitrary changed, which we call dynamic latency. Thus can be generally applied as an inference technique without requiring extra training support. We also propose segment cropping training to alleviate partial memory problem during inference. We show that the dynamic latency model gives 8%14% relative improvements over the streaming models under the same latency. We also show that in the 2.4 seconds latency setting, the performance of asynchronous revision decoding is very close to full context decoding. A limitation of our framework is that more revisions means more computations. Our future work will focus on how to improve the computation efficiency of asynchronous revision.
References
 [1] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.

[2]
A. Graves, A. Mohamed, and G. Hinton,
“Speech recognition with deep recurrent neural networks,”
in Proc. ICASSP, 2013, pp. 6645–6649. 
[3]
W. Chan, N. Jaitly, Q. Le, and O. Vinyals,
“Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,”
in Proc. ICASSP, 2016, pp. 4960–4964.  [4] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Joint ctcattention based endtoend speech recognition using multitask learning,” in Proc. ICASSP. IEEE, 2017, pp. 4835–4839.
 [5] C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani, “Stateoftheart speech recognition with sequencetosequence models,” in Proc. ICASSP, 2018, pp. 4774–4778.
 [6] Anmol Gulati, James Qin, ChungCheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al., “Conformer: Convolutionaugmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.

[7]
Kanishka Rao, Haşim Sak, and Rohit Prabhavalkar,
“Exploring architectures, data and units for streaming endtoend
speech recognition with rnntransducer,”
in
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
. IEEE, 2017, pp. 193–199.  [8] Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al., “Streaming endtoend speech recognition for mobile devices,” in Proc. ICASSP. IEEE, 2019, pp. 6381–6385.
 [9] Niko Moritz, Takaaki Hori, and Jonathan Le, “Streaming automatic speech recognition with the transformer model,” in Proc. ICASSP. IEEE, 2020, pp. 6074–6078.
 [10] Tara N Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuoyiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen, et al., “A streaming ondevice endtoend model surpassing serverside conventional model quality and latency,” in Proc. ICASSP. IEEE, 2020, pp. 6059–6063.
 [11] Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar, “Transformer transducer: A streamable speech recognition model with transformer encoders and rnnt loss,” in Proc. ICASSP. IEEE, 2020, pp. 7829–7833.
 [12] Danni Liu, Gerasimos Spanakis, and Jan Niehues, “Lowlatency sequencetosequence speech recognition and translation by partial hypothesis selection,” arXiv preprint arXiv:2005.11185, 2020.
 [13] Naveen Arivazhagan, Colin Cherry, Wolfgang Macherey, and George Foster, “Retranslation versus streaming for simultaneous translation,” arXiv preprint arXiv:2004.03643, 2020.
 [14] Naveen Arivazhagan, Colin Cherry, Isabelle Te, Wolfgang Macherey, Pallavi Baljekar, and George Foster, “Retranslation strategies for long form, simultaneous, spoken language translation,” in Proc. ICASSP. IEEE, 2020, pp. 7919–7923.
 [15] Anshuman Tripathi, Jaeyoung Kim, Qian Zhang, Han Lu, and Hasim Sak, “Transformer transducer: One model unifying streaming and nonstreaming speech recognition,” arXiv preprint arXiv:2010.03192, 2020.
 [16] Jiahui Yu, Wei Han, Anmol Gulati, ChungCheng Chiu, Bo Li, Tara N Sainath, Yonghui Wu, and Ruoming Pang, “Universal asr: Unify and improve streaming asr with fullcontext modeling,” arXiv preprint arXiv:2010.06030, 2020.
 [17] Shiliang Zhang, Ming Lei, Zhijie Yan, and Lirong Dai, “Deepfsmn for large vocabulary continuous speech recognition,” in Proc. ICASSP. IEEE, 2018, pp. 5869–5873.