End-to-end (E2E) models [1, 2, 3, 4, 5, 6] gained large popularity for ASR over the last few years. These models, which combine the acoustic, pronunciation and language models into a single network, have shown competitive results compared to conventional ASR systems.
Recently, there has been considerable interest in training E2E models for streaming ASR [7, 8, 9, 10, 11]. It is shown that having access to future context to encode the current frame significantly improves speech recognition accuracy. Bidirectional encoders take advantage of future context, however only full input sequence provided can inference be performed. Therefore, streaming recognition can be achieved by specifying a limited right context, with a cost of accuracy degradation.
Typically, streaming and non-streaming ASR models are usually trained separately. Specially, in streaming condition, various models are trained to reach a suitable latency and accuracy trade-off. In this work, we propose an inference technique, asynchronous revision, to unify streaming and non-streaming speech recognition models. Moreover, in streaming mode, we achieve dynamic latency ASR with only one model. To be specific, the non-streaming model can be used with arbitrary right context during inference. In inference stage, chunk-based incremental decoding is applied on the non-streaming model, the history encoder and decoder states of which can be asynchronously revised to achieve dynamic latency ASR. When performing asynchronous revision during inference, the encoder network may get incomplete right context. To alleviate this training and inference mismatch, we propose a training technique, segment cropping, which randomly splits input utterances into several segments with forward connections. We show that the dynamic latency model gives 8%-14% relative improvements over streaming models with the same latency.
2 related work
There has been a growing interest in building streaming ASR systems based on RNN-T[1, 2]. Compared with attention-based encoder decoder (AED) models, RNN-T models are naturally streamable and have shown great potentials for low-latency streaming ASR. In this work, we mainly focus on RNN-T based models.
To improve the latency of AED models, partial hypothesis selection is proposed in . We both use chunk-based incremental decoding during inference, the main difference is that we achieve truly streaming ASR based on RNN-T, while they use global attention mechanism. Similar work is explored in simultaneous (speech) translation[13, 14]. Variable context training is used in Y-model  architecture. In inference time, the context length for the variable context layers can be changed. However, the latency of Y-model is predefined in training stage. At inference, limited latency configurations can be used. Universal ASR  is a unified framework to train a single E2E ASR model for both streaming and non-streaming speech recognition. Nevertheless, only one streaming mode is available.
Unlike these approaches, our work not only explores the unification of streaming and non-streaming ASR models, but also achieve dynamic latency ASR with only one model. In other words, with asynchronous revision decoding technique, the non-streaming model can be used with any right context during inference.
3 dynamic latency asr
. The encoder network encode Mel-spectrogram feature into hidden representation containing semantic information. Then, the decoder network make predictions based on the current output of encoder and history prediction states. In this work, we focus on the most commonly used RNN-T model.
3.1 Frame Synchronous Decoding
It is clear that the decoder in RNN-T is an auto-regressive model in both streaming and non-streaming ASR models. When performing streaming inference, complete input context for the encoder is necessary. Commonly, for incremental inference, we divide the input utterances into fixed-size chunks and decode every time a new chunk arrives. As shown in figure 1, when performing incremental decoding, the encoder state is conditioned on the current chunk and history states (memory). Meanwhile, the decoder state is based on the current encoder state and previous decoder state.
3.2 Asynchronous Revision Decoding
For non-streaming model, its hard to perform incremental decoding since the long right context leads to large latency. What if we limit the right context? Figure 2 depicts the limited context decoding process for non-streaming model. Suppose the right context of the encoder network has two chunks, figure 2 part a shows the initial states when decoding chunk C0. Since the right context of E0 is not complete, when chunk C1 comes, we need to revise the encoder state E0 along with the decoder state D0. As shown in figure 2 part b, if we only revise the last decoder state, then D0 is finalized. After chunk C2 comes, the encoder state E0 get full input context, then we revise E0 to make it finalized as shown in part c. The finalized states will not be changed when new chunks arrive as shown in part d. Temporary states will be revised in the next decoding step.
Since the revision steps of the encoder and decoder states may be different, we refer this technique as asynchronous revision decoding. For non-streaming model, more revisions means more available right context, then gives better results. However, more revision also leads to larger latency. Nevertheless, revision provides trade-off between latency and accuracy. In other words, the non-streaming model can be used with any right context during inference, which achieves arbitrary latency. Thus, the latency of a ASR model can be arbitrarily changed after training, which we call dynamic latency.
3.3 Segment Cropping Training
When training a non-streaming ASR model, the full utterance is provided. At each frame, the encoder has a complete context. When performing asynchronous revision during inference, the finalized encoder states may get incomplete right context. As shown in figure 2, suppose the right context covers five chunks, the finalized encoder state E0 only sees two chunks in the right since the encoder revision is two steps. Then, E0 provides partial memory to the next state E1 and so on. This leads to inference and training mismatch.
Prefix training is used to alleviate partial inputs in simultaneous (speech) translation[14, 13]. However, its not work for RNN-T models for its intrinsic monotonic alignment property. In our framework, revision is applied on each chunk during inference, if the right context of the finalized encoder state is not the same as training, mismatch happens. We call this incomplete right context state as partial memory. To alleviate partial memory problem in incremental decoding, we propose a training strategy, segment cropping, which randomly splits input sequence into several segments with forward connections. As shown in figure 3, every chunk has full right context in normal training. At inference, partial memory occurs in every chunk if no revision. By segment cropping, we can simulate partial memory in training.
For our experiments we use 10K hours of internal Mandarin speech dataset. The test set we use consists of 21k utterances with duration less than 30 seconds long from various applications. The training and test sets are all anonymized and hand-transcribed. The input speech wave-forms are framed using a 25 msec window with 10 msec shift. We use 80 dimension filter bank features. We report the model performance using Character Error Rate (CER).
4.2 Model Structures and Hyperparameters
. And our decoder network is consists of two LSTM layers with 1024 units. At the front of the encoder network, we use two convolution subsampling layers, each with a stride 2. For non-streaming model, the encoder has 30L DFSMNs with 2048 hidden units, the input context for each convolution layer is, which means 20 left frames and 20 right frames along with the current frame. For streaming models, we use three different context configurations: for 0.4 second (0.4s for short) latency model, for 1.2s latency model and for 2.4s latency model. The only difference for those models is the right context configuration. For simplicity, we use greedy search to decode our models.
4.3 Asynchronous Revision on Non-Streaming Model
We perform asynchronous revision decoding on the baseline non-streaming model (CER 9.45%) with various revisions. At inference, we divide the input utterances into fixed-size chunks with 40 frames and decode every time a new chunk arrives. Hence, the number of revisions means the number of history chunks to be revised.
Figure 4 shows the relationship between revisions and CER. With the same decoder revisions, more encoder revisions has little impact on model performance. As we can see, more decoder revisions brings better results, while the marginal improvements becomes smaller.
We have also done experiments with smaller encoder revision, the results are similar with figure 4. Besides, when there is no decoder revision (revise=0), the decoder network sees partial memory without right context, which leads to bad performance (CER ). We will not report the results with no revision in the following experiments.
4.4 Asynchronous Revision on Streaming Models
The proposed framework can also works on streaming model. As shown in figure 5, we perform asynchronous revision on two different streaming models, along with the baseline non-streaming model. For simplicity, we use the same revisions for the encoder and decoder networks. It is evident that the number of revisions is positively related to performance whenever the model latency is. Meanwhile, large revisions gives little relative improvements since the right context provided by revision is close to training. Interestingly, when revisions is small (e.g. revise=1), lower latency model gives a better result. We believe that it is because partial memory has larger impact on higher latency model.
4.5 Dynamic Latency Versus Streaming
|+3 Segments Cropping||6||2.4s||9.32|
|+5 Segments Cropping||6||2.4s||9.37|
As discussed in section 3.2, the latency of a ASR model can be arbitrarily changed with asynchronous revision during inference. Without loss of generality, the decoding chunk size is 40 frames. Revise one chunk means 0.4 second latency, 3 chunks for 1.2 seconds latency and 6 chunks for 2.4 seconds latency. We denote the non-streaming model applied asynchronous revision during inference as dynamic latency model.
Table 1 presents the results of dynamic latency model versus streaming models. The streaming models we used are described in section 4.2. The dynamic latency model gives better or similar results compared with streaming models in the same latency configuration. To alleviate partial memory mismatch, we apply segment cropping training. We use two cropping strategies by randomly splitting input utterances into 3 or 5 segments. As shown in the last two blocks in table 1, both strategies give similar results. Compare with streaming models, the dynamic latency model with segment cropping training gives 13.76%, 10.16% and 8.14% relative improvements in 0.4s, 1.2s and 2.4s latency respectively. Furthermore, the performance of the dynamic latency model is extremely close (less than 1% rel.) to that of the non-streaming model in the case of 2.4 second latency.
We propose an inference framework, asynchronous revision, to perform streaming and non-streaming speech recognition in one model. Moreover, in streaming mode, the latency of a non-streaming model can be arbitrary changed, which we call dynamic latency. Thus can be generally applied as an inference technique without requiring extra training support. We also propose segment cropping training to alleviate partial memory problem during inference. We show that the dynamic latency model gives 8%-14% relative improvements over the streaming models under the same latency. We also show that in the 2.4 seconds latency setting, the performance of asynchronous revision decoding is very close to full context decoding. A limitation of our framework is that more revisions means more computations. Our future work will focus on how to improve the computation efficiency of asynchronous revision.
-  Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
A. Graves, A. Mohamed, and G. Hinton,
“Speech recognition with deep recurrent neural networks,”in Proc. ICASSP, 2013, pp. 6645–6649.
W. Chan, N. Jaitly, Q. Le, and O. Vinyals,
“Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,”in Proc. ICASSP, 2016, pp. 4960–4964.
-  Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in Proc. ICASSP. IEEE, 2017, pp. 4835–4839.
-  C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. ICASSP, 2018, pp. 4774–4778.
-  Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
Kanishka Rao, Haşim Sak, and Rohit Prabhavalkar,
“Exploring architectures, data and units for streaming end-to-end
speech recognition with rnn-transducer,”
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 193–199.
-  Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al., “Streaming end-to-end speech recognition for mobile devices,” in Proc. ICASSP. IEEE, 2019, pp. 6381–6385.
-  Niko Moritz, Takaaki Hori, and Jonathan Le, “Streaming automatic speech recognition with the transformer model,” in Proc. ICASSP. IEEE, 2020, pp. 6074–6078.
-  Tara N Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo-yiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen, et al., “A streaming on-device end-to-end model surpassing server-side conventional model quality and latency,” in Proc. ICASSP. IEEE, 2020, pp. 6059–6063.
-  Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar, “Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss,” in Proc. ICASSP. IEEE, 2020, pp. 7829–7833.
-  Danni Liu, Gerasimos Spanakis, and Jan Niehues, “Low-latency sequence-to-sequence speech recognition and translation by partial hypothesis selection,” arXiv preprint arXiv:2005.11185, 2020.
-  Naveen Arivazhagan, Colin Cherry, Wolfgang Macherey, and George Foster, “Re-translation versus streaming for simultaneous translation,” arXiv preprint arXiv:2004.03643, 2020.
-  Naveen Arivazhagan, Colin Cherry, Isabelle Te, Wolfgang Macherey, Pallavi Baljekar, and George Foster, “Re-translation strategies for long form, simultaneous, spoken language translation,” in Proc. ICASSP. IEEE, 2020, pp. 7919–7923.
-  Anshuman Tripathi, Jaeyoung Kim, Qian Zhang, Han Lu, and Hasim Sak, “Transformer transducer: One model unifying streaming and non-streaming speech recognition,” arXiv preprint arXiv:2010.03192, 2020.
-  Jiahui Yu, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N Sainath, Yonghui Wu, and Ruoming Pang, “Universal asr: Unify and improve streaming asr with full-context modeling,” arXiv preprint arXiv:2010.06030, 2020.
-  Shiliang Zhang, Ming Lei, Zhijie Yan, and Lirong Dai, “Deep-fsmn for large vocabulary continuous speech recognition,” in Proc. ICASSP. IEEE, 2018, pp. 5869–5873.