Streaming Transformer ASR with Blockwise Synchronous Inference

by   Emiru Tsunoo, et al.

The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the entire input sequence is required to compute self-attention. We have proposed a block processing method for the Transformer encoder by introducing a context-aware inheritance mechanism. An additional context embedding vector handed over from the previously processed block helps encode not only local acoustic information but also global linguistic, channel, and speaker attributes. In this paper, we extend block processing towards an entire streaming E2E ASR system without additional training, by introducing a blockwise synchronous decoding process inspired by a neural transducer into the Transformer decoder. We further apply a knowledge distillation technique with which training of the streaming Transformer is guided by the ordinary batch Transformer model. Evaluations of the HKUST and AISHELL-1 Mandarin tasks and LibriSpeech English task show that our proposed streaming Transformer outperforms conventional online approaches including monotonic chunkwise attention (MoChA). We also confirm that the knowledge distillation technique improves the accuracy further. Our streaming ASR models achieve comparable/superior performance to the batch models and other streaming-based transformer methods in all the tasks.



There are no comments yet.


page 1

page 2

page 3

page 4


Towards Online End-to-end Transformer Automatic Speech Recognition

The Transformer self-attention network has recently shown promising perf...

Transformer-based Streaming ASR with Cumulative Attention

In this paper, we propose an online attention mechanism, known as cumula...

Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation

End-to-end automatic speech recognition (ASR), unlike conventional ASR, ...

Transformer ASR with Contextual Block Processing

The Transformer self-attention network has recently shown promising perf...

Shifted Chunk Encoder for Transformer Based Streaming End-to-End ASR

Currently, there are mainly three Transformer encoder based streaming En...

VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording

In this work, we propose novel decoding algorithms to enable streaming a...

Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation

Although Transformers have gained success in several speech processing t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

End-to-end (E2E) automatic speech recognition (ASR) has been attracting attention as a method of directly integrating acoustic models and language models (LMs) because of the simple training and efficient decoding procedures. In recent years, various models have been studied, including connectionist temporal classification (CTC) [1, 2, 3], attention-based encoder–decoder models [4, 5, 6], their hybrid models [7], and the RNN-transducer [8, 9]. Transformer [10] has been successfully introduced into E2E ASR by replacing RNNs [11, 12, 13], and it outperforms bidirectional RNN models in most tasks [14]. Transformer has multihead self-attention network (SAN) layers, which can leverage a combination of information from completely different positions of the input.

However, similarly to bidirectional RNN models [15], Transformer has a drawback in that the entire utterance is required to compute self-attention, making it difficult to utilize in streaming ASR systems. Also, the memory and computational requirements of Transformer grow quadratically with the input sequence length, which makes it difficult to apply to longer speech utterances. This problem is now being tackled. Streaming ASR is realized by simply introduced blockwise processing as in [11, 16, 17, 18, 19]; furthermore, Miao et al. [20] also utilized the previous chunk, inspired by Transformer XL [21]. A triggered attention mechanism was introduced in [16], which requires complicated training using CTC forced alignment. Monotonic chunkwise attention (MoChA) [22] is a popular approach to achieve online processing [23, 24, 25]. However, MoChA degrades the performance [24, 25], and there is no guarantee for latency to be well controlled.

We have proposed a block processing method for the encoder–decoder Transformer model by introducing a context-aware inheritance mechanism combined with MoChA [26, 27]. The encoder is processed blockwise as in [20]. In addition, a context embedding vector handed over from the previously processed block helps encode not only local acoustic information but also global linguistic, channel, and speaker attributes. MoChA is modified for the source–target attention (STA) and utilized in the Transformer decoder. However, MoChA significantly degrades its performance.

In this paper, we propose a simple blockwise synchronous inference of decoder, which is inspired by a neural transducer [18]. The decoder receives encoded blocks one by one from the contextual block encoder. Then each block is decoded synchronously until the end-of-sequence token, “eos,” appears. Our contributions are as follows. 1) The contextual block processing of the encoder is incorporated with the blockwise synchronous inference of decoder in the scheme of CTC/attention hybrid decoding. 2) We further apply the knowledge distillation technique [28, 29, 30] to the streaming Transformer, guided by the original batch Transformer. 3) Our proposed streaming Transformer is compared with conventional approaches including MoChA, and our approach outperforms them in the tasks of HKUST and AISHELL-1 Mandarin and LibriSpeech English.

2 Relation with Prior Work

Among various approaches toward streaming processing for Transformer, such as time-restricted Transformer [16, 17], Miao et al. [20] adopted chunkwise self-attention encoder (Chunk SAE), which was inspired by transformer XL [21], where not only the current chunk but also the previous chunk was utilized for streaming encoding. Although this encoder is similar to that in our earlier work [26, 27], in our case, not only the previous chunk but also a long history of chunks is efficiently refereed to by introducing context embeddings.

Tian et al. [31] applied a neural transducer [18] to the synchronous Transformer decoder, which decodes sequences in a similar manner to our approach in this paper. The synchronous Transformer has to be trained using a special forward–backward algorithm similarly to the training of a neural transducer using dynamic programming alignment. In this paper, instead, training is carried out with the ordinary batch decoder. We use its parameters as they are and directly apply to the proposed blockwise synchronous inference, since our preliminary experiments show that it does not degrade the performance. Rather, knowledge distillation [28, 29, 30] is applied to improve the integrated performance of the streaming encoder and decoder in this paper. In addition, while the encoder in [31] looks at all previous blocks, as mentioned above, we adopt the context block encoder, which reduces the computation cost efficiently.

3 Blockwise Synchronous Inference

3.1 Transformer ASR

The baseline Transformer ASR follows that in [14], which is based on the encoder–decoder architecture. An encoder transforms a -length speech feature sequence to an -length intermediate representation , where owing to downsampling. Given and previously emitted character outputs

, a decoder estimates the next character


The encoder consists of two convolutional layers with stride

for downsampling, a linear projection layer, and positional encoding, followed by

encoder layers and layer normalization. Each encoder layer has a multihead SAN followed by a position-wise feedforward network, both of which have residual connections. In the SAN, attention weights are formed from queries (

) and keys (), and applied to values () as


where typically for the number of heads . We utilize multihead attention, denoted as the function, as follows:


In (2) and (3), the th layer is computed with the projection matrices , , , and . For all the SANs in the encoder, , , and are the same matrices, which are the inputs of the SAN. The position-wise feedforward network is a stack of linear layers.

The decoder predicts the probability of the following character from the previous output characters

and the encoder output , i.e., . The character history sequence is converted to character embeddings. Then, decoder layers are applied, followed by the linear projection and Softmax function. The decoder layer consists of a SAN and a STA, followed by a position-wise feedforward network. The first SAN in each decoder layer applies attention weights to the input character sequence, where the input sequence of the SAN is set as , , and . Then, the following STA attends to the entire encoder output sequence by setting and to be .

Transformer can leverage a combination of information from completely different positions of the input. This is due to the multiple heads and residual connections of the layers that complement each other, i.e., some attend monotonically and locally while others attend globally. Transformer requires the entire speech utterance for both the encoder and the decoder; thus, they are processed only after the end of the utterance, which causes a huge delay. To realize a streaming ASR system, both the encoder and decoder are processed online synchronously.

Figure 1: Context inheritance mechanism of the encoder.

3.2 Contextual Block Processing of Encoder

A simple way to process the encoder online is blockwise computation, as in [11, 16, 17, 18, 19, 20]. However, the global channel, speaker, and linguistic context are also important for local phoneme classification. We have proposed a context inheritance mechanism for block processing by introducing an additional context embedding vector [26]. As shown by the tilted arrows in Fig. 1

, the context embedding vector is computed in each layer of each block and handed over to the upper layer of the following block. Thus, the SAN in each layer is applied to the block input sequence using the context embedding vector. Similar idea was also proposed in image and natural language processing around the same time


Note that the blocks can overlap. We originally proposed a half-overlapping approach in [26], in which the central frames of block , , are computed using the blocked input , which includes past frames as well as looking ahead for future frames. Typically the numbers of frames used for left/center/right in [26] were , where the frames were already downsampled by a factor of . This can be easily extended to use more frames, such as , which are equivalent to the parameters in [20].

The context embedding vector is introduced into the original formulation in Sec. 3.1. Denoting the context embedding vector as , the augmented variables satisfy and , where the context embedding vector of the previous block of the previous layer is used. is the output of the th encoder layer of block , which is computed simultaneously with the context embedding vector as


where , , , and are trainable matrices and biases. Thus, the encoded output of the block is described as


where selects the central frames from . The output of the SAN not only encodes input acoustic features but also delivers the context information to the succeeding layer as shown by the tilted red arrows in Fig. 1.

3.3 Blockwise Synchronous Inference of Decoder

3.3.1 Synchronous Decoding

The original Transformer decoder requires the entire output of the encoder ; thus, it is not suitable for streaming processing as is. We proposed the use of MoChA [22], which is tailored for the STA [27]. However, the accuracy significantly drops when MoChA is applied to decoder layers, which was also observed in other studies [24, 25]. In addition, there is no guarantee for latency to be well controlled. Instead, we propose blockwise synchronous decoding inspired by a neural transducer [18], as the similar approach is proofed to be effective for streaming Transformer in [31].

When the encoder in Sec. 3.2 outputs a block , the decoder starts decoding using the outputs encoded so far, , as the ordinary Transformer does in Sec. 3.1, until the end-of-sequence token, “eos,” appears in a beam. Then the decoder waits for the next output , and resumes from just before the last decode outputs including the eos hypothesis. All the hypotheses are maintained to be used for decoding the following block. While only the last block output form the encoder was used in [31], which was computed from the entire history, we use entire encoded outputs computed from each input block with a contextual embedding vector.


Our Transformer has less computational cost than that in [31] in a typical setup, where the number of encoder layers is greater than that of decoder layers, or the number of outputs is smaller than that of encoder outputs . The synchronous decoding process is shown in Fig. 2.

Figure 2: Blockwise synchronous inference of the decoder.

3.3.2 On-the-fly CTC Prefix Scoring

Decoding is carried out jointly with CTC as in [7]. Originally, for each hypothesis, the CTC prefix score is computed as


where the superscripts and denote CTC paths ending with a nonblank or blank symbol, respectively. Thus, the entire input is required for accurate computation. However, in the case of a blockwise synchronous inference, the beam search is carried out with a limited input length. Therefore, the CTC prefix score is computed from the blocks that are already encoded, as follows:


where is the last frame of the currently processing block . When a new block output is emitted by the encoder, the decoder resumes the CTC prefix score computation according to Algorithm 2 in [7]. Equation (9) requires more computational cost as the input sequence becomes long. However, it can be efficiently computed using a technique described in [33].

3.4 Knowledge Distillation Training

Our preliminary experiments show that the parameters trained for the ordinary batch decoder perform well without significat degradation when they are directly used in the blockwise synchronous inference of decoder. Therefore, instead of using a special dynamic programming or forward–backward training method as in [18, 31], we propose to apply knowledge distillation [28, 29, 30] to the streaming Transformer, guided by the ordinary batch Transformer model for further improvement.


be a probability distribution computed by a teacher batch model trained with the same dataset, and

be a distribution predicted by the student streaming Transformer model. Then, the latter is forced to mimic the former distribution by minimizing the cross-entropy, which is written as



is a set of vocabulary. The aggregated loss function for the attention encoder and decoder is calculated as


where is a controllable parameter; typically . Then, this loss is combined with CTC loss as in [7].

4 Experiments

4.1 Experimental Setup

We carried out experiments using the HKUST [34] and AISHELL-1 [35] Mandarin tasks. Also, as an English task, the LibriSpeech dataset [36] was trained and evaluated. The input acoustic features were 80-dimensional filter banks and the pitch. We used {3655, 4231} character classes for the {HKUST, AISHELL-1} Mandarin setups. For LibriSpeech, we adopted byte-pair encoding (BPE) subword tokenization [37], which had 5000 token classes.

For the training, we utilized multitask learning with CTC loss as in [7, 14] with a weight of 0.3. A linear layer was added onto the encoder to project

to the character probability for the CTC. The Transformer models were trained over 50 epochs,with the Adam optimizer and Noam learning rate decay as in

[10]. SpecAugment [38] was applied except to AISHELL-1.

The encoder had layers with 2048 units and the decoder had layers with 2048 units, except for AISHELL-1 where to enable a comparison with [31]. We set and for the multihead attentions for the Mandarin tasks, and and for the English task. The input block was overlapped with the parameters to enable a comparison with [20], as explained in Sec. 3.2. We trained the contextual block processing encoder (CBP-ENC) with the batch decoder. The parameters for the batch decoder were directly used in the proposed blockwise synchronous decoder (BS-DEC) for inference. The training was carried out using ESPNet 111The training and inference implementations will be publicly available at [39]

with the PyTorch backend.

The decoding was performed alongside the CTC, using a beam search. The CTC weights and beam widths were {, 10} for Mandarin and {, 30} for English respectively. For two Mandarin tasks, an external word-level LM, which was a single-layer LSTM with 1000 units, was used for rescoring using shallow fusion [40] with weights of for {HKUST, AISHELL-1}, and for Librispeech, 4-layer LSTM LM with 2048 units or 16-layer transformer LM with 2048 units and 8 heads was fused with a weight of .

4.2 Results

4.2.1 Hkust

The results are listed in Table 1. For comparison, we implemented Chunk SAE [20], which is similar to our CBP-ENC approach except that it does not use contextual embedding introduced in Section 3.2. Although we were unable to reproduce the original score in [20], the implemented model performed reasonably. By comparing CBP-ENC with Chunk SAE, we can confirm that our contextual embedding approach performed better, in both of which the batch decoder was used. SpecAugment [38] gained further improvement. For streaming processing, we obtained better performance when we combined CBP-ENC and BS-DEC than that obtained for the combination of CBP-ENC and the MoChA decoder [27]. The knowledge distillation training in Sec. 3.4 further improved its performance. Our proposed method achieved state-of-the-art performance as a streaming E2E approach.

Dev Test
Batch processing
Transformer [14] (reprod.) 24.0 23.5
    + SpecAugment 21.2 21.4
Chunk SAE + Batch Dec. [20] (reprod.) 25.8 25.0
CBP-ENC + Batch Dec. [26] 25.3 24.6
    + SpecAugment 22.3 22.1
    + Knowledge Distillation 22.1 22.3
Streaming processing
CIF + Chunk-hopping [41] 23.6
CBP-ENC + MoChA Dec. [27]
    + SpecAugment 28.1 26.1
CBP-ENC + BS-DEC + SpecAug (proposed)
    + SpecAugment 22.6 22.6
    + Knowledge Distillation 22.2 22.4
Table 1: Character error rates (CERs) in the HKUST task.

4.2.2 Aishell-1

We also conducted an evaluation on AISHELL-1. For this task, we compared our approach with Sync-Transformer [31], which was the most similar to our approach. The results are shown in Table 2. Also, the results for RNN-T evaluated in [42] are listed. As can be seen, our approach outperformed the MoChA decoder, especially when we applied the knowledge distillation. Comparing with Sync-Transformer [31], our method outperformed. We also trained a larger model (), with which we confirmed that the proposed block synchronous inference suppressed the performance degradation from batch decoding.

Dev Test
Batch processing
Transformer [14] (reprod.) 7.4 8.1
CBP-ENC + Batch Dec. [26] 7.6 8.4
    + Knowledge Distillation 7.6 8.3
    — large () 6.4 7.2
Streaming processing
RNN-T [42] 10.1 11.8
Sync-Transformer [31] 7.9 8.9
CBP-ENC + MoChA Dec. [27] 9.7 9.7
CBP-ENC + SB-DEC (proposed) 7.6 8.5
    + Knowledge Distillation 7.6 8.4
    — large () 6.4 7.3
Table 2: Character error rates (CERs) in the AISHELL-1 task.

4.2.3 LibriSpeech

Lastly, we carried out an evaluation on LibriSpeech, in which BPE was utilized. The results are shown in Table 3. Although we did not tune the parameters as carefully as in [14, 16], and had not applied the knowledge distillation, similar results were observed. Our proposed method achieved better performance than CTC decoding [26] and continuous integer-and-fire (CIF) online E2E ASR [41], which indicated that the blockwise synchronous decoder also worked with BPE tokenization. This also achieved better performance comparing to state-of-the-art streaming E2E ASR using triggered attention [16] that was well tuned and trained until 120 epochs. We further obtained improvements using Transformer LM.

Dev Test
clean other clean other
Batch processing
Transformer [14] (reprod.) 2.5 6.3 2.8 6.4
    + Transformer LM 2.4 5.9 2.7 6.1
CBP-ENC + Batch Dec. [26] 2.7 7.2 2.9 7.3
Streaming processing
CBP-ENC + CTC [26] 3.2 9.0 3.3 9.1
CIF + Chunk-hopping [41] 3.3 9.6
Triggered Attention [16] (SOTA) 2.6 7.2 2.8 7.3
CBP-ENC + SB-DEC (proposed) 2.5 6.8 2.7 7.1
    + Transformer LM 2.3 6.5 2.6 6.7
Table 3: Word error rates (WERs) in the LibriSpeech task, which confirms the applicability of the proposed method to BPE tokenization.

5 Conclusions

We extended our previously proposed contextual block processing for the Transformer encoder towards an entire streaming E2E ASR system without additional training, by introducing blockwise synchronous decoding inspired by a neural transducer into the Transformer decoder. The decoder synchronously applies self-attention networks to each encoded block output until the end-of-sequence token appears. Evaluations of the HKUST and AISHELL-1 Mandarin tasks and LibriSpeech English task showed that our proposed streaming Transformer outperforms conventional online approaches including MoChA, especially when we applied the knowledge distillation technique.


  • [1] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in

    Proc. of 23rd International Conference on Machine Learning

    , 2006, pp. 369–376.
  • [2] Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in Proc. of ASRU Workshop, 2015, pp. 167–174.
  • [3] D. Amodei et al., “Deep Speech 2: End-to-end speech recognition in English and Mandarin,” in Proc. of 33rd International Conference on Machine Learning, vol. 48, 2016, pp. 173–182.
  • [4] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Proc. of NIPS, 2015, pp. 577–585.
  • [5] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. of ICASSP, 2016, pp. 4960–4964.
  • [6] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. of ICASSP, 2018, pp. 4774–4778.
  • [7] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid CTC/attention architecture for end-to-end speech recognition,” Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  • [8] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. of ICASSP, 2013, pp. 6645–6649.
  • [9] K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer,” in Proc. of ASRU Workshop, 2017, pp. 193–199.
  • [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of NeurIPS, 2017, pp. 5998–6008.
  • [11] M. Sperber, J. Niehues, G. Neubig, S. Stüker, and A. Waibel, “Self-attentional acoustic models,” in Proc. of Interspeech, 2018, pp. 3723–3727.
  • [12] J. Salazar, K. Kirchhoff, and Z. Huang, “Self-attention networks for connectionist temporal classification in speech recognition,” in Proc. of ICASSP, 2019, pp. 7115–7119.
  • [13] Y. Zhao, J. Li, X. Wang, and Y. Li, “The Speechtransformer for large-scale Mandarin Chinese speech recognition,” in Proc. of ICASSP, 2019, pp. 7095–7099.
  • [14] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang et al., “A comparative study on transformer vs RNN in speech applications,” in Proc. of ASRU Workshop, 2019, pp. 449–456.
  • [15]

    M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”

    Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
  • [16] N. Moritz, T. Hori, and J. L. Roux, “Streaming automatic speech recognition with the transformer model,” in Proc. of ICASSP, 2020, pp. 6074–6078.
  • [17] D. Povey, H. Hadian, P. Ghahremani, K. Li, and S. Khudanpur, “A time-restricted self-attention layer for ASR,” in Proc. of ICASSP, 2018, pp. 5874–5878.
  • [18] N. Jaitly, Q. V. Le, O. Vinyals, I. Sutskever, D. Sussillo, and S. Bengio, “An online sequence-to-sequence model using partial conditioning,” in Proc. of NIPS, 2016, pp. 5067–5075.
  • [19] L. Dong, F. Wang, and B. Xu, “Self-attention aligner: A latency-control end-to-end model for ASR using self-attention network and chunk-hopping,” in Proc. of ICASSP, 2019, pp. 5656–5660.
  • [20] H. Miao, G. Cheng, Z. Pengyuan, and Y. Yan, “Transformer online CTC/attention end-to-end speech recognition architecture,” in Proc. of ICASSP, 2020, pp. 6084–6088.
  • [21] Z. Dai, Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-XL: Attentive language models beyond a fixed-length context,” arXiv preprint arXiv:1901.02860, 2019.
  • [22] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” arXiv preprint arXiv:1712.05382, 2017.
  • [23] R. Fan, P. Zhou, W. Chen, J. Jia, and G. Liu, “An online attention-based model for speech recognition,” Proc. of Interspeech, pp. 4390–4394, 2019.
  • [24] K. Kim, K. Lee, D. Gowda, J. Park, S. Kim, S. Jin, Y.-Y. Lee, J. Yeo, D. Kim, S. Jung et al., “Attention based on-device streaming speech recognition with large speech corpus,” in Proc. of ASRU Workshop, 2019, pp. 956–963.
  • [25] H. Inaguma, Y. Gaur, L. Lu, J. Li, and Y. Gong, “Minimum latency training strategies for streaming sequence-to-sequence ASR,” in Proc. of ICASSP, 2020, pp. 6064–6068.
  • [26] E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, “Transformer ASR with contextual block processing,” in Proc. of ASRU Workshop, 2019, pp. 427–433.
  • [27] ——, “Towards online end-to-end transformer automatic speech recognition,” arXiv preprint arXiv:1910.11871, 2019.
  • [28] J. Li, R. Zhao, J.-T. Huang, and Y. Gong, “Learning small-size DNN with output-distribution-based criteria,” in Proc of 15th Annual Conference of the International Speech Communication Association, 2014.
  • [29] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [30] L. Lu, M. Guo, and S. Renals, “Knowledge distillation for small-footprint highway networks,” in Proc. of ICASSP, 2017, pp. 4820–4824.
  • [31] Z. Tian, J. Yi, Y. Bai, J. Tao, S. Zhang, and Z. Wen, “Synchronous transformers for end-to-end speech recognition,” in Proc. of ICASSP, 2020, pp. 7884–7888.
  • [32] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” arXiv preprint arXiv:1904.10509, 2019.
  • [33] H. Seki, T. Hori, S. Watanabe, N. Moritz, and J. Le Roux, “Vectorized beam search for ctc-attention-based speech recognition,” in Proc. of Interspeech, 2019, pp. 3825–3829.
  • [34] Y. Liu, P. Fung, Y. Yang, C. Cieri, S. Huang, and D. Graff, “HKUST/MTS: A very large scale Mandarin telephone speech corpus,” in International Symposium on Chinese Spoken Language Processing.   Springer, 2006, pp. 724–735.
  • [35]

    H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AIShell-1: An open-source Mandarin speech corpus and a speech recognition baseline,” in

    Oriental COCOSDA, 2017, pp. 1–5.
  • [36] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. of ICASSP, 2015, pp. 5206–5210.
  • [37]

    R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in

    Proc. of the Association for Computational Linguistics, vol. 1, 2016, pp. 1715–1725.
  • [38] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. of Interspeech, 2019.
  • [39] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen et al., “ESPnet: End-to-end speech processing toolkit,” in Proc. of Interspeech, 2019, pp. 2207–2211.
  • [40] A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. of ICASSP, 2018, pp. 5824–5828.
  • [41] L. Dong and B. Xu, “CIF: Continuous integrate-and-fire fore end-to-end speech recognition,” in Proc. of ICASSP, 2020, pp. 6079–6083.
  • [42] Z. Tian, J. Yi, J. Tao, Y. Bai, and Z. Wen, “Self-attention transducers for end-to-end speech recognition,” in Proc. of Interspeech, 2019, pp. 4395–4399.