. E2E ASR models simplify the hybrid DNN/HMM ASR models by replacing the acoustic, pronunciation and language models with one single deep neural network, and thus transcribe speech to text directly. To date, E2E ASR models have achieved significant improvement in ASR field[4, 5, 6]. The hybrid Connectionist Temporal Classification (CTC) / attention E2E ASR architecture 
has attracted lots of attention because it combines the advantages of CTC models and attention models. During training, the CTC objective is attached to the attention-based encoder-decoder model as an auxiliary task. During decoding, the joint CTC/attention decoding approach is adopted in the beam search. However, it is difficult to deploy the online CTC/attention E2E ASR architecture because of global attention mechanisms  and CTC prefix scores [6, 9], which depend on the entire input speech. Our prior work [10, 11] has streamed this architecture from both the model structure and decoding algorithm aspects. On the model structure aspect, we proposed the monotonic truncated attention (MTA)  to stream attention mechanisms, and applied the latency-controlled bidirectional long short-term memory (LC-BLSTM) as the low-latency encoder. On the decoding aspect, we proposed the online joint decoding approach, which includes truncated CTC (T-CTC) prefix scores and dynamic waiting joint decoding (DWJD) algorithm .
. Transformer-based models are parallelizable and competitive to recurrent neural networks. However, the vanilla Transformer is inapplicable to online tasks for two reasons: First, the self-attention encoder (SAE) computes the attention weights on the whole input frames; Second, the self-attention decoder (SAD) computes the attention weights on the whole outputs of SAE.
In this paper, we stream the Transformer and integrate it into the CTC/attention E2E ASR architecture. On the SAE aspect, we propose the chunk-SAE which splits the input speech into isolated chunks of fixed length. Inspired by Transformer-XL , we further propose the state reuse chunk-SAE which reuses the stored states of the previous chunks to reduce the computational cost. On the SAD aspect, we propose the MTA based SAD, which performs attention on the truncated historical outputs of SAE. Finally, we propose the Transformer-based online CTC/attention E2E ASR architecture via the online joint decoding approach . Our experiments shows that the proposed online model with a 320 ms latency achieves 23.66% character error rate (CER) on HKUST, with only 0.19% absolute CER degradation compared with the offline baseline.
The rest of this paper is organized as follows. In Section 2, we describe the online CTC/attention E2E architecture proposed in our prior work [10, 11]. In Section 3, we introduce the Transformer architecture. In Section 4, we describe the online Transformer-based CTC/attention architecture. The experiments and conclusions are presented in Sections 5 and 6, respectively.
2 Online CTC/attention E2E Architecture
In our prior work 
, we proposed an online hybrid CTC/attention E2E ASR architecture, which consists of the LC-BLSTM encoder, MTA and LSTM decoder. During training, we introduce the CTC objective as an auxiliary task, and the loss function is defined by:
is a hyperparameter,and are loss functions from the decoder and CTC. During decoding, we adopt the online joint decoding approach, which is defined by:
are the probabilities of the hypothesisconditioned on input frames from the decoder and T-CTC , and is the language model probability. The hyperparameters and are tunable. For online decoding, we proposed DWJD algorithm  to 1) coordinate the forward propagation in the encoder and the beam search in the decoder; 2) address the unsynchronized predictions of the MTA-based decoder and CTC outputs.
MTA , which performs attention on top of the truncated historical encoder outputs, plays a major role in our online system. Formally, we denote and as the -th decoder state and the -th encoder output, respectively. Similar to monotonic chunk-wise attention , MTA defines the probability of truncating encoder outputs at as:
where the matrices ,
, vectors, and scalars , are trainable parameters. Then, the attention weight is computed by:
where indicates the probability of truncating encoder outputs at and skipping the encoder outputs before . During decoding, MTA determines a truncation end-point for the -th decoder step by:
where denotes the indicator of truncating or do not truncating encoder outputs at , and represents an indicator function. By the condition in Eq. 5, MTA enforces the end-point to move in a left-to-right mode. Once for some , MTA sets to . Finally, MTA performs attention on the truncated encoder outputs:
where is the letter-wise hidden vector for the -th decoder step.
During training, MTA performs attention on the whole encoder outputs:
where denotes the number of encoder outputs.
3 Transformer Architecture
Transformer  follows the encoder-decoder architecture using stacked self-attention and position-wise feed-forward layers for both the encoder and decoder. We briefly introduce the Transformer architecture in this section.
3.1 Multi-head attention
Transformer adopts the scaled dot-product attention to map a query and a set of key-value pairs to an output as:
where the matrices , and denote queries, keys and values, and denote the number of queries and keys (or values), and denotes representation dimension.
Instead of performing a single attention function, Transformer uses multi-head attention that jointly learns diverse relationships between queries and keys from different representation sub-spaces as follows:
where denotes the head number and . The matrices and are trainable parameters.
Because Transformer lacks of modeling the sequence order, the work in  suggested to use sine and cosine functions of different frequencies to perform the positional encoding.
3.2 Self-attention encoder (SAE)
The SAE consists of a stack of identical layers, each of which has two sub-layers, i.e. one self-attention layer and one position-wise feed-forward layer. The inputs of the SAE are acoustic frames in ASR tasks. The self-attention layer employs multi-head attention, in which the queries, keys and values are inputs of the previous layer. Besides, the SAE uses residual connections and layer normalization  after each sub-layer.
3.3 Self-attention decoder (SAD)
The SAD also consists of a stack of identical layers, each of which has three sub-layers, i.e. one self-attention layer, one encoder-decoder attention layer and one position-wise feed-forward layer. The inputs of the SAD are embeddings of right-shifted output labels. To prevent the access to the future output labels in the self-attention, the subsequent positions are masked. In the encoder-decoder attention, the queries are current layer inputs while the keys and values are SAE outputs. Besides, the SAD also uses residual connections and layer normalization after each sub-layer.
4 Transformer-based Online CTC/attention E2E Architecture
In this section, we propose the Transformer-based online E2E model, which consists of the chunk-SAE with or without reusing stored states and MTA based SAD. The Transformer-based online CTC/attention E2E architecture is shown in Fig. 1.
To stream the SAE, we first propose the chunk-SAE, which splits a speech into non-overlapping isolated chunks of central length. To acquire the contextual information, we splice left frames before each chunk as historical context and right frames after it as future context. The spliced frames only act as contexts and give no output. With the predefined parameters , and , the receptive field of each chunk-SAE output is restricted to and the latency of the chunk-SAE is limited to .
4.2 State reuse chunk-SAE
In the chunk-SAE, the historical context is re-computed for each chunk. To reduce the computational cost, we store the computed hidden states in central context. Then, when computing the new chunk, we reuse stored hidden states from the previous chunks at the same positions as historical context, which is inspired by Transformer-XL . Fig. 2 illustrates the difference between the chunk-SAE with or without reusing hidden states. Formally, and denote the stored and newly-computed hidden states for the -th chunk in the -th layer, respectively. Then, the queries, keys and values for the -th chunk in the -th self-attention layer are defined as follows:
In Eq. 12, the function stands for stop-gradient. Therefore, the complexity of the state reuse chunk-SAE is reduced by a factor of .
Moreover, the state reuse chunk-SAE captures long-term dependency beyond the chunks. Suppose the state reuse chunk-SAE consists of layers, the receptive field on the left side extends to as far as frames, which is much broader than that of chunk-SAE.
4.3 MTA based SAD
To stream the SAD, we propose the MTA based SAD to truncate the receptive field in a monotonic left-to-right way and perform attention on the truncated outputs of SAE. Specifically, we substitute MTA for the encoder-decoder attention in each SAD layer, as shown in Fig. 2. Suppose the representation dimension is , MTA performs in parallel during training as follows:
where the matrices and scalar bias are trainable parameters, and denotes the noise. We define as the truncation probability matrix, where indicates the probability of truncating the -th SAE output in order to predict the -th output label. In Eq. 13, the cumulative product function and applies to the rows of . The notation indicates the element-wise product.
MTA learns the appropriate offset for the pre-sigmoid activations in Eq. 14 via the trainable scalar . To prevent from vanishing to zeros, we initialize to a negative value, e.g.
in our experiments. To encourage the discreteness of the truncation probabilities, we simply add zero-mean, unit-variance Gaussian noiseto the pre-sigmoid activations only during training.
During decoding, we have to compute the elements in row by row, where is the truncation probability matrix in the -th layer. we define as the truncation end-point belonging to the -th layer when predicting the -th output label. Then, the end-point is determined by:
where denotes the indicator of truncating or do not truncating -th SAE output in -th layer and represents an indicator function. Once for some , we set to , which means that the receptive field of the -th layer is restricted to SAE outputs. Suppose the MTA based SAD consists of layers, there will be end-points at each decoding step. The number of truncated SAE outputs in each layer will not affect other layers. Therefore, we define the the maximum of end-points as the receptive field of the MTA based SAD.
We evaluated our models using HKUST Mandarin Chinese conversational telephone . The HKUST consists of about 200 hours train set for training and about 5 hours test set. We extracted 4000 utterances from the train set as our development set. To improve the recognition accuracy, we applied the speed perturbation on the rest train set by factors 0.9 and 1.1.
5.2 Model descriptions
We built all the online models using ESPnet toolkit . For the input, we used 83-dimensional features, including 80-dimensional filter banks, pitch, delta-pitch and Normalized Cross-Correlation Functions. The features were computed with a 25 ms window and shifted every 10 ms. For the output, we adopted a 3655-sized vocabulary set, including 3623 Chinese Mandarin characters, 26 English characters, as well as 6 non-language symbols denoting laughter, noise, vocalized noise, blank, unknown-character and sos/eos.
We used 2-layer convolutional neural networks (CNN) as the front-end. Each CNN layer had 256 filters, each of which haskernel size with stride, and thus the time reduction of the front-end was . The SAE and SAD had 12 and 6 layers, respectively. All sub-layers, as well as embedding layers, produced outputs of dimension 256. In the multi-head attention networks, the head number was 4. In the position-wise feed-forward networks, the inner dimension was 2048. Besides, we trained a 2-layer 1024-dimensional LSTM network on HKUST transcriptions as the external language model and adopted the above 3655-sized vocabulary set.
During training, we used the CTC/attention joint training () and the Adam optimizer with Noam learning rate schedule (25000 warm steps)
, and trained for 30 epochs. To prevent overfitting, we used dropout (dropout rate ) in each sub-layer, uniform label smoothing  (penalty ) in the output layer and the model averaging approach that averages the parameters of models at the last 10 epochs. During decoding, we adopted online joint decoding approach, combining T-CTC prefix scores () and language model scores () to prune the hypotheses, and the beam size was 10.
5.3 Chunk-SAE with or without reusing states
In Table 1, we compared the speed and performance of the chunk-SAE with or without reusing states. The context configuration remained the same for online models during the comparison, i.e. . Firstly, we measured the speed of various encoders during decoding using a sever with Intel(R) Xeon(R) Silver 4114 CPU, 2.20GHz. For the clear comparison, we set the speed of chunk-SAE to and give the speed ratio of other encoders. In lines 1 and 2 of Table 1, the chunk-SAE was slower than the SAE due to the redundant computation of the historical and future context. In lines 2 and 3 of Table 1, we observed that the state reuse chunk-SAE was 1.5x faster than the chunk-SAE, which is consistent with the theoretical analysis in Section 4.2. In addition to the faster speed, the state reuse chunk-SAE outperformed the chunk-SAE by and relative CERs reduction on HKUST development and test set, respectively. Because of the faster speed and better performance, we employed the state reuse chunk-SAE in our subsequent experiments.
5.4 Context investigation
in Table 2, we investigated our online model performance varying the historical, central and future context lengths. Firstly, comparing lines 2-4 in Table 2, we can see that the future context brought more improvement than the historical context, which indicates that the future context is more crucial to the performance of our online models. Secondly, comparing lines 5-7 in Table 2, we found that it was effective to increase the length of the historical context when we intended to reduce the latency of the state reuse chunk-SAE and maintain the recognition accuracy at the same time. Thirdly, comparing lines 7 and 8 in Table 2, we found that the CER reduced when we increased the length of central context.
|TDNN-hybrid, lattice-free MMI ||23.69|
|Offline Self-attention Aligner ||24.12|
|Online Self-attention Aligner (no speed perturb) ||26.52|
|Offline BLSTM-based CTC/attention model ||27.43|
|Online LC-BLSTM-based CTC/attention model ||27.84|
|Online Transformer-based CTC/attention model||23.66|
Finally, our best online model achieved a CER, with a 640 ms latency and a absolute CER degradation compared with the offline baseline in line 1 of Table 2. In Table 3, we also compared our online Transformer-based model with other published ASR models. For a fair comparison, the latency of the online E2E models listed in Table 3 is 320 ms. We can see that our proposed online model, with only 31M parameters, achieved better results.
In this paper, we propose the Transformer-based online E2E ASR model, which consists of the state reuse chunk-SAE and MTA based SAD, and integrate the proposed Transformer-based online E2E ASR model into the CTC/attention ASR architecture. Compared with the simple chunk-SAE, the state reuse chunk-SAE performs better and requires less computational cost, because it has broader historical context via storing the states in previous chunks. Compared with the SAD, the MTA based SAD truncates the SAE outputs in a monotonic left-to-right way and performs attention on the truncated SAE outputs, making it applicable to online recognition. We evaluate the proposed Transformer-based online CTC/attention E2E models on HKUST and achieves a CER with a 320 ms latency, which outperforms our prior LSTM-based online E2E models. In future, we plan to adopt teacher-student learning approach to further reduce the model latency.
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber,
“Connectionist temporal classification: Labelling unsegmented
sequence data with recurrent neural networks,”
Proceedings of the 23rd International Conference on Machine Learning, New York, NY, USA, 2006, ICML ’06, pp. 369–376, ACM.
-  J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, Cambridge, MA, USA, 2015, NIPS’15, pp. 577–585, MIT Press.
-  W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 4960–4964.
-  D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in International conference on machine learning, 2016, pp. 173–182.
-  C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018, pp. 4774–4778.
-  S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, Dec 2017.
-  T. Hori, S. Watanabe, and J. Hershey, “Joint CTC/attention decoding for end-to-end speech recognition,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, July 2017, pp. 518–529, Association for Computational Linguistics.
D., K. Cho, and Y. Bengio,
“Neural machine translation by jointly learning to align and translate,”in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
-  K. Kawakami, Supervised sequence labelling with recurrent neural networks, Ph.D. thesis, Ph. D. thesis, Technical University of Munich, 2008.
-  H. Miao, G. Cheng, P. Zhang, L. Ta, and Y. Yan, “Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,” in Proc. Interspeech 2019, 2019, pp. 2623–2627.
-  H. Miao, G. Cheng, P. Zhang, and Y. Yan, “Online hybrid ctc/attention end-to-end automatic speech recognition architecture,” Unpublished Submitted Journal.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, USA, 2017, NIPS’17, pp. 6000–6010, Curran Associates Inc.
-  L. Dong, S. Xu, and B. Xu, “Speech-transformer: A norecurrence sequence-to-sequence model for speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5884–5888.
-  S. Karita, N. E. Y. Soplin, S. Watanabe, M. Delcroix, A. Ogawa, and T. Nakatani, “Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration,” in Proc. Interspeech 2019, 2019, pp. 1408–1412.
-  N. Pham, T. Nguyen, J., M. Müller, and A. Waibel, “Very Deep Self-Attention Networks for End-to-End Speech Recognition,” in Proc. Interspeech 2019, 2019, pp. 66–70.
-  S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. Yalta, R. Yamamoto, X. Wang, S. Watanabe, T. Yoshimura, and W. Zhang, “A comparative study on transformer vs rnn in speech applications,” 09 2019.
-  Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” CoRR, vol. abs/1901.02860, 2019.
-  C. Chiu and C. Raffel, “Monotonic chunkwise attention,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in , June 2016, pp. 770–778.
-  L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” CoRR, vol. abs/1607.06450, 2016.
-  Y. Liu, P Fung, Y. Yang, C Cieri, S. Huang, and D. Graff, “Hkust/mts: A very large scale mandarin telephone speech corpus,” in International Conference on Chinese Spoken Language Processing, 2006.
-  S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, and N. Chen, “Espnet: End-to-end speech processing toolkit,” 2018.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, Jan. 2014.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,, 2016.
-  D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for asr based on lattice-free mmi,” in INTERSPEECH, 2016.
-  L. Dong, F. Wang, and B. Xu, “Self-attention aligner: A latency-control end-to-end model for asr using self-attention network and chunk-hopping,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 5656–5660.