A conventional hybrid automatic speech recognition (ASR) system usually consists of an acoustic model, a lexicon and a language model that are built and optimized separately. During decoding, the acoustic model, the lexicon and the language model are compiled into a weighted finite-state transducer. In contrast, an ASR system based on a sequence-to-sequence model can directly decode the input speech into text by fully using a single neural network, without a finite state transducer or a lexicon. Therefore, sequence-to-sequence models are becoming increasingly popular in automatic speech recognition (ASR). Comparing to conventional hybrid systems with separate acoustic models, lexicons and language models, a sequence-to-sequence model can be trained in a much simpler way. In addition, it allows joint optimization of all the components and a more compact model.
There have been a variety of sequence-to-sequence models such as ”Listen, Attend and Spell (LAS)” model 
and recurrent neural network transducer (RNN-T)[8, 7, 9]. Generally, a sequence-to-sequence model processes an input sequence with an encoder recurrent neural network to produce a sequence of hidden states. A decoder RNN is then used to autoregressively produce the output sequence. Unlike conventional ASR, no frame-level alignment is needed during training. Recent results show that the LAS model, which incorporates an attention mechanism between the encoder and the decoder, has achieved comparable performance with conventional hybrid systems [4, 14]. However, because LAS attends to all hidden states at every output timestep, it must wait until the whole input sequence has been processed before producing any output. Thus it is inapplicable to online speech recognition. Although a RNN-T model can perform streaming recognition, its performance still lags behind that of a large hybrid model .
In order to make the LAS model suitable for online decoding, the monotonic attention is proposed . However, the monotonic constraints also limits the expressivity of the model. Therefore, Chiu and Raffel proposed to use the monotonic chunk-wise attention (MoChA) in . Unfortunately, experiments show that there are still a gap between MoChA and a vanilla LAS model. In this paper, we aim to improve the MoChA based streaming model, especially for online large vocabulary continuous speech recognition (LVCSR).
Specifically, in this work, we first propose a weight-sharing multi-head MoChA (MTH-MoChA) mechanism to improve the model expressivity. The proposed MTH-MoChA model incorporates a multi-head monotonic attention mechanism. In addition, the parameters for different attention heads are shared to improve the model robustness. A variety of optimization training strategies are also used to get more improvement on top of the proposed model structure. Firstly, to reduce the computation complexity, we use pooling between LSTMs  to reduce the size of the sequence. Secondly, SpecAugment  and label smoothing  are also adopted during model training. Finally, discriminative minimum word error rate (MWER) 
training, which is more closely related to word error rate, is applied to get further improvement. The effectiveness of the above structure and training strategies is demonstrated on both an open-source corpus with limited training data and an in-car dataset with 18000-hour of transcribed speech. In the experiments with 18000-hour in-car speech assistant data set, the MTH-MoChA model obtains 7.28% CER, which is much better than a chain model trained with the lattice free maximum mutual information criterion.
The rest of the paper is organized as follows. Section 2 describes the details of our proposed methods and training strategies adopted in this paper. Experiments and results are presented in Section 3, followed by a conclusion in Section 4.
2 method details
In this section, we first briefly introduce the MoChA model. The proposed MTH-MoChA, as well as various optimization training strategies are then described in details.
2.1 Monotonic Chunk-wise Attention(MoChA)
Monotonic attention  explicitly enforces a hard monotonic input-output alignment. Monotonic chunkwise attention (MoChA)  relaxes the hard monotonicity constraint and increases the flexibility of the model. MoChA splits the input sequence into small chunks over which the attention is computed. Specifically, given the decoder’s state at step and a sequence of the encoder’s hidden states , the “energy” used to calculate the attention context is defined as
where , , , , and are learnable parameters and
is a hyperparameter (i.e. the hidden dimensionality of the energy function).
Then, the energy scalars for , where is the index of the hidden states chosen at output timestep
, are passed to a logistic sigmoid function
to produce the “selection probabilities”:
In the above equation, the common approach of adding zero-mean, unit-variance Gaussian noise to the logistic sigmoid function’s activations causes the model to produce effectively binary. During decoding, is then used to select the hidden state that the context attends to. The model is trained with respect to the expected value of .
In monotonic attention , the decoder can only attend to a single entry in the hidden states at each timestep. In contrast, MoChA  allows the model to perform soft attention over a length- window of hidden states preceding and including :
Similar to monotonic attention, MoChA can be trained using the expected value of .
2.2 Multi-head MoChA
The main difference between MoChA and the proposed multi-head MoChA (MTH-MoChA) lies in the monotonic energy calculation. As can be seen in Eq. (2), represents the correlation between and
. However, the relationship between two high dimensional vectors are very complex. Eq. (2) is not able to give sufficient information between them. Therefore, we propose to split and into heads and calculate the energies between them in order to recover the complex dependency between and . Formally, we have
For the ’th head, the monotonic energy can be calculated as
where , , , , and are learnable parameters shared among different heads. On the one hand, sharing weights among different heads can reduce the number of parameters. On the other hand, it also helps share information among different heads. Because each head calculates a different attention context vector, the proposed MTH-MoChA method can utilize more context information than MoChA.
Finally, we can obtain a context vector as that in MoChA for different heads. The final context vector is then obtained by averaging the vectors:
2.3 Optimization strategies
As will be seen in our experiments, we use various optimization training strateties to further improve the model.
SpecAugment  is a simple but powerful data augmentation method for speech recognition. Attention-based sequence-to-sequence model such as LAS and Transformer is prone to over-fitting. Augmentation helps convert an over-fitting problem into an under-fitting problem. After using SpecAugment, model performance improves greatly.
Generally, the SpecAugment policy consists of time warping, time masking, and frequency masking operation. Despite the warping step, which is less important than other operations, SpecAugment can be regarded as a special block-based dropout layer.
To perform time masking, firstly the length of block is generated randomly form , where is the maximum time masking size. The starting position is then chosen from , where is the length of the input sequence. Finally the data between
is dropped. A similar operation is applied in the frequency domain for frequency masking.
2.3.2 Pooling between LSTMs
In ASR task, the length of the input feature sequence is much larger than that of the output word sequence. To reduce the length of the input sequence and improve caculating efficiency, maximum pooling between LSTM outputs along time  is adopted. Pooling between LSTMs can significantly reduce the computational complexity, as well as improving model’s performance because it can remove redundant information from the hidden sequences.
2.3.3 Minimum Word Error Rate (MWER) Training
Although Cross Entropy (CE) training is effective, it is not closely related to the optimization metric that we really care about, called word error rate. In the MWER training criterion, the target is to minimize the expected number of word errors directly. The loss function of MWER is shown as follow:
The loss is interpolated with typical cross-entropy loss to stabilize training. When using character as modeling units, MWER is the same as minimum character error rate, or MCER.
3 Experimental setup and results
3.1 Data sets
We conducted experiments on three Mandarin data sets with different recording settings and sizes to evaluate the effectiveness of the proposed MTH-MoChA model as well as other training strategies. Table 1 summaries the key information about the three data sets.
AISHELL-1 is an open-source corpus recorded using high-fidelity microphones in quiet environments such as normal living rooms and recording studios etc. The training set, a total of about 150 hours, contains 120098 utterances from 340 speakers. The text contents are chosen from 11 domains, including finance, science, sports and so on. We used the standard development and testing data sets from AISHELL-1 to tune and test all the models.
InCar2000 and InCar18000 data sets, with about 2000 and 18000 hours respectively, are collected from Tencent’s in-car speech assistant products. The speech contents include enquiries, navigations and conversations. The average number of characters of each utterance is less than that of the AISHELL-1 data set. All the data are anonymized and hand-transcribed. The development and testing data sets contain 4998 and 4271 recordings respectively.
3.2 Model details
We followed the model configurations as described in . The encoder consists of 2 convolutional layers with 32-channel
kernel with a stride of 2, followed by 4 unidirectional LSTM layers with 1024 units. Two-layer CNNs with a stride of 2 for each layer result in a sub-sampling factor of 4 in time. Batch normalization is followed after each CNN and LSTM layer. As to the pooling operation of LSTM, a pooling layer is inserted after the 2nd and the 4th LSTM layer of the encoder to perform max-pooling with width 2.
The decoder consists of 2 unidirectional LSTM layers with 1024 units. The implementation of the attention follows that in . The attention context vector is fed into every layer of the decoder. For MoChA, we use a chunk wise and hidden dimension 1024. For MTH-MoChA, 4 heads are used for all experiments, thus the hidden dimension of each head is 256.
To avoid model over-fitting, label smoothing  with 0.1 uncertainty probability is used. As for SpecAugment, no time-warping is used. The maximum masking block sizes for frequency and time masking are 27 and 40 respectively. In addition, the time masking block size is set to no more than 20% of the total sequence length.
80-dimensional log-mel features, computed every 10ms with a 25ms window are used as the input. Since there are some English words in the in-car data sets (e.g. English song names in the enquiries), both Chinese and English characters are adopted as modeling units. All the experiments are conducted with the Lingvo  toolkit.
We also conducted experiments using state-of-the-art hybrid models on the in-car data sets for comparisons. TDNN-OPGRU models were trained with the LF-MMI criterion by using Kaldi. We followed the model structure and training scripts described in. A 18G 5-gram language model was used for TDNN-OPGRU models. However, no language model was used to rescore the output of sequence-to-sequence models.
3.3 Results on Aishell-1
We first conducted experiment on AISHELL-1. Table 2 gives results of the baseline LAS model, the vanilla MoChA model and the proposed MTH-MoChA model. Results on this data set reported in other work are also given. In table 2, label smoothing and SpecAugment are used for all end-to-end models. The vanilla non-streaming LAS model obtains 7.51% character error rate (CER) on the development set and 9.29% CER on the test set. The performance of the online MoChA model is comparable with that of the LAS model.
The proposed MTH-MoChA performs slightly better than MoChA on this data set. In addition, the optimization training strategies described in subsection 2.3 significantly improve the model performance. The best sequence-to-sequence model achieves 7.68% CER on the test set, which is comparable to a strong hybrid model reported in .
3.4 Results on both InCar2000 and InCar18000 corpora
To further verify the effectiveness of the MTH-MoChA model and other optimization training strategies for online attention model, we conduct experiments on another two large data sets,InCar2000 and InCar18000. Table 3 shows results on these two data sets. TDNN-OPGRU is a hybrid model trained with SpecAugment and LFMMI. From the result on InCar2000, we can find that MoChA performs much worse than the vanilla LAS model. The reason for this may be that most recordings of the in-car corpus are speech enquiries and the average number of characters inside a recording is much less, meaning that the average duration of utterance segments is shorter. Compared with LAS, MoChA cannot use the whole audio segment to calculate the attention context. Therefore, the available information for attention context calculation is much less, especially for short recordings.
For InCar2000 data set, MTH-MoChA obtains 10.42% and 15.30% CER the development and test sets respectively. It improves MoChA by 21.62% relatively on the test data. When combining with the other optimization training strategies, we can get another 9.09% relative reduction in CER. We did not train the LAS and MoChA models on the InCar18000 data set since it takes a long time. As can be seen from Tabel 3, the proposed MTH-MoChA model significantly outperforms the TDNN-OPGRU model trained with the LFMMI criterion on the InCar18000 data set.
In this paper, we propose multi-head monotonic chunkwise attention (MTH-MoChA) for online large vocabulary speech recognition. MTH-MoChA splits the input sequence into small chunks and computes the multi-head attention context over the chunks. Our experiments show that MTH-MoChA outperforms the original MoChA model. In addition, other optimization training strategies are also necessary to push the performance of MTH-MoChA to state-of-the-art. When a large amount of training data is available, MTH-MoChA, without using any additional language model, significantly outperforms the best conventional hybrid system.
In the future, we would like to investigate other training strategies such as schedule sampling (SS) , focal loss (FL) , CTC joint training  and language model (LM) rescoring  to further improve the performance of the proposed model. By applying these tricks, the online attention model is very likely to get further improvements.
-  (2019) Learn spelling from teachers: transferring knowledge from language models to sequence-to-sequence speech recognition. arXiv:1907.06017. Cited by: Table 2.
Output-gate projected gated recurrent unit for speech recognition.. In Interspeech, pp. 1793–1797. Cited by: §3.2.
-  (2018) Monotonic chunkwise attention. In international conference on learning representations, Cited by: §1, §2.1, §2.1.
-  (2018) State-of-the-art speech recognition with sequence-to-sequence models. In ICASSP, pp. 4774–4778. Cited by: §1, §4.
-  (2018) Extending recurrent neural aligner for streaming end-to-end speech recognition in mandarin. In Interspeech, pp. 816–820. Cited by: §2.3.2.
-  (2018) Extending recurrent neural aligner for streaming end-to-end speech recognition in mandarin. In Interspeech, pp. 816–820. Cited by: §1.
-  (2013) Speech recognition with deep recurrent neural networks. In ICASSP, pp. 6645–6649. Cited by: §1.
-  (2012) Sequence transduction with recurrent neural networks. arXiv:1211.3711. Cited by: §1.
-  (2019) Streaming end-to-end speech recognition for mobile devices. In ICASSP, pp. 6381–6385. Cited by: §1.
-  (2017) AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Cited by: §3.3, Table 2.
-  (2019) Model unit exploration for sequence-to-sequence speech recognition. arXiv:1902.01955. Cited by: §3.2.
-  (2019) A comparative study on transformer vs rnn in speech applications. arXiv:1909.06317. Cited by: Table 2, §4.
-  (2019) The speechtransformer for large-scale mandarin chinese speech recognition. In ICASSP, pp. 7095–7099. Cited by: §4.
-  (2019) Specaugment: a simple data augmentation method for automatic speech recognition. arXiv:1904.08779. Cited by: §1, §1, §2.3.1.
-  (2016) Purely sequence-trained neural networks for asr based on lattice-free mmi.. In Interspeech, pp. 2751–2755. Cited by: §1.
-  (2018) Minimum word error rate training for attention-based sequence-to-sequence models. In ICASSP, pp. 4839–4843. Cited by: §1, §2.3.3.
-  (2016) Lower frame rate neural network acoustic models. In Interspeech, pp. 22–26. Cited by: §1.
Online and linear-time attention by enforcing monotonic alignments.
international conference on machine learning, pp. 2837–2846. Cited by: §1, §2.1, §2.1.
-  (2019) Lingvo: a modular and scalable framework for sequence-to-sequence modeling. arXiv:1902.08295. Cited by: §3.2.
-  (2019) Adversarial regularization for attention based end-to-end robust speech recognition. In IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 1826–1838. Cited by: Table 2.
Rethinking the inception architecture for computer vision. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1, §3.2.
-  (2017) Sequence-to-sequence models can directly translate foreign speech. In Interspeech, pp. 2625–2629. Cited by: §3.2.