End-to-end models [1, 2, 3, 4, 5] have greatly simplified the automatic speech recognition (ASR) system by combining acoustic model, language model and an acoustic-to-text alignment mechanism (e.g. attention [2, 3, 4], CTC-like [1, 5]) in a unified neural network. As a common component of these models, recurrent neural networks (RNNs) [6, 7, 8] have demonstrated their sequential modelling power in both of capturing acoustic dependencies (acoustic modelling) and recurrently emitting text units (language modelling). However, RNNs may generate “confusing” internal states (memory) after passing through noisy pieces (e.g. long silence or noise pieces in speech utterances). Besides, the sequential nature of RNNs leads to low parallelization and slow computation speed. These shortcomings may restrict the performance and efficiency of the RNN-based end-to-end models in ASR task.
Recently, an attention-based feedforward neural network, called self-attention network, has shown promising performance in a variety of NLP tasks including neural machine translation, reading comprehension , etc. This network captures positional dependencies of a sequence by computing pairwise attention weights, which could be small so as to bypass the unrelated (e.g. noisy) positions, thus it may leverage the related context information more effectively. In addition, the self-attention network models in a totally feedforward manner, thus providing highly paralellizable computation. Those advantages make it a potential alternative of RNNs.
However, there are some challenges in the replacement of RNNs by the self-attention networks for end-to-end modelling in ASR. Firstly, speech sequences often contain hundreds of, even over one thousand frames, it is not clear how the self-attention network could better encode in such a long range. Secondly, the self-attention network decodes output units in an auto-regressive manner, it is unclear if it could be effectively combined with the CTC-like alignments or an extra language model (LM). Thirdly, the self-attention network computes by relating all pairwise positions in a sequence, which means the entire utterance needs to be obtained at first, thus bringing difficulties for online recognition. To explore above challenges, we introduce the self-attention networks to a simplified recurrent neural aligner (RNA) framework . Our contributions are as follows:
We construct an encoder that relies only on self-attention and shallow convolutional networks. Pooling layers in between the self-attention networks plus the front-end strided convolutions, offering effective temporal down-sampling for speech utterances. A 5.5% relative character error rate (CER) reduction on the HKUST dataset demonstrates the superiority of our encoder than a strong RNN-based encoder.
We present a self-attention decoder, which emits output units in an auto-regressive manner. It works well with the CTC-like alignment and provides a 2.4% relative CER reduction than a RNN-based decoder.
We combine the proposed encoder and decoder for end-to-end training, and term the integrated model as self-attention aligner (SAA). We find the SAA model performs competitive on two Mandarin ASR datasets. Moreover, after jointly training with a self-attention network LM, it obtains further performance benefits.
We propose a chunk-hopping mechanism, which enables the SAA to support online speech recognition. Results show the chunk-hopping allows the SAA model to have only a 2.5% relative CER degradation with a 320ms latency, increasing the diversity of application scenarios for the SAA model.
2 Relations to prior work
Self-attention network has been applied to ASR community in several prior works [12, 13, 14, 15]. In , Povey et al. proposed a time-restricted self-attention layer, which improves the performance of the LF-MMI model when combining with the TDNN or TDNN-LSTM structure. In [13, 14], the self-attention network is utilized in the transformer framework, which entirely relies on the attention mechanism and transcribes speech utterances in a sequence-to-sequence manner. In , Sperber et al. applied the self-attention network to the encoder of the LAS model  and proposed several improvements for effective acoustic modelling of self-attention.
In this paper, we aim to explore the combination of the self-attention networks with the CTC-like alignment mechanism, differing from the HMM alignment in  or the attention alignment in [13, 14, 15]. The CTC-like alignment mechanism provides the potential of online recognition, which is what above attention-based models lack. Besides, we utilize multiple vanilla self-attention networks in , contrasting to a single time-restricted self-attention layer in , which is placed towards the end of the TDNN or TDNN-LSTM and provides latency-control by attending to limited future high-level context. In contrast, our model attends to segmented input frame chunks one after another, thus controlling the latency more directly without considering the setting of used neural networks.
3 Self-attention Network
Self-attention is an attention mechanism that computes the representation of a single sequence by relating different positions in it. In this work, we employ the scaled dot-product self-attention in the transformer 
, and leverage its encoder block as the self-attention network (SAN), which contains two sub-networks: multi head self-attention and position-wise feed-forward network. In addition, the layer normalization, dropout and residual connection in the SAN are also introduced for effective training.
Let be an input sequence, where T is the sequence length and is the hidden size of the SAN. Let , be the input and output of the first sub-network: multi-head self-attention network, , be the input and output of the position-wise feed-forword network. The computation of the SAN is formulated as follows:
Where, h is the number of heads in the multi-head self-attention network, which jointly attends to the information from different subspaces mapped by , , . Each head relates different positions by computing pairwise dot-product values, which are scaled and then added with the for affecting the attention manner. Since the speech signal is consecutive, in this work, we encourage attention to closer positions by adding the proximity bias: to each position-pair with distance. Dropout appears in equation (4), (6) is the residual dropout. Another dropout, called the attention dropout, is applied to the softmax weights in equation (2) but not showing in the equation.
4 Model Architecture
4.1 Self-attention Aligner
Self-attention aligner (SAA) is a RNN-free end-to-end model that contains three parts: an encoder, a decoder and a CTC-like alignment mechanism. The encoder transforms speech features, which can also be depicted as 2-dimensional spectrograms, to high-level acoustic representations. Then, with the joint action of the CTC-like alignment mechanism, the decoder learns to predict output units (e.g. characters, word pieces) by leveraging the encoded acoustic information and previous decoded outputs. The detailed model architecture is illustrated in figure 1.
The encoder, as shown in the left half of figure 1, transforms speech sequences only by self-attention and shallow convolutional networks. The convolutional front-end, employing the same structure in , utilizes a strided convolutional layer to offer translational invariance while halving sequence length, and a multiplicative unit (MU) to further capture distinguishable acoustic details. Then, its 2-dimensional outputs are flatten and projected to representations with hidden size to as the input of self-attention encoder, which consists of stacked self-attention networks (SANs). Since the proximity bias in the SANs has provided the relative position information, we abandon the sinusoidal position encoding in . Besides, between the stacked SANs, we place temporal pooling layers to conduct down-sampling, the motivation behind is as two-folds: (1) It encourages effective encoding in different temporal resolution. (2) It further shortens the length of acoustic representations, thus promoting faster alignments in the decoding. After the entire encoding, acoustic representations are obtained.
The decoder, illustrated in the right half of figure 1, is also computed using stacked SANs. Differing from the application in the encoder, the SANs in the decoder are computed in an auto-regressive manner, which restricts each position to attend to positions up to and including that position. Thus, at step , we cache the computed self-attention states , of all heads for the dependency modelling of later positions. Additionally, in order to make full use of acoustic information in the decoding, we concatenate with the embedding of previous predicted label to as the input of self-attention decoder. Besides that we also concatenate
with the output of the SANs and project to the logits with size, where L means the number of real output labels, and the extra one means the blank label, which is used for the acoustic-to-text alignment.
The alignment mechanism, aims to find an alignment between the acoustic representations and the target sequence . Here, we utilize a simplified RNA alignment mechanism , its conditional distribution =(), where
is the label with the maximum probability at previous step. This mechanism simplifies the computation of the RNA decoder
, meanwhile keeping the computation consistency during training and inference. The loss function to be minimized is calculated by:
where is a “CTC-like mapping function”, which maps the alignment to the corresponding by just removing the blank labels.
4.2 Joint Training with a SAN-LM
We find self-attention network language model (SAN-LM) obtains better perplexity than recurrent neural network language model (RNN-LM) at the character-level (the details are in section 5.2). In order to leverage more effective language information, we combine the SAA model with a pre-trained SAN-LM by a joint training mechanism similar to .
At each step , the predicted label is used as the input to the SAA decoder and the SAN-LM to calculate the corresponding SAA state and LM state, respectively. However, is likely to be the blank label, which is not seen in the training of the SAN-LM. Thus we let if is the blank label, and let <sos> which represents a special label of start for all sentences. In the calculation of the LM state, we introduce a masking bias to make the SAN-LM just attend to positions whose original is non-blank. We also abandon the proximity bias in the SAN-LM due to the changed positional distance between the separate LM training and the joint training. An example of above setting is illustrated in figure 2. After obtaining the LM state, we follow the same fusion structure as  to get the logits, and only the fusion structure is optimized during the joint training.
4.3 Chunk-hopping Mechanisms
The calculation of SANs needs to relate all pairwise positions in a sequence, which makes it necessary for the SAA model to start recognizing after entire utterance has been obtained. For this problem, we propose a chunk-hopping mechanism, which enables the SAA to support online recognition by encoding on segmented frame chunks sequentially (shown in the figure 3).
We first segment entire utterance into several overlapped chunks, each of them contains three parts: one of which is the current part, whose output is used as the output of the chunk. Other two parts are the past, future part, which provide contexts for the calculation of the current part. After calculating one chunk, a hopping is generated for the recognition of the next chunk, and the hop size is equal to the size of the current part in each chunk. When calculating the beginning and end chunks, zeros are padded to make them work.
5.1 Experimental Setups
We experiment with two Mandarin Chinese conversational telephone speech recognition (MTS) datasets, including the Mandarin ASR benchmark (HKUST)  and a larger-scale dataset (CasiaMTS).
The HKUST has 5413 utterances (5 hours) for evaluation, we extract 6017 utterances (5 hours) as our development set from the original training set with 197387 utterances (173 hours) and use the left as our training set. Input features use 40-dimensional filterbanks extracted from a 25ms window and shifted every 10ms, extended with delta and delta-delta, then with the per-speaker and global normalization. Output units contain 3673 classes, including 3642 Chinese characters, 26 lowercase letters, 3 special character (noise etc.), the <sos> label and the blank label. In the convolutional front-end, the filter number is set to 64, and layer normalization  is applied after the convolutions. In the self-attention encoder and decoder, the hidden size , the head number , the residual dropout and the attention dropout is set to 0.1, the inner size is set to 1280 except the experiments on the augmented data, is set to 2560. The confidence penalty regularizer in  is also introduced with the hyper-parameter . The SAN-LM used in our experiments contains 3 self-attention networks, with the same , , as the SAA, but with larger dropout value 0.2. The RNN-based baseline uses the Extended-RNA model in , which leverages 4-layers bidirectional LSTM (BLSTM)  as the encoder and 1-layer LSTM  as the decoder. The RNN-LM also follows the setting in , utilizes 1-layer LSTM with 640 cells.
The CasiaMTS has four representative test sets which contain 1315, 967, 2280, 17793 utterances, respectively. The development set contains 20000 utterances and the train set has 1109696 utterances (745 hours). Output units contain 4622 classes, including 4594 Chinese characters, 26 uppercase letter, the <sos> label and the blank label. We directly utilize the same SAA model as the HKUST dataset except the output layer becomes to 4622 units.
We first explore the effects of replacing RNNs by the SANs. The corresponding results are shown in table 1, where n, k have the same meaning as in figure 1, specifically, n represents the number of the SANs at each temporal resolution of the encoder, k represents the number of the SANs in the decoder. We find replacing the LSTM-encoder, LSTM-decoder in the RNN-based baseline with our SAN-encoder, SAN-decoder yields a 5.5%, 2.4% character error rate (CER) reduction, respectively. We also find the SAA model performs better as the number of the SANs start increasing, but after increasing to a certain number of layers, improvements become limited or even decreased. In the later part, we use the best performing model with =5, =2 in table 1 as the default SAA model.
|SAN-encoder + LSTM-decoder||3||-||27.25|
Besides the performance improvements, our SAA model also obtains speed improvements not only in the training stage but also in the inference stage (in table 2), showing the advantages of the replacement of RNNs by SANs.
|Model name||steps/sec (training)||utts/sec (inference)|
Then, we compare the performance of jointly training with different language models for the SAA model (in table 3), and find the SAN-LM not only obtains lower perplexity but also provides more extra language information for the SAA model. The SAA combined with the SAN-LM is used for further comparison with other results.
|LM type||LM perplexity||CER|
Table 4 shows the investigation on the chunk-hopping mechanism, except the row 8, the number of frames in the past and future part keeps the same. We first compare different chunk sizes (row 2-4) under the same hop size 32. In line with the intuition, the better result is obtained under the wider chunk. Then, under the same chunk size, we compare the performance obtained by different hop sizes (row 4-7), and find the hop size 64 performs the best, which addresses the importance of suitable context information. Next, we widen the chunk size to 192 to further explore the effects brought by the changing of contexts, and find widening the past and future parts (row 9) at the same time performs better than only widening the past part (row 8). Even so, the chunk-hopping setting in row 8 achieves a 26.52% CER, a 2.47% degradation compared with the full-sequence result with a latency of 320ms.
|use||chunk size||hop size||future size||CER|
|yes||32 frames||32 frames||0 frames||31.99|
|yes||64 frames||32 frames||16 frames||28.80|
|yes||128 frames||32 frames||48 frames||27.15|
|yes||128 frames||64 frames||32 frames||27.09|
|yes||128 frames||96 frames||16 frames||27.57|
|yes||128 frames||128 frames||0 frames||28.53|
|yes||192 frames||64 frames||32 frames||26.52|
|yes||192 frames||64 frames||64 frames||26.28|
Table 5 lists the comparison between the SAA model and other published models [20, 11, 14, 21] on the HKUST dataset. To our best knowledge, the results of all comparison models in table 5 are the latest. For a fair comparison, we also augment the training data by linearly scaling the audio lengths by factors of 0.9 and 1.1 (speed perturb), which brings a 0.8 absolute CER reduction. Finally, our SAA model obtains a 24.12% CER, which exceeds the best end-to-end results from transformer by over 2% absolute CER, but still has a little performance gap from the LF-MMI model, which uses the left-to-right alignment of the HMM rather than an end-to-end alignment.
Joint CTC-attention model / ESPNet (speed perturb)
|Extended-RNA (speed perturb) ||26.8|
|Transformer (speed perturb) ||26.6|
|TDNN-hybrid, lattice-free MMI (speed perturb) ||23.7|
|SAA model (speed perturb)||24.1|
Not only that we find the SAA model obtains a 8.4%-10.2% CER reduction than its RNN-based baseline on the CasiaMTS dataset (in table 6), further validating the effectiveness of the SAA.
In this work, we conduct exploration on the replacement of RNNs by the self-attention networks (SANs) in a simplified RNA framework. We find the SANs could (1) effectively represent speech utterance with temporal down-sampling in the encoder; (2) be compatible with CTC-like alignment mechanism in the decoder. We term the constructed RNN-free model as self-attention aligner (SAA). Compared with a RNN-based baseline on two Mandarin conversation telephone ASR datasets, the SAA model (1) obtains a 8.4%-10.2% CER reduction; (2) achieves faster calculation speed during training and inference; (3) supports latency-control recognition with little performance degradation. These advantages demonstrate the effectiveness of replacing RNNs by the SANs in ASR field.
-  Alex Graves, “Sequence transduction with recurrent neural networks,” Computer Science, vol. 58, no. 3, pp. 235–242, 2012.
-  William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 4960–4964.
-  Navdeep Jaitly, Quoc V Le, Oriol Vinyals, Ilya Sutskever, David Sussillo, and Samy Bengio, “An online sequence-to-sequence model using partial conditioning,” in Advances in Neural Information Processing Systems, 2016, pp. 5067–5075.
Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, and Douglas Eck,
“Online and linear-time attention by enforcing monotonic
Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, 2017, pp. 2837–2846.
-  Hasim Sak, Matt Shannon, Kanishka Rao, and Françoise Beaufays, “Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping,” in Proc. of Interspeech, 2017.
-  Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645–6649.
Haşim Sak, Andrew Senior, and Françoise Beaufays,
“Long short-term memory recurrent neural network architectures for large scale acoustic modeling,”in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
-  Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
-  Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le, “Qanet: Combining local convolution with global self-attention for reading comprehension,” arXiv preprint arXiv:1804.09541, 2018.
-  Linhao Dong, Shiyu Zhou, Wei Chen, and Bo Xu, “Extending recurrent neural aligner for streaming end-to-end speech recognition in mandarin,” arXiv preprint arXiv:1806.06342, 2018.
-  Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, and Sanjeev Khudanpur, “A time-restricted self-attention layer for asr,” in Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on. IEEE, 2018.
-  Linhao Dong, Shuang Xu, and Bo Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on. IEEE, 2018.
-  Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu, “A comparison of modeling units in sequence-to-sequence speech recognition with the transformer on mandarin chinese,” arXiv preprint arXiv:1805.06239, 2018.
-  Matthias Sperber, Jan Niehues, Graham Neubig, Sebastian Stüker, and Alex Waibel, “Self-attentional acoustic models,” arXiv preprint arXiv:1803.09519, 2018.
-  Yi Liu, Pascale Fung, Yongsheng Yang, Christopher Cieri, Shudong Huang, and David Graff, “Hkust/mts: A very large scale mandarin telephone speech corpus,” in Chinese Spoken Language Processing, pp. 724–735. Springer, 2006.
-  Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
-  Mike Schuster and Kuldip K Paliwal, Bidirectional recurrent neural networks, IEEE Press, 1997.
-  Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
-  Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming Wang, and Sanjeev Khudanpur, “Purely sequence-trained neural networks for asr based on lattice-free mmi.,” in Interspeech, 2016, pp. 2751–2755.