Transformer models 
have been successfully applied and proven to be more effective than recurrent neural networks, eg., LSTMs, in several NLP tasks. The two key ingredients include sinusoidal positional encoding and the self-attention mechanism to be context-aware on input word embeddings. Recently, transformer models and their variants have also been actively investigated for speech recognition as well[2, 3, 4, 5]. To work well for ASR modeling, transformer architecture needs to make some revision. Some key points have been summarized by previous work.
First, due to the fact that a speech utterance typically lasts for couple of seconds, while speech frames are extracted using a several-milliseconds window, downsampling acoustic input frames to widen temporal context is always beneficial. Some variant of model architecture has also been explored either using TDNN  or BiLstm  as front blocks to extract high-level features for the self-attention layers. Second, transformer encoder is effective both with CTC loss  and within Listen, Attend and Spell (LAS) framework. In , the CTC loss was applied to optimize the Transformer encoder structure for ASR. In [8, 5], the entire encoder-decoder structure of the original Transformer was examined in the context of Mandarin Chinese speech recognition tasks. Last but not least, with self-attention network, a common observation is that the longer the context is selected, the better the performance can be obtained. In , to support online speech recognition, a chunk-hopping mechanism is proposed. The experimental results showed that compared with the whole utterance context, chunk-hopping always leads to performance degradation. In this work, the priority is given to explore all possible information to improve speech recognition accuracy, while streaming speech recognition where latency is an important factor to consider will not be in the scope of the study.
To make further improvement based on self-attention layer, a natural idea is to explore more information beyond the whole utterance context length. Recently, in 
, the authors proposed to augment self-attention with persistent memory by introducing a new layer that merges the self-attention and feed-forward sub-layers into a single unified attention layer. It has shown that the additional persistent memory block in the form of key-value vectors stores some global information that does not depend on the context. This work sheds light on achieving our above motivation.
In this paper, we first explore a variant of model architecture which combines DFSMN and self-attention layers. Experiments show this model architecture outperforms a standard transformer encoder. Then we apply the memory augmenting method on its self-attention layer. We further propose an improved memory structure and make comparison experiments. Our contributions are as follows. First, we make further verification that self-attention layer is not necessary for low-level front layers and only a few self-attention layers added to high level can achieve competitive performance. Second, we apply augmented persistent memory to ASR model and we propose an improved variant of memory structure, which is more compact and competitive on recognition performance. All experiments are performed on a large-scale training data, ie., over 10,000hrs.
2 model architecture
A deep feed-forward sequential memory network  provides a non-recurrent architecture to model long term dependency. Compared with self-attention network, FSMN’s layer structure is much simpler and more focusing on a local range of neighbouring frames, while ignores the relative dependency in different positions of a sequence. A multi-head self-attention layer can model the relative dependency by gathering the information from the whole context in a sequence. In view of this, we explore the combination of FSMN layers and multi-head self-attention layers. We conjecture that this proposed architecture can achieve a better trade-off between modeling efficiency and capturing the long term relative dependency.
DFSMN can conceptually be viewed as the standard feedforward fully connected neural networks augmented with some FIR-like filters. The formulation of the FIR-like filters takes the following form:
where and denote the look-back and look-ahead order respectively. From equation 1, we can observe that the learnable FIR-like filters in DFSMNs can be used to encode long context information into fixed-size representation (), which makes the DFSMN capture long-term dependency. However, the relative dependency in different positions is ignored in this architecture.
A self-attention network 
has two sub-modules including a multi-head attention layer and a position-wise feedforward layer. In addition, dropout, residual connection, and layer normalization are applied after both the self-attention and feed-forward layers. The computation process of self-attention is formulated as follows:
where , and are the key, value and query matrices of size x . and h is the number of heads in the self-attention layer. The multi-head attention performs by attending the information from different subspaces mapped by , and . is the output weight matrix. It is clear that self-attention can explore the information from different representations at different positions. In other words, this architecture has the ability to model the relative dependency.
We propose the DFSMN-SAN model in which the multi-head self-attention layer (red block in Fig.1) is combined with DFSMN model. Similar to the combination of TDNN and SAN in , we argue that the combination of DFSMN and SAN can achieve a better trade-off between modeling efficiency and capturing the long-term relative dependency. Two types of the combination are empirically evaluated. The first is to simply stack all of the self-attention layers at the end of DFSMN model, and the second is to insert self-attention layer into DFSMN with an alternate style. In our pilot experiments, we find that the latter consistently performs better. Therefore the combination with an alternate type is used in this paper, as shown in Fig. 1. After each 10 consecutive FSMN layers, a self-attention layer is inserted.
3 Augmenting Self-Attention Layers With Persistent Memory
In this section, we further propose to apply self-attention layer augmented with persistent memory into the DFSMN-SAN model. The motivation behind this is to investigate whether even more information beyond the whole utterance level can be exploited and beneficial. These memory vectors are random initialized from the beginning of training and updated with whole training corpus. We believe that these memory vectors can learn and store some global information useful for the ASR task. Here we propose a new memory structure different with . The two different types of memory structures are described as follows.
3.1 Key-Value Memory structure
Fig.2(a) shows the self-attention layer augmented with persistent memory vectors proposed by . More precisely, these persistent memory vectors are a set of N pairs key-value vectors, which are stacked in two x -dimensional matrices and .
These persistent memory vectors are simply concatenated to the pool of key-value vectors:
where the position encoding corresponding to a persistent vector is equal to zero. . denotes the concatenation of the key vectors with the corresponding N persistent vectors. Similar to general multi-head self-attention layers, the persistent memory vectors are split into multiple heads and not shared between heads.
3.2 Input-Embedding Memory structure
In this paper, we propose a new type of memory structure which is shown in Fig.2(b). Different from key-value memory structure, these persistent memory vectors are directly concatenated to the input vectors:
where denote the persistent memory vectors. Obviously, and share the same persistent memory vectors. That is to say, we have fewer parameters compared with key-value memory structure.
4 Experimental Setup
4.1 Training setup
The feature vectors used in all the experiments are 40-dimensional log-Mel filterbank energy features appended with the first-order and the second-order derivatives. Log-mel filterbank energy features are computed with a 25ms window and shifted every 10ms. We stack 8 consecutive frames and subsample the input frames with 3. A global mean and variance normalization is applied for each frame. All the experiments are based on the CTC learning framework. We use the CI-syllable-based acoustic modeling method for CTC learning. The target labels of CTC learning are defined to include 1394 Mandarin syllables, 39 English phones, and a blank. Character error rate results are measured on the test sets. We use a pruned, first pass, 5-gram language model. All the systems use a vocabulary that consists of millions of words. Decoding is performed with a beam search algorithm by using the weighted finite-state transducers (WFSTs).
Our training corpus is mixed data sets collected from several different application domains, all in Mandarin. In order to improve system robustness, a set of simulated room impulse responses (RIRs) are created with different rectangular room sizes, speaker positions, and microphone positions, as proposed in . Together they contain a total of 10k hours speech.
To evaluate the performance of our proposed method, we report performance on 3 types of test sets which consist of hand-transcribed anonymized utterances extracted from reading speech (1001 utterances), conversation speech (1665 utterances) and spontaneous speech (2952 utterances). We refer them as Read, Chat, and Spon respectively. In addition, to provide a public benchmark, we also use AISHELL-2 development set (2500 utterances) recorded by high fidelity microphone as the test set.
4.3 Acoustic Model
For the first experiment, we present our work with DFSMN, self-attention and DFSMN-SAN model. The DFSMN system uses 30 DFSMN components of 1024 hidden units, each with a projection layer of 512 units. The self-attention model contains 10 multi-head self-attention sublayers with a comparable size with 30 DFSMN model. We set the model dimensionand the number of heads . The DFSMN-SAN model consists of 30 DFSMN components and 3 multi-head self-attention sublayers.
For the second experiment, we improve the DFSMN-SAN by augmenting the self-attention sublayer with persistent memory. We set the number of heads to 8 for key-value memory vectors. The position embedding is shared across all the heads.
5 Experimental Results
5.1 Model Architecture Experiments
In this section, we compare different variations of our acoustic models. Table 1 shows the performance comparison of 3 different acoustic models. Line 2 presents the results of the baseline system (30 layers FSMN). Line 3 presents the results of self-attention model of 10 layers. The last line presents the results of DFSMN-SAN model. The results show that self-attention model performs better than DFSMN model with comparable model size. As expected, the DFSMN-SAN model which combines the dfsmn layers with self-attention layers performs best compared with other models. This indicates that the relative dependency learned by self attention layers can improve the system performance while self-attention mechanism is not necessary for low-level front layers, and only adding a few self-attention layers to high level can achieve competitive performance. Notably, the DFSMN-SAN model achieves up to 34.4% relative CER improvement over the baseline model on the Read test set.
5.2 Augmenting Memory Experiments
Table 2 shows the experimental results of models augmented with key-value memory structure. The different number of persistent memory vectors was examined and the performance seems not to vary too much as set to be 64, 128 and 256. The key-value memory structure achieves 9.5% relative CER improvement over the baseline model on the AISHELL test set.
Table 3 shows the models augmented with input-embedding memory structure also significantly outperform the baseline system. Both two types of augmented memory introduce very modest increase in model size. The proposed input-embedding memory structure has achieved competitive performance to the key-value memory structure with fewer parameters.
In this work, we first explore a variant model architecture that combines DFSMN layer with self-attention layer. By adding multi-head self-attention layer into DFSMN with an alternate type, consistent improvement can be obtained both over the original DFSMN model and the SAN model with pure self-attention layers. More importantly, we apply self-attention layer with persistent memory vectors into DFSMN-SAN model. Two types of augmented memory methods are evaluated and both provide further improvements in CER. Notably, we find that our proposed input-embedding memory structure can achieve comparable performance with the key-value memory structure with fewer parameters. To make experimental results more convincing, all experiments in this paper are performed on large data sets and evaluated four different individual test sets. Future work includes making visualization analysis on the memory vectors to get better understanding of what they have learned.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
-  Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, and Sanjeev Khudanpur, “A time-restricted self-attention layer for asr,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5874–5878.
-  M. Sperber, J. Niehues, G. Neubig, S. Stüker, and A. Waibel, “Self-attentional acoustic models,” in arXiv preprint arXiv:1803.09519, 2018.
-  Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4774–4778.
-  L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, p. 5884–5888.
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen
“Connectionist temporal classification: labelling unsegmented
sequence data with recurrent neural networks,”
Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.
-  Julian Salazar, Katrin Kirchhoff, and Zhiheng Huang, “Self-attention networks for connectionist temporal classification in speech recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 7115–7119.
-  Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu, “Syllable-based sequence-to-sequence speech recognition with the transformer in mandarin chinese,” arXiv preprint arXiv:1804.10752, 2018.
-  Linhao Dong, Feng Wang, and Bo Xu, “Self-attention aligner: A latency-control end-to-end model for asr using self-attention network and chunk-hopping,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5656–5660.
-  S. Sainbayar, G. Edouard, L. Guillaume, J. Herve, and J. Armand, “Augmenting self-attention with persistent memory,” in https://arxiv.org/abs/1907.01470, 2019.
-  Shiliang Zhang, Ming Lei, Zhijie Yan, and Lirong Dai, “Deep-fsmn for large vocabulary continuous speech recognition,” arXiv preprint arXiv:1803.05030, 2018.
-  Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Hervé Jégou, and Armand Joulin, “Augmenting self-attention with persistent memory,” CoRR, vol. abs/1907.01470, 2019.
-  Zhongdi Qu, Parisa Haghani, Eugene Weinstein, and Pedro Moreno, “Syllable-based acoustic modeling with ctc-smbr-lstm,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE. IEEE, 2017, pp. 173–177.
-  I. Himawan, P. Motlicek, D. Imseng, B. Potard, N. Kim, and J. Lee, “Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition,” in International Conference on Acoustics, Speech and Signal Processing, 2015.
-  Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011, number EPFL-CONF-192584.
K. Chen and Q. Huo,
“Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering,”in ICASSP. IEEE, 2016, p. 5880–5884.