DFSMN-SAN with Persistent Memory Model for Automatic Speech Recognition

10/28/2019 ∙ by Zhao You, et al. ∙ 0

Self-attention networks (SAN) have been introduced into automatic speech recognition (ASR) and achieved state-of-the-art performance owing to its superior ability in capturing long term dependency. One of the key ingredients is the self-attention mechanism which can be effectively performed on the whole utterance level. In this paper, we try to investigate whether even more information beyond the whole utterance level can be exploited and beneficial. We propose to apply self-attention layer with augmented memory to ASR. Specifically, we first propose a variant model architecture which combines deep feed-forward sequential memory network (DFSMN) with self-attention layers to form a better baseline model compared with a purely self-attention network. Then, we propose and compare two kinds of additional memory structures added into self-attention layers. Experiments on large-scale LVCSR tasks show that on four individual test sets, the DFSMN-SAN architecture outperforms vanilla SAN encoder by 5 additional memory structure provides further 5 CER.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Transformer models [1]

have been successfully applied and proven to be more effective than recurrent neural networks, eg., LSTMs, in several NLP tasks. The two key ingredients include sinusoidal positional encoding and the self-attention mechanism to be context-aware on input word embeddings. Recently, transformer models and their variants have also been actively investigated for speech recognition as well

[2, 3, 4, 5]. To work well for ASR modeling, transformer architecture needs to make some revision. Some key points have been summarized by previous work.

First, due to the fact that a speech utterance typically lasts for couple of seconds, while speech frames are extracted using a several-milliseconds window, downsampling acoustic input frames to widen temporal context is always beneficial. Some variant of model architecture has also been explored either using TDNN [5] or BiLstm [3] as front blocks to extract high-level features for the self-attention layers. Second, transformer encoder is effective both with CTC loss [6] and within Listen, Attend and Spell (LAS) framework. In [7], the CTC loss was applied to optimize the Transformer encoder structure for ASR. In [8, 5], the entire encoder-decoder structure of the original Transformer was examined in the context of Mandarin Chinese speech recognition tasks. Last but not least, with self-attention network, a common observation is that the longer the context is selected, the better the performance can be obtained. In [9], to support online speech recognition, a chunk-hopping mechanism is proposed. The experimental results showed that compared with the whole utterance context, chunk-hopping always leads to performance degradation. In this work, the priority is given to explore all possible information to improve speech recognition accuracy, while streaming speech recognition where latency is an important factor to consider will not be in the scope of the study.

To make further improvement based on self-attention layer, a natural idea is to explore more information beyond the whole utterance context length. Recently, in [10]

, the authors proposed to augment self-attention with persistent memory by introducing a new layer that merges the self-attention and feed-forward sub-layers into a single unified attention layer. It has shown that the additional persistent memory block in the form of key-value vectors stores some global information that does not depend on the context. This work sheds light on achieving our above motivation.

In this paper, we first explore a variant of model architecture which combines DFSMN and self-attention layers. Experiments show this model architecture outperforms a standard transformer encoder. Then we apply the memory augmenting method on its self-attention layer. We further propose an improved memory structure and make comparison experiments. Our contributions are as follows. First, we make further verification that self-attention layer is not necessary for low-level front layers and only a few self-attention layers added to high level can achieve competitive performance. Second, we apply augmented persistent memory to ASR model and we propose an improved variant of memory structure, which is more compact and competitive on recognition performance. All experiments are performed on a large-scale training data, ie., over 10,000hrs.

2 model architecture

A deep feed-forward sequential memory network [11] provides a non-recurrent architecture to model long term dependency. Compared with self-attention network, FSMN’s layer structure is much simpler and more focusing on a local range of neighbouring frames, while ignores the relative dependency in different positions of a sequence. A multi-head self-attention layer can model the relative dependency by gathering the information from the whole context in a sequence. In view of this, we explore the combination of FSMN layers and multi-head self-attention layers. We conjecture that this proposed architecture can achieve a better trade-off between modeling efficiency and capturing the long term relative dependency.

2.1 Dfsmn

DFSMN can conceptually be viewed as the standard feedforward fully connected neural networks augmented with some FIR-like filters. The formulation of the FIR-like filters takes the following form:


where and denote the look-back and look-ahead order respectively. From equation 1, we can observe that the learnable FIR-like filters in DFSMNs can be used to encode long context information into fixed-size representation (), which makes the DFSMN capture long-term dependency. However, the relative dependency in different positions is ignored in this architecture.

2.2 San

A self-attention network [1]

has two sub-modules including a multi-head attention layer and a position-wise feedforward layer. In addition, dropout, residual connection, and layer normalization are applied after both the self-attention and feed-forward layers. The computation process of self-attention is formulated as follows:


where , and are the key, value and query matrices of size x . and h is the number of heads in the self-attention layer. The multi-head attention performs by attending the information from different subspaces mapped by , and . is the output weight matrix. It is clear that self-attention can explore the information from different representations at different positions. In other words, this architecture has the ability to model the relative dependency.

2.3 Dfsmn-San

We propose the DFSMN-SAN model in which the multi-head self-attention layer (red block in Fig.1) is combined with DFSMN model. Similar to the combination of TDNN and SAN in [2], we argue that the combination of DFSMN and SAN can achieve a better trade-off between modeling efficiency and capturing the long-term relative dependency. Two types of the combination are empirically evaluated. The first is to simply stack all of the self-attention layers at the end of DFSMN model, and the second is to insert self-attention layer into DFSMN with an alternate style. In our pilot experiments, we find that the latter consistently performs better. Therefore the combination with an alternate type is used in this paper, as shown in Fig. 1. After each 10 consecutive FSMN layers, a self-attention layer is inserted.

Figure 1: DFSMN-SAN model architecture

3 Augmenting Self-Attention Layers With Persistent Memory

In this section, we further propose to apply self-attention layer augmented with persistent memory into the DFSMN-SAN model. The motivation behind this is to investigate whether even more information beyond the whole utterance level can be exploited and beneficial. These memory vectors are random initialized from the beginning of training and updated with whole training corpus. We believe that these memory vectors can learn and store some global information useful for the ASR task. Here we propose a new memory structure different with [12]. The two different types of memory structures are described as follows.

3.1 Key-Value Memory structure

Figure 2: In Fig.2(a), the persistent memory vectors are concatenated to key-value vectors. In Fig.2(b), the persistent memory vectors are directly concatenated to input vectors. denotes the input vectors augmented with persistent vectors. The key-value vectors are generated from and query vectors are generated from (dotted line in Fig.2(b)). We represent the memory vectors with a red block for both models. Here, we represent both models in the case of a single head. In our experiments, both models have multiple heads.

Fig.2(a) shows the self-attention layer augmented with persistent memory vectors proposed by [12]. More precisely, these persistent memory vectors are a set of N pairs key-value vectors, which are stacked in two x -dimensional matrices and .

These persistent memory vectors are simply concatenated to the pool of key-value vectors:


where the position encoding corresponding to a persistent vector is equal to zero. . denotes the concatenation of the key vectors with the corresponding N persistent vectors. Similar to general multi-head self-attention layers, the persistent memory vectors are split into multiple heads and not shared between heads.

3.2 Input-Embedding Memory structure

In this paper, we propose a new type of memory structure which is shown in Fig.2(b). Different from key-value memory structure, these persistent memory vectors are directly concatenated to the input vectors:


where denote the persistent memory vectors. Obviously, and share the same persistent memory vectors. That is to say, we have fewer parameters compared with key-value memory structure.

4 Experimental Setup

4.1 Training setup

The feature vectors used in all the experiments are 40-dimensional log-Mel filterbank energy features appended with the first-order and the second-order derivatives. Log-mel filterbank energy features are computed with a 25ms window and shifted every 10ms. We stack 8 consecutive frames and subsample the input frames with 3. A global mean and variance normalization is applied for each frame. All the experiments are based on the CTC learning framework. We use the CI-syllable-based acoustic modeling method

[13] for CTC learning. The target labels of CTC learning are defined to include 1394 Mandarin syllables, 39 English phones, and a blank. Character error rate results are measured on the test sets. We use a pruned, first pass, 5-gram language model. All the systems use a vocabulary that consists of millions of words. Decoding is performed with a beam search algorithm by using the weighted finite-state transducers (WFSTs).

4.2 Datasets

Our training corpus is mixed data sets collected from several different application domains, all in Mandarin. In order to improve system robustness, a set of simulated room impulse responses (RIRs) are created with different rectangular room sizes, speaker positions, and microphone positions, as proposed in [14]. Together they contain a total of 10k hours speech.

To evaluate the performance of our proposed method, we report performance on 3 types of test sets which consist of hand-transcribed anonymized utterances extracted from reading speech (1001 utterances), conversation speech (1665 utterances) and spontaneous speech (2952 utterances). We refer them as Read, Chat, and Spon respectively. In addition, to provide a public benchmark, we also use AISHELL-2 development set (2500 utterances) recorded by high fidelity microphone as the test set.

Model Size Test set
Read Chat Spon AISHELL
DFSMN 131M 3.19 31.59 32.82 5.78

141M 2.66 30.29 30.40 5.24

143M 2.09 28.56 28.70 4.95

Table 1: Results of the different model architectures.

4.3 Acoustic Model

For the first experiment, we present our work with DFSMN, self-attention and DFSMN-SAN model. The DFSMN system uses 30 DFSMN components of 1024 hidden units, each with a projection layer of 512 units. The self-attention model contains 10 multi-head self-attention sublayers with a comparable size with 30 DFSMN model. We set the model dimension

and the number of heads . The DFSMN-SAN model consists of 30 DFSMN components and 3 multi-head self-attention sublayers.

For the second experiment, we improve the DFSMN-SAN by augmenting the self-attention sublayer with persistent memory. We set the number of heads to 8 for key-value memory vectors. The position embedding is shared across all the heads.

For stable CTC learning, we clip gradients to [-1.0, 1.0]. We use the Kaldi [15] toolkit to train models and all models are trained in a distributed manner using BMUF [16] optimization with 8 Tesla P40 GPUs.

5 Experimental Results

5.1 Model Architecture Experiments

In this section, we compare different variations of our acoustic models. Table 1 shows the performance comparison of 3 different acoustic models. Line 2 presents the results of the baseline system (30 layers FSMN). Line 3 presents the results of self-attention model of 10 layers. The last line presents the results of DFSMN-SAN model. The results show that self-attention model performs better than DFSMN model with comparable model size. As expected, the DFSMN-SAN model which combines the dfsmn layers with self-attention layers performs best compared with other models. This indicates that the relative dependency learned by self attention layers can improve the system performance while self-attention mechanism is not necessary for low-level front layers, and only adding a few self-attention layers to high level can achieve competitive performance. Notably, the DFSMN-SAN model achieves up to 34.4% relative CER improvement over the baseline model on the Read test set.

5.2 Augmenting Memory Experiments

Table 2 shows the experimental results of models augmented with key-value memory structure. The different number of persistent memory vectors was examined and the performance seems not to vary too much as set to be 64, 128 and 256. The key-value memory structure achieves 9.5% relative CER improvement over the baseline model on the AISHELL test set.

Table 3 shows the models augmented with input-embedding memory structure also significantly outperform the baseline system. Both two types of augmented memory introduce very modest increase in model size. The proposed input-embedding memory structure has achieved competitive performance to the key-value memory structure with fewer parameters.

Model Size Test set
Read Chat Spon AISHELL
Baseline 143M 2.09 28.56 28.70 4.95

144M 1.93 27.16 25.41 4.48

145M 1.95 27.23 25.45 4.47
N=256 146M 1.96 27.08 25.58 4.52

Table 2: Results of the different number of persistent vectors (N) for key-value memory structure.
Model Size Test set
Read Chat Spon AISHELL
Baseline 143M 2.09 28.56 28.70 4.95

143M 1.95 27.35 25.72 4.41

144M 1.99 27.30 25.86 4.43
N=256 145M 2.02 26.89 26.13 4.50

Table 3: Results of the different number of persistent vectors (N) for input-embedding memory structure.

6 Conclusion

In this work, we first explore a variant model architecture that combines DFSMN layer with self-attention layer. By adding multi-head self-attention layer into DFSMN with an alternate type, consistent improvement can be obtained both over the original DFSMN model and the SAN model with pure self-attention layers. More importantly, we apply self-attention layer with persistent memory vectors into DFSMN-SAN model. Two types of augmented memory methods are evaluated and both provide further improvements in CER. Notably, we find that our proposed input-embedding memory structure can achieve comparable performance with the key-value memory structure with fewer parameters. To make experimental results more convincing, all experiments in this paper are performed on large data sets and evaluated four different individual test sets. Future work includes making visualization analysis on the memory vectors to get better understanding of what they have learned.