Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention

03/29/2021 ∙ by Chengdong Liang, et al. ∙ 0

Self-attention (SA), which encodes vector sequences according to their pairwise similarity, is widely used in speech recognition due to its strong context modeling ability. However, when applied to long sequence data, its accuracy is reduced. This is caused by the fact that its weighted average operator may lead to the dispersion of the attention distribution, which results in the relationship between adjacent signals ignored. To address this issue, in this paper, we introduce relative-position-awareness self-attention (RPSA). It not only maintains the global-range dependency modeling ability of self-attention, but also improves the localness modeling ability. Because the local window length of the original RPSA is fixed and sensitive to different test data, here we propose Gaussian-based self-attention (GSA) whose window length is learnable and adaptive to the test data automatically. We further generalize GSA to a new residual Gaussian self-attention (resGSA) for the performance improvement. We apply RPSA, GSA, and resGSA to Transformer-based speech recognition respectively. Experimental results on the AISHELL-1 Mandarin speech recognition corpus demonstrate the effectiveness of the proposed methods. For example, the resGSA-Transformer achieves a character error rate (CER) of 5.86 SA-Transformer. Although the performance of the proposed resGSA-Transformer is only slightly better than that of the RPSA-Transformer, it does not have to tune the window length manually.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, automatic speech recognition (ASR) based on dot-product

self-attention (SA) [1] has been widely studied. SA has a simple mathematical structure. It not only is able to do global dependency modeling for an input sequence, but also supports parallel computing naturally. Due to such advantages, it has been applied successfully to hybrid-model-based ASR [2, 3] and Transformer-based ASR [4, 5, 6, 7]. However, its weighted average calculation may lead to the dispersion of attention distribution, which results in insufficient calculation on the dependency of neighboring signals, denoted as the insufficient localness modeling problem. This problem becomes apparent in long utterances because of the excessive flexibility of the SA in context modeling. In addition, the computational complexity of SA is about the square of the length of its input.

Recently, relative positional embedding [8, 9] uses relative embedding to solve the insufficient localness modeling problem for the embedding layers of the Transformer-based ASR. However, the relative embedding itself does not limit the attention to the neighborhood of the frame. This still allows SA to attend to the distant frames that are not important for ASR. To remedy this weakness, Xu et al. [10]

proposed to add a local dense synthesizer attention together with self-attention, where the local dense synthesizer attention focuses on the localness modeling. However, the window length of the method is a hyperparameter. Masking

[11] limits the range of SA by a soft Gaussian window. Contrast to the relative positional embedding, masking directly limits the range of the attention structurally. Compared to [10], its advantage is that its window length is automatically determined in the training stage. However, the window length is fixed in the test stage again, which may not be the best choice since that different frames or characters may have different dependencies on neighbor frames. Moreover, choosing a suitable window length for each self-attention layer is time consuming, and case by case for different test data.

To solve the insufficient localness modeling problem in a more flexible way, in this paper, we propose an alternative Transformer-based ASR based on relative-position-awareness self-attention (RPSA), Gaussian-based self-attention (GSA), and residual Gaussian-based self-attention (resGSA). Specifically, RPSA, which was originally proposed for machine translation [12], adds a window to limit the range of SA which enforces SA to concentrate on learning the local connection between frames. However, its window length is fixed, which may not be the best choice since that different frames may have different dependency on their neighboring frames. Moreover, choosing a suitable window length is time consuming. To solve this problem, we further apply GSA [13] to the Transformer-based ASR. The window of GSA is dynamic and adaptive to frames in the test stage, i.e., different test frames may have different window lengths and weights. Motivated by [14], we further propose resGSA to improve the performance, where each resGSA layer takes the attention score from the previous layer as an additive residual score of GSA.

Experiment results on the AISHELL-1 Mandarine dataset show that the proposed resGSA significantly improves the performance without increasing computational complexity. Specifically, the resGSA-Transformer achieves a relative character error rate (CER) reduction of over the SA-Transformer with roughly the same number of parameters and computational complexity as the latter. Among the proposed methods, although the resGSA-Transformer achieves a slightly lower CER than the RPSA-Transformer, the resGSA-Transformer is more flexible than the latter. It does not need to take extra time to find an appropriate window length for each self-attention.

The most related work of resGSA for ASR is [11], in which a soft Gaussian mask window is used to address the localness modeling problem. However, the window length, which is fixed during the testing time, is independent of the input sequences. Different from [11]

, we use a feed-forward neural network to learn the mean and variance of a Gaussian function, which makes resGSA more flexible than the method in


2 Background

In this section, we first introduce the scaled dot-product self-attention, which causes the insufficient localness modeling problem. Then, we introduce the masking self-attention, which is a solution closely related to our proposed methods.

2.1 Scaled dot-product self-attention

The attention layer in Transformer adopts dot-product attention which has the following form:


where is the attention function, and , and are formulated as:


where , , are learnable projection matrices with as the individual dimension of each attention head, and is an input sequence with as the length of the sequence and as a -dimensional acoustic feature. It is known that is also the size of the SA layer.

The SA layers in Transformer usually uses multi-head attention to perform parallel attention. It calculates the scaled dot-product attention times, and then projects the concatenation of all outputs for the final attention values. The multi-head self-attention is formulated as:


where is the weight matrix of the linear projection layer, and is the output of the -th attention head.

2.2 Masking self-attention

Sperber et al. used a soft mask to mask the scaled dot-product self-attention [11].


where is the soft mask defined as with as a trainable parameter controlling the window size.

Figure 1: Example of relative-position-awareness with .

3 Proposed algorithms

In this section, we present RPSA, GSA and resGSA respectively.

3.1 Relative-position-awareness self-attention

Self-attention has strong global-dependency modeling ability which could build a connection to any two samples. This flexibility is an advantage in some long dependency tasks such as machine translation. However, for speech recognition, local information is more important than global characteristics for representing phonetic features, especially in long sequences. To enhance the localness modeling ability, we propose to replace SA by RPSA. RPSA adds neighboring edge connections between two speech frames and that are close to each other, which modifies (1) to the following localness-aware model:


where is the edge representation for matrix . Let denote the element of matrix , which strengthens the relative contribution of the acoustic feature to the -th attention weight at time . It is calculated as follows:


where is the maximum length of the relative distance. Figure 1 shows an example of the relative-position-awareness with . From the figure, we see that, when or , the representation is fixed as and respectively.

3.2 Gaussian-based self-attention

Figure 2: Example of Gaussian bias to enhance local features.

RPSA learns a fixed representation weight for the localness modeling, and the length of its window is fixed as well. Therefore, it is time consuming to choose a suitable window length for each RPSA layer in practice. In order to achieve dynamic local enhancement, GSA uses Gaussian distribution as an additive bias of (



where is the Gaussian bias matrix.

In this algorithm, the window size is predicted by each input sequence. Compared with RPSA, Gaussian distribution naturally focuses more attention on the closer position, as shown Figure 2.

3.2.1 Learning algorithm of the Gaussian matrix

Each bias element means the relation between current query and position :


where is the central position of ,

is the standard deviation and

. The mean and deviation of , which decide the curve of the distribution, are related to . Adding such bias before softmax approximates to multiplying

after softmax layer. The key problem is how to choose right

and . An intuitive choice for is to set , because the t-th attention weight is highly related to the input . But we predict the central positon from :


where and

denote learnable parameters of the feedforward neural network (FNN).

is the sequence length. Sigmoid activate is applied and its output is in . Then , and final predicted position is . The is set as , and similar to (9), can be predicted by:


The value of determines the steepness of the curve. The larger the , the smoother the distribution. When both and are fixed, the Gaussian deviation is a special case of the relative-position-awareness method, in which the weight of the window weights follow Gaussian distribution.

3.3 Residual Gaussian-based self-attention

As shown in Figure 3, each resGSA layer takes the raw attention scores of all attention heads from the previous layer as additive residual scores of the current attention function. The sum of the two scores is then used to compute attention weights via softmax:


where is the attention scores from previous layer. Finally, new attention scores are sent to the next layer. ResGSA converges well as GSA.

Figure 3: The resGSA layer.

4 Experiments

4.1 Model architecture

As shown in Figure 4, the proposed resGSA-Transformer is an improved Speech-Transformer [15], which contains an encoder and a decoder. The encoder is composed of a stack of encoder sub-blocks, each of which contains a resGSA layer and a position-wise feed-forward layer. For the convolution block, we stack two

convolution layers with stride 2 in both time dimension and frequency dimension to downsample the input features. The decoder is composed of a stack of

identical decoder sub-blocks and an embedding layer. In addition to the position-wise feed-forward layer, the decoder sub-block contains a resGSA layer and a multi-head attention layer. The former is used to receive the embedded label sequence and the latter is used to receive the output of the encoder. We add mask to the resGSA in the decoder sub-block to ensure that the predictions for the current position only depend on the previous positions. The output dimensions of resGSA and the feed-forward layers are both

. The output of each layer in the sub-blocks has a residual connection to the input of the layer, followed by a layer normalization operator. The number of the attention heads in each attention layer is


The structures of the proposed RPSA-Transformer and GSA-Transformer are similar to that of the resGSA-Transformer. They replace the resGSA layers in the encoder with RPSA and GSA respectively. Their decoder sub-blocks use the SA layer. The number of heads of RPSA and GSA is . The other layers of the encoder are the same as the resGSA-Transformer.

To demonstrate the effectiveness of the proposed methods in an apple-to-apple comparison, we also constructed a standard dot-product self-attention based Transformer (SA-Transformer) and a masking self-attention based Transformer (masking-Transformer). Their model structures are similar to the GSA-Transformer except replacing the GSA layers in the encoder with scaled dot-product self-attention and masking self-attention respectively.

Figure 4: The model architecture of the resGSA-Transformer.

4.2 Experimental setup

We evaluated the proposed models on a publicly-available Mandarine speech corpus AISHELL-1 [16], which contains over hours of Mandarin speech data recorded from 400 speakers. We used the offical partitioning of the corpus, with utterances from speakers for training, utterance from speakers for validation and utterances from speakers for testing. For each speaker, around utterances are released. For all experiments, we used 80-dimensional Mel filter banks coefficients (Fbank) features as input, and the frame length and shift was set to ms and ms respectively. We use a vocabulary set of Mandarine characters and

non-language symbols (”unk”, ”eos”, ”pad” and ”space”), which denote unknown characters, the start or end of a sentence, padding character and blank character respectively.

For the model training, we used Adam with Noam learning rate schedule ( warm steps) [1] as the optimizer. We used SpecAugment [17] for data augmentation. The attention dropout is probalility

. For the language model, we used recurrent neural network (RNN) language model, which consisted of

RNN layers with units. We integrated language model into beam search by shallow fusion [18]. And the weight of the language model was set to for all experiments. After epochs training, the parameters of the last

epochs were averaged as the final model. In the decoding stage, we used a beam search with a width of 5 for all models. We use pytorch

[19] for modeling and kaldi [20] for data preparation.

We compared with TDNN-Chain [21], listen-attend-spell (LAS) model [22], Self-attention Transducer (SA-Transducer) [23], Speech-Transformer [24] and hybrid-attention Transformer (HA-Transformer) [10]. TDNN-Chain uses the time-delay neural networks (TDNN) as the acoustic model. LAS is an attention-based model, which contains a pyramidal BLSTM encoder and an attention-based decoder. Self-attention Transducer contains a SA-based encoder and a SA-based prediction network. The structure of Speech-Transformer is the same as the SA-Transformer. The HA-Transformer replaces the SA layers in the encoder of the SA-Transformer with hybird-attention, which is a combination of the local dense synthesizer attention and SA.

4.3 Results

Figure 5: CER (no LM) with different .

We first investigated the effect of the window length of the RPSA-Transformer on AISHELL-1. Figure 5 shows the CER curves of the model with respect to . From the figure, we see that the CER first decreases, and then becomes stable with the increase of . Based on the above finding, we set to 30 in all of the following comparisons.

Then, we conducted an apple-to-apple comparison between the algorithms mentioned in Sections 2 and 3. From Table 1, we see that the proposed models outperform the baseline models. Specifically, the RPSA-Transformer achieves a lower CER than the baseline models, which demonstrates the effectiveness of the localness-aware modeling for ASR. The GSA-Transformer outperforms the RPSA-Transformer, which shows that the dynamic window leads to a better localness-aware model than the fixed window. Finally, the resGSA-Transformer achieves the best performance among all comparison methods. It achieves a relative CER reduction of over the SA-Transformer and over the masking-Transformer. Although the resGSA-Transformer is slightly better than the RPSA-Transformer, it is more flexible than the latter and does not need to take extra time to find the suitable window length.

To further verify the effectiveness of the proposed models, we compare them with several representative ASR systems which are the TDNN-chain, LAS, self-attention Transducer, speech-Transformer, and HA-Transformer, respectively. From the comparison results in Table 1, we see that the proposed models are superior to the five comparison systems. For example, the performance of the resGSA-Transformer, which achieves a CER of , outperforms the strongest reference method, i.e. the HA-Transformer.

Model Dev Test
TDNN-Chain (kaldi) [21] - 7.45
LAS [22] - 10.56
Self-attention Transducer [23] 8.30 9.30
Speech-Transformer [24] 6.57 7.37
HA-Transformer [10] 5.66 6.18
SA-Transformer (baseline) 5.81 6.36
Masking-Transformer (baseline) 5.67 6.29
RPSA-Transformer (m=30) (proposed) 5.53 6.12
GSA-Transformer (proposed) 5.41 5.94
resGSA-Transformer (proposed) 5.38 5.86
Table 1: CER comparison with the representative ASR systems.

5 Conclusions

In this paper, we have proposed three attention schemes for the encoder of the Transformer-based speech recognition, which are the RPSA, GSA, and resGSA respectively. Specifically, RPSA was proposed to replace the common SA for remedying the inefficient localness-aware modeling problem of the SA-Transformer. To overcome the weakness of RPSA on the window length selection problem, we further proposed GSA and an improved version resGSA, whose window lengths are learnable and highly related to the input sequences. GSA and resGSA can achieve dynamic localness modeling, i.e., different test frames may be assigned with different window lengths and weights. Experimental results on AISHELL-1 show that the GSA- and resGSA-Transformer achieve better performance than RPSA-Transformer, and do not have to tune the window length; the proposed models are significantly better than the representative ASR systems.