State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions

10/01/2019 ∙ by Kyu J. Han, et al. ∙ ASAPP INC 0

Self-attention has been a huge success for many downstream tasks in NLP, which led to exploration of applying self-attention to speech problems as well. The efficacy of self-attention in speech applications, however, seems not fully blown yet since it is challenging to handle highly correlated speech frames in the context of self-attention. In this paper we propose a new neural network model architecture, namely multi-stream self-attention, to address the issue thus make the self-attention mechanism more effective for speech recognition. The proposed model architecture consists of parallel streams of self-attention encoders, and each stream has layers of 1D convolutions with dilated kernels whose dilation rates are unique given stream, followed by a self-attention layer. The self-attention mechanism in each stream pays attention to only one resolution of input speech frames and the attentive computation can be more efficient. In a later stage, outputs from all the streams are concatenated then linearly projected to the final embedding. By stacking the proposed multi-stream self-attention encoder blocks and rescoring the resultant lattices with neural network language models, we achieve the word error rate of 2.2 the test-clean dataset of the LibriSpeech corpus, the best number reported thus far on the dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Self-attention is the core component of the neural network architectures recently proposed in NLP [1, 2, 3] to achieve the state-of-the art performances in a number of downstream tasks. Transformer [1]

successfully replaced recurrent neural networks such as LSTMs with sinusoidal positional encoding and the self-attention mechanism to be context-aware on input word embeddings. BERT


took the benefit from the success of Transformer to extend it to the autoencoding based pretraining model, which can be fine-tuned to reach the state-of-the-art performances for various downstream tasks. XLNet

[3], as the very latest state-of-the-art pretraining model, outperformed BERT in a number of downstream tasks from question answering to document ranking, thanks to model training with targets being aware and relative positional encoding like its ancestor of Transformer-XL [4].

With the huge success in NLP, self-attention has been actively investigated for speech recognition as well [5, 6, 7, 8, 9, 10]. In [5]

, time-restricted self-attention was introduced with a one-hot vector representation being exploited as relative positional encoding for given restricted contexts of speech frames. In

[6], the well-known Listen, Attend and Spell (LAS) ASR model [11] employed the multi-head approach to the attention mechanism to further improve its already state-of-the-art accuracy on the large-scale voice search data. In [7], several approaches were explored for better application of self-attention to speech recognition in the LAS framework, e.g., speech frame handling strategies or attention biasing to restrict the locality of the self-attention mechanism. In [8], the CTC loss [12] was applied to optimize the Transformer encoder structure for ASR. In [9, 10], the entire encoder-decoder structure of the original Transformer [1] was examined in the context of Mandarin Chinese speech recognition tasks.

The challenge in terms of applying self-attention to speech recognition is that individual speech frames are not like lexical units such as words. Speech frames do not convey distinct meanings or perform unique functions, which makes it hard for the self-attention mechanism to compute proper attentive weights on speech frames. Considering that adjacent speech frames could form a chunk to represent more meaningful units like phonemes, some sort of pre-processing mechanisms such as convolutions to capture an embedding for a group of nearby speech frames would be helpful for self-attention. In addition, a multi-resolution approach could be beneficial as well since boundaries for such meaningful chunks of speech frames are dependent of many factors, e.g., the type of phonemes (vowel vs. consonant) and the way they are pronounced, affected by gender, speaker, co-articulation and so on. Based on this reasoning, in this paper, we propose a new neural network model architecture for better self-attention, namely multi-stream self-attention. The proposed architecture consists of parallel streams of self-attention. In each stream, input speech frames are processed with a distinct resolution by multiple layers of 1D convolutions with a unique dilation rate, and the convoluted embeddings are fed to a subsequent multi-head self-attention layer. In a later stage, the attentive embeddings from all the streams are concatenated then linearly projected to the final embedding. We achieve the state-of-the-art performances on the LibriSpeech corpus [13] by stacking up these multi-stream self-attention blocks and rescoring the resultant lattices with powerful neural language models. Our WERs on the dev-clean and test-clean sets of 1.8% and 2.2%, respectively, are the best reported numbers thus far on the datasets as far as we know.

Figure 1: Multi-stream self-attention block. : dilation rate for the 1D convolutions in the stream .

This paper is organized in the following structure. In Section 2, we provide the details of the proposed multi-stream self-attention model architecture. In Section 3, we present the experimental setups and discuss the validity of the proposed ideas using the ablation tests. In addition, we compare our ASR system based on the multi-stream self-attention models with the other state-of-the-art systems. In Section 4, we conclude the paper with the summary of our contributions as well as the future directions.

2 Multi-Stream Self-Attention

Fig. 1 shows the proposed multi-stream self-attention block, each stream of which consists of multiple layers of 1D convolutions, one layer of multi-head self-attention, and a feed forward layer sandwiched by two layer normalizations111Each layer normalization has a skip connection (not depicted in the figure) with the input of its previous layer. [14]. The embeddings from all the streams are projected to the final embedding in a later stage. In this section, we detail each component of the multi-stream self-attention architecture.

Figure 2: Examples of the 1D convolutions with the kernels and the different dilation rates (). (a): , (b): , (c): .

2.1 1D convolutions with factorization (1D Conv-F)

Time delay neural networks (TDNNs) have been one of the most popular neural network models for speech recognition [15, 16]. It was introduced to capture the long range temporal dependencies of acoustic events in speech signals [15] by exploiting a modular and incremental design from sub-components. The modified version was recently proposed for better efficiency using the layer-wise subsampling methods [16].

A TDNN is basically a 1D convolution. We use this convolution layer and its kernels with various dilation rates to control the resolution of input speech frames being processed in the parallel streams. In each stream, layers of 1D convolutions with a a unique dilation rate process speech frames in the specified resolution. This can reduce burden on the self-attention mechanism to enable the attentive computation to focus on only one resolution of speech frames in the given stream. Examples of the 1D convolutions which the kernels with the dilation rates of 1, 2, and 3 are applied to are shown in Fig. 2.

In order to make the convolution layers more efficient, we utilize the factorized TDNN [17]

. Singular Value Decomposition (SVD) has been a popular choice to factorize a learned weight matrix into two low-rank factors and reduce the model complexity of neural networks

[18, 19, 20]. The factorized TDNN or factorized 1D convolution (1D Conv-F) layers also utilizes SVD to factorize a 1D convolution parameter matrix into two low-rank matrices. The kernel for each factorized 1D convolution is set to . One of the two factorized convolutions is constrained by the semi-orthogonal condition during training. Consider as one of the factors in the original parameter matrix after SVD. The semi-orthogonal constraint puts the condition to minimize a function :


where . is defined by and

is the identity matrix. This way of factorization with the semi-orthogonal constraint not only leads to less model complexity overall, but also results in even better modeling power. After the factorization, the rectified linear unit (ReLu)


, batch normalization

[22], and dropout [23] are followed by a skip connection [24] between the scaled input embedding and the output of the dropout layer. The scale value is a hyper-parameter.

2.2 Self-attention with factorized feed-forward (FF)

In a given stream of , we can formulate the time-restricted self-attention mechanism [5] in a mathematical manner as follows. We define an input embedding matrix to the stream , , where is the total number of input embeddings restricted by the left and right context and is the dimension of embeddings used inside the self-attention mechanism. Note that downsampling is applied to the input embeddings and the sampling rate is matched to the specified dilation rate () of the 1D Conv-F layers in the stream. For the projected query, key and value matrices, and in the stream , the output for the head is computed as follows:


where and , and , and , , and are the dimensions of query, key and value embeddings, respectively. The multi-head outputs are concatenated and linearly projected, then layer normalization is applied to the projected embedding that is skip-connected with the input embedding:


where is the number of heads in the stream and . We set , given is a fixed value for the total number of multi-heads across the self-attention components of the whole streams.

The multiple streams in the proposed architecture could increase the model complexity significantly as opposed to a single stream approach. To retain the model complexity to a reasonable level as well as avoid the loss of modeling power, we use the factorized feed forward networks with a bottleneck layer in between given stream. The semi-orthogonal constraint discussed in Section 2.1 is also applied here to one of the factorized matrices during training. After the skip connection, the encoder output in the stream can be written as below:


Note that the dimension of the embedding layer between the feed forward networks of the original Transformer encoder [1] is either 1,024 or 2,048. We could reduce the model complexity of the feed forward component by the factor of 8 or 16 if our choice of the bottleneck layer in the factorized version were 128. We can discuss how the factorized feed forward networks with this narrower bottleneck dimension are compared to the original version with no factorization but wider dimension in Section 3.

2.3 Final embedding

The final embedding layer concatenates the encoder output from each stream and linearly projects the concatenated vector to the final embedding. The ReLu non-linear activation, batch normalization, and dropout follows before feeding out the final embedding as the output:


where .

3 Experiments and Results

3.1 Data and experimental setups

For the experiments being discussed in the paper, we use the LibriSpeech corpus [13] as the main training and testing datasets. The LibriSpeech corpus is a collection of approximately 1,000hr audiobooks that are a part of the LibriVox project [25]. Most of the audiobooks come from the Project Gutenberg222

. The training data is split into 3 partitions of 100hr, 360hr, and 500hr sets while the dev and test data are split into the ’clean’ and ’other’ categories, respectively, depending upon how well or challening ASR systems would perform against. Each of the dev and test sets is around 5hr in audio length. This corpus also provides the n-gram language models and the corresponding texts

333Available at excerpted from the Project Gutenberg books, which contain 803M tokens and 977K unique words.

To prepare a lexicon, we selected 522K words among the 977K unique words that occur more than once in the LibriSpeech texts. Using the base lexicon of the CMUdict

444Available at that covered 81K among the selected words of 522K, we trained a G2P model using the Sequitur tool [26] to cover the out-of-vocabulary words. We used the SRILM tooklit [27] to train n-gram language models (LMs). The 4-gram LM was trained initially on the entire texts available with the modified Kneser-Ney smoothing [28, 29], then pruned to the 3-gram LM. The first-pass decoding was conducted with the 3-gram LM and the resultant lattices were rescored with the 4-gram LM later in the second-pass. The lattices were further rescored [30] with the neural network LMs of three TDNN layers and two LSTM layers being interleaved that were trained by the Kaldi toolkit [31].

We used the Kaldi toolkit for the acoustic modeling as well, mostly following the LibriSpeech recipe555 up to the stage of Speaker Adpative Training (SAT) [32]

. We gradually increased the training data size from 100hrs to 960hrs over the course of the GMM training stages, while the neural network training used the entire 960hr data. We first trained GMMs within the framework of 3-state Hidden Markov Models (HMMs). The conventional 39-dimensional MFCC features were spliced over 9 frames and LDA was applied to project the spliced features onto a 40-dimensional sub-space. Further projection was conducted through MLLT for better orthogonality. SAT was applied with feature-space MLLR (fMLLR) to further refine mixture parameters in GMMs. For the neural network acoustic models, we used the 40-dimensional higher-resolution MFCCs appended with 100-dimensional i-vectors

[33] and trained the models having the lattice alignments given by the SAT-ed GMMs as soft targets. The LF-MMI objective was used to optimized the parameters with the three regularization methods of cross-entropy, and leaky HMM [34]. The exponential decrease of learning rates from to was applied to make the entire training procedure stable and have better convergence. The number of nodes in the final layer was determined by the number of tri-phone states in the HMM, which is 6K after the phonetic tree clustering. The trainings were conducted on the Nvidia V100 servers with 8 GPUs.

The dimensions of query, key, and value embeddings in self-attention are set to , and , and . The bottleneck dimension for the factorized convolutions and the factorized feed-forwards is 128. The number of streams and the number of the 1D convolution layers in each stream is from 1 to 5 and 0 to 7, respectively, depending on the experiment type for the ablation tests.

Stream Config. dev-clean dev-other
Single (1) 4.16 - 10.66 -
1-1-1 3.99 4.09 10.14 4.88
1-1-1-1-1 3.95 5.05 10.02 6.00
Table 1: Effect of multiple streams in WER (%) and WERR (%) on dev-clean and dev-other. WER: word error rate, WERR: WER reduction (relative). Lattice-rescored with the 4-gram LM.

3.2 Ablation tests

This section validates the proposed idea of multi-stream self-attention by conducting the ablation tests.

3.2.1 Effect of multiple streams

As we test the validity of multiple streams in the proposed architecture in a fair comparison, we control the number of multi-heads in the self-attention layer in each stream such that while fixing . For example, if the number of streams is 3 (i.e., ) , then the number of multi-heads in the self-attention layer of each stream would be , thus . This is to rule out the possibility that any performance improvement would come from the significant increase of model complexity.

Table 1 shows the effect of having multiple streams in the proposed architecture. The three entries for the system configurations in the table correspond to and 5, respectively, while fixing the dilation rate of 1 across the streams. (For example, 1-1-1 means the three streams with the fixed dilation rate of 1 for the factorized convolutions of all the streams.) It is noticeable that the more streams we had, the better accuracy we would obtain, but without a diverse selection of dilation rates across the streams, the improvement would be limited, which will be shown more clearly in Table 2.

Stream Config. dev-clean dev-other
Single (1) 4.16 - 10.66 -
1-2-3 3.95 5.05 10.05 5.72
1-3-5 3.99 4.09 10.27 3.66
Single (1) 4.16 - 10.66 -
1-2-3-4-5 3.84 7.69 9.94 6.75
1-3-5-7-9 3.85 7.45 10.03 5.91
3-3-3-3-3 3.84 7.69 10.22 4.13
5-5-5-5-5 3.89 6.49 10.30 3.38
Table 2: Effect of dilation rates in WER (%) and WERR (%) on dev-clean and dev-other. Lattice-rescored with the 4-gram LM.

3.2.2 Effect of dilation rates

Table 2 shows the effect of having diverse dilation rates across the multiple streams of the proposed architecture. The various dilation rates across the streams are shown to help improve WERs by the clear margins. However, just mixing any values would not guarantee the performance improvement. For example, the system configuration of 1-3-5 presents the WER on the dev-other set worse than the configuration of 1-1-1 does in Table 1. It seems that a careful mix of the dilation rates would be critical for the proposed model. We found out the best configuration from the 1-2-3-4-5 setup, which has the 5 different dilation rates (by the difference of 1) for the 1D convolutions across the streams, marking 7.69% and 6.75% WERR on dev-clean and dev-other, respectively, as compared to the single stream baseline. This validates the efficacy of the proposed multi-stream strategy of having 1D convolutions with a unique dilation rate in each stream. The proposed architecture seemingly helps the self-attention mechanism better process embeddings in each stream and lead to more accurate results overall.

Stream Config. Param. dev-clean dev-other
Factorized-FF w/ 128-dim 8M 3.84 9.94
FF w/ 1,024-dim 10.5M 3.80 9.91
FF w/ 2,048-dim 13M 3.83 9.67
Table 3: Effect of factorization in the feed-forward networks of the proposed multi-stream architecture in terms of model comlexity and WER (%) on dev-clean and dev-other. Lattice-rescored with the 4-gram LM.
Stream Config. dev-clean dev-other
No Conv-F 4.36 -7.13 11.86 -10.43
1 Conv-F 4.07 - 10.74 -
3 Conv-F 3.84 5.65 9.94 7.45
5 Conv-F 3.76 7.62 9.75 9.22
7 Conv-F 3.73 8.35 9.49 11.64
Table 4: Effect of 1D convolutions in WER (%) and WERR (%) on dev-clean and dev-other. Lattice-rescored with the 4-gram LM.

3.2.3 Effect of factorized feed-forward

The main purpose of having the factorized feed-forward networks in the proposed architecture is to retain the model complexity within a reasonable boundary even with adding more streams. The proposed multi-stream model architecture, otherwise, would increase the model complexity very easily as we add more streams. In Table 3 where we base the same configuration of 1-2-3-4-5 from Table 2, it is shown that the factorization works as expected. The factorized feed-forward networks not only contribute to the model complexity under 10M parameters but also keep the performance in a similar level with the control group that includes the normal (i.e., without factorization) feed-forward networks with the wider bottleneck layer of 1,024- or 2,048-dimension.

3.2.4 Effect of 1-D convolutions

Table 4 highlights the effect of having the 1D convolutions in the proposed multi-stream self-attention model architecture. It presents the importance of the 1D convolutions preceding the self-attention mechanism in the multi-stream framework. The 7-layer Conv-F leads to roughly 15% and 20% WERR for the dev-clean and dev-other dataset, respectively, against the case of having no convolutions. The pattern seems that the more convolution layers, the more performance gain would be obtained, but after the 7 layers we didn’t observe any significant performance boost.

System dev test
clean other clean other
4-gram 2.65 8.21 2.93 8.32
3-T/2-L (1,024) 2.51 7.13 2.69 7.45
3-T/2-L (2,048) 2.23 6.82 2.39 7.01
3-T/2-L (4,096) 2.14 6.51 2.21 6.73
4-LSTM (2,048) 1.84 5.75 2.20 5.82
Table 5:

Comparison of the best configured model in terms of LMs in WER (%). 3-T/2-L: 3-layer TDNNs & 2-layer LSTMs being interleaved. The numbers in the parentheses indicate the size of neurons.

3.3 Comparison with the other state-of-the-arts

In this section, we configure our best model for the LibriSpeech speech recognition task by stacking the proposed multi-stream self-attention blocks. The chosen configuration of the best ASR system for us is

  • 3 layers of multi-stream self-attention blocks

  • 5 streams of self-attention encoders in each block

  • 7 layers of 1D Conv-F’s in each stream

  • Dilation configuration of 1-2-3-4-5 to 1D CONV-F’s across streams

The total number of parameters for this setup is 23M. The lattice-rescored result of this system with the 4-gram LM is presented in the first row in Table 5.

As for the neural network language models, we trained the models of 3 TDNN layers and 2 LSTM layers (uni-directional) being interleaved. The averaged (relative) error rate reduction by the best neural network language model (with the dimention of 4,096) against the 4-gram LM case ranges from 15% to 20%. The embedding dimension seems to matter in terms of performance improvement. We did not try the bigger dimensions than 4,096 due to the prohibitive model training time. The best model in terms of WER improvement is the one with the 4-layered LSTMs, trained with around 10K word pieces.

Table 6 shows some state-of-the-art system performances. The hybrid DNN/HMM approach was used for the different neural network acoustic models of CNN-biLSTM, pyramidal Feedforward Sequential Memory Network (FSMN), and BiLSTM, respectively, in [35, 36, 37]. The Transformer LM was exploited as the best rescoring LM in [37]. In [38, 39] the end-to-end framework was considered. Time Depth Separable (TDS) convolutions were introduced in [38] while the cut out of spectrograms was applied on the fly during training to enhance the noise robustness. In [40], the full Transformer model was studied comparatively in various speech tasks. As compared to the other state-of-the-art system performances, it is shown that we achieve the best performances on both of the dev-clean and test-clean set, while on the dev-other and test-other set the lowest WERs are presented in [37]. The WERs of 1.8% and 2.2% on the datasets by the proposed system are the best numbers thus far reported in the literature.

System dev test
clean other clean other
Han, et al. [35] 3.1 8.3 3.5 8.6
Hannun, et al. [38] 3.0 8.9 3.3 9.8
Yang, et al. [36] 2.6 7.5 3.0 7.5
Karita, et al. [40] 2.2 5.6 2.6 5.7
Park, et al. [39] - - 2.5 5.8
Lüscher, et al. [37] 1.9 4.5 2.3 5.0
Proposed 1.8 5.8 2.2 5.8
Table 6: Comparison with the other state-of-the-art systems in WER (%).

4 Conclusions

We proposed the multi-stream self-attention model architecture preceded by the layers of the factorized 1D convolutions with the unique dilation rate in each stream. This architecture allows input speech frames to be efficiently processed and effectively self-attended. We validated the proposed ideas by performing the ablation tests, and also configured the state-of-the-art ASR system by stacking the multi-stream self-attention model blocks with the strong neural network language models. The WERs on the dev-clean and test-clean set of 1.8% and 2.2% are the best reported numbers found in the literature.

Note that the proposed system has only 23M parameters. The other systems in Table 6 have much higher model complexity, for example, 100M for CNN-biLSTM [35] or 200M for LAS in [39]. This could make our system more practical and appealing to speech engineers or practitioners who would like to deploy ASR models as a service on devices with the limited computing power. Of course, the practicality of the proposed model on on-device ASR would rely upon the usage of much lighter LM models.

We plan to further explore the self-attention mechanism to make it more suitable for speech applications. To enhance the noise robustness especially on the dev-other and test-other data of the LibriSpeech corpus, we will continue to work on the data augmentation side. In this paper, we used speed perturbation as the data augmentation strategy and it is worth considering other types of augmentation such as data dropout like the cut out method used in [39].