ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition

by   Jing Pan, et al.

In this paper we present state-of-the-art (SOTA) performance on the LibriSpeech corpus with two novel neural network architectures, a multistream CNN for acoustic modeling and a self-attentive simple recurrent unit (SRU) for language modeling. In the hybrid ASR framework, the multistream CNN acoustic model processes an input of speech frames in multiple parallel pipelines where each stream has a unique dilation rate for diversity. Trained with the SpecAugment data augmentation method, it achieves relative word error rate (WER) improvements of 4 improve the performance via N-best rescoring using a 24-layer self-attentive SRU language model, achieving WERs of 1.75 test-other.



There are no comments yet.


page 1

page 2

page 3

page 4


The Microsoft 2016 Conversational Speech Recognition System

We describe Microsoft's conversational speech recognition system, in whi...

Multistream CNN for Robust Acoustic Modeling

This paper presents multistream CNN, a novel neural network architecture...

The RWTH ASR System for TED-LIUM Release 2: Improving Hybrid HMM with SpecAugment

We present a complete training pipeline to build a state-of-the-art hybr...

Predicting word error rate for reverberant speech

Reverberation negatively impacts the performance of automatic speech rec...

Speaker diarisation using 2D self-attentive combination of embeddings

Speaker diarisation systems often cluster audio segments using speaker e...

The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge

This paper describes the NTNU ASR system participating in the Interspeec...

Data augmentation using prosody and false starts to recognize non-native children's speech

This paper describes AaltoASR's speech recognition system for the INTERS...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have brought revolutionary changes to the landscape of speech recognition research. Since their advent [1], DNNs ranging from LSTMs [2] to Transformers based on multi-headed self-attention [3] have contributed to inching the performance of speech recognition systems closer to human-level accuracy.

In [4] it was reported that human transcribers were given the evaluation data (also known as HUB5 eval2000) of the NIST 2000 Evaluation Challenge of Conversational Telephone Speech (CTS) [5]. Their average error rate on the Switchboard (SWBD) portion of the HUB5 eval2000 set was 5.9%. In the same paper an ASR system was proposed, fusing variants of deep CNNs such as VGG [6] or ResNet [7] with layer-wise context expansion and attention (LACE) [8] as well as bi-directional LSTM (bLSTM) trained with the lattice-free MMI (LF-MMI) loss [9]. The system was claimed to have surpassed the human level of accuracy (5.8% WER) on the SWBD eval set. In [10], a similar experiment discovering the human ability for speech recognition was conducted, suggesting a new human-level error rate of 5.3% on the SWBD eval set. Their proposed ASR system reached 5.5% WER, combining ResNets and bLSTMs with speaker-adversarial multi-task learning. The WER on SWBD has been further pulled down to the state-of-the-art (SOTA) level of 5.1% thanks to various novel DNN architectures, such as CNN-bLSTMs [11, 12], highway LSTMs [13], and densely connected LSTMs [14, 15].

Another active area of research in speech recognition focuses on the well-known LibriSpeech corpus [16] where roughly 1,000hrs of spoken utterances were collected from audio books. Unlike the SWBD evaluation challenges, a number of end-to-end (E2E) ASR systems have been competitive on the LibriSpeech test sets, even exceeding the performance of hybrid HMM/DNN ASR systems. Listen, Attend and Spell (LAS) [17], which uses a sequence-to-sequence architecture is the most representative E2E model for ASR. It consists of a pyramidal structure of bLSTMs for the encoder with content-aware attention. Using the SpecAugment method for data augmentation [18], the LAS-based ASR system surpassed the accuracy on the test sets of LibriSpeech, boasting the WERs of 2.5% and 5.8% on test-clean and test-other, respectively.

More recent systems, both E2E and hybrid, utilized improvements from powerful Transformer models [3] for both AM and LM [19, 20, 21, 22]. In [19], a hybrid AM with sequence discriminative training was boosted by Transformer LM rescoring. [20] applied a multistream architecture to the hybrid ASR setting, where in each stream the multi-headed self-attention layer, following the shallow layers of TDNN-F [23], was modified with factorizing the feed-forward layer inside. In [21], E2E Transformer models with a sequence-to-sequence loss were trained for both AM and LM, and the outputs of the E2E models were rescored with a Transfomer LM as well as a gated CNN (GCNN) LM [24]. Interestingly, 60k hours of extra audio data [25, 26] were used for the semi-supervised AM update to further boost the accuracy of the proposed system. In [22], a deep Transformer architecture was analysed for AM with an iterated loss [27] in the hybrid ASR framework. Recently, a CNN-RNN-Transducer architecture was introduced [28], demonstrating even further improvement using Transformer models on the LibriSpeech benchmark.

This paper presents new benchmark results for test-clean and test-other in LibriSpeech, 1.75% and 4.46%, respectively, thanks to the novel neural network architectures of AM and LM in multistream CNN [29] and self-attentive simple recurrent unit (SRU). The multistream CNN acoustic model, inspired by [20] but without the multi-headed self-attention layers, processes input speech frames in multiple parallel pipelines where each stream has a unique dilation rate for the convolution kernels of CNNs for diversity. Trained with SpecAugment, it achieves relative WER improvements of 4% on test-clean and 14% on test-other. We further improve the performance with -best rescoring using a 24-layer self-attentive SRU language model. SRU was proposed in [30] for higher parallelization in recurrence computation. Our variant adds self-attention to the original SRU to not only replace some of linear operations in computation but also enhance context modeling capability. We rescore the -best outputs of the lattices once rescored with the TDNN-LSTM language model trained by the Kaldi toolkit [31, 32]. The average relative WER improvement by the self-attentive SRU LM is around 23% on both of the test sets.

This paper is organized as follows. In Section 2, we describe the details of our ASAPP-ASR system focusing on the neural network architectures for AM and LM as well as the LM rescoring strategies leveraged. In Section 3, we provide the experimental setup and discuss the results from various configurations in the proposed system. In Section 4, we conclude the paper with summary remarks and future directions.

Figure 1: Schematic diagram of the multistream CNN acoustic model architecture.

2 ASAPP-ASR: System Descriptions

2.1 Multistream CNN for Acoustic Modeling

For robust acoustic modeling, we leverage the benefits of multistream CNNs [29] (illustrated in Figure 1 above). This novel neural network architecture accommodates diverse temporal resolutions in multiple streams to achieve robustness. For diverse temporal resolution, it considers stream-specific dilation rates on TDNN-F [23], a variant of 1D-CNN. Each stream stacks narrower TDNN-F layers whose kernel has a unique dilation rate when processing input speech frames in parallel. The dilation rate for the TDNN-F layers in each stream is chosen from multiples of the default subsampling rate (3 frames). This offers a seamless integration with the training and decoding process where input speech frames are subsampled. With SpecAugment [18], multistream CNNs provide additional robustness against challenging audio, such as the “other” sets in LibriSpeech.

In the proposed architecture, we position 5 layers of 2D-CNNs in a single stream fashion to process input log-mel spectrogram before multiple streams are branched out. We use kernels for the 2D CNN layers with a filter size of 256 except for the first layer with a filter size of 128. Every other layer in the 2D-CNNs we apply frequency band subsampling with a rate of 2. In the multistream structure, each stream rolls out 17 TDNN-F layers, where each TDNN-F consists of two

factorized convolution matrices, followed by a skip connection, batch normalization and dropout layer, with 512-dimensional neurons. Consider the embedding vector

coming out of the single stream 2D-CNN layers at the given time step of . The output vector from the stream going through the stack of TDNN-F layers with the dilation rate can be written as


where means a kernel given the dilation rate of

. The output embeddings from the multiple streams are then concatenated, and followed by ReLu, batch normalization and a dropout layer:


This embedding vector is projected on the output layer via a couple of fully connected layers at the end of the network. We employ 3 streams with the dilation configuration of 6-9-12 where the dilation rates of 6, 9 and 12 are applied for TDNN-F layers across the 3 streams.

2.2 Self-attentive SRU for Language Modeling

We train our LMs using a variant of the SRU architecture proposed by [30]. Given an input sequence where each represents a feature vector, a single layer of SRU involves the following recurrence computation:

where is the sigmoid activation, is the output state at step , and are the internal hidden state and sigmoid gates at step , respectively. The vectors are computed using a linear projection:


given three parameter matrices , , of the SRU layer. Compared to other recurrent networks such as LSTM and GRU, SRU adopts element-wise hidden-to-hidden connections and . As a consequence, each of the hidden dimensions becomes independent and can be executed in parallel, achieving much faster training speed.

Several variants of SRU architecture have been successfully employed in speech models [33, 34, 35]. In this work, we use a self-attentive variant to enhance context modeling capacity by substituting the linear operation (Eq. 3) with the multi-head attention operation originally proposed in [3].

2.3 Language Model Rescoring

We employ multiple stages of LM rescoring in order to obtain the minimum WERs. The initial decoding is based on the decoding graph constructed from the multistream CNN AM and a 3-gram LM, resulting in the initial hypotheses in a lattice format. Lattice rescoring is done with a larger sized 4-gram LM, followed by a second-pass lattice rescoring with the TDNN-LSTM language model [32]

. In the final rescoring stage, we use an interpolated self-attentive SRU LM. We linearly interpolate two self-attentive SRU models, one of which is trained on word pieces using byte-pair encoding (BPE) and the other is trained at the word level. With this interpolation, we re-rank the

-best hypotheses from the lattices rescored by the TDNN-LSTM LM in the previous stage. In our experiments, we empirically keep at 100.

In the final stage of rescoring, the

-best hypotheses are re-ranked by the combination of an estimated AM and LM likelihood for a hypothesized word sequence

given acoustic features ,


where is the AM likelihood estimate, is the LM likelihood estimate, which can be further detailed as below,


where is the likelihood estimate from the interpolated SRU LM given an interpolation weight for the BPE SRU model (i.e., for the word SRU LM), is the likelihood estimate from the TDNN-LSTM LM, and , and are the hyper parameters which we optimize through a grid search on the dev data in LibriSpeech.

We further refine the ranking with the minimum expected word error objective [36], defined by:


where is the normalized term of and is the WER measure of with as reference. Applying the expected word error minimization can weaken a bias on the sentence-level likelihood maximization and shift the ranking focus toward the local word-level accuracy. In order to reduce the computation complexity, we first rank the -best hypotheses by the sentence-level likelihood maximization (Eq. 4), and then update the rank of the top 20 hypotheses by the minimum expected word error (Eq. 6).

3 Experimental Setup and Results

3.1 LibriSpeech

We conduct the experiments on the LibriSpeech corpus [16], which is a collection of approximately 1,000hr read speech (16kHz) from the audio books that are part of the LibriVox project [26]. The training data is split into 3 partitions of 100hrs, 360hrs, and 500hrs while both of the dev and test data are split into ‘clean’ and ‘other’ categories, where each category contains around 5hrs of audio. The corpus provides extra written texts of 800M words111 for LMs. We normalize them to correct typos as well as spelling consistencies between British and American English. The same normalization is applied to all the text transcripts of the training, dev and test set to be consistent.

3.2 Acoustic Models and Non-SRU Language Models

We follow the conventional steps to train hybrid GMM/HMM acoustic models using the default Kaldi recipe for LibriSpeech222, up to a point where a triphone model is trained with speaker-adaptive training (SAT) with feature-space MLLR (fMLLR) to further refine Gaussian mixture parameters [37]. The alignment of this model is used for neural network model training as the reference label. The multistream CNN AM described in Section 2.1 is trained on the total 960hr training set with the LF-MMI loss, decaying learning rates from to

over the span of 6 epochs. The mini-batch size is 64.

To prepare a lexicon, we select the most frequently used 200K words from the 800M word text and add out-of-vocabulary words to the original lexicon provided by the LibriSpeech corpus with the CMU phoneset, resulting in a 203K word list in total. We train a G2P model using the Sequitur tool

[38] to generate pronunciations for the out-of-vocabulary words.

We use the PocoLM tooklit to train -gram LMs by modifying the default recipe for the Switchboard corpus333 A 4-gram LM is trained on the 800M word text as well as the entire text transcripts for the 960hr training data containing around 10M words. This LM is pruned to a 3-gram, which is used for the 1st-pass decoding. The 4-gram LM is used for -gram LM rescoring.

The TDNN-LSTM language model is trained on the aforementioned combined text, totaling 810M words, with the default Kaldi RNNLM recipe for LibriSpeech. We modify the dimension of embedding to 4,096 to increase the representational power of contexts in word sequences.

3.3 Self-Attentive SRU Language Models

3.3.1 Dataset

We construct our dataset for language modeling by combining the normalized corpus of 800M words with the text transcripts from the 960h training data, for a total of about 810M words. All of our self-attentivce SRU language models are trained at the utterance level (i.e., the model does not leverage any context past sentence boundaries), with a maximum sequence length of 275 tokens. We train a new 10K BPE vocabulary for our model. We limit the maximum sequence length only during training, not when computing dev set perplexity or when rescoring utterances from -best hypotheses by the acoustic model with the TDNN-LSTM LM. We report perplexity numbers on dev-clean and dev-other, which include start and end of sentence tokens for each utterance.

3.3.2 Model configuration

All the self-attentive SRU LMs are trained using a hidden dimension of 2,048 and a projected dimension of 512 for the self-attention layer. We use a learning rate of and no dropout. Optimization is done with the RAdam optimizer [39] using a cosine annealing learning rate schedule. We train SRU models of 12 and 24 layers, slightly varying architectures. For the 12 layer model, we train across 8 Tesla V100 GPUs with a total batch size of 192 for 10 epochs. We use an embedding size of 2,048 and tie the input and output weights. We use single-headed attention in the self-attentive modules. We train the 24 layer model on 8 Quadro RTX 8000 GPUs with a total batch size of 512 for 12 epochs. We use an embedding size of 512, do not tie weights, and use 2 heads in the self-attentive modules.

Model Layers # Params Dev
clean other
Transformer 12 74M 37.7 39.5
SRU 12 77M 36.2 38.3
SRU 24 139M 34.3 36.8
Table 1: BPE-level perplexities on dev-clean and dev-other for 12 layer Transformer, 12 layer SRU and 24 layer SRU LMs. We include start and end of sentence tokens on each utterance.

3.3.3 Results

Table 1 shows the perplexities obtained on the dev sets. With our 12 and 24-layer models, we achieve dev-clean perplexities of 36.2 and 34.3 respectively. In the first third of training we see the most improvement, with combined dev set perplexities at 40 for the 12-layer model and 37 for the 24-layer model.

For a comparison we also train a 12-layer Transformer model on the 10K BPE vocabulary. We use a model dimension of 768, feedforward dimension of 2,048, and 8 attention heads. These parameters were chosen as they result in a network with a comparable number of parameters to the 12-layer SRU model. All other parameters such as weight tying mirror the 12-layer SRU model. A cosine annealing learning rate schedule is used, with a linear warmup for the first 20,000 steps. By adding a self-attentive module between SRU layers, we are able to achieve better perplexity using a comparable number of parameters.

3.3.4 Analysis

We show that our proposed self-attentive SRU not only improves performance over the Transformer architecture but also converges faster. As shown in Figure 2, the 12-layer Transformer model reaches a perplexity of 40 in under 1.2M steps, while it only takes 622K training steps for the SRU model. This results in a 2 times training speedup. In practice this reduced training time by almost 2 days, allowing for faster iteration and greater exploration.

Additionally, we show that the perplexity improvement achieved by the 12-layer SRU model transfers directly to a WER improvement when used for candidate rescoring. In Table 2 we compare WERs when using both the SRU model and Transformer for the final stage of -best rescoring, fixing at 100. In all dev and test sets the 12-layer SRU achieves a lower WER than the Transformer model.

Figure 2: Dev perplexity curves of 12-layer SRU and Transformer models. Vertical lines signify when each model reaches a perplexity of 40. Here perplexity is reported on the combination of dev-clean and dev-other.
Rescoring Model Dev Test
clean other clean other
12-layer Transformer 1.62 4.41 1.96 4.70
12-layer SRU 1.59 4.38 1.93 4.62
Table 2: WER (in %) comparison of 12-layer Transformer and 12-layer SRU for -best rescoring.

3.4 Results and Analysis

Table 3 shows the WER comparison of different experimental setups for the multistream CNN AM and staged rescoring with various LMs. The setup of TDNN-F + 4-gram is a baseline with the TDNN-F acoustic model of the Kaldi Librispeech recipe rescored with our custom 4-gram LM. As compared to this basline, multistream CNN achieves a relative WER improvement of 14% on test-other, demonstrating its robustness. The lattice rescoring with the TDNN-LSTM language model further reduces the WER by relative, showing the better modeling capability of a neural language model.

Regarding self-attentive SRU LMs, we first construct -best rescoring using the 24-layer BPE SRU model. The language model likelihood is re-estimated by linearly interpolating the TDNN-LSTM and SRU LM. BPE-based LMs can help mitigate out-of-vocabulary issues from word-based models. Also, interpolating LMs with different levels of capacity has been proven to be beneficial to WER reduction in practice. These benefits are presented by the relative WER improvement of 23% on the test sets against the TDNN-LSTM rescoring approach only. When we interpolate the BPE SRU model with the word-level SRU LM, we obtain a slight improvement around relative. Finally, we re-rank the interpolated SRU results by minimizing the expected WER, resulting in further reduction of WER by approximately , also relative.

Table 4 compares the WERs between our proposed system and other benchmark systems in the literature. Other than the test-other set, we outperform any other system performances in the group by noticeable margins. In comparison with our previous results [20], thanks to multistream CNN for acoustic modeling and multiple stages of LM rescoring with powerful self-attentive SRU language models, we improve the relative WERs on test-clean and test-other of 20% and 23%, respectively.

        Setup Dev Test
clean other clean other
TDNN-F + 4-gram 2.75 8.16 2.93 8.17
Multistream CNN 2.62 6.78 2.80 7.06
+TDNN-LSTM LM 2.14 5.82 2.34 6.04
+24-layer SRU 1.56 4.28 1.83 4.57
+Interpolated SRU 1.56 4.25 1.79 4.49
+Expected Word 1.55 4.22 1.75 4.46
  Error Minimization
Table 3: WER (in %) comparison among different setups.
Systems Dev Test
clean other clean other
Park, et al. [18] - - 2.5 5.8
Synnaeve, et al. [21] 2.10 4.79 2.33 5.17
w/o semi-supervision
Luscher, et al. [19] 1.9 4.5 2.3 5.0
Wang, et al. [22] - - 2.26 4.85
Han, et al. [20] 1.84 5.75 2.20 5.82
Zhang, et al. [40] - - 2.0 4.6
Han, et al. [28] - - 1.9 4.1
ASAPP-ASR 1.55 4.22 1.75 4.46
Table 4: WER (in %) comparison among different systems.

4 Conclusions

In this work, we proposed a hybrid ASR system that combines a novel acoustic model architecture, multistream CNN, and an efficient language model, self-attentive SRU. Through the multiple stages of LM rescoring and the expected word error minimization for -best hypotheses re-ranking, we achieved a new state-of-the-art result on test-clean and competitive performance on test-other in the popular speech benchmark of Librispeech. Multi-resolution processing in a multistream architecture by multistream CNN manifested its robustness on test-other, and self-attentive variant to SRU demonstrated its superiority of modeling power over Transformer.

We will continue on improving the robustness of our acoustic model with efficient usage of a deep CNN architecture and more optimization of data augmentation methods in training. With the promising results presented by the self-attentive SRU in language modeling, we also plan to leverage similar modeling capacity from SRUs in acoustic modeling in the framework of end-to-end ASR.