Log In Sign Up

RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation

We present state-of-the-art automatic speech recognition (ASR) systems employing a standard hybrid DNNHMM architecture compared to an attention-based encoder-decoder design for the LibriSpeech task. Detailed descriptions of the system development, including model design, pretraining schemes, training schedules, and optimization approaches are provided for both system architectures. Both hybrid DNN/HMM and attention-based systems employ bi-directional LSTMs for acoustic modeling/encoding. For language modeling, we employ both LSTM and Transformer based architectures. All our systems are built using RWTHs open-source toolkits RASR and RETURNN. To the best knowledge of the authors, the results obtained when training on the full LibriSpeech training set, are the best published currently, both for the hybrid DNN/HMM and the attention-based systems. Our single hybrid system even outperforms previous results obtained from combining eight single systems. Our comparison shows that on the LibriSpeech 960h task, the hybrid DNN/HMM system outperforms the attention-based system by 15 other test sets in terms of word error rate. Moreover, experiments on a reduced 100h-subset of the LibriSpeech training corpus even show a more pronounced margin between the hybrid DNN/HMM and attention-based architectures.


page 1

page 2

page 3

page 4


LeVoice ASR Systems for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge

This paper describes LeVoice automatic speech recognition systems to tra...

Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

Recently, self-supervised pretraining has achieved impressive results in...

Conformer-based Hybrid ASR System for Switchboard Dataset

The recently proposed conformer architecture has been successfully used ...

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

Attention-based encoder-decoder architectures such as Listen, Attend, an...

Finnish Parliament ASR corpus - Analysis, benchmarks and statistics

Public sources like parliament meeting recordings and transcripts provid...

Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model

Videos uploaded on social media are often accompanied with textual descr...

Image-to-Markup Generation with Coarse-to-Fine Attention

We present a neural encoder-decoder model to convert images into present...

1 Introduction

Over the last years, asr systems have improved significantly. Especially the rise of deep nn has accelerated this development immensely [1]

. Convolutional nn and recurrent nn are the state-of-the-art architectures for most asr tasks. State-of-the-art systems are largely based on the hybrid deep neural network (DNN) based standard architectures. However, the general progress in deep learning/machine learning also triggered a diversification of ASR architectures into a series of so-called end-to-end approaches. Most notably, this includes the attention-based encoder-decoder architecture, for which good performance has been reported on a number of tasks, including the LibriSpeech task.

The LibriSpeech task comprises English read speech data based on the LibriVox project [2]. Previous results on LibriSpeech using hybrid models are presented in [3, 4]. While [3] uses a gmmhmm gmm hmm as the basis for their system, further training is conducted with a hybrid dnn/hmm with a densely connected topology. The densely connected nn in [3] are composed of different types of nn layers: convolutional nn, *tdnn and bi-directional lstm. lfmmi is applied during training. A recurrent nn lm is used for rescoring. The final best result in [3] is achieved with a system combination of eight systems. In [4], a lattice-free smbr training method is used.

End-to-end results on LibriSpeech were presented in [5, 6, 7, 8, 9]. The end-to-end approach in [8] uses the raw waveform and a convolutional nn acoustic model with gated linear units. An end-to-end attention-based encoder-decoder approach with a pretraining scheme is presented in [5, 6]. In [7] a training procedure based on edit distance for sequence to sequence model optimization is presented. An exploration of target units (phoneme, grapheme and word-piece) in relation to training size was performed in [9]. A data augmentation method called SpecAugment was presented in [10]

. So far, while end-to-end approaches show competitive performance, they are outperformed by hybrid approaches. We compare the conventional hybrid dnn/hmm approach on phone level to the encoder-decoder-attention model which directly operates on the word or sub-word level and is thus often referred to as an end-to-end model. In addition, we use word-level and subword-level neural language models to further improve the performance of both systems. We describe the development of our hybrid system and show which factors were especially important for the performance of the system. To the best of our knowledge, the results obtained on the LibriSpeech task reflect state-of-the-art performance for both hybrid and attention-based modeling, with a clear margin still for hybrid DNN/HMM modeling when no data augmentation scheme is applied.

2 Hybrid system

2.1 Acoustic model

2.1.1 gmmhmm system

We use 16-dim mfcc adding first and second order derivatives, and additionally energy features as input for the gmmhmm system. The transition probabilities are set manually and applied to all hmm.

The first step is linear time alignment where the features are uniformly distributed over the audio. We iterate the repetition of the parameter estimation based on the linear time alignment five times. Afterwards, we perform a non-linear time alignment to improve the alignment. Afterwards we perform parameter estimation. Initially, this process was iterated 10 times. Increasing the number of iterations showed constant improvement. Therefore we continued adding training iterations until wer convergence. Training of a state-tied triphone gmmhmm model is the following step. The states are tied using a phonetic cart. We experimented with different numbers of cart labels ranging from

plus silence. All state-tied triphones use three hmm states. We switch the input features from 16-dim. mfcc with derivatives to a context window of mfcc features resulting in a 144-dim. feature vector on which lda is performed. The lda output has a dimension of 48. After training the state-tied triphone gmmhmm model, vtln is applied, followed by sat with *cmllr to adapt the Gaussian mixture model parameters to a speaker. After adapting the parameters, a realignment is performed.

2.1.2 Hybrid dnn/hmm system

The nn acoustic model architecture is a bi-directional lstm [11, 12]. This architecture achieves good performance in acoustic modelling [12, 13]. For the hybrid dnn/hmm system we extract several different features: 16-dim mfcc with derivatives and energy, 48-dim. features from the triphone system and Gammatone filters [14]

with 25, 50 or 100-dim. The extracted 50-dim. Gammatone filters had the best performance. All features are used as input into the bi-directional lstm along with the generated alignments from the gmmhmm system. We continue to use cart labels for the state-tied phones. The same range of cart labels was used for experimentation. The network topology consists of six bi-directional lstm layers with 1000 units for backward and forward direction each. We experimented with smaller bi-directional lstm sizes (number of layers and number of units per layer) but found them to be worse in performance. The output layer is comprised of a softmax layer with output units corresponding to the number of the cart labels. Frame-wise cross-entropy loss criterion and Adam optimization with Nesterov momentum (Nadam) are used for the mini-batch training of the network

[15, 16]. Newbob learning rate scheduling [13] is applied to control the learning rate reduction with a learning rate decay rate of . regularization was used to prevent overfitting. The hyperparameter was set to . Further regularization is done with dropout [17]. We experimented with dropout in the range of % – % and found a dropout of % to work best for us. Gradient noise [18]

with a variance of

was employed. We experimented with different learning rates and batch sizes in various combinations. So far a batch size of and a learning rate of have shown the best performance. Additionally, learning rate warm up proved to be helpful. We start with a learning rate of and increase the learning rate to over the first ten subepochs. A subepoch is 1/40th of the training data. The training data is seen times. During decoding the lm scale is an important hyperparameter which will effect the wer directly. We found a scale between worked best for us.

2.1.3 Sequence discriminative training

Sequence discriminative training is performed using a lattice-based version of the state-level minimum Bayes risk (sMBR) criterion [19].

The hybrid dnn/hmm model is used to generate lattices for all of the training data. The training is then continued from the hybrid dnn/hmm model with a lower learning rate.

We use cross-entropy smoothing with a smoothing factor of and early stopping to prevent overfitting.

2.2 Language model

We report performance of hybrid systems using both a 4-gram count based language model [20] and an LSTM language model [21] in the first pass decoding [22]. We use the 4-gram count model officially distributed with the LibriSpeech dataset [2]. For the LSTM language model, we train our own model using our toolkit RETURNN [23]

. Two training datasets are available for language modeling: 800M-word text only data and 960h of audio transcriptions which corresponds to 10M-word text data. These two sets are merged to form one training dataset for language model training. Our LSTM language model has two recurrent layers with 4096 LSTM nodes in each layer, an input projection layer of size 128, and a output softmax layer over the full 200k vocabulary. We train the model using the stochastic gradient descent with gradient norm clipping and Newbob learning rate scheduling.

In addition, we carry out rescoring of lattices generated by the LSTM language model using a Transformer [24] language model. Our Transformer model has 96 layers with the self-attention total dimension of 512 using 8 heads and the inner feed-forward dimension of 2048 in each layer, which gave the best development perplexity in our preliminary experiments [25]. We use push-forward algorithm [26]

with recombination pruning of order 9. We linearly interpolate the two models with interpolation weights optimized on the development perplexity. We found 0.71 to be the optimal weight on the Transformer model which gave the development perplexity of 52.3, while the LSTM and Transformer models have the individual development perplexity of 60.2 and 53.7 respectively.

phoneme context acoustic model vtln sat sMBR *wer [%]
dev test
clean other clean other
mono GMM no no no 24.3 52.6 24.1 56.1
tri 12.1 34.5 12.9 36.9
yes 12.0 35.1 11.2 36.4
no yes 8.0 21.9 8.6 22.9
yes 7.6 22.0 8.4 23.1
LSTM 4.0 9.6 4.4 10.0
yes 3.4 8.3 3.8 8.8
Table 1: gmmhmm and hybrid dnn/hmm results on LibriSpeech with 12k cart labels and evaluated with the official 4-gram lm.

3 Encoder-Decoder-Attention system

The encoder-decoder framework with attention has initially been introduced for machine translation where it dominates the field now [27, 28, 29]. Recent investigations have shown promising results by applying the same approach for speech recognition [30, 31, 32, 33, 7, 5]. Among end-to-end approaches for asr, the attention model seems to perform best [6]. Our model operates on sub-word units via byte-pair encoding [34]. As input 40-dim MFCC feature vectors are used. Our presented results outperform the best LibriSpeech attention system presented in [6]. Compared to the system in [6] we use an extended pretraining variant where we not only grow the encoder depth but also grow the hidden dimension of the LSTMs. Specifically, we start with 2 layers in the encoder of dimension 512 and increase to 6 layers with dimension 1024. Additionally, we train the first pretrain construction step first without dropout. We improved upon that model by tuning the curriculum learning schedule slightly, i.e. we have these 4 steps with different portions of the dataset:

  1. from 25% of the whole data, take only train-clean, and filter randomly such that the max mean number of characters in the transcriptions of each sequence is 50,

  2. from the next 25% of the whole data, take only train-clean, and filter randomly such that the max mean number of characters is 75,

  3. from the next 50% of the whole data, take only train-clean,

  4. from now on, take everything.

Also, in the pretraining, we repeat the first step once more, with 2 layers of dimension 512, without dropout. The next improvement came from just training longer, i.e. we trained with our learning rate scheduling until it converged, then took the best model, and continued training with a reset learning rate scheduling. We repeated this twice. In the first iteration, we went over the whole data 12.5 times, then another 6.6 times and finally another 8.3 times, i.e. in total 27.4 times.

To further enhance end-to-end system’s performance, we train bpe-level language models and apply them to the system by shallow fusion[35, 36]. We report the performance of LSTM based and Transformer based language models separately. Our LSTM model has 4 recurrent layers with 2048 LSTM nodes. We use a 24-layer Transformer model with 8-head self-attention and feed-forward dimensionality of 1024 and 4096 respectively, which we obtained in [25]. We select the language model checkpoints for the recognition experiments based on the development perplexity. For shallow fusion, we apply a single weight on the language model score (the weight on the score of the attention model is 1) and we use a beam size of 64 as well as an end-of-sentence penalty [37]. We optimize the weights separately on the dev-clean and dev-other sets, then respectively apply them to the test-clean and test-other sets. We found optimal weights to be similar for both models; 0.5 and 0.56 for the LSTM language model, and 0.52 and 0.54 for the Transformer model, respectively on the clean and other sets.

4 Experimental setup

The two systems, a hybrid-dnn and an attention-based encoder-decoder are both trained on the 960h training data from the LibriSpeech corpus. For comparison, also a 100h subset is used. Unless specified otherwise, the training was performed using the full training set of 960h. The data is in English but the content ranges from different time periods and different English speaking countries. Having the consequence of different English styles being within the corpus.

The hybrid model was trained and decoded with RASR [38] and RETURNN [23, 39]. The monophone and triphone system to generate the alignments was built in RASR while the nn model was trained in RETURNN. The decoding process was setup in RASR. Our encoder-decoder-attention model was trained and decoded using RETURNN [23]. Both toolkits are open-source. All the config files used for training and recognition of all our results are publicly available online [40].

We evaluate the models on the dev and test sets provided with the LibriSpeech corpus: dev-clean, dev-other, test-clean and test-other. The difference between clean and other is the quality of the audio and its corresponding transcription. The clean quality is higher than the other.

5 Experimental results

The development stages of our acoustic model are shown in Table 1. We start the training of the gmmhmm model from scratch using linear alignments. Afterwards we utilize non-linear alignments. To further improve the gmmhmm model we introduce triphones. Adding vtln on top of the triphone system only shows improvements on clean but degradation on other. However adding sat to the triphone system improves the wer. Combining vtln and sat gives mixed wer: clean improves, other degrades. Introducing an hybrid dnn/hmm improves the system wer results. Continuing with sequence discriminative training improves the performance even further.

# of cart labels *wer [%]
dev test
clean other clean other
9001 6.2 14.9 5.8 15.9
12001 4.0 9.6 4.4 10.0
20001 4.9 11.3 5.4 12.3
Table 2: Hybrid dnn/hmm results on LibriSpeech with different numbers of cart labels. For all systems the official 4-gram word lm is used.

We evaluated the influence of the number of cart labels with the hybrid dnn/hmm model and the official 4-gram lm (Table 2). 9k cart labels show the worst performance. In contrast, 20k cart labels shows improved performance. But the best performance was shown by 12k cart labels.

training set model LM *wer [%]
dev test
clean other clean other
train-clean-100 hybrid 4-gram 5.0 19.5 5.8 18.6
attention none 14.7 38.5 14.7 40.8
train-960 hybrid 4-gram 4.0 9.6 4.4 10.0
attention none 4.7 14.3 4.8 15.4
Table 3: Comparison between hybrid dnn/hmm and encoder-decoder-attention model results on LibriSpeech with different training corpus sizes. train-clean-100 is a official subset of the training corpus. train-960 is the complete training corpus. (Clustered) context-dependent phones (CDp) are utilized for the hybrid model, and sub-word units for the attention model.
paper model label unit LM *wer [%]
AM LM dev test
clean other clean other
Han et al. [3] hybrid, seq. disc., single CDp word RNN 3.0 8.8 3.6 8.7
hybrid, seq. disc., ensemble 2.6 7.6 3.2 7.6
Zeghidour et al. [8] end-to-end GCNN chars words GCNN 3.2 10.1 3.4 11.2
Irie et al. [9] end-to-end attention Word Piece Model lstm 3.3 10.3 3.6 10.3
Zeyer et al. [5] bpe 3.5 11.5 3.8 12.8
this work None 4.3 12.9 4.4 13.5
lstm 2.9 8.9 3.2 9.9
Transformer 2.6 8.4 2.8 9.3
hybrid CDp word 4-gr 4.0 9.6 4.4 10.0
hybrid, seq. disc. 3.4 8.3 3.8 8.8
+ lstm 2.2 5.1 2.6 5.5
Transformer resc. 1.9 4.5 2.3 5.0
Park et. al.[10] end-to-end attention/SpecAugment Word Piece Model LSTM - - 2.5 5.8
Table 4: The wer results from our most interesting models and important results from other papers on LibriSpeech 960 h. CDp are (clustered) context-dependent phones. bpe are sub-word units. 4-gr lm is the official 4-gram word lm. GCNN are gated convolutional nn. RNN are recurrent nn.

We compare the hybrid model with the encoder-decoder-attention model. We trained both models on the train-clean-100 training subset and on the train-960 complete training set. These are not the best models but utilize a baseline model for both approaches. The hybrid dnn/hmm model outperforms the encoder-decoder-attention model constantly. But the difference in performance shrinks substantially with the much larger training set.

Our encoder-decoder-attention model in combination with a Transformer lm gives a wer of % on test-clean and % on test-other (Table 4). Evaluating our sequence discriminativly trained acoustic model with our lstm lm results in a wer of % on test-clean and % on test-other. Rescoring with a Transformer language model further improves the performance of our hybrid dnn/hmm system resulting in a wer of are % on test-clean and % on test-other. The previous best hybrid system was presented in [3] while the best end-to-end system without data augmentation was presented in [8, 9] (Table 4). Additionally we present the best end-to-end system with data augmentation [10]. Our best encoder-decoder-attention model improves the state-of-the-art for end-to-end models without data augmentation by % relative wer on test-clean and by % relative wer on test-other. Our best hybrid dnn/hmm system without Transformer lm rescoring improves the state-of-the-art by % relative wer on test-clean and by % relative wer on test-other. If we add rescoring with a Transformer lm we improve further by % relative wer on test-clean and by % relative wer on test-other. In comparison, the hybrid dnn/hmm system still outperforms the encoder-decoder-attention system by over % relative wer on test-clean and by over % relative wer on test-other. Our best hybrid model even outperforms the end-to-end attention model with SpecAugment [10] by % relative wer on test-clean and by % relative wer on test-other. These results reflect the state-of-the-art performance for both hybrid and attention-based models on LibriSpeech, to the best of the authors’ knowledge.

wer become very small, especially for dev-clean and test-clean. When analyzing the errors, it is noticeable that some of the errors would not be recognized as primary errors by a human. These can be categorized as, for example: word contractions or American vs British English spelling. Examples of such errors are: I am I’m, tyrannise tyrannize, color colour, oh o. So far we have not employed a normalization strategy for these errors.

6 Conclusions

In this paper we presented two asr systems for the LibriSpeech task. One System was a hybrid dnn/hmm system based on a gmmhmm system, the other system was an attention-based encoder-decoder system.

We described how we built the systems and described how to incrementally improve the systems to get competitive results. For the hybrid dnn/hmm system a large nn acoustic model, the sequence discriminative training and the employment of an lstm lm was important for the good performance. The encoder-decoder-attention approach utilized an extended pretraining variant and a tuned curriculum learning schedule. This enabled the model to achieve competitive results in comparison to other end-to-end approaches.

The presented encoder-decoder-attention system showed state-of-the-art performance on the LibriSpeech 960h task in comparison with end-to-end systems without data augmentation. But our comparison shows that on the LibriSpeech 960h task, the hybrid DNN/HMM system outperforms the attention-based system by % relative on the clean and % relative on the other test sets. Our hybrid system even outperforms previous results presented in the literature. Moreover, experiments on a reduced 100h-subset of the LibriSpeech training corpus even show a more pronounced margin between the hybrid DNN/HMM and attention-based architectures. To the best knowledge of the authors, the results obtained when training on the full LibriSpeech training set, are the best published currently, both for the hybrid DNN/HMM and the attention-based systems presented in this work.


This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 694537, project ”SEQCLAS”) and from a Google Focused Award. The work reflects only the authors’ views and none of the funding parties is responsible for any use that may be made of the information it contains.

Experiments were partially performed with computing resources granted by RWTH Aachen University under project nova0003.

We thank Wei Zhou for help with generating lattices.