Log In Sign Up

The RWTH ASR System for TED-LIUM Release 2: Improving Hybrid HMM with SpecAugment

by   Wei Zhou, et al.

We present a complete training pipeline to build a state-of-the-art hybrid HMM-based ASR system on the 2nd release of the TED-LIUM corpus. Data augmentation using SpecAugment is successfully applied to improve performance on top of our best SAT model using i-vectors. By investigating the effect of different maskings, we achieve improvements from SpecAugment on hybrid HMM models without increasing model size and training time. A subsequent sMBR training is applied to fine-tune the final acoustic model, and both LSTM and Transformer language models are trained and evaluated. Our best system achieves a 5.6 27


page 1

page 2

page 3

page 4


Effect and Analysis of Large-scale Language Model Rescoring on Competitive ASR Systems

Large-scale language models (LLMs) such as GPT-2, BERT and RoBERTa have ...

Multilingual ASR with Massive Data Augmentation

Towards developing high-performing ASR for low-resource languages, appro...

ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition

In this paper we present state-of-the-art (SOTA) performance on the Libr...

Frame-level SpecAugment for Deep Convolutional Neural Networks in Hybrid ASR Systems

Inspired by SpecAugment – a data augmentation method for end-to-end ASR ...

Improving the Training Recipe for a Robust Conformer-based Hybrid Model

Speaker adaptation is important to build robust automatic speech recogni...

Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

In this work, we study leveraging extra text data to improve low-resourc...

Quantity doesn't buy quality syntax with neural language models

Recurrent neural networks can learn to predict upcoming words remarkably...

1 Introduction & Related Work

One of the most common neural network (NN) based acoustic modeling methods is the hybrid hidden Markov model (HMM) approach

[1], which still gives state-of-the-art performance, as recently shown for benchmarks like Librispeech [2] and Switchboard [3]

. Bi-directional long short-term memory


(BLSTM)-HMM are widely used for acoustic modeling in hybrid HMM systems. Based on the alignment generated from a Gaussian mixture model (GMM)-HMM baseline, cross-entropy (CE) training is usually applied to train the baseline NN models. Additionally, speaker adaptive training (SAT) using i-vectors and sequence discriminative training, such as the state-level minimum Bayes risk (sMBR)

[5] criterion, are often applied for further improvements.

Language models (LM) based on LSTM [6]

have been widely applied to automatic speech recognition (ASR). Large improvements are observed for both hybrid HMM systems and end-to-end systems

[2]. Transformer [7] based LMs are reported to further improve over LSTM LMs [8]. For hybrid HMM systems, they are usually applied in lattice rescoring [8], but also may be used in single-pass search [9].

SpecAugment [10], as a simple feature augmentation method, has been successfully applied to end-to-end speech recognition systems. With increased model size and training time, end-to-end systems benefit strongly from SpecAugment [11, 12], and large improvements are also reported for end-to-end speech translation [13]. However, its effect on hybrid HMM systems has not been thoroughly studied yet.

In this work, we describe a complete training pipeline to build a state-of-the-art hybrid HMM-based ASR system on the 2nd release of the TED-LIUM corpus [14] (TED-LIUM-v2). We apply SpecAugment in our training pipeline and obtain further improvement over our best SAT model using i-vectors. By investigating the effect of different maskings on hybrid models, we achieve improvements from SpecAugment without increasing model size and training time. And no additional effort on learning rate scheduling is needed. Subsequent sequence discriminative training using the sMBR criterion is used to fine-tune the final acoustic model. For language modeling, both LSTM and Transformer based LMs are trained and evaluated. Our best system outperforms the previous state-of-the-art by a large margin.

2 Baseline Acoustic Model

2.1 Basic setups

All acoustic models are trained on the 207 hours of training data of TED-LIUM-v2. The official dictionary with roughly 152k words and 160k pronunciations is used. To evaluate all intermediate acoustic models with recognition experiments, we use a fixed heavily pruned 4-gram LM, whose details are described in Sec. 5 denoted as ‘4-gram-small’. This allows us to simplify the tuning, and report the relative improvement of each step. All trainings are done using our NN modeling toolkit RETURNN [15], our ASR toolkit RASR [16] and our work-flow manager Sisyphus [17]. All recognition results are obtained with maximum a posteriori (MAP) Viterbi decoding.

2.2 Baseline

We follow the standard steps described in [2]

to train the GMM-HMM baseline. Starting with linear alignment, monophone GMMs are trained with 16-dimensional MFCCs and their first oder derivatives. With each triphone modeled by 3 HMM states, generalized triphone states are obtained by state tying using classification and regression tree (CART). We use 9k CART labels. Generalized triphone state GMMs are then trained on windowed MFCC with linear discriminant analysis (LDA) transformation. This step is repeated once to refine the CART labels with better alignment. Subsequently, vocal tract length normalization (VTLN) and SAT using constrained maximum likelihood linear regression are applied to further improve the GMMs. The final alignment from the VTLN-SAT GMM is used in the next step to train NN baseline with the CE criterion.

We use 80-dimensional logmel features for the NN training. The NN model contains six BLSTM layers with 512 units for each direction. This topology is used in all further steps. The Nesterov-accelerated adaptive moment estimation (Nadam) optimizer

[18] with an initial learning rate of 0.0009 is used. Greedy layer-wise pre-training [19] and Newbob learning rate scheduling [20] with a decay factor of 0.9 are applied. CE and focal loss [21]

with factor of 2 are used. The training set is split into 5 subepochs and models converge well with roughly 32 full epochs. Sequences are decomposed into chunks of 64 frames with 50% overlap and a mini-batch of 128 chunks is used. Additionally, a 10% dropout

[22] and regularization with a factor of are applied to all hidden layers.

Table 1 shows the word error rate (WER) results of each of the aforementioned training steps. We also try to use the BLSTM baseline to generate a new alignment and repeat the NN training, but no further improvement is obtained. This new alignment is used in further steps of training.

Unit Model Feature Dev
monophone GMM MFCC 41.6
triphone + LDA 21.8
 + VTLN 21.0
 + SAT 19.5
  + VTLN 19.4
BLSTM logmel 10.4
 + i-vectors 19.8
Table 1: WERs(%) of baseline acoustic models (evaluated with the 4-gram-small LM on the dev set)

2.3 I-Vector Adaptation

We follow [3] to apply SAT using i-vectors as speaker embeddings. The embeddings are concatenated to the logmel features at each frame. The universal background model (UBM) is trained on the whole training set. To train the UBM, logmel features with a context of 9 frames are concatenated and then reduced to a dimension of 60 with LDA. I-vectors are then estimated for each recording separately using all feature frames including non-speech. We follow [3] to use a size of 100 for the i-vectors. As shown in the last two rows of Table 1, 6% relative improvement is achieved by applying SAT with i-vectors. We expect to achieve larger improvements with further tuning of the embedding parameters.

3 SpecAugment

The original SpecAugment [10] applies time warping, time masking and frequency masking on logmel features. Since time warping is reported to have minor effect, we skip it in our training. This also avoids the additional effort to handle the alignment accordingly. We apply the two maskings on logmel features concatenated with i-vectors. Since i-vectors are included, the frequency masking is renamed as feature masking. Both maskings are bounded to the fixed chunk size and feature dimension, and are realized in a similar way as described in [13].

3.1 Time Masking (TM)

With a chunk of frames , a position is randomly selected from . Then a time mask of length is randomly selected from , where is a predefined maximum time mask length. TM is then applied by setting the consecutive frames to . This procedure is repeated times, where is randomly selected from . is a predefined maximum iteration number for TM. Thus, TM can be controlled by setting and accordingly, which we denote as .

3.2 Feature Masking (FM)

With features of dimension , an index is randomly selected from . Then a feature mask of length is randomly selected from , where is a predefined maximum feature mask length. FM is then applied by setting the features within dimension to . This procedure is again repeated times, where is randomly selected from . is a predefined maximum iteration number for FM. Similar to TM, FM can be controlled by setting and accordingly, which we denote as .

3.3 SpecAugment on Logmel with I-Vectors

To further improve the previous best baseline model, we directly apply the TM and FM on the 80-dimensional logmel features concatenated with 100-dimensional i-vectors. The random selections in both TM and FM are independently applied for each chunk in a batch. BLSTM models are trained from scratch. The predefined and are halved in the first 2000 steps for a more stable pre-training. We set a default to 10% of the feature dimension. With , this means . Then a default is used for a maximum of 50% FM (denoted as ). For hybrid HMM systems, CART labels consume much less frames than label units used in end-to-end systems. By setting a very large , evidence of several continuous CART labels are masked out, which might be less beneficial. Therefore, we start with a default to match the maximum duration of a speech CART label based on our experience. Due to the fixed chunk size , also has to be limited to keep a reasonable ratio of TM. With , we set a default for roughly a maximum of 50% TM (denoted as ).

We first investigate the effect of different TM with the default FM (). Under the same maximum ratio of TM, we compare a set of different to find the optimal . As shown in Table 2, too long TM gives less improvement, which matches our expectation. Surprisingly too short TM is also less beneficial, which should be resulted from the decreased effect of TM. With the optimal , we further optimize by training models with to apply less and more TM (denoted as and ). From Table 2 we see neither of them brings further improvement. The best result achieves 7% relative improvement over the SAT baseline using i-vectors.

We then investigate the effect of different FM with the best TM (). Similarly under the same 50% maximum ratio of FM, we compare a set of different to find the optimal . As shown in Table 2, our default FM setting still gives the best result. With optimal , we train models with to vary the maximum ratio of FM (denoted as and ). Both give worse results. Additionally, we also investigate the importance of i-vectors in terms of FM. Since they are fixed for each frame, we train a model with default FM applied only within the logmel features and i-vectors are left untouched. This is reflected by the column ’FM on Ivec’ in Table 2. For 80-dimensional logmel features, 10% of the feature dimension results in . The result of applying FM only within logmel features is much worse. This shows that including i-vectors for FM is essential, which brings certain variation also into the speaker features.

Finally, we also investigate the effect to continue training the i-vectors-based SAT baseline with SpecAugment. We use the best masking settings obtained so far, i.e. TM= and FM=. In this case, we turn off the pre-training and its corresponding 2000 steps of halved masking. The learning rate is reset to allow an escape from local optimum. The model converges slightly faster than training from scratch directly with SpecAugment, but it only achieves the same performance of 9.1% WER in the end. Considering the much longer training time in total, there is not too much benefit to follow this track.

Criterion SpecAugment Dev
FM on Ivec
CE none 9.8
yes     9.5
no    9.6
 + sMBR none 8.6
Table 2: WERs(%) of further training steps based on logmel features concatenated with i-vectors (evaluated with the 4-gram-small LM on the dev set)

3.4 Discussion

Overall, the improvements from SpecAugment are not as large as reported for end-to-end systems [10]. However, improvements are obtained without increasing model size and training time. Models converge well with roughly the same number of epochs as needed for the baseline training. Additionally, no careful design of learning rate scheduling is needed (only Newbob is applied here), although more improvements can be explored by doing this.

In general, end-to-end systems need larger amounts of training data to be competitive with state-of-the-art hybrid HMM systems. This situation is eased by training end-to-end systems with SpecAugment for many more epochs. Together with the results in this work, we tend to infer that in terms of SpecAugment, end-to-end systems benefit most from having more data, whereas hybrid HMM benefit from more variation introduced into the data. However, more investigation is needed for a thorough understanding.

Model Param PPL
in M. Dev Test
4-gram-small 4 135.0 169.9
4-gram 161 113.2 127.9
LSTM 450 73.5 71.3
Transformer 414 62.0 60.7
Table 3: Perplexity of the word-level LMs. The same 152K vocabulary is used for all models (except for the small 4-gram which contains 52 words less).
Paper Acoustic Model Language Model Dev Test
Approach Labels Approach
Zeyer et al. [11] E2E BPE Transformer 10.3 8.8
Karita et al. [12] SentencePiece RNN 9.3 8.1
Han et al. [23] hybrid HMM triphone word 4-gram 7.7 8.0
Han et al. [24] 4-gram 7.6 8.1
RNN 7.1 7.7
this work 4-gram 6.8 7.3
LSTM 5.6 6.0
Transformer 5.1 5.6
Table 4: WERs(%) of the final acoustic model with different language models on both dev and test sets of TED-LIUM-v2, and a summary of most relevant results from the literature

4 Sequence Discriminative Training

We follow [2] to further apply sequence discriminative training on the best model from the previous step. In this case, we take the SAT model using i-vectors trained with the best SpecAugment setting. We use a lattice-based version of sMBR training criterion to fine-tune the model weights. No SpecAugment is applied in this step. This converged CE model and a bi-gram LM trained on the TED-LIUM-v2 LM training data are used for lattice generation and initialization of model training. We then continue training with a small constant learning rate of and use early stopping to prevent overfitting on the training data. CE smoothing with a scale of 0.1 is applied. As shown in Table 2, the sequence discriminative training achieves an additional 6% relative improvement.

5 Language Modeling

The LM training data consists of 7 subsets including the TED-LIUM-v2 training audio transcriptions, with a total of 270 M running words. The small 4-gram LM is trained in a similar way as the Kaldi example recipe [25]. All the rest of our LMs have been described in [26]. We refer readers interested in more details to this paper.

We first train modified Kneser-Ney 4-gram language models [27, 28, 29]

on each subset of the training data with the word level vocabulary of size 152K. We linearly interpolate these sub-LMs including a background 4-gram model trained on all training text, using the interpolation weights optimized for the development perplexity.

We train both LSTM and Transformer language models. The LSTM LM has 4 layers with 2048 nodes in each layer. The Transformer model has 32 layers with a feed-forward inner dimension of 4096, a self-attention embedding dimension of 768, and 12 attention heads per layer. No positional encoding is used. The input word embedding dimension is 128 for both models. Table 3 shows the corresponding perplexities.

6 Experimental Results

The final acoustic model trained with the sMBR criterion is evaluated with better language models. LM scales are optimized on the development set. A one-pass recognition setup with MAP Viterbi decoding is applied for both the 4-gram LM and the LSTM LM, where the generated lattices from the LSTM LM-based recognition are used for lattice rescoring with the Transformer LM.

Tabel 4 shows the WER results of these experiments together with a brief summary of best results from the literature. These include hybrid HMM systems as well as end-to-end (E2E) systems using different model types, topologies and label units, such as byte pair encoding (BPE) and SentencePiece [30]. We refer readers to the original papers for more details. As shown in the table, the previous best system [24] has a 7.7% WER on the test set. Our best result is 5.6% on the test set, which achieves 27% relative improvement.

7 Conclusion

In this work, we presented the integration of data augmentation using SpecAugment into the training pipeline of a state-of-the-art ASR system based on hybrid HMM approach for the TED-LIUM-v2 corpus. SpecAugment provides 7% relative improvement on top of our best SAT model using i-vectors, more precisely from 9.8% to 9.1% WER on the dev set with a small 4-gram LM. We analyzed the effect of different maskings and found out that SpecAugment is beneficial in all cases. The major impact comes from the maximum time and feature mask lengths, which have to be optimized. Then with a good control of maximum ratio of TM and FM, decent improvements are achieved without increasing model size and training time. For feature masking, it is essential to include all features even if i-vectors are fixed for each frame of the segment. Additionally, we found that training from scratch with SpecAugment directly is more efficient than continuing training with SpecAugment to achieve similar performance. Together with subsequent sMBR training and Transformer LM, our best hybrid HMM system achieves the state-of-the-art performance with 5.6% WER on the test set, which improves over the previous best WER of 7.7% by 27% relative.

8 Acknowledgements

This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No 694537, project “SEQCLAS”) and from a Google Focused Award. The work reflects only the authors’ views and none of the funding parties is responsible for any use that may be made of the information it contains.

We thank Albert Zeyer, Christoph Lüscher, Pavel Golik, Peter Vieting and Tobias Menne for useful discussions.


  • [1] Herve A. Bourlard and Nelson Morgan, Connectionist Speech Recognition: A Hybrid Approach, Kluwer Academic Publishers, Norwell, MA, USA, 1993.
  • [2] Christoph Lüscher, Eugen Beck, Kazuki Irie, Markus Kitza, Wilfried Michel, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “RWTH ASR Systems for LibriSpeech: Hybrid vs Attention,” in Interspeech, Graz, Austria, Sept. 2019, pp. 231–235.
  • [3] Markus Kitza, Pavel Golik, Ralf Schlüter, and Hermann Ney, “Cumulative Adaptation for BLSTM Acoustic Models,” INTERSPEECH, Sept. 2019.
  • [4] Sepp Hochreiter and Jürgen Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [5] Matthew Gibson and Thomas Hain, “Hypothesis Spaces for Minimum Bayes Risk Training in Large Vocabulary Speech Recognition,” in INTERSPEECH. 2006, ISCA.
  • [6] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney, “LSTM Neural Networks for Language Modeling,” in Proc. Interspeech, Portland, OR, USA, Sept. 2012, pp. 194–197.
  • [7] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention Is All You Need,” in Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008. Long Beach, CA, USA, Dec. 2017.
  • [8] Kazuki Irie, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “Language Modeling with Deep Transformers,” in Proc. Interspeech, Graz, Austria, Sept. 2019, pp. 3905–3909.
  • [9] Eugen Beck, Wei Zhou, Ralf Schlüter, and Hermann Ney, “LSTM Language Models for LVCSR in First-Pass Decoding and Lattice-Rescoring,”, July 2019.
  • [10] Barret Zoph, Chung-Cheng Chiu, Daniel S. Park, Ekin Dogus Cubuk, Quoc V. Le, William Chan, and Yu Zhang, “SpecAugment: A Simple Augmentation Method for Automatic Speech Recognition,” in INTERSPEECH, Graz, Austria, Sept. 2019.
  • [11] Albert Zeyer, Parnia Bahar, Kazuki Irie, Ralf Schlüter, and Hermann Ney, “A comparison of Transformer and LSTM encoder decoder models for ASR,” in Proc. ASRU, Sentosa, Singapore, Dec. 2019, pp. 8–15.
  • [12] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, and Wangyou Zhang, “A Comparative Study on Transformer vs RNN in Speech Applications,” CoRR, vol. abs/1909.06317, 2019.
  • [13] Parnia Bahar, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “On Using SpecAugment for End-to-End Speech Translation,” in International Workshop on Spoken Language Translation, Hong Kong, China, Nov. 2019.
  • [14] Anthony Rousseau, Paul Deléglise, and Yannick Estève, “Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks,” in In Proc. LREC, 2014, pp. 26–31.
  • [15] Patrick Doetsch, Albert Zeyer, Paul Voigtlaender, Ilia Kulikov, Ralf Schlüter, and Hermann Ney,

    “RETURNN: the RWTH extensible training framework for universal recurrent neural networks,”

    in ICASSP. 2017, pp. 5345–5349, IEEE.
  • [16] Simon Wiesler, Alexander Richard, Pavel Golik, Ralf Schlüter, and Hermann Ney, “RASR/NN: The RWTH Neural Network Toolkit for Speech Recognition,” in ICASSP. 2014, pp. 3281–3285, IEEE.
  • [17] Jan-Thorsten Peter, Eugen Beck, and Hermann Ney, “Sisyphus, a Workflow Manager Designed for Machine Translation and Automatic Speech Recognition,” in Proc. EMNLP, Brussels, Belgium, Nov. 2018, pp. 84–89.
  • [18] Timothy Dozat,

    “Incorporating Nesterov Momentum into Adam,”

  • [19] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle, “Greedy Layer-Wise Training of Deep Networks,” in Advances in Neural Information Processing Systems 19, pp. 153–160. 2007.
  • [20] Albert Zeyer, Patrick Doetsch, Paul Voigtlaender, Ralf Schlüter, and Hermann Ney, “A Comprehensive Study of Deep Bidirectional LSTM RNNs for Acoustic Modeling in Speech Recognition,” in ICASSP, New Orleans, LA, USA, Mar. 2017, pp. 2462–2466.
  • [21] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár, “Focal Loss for Dense Object Detection,”

    2017 IEEE International Conference on Computer Vision (ICCV)

    , pp. 2999–3007, 2017.
  • [22] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,”

    Journal of Machine Learning Research

    , vol. 15, pp. 1929–1958, 2014.
  • [23] Kyu J. Han, Jing Huang, Yun Tang, Xiaodong He, and Bowen Zhou,

    “Multi-Stride Self-Attention for Speech Recognition,”

    in INTERSPEECH, Graz, Austria, Sept. 2019.
  • [24] Kyu J. Han, Akshay Chandrashekaran, Jungsuk Kim, and Ian R. Lane, “The CAPIO 2017 Conversational Speech Recognition System,” ArXiv, vol. abs/1801.00059, 2018.
  • [25] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi Speech Recognition Toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, 2011.
  • [26] Kazuki Irie, Alexander Gerstenberger, Ralf Schlüter, and Hermann Ney, “How Much Self-attention Do We Need? Trading Attention for Feed-forward Layers,” in ICASSP, Barcelona, Spain, May 2020.
  • [27] Reinhard Kneser and Hermann Ney, “Improved backing-off for m-gram language modeling,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Detroit, MI, USA, May 1995, pp. 181–184.
  • [28] Stanley F Chen and Joshua Goodman, “An Empirical Study of Smoothing Techniques for Language Modeling,” Computer Speech & Language, vol. 13, no. 4, pp. 359–393, 1999.
  • [29] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney, “On the Estimation of Discount Parameters for Language Model Smoothing,” in Proc. Interspeech, Florence, Italy, Aug. 2011, pp. 1433–1436.
  • [30] Taku Kudo and John Richardson, “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing,” in EMNLP (Demonstration), 2018, pp. 66–71.