Deep learning based approaches, especially recurrent neural networks and their variants, have been one of the hottest topics in language modeling research for the past few years. Long Short Term Memory (LSTM) based recurrent language models (LMs) have shown significant perplexity gains on well established benchmarks such as the Penn Tree Bank  and the more recent One Billion corpus 
. These results validate the potential of deep learning and recurrent models as being key to further progress in the field of language modeling. Since LMs are one of the core components of natural language processing (NLP) tasks such as automatic speech recognition (ASR) and Machine Translation (MT), improved language modeling techniques have sometimes translated to improvements in overall system performance for these tasks[3, 4, 5].
Enhancements of recurrent neural networks such as deep transition networks, recurrent highway networks, and fast-slow recurrent neural networks that add non-linear transformations in the time dimension have shown superior performance especially in NLP tasks[6, 7, 8]. Inspired by these ideas, we extend LSTMs by adding highway networks inside LSTMs and we call the resulting model Highway LSTM (HW-LSTM). The added highway networks further strengthen the LSTM capability of handling long-range dependencies. To the best of our knowledge, this is the first research that uses HW-LSTM for language modeling in the context of a state-of-the-art speech recognition task.
In this paper, we present extensive empirical results showing the advantage of HW-LSTM LMs on state-of-the-art broadcast news and conversational ASR systems built on publicly available data. We compare multiple variants of HW-LSTM LMs and analyze the configuration that achieves better perplexity and speech recognition accuracy. We also present a training procedure of a HW-LSTM LM initialized with a regular LSTM LM. Our results also show that the regular LSTM LM and the proposed HW-LSTM LMs are complementary and can be combined to obtain further gains. The proposed methods were instrumental in reaching the current best reported accuracy on the widely-cited Switchboard (SWB)  and CallHome (CH) [10, 11] subsets of the NIST Hub5 2000 evaluation testset.
Our paper has three main contributions:
a novel language modeling technique with HW-LSTM,
a training procedure of HW-LSTM LMs initialized with regular LSTM LMs, and
the impact of the above proposed methods in state-of-the-art ASR tasks with publicly available broadcast news and conversational telephone speech data.
This paper is organized as follows. We summarize related work in Section 2 and detail our proposed language modeling with HW-LSTM in Section 3. Next, we confirm the advantage of HW-LSTM LMs through a wide range of speech recognition experiments in Section 4. Finally, we conclude this paper in Section 5.
2 Related Work
In this section, we summarize related work that serves as a basis for our proposed method which is described in Section 3.
2.1 Highway Networks
Highway networks make it easy to train very deep neural networks . The input is transformed to output by a highway network with information flows being controlled by transformation gate and carry gate as follows:
are the weight matrix and bias vector for the transform gate.and are the weight matrix and bias vector for the carry gate. and are the weight matrix and bias vector and non-linearity other than can be used here. Highway networks have been showing strong performance in various applications including language modeling , image classification , to name a few.
2.2 Recurrent Highway Networks
A typical Recurrent Neural Network (RNN) has one non-linear transformation from a hidden state to the next hidden state given by:
is the input to the RNN at time step , are the weight matrices, and is the bias vector. A Recurrent Highway Network (RHN) was recently proposed by combining RNN and highway network . An RHN applies multiple layers of highway networks when transforming to . Multiple layers of highway networks serve as a “memory”.
An LSTM is a specific architecture of an RNN which avoids the vanishing (or exploding) gradient problem and is easier to train thanks to its internal memory cells and gates. After exploring a few variants of LSTM architectures, we settled on the architecture specified below, which is similar to the architectures of[15, 16] illustrated in Figure 1.
is the input to the LSTM at time step , are the weight matrices, and are the bias vectors. denotes an element-wise product. and represent the memory cell vector and the hidden vector at time step . Note that this LSTM does not have peephole connections.
3 Highway LSTM
Inspired by the improved performance achieved by adding multiple layers of highway networks to regular RNNs, we propose Highway LSTM (HW-LSTM) in this paper. We also introduce a suitable training procedure for HW-LSTM.
3.1 Variants of Highway LSTM
Unlike a regular RNN, an LSTM has two internal states, memory cell and hidden state . We explore three variants, HW-LSTM-C, HW-LSTM-H, and HW-LSTM-CH, which differ in whether the highway network is added to and/or as below:
In the above explanation, for simplicity, the number of highway network layers was set to one. We define the number of highway layers as depth. Same as described in deep transition networks and recurrent highway networks [6, 7], we can increase the depth in HW-LSTM by stacking highway network layers inside the LSTM.
In order to reduce the number of parameters in HW-LSTM-C and HW-LSTM-H, we simplified the carry gate to as in the original paper . For HW-LSTM-CH, both carry gates are set and .
3.2 Training Procedure of Highway LSTM
Due to the additional highway networks, the training of HW-LSTM LMs is slower than for regular LSTM LMs. To mitigate this, we can: (1) train the regular LSTM LM, (2) convert it to HW-LSTM by adding highway networks, and (3) conduct additional training of HW-LSTM LM that is converted from the regular LSTM LM. In other words, HW-LSTM LM is initialized with the trained regular LSTM LM. To smoothly convert the regular LSTM LM to the HW-LSTM LM, we set the bias term for the transformation gate to a negative value, say -3, so that the added highway connection is initially biased toward carry behavior, which means that the behavior of the converted HW-LSTM LM is almost the same as the regular LSTM LM. Next, we conduct the additional training of HW-LSTM LM.
In LM training, it is common to have a two-stage training procedure where the first stage uses a large generic corpus and the second stage uses a small specific target-domain corpus. Our proposed procedure fits this type of two-stage training.
We compare the three HW-LSTM variants, HW-LSTM-C, HW-LSTM-H, and HW-LSTM-CH. In a first set of experiments, we use English broadcast news data for this comparison and report perplexity and speech recognition accuracy. Then we conduct experiments with English conversational telephone speech data and compare the speech recognition accuracy with the strong LSTM baseline. For speech recognition experiments, we generated -best lists from lattices produced by the baseline system for each task and rescored them with the baseline LSTM and/or the HW-LSTM
LMs. The evaluation metric for speech recognition accuracy was Word Error Rate (WER). LM probabilities were linearly interpolated and the interpolation weights of LMs were estimated using the heldout data.
4.1 Baseline LSTM
Our baseline LSTM LM consists of one word-embeddings layer, four LSTM layers, one fully-connected layer, and one softmax layer, as described in Figure5
. The second to fourth LSTM layers and the fully-connected layer allow residual connections. Dropout is applied to the vertical dimension only and not applied to the time dimension . This model minimizes the standard cross-entropy objective during training. The competitiveness of this baseline LSTM LM is detailed in Section 4.4
To investigate the advantage of HW-LSTM LMs, we replaced the LSTM with the HW-LSTM (HW-LSTM-C, HW-LSTM-H, or HW-LSTM-CH). The rest of the topology is same as the baseline LSTM LM.
4.2 Network configuration and hyper-parameters
The baseline LSTM LM uses word embeddings of dimension 256 and 1,024 units in each hidden layer. The fully connected layer uses a gated linear unit  and the network is trained with a dropout rate of 0.5.
4.3 Broadcast news
Broadcast news evaluation results were reported on the Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) RT’04 test set which contains approximately 4 hours of data. We used two types of acoustic models. The first model is a discriminatively-trained, speaker-adaptive Gaussian Mixture Model (GMM) acoustic model (AM) trained on 430 hours of broadcast news audio. The second model is a Convolutional Neural Net (CNN) acoustic model trained on 2,000 hours of broadcast news, meeting, and dictation data with noise based data augmentation. The CNN-based AM was first trained with cross-entropy training  and then with Hessian-free state-level Minimum Bayes Risk (sMBR) sequence training [23, 24].
We trained a conventional word 4-gram model using a total of 350M words from multiple sources  with a vocabulary size of 84K words. To compare the baseline LSTM LM and the three types of the proposed HW-LSTM LMs, we used a 12M-word subset of the original 350M-word corpus, as done in . For reference, we also trained model-M from the same training data with the baseline LSTM [27, 28, 29]. The hyper-parameters were optimized on a heldout data set.
Table 1 illustrates the perplexity of these models on the heldout set. The HW-LSTM-H and HW-LSTM-CH achieved the best perplexity, whereas the HW-LSTM-C saw a marginal degradation compared with the baseline LSTM.
|GMM AM||CNN AM|
|-gram + Baseline LSTM||12.2||10.2|
|-gram + Baseline LSTM|
Bold numbers is the best WER for each AM.
Table 2 illustrates the WER on EARS RT’04 obtained by rescoring -best lists produced by the two acoustic models. For reference, the first section in Table 2 compares the WER with -gram LM and the WER after rescoring with the baseline LSTM over the lattices generated with the -gram LM. As can be seen, rescoring with the baseline LSTM LM significantly reduces WER for both the GMM AM and the CNN AM. The second section describes the rescoring results by three types of HW-LSTM LMs over the lattice generated by -gram. The third section describes the rescoring results by the same HW-LSTM LMs, but after rescoring by the baseline LSTM LM. Comparing the first and the second section, HW-LSTM-H showed better WER than the baseline LSTM both for GMM AM and CNN AM. When looking at the second and third sections, using HW-LSTM-H resulted in the best WER both for GMM AM and CNN AM. While HW-LSTM-H and HW-LSTM-CH had similar perplexity, HW-LSTM-H showed slightly better WER than HW-LSTM-CH. HW-LSTM-H has a smaller number of parameters than HW-LSTM-CH and thus is less prone to overfitting.
In the experiments with English conversational telephone speech recognition described in the next section, we will use the pipeline of using both the baseline LSTM and HW-LSTM-H that achieved the best WER in these broadcast news experiments.
|Duration||# speakers||# words|
|-gram + model-M ||6.1||11.2||9.4||9.4||9.0||8.8|
|-gram + model-M + 4 LSTM + CNN [10, 11]||5.5||10.3||8.3||8.3||8.0||8.0|
|-gram + model-M + Baseline LSTM||5.4||10.1||8.4||8.3||8.0||8.1|
|-gram + model-M + Baseline LSTM|
|+ HW-LSTM-H (d=1)||5.3||10.1||8.3||8.3||8.0||8.1|
|+ HW-LSTM-H (d=2)||5.3||10.1||8.2||8.2||7.9||7.9|
|+ HW-LSTM-H (d=3)||5.3||10.0||8.1||8.2||7.8||7.9|
|+ HW-LSTM-H (d=4)||5.3||10.0||8.1||8.2||7.8||7.9|
|+ HW-LSTM-H (d=5)||5.3||10.0||8.1||8.2||7.8||7.9|
|+ HW-LSTM-H (d=6)||5.3||9.9||8.2||8.1||7.8||7.9|
|+ HW-LSTM-H (d=7)||5.2||10.0||8.1||8.1||7.8||7.9|
|+ Unsupervised LM Adaptation||5.1||9.9||8.2||8.1||7.7||7.7|
4.4 Conversational telephone speech
To confirm the advantage of HW-LSTM, we conducted experiments with a wide range of conversational telephone speech recognition test sets, including SWB and CH subsets of the NIST Hub5 2000 evaluation data set and also the RT’02, RT’03, RT’04, and DEV’04f test sets of DARPA-sponsored Rich Transcription evaluation. Statistics of these six data sets are described in Table 3.
The acoustic model uses an LSTM and a ResNet 
whose posterior probabilities are combined during decoding.
The baseline LSTM LM and the HW-LSTM-H LMs were built with a vocabulary of 85K words. In the first pass, the LMs were trained with the corpus of 560M words consisting of publicly available text data from LDC, including Switchboard, Fisher, Gigaword, and Broadcast News and Conversations. In a second pass, this model was refined further with just the acoustic transcripts (approximately, 24M words) corresponding to the 1,975 hour audio data used to train the acoustic models . The hyper-parameter was optimized on a heldout data set.
We tried HW-LSTM-H LMs with different depths from 1 to 7. Training of HW-LSTM-H LMs is slower than that of the baseline LSTM LMs because of their additional highway connections. Thus, we used the training procedure introduced in Section 3.2. In the first pass with larger corpora, we trained the baseline LSTM LM. Then we added the highway connections to the trained baseline LSTM LM to compose the HW-LSTM-H. To smoothly convert the baseline LSTM LM to the HW-LSTM-H LM, we set the bias term for the transformation gate to a negative value, -3, so that the added highway connection was initially biased toward carry behavior which means that the behavior of the composed HW-LSTM-H LM was almost the same with that of the trained baseline LSTM LM. Then we conducted the second pass to further train HW-LSTM-H LM by only using the acoustic transcripts.
Again, the interpolation weights when interpolating the baseline LSTM LM and the HW-LSTM-H LMs with different depths were optimized on a heldout data set.
The WERs on the six test sets are tabulated in Table 4. The first section shows the reference from previous papers [10, 11]. In the case of “-gram + model-M”, lattices were generated using an -gram LM and rescored with model-M [27, 28, 30]. “-gram + model-M + 4 LSTM + CNN” line indicates the previously reported results  that achieved the state-of-the-art WER in the SWB and CH test sets.
The second section of “-gram + model-M + Baseline LSTM” is the baseline in this paper. -gram and model-M are identical with the ones in the first section, however, our baseline LSTM explained in Section 4.1 has a different architecture from the LSTM in the first section. Note that the WERs obtained in “-gram + model-M + Baseline LSTM” are comparable with the WERs in “-gram + model-M + 4 LSTM + CNN”, which indicates that our baseline in this paper is sufficiently competitive.
The third section is our main results and we conducted rescoring by HW-LSTM-H LMs over the lattices prepared in the second section (“-gram + model-M + Baseline LSTM”). Here, we incrementally applied the deeper HW-LSTM-H as to . Note that indicates the number of the highway networks in HW-LSTM-H and the number of HW-LSTM-H layers were kept unchanged to four in all cases. While there are a marginal number of exceptions, adding deeper HW-LSTM-H gradually reduces WERs in all test sets. Comparing with our competitive baseline in this paper, after adding HW-LSTM-H (d=7), we obtained absolute 0.2%, 0.1%, 0.3%, 0.2%, 0.2%, 0.2% WER reductions respectively for SWB, CH, RT’02, RT’03, RT’04, and DEV’04f test sets.
Finally, in the fourth section, we conducted an unsupervised LM adaptation after rescoring by HW-LSTM-H LMs. We started with re-estimation of the interpolation weights using the rescored results for each test set as a heldout set and conducted rescoring again. Then, we adapted the model-M LM trained only from acoustic transcripts with using the rescored results for each test set obtained in the previous step . We rescored the -best lists with the adapted model-M and obtained the final results. After all of unsupervised LM adaptation steps, we reached 5.1% and 9.9% WER for SWB and CH subsets of the Hub5 2000 evaluation.
In this paper, we proposed a language modeling with using HW-LSTM and confirmed its advantage by conducting a range of speech recognition experiments. While it may appear that the gains are marginal, it is quite significant at these low WER scenarios obtained with strong LMs and is typical when working with very strong baselines (e.g. 0.2% improvement for the SWB subset was statistically significant at 111Statistical significance was measured by the Matched Pair Sentence Segment test  in sc_stats..).
It is noteworthy that 5.1% and 9.9% WER for SWB and CH subsets of the Hub5 2000 evaluation reach the best reported results [9, 10, 11] to date with being achieved by the same system architecture for both tasks. While a wide range of discussion on the human performance of speech recognition is ongoing [3, 9, 10, 32, 33], the achieved WER of 5.1% for the SWB subset is on par with one of the recently estimated human performance for this subset .
Among the three variants of HW-LSTM, HW-LSTM-H that adds highway network to a hidden state inside LSTM is superior to HW-LSTM-C and HW-LSTM-CH222Comparing with a different combination of highway connection and LSTM that uses a different gating mechanism  is worth trying..
HW-LSTM-H LM reduces WER over the strong baseline LSTM LM in English broadcast news and a wide range of conversational telephone speech recognition tasks.
Baseline LSTM LM and HW-LSTM-H are complementary and a combination of them can result in reduction in WER.
We would like to thank Michael Picheny, Stanley F. Chen, and Hong-Kwang J. Kuo of IBM T.J. Watson Research Center for their valuable comments and supports.
-  Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals, “Recurrent neural network regularization,” arXiv preprint arXiv:1409.2329, 2014.
-  Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu, “Exploring the limits of language modeling,” arXiv preprint arXiv:1602.02410, 2016.
-  Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig, “Achieving human parity in conversational speech recognition,” arXiv preprint arXiv:1610.05256, 2016.
-  Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur, “Recurrent neural network based language model,” in Proc. INTERSPEECH, 2010, pp. 1045–1048.
-  Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig, “Joint language and translation modeling with recurrent neural networks.,” in Proc. EMNLP, 2013, pp. 1044–1054.
-  Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio, “How to construct deep recurrent neural networks,” arXiv preprint arXiv:1312.6026, 2013.
-  Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber, “Recurrent highway networks,” arXiv preprint arXiv:1607.03474, 2016.
-  Asier Mujika, Florian Meier, and Angelika Steger, “Fast-slow recurrent neural networks,” arXiv preprint arXiv:1705.08639, 2017.
-  Wayne Xiong, Lingfeng Wu, Fil Alleva, Jasha Droppo, Xuedong Huang, and Andreas Stolcke, “The Microsoft 2017 conversational speech recognition system,” arXiv preprint arXiv:1708.06073, 2017.
-  George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, and Phil Hall, “English conversational telephone speech recognition by humans and machines,” in Proc. INTERSPEECH, 2017, pp. 132–136.
-  Gakuto Kurata, Abhinav Sethy, Bhuvana Ramabhadran, and George Saon, “Empirical exploration of novel architectures and objectives for language models,” in Proc. INTERSPEECH, 2017, pp. 279–283.
-  Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387, 2015.
-  Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush, “Character-aware neural language models,” arXiv preprint arXiv:1508.06615, 2015.
-  Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber, “Training very deep networks,” in Proc. NIPS, 2015, pp. 2377–2385.
-  Alex Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850, 2013.
-  Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever, “An empirical exploration of recurrent network architectures,” in Proc. ICML, 2015, pp. 2342–2350.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016, pp. 770–778.
-  Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier, “Language modeling with gated convolutional networks,” arXiv preprint arXiv:1612.08083, 2016.
-  Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
-  Stanley F. Chen, Brian Kingsbury, Lidia Mangu, Daniel Povey, George Saon, Hagen Soltau, and Geoffrey Zweig, “Advances in speech transcription at IBM under the DARPA EARS program,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, pp. 1596–1608, 2006.
-  David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986.
-  Brian Kingsbury, “Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling,” in Proc. ICASSP, 2009, pp. 3761–3764.
-  Brian Kingsbury, Tara N Sainath, and Hagen Soltau, “Scalable minimum bayes risk training of deep neural network acoustic models using distributed hessian-free optimization.,” in Proc. INTERSPEECH, 2012.
-  Stanley F. Chen and Joshua Goodman, “An empirical study of smoothing techniques for language modeling,” Computer Speech & Language, vol. 13, no. 4, pp. 359–393, 1999.
Ebru Arisoy, Abhinav Sethy, Bhuvana Ramabhadran, and Stanley Chen,
“Bidirectional recurrent neural network language models for automatic speech recognition,”in Proc. ICASSP, 2015, pp. 5421–5425.
-  Stanley F. Chen, “Performance prediction for exponential language models,” in Proc. HLT-NAACL, 2009, pp. 450–458.
-  Stanley F. Chen, “Shrinking exponential language models,” in Proc. HLT-NAACL, 2009, pp. 468–476.
-  Stanley F Chen, Lidia Mangu, Bhuvana Ramabhadran, Ruhi Sarikaya, and Abhinav Sethy, “Scaling shrinkage-based language models,” IBM Research Report:RC24970, 2010.
-  George Saon, Tom Sercu, Steven Rennie, and Hong-Kwang J. Kuo, “The IBM 2016 English conversational telephone speech recognition system,” in Proc. INTERSPEECH, 2016, pp. 7–11.
-  Laurence Gillick and Stephen J Cox, “Some statistical issues in the comparison of speech recognition algorithms,” in Proc. ICASSP, 1989, pp. 532–535.
-  Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig, “The Microsoft 2016 conversational speech recognition system,” in Proc. ICASSP, 2017, pp. 5255–5259.
-  Andreas Stolcke and Jasha Droppo, “Comparing human and machine errors in conversational speech transcription,” in Proc. INTERSPEECH, 2017, pp. 137–141.
-  Kazuki Irie, Zoltán Tüske, Tamer Alkhouli, Ralf Schlüter, and Hermann Ney, “LSTM, GRU, highway and a bit of attention: An empirical overview for language modeling in speech recognition,” in Proc. INTERSPEECH, 2016, pp. 3519–3523.