1 Introduction
Language models (LMs) are crucial components in many applications, such as speech recognition and machine translation. The aim of language models is to compute the probability of any given sentence
, which can be calculated as(1) 
The task of LMs is to calculate the probability of word given its previous history . gram LMs [1] and neural network based language mdoels (NNLMs) [2, 3] are two widely used language models. In gram LMs, the most recent words are used as an approximation of the complete history, thus
(2) 
This gram assumption can also be used to construct a gram feedforward NNLMs [2]. In contrast, recurrent neural network LMs (RNNLMs) model the complete history via a recurrent connection.
Most of previous work on language models has focused on utilising history information, the future word context information has not been extensively investigated. There have been several attempts to incorporate future context information into recurrent neural network language models. Individual forward and backward RNNLMs can be built, and these two LMs combined with a loglinear interpolation
[4]. In [5], succeeding words were incorporated into RNNLM within a Maximum Entropy framework. [6]investigated the use of bidirectional RNNLMs (biRNNLMs) for speech recognition. For a broadcast news task, sigmoid based RNNLMs gave small gains, while no performance improvement was obtained when using long shortterm memory (LSTM) based RNNLMs. More recently, biRNNLMs can produce consistent, and significant, performance improvements over unidirectional RNNLMs (uniRNNLMs) on a range of speech recognition tasks
[7].Though they can yield performance gain, biRNNLMs pose several challenges for both model training and inference as they require the complete previous and future word context information to be taken into account. It is difficult to parallelise training efficiently. Lattice rescoring is also complicated for these LMs as future context needs to be incorporated. This means that the form of approximation used for uniRNNLMs [8] is not suitable to apply. Hence, Nbest rescoring is normally used [5, 6, 7]
. However, the ability to manipulate lattices is very important in many speech applications. Lattices can be used for a wide range of downstream applications, such as confidence score estimation
[9], keyword search [10] and confusion network decoding [11]. In order to address these issues, a novel model structure, succeeding word RNNLMs (suRNNLMs), is proposed in this paper. Instead of using a recurrent unit to capture the complete future word context as in biRNNLMs, a feedforward unit is used to model a small, fixedlength number of succeeding words. This allows existing efficient training [12] and lattice rescoring [8] algorithms developed for uniRNNLMs to be extended to the proposed suRNNLMs. Using these extended algorithms, compact lattices can be generated with suRNNLMs supporting lattice based downstream processing.The rest of this paper is organized as follows. Section 2 gives a brief review of RNNLMs, including both unidirectional and bidirectional RNNLMs. The proposed model with succeeding words (suRNNLMs) is introduced in Section 3, followed by a description of the lattice rescoring algorithm in Section 4. Section 5 discusses the interpolation of language models. The experimental results are presented in Section 6 and conclusions are drawn in Section 7.
2 Uni and Bidirectional RNNLMs
2.1 Unidirectional RNNLMs
In contrast to feedforward NNLMs, where only modeling the previous words, recurrent NNLMs [13] represent the full nontruncated history for word using the 1ofK encoding of the previous word
and a continuous vector
as a compact representation of the remaining context . Figure 1 shows an example of this unidirectional RNNLM (uniRNNLM). The most recent word is used as input and projected into a lowdimensional, continuous, space via a linear projection layer. A recurrent hidden layer is used after this projection layer. The form of the recurrent layer can be based on a standard sigmoid based recurrent unit, with sigmoid activations [3], or more complicated forms such as gated recurrent unit (GRU)
[14] and long shortterm memory (LSTM) units [15]. A continuous vector representing the complete history information can be obtained using and previous word . This vector is used as input of recurrent layer for the estimation of next word. An output layer with softmax function is used to calculate the probability . An additional node is often added at the output layer to model the probability mass of outofshortlist (OOS) words to speed up softmax computation by limiting vocabulary size [16]. Similarly, an outofvocabulary (OOV) node can be added in the input layer to model OOV words. The probability of word sequence is calculated as,(3) 
Perplexity (PPL) is a metric used widely to evaluate the quality of language models. According to the definition in [17], the perplexity can be computed based on sentence probability with,
(4)  
Where is the total number of words and is the number of sentence in the evaluation corpus. is the number of word in th sentence. From the above equation, the PPL is calculated based on the average log probability of each word, which for unidirectional LMs, yields the average sentence log probability.
UniRNNLMs can be trained efficiently on Graphics Processing Units (GPUs) by using spliced sentence bunch (i.e. minibatch) mode [12]. Multiple sentences can be concatenated together to form a longer sequence and sets of these long sequences can then be aligned in parallel from left to right. This data structure is more efficient for minibatch based training as they have comparable sequence length [12]. When using these forms of language models for tasks like speech recognition, Nbest rescoring is the most straightforward way to apply uniRNNLMs. Lattice rescoring is also possible by introducing approximations [8] to control merging and expansion of different paths in lattice. This will be described in more detail in Section 4.
2.2 Bidirectional RNNLMs
Figure 2 illustrates an example of bidirectional RNNLMs (biRNNLMs). Unlike uniRNNLMs, both the history word context and future word context are used to estimate the probability of current word . Two recurrent units are used to capture the previous and future information respectively. In the same fashion as uniRNNLMs, is a compact continuous vector of the history information . While is another continuous vector to encode the future information . This future context vector is computed from the next word and the previous future context vector containing information of . The concatenation of and is then fed into the output layer, with softmax function, to calculate the output probability. In order to reduce the number of parameter, the projection layer for the previous and future words are often shared.
The probability of word sequence can be computed using biRNNLMs as,
(5) 
is the unnormalized sentence probability computed from the individual word probabilities of the biRNNLM. is a sentencelevel normalization term to ensure the sentence probability is appropriately normalized. This is defined as,
(6) 
where is the set of all possible sentences. Unfortunately, this normalization term is impractical to calculate for most tasks.
In a similar form to Equation 4, the PPL of biRNNLMs can be calculated based on sentence probability as,
However, is often infeasible to obtain. As a result, it is not possible to compute a valid perplexity from biRNNLMs. Nevertheless, the average log probability of each word can be used to get a “pseudo” perplexity (PPL).
(8) 
This is the second term of the valid PPL of biRNNLMs shown in Equation 2.2. It is a “pseudo” PPL because the normalized sentence probability is impossible to obtain and the unnormalized sentence probability is used instead. Hence, the “pseudo” PPL of biRNNLMs is not comparable with the valid PPL of uniRNNLMs. However, the value of “pseudo” PPL provides information on the average word probability from biRNNLMs since it is obtained using the word probability.
In order to achieve good performance for speech recognition, [7] proposed an additional smoothing of the biRNNLM probability at test time. The probability of biRNNLMs is smoothed as,
(9) 
where is the activation before softmax function for node in the output layer. is an empirical smoothing factor, which is chosen as 0.7 in this paper.
The use of both preceding and following context information in biRNNLMs presents challenges to both model training and inference. First, Nbest rescoring is normally used for speech recognition [7]. Lattice rescoring is impractical for biRNNLMs as the computation of word probabilities requires information from the complete sentence.
Another drawback of biRNNLMs is the difficulty in training. The complete previous and future context information is required to predict the probability of each word. It is expensive to directly training biRNNLMs sentence by sentence, and difficult to parallelise the training for efficiency. In [6], all sentences in the training corpus were concatenated together to form a single sequence to facilitate minibatch based training. This sequence was then “chopped” into subsequences with the average sentence length. BiRNNLMs were then trained on GPU by processing multiple sequences at the same time. This allows biRNNLMs to be efficiently trained. However, issues can arise from the random cutting of sentences, history and future context vectors may be reset in the middle of a sentence. In [7], the biRNNLMs were trained in a more consistent fashion. Multiple sentences were aligned from left to right to form minibatches during biRNNLM training. In order to handle issues caused by variable sentence length, NULL tokens were appended to the ends of sentences to ensure that the aligned sentences had the same length. These NULL tokens were not used for parameter update. In this paper, this approach is adopted to train biRNNLMs as it gave better performance.
3 RNNLMs with succeeding words
As discussed above, biRNNLMs are slow to train and difficult to use in lattice rescoring. In order to address these issues, a novel structure, the suRNNLM, is proposed in this paper to incorporate future context information. The model structure is illustrated in Figure 3. In the same fashion as biRNNLMs, the previous history is modeled with recurrent units (e.g. LSTM, GRU). However, instead of modeling the complete future context information, , using recurrent units, feedforward units are used to capture a finite number of succeeding words, . The softmax function is again applied at the output layer to obtain the probability of the current word
. The word embedding in the projection layer are shared for all input words. When the succeeding words are beyond the sentence boundary, a vector of 0 is used as the word embedding vector. This is similar to the zero padding of the feedforward forward NNLMs at the beginning of each sentence
[13].As the number of succeeding words is finite and fixed for each word, its succeeding words can be organized as a gram future context and used for minibatch mode training as in feedforward NNLMs [13]. SuRNNLMs can then be trained efficiently in a similar fashion to uniRNNLMs in a spliced sentence bunch mode [12].
Compared with equations 3 and 5, the probability of word sequence can be computed as
(10) 
Again, the sentence level normalization term is difficult to compute and only “pseudo” PPL can be obtained. The probabilities of suRNNLMs are also very sharp, which can be seen from the “pseudo” PPLs in Table 2 in Section 6. Hence, the biRNNLM probability smoothing given in Equation 9 is also required for suRNNLMs to achieve good performance at evaluation time.
4 Lattice rescoring
Lattice rescoring with feedforward NNLMs is straightforward [13] whereas approximations are required for uniRNNLMs lattice rescoring [8, 18]. As mentioned in Section 2.2, Nbest rescoring has previously been used for biRNNLMs. It is not practical for biRNNLMs to be used for lattice rescoring and generation as both the complete previous and future context information are required. However, lattices are very useful in many applications, such as confidence score estimation [9], keyword search [10] and confusion network decoding [11]. In contrast, suRNNLMs require a fixed number of succeeding words, instead of the complete future context information. From Figure 3, suRNNLMs can be viewed as a combination of uniRNNLMs for history information and feedforward NNLMs for future context information. Hence, lattice rescoring is feasible for suRNNLMs by extending the lattice rescoring algorithm of uniRNNLMs by considering additional fixed length future contexts.
4.1 Lattice rescoring of uniRNNLMs
In this paper, the gram approximation [8] based approach is used for uniRNNLMs lattice rescoring. When considering merging of two paths, if their previous words are identical, the two paths are viewed as “equivalent” and can be merged. This is illustrated in Figure 5 for the start node of word . The history information from the best path is kept for the following RNNLM probability computation and the histories of all other paths are discarded. For example, the path is kept and the other path is discarded given arc .
There are two types of approximation involved for uniRNNLM lattice rescoring, which are the merge and cache approximations. The merge approximation controls the merging of two paths. In [8], the first path reaching the node was kept and all other paths with the same gram history were discarded irrespective of the associated scores. This introduces inaccuracies in the RNNLM probability calculation. The merge approximation can be improved by keeping the path with the highest accumulated score. This is the approach adopted in this work. For fast probability lookup in lattice rescoring, gram probabilities can be cached using words as a key. A similar approach can be used with RNNLM probabilities. In [8], RNNLM probabilities were cached based on the previous words, which is referred as cache approximation. Thus a word probability obtained from the cache may be derived from another history sharing the same previous words. This introduces another inaccuracy. In order to avoid this inaccuracy yet maintain the efficiency, the cache approximation used in [8] is improved by adopting the complete history as key for caching RNNLM probabilities. Both modifications yielt small but consistent improvements over [8] on a range of tasks.
4.2 Lattice rescoring of suRNNLMs
For lattice rescoring with suRNNLMs, the gram approximation can be adopted and extended to support the future word context. In order to handle succeeding words correctly, paths will be merged only if the following succeeding words are identical. In this way, the path expansion is carried out in both directions. Any two paths with the same succeeding words and previous words are merged.
Figure 4 shows part of an example lattice generated by a 2gram LM. In order to apply uniRNNLM lattice rescoring using a 3gram approximation, the grey shaded node in Figure 4 needs to be duplicated as word has two distinct 3gram histories, which are and respectively. Figure 5 shows the lattice after rescoring using a uniRNNLM with 3gram approximation. In order to apply suRNNLMs for lattice rescoring, the succeeding words also need to be taken into account. Figure 6 is the expanded lattice using a suRNNLM with 1 succeeding word. The grey shaded nodes in Figure 5 need to be expanded further as they have distinct succeeding words. The blue shaded nodes in Figure 6 are the expanded node in the resulting lattice.
Using the gram history approximation and given succeeding words, the lattice expansion process is effectively a gram lattice expansion for uniRNNLMs. For larger value of and , the resulting lattices can be very large. This can be addressed by pruning the lattice and doing initial lattice expansion with a uniRNNLM.
5 Language Model Interpolation
For unidirectional language models, such as gram model and uniRNNLMs, the word probabilities are normally combined using linear interpolation,
where and are the probabilities from gram and uniRNN LMs respectively, is the interpolation weight of uniRNNLMs.
However, it is not valid to directly combine uniLMs (e.g unidirectional gram LMs or RNNLMs) and biLMs (or suLMs) using linear interpolation due to the sentence level normalisation term required for biLMs (or suLMs) in Equation 5. As described in [7], uniLMs can be loglinearly interpolated with biLMs for speech recognition using,
where is the appropriate normalisation term. The normalisation term can be discarded for speech recognition as it does not affect the hypothesis ranking. and are the probabilities from uniLMs and biRNNLMs respectively. is the loglinear interpolation weight of biRNNLMs. The issue of normalisation term in suRNLMs is similar to that of biRNNLMs, as shown in Equation 10. Hence, loglinear interpolation can also be applied for the combination of suRNNLMs and uniLMs and is the approach used in this paper.
By default, linear interpolation is used to combine uniRNNLMs and gram LMs. A twostage interpolation is used when including biRNNLMs and suRNNLMs. The uniRNNLMs and gram LMs are first interpolated using linear interpolation. These linearly interpolated probabilities are then loglinearly interpolated with those of biRNNLMs (or suRNNLMs).
6 Experiments
Experiments were conducted using the AMI IHM meeting corpus [19] to evaluated the speech recognition performance of various language models. The Kaldi training data configuration was used. A total of 78 hours of speech was used in acoustic model training. This consists of about 1M words of acoustic transcription. Eight meetings were excluded from the training set and used as the development and test sets.
The Kaldi acoustic model training recipe [20] featuring sequence training [21] was applied for deep neural network (DNN) training. CMLLR transformed MFCC features [22] were used as the input and 4000 clustered context dependent states were used as targets. The DNN was trained with 6 hidden layers, and each layer has 2048 hidden nodes.
The first part of the Fisher corpus, 13M words, was used for additional language modeling training data. A 49k word decoding vocabulary was used for all experiments. All LMs were trained on the combined (AMI+Fisher), 14M word in total. A 4gram KN smoothed backoff LM without pruning was trained and used for lattices generation. GRU based recurrent units were used for all unidirectional and bidirectional RNNLMs ^{1}^{1}1GRU and LSTM gave similar performance for this task, while GRU LMs are faster for training and evaluation. 512 hidden nodes were used in the hidden layer. An extended version of CUEDRNNLM [23] was developed for the training of uniRNNLMs, biRNNLMs and suRNNLMs. The related code and recipe will be available online ^{2}^{2}2http://mi.eng.cam.ac.uk/projects/cuedrnnlm/. The linear interpolation weight between 4gram LMs and uniRNNLMs was set to be 0.75 as it gave the best performance on the development data. The loglinear interpolation weight for biRNNLMs (or suRNNLMs) was 0.3. The probabilities of biRNNLMs and suRNNLMs were smoothed with a smoothing factor 0.7 as suggested in [7]. The 3gram approximation was applied for the history merging of uniRNNLMs and suRNNLMs during lattice rescoring and generation [8].
Table 1 shows the word error rates of the baseline system with 4gram and uniRNN LMs. Lattice rescoring and best rescoring are applied to lattices generated by the 4gram LM. As expected, uniRNNLMs yield a significant performance improvement over 4gram LMs. Lattice rescoring gives a comparable performance with 100best rescoring. Confusion network (CN) decoding can be applied to lattices generated by uniRNNLM lattice rescoring and additional performance improvements can be achieved. However, it is difficult to apply confusion network decoding to the 100best ^{3}^{3}3Nbest list can be converted to lattice and CN decoding then can be applied, but it requires a much larger Nbest list, such as 10K used in [8]..
LM  rescore  dev  eval  

Vit  CN  Vit  CN  
ng4    23.8  23.5  24.2  23.9 
+unirnn  100best  21.7    22.1   
lattice  21.7  21.5  21.9  21.7 
Table 2 gives the training speed measured with word per second (w/s) and (“pseudo”) PPLs of various RNNLMs with difference amounts of future word context. When the number of succeeding words is 0, this is the baseline uniRNNLMs. When the number of succeeding words is set to , a biRNNLM with complete future context information is used. It can be seen that suRNNLMs give a comparable training speed as uniRNNLMs. The additional computational load of the suRNNLMs mainly come from the feedforward unit for succeeding words as shown in Figure 3. The computation in this part is much less than that of other parts such as output layer and GRU layers. However, the training of suRNNLMs is much faster than biRNNLMs as it is difficult to parallelise the training of biRNNLMs efficiently [7]. It is worth mentioning again that the PPLs of uniRNNLMs can not be compared directly with the “pseudo” PPLs of biRNNLMs and suRNNLMs. But both PPLs and “pseudo” PPLs reflect the average log probability of each word. From Table 2, with increasing number of succeeding words, the “pseudo” PPLs of the suRNNLMs keeps decreasing, yielding comparable value as biRNNLMs.
#succ words  0  1  3  7  

train speed(w/s)  4.5K  4.5K  3.9K  3.8K  0.8K 
(pseudo) PPL  66.8  25.5  21.5  21.3  22.4 
Table 3 gives the WER results of 100best rescoring with various language models. For biRNNLMs (or suRNNLMs), it is not possible to use linear interpolation. Thus a two stage approach is adopted as described in Section 5. This results in slight differences, second decimal place, between the uniRNNLM case and the 0 future context suRNNLM. The increasing number of the succeeding words consistently reduces the WER. With 1 succeeding word, the WERs were reduced by 0.2% absolutely. SuRNNLMs with more than 2 succeeding words gave about 0.5% absolute WER reduction. BiRNNLMs (shown in the bottom line of Table 3) outperform suRNNLMs by 0.1% to 0.2%, as it is able to incorporate the complete future context information with recurrent connection.
LM  #succ words  dev  eval 

ng4  23.8  24.2  
+unirnn    21.7  22.1 
+surnn  0  21.7  22.1 
1  21.5  21.8  
2  21.3  21.7  
3  21.3  21.6  
4  21.4  21.6  
5  21.3  21.6  
6  21.3  21.6  
7  21.4  21.6  
21.2  21.4 
Table 4 shows the WERs of lattice rescoring using suRNNLMs. The lattice rescoring algorithm described in Section 4 was applied. SuRNNLMs with 1 and 3 succeeding words were used for lattice rescoring. From Table 4, suRNNLMs with 1 succeeding words give 0.2% WER reduction and using 3 succeeding words gives about 0.5% WER reduction. These results are consistent with the 100best rescoring result in Table 3. Confusion network decoding can be applied on the rescored lattices and additional 0.30.4% WER performance improvements are obtained on dev and eval test sets.
LM  #succ  dev  eval  

words  Vit  CN  Vit  CN  
ng4    23.8  23.5  24.2  23.9 
+unirnn    21.7  21.5  21.9  21.7 
+surnn  1  21.6  21.3  21.6  21.5 
3  21.3  21.0  21.4  21.1 
7 Conclusions
In this paper, the use of future context information on neural network language models has been explored. A novel model structure is proposed to address the issues associated with biRNNLMs, such as slow train speed and difficulties in lattice rescoring. Instead of using a recurrent unit to capture the complete future information, a feedforward unit was used to model a finite number of succeeding words. The existing training and lattice rescoring algorithms for uniRNNLMs are extended for the proposed suRNNLMs. Experimental results show that suRNNLMs achieved a slightly worse performances than biRNNLMs, but with much faster training speed. Furthermore, additional performance improvements can be obtained from lattice rescoring and subsequent confusion network decoding. Future work will examine improved pruning scheme to address the lattice expansion issues associated with larger future context.
References
 [1] Stanley Chen and Joshua Goodman, “An empirical study of smoothing techniques for language modeling,” Computer Speech & Language, vol. 13, no. 4, pp. 359–393, 1999.

[2]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin,
“A neural probabilistic language model,”
Journal of Machine Learning Research
, vol. 3, pp. 1137–1155, 2003.  [3] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur, “Recurrent neural network based language model.,” in Proc. ISCA INTERSPEECH, 2010.
 [4] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig, “Achieving human parity in conversational speech recognition,” arXiv preprint arXiv:1610.05256, 2016.
 [5] Yangyang Shi, Martha Larson, Pascal Wiggers, and Catholijn Jonker, “Exploiting the succeeding words in recurrent neural network language models.,” in Proc. ICSA INTERSPEECH, 2013.
 [6] Ebru Arisoy, Abhinav Sethy, Bhuvana Ramabhadran, and Stanley Chen, “Bidirectional recurrent neural network language models for automatic speech recognition,” in Proc. ICASSP. IEEE, 2015, pp. 5421–5425.
 [7] Xie Chen, Anton Ragni, Xunying Liu, and Mark Gales, “Investigating bidirectional recurrent neural network language models for speech recognition.,” in Proc. ICSA INTERSPEECH, 2017.
 [8] Xunying Liu, Xie Chen, Yongqiang Wang, Mark Gales, and Phil Woodland, “Two efficient lattice rescoring methods using recurrent neural network language models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1438–1449, 2016.
 [9] Frank Wessel, Ralf Schluter, Klaus Macherey, and Hermann Ney, “Confidence measures for large vocabulary continuous speech recognition,” Speech and Audio Processing, IEEE Transactions on, vol. 9, no. 3, pp. 288–298, 2001.
 [10] Xie Chen, Anton Ragni, Jake Vasilakes, Xunying Liu, Kate Knill, and Mark Gales, “Recurrent neural network language models for keyword search,” in Proc. ICASSP. IEEE, 2017, pp. 5775–5779.
 [11] Lidia Mangu, Eric Brill, and Andreas Stolcke, “Finding consensus in speech recognition: word error minimization and other applications of confusion networks,” Computer Speech & Language, vol. 14, no. 4, pp. 373–400, 2000.
 [12] Xie Chen, Xunying Liu, Yongqiang Wang, Mark Gales, and Phil Woodland, “Efficient training and evaluation of recurrent neural network language models for automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016.
 [13] Holger Schwenk, “Continuous space language models,” Computer Speech & Language, vol. 21, no. 3, pp. 492–518, 2007.
 [14] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
 [15] Sepp Hochreiter and Jürgen Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [16] Junho Park, Xunying Liu, Mark Gales, and Phil Woodland, “Improved neural network based language modelling and adaptation,” in Proc. ISCA INTERSPEECH, 2010.
 [17] Frederick Jelinek, “The dawn of statistical asr and mt,” Computational Linguistics, vol. 35, no. 4, pp. 483–494, 2009.
 [18] Martin Sundermeyer, Hermann Ney, and Ralf Schluter, “From feedforward to recurrent lstm neural networks for language modeling,” Audio, Speech, and Language Processing, IEEE/ACM Transactions on, vol. 23, no. 3, pp. 517–529, 2015.
 [19] Jean Carletta et al., “The AMI meeting corpus: A preannouncement,” in Machine learning for multimodal interaction, pp. 28–39. Springer, 2006.
 [20] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The Kaldi speech recognition toolkit,” in ASRU, IEEE Workshop on, 2011.
 [21] Karel Veselỳ, Arnab Ghoshal, Lukás Burget, and Daniel Povey, “Sequencediscriminative training of deep neural networks.,” in Proc. ICSA INTERSPEECH, 2013.

[22]
Mark Gales,
“Maximum likelihood linear transformations for HMMbased speech recognition,”
Computer Speech & Language, vol. 12, no. 2, pp. 75–98, 1998.  [23] Xie Chen, Xunying Liu, Mark Gales, and Phil Woodland, “CUEDRNNLM – an opensource toolkit for efficient training and evaluation of recurrent neural network language models,” in Proc. ICASSP. IEEE, 2015.
Comments
There are no comments yet.