Ever since the introduction of Deep Neural Networks (DNNs) to Automatic Speech Recognition (ASR) tasks, researchers had been trying to use additional inputs to the raw input features. We extracted features that are more representative using the first and second order differentiates of the raw input features. And we utilized features in multiple neighboring frames to make use of the context information.
Efforts had been continuously made in designing and modifying models that are more powerful. We designed Recurrent Neural Networks (RNNs)
for context-sensitive applications, Convolutional Neural Networks (CNNs)4], making our DNNs more capable of incorporating large amounts of data and making accurate predictions.
In the area of Robust ASR, although it is always helpful to incorporate more data, we still lack a model as well-designed as CNN in Computer Vision (CV). Many methods were proposed on both front-end and back-end. The models in this paper belong to the back-end methods.
Inspired by recent progresses in Natural Language Processing, we proposed the Recurrent Deep Stacking Network (RDSN) and successfully applied it to Speech Enhancement tasks. RDSN utilizes the phoneme information in previous frames as additional inputs to the raw features. From another perspective of view, this framework transformed the Acoustic Model into a hybrid model consisted of an Acoustic Model and a simple N-gram Language Model on phoneme level.
In the next section, we will explain the framework of RDSN and tricks to compress the outputs. Then we will show the experimental results and make a conclusion.
2 Recurrent Deep Stacking Networks
2.1 Recurrent Deep Stacking Network
As indicated in its name, Recurrent Deep Stacking Network stacks and concatenates the outputs of previous frames into the input features of the current frame. If we view acoustic models in ASR systems as functions projecting input features to the probability density outputs, we can see the differences between conventional systems and RDSN clearer. Denote the input features at frameas , and the output as frame as . We can see that RDSN tries to model
while conventional DNNs try to model
Note that if we want the RDSN to be causal, we can simplify it to
where ’s in the above formula represent the number of recurrent frames. Figure 1 shows the framework of RDSN.
Adding as additional inputs, we transferred the pure acoustic model into a hybrid model consisted of an acoustic model and a phoneme-level N-gram model representing the relation between current phone and previous phones. The phoneme-level N-gram (or as in the formula above, k-gram) model provides additional information on phoneme-level, trying to make the output of current frame more accurate and robust with respect to noises and reverberations.
2.2 Compressing the Outputs
Since the output dimensions of acoustic models are usually of thousands, adding multiple recurrent outputs will significantly increase the size of the model. We use a compression method based on the correspondence of the DNN output and monophone states. For each output dimension, we find the corresponding monophone and add its value with the other output dimensions that have the same corresponding monophone. This process compresses the dimension from thousands to about forty.
Compressing the output dimensions enables us to add multiple recurrent outputs and keep the input vector dense.
2.3 BiPass Stacking Network
Originated from a similar idea as of RDSN, a BiPass Stacking Network (BPSN) takes as input both conventional features and the outputs of previous frames. But the way BiPass Stacking Network generates the representations of previous frames is through a two-pass scheme similar to Deep Stacking Networks (DSN) .
During the first pass, BPSN sets all the recurrent inputs to be zeros and concatenates the zeros vector with the extracted features. After getting the outputs, we compress them and use the compresses outputs as the additional inputs to the second pass. The second pass takes as input both the compressed outputs and the extracted features.
The difference between BPSN and DSN is that instead of stacking representations of different levels all from current frame, BPSN utilizes the information from previous frames as well.
Note that we can extend BiPass Stacking Network to networks with more than two passes naturally. To add a third pass, we can take as additional inputs the outputs of the second pass. Stacking the outputs of the previous passes, we can use as many passes as we want.
We conducted experiments on CHiME-4 dataset, using all of the single channel utterances. The total length of the dataset is around 100 hours. The training set is a simple mixture of utterances with different background noise types, with has 8738 utterances. The development set is consisted of 3280 utterances and test set 2640. For GMM-HMM model training and decoding, we used the recipe in Kaldi.
In our preliminary experiments, the baseline was a conventional DNN with 6 hidden layers, exponential Rectified Linear Units (ReLUs), and drop out. Each hidden layer has 1024 nodes. The input features extracted from the utterances were 40 dimensional MFCC features. We concatenated 9 features from previous frames and 9 from the following frames, forming a 760 dimensional input vector. The output has 3161 dimensions.
For both RDSN and BPSN, we took 9 previous outputs as additional inputs, forming a 1138 dimensional vector. The other network settings were all kept the same as the baseline. We used the 15th epoch of the baseline model as the initial models to RDSN and BPSN.
Some preliminary results are shown in figure 2.
We can see from figure 2 that after a short adjustment, the cross entropies of RDSN quickly reduced to values substantially lower than that of the baseline system. The adjustment may be due to the fact that in the first epoch, the weights corresponding to the additional inputs were all random.
In this paper, we proposed a Recurrent Deep Stacking Network (RDSN) based speech recognition system and an efficient substitute to RDSN, BiPass Stacking Network (BPSN). These two models convert the pure acoustic model into a hybrid structure consisted of both an acoustic model and a phoneme-level N-gram model. Note that both RDSN and BPSN can be extended to other types of neural networks like LSTM RNN. We tested the performances of our models on CHiME-4 dataset and got good results. This performance improvement is of high possibility consistent among all types of acoustic models because it provides additional phoneme-level information on top of the acoustic model. Future work includes using a larger RDSN to compete with the state-of-art models on CHiME challenges and designing recurrent deep stacking networks from LSTM RNNs.
-  Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
-  Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6645–6649.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton,
“Imagenet classification with deep convolutional neural networks,”in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  Arun Narayanan and DeLiang Wang, “Investigation of speech separation as a front-end for noise robust speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 826–835, 2014.
-  Steve Lawrence, C Lee Giles, and Sandiway Fong, “Natural language grammatical inference with recurrent neural networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 1, pp. 126–140, 2000.
-  Li Deng, Dong Yu, and John Platt, “Scalable stacking and learning for building deep architectures,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012, pp. 2133–2136.