In recent years, significant progress has been achieved in the speech recognition field, and researchers focus more on real-time problems such as keyword spotting (KWS). Most of the time, KWS provides the start signal of ASR problem. Thus, it requires a high recall rate, low mistake rate, and fast calculation speed. Although large and complex model guarantees high recall and low mistake rate, it also requires substantial computing power. Therefore, many complex models like LSTM are not suitable for KWS problem.
Traditionally, deep neural network (DNN) and convolutional neural network (CNN) are adopted to solve KWS problem and have achieved excellent performance. However, some KWS problems require the model to perform well in various scenarios under different background noise, and many simple KWS models fail to meet this need.
Motivated by this problem, we implemented the Hierarchical Neural Network (HNN) in KWS problem. HNN uses several models trained at different levels to calculate posterior probability together. The final result depends on the combination of all the networks from different levels. Each model is trained with different scenario training data. Low-level models provide bottleneck features to the high-level models. In our experiment, this architecture shows considerable improvement over the traditional network.
Section 2 describes some related work. Section 3 talks about KWS algorithm used by us. Sections 4 introduces HNN and experiments are described in section 5. The paper concludes with a discussion in section 6.
2 Related Work
Thomas et al.  uses multilingual MLP features to build a Large Vocabulary Continuous Speech Recognition (LVSCR) systems. Plahl et al.  applies hierarchical bottleneck feature for LVSCR. Valente et al.  realizes hierarchical processing spectrum for mandarin LVCSR system.
There are also plenty of literature on the topic of KWS. Offline LVCSR systems can be used for detecting the keywords of interest. . Moreover, Hidden Markov Models (HMM) are commonly used for online KWS system . In traditional, Gaussian Mixture Models (GMM) is used in acoustic modeling under the HMM framework. It is replaced by Deep Neural Network (DNN) with time goes on . And several architectures have been applied .
3 Keyword spotting
In general, KWS is implemented on local devices and is a real-time problem, therefore low latency and memory storage are required to ensure user experience and acceptable consumption. The early KWS is based on offline continuous speech recognition with GMM-HMM . With the great success of Deep Neural Network in continuous speech recognition, traditional GMM-HMM is replaced by DNN . Recently, Chen er al.  designed a KWS strategy without HMM.
In our research, we use finite state transducer (FST) to realize KWS by employing work unit. FST consists of a finite number of states. Each state is connected by a transition labeled with input/output pair. State transitions depend on the inputs and transition rules. For example, we use “hi nomi”, which is implemented in our product, as a keyword. FST will begin with searching “hi nomi” in the dictionary for its phone units. Its pronunciation is “hai nmi” and we choose “HH AY1 N OW1 M IY1” as its phone state. Then find its all tri-phones such as “HH-AY1-N” (which may occur in speech data) and do clustering to generate each state. During clustering, we let the tri-phones whose central phone is “HH” (like “HH-HH-AY1” and “sil-HH-AY1” etc. ) and “AY1” (like “HH-AY1-AY1” and “AY1-AY1-AY1” etc.) as the first word state, “HH-AH1”; “N” and “OW1” as the second word state “N-OW1”; “M” and “IY1” as the third word state “M-IY1”. “sil” is the silent state and any input which does not occur on the connected arcs is directed to the “other” state. These labels are generated via forced alignment using our LVSCR system. The “hi nomi” FST is shown in Figure.1. The expression is input/output pairs, such as “HH-AY1”, and the arrow means state transformation. Device wakes up when the output equals 1. It would wake up if and only if “hi nomi” occurs.
4 Hierarchical Neural Network
In this section, we elaborate how we realize the Hierarchical Neural Network(HNN) in KWS problem
4.1 Training Neural Network
The neural network is trained on three environment training data - quiet (Q), video (V) and incar (C). The major difference between these three datasets is background noise. Quiet data has minimal background noise; video data has noise that is extracted from videos such as movie or television programs; noise for incar data consists of road noise and external noise generated from a moving vehicle. The network is trained in the following steps:
A. Train the MLP on quiet training data. The first level network is a traditional neural network with bottleneck BN1 and randomly initialized weight. The bottleneck layer allows the network to learn low dimensional feature extracted among quiet environments data.
B. Training the second level network on video training data. Apart from input feature dimension, it shares the same architecture as the first level. Besides the original input feature, it also inputs the bottleneck feature BN1 from the first level network. Also, the second level network has a bottleneck layer BN2.
C. Training the third level network on incar training data. The input of third level network includes original input feature and bottleneck feature. Unlike the first two level network, the third level network is the last one, and it does not have bottleneck architecture.
And the final HNN is the three-level network combined. The training architecture of HNN is shown in Figure.2.
4.2 Post Processing and Wakeup Decision
Post processing is another problem of HNN because three level model have the same input but different outputs .
Obviously, there are two post processing ways, only retaining the third level output and retaining all outputs. The configuration is showed in Figure.3.
There are also two strategies when retaining all level outputs.
A. Wake up the device as long as any level neural network wakes up the device.
B. Use the average of three level model outputs to decide whether the device wake up.
We implement our HNN algorithm in KWS problem. We use “hi nomi” as the wake-up keyword. Video and incar noise is mixed into the training data. The training data consists of 830k utterances which include 520k positive samples and 310k negative samples.180 hours of data is used to test the recall rate and false alarm (FA). Besides 20k keyword utterances, the test data includes various environmental data such as conference, incar, video and etc. We use KALDI toolkit  to train each model in our experiment.
The baseline model is a 3-layer DNN with 4 outputs. Each hidden layer has 512 nodes. We generate acoustic feature based on 40-dimensional log-filterbank energies computed every 10ms over a window of 25ms. The input context is set to 11 so that the format is “5-1-5”. For comparison, the HNN in our experiment has similar model size and calculation complexity with the baseline model. The model architecture and calculation complexity are showed in Table.1 and Table .2.
|1st lv||2nd lv||3rd lv|
|all bn||all output||calculation|
5.1 Bottleneck Architecture
The three levels of the HNN are trained with quiet(Q), video(V) and incar(C) keyword training data respectively. We compare the performance of HNN with different bottleneck architecture.
The ROC curve is shown in Figure.4 and Figure.5. According to the ROC, HNN performs much better than the baseline DNN model. What’s more, no matter how we process the three-level outputs, 1 bottleneck architecture performs better than all bottleneck architecture. The reason may be that the first level bottleneck feature is included in the second level bottleneck feature. So more bottleneck feature input, more parameter has to be trained. Moreover, these parameters may influence convergence.
5.2 Output Process
Besides bottleneck architecture, there are also three ways to process output:
A. only keep the final level output
B. wakeup the device as long as any network at any level wakes up the device
C. calculate the average of three-level network’s posterior to decide whether or not wake up the device.
We find that the performance of B and C is similar. So we only show the result of A and B in this paper. The ROC is shown in Figure.6. In this part experiment, we only use 1 bottleneck HNN as the result in 5.1.
At most of the time, the performance of all output HNN is better than 1 output architecture, but it performs worse when FA is low. It is easy to imagine that using three-level outputs can increase the recall rate. When FA is low, all level networks are required to have good performance. If any level of the HNN makes a mistake, the network will generate a false alarm.
5.3 Compare with CNN
Besides comparing with DNN baseline, we also examine the performance of HNN with some CNN baseline. HNN1 is chosen to compare with CNNs because it has the best performance according to the previous experiments. The CNN architecture is shown in Table.3. The ROC is shown in Figure.7.
According to ROC, HNN performs better than most CNN we get. But some CNN is still better than HNN.
[Kernel Length, Kernel Height], S[Length Stride, Height Stride]
5.4 Multi Hierarchical Neural Network
According to CNN performance and calculation complexity, we implement Multi Hierarchical Neural Network(MHNN) whose first two level is CNN and third level is DNN. We chose same kernel as CNN5 for CNN layer. And the affine configuration of MHNN and computation complexity is showed in Table.4 and Table.5. And the ROC is showed in Figure.8.
|1st lv||2nd lv||3rd lv|
|all bn||all output||calculation|
According to the ROC, most MHNN performs better than CNN5 which is the best CNN beyond our experiment.
We have implemented Hierarchical Neural Network into KWS problem. Its performance is better than original DNN and CNN architecture. The model size and computation complexity are low enough to deal with real-time problems. According to past work, CNN baseline performs better than DNN baseline, so we will try to implement three level CNN Hierarchical Neural Network in the future.
 Samuel Thomas, Sriram Ganapathy and Hynek Hermansky, “Multilingual MLP Features For Low-Resource LVCSR System”, Proc, of IEEE ECASSP, 2012.
 Christian Plahl, Ralf Schlüter, Hermann Ney, “Hierarchical Bettle Neck Feature For LVCSR”, in Interspeech, 2010.
 Christian Plahl, Ralf Schlüter, Hermann Ney, “Cross-lingual Probability of Chinese and English Neural Network Features For French and German LVCSR”, in IEEE ASRU, 2011.
 Fabio Valente, Mathew Magimai.-Doss, Christian Plahl, Suman Ravuri, “Hierarchical Processing of the Modulation Spectrum for GALE Mandarin LVSCR System”, in Interspeech, 2009.
 David RH Miller, Michael Kleber, Chia-Lin Kai, Owen Kimball, Thomas Colthurst, Stephen A Lowe, Richard M Schwarts and Herbert Gish, ”Rapid and Accurate Spoken Term Detection”, in Eighth Annual Conference of the International Speech Communication Association 2007.
 Siddika Parlak and Murat Saraclar, “Spoken Term Detection For Turkish Broadcast News”, in Acoustics, Speech and Singal Processing, 2008, pp. 5244-5247
 Richard C Rose and Douglas B Paul, “A Hidden Markov Model Based Keyword Recognition System”, in Acoustics, Speech and Signal Proccessing 1990.
 Jay G Wilpon, Lawrence R Rabiner, C-H Lee, and ER Goldman, ”Automatic Recognition of Keywords in Unconstrained Speech Using Hidden Markov Model”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 38, no.11, pp. 1870-1878, 1990.
 Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups”, IEEE Signal Processing Magazine, vol.29, no.6, pp.82-97, 2012.
 Sankaran Panchapagesan, Ming Sun, Aparna Khare, Spyros Matsoukas, Arindam Mandal, Björn Hoffmeister and Shiv Vitaladevuni, “Multi-task Learning and Weighted Cross-entropy for DNN-based Keyword Spotting”, Interspeech, 2016.
 Ming Sun, David Snyder, Yixin Gao, Varun Nagaraja, Mike Rodehorst, Sankaran Panchapagesan, Nikko Strom, Spyros Matsoukas and Shiv Vitaladevuni, “Compressed Time Delay Neural Network for Small Footprint Keyword Spotting”, Interspeech, 2017.
 Guoguo Chen, Sanjeev Khudanpur, Daniel Povey, Jan Trmal, David Yarowsky and Oguz Yilmaz, “Quantifying the Value of Pronunciation Lexicons for Keyword Search In Low Resource Languages”,IEEE international Conference on Acoustics, 2013.
 Daniel Provey et al. “The Kaldi Speech Recognition Toolkit”, IEEE AERU, IEEE Signal Processing Society, 2011.