Keyword spotting (KWS), sometimes also referred to as spoken term detection (STD), is one of the most widely used speech-related technique which aims at detecting the occurrence of a particular word or multi-word phrases in a given speech utterance. A prominent example of STD is wake-up word detection (WWD), embedded in “Google’s voice search” , “Apple’s Siri”, and “Amazon’s Echo”, which has much attention recently.
A conventional approach for STD is based on the keyword/filler hidden Markov model (HMM)[2, 3, 4], which remains strongly competitive until these days 
. At runtime, Viterbi decoding is used to search the best path in the decoding graph, which can be computationally expensive depending on the HMM topology. Other approaches to STD rely on pattern matching schemes such as dynamic time warping (DTW) to quantify the similarity between templates, including Gaussian posteriorgrams , phoneme posteriograms , and CNN-based bottleneck features . However, DTW has known inadequacies  and is quadratic-time in the duration of the segments .
Recently, many researchers have tried to embed the variable length of speech signal into a fixed dimensional vector calledacoustic word embeddings [12, 13, 14, 15, 16, 17, 18] which can easily measure the similarity between them by using cosine distance with only a small amount of computations compare to the DTW. In [12, 13]
, acoustic word embeddings were generated from a long short-term memory (LSTM) trained with whole word output targets, obtaining better performance than other DTW-based approaches. Later, in[17, 18], an additional performance gain was achieved by introducing a triplet network  which was originally proposed in  to solve the problem of signature verification. The network was optimized by using a triplet loss function on the LSTM output layer that tried to maximize the distance between embeddings from the same word classes and simultaneously minimize the distance between embeddings from the different word classes. However, the triplet loss did not consider phonetic information but rely only on relative relationship between words.
In this paper, we propose a phonetically associated triplet network (PATN) that expand the previous work through a hierarchical multitask learning scheme [21, 22] to utilize phonetic information in the triplet network. Similar to [21, 22], a frame-level cross entropy loss function is introduced to the lower layer of the triplet network to explicitly impose the concept that different layers encode different levels of information; the lower layer models the frame-level variations while the higher layer describes the relationship among words. Experimental results show that, more discriminative embeddings can be obtained from the proposed model trained on a convex combination of the two loss functions.
Denote an input acoustic feature sequence as and corresponding labels as where is the number of acoustic frames in a word. In this section, we first review previous work (Figure 1-(a)) and then present our proposed method in detail (Figure 1-(b)).
2.1 Triplet network
A Triplet network  consists of three identical networks with shared weights. Intuitively, the triplet network encourages to find an embedding space where the distances between examples from the same word class (, and ) are smaller than those from different word classes (, and ) by at least a margin . Formally, given a triplet of acoustic feature sequences , the triplet network is trained to minimize triplet loss defined as
where is an acoustic word embedding function, is the margin constraint, and and are the cosine distance between acoustic word embeddings belong to same/different word classes, respectively. As in [17, 18]
, we use the concatenation of the hidden representations from a bidirectional LSTM network as our acoustic embedding function.
2.2 Phonetically associated triplet network (PATN)
Since the triplet network trained on word-level criterion, the resulting acoustic word embedding may not be sensitive to small amount of variation within words. Thus we propose the phonetically associated triplet network (PATN) which is jointly trained on both word- and frame-level criteria and given by
where is a cross-entropy loss obtained from the softmax with classes, is a hyper-parameter which controls the trade-off between the two loss functions, is the number of data in a mini-batch, and the indicator function is if is equal to class label and for otherwise. Similar to [21, 22], we introduce the cross-entropy loss function to the lower layer of the triplet network as depicted in Figure 1-(b). Such a low-level auxiliary task explicitly encourages intuitive and empirical observation that different layers encode different levels of information.
To confirm whether the embeddings extracted from the proposed architecture represent the characteristics of words, we conducted the experiments of word discrimination [17, 18, 24, 14], which is a simplified version of wake-up word detection where the word boundary information is given. Similar to wake-up word detection, the test of word discrimination consists of two steps: enrollment and verification. In the enrollment phase, a speech segment of query is fed into the trained BLSTM. Then, the enrollment embedding is generated by concatenating the two last hidden state vectors from forward and backward directions of BLSTM. In the verification phase, test embeddings are generated in the same way followed by measuring the cosine distance between the enrollment and test embeddings. To get more reliable results, we used averaged cosine distance calculated from 5 enrollment queries for each keyword. By sweeping a threshold, we can obtain the recall at a certain point of false alarm which is usually used to measure the performance in wake-up word detection task [25, 12, 26].
Our models were trained on triplets selected from WSJ  SI-284 training set. The triplets consisted of three word segments which were randomly chosen from entire words in training set. Note that the minimum duration of the segments was sec and the segments were extracted by using the forced alignment of the transcriptions from the GMM-HMM acoustic model trained on the same training set using the open-source Kaldi toolkit . For testing, we used two kinds of datasets: test sets (i.e., eval92 and eval93) from WSJ database (in-domain test) and training set from RM database  (out-of-domain test). We selected queries from the each test set with high frequency of occurrence which are listed in Table 1.
3.2 Model details
We represented the speech signal using -dimensional Mel-filterbank log energy which was calculated from msec frame size with overlap. The state-level label on each speech frame was generated by the GMM-HMM acoustic model through forced alignment for the PATN training. Note that we utilized both monophone ( states) and tied-triphone ( states) state-level labels. The acoustic word embeddings were generated by concatenating two last hidden state vectors from forward and backward directions of BLSTMs which consisted with hidden layers and hidden units. All the models were trained for epochs using the Adam optimization algorithm 
which was implemented in tensorflow toolkit with a batch size of , learning rate of , , , and .
We first examined how the mixing weight between the two terms in the PATN loss affects performance. For this, we measured word discrimination performance in the development data in terms of recall at the operating threshold of false alarm (FA) per hour. As can be seen in Figure 2, the proposed method with any value of outperformed the baseline, with the best performance obtained at in . This means that the additional phonetic information, especially when we use the monophone state-level targets, can improve the performance of the triplet network.
Next, we summarize the performance of baseline and proposed method measured in both in-domain and out-of-domain test sets which is depicted in Table 2. Here, we used the that achieved the highest performance in the development set (see Figure 2). We can clearly see that our proposed method outperformed the baseline as in the previously observed results from the development set even though we did not increase the model size. Surprisingly, our proposed method was still effective with out-of-domain environment, achieving over relative improvement with the same model size.
|WSJ (ID)||company, dollars, from, hundred, nineteen, percent, point, seven|
|RM (OOD)||Bismark, coral, displacement, Formosa, frigates, kilometers, Mozambique, Siberian, Thailand, Tonkin, Westpac, Zulu|
|Model||Recall @ FA/hr|
|WSJ (ID)||RM (OOD)|
|TN ( BLSTM layer)|
|TN ( BLSTM layers)|
|TN ( BLSTM layers)|
|Proposed ( BLSTM layers)|
To verify the effectiveness of the proposed method, we plotted a two-dimensional visualization of embeddings extracted from words appeared at least occurrences of frequency in the out-of-domain test set via t-distributed stochastic neighbor embedding (t-SNE) 
, which is a non-linear dimensionality reduction technique particularly well suited for high-dimensional data (see Figure3). As you can see, most of the embeddings are clustered into their corresponding word classes in both the baseline and the proposed method. Since the triplet network was learned based on the similarity between words, we can obtain the discriminative word representations in the embedding space even unused data in the training. We can also observe that a confusability between words was relaxed in our proposed method resulting more separable clusters of word embeddings (e.g., give vs. get, have vs. how, for vs. from, chart vs. track, etc.). Therefore, we can conclude that the proposed method can increase the discrimination between words while maintaining the generalization power of the triplet network.
In this paper, we proposed a novel architecture called phonetically associated triplet network (PATN) which can learn more discriminative embeddings by inserting phonetic information into the triplet network. In the method, we applied the hierarchical multitask learning framework to the triplet network by introducing an auxiliary cross-entropy loss function at the lower layer of the LSTM. On the same-different word discrimination task, which is similar to wake-up word detection except word boundary is given, our approach outperformed the previous triplet network architecture, achieving over relative improvement in terms of recall at the operating threshold of false alarm (FA) per hour. Moreover, we showed that our model could generalize their performance in the out-of-domain dataset. Finally, we have demonstrated that the phonetic information is really helpful to generate acoustic word embeddings through qualitative comparison of proposed method and the baseline with t-SNE visualizations.
As a future direction, we will expand our works by using large amount of training data to improve performance of the triphone based PATN which was mentioned in Section 3.3. To do so, we also look into ways of improving the extremely long training times such as triplet selection  and class-wise triplet loss . Based on not only the promising results from the out-of-domain task but also further considerations like as temporal context information , our method may successfully be applied to the personalized wake-up word detection task.
This material is based upon work supported by the Ministry of Trade, Industry & Energy (MOTIE, Korea) under Industrial Technology Innovation Program (No.10063424, Development of distant speech recognition and multi-task dialog processing technologies for in-door conversational robots).
-  J. Schalkwyk et al., “ “your word is my command”: Google search by voice: A case study,” in Advances in Speech Recognition, pp. 61–90. Springer, 2010.
-  R. C. Rose and D. B. Paul, “A hidden markov model based keyword recognition system,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1990, pp. 129–132.
-  J. G. Wilpon, L. R. Rabiner, C. H. Lee, and E. R. Goldman, “Automatic recognition of keywords in unconstrained speech using hidden markov models,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 38, no. 11, pp. 1870–1878, 1990.
-  J. G. Wilpon, L. G. Miller, and P. Modi, “Improvements and applications for key word recognition using hidden markov modeling techniques,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1991, pp. 309–312.
-  S. Panchapagesan, M. Sun, A. Khare, S. Matsoukas, A. Mandal, B. Hoffmeister, and S. Vitaladevuni, “Multi-task learning and weighted cross-entropy for DNN-based keyword spotting,” in Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), 2016, pp. 760–764.
-  H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 1, pp. 43–49, 1978.
-  Y. Zhang and J. R. Glass, “Unsupervised spoken keyword spotting via segmental DTW on gaussian posteriorgrams,” in Proceedings of the Automatic Speech Recognition & Understanding (ASRU), 2009, pp. 398–403.
-  T. J. Hazen, W. Shen, and C. White, “Query-by-example spoken term detection using phonetic posteriorgram templates,” in Proceedings of the Automatic Speech Recognition & Understanding (ASRU), 2009, pp. 421–426.
-  H. Lim, Y. Kim, Y. Kim, and H. Kim, “CNN-based bottleneck feature for noise robust query-by-example spoken term detection,” in Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2017, pp. 1278–1281.
-  L. R. Rabiner, A. Rosenberg, and S. Levinson, “Considerations in dynamic time warping algorithms for discrete word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 6, pp. 575–582, 1978.
-  K. Levin, K. Henry, A. Jansen, and K. Livescu, “Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings,” in Proceedings of the Automatic Speech Recognition & Understanding (ASRU), 2013, pp. 410–415.
-  G. Chen, C. Parada, and T. N. Sainath, “Query-by-example keyword spotting using long short-term memory networks,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5236–5240.
J. Hou, L. Xie, and Z. Fu,
“Investigating neural network based query-by-example keyword spotting approach for personalized wake-up word detection in mandarin chinese,”in Proceedings of the International Symposium on Chinese Spoken Language Processing (ISCSLP), 2016, pp. 1–5.
-  H. Kamper, W. Wang, and K. Livescu, “Deep convolutional acoustic word embeddings using word-pair side information,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4950–4954.
-  Y. Wang, H. Lee, and L. Lee, “Segmental audio word2vec: Representing utterances as sequences of vectors with applications in spoken term detection,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 6269–6273.
-  S. Changhao, Z. Junbo, W. Yujun, and X. Lei, “Attention-based end-to-end models for small-footprint keyword spotting,” in Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), 2018, pp. 2037–2041.
S. Settle and K. Livescu,
“Discriminative acoustic word embeddings: Recurrent neural network-based approaches,”in Proceedings of Spoken Language Technology Workshop (SLT), 2016, pp. 503–510.
-  S. Settle, K. Levin, H. Kamper, and K. Livescu, “Query-by-example search with discriminative neural acoustic word embeddings,” in Proceedings of Interspeech, 2017, pp. 2874–2878.
E. Hoffer and N. Ailon,
“Deep metric learning using triplet network,”
Proceedings of International Workshop on Similarity-Based Pattern Recognition, 2015, pp. 84–92.
-  J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a “siamese” time delay neural network,” in Proceedings of the International Conference on Neural Information Processing Systems (NIPS), 1994, pp. 737–744.
-  S. Toshniwal, H. Tang, L. Lu, and K. Livescu, “Multitask learning with low-level auxiliary tasks for encoder-decoder based speech recognition,” in Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), 2017, pp. 3532–3536.
-  K. Krishna, S. Toshniwal, and K. Livescu, “Hierarchical multitask learning for CTC-based speech recognition,” CoRR, vol. abs/1807.06234, 2018.
-  M. Schuster and K. K. Paliwal, IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
-  M. A. Carlin, S. Thomas, A. Jansen, and H. Hermansky, “Rapid evaluation of speech representations for spoken term discovery,” in Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), 2011.
-  G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4087–4091.
T. N. Sainath and C. Parada,
“Convolutional neural networks for small-footprint keyword spotting,”in Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), 2015.
-  D. B. Paul and J. M. Baker, “The design for the wall street journal-based CSR corpus,” in Proceedings of the workshop on Speech and Natural Language, 1992, pp. 357–362.
-  D. Povey et al., “The Kaldi speech recognition toolkit,” in Proceedings of the Automatic Speech Recognition & Understanding (ASRU), 2011.
-  P. Price, W. M. Fisher, J. Bernstein, and D. S. Pallett, “The DARPA 1000-word resource management database for continuous speech recognition,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1988, pp. 651–654.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.
M. Abadi et al.,
“TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015,Software available from tensorflow.org.
-  L. Van Der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, pp. 2579–2605, 2008.
F. Schroff, D. Kalenichenko, and J. Philbin,
“Facenet: A unified embedding for face recognition and clustering,”in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 815–823.
-  Z. Ming, J. Chazalon, M. M. Luqman, M. Visani, and J. C. Burie, “Simple triplet loss based on intra/inter-class metric learning for face verification,” in Proceedings of International Conference on Computer Vision Workshop (ICCVW), 2017, pp. 1656–1664.
-  Y. Yougen, L. Cheung-Chi, X. Lei, C. Hongjie, M. Bin, and L. Haizhou, “Learning acoustic word embeddings with temporal context for query-by-example speech search,” in Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), 2018, pp. 97–101.