Keyword spotting (KWS), also known as spoken term detection (STD), is the task of detecting some predefined keywords from a stream of utterances. It is usually used as an intelligent agent in mobile phones or smart devices. Recently, deep neural network (DNN) based KWS has led to significant performance improvement over conventional methods. Deep KWS 
first considers keyword spotting as an audio classification problem. It trains a DNN model to predict the posteriors of predefined keywords, in which each neuron in the softmax output layer of the DNN model corresponds to a keyword, with an additional “filler” neuron representing all other non-keyword segments. This classification-based method achieves significant improvement over the keyword/filter hidden Markov models. Later on, a number of classification-based methods[2, 3, 4, 5, 6, 7, 8] were explored to miniaturize the memory footprint.
However, because the softmax cross entropy loss focuses on maximizing the classification accuracy of the training data, the aforementioned models require a large number of training samples to achieve robust performance against various non-keyword segments in the test stage [1, 2, 4]. Because collecting as many types of non-keyword segments as possible for the model training is expensive and sometimes unavailable, the classification-based models [3, 5, 6, 7, 8] perform particularly poor in practice. Moreover, using a single “filler” neuron to represent all non-keyword segments does not reflect the diversity between these sounds, which will further degrade the performance.
introduced metric learning into KWS. Metric learning adopts a ranking loss to learn the relative distance between samples. It aims to enlarge the inter-class variance and reduce the intra-class variance in an embedded space of data. However, it will result in a significant performance drop if we directly apply metric learning to KWS without taking the prior knowledge that the target keywords are predefined and fixed into consideration. To address the problem, Huhet al. 
proposed an angular prototypical network with fixed target classes (AP-FC) to enhance the robustness against non-keyword segments. However, they have to use an additional support vector machine (SVM) to make the final decision. In, Vygon et al
. combined a triplet loss-based embedding extractor with a K-Nearest Neighbor (kNN) classifier, which gets higher accuracy than the cross entropy loss based methods. Their method exceedingly increases the number of parameters and computational complexity of the KWS model.
Motivated by some works on the open-set recognition problem [14, 15], in this paper, we propose a new loss function, named the maximization of the area under the receiver-operating-characteristic curve (AUC), and a confidence based decision method, which leads to a robust, small-footprint, and high accuracy KWS model. Specifically, the proposed multi-class AUC loss maximizes the classification accuracy of predefined keywords, and the detection AUC of non-keyword segments simultaneously. We compared the proposed multi-class AUC loss with softmax cross entropy loss , prototypical loss , AP-FC loss , and triplet loss  on the Google Speech Commands dataset v1  and v2 . Experimental results demonstrate that our methods outperform the comparison methods in most evaluation metrics. The main contributions of this paper are summarized as follows:
To our knowledge, we reformulate the low resource keyword spotting task as an open-set recognition problem for the first time.
We propose a novel multi-class AUC loss. It outperforms the four representative referenced methods in most evaluation metrics.
We propose a new confidence-based decision method. It helps the proposed method achieve the state-of-the-art performance without using a complex back-end classifier.
The original AUC optimization is designed for binary-class classification only. Therefore, before describing the proposed multi-class AUC loss function, we first take a look at the existing binary AUC optimization.
Given a binary-class dataset where , and a binary-class neural network with being the parameter of the network, we define two new subsets: which is a set of neural network scores for the samples with , and which represents a set of neural network scores for the samples with . Cardinalities of these two subsets are and respectively. As described in , for the finite set of samples
, the approximate estimate of the AUC metric is:
where is an indicator function that returns 1 if the statement is true, and 0 otherwise, and and are the elements of and respectively. As  did, we relax (1) by replacing the indicator function by a modified hinge loss function:
where , and
is a tunable hyperparameter controlling the distance margin betweenand . Substituting (2) into (1) transforms the maximization problem of (1) into the following minimization problem:
which can be easily backpropagated throughout the network in a standard procedure.
3 Algorithm description
3.1 Problem formulation
In this paper, we decompose the KWS task into a non-keyword segments detection subtask and a closed-set classification subtask. Specifically, for a given input sample, we first determine whether it belongs to a predefined keyword set. If so, then we decide which keyword it is. Note that the two subtasks are performed simultaneously in our proposed method.
To formalize the task, suppose there is a dataset where is a high-dimensional acoustic feature of the -th sample, and is the ground-truth label of . Note that, without loss of generality, we always assume that there are categories with class representing non-keyword segments, and the other classes representing keywords respectively.
We aim to train a neural network where is the parameter of the network. It maps the -dimensional input acoustic feature to a
-dimensional vector. Each dimension of the vector represents the confidence score of its corresponding keyword. In the test stage, we useto conduct KWS by the following criterion:
where is the output scores of the neural network , and is a decision threshold. For simplicity, we denote in the remaining of the paper.
3.2 The proposed multi-class AUC optimization
Several studies have extended the binary AUC optimization to multi-class problems e.g. [19, 20]. In this work, we propose a new extension suitable for most multi-class classification tasks and computationally straightforward. The key idea of this extension is to modify the two subsets and in the binary AUC optimization to new forms that satisfy the multi-class AUC optimization problem.
Specifically, for the general KWS problem with more than one keyword, we define the subset of positive examples as
and the subset of negative samples with
where is the score at the -th position of the vector , is the maximum value of after removing the score at the -th position of , and
represents the set of the output scores of the neural network for the non-keyword segments in .
Algorithm 1 presents the proposed multi-class AUC loss in detail.
3.3 Confidence based decision for the multi-class AUC loss
In the test stage, the decision threshold is calculated on a validation set by:
where is the size of the validation set.
3.4 Connection to other loss functions
This subsection presents the connection of the proposed multi-class AUC to other loss functions.
3.4.1 Connection to multi-class hinge loss
Under the same supposition in Section 3.1, the multi-class classification hinge loss is presented as:
The connection between the proposed multi-class AUC loss and the multi-class hinge loss is as follows. The multi-class AUC loss calculates the loss on the whole training set. It essentially learns a rank of the training samples without resorting to a classification-based loss explicitly. In contrast, the multi-class hinge loss calculates the optimization objective on each sample respectively and then averages them on the entire dataset. It needs to assign all non-keyword segments to a single class.
3.4.2 Connection to AP-FC loss
The AP-FC loss first arranges the keywords in a predefined order. Then, for each mini-batch, it selects one sample from each keyword, followed by non-keywords. Note that the first samples should be arranged in the predefined order of the keywords.
According to , we rewrite the AP-FC loss as:
where is the extracted feature of the -th sample by the neural network, is the learnable class center of the -th keyword, and are learnable parameters with .
The proposed AUC loss and AF-FC loss are similar in that they do not assign widely distributed non-keyword segments to a single “filler” class. However, the implementation of the AP-FC loss has a strict constraint on the samples in each mini-batch. Moreover, the AP-FC loss-based model still needs an SVM back-end to make the final decision.
3.4.3 Connection to other multi-class AUC loss
The multi-class AUC optimization in  is a natural extension of the binary AUC optimization. Gimeno et al. extended the binary AUC optimization to the multi-class problem by the one-versus-one and one-versus-rest frameworks. The one-versus-one multi-class AUC loss is obtained by averaging the pairwise binary AUC losses. The one-versus-rest multi-class AUC loss decomposes the multi-class classification task to binary tasks. For the -th task, the -th class is viewed as a positive class, and all other classes are merged into a negative class. However, the above two methods cannot be directly used for our open-set optimization problem, since that they need to assign non-keyword segments to a “filler” class. In addition, it is obvious that our proposed AUC loss is more computationally efficient than the above two methods.
4 Experimental setup
4.1 Data preparation
In our experiments, two popular keyword spotting datasets, Google Speech Commands version 1 (GSC v1)  and version 2 (GSC v2)  are used for evaluation. The dataset GSC v1 consists of 65K one-second-long recordings of 30 words from thousands of different speakers. GSC V2 is an augmented version of GSC v1, which contains 105K utterances of 35 words. In addition, both datasets contain several minute-long background noise files. The sampling rates of all signals are 16 kHz in the two datasets.
Both GSC v1 and GSC v2 include a “validation_list” file and a “testing_list” file. We use audio files in the “validation_list” and “testing_list” as validation and testing data, and the other audio files as training data. Following previous works, we apply random time-shift and noise injection to training data. Specifically, we first perform a random time-shift of milliseconds to each sample, where
. We then add background noise to each sample with a probability of 0.8, where the noise is chosen randomly from the background noises. Note that the random time-shift and noise injection are performed on the fly at each training step. Finally, 40-dimensional Mel-frequency Cepstrum Coefficient (MFCC) features are extracted and stacked over the time-axis with a window length of 25ms and a stride of 10ms.
4.2 Backbone network
We use res15  as the backbone network. As shown in Figure 1, it starts with a bias-free convolution layer (Conv) with weight , where and are the height and width of the convolution kernel respectively, and is the number of the output channels. Then, it takes the output of the first convolution layer as the input of a chain of residual blocks (Res), followed by a separate non-residual convolution layer. Finally, the output of the network is obtained by an average-pooling layer (Avg-pool). Additionally, a
convolution dilation is used to increase the receptive field of the network, and a batch normalization layer (BatchNorm) is added after each convolution layer to help train the deep network. The details of the backbone network are listed in Table1.
|Loss||Back-end||GSC v1||GSC v2|
|Total acc||Closed acc||F1 score||Total acc||Closed acc||F1 score|
|Cross entropy ||-||89.96%||97.14%||0.8805||92.74%||97.46%||0.9068|
4.3 Tasks and evaluation metrics
The tasks in previous works [3, 5, 7, 8] focus on discriminating the 11 keywords (“yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop”, “go”, “silence”) and a non-keyword “unknown”, where “silence” denotes silence segments and “unknown” represents all other words. In their settings, all unknown words used in the test set have been seen by the model in the training stage, which is not consistent with real-world KWS applications.
To meet the real-world KWS applications, in our experiments, we consider the task in , where ten unknown words (“zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”) are used for testing only. We evaluate our model by comparing the following metrics with other related works.
Total acc is the classification accuracy on the test set that contains unseen unknown words, which is more likely to reflect the performance of the KWS model in the real world. Note that unseen unknown words represent the above ten unknown words that are used for testing only.
Closed acc is the classification accuracy on the test set that does not contain unseen unknown words.
We also report the F1 score that is extended to multi-class one by “macro” average on the test set that contains unseen unknown words.
In addition, we plot the detection error tradeoff (DET) curve of the non-keyword segments detection subtask to evaluate the KWS models.
4.4 Data sampler
Usually, the training samples in each mini-batch is randomly sampled from the whole training set, which results in the proportion of the keywords over non-keywords in each mini-batch vary greatly. We denote this sampling method as random sampler. However, the variable proportion will hinder the convergence of the model training of the proposed method. To overcome this problem, we use a fixed proportion sampler, which keeps the proportion of keywords and non-keywords consistent in each mini-batch.
4.5 Training details
Each model in our experiments is trained for 60 epochs, using the Adam optimizer. The initial learning rate is set to 0.001 and reduced to 0.0001 after 30 epochs. For the cross entropy loss, we use a mini-batch size of 128 and weight decay of . We use the same hyperparameters in  and  for the prototypical loss, AP-FC loss and triplet loss. We use the validation set to select the best model among different epochs and evaluate the effect of the hyperparameter .
We evaluate the proposed multi-class AUC loss with the fixed proportion sampler and the random sampler. For the fixed proportion sampler, the number of keywords and non-keywords in each mini-batch is set to 32 and 64, respectively; for the random sampler, the mini-batch size is set to 128, which is the same as the other comparison methods. The hyperparameter is set to 0.3. Following the same training procedure, we evaluate all comparison methods for five independent times, and report the average performance.
5.1 Evaluation of the proposed methods
Table 2 lists the comparison result between the proposed methods and the four baselines. From the table, we see that both the two variants of the proposed multi-class AUC loss achieve significant improvement in terms of the Total acc and F1 score, and achieve a competitive result with the best referenced method in terms of the Closed acc. We take the result on GSC v1 as an example. Comparing to the cross entropy loss, the multi-class AUC loss with the fixed proportion sampler achieves 30.0% and 25.9% relative improvement in Total acc and F1 score respectively. It also achieves a slightly higher Closed acc than the cross entropy loss. Even when compared with the triplet loss with a complex kNN backend, the proposed method still achieves a relative improvement of 11.1% in Total acc and 9.8% in F1 score while maintaining a similar Closed acc.
To further investigate the effectiveness of the proposed method, we conduct a comparison on GSC v2 using the same settings as that on GSC v1. The experimental results again demonstrate the superiority of our method. In addition, the result on GSC v2 indicates that the training data of GSC v2 is responsible for the substantial improvement in all evaluation metrics, which is consistent with the experimental phenomenon in . However, although both of the two variants of the proposed multi-class AUC loss achieve better results on GSC v2 than that on GSC v1, the improvement with random sampler is more evident than that with the fixed proportion sampler. This may be caused by that the training data of GSC v2 contains more non-keywords than the training data of GSC v1.
From Table 2 we also see that the AP-FC loss with a SVM back-end and the triplet loss with a kNN back-end outperform the prototypical loss and the cross entropy loss. It demonstrates that the metric learning-based methods still require a decision back-end to achieve satisfactory performance. In addition, we plot the DET curves of the non-keyword segments detection subtask in Figure 2. From the figure, we see that these curves are consistent with the results presented in Table 2, and we see that the two variants of the proposed multi-class AUC loss outperform the referenced methods.
5.2 Effect of the hyperparameter on performance
This subsection investigates the effect of the hyperparameter on performance. Becaue there are no unseen unknown words in the validation set, here we only use the Closed acc and F1 score as the evaluation metrics. For simplicity, we show the experimental results on GSC v1 only. Note that the experimental phenomenon on the other evaluation dataset are consistent with that on GSC v1. Table 3 lists the result on GSC v1. From the table, one can see that the parameter , which controls the margin of the AUC loss, plays an important role on the performance. Both of the two variants of the multi-class AUC loss outperform the cross entropy baseline in the two evaluation metrics when . It is also observed that the results in both the two evaluation metrics first increase and then decrease along with the increase of , where the best performance is achieved at .
In this study, we have proposed a robust and highly accurate KWS method based on a novel multi-class AUC loss function and a confidence based decision method. Our KWS method not only significantly improves the robustness of the model against unseen sounds by optimizing the proposed multi-class AUC loss, but also eliminates the complex back-end processing module by using the simple confidence based decision method. To our knowledge, it is the first time that the low resource keyword spotting task is formulated as an open-set recognition problem. We compared the proposed method with four representative methods on the two public available datasets GSC v1 and GSC v2. Experimental results show that the proposed method significantly outperforms the four representative methods in most evaluations with smaller model sizes and less computational complexity than the latter.
-  Guoguo Chen, Carolina Parada, and Georg Heigold, “Small-footprint keyword spotting using deep neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4087–4091.
Sercan Ö Arık, Markus Kliegl, Rewon Child, Joel Hestness, Andrew
Gibiansky, Chris Fougner, Ryan Prenger, and Adam Coates,
“Convolutional recurrent neural networks for small-footprint keyword spotting,”Proc. Interspeech 2017, pp. 1606–1610, 2017.
-  Raphael Tang and Jimmy Lin, “Deep residual learning for small-footprint keyword spotting,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5484–5488.
-  Changhao Shan, Junbo Zhang, Yujun Wang, and Lei Xie, “Attention-based end-to-end models for small-footprint keyword spotting,” Proc. Interspeech 2018, pp. 2037–2041, 2018.
-  Seungwoo Choi, Seokjun Seo, Beomjun Shin, Hyeongmin Byun, Martin Kersner, Beomsu Kim, Dongyoung Kim, and Sungjoo Ha, “Temporal convolution for real-time keyword spotting on mobile devices,” Proc. Interspeech 2019, pp. 3372–3376, 2019.
-  Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Zhengkun Tian, Chenghao Zhao, and Cunhang Fan, “A time delay neural network with shared weight self-attention for small-footprint keyword spotting.,” in INTERSPEECH, 2019, pp. 2190–2194.
-  Menglong Xu and Xiao-Lei Zhang, “Depthwise separable convolutional resnet with squeeze-and-excitation blocks for small-footprint keyword spotting,” Proc. Interspeech 2020, pp. 2547–2551, 2020.
-  Chen Yang, Xue Wen, and Liming Song, “Multi-scale convolution for robust keyword spotting,” Proc. Interspeech 2020, pp. 2577–2581, 2020.
-  Niccolo Sacchi, Alexandre Nanchen, Martin Jaggi, and Milos Cernak, “Open-vocabulary keyword spotting with audio and text embeddings,” in INTERSPEECH 2019-IEEE International Conference on Acoustics, Speech, and Signal Processing, 2019, number CONF.
Yougen Yuan, Zhiqiang Lv, Shen Huang, and Lei Xie,
“Verifying deep keyword spotting detection with acoustic word
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 613–620.
-  Peng Zhang and Xueliang Zhang, “Deep template matching for small-footprint and configurable keyword spotting,” Proc. Interspeech 2020, pp. 2572–2576, 2020.
-  Jaesung Huh, Minjae Lee, Heesoo Heo, Seongkyu Mun, and Joon Son Chung, “Metric learning for keyword spotting,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 133–140.
-  Roman Vygon and Nikolay Mikhaylovskiy, “Learning efficient representations for keyword spotting with triplet loss,” arXiv preprint arXiv:2101.04792, 2021.
-  Abhijit Bendale and Terrance E Boult, “Towards open set deep networks,” in , 2016, pp. 1563–1572.
-  Terrance DeVries and Graham W Taylor, “Learning confidence for out-of-distribution detection in neural networks,” arXiv preprint arXiv:1802.04865, 2018.
“Speech commands: A public dataset for single-word speech
Dataset available from http://download. tensorflow. org/data/speech_commands_v0, vol. 1, 2017.
-  P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
Zi-Chen Fan, Zhongxin Bai, Xiao-Lei Zhang, Susanto Rahardja, and Jingdong Chen,
“Auc optimization for deep learning based voice activity detection,”in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6760–6764.
-  Zhongxin Bai, Xiao-Lei Zhang, and Jingdong Chen, “Partial auc optimization based deep speaker embeddings with class-center learning for text-independent speaker verification,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6819–6823.
-  Pablo Gimeno, Victoria Mingote, Alfonso Ortega, Antonio Miguel, and Eduardo Lleida, “Generalising auc optimisation to multiclass classification for audio segmentation with limited training data,” IEEE Signal Processing Letters, 2021.
-  Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.