AUC Optimization for Robust Small-footprint Keyword Spotting with Limited Training Data

07/13/2021 ∙ by Menglong Xu, et al. ∙ 0

Deep neural networks provide effective solutions to small-footprint keyword spotting (KWS). However, if training data is limited, it remains challenging to achieve robust and highly accurate KWS in real-world scenarios where unseen sounds that are out of the training data are frequently encountered. Most conventional methods aim to maximize the classification accuracy on the training set, without taking the unseen sounds into account. To enhance the robustness of the deep neural networks based KWS, in this paper, we introduce a new loss function, named the maximization of the area under the receiver-operating-characteristic curve (AUC). The proposed method not only maximizes the classification accuracy of keywords on the closed training set, but also maximizes the AUC score for optimizing the performance of non-keyword segments detection. Experimental results on the Google Speech Commands dataset v1 and v2 show that our method achieves new state-of-the-art performance in terms of most evaluation metrics.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Keyword spotting (KWS), also known as spoken term detection (STD), is the task of detecting some predefined keywords from a stream of utterances. It is usually used as an intelligent agent in mobile phones or smart devices. Recently, deep neural network (DNN) based KWS has led to significant performance improvement over conventional methods. Deep KWS [1]

first considers keyword spotting as an audio classification problem. It trains a DNN model to predict the posteriors of predefined keywords, in which each neuron in the softmax output layer of the DNN model corresponds to a keyword, with an additional “filler” neuron representing all other non-keyword segments. This classification-based method achieves significant improvement over the keyword/filter hidden Markov models. Later on, a number of classification-based methods

[2, 3, 4, 5, 6, 7, 8] were explored to miniaturize the memory footprint.

However, because the softmax cross entropy loss focuses on maximizing the classification accuracy of the training data, the aforementioned models require a large number of training samples to achieve robust performance against various non-keyword segments in the test stage [1, 2, 4]. Because collecting as many types of non-keyword segments as possible for the model training is expensive and sometimes unavailable, the classification-based models [3, 5, 6, 7, 8] perform particularly poor in practice. Moreover, using a single “filler” neuron to represent all non-keyword segments does not reflect the diversity between these sounds, which will further degrade the performance.

Recently, several works [9, 10, 11, 12, 13]

introduced metric learning into KWS. Metric learning adopts a ranking loss to learn the relative distance between samples. It aims to enlarge the inter-class variance and reduce the intra-class variance in an embedded space of data. However, it will result in a significant performance drop if we directly apply metric learning to KWS without taking the prior knowledge that the target keywords are predefined and fixed into consideration. To address the problem, Huh

et al[12]

proposed an angular prototypical network with fixed target classes (AP-FC) to enhance the robustness against non-keyword segments. However, they have to use an additional support vector machine (SVM) to make the final decision. In

[13], Vygon et al

. combined a triplet loss-based embedding extractor with a K-Nearest Neighbor (kNN) classifier, which gets higher accuracy than the cross entropy loss based methods. Their method exceedingly increases the number of parameters and computational complexity of the KWS model.

Motivated by some works on the open-set recognition problem [14, 15], in this paper, we propose a new loss function, named the maximization of the area under the receiver-operating-characteristic curve (AUC), and a confidence based decision method, which leads to a robust, small-footprint, and high accuracy KWS model. Specifically, the proposed multi-class AUC loss maximizes the classification accuracy of predefined keywords, and the detection AUC of non-keyword segments simultaneously. We compared the proposed multi-class AUC loss with softmax cross entropy loss [3], prototypical loss [12], AP-FC loss [12], and triplet loss [13] on the Google Speech Commands dataset v1 [16] and v2 [17]. Experimental results demonstrate that our methods outperform the comparison methods in most evaluation metrics. The main contributions of this paper are summarized as follows:

  • To our knowledge, we reformulate the low resource keyword spotting task as an open-set recognition problem for the first time.

  • We propose a novel multi-class AUC loss. It outperforms the four representative referenced methods in most evaluation metrics.

  • We propose a new confidence-based decision method. It helps the proposed method achieve the state-of-the-art performance without using a complex back-end classifier.

The remainder of the paper is organized as follows. Section 3 introduces the proposed AUC loss. Section 4 and 5 present the experimental setup and results respectively. Section 6 concludes the paper.

2 Background

The original AUC optimization is designed for binary-class classification only. Therefore, before describing the proposed multi-class AUC loss function, we first take a look at the existing binary AUC optimization.

Given a binary-class dataset where , and a binary-class neural network with being the parameter of the network, we define two new subsets: which is a set of neural network scores for the samples with , and which represents a set of neural network scores for the samples with . Cardinalities of these two subsets are and respectively. As described in [18], for the finite set of samples

, the approximate estimate of the AUC metric is:


where is an indicator function that returns 1 if the statement is true, and 0 otherwise, and and are the elements of and respectively. As [19] did, we relax (1) by replacing the indicator function by a modified hinge loss function:


where , and

is a tunable hyperparameter controlling the distance margin between

and . Substituting (2) into (1) transforms the maximization problem of (1) into the following minimization problem:


which can be easily backpropagated throughout the network in a standard procedure.

3 Algorithm description

3.1 Problem formulation

In this paper, we decompose the KWS task into a non-keyword segments detection subtask and a closed-set classification subtask. Specifically, for a given input sample, we first determine whether it belongs to a predefined keyword set. If so, then we decide which keyword it is. Note that the two subtasks are performed simultaneously in our proposed method.

To formalize the task, suppose there is a dataset where is a high-dimensional acoustic feature of the -th sample, and is the ground-truth label of . Note that, without loss of generality, we always assume that there are categories with class representing non-keyword segments, and the other classes representing keywords respectively.

We aim to train a neural network where is the parameter of the network. It maps the -dimensional input acoustic feature to a

-dimensional vector. Each dimension of the vector represents the confidence score of its corresponding keyword. In the test stage, we use

to conduct KWS by the following criterion:


where is the output scores of the neural network , and is a decision threshold. For simplicity, we denote in the remaining of the paper.

3.2 The proposed multi-class AUC optimization

Several studies have extended the binary AUC optimization to multi-class problems e.g[19, 20]. In this work, we propose a new extension suitable for most multi-class classification tasks and computationally straightforward. The key idea of this extension is to modify the two subsets and in the binary AUC optimization to new forms that satisfy the multi-class AUC optimization problem.

Specifically, for the general KWS problem with more than one keyword, we define the subset of positive examples as

and the subset of negative samples with

where is the score at the -th position of the vector , is the maximum value of after removing the score at the -th position of , and

represents the set of the output scores of the neural network for the non-keyword segments in .

Algorithm 1 presents the proposed multi-class AUC loss in detail.

0:    a batch of acoustic features, ; the corresponding labels, ; the number of samples in the mini-batch, ; predefined hyperparameter, ;
0:    loss on the current mini-batch;
1:  ;
2:  ;
3:  Init the positive subset which contains samples and the negative subset which contains samples;
4:  ;
5:  for each  do
6:     if  then
7:        add to subset ;
8:        add the largest element of except to subset ;
9:     else
10:        add the largest element of to subset ;
11:     end if
12:  end for
13:  ;
14:  return  ;
Algorithm 1 Multi-class AUC loss for KWS

3.3 Confidence based decision for the multi-class AUC loss

In the test stage, the decision threshold is calculated on a validation set by:


where is the size of the validation set.

3.4 Connection to other loss functions

This subsection presents the connection of the proposed multi-class AUC to other loss functions.

3.4.1 Connection to multi-class hinge loss

Under the same supposition in Section 3.1, the multi-class classification hinge loss is presented as:


The connection between the proposed multi-class AUC loss and the multi-class hinge loss is as follows. The multi-class AUC loss calculates the loss on the whole training set. It essentially learns a rank of the training samples without resorting to a classification-based loss explicitly. In contrast, the multi-class hinge loss calculates the optimization objective on each sample respectively and then averages them on the entire dataset. It needs to assign all non-keyword segments to a single class.

3.4.2 Connection to AP-FC loss

The AP-FC loss first arranges the keywords in a predefined order. Then, for each mini-batch, it selects one sample from each keyword, followed by non-keywords. Note that the first samples should be arranged in the predefined order of the keywords.

According to [12], we rewrite the AP-FC loss as:




where is the extracted feature of the -th sample by the neural network, is the learnable class center of the -th keyword, and are learnable parameters with .

The proposed AUC loss and AF-FC loss are similar in that they do not assign widely distributed non-keyword segments to a single “filler” class. However, the implementation of the AP-FC loss has a strict constraint on the samples in each mini-batch. Moreover, the AP-FC loss-based model still needs an SVM back-end to make the final decision.

3.4.3 Connection to other multi-class AUC loss

The multi-class AUC optimization in [20] is a natural extension of the binary AUC optimization. Gimeno et al. extended the binary AUC optimization to the multi-class problem by the one-versus-one and one-versus-rest frameworks. The one-versus-one multi-class AUC loss is obtained by averaging the pairwise binary AUC losses. The one-versus-rest multi-class AUC loss decomposes the multi-class classification task to binary tasks. For the -th task, the -th class is viewed as a positive class, and all other classes are merged into a negative class. However, the above two methods cannot be directly used for our open-set optimization problem, since that they need to assign non-keyword segments to a “filler” class. In addition, it is obvious that our proposed AUC loss is more computationally efficient than the above two methods.

4 Experimental setup

4.1 Data preparation

In our experiments, two popular keyword spotting datasets, Google Speech Commands version 1 (GSC v1) [16] and version 2 (GSC v2) [17] are used for evaluation. The dataset GSC v1 consists of 65K one-second-long recordings of 30 words from thousands of different speakers. GSC V2 is an augmented version of GSC v1, which contains 105K utterances of 35 words. In addition, both datasets contain several minute-long background noise files. The sampling rates of all signals are 16 kHz in the two datasets.

Both GSC v1 and GSC v2 include a “validation_list” file and a “testing_list” file. We use audio files in the “validation_list” and “testing_list” as validation and testing data, and the other audio files as training data. Following previous works, we apply random time-shift and noise injection to training data. Specifically, we first perform a random time-shift of milliseconds to each sample, where

. We then add background noise to each sample with a probability of 0.8, where the noise is chosen randomly from the background noises. Note that the random time-shift and noise injection are performed on the fly at each training step. Finally, 40-dimensional Mel-frequency Cepstrum Coefficient (MFCC) features are extracted and stacked over the time-axis with a window length of 25ms and a stride of 10ms.

4.2 Backbone network

We use res15 [3] as the backbone network. As shown in Figure 1, it starts with a bias-free convolution layer (Conv) with weight , where and are the height and width of the convolution kernel respectively, and is the number of the output channels. Then, it takes the output of the first convolution layer as the input of a chain of residual blocks (Res), followed by a separate non-residual convolution layer. Finally, the output of the network is obtained by an average-pooling layer (Avg-pool). Additionally, a

convolution dilation is used to increase the receptive field of the network, and a batch normalization layer (BatchNorm) is added after each convolution layer to help train the deep network. The details of the backbone network are listed in Table


Figure 1: The architecture of the backbone network, with a magnified residual block.
#Par. #Mult.
Conv 3 3 45 1 1 405 1.52M
Res6 3 3 45 219K 824M
Conv 3 3 45 16 16 18.2K 68.6M
BatchNorm - - 45 - - - 169K
Avg-Pool - - 45 - - - 45
Total - - - - - 238K 894M
Table 1: Parameter setting of res15, along with the number of parameters and multiplies.
Loss Back-end GSC v1 GSC v2
Total acc Closed acc F1 score Total acc Closed acc F1 score
Cross entropy [3] - 89.96% 97.14% 0.8805 92.74% 97.46% 0.9068
Prototypical [12] - 87.89% 95.88% 0.8654 93.32% 96.55% 0.9149
AP-FC [12] SVM 91.59% 96.72% 0.8962 93.77% 97.11% 0.9188
Triplet [13] kNN 92.09% 97.28% 0.9019 94.01% 97.78% 0.9251
Multi-class AUCR - 92.16% 97.01% 0.9031 94.87% 97.39% 0.9315
Multi-class AUCF - 92.97% 97.22% 0.9115 94.71% 97.50% 0.9312
Table 2: Comparison results between the proposed multi-class AUC and four referenced methods. The subscript R indicates the random sampler, and F the fixed proportion sampler.

4.3 Tasks and evaluation metrics

The tasks in previous works [3, 5, 7, 8] focus on discriminating the 11 keywords (“yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop”, “go”, “silence”) and a non-keyword “unknown”, where “silence” denotes silence segments and “unknown” represents all other words. In their settings, all unknown words used in the test set have been seen by the model in the training stage, which is not consistent with real-world KWS applications.

To meet the real-world KWS applications, in our experiments, we consider the task in [12], where ten unknown words (“zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”) are used for testing only. We evaluate our model by comparing the following metrics with other related works.

  • Total acc is the classification accuracy on the test set that contains unseen unknown words, which is more likely to reflect the performance of the KWS model in the real world. Note that unseen unknown words represent the above ten unknown words that are used for testing only.

  • Closed acc is the classification accuracy on the test set that does not contain unseen unknown words.

  • We also report the F1 score that is extended to multi-class one by “macro” average on the test set that contains unseen unknown words.

In addition, we plot the detection error tradeoff (DET) curve of the non-keyword segments detection subtask to evaluate the KWS models.

4.4 Data sampler

Usually, the training samples in each mini-batch is randomly sampled from the whole training set, which results in the proportion of the keywords over non-keywords in each mini-batch vary greatly. We denote this sampling method as random sampler. However, the variable proportion will hinder the convergence of the model training of the proposed method. To overcome this problem, we use a fixed proportion sampler, which keeps the proportion of keywords and non-keywords consistent in each mini-batch.

4.5 Training details

Each model in our experiments is trained for 60 epochs, using the Adam optimizer

[21]. The initial learning rate is set to 0.001 and reduced to 0.0001 after 30 epochs. For the cross entropy loss, we use a mini-batch size of 128 and weight decay of . We use the same hyperparameters in [12] and [13] for the prototypical loss, AP-FC loss and triplet loss. We use the validation set to select the best model among different epochs and evaluate the effect of the hyperparameter .

We evaluate the proposed multi-class AUC loss with the fixed proportion sampler and the random sampler. For the fixed proportion sampler, the number of keywords and non-keywords in each mini-batch is set to 32 and 64, respectively; for the random sampler, the mini-batch size is set to 128, which is the same as the other comparison methods. The hyperparameter is set to 0.3. Following the same training procedure, we evaluate all comparison methods for five independent times, and report the average performance.

5 Results

5.1 Evaluation of the proposed methods

Table 2 lists the comparison result between the proposed methods and the four baselines. From the table, we see that both the two variants of the proposed multi-class AUC loss achieve significant improvement in terms of the Total acc and F1 score, and achieve a competitive result with the best referenced method in terms of the Closed acc. We take the result on GSC v1 as an example. Comparing to the cross entropy loss, the multi-class AUC loss with the fixed proportion sampler achieves 30.0% and 25.9% relative improvement in Total acc and F1 score respectively. It also achieves a slightly higher Closed acc than the cross entropy loss. Even when compared with the triplet loss with a complex kNN backend, the proposed method still achieves a relative improvement of 11.1% in Total acc and 9.8% in F1 score while maintaining a similar Closed acc.

(a) Results on GSC v1.
(b) Results on GSC v2.
Figure 2: DET curves of the non-keyword segments detection subtask.
AUC Cross Entropy
sampler 0.1 0.2 0.25 0.3 0.35 0.4 0.5
Closed acc R 94.52% 96.29% 96.53% 96.85% 96.72% 96.49% 95.54% 96.20%
F 94.04% 96.54% 96.71% 96.81% 96.44% 96.53% 96.20%
F1 score R 0.9426 0.9578 0.9581 0.9615 0.9577 0.9535 0.9429 0.9513
F 0.9321 0.9599 0.9613 0.9613 0.9553 0.9546 0.9508
Table 3: Effect of the hyperparameter on performance.

To further investigate the effectiveness of the proposed method, we conduct a comparison on GSC v2 using the same settings as that on GSC v1. The experimental results again demonstrate the superiority of our method. In addition, the result on GSC v2 indicates that the training data of GSC v2 is responsible for the substantial improvement in all evaluation metrics, which is consistent with the experimental phenomenon in [17]. However, although both of the two variants of the proposed multi-class AUC loss achieve better results on GSC v2 than that on GSC v1, the improvement with random sampler is more evident than that with the fixed proportion sampler. This may be caused by that the training data of GSC v2 contains more non-keywords than the training data of GSC v1.

From Table 2 we also see that the AP-FC loss with a SVM back-end and the triplet loss with a kNN back-end outperform the prototypical loss and the cross entropy loss. It demonstrates that the metric learning-based methods still require a decision back-end to achieve satisfactory performance. In addition, we plot the DET curves of the non-keyword segments detection subtask in Figure 2. From the figure, we see that these curves are consistent with the results presented in Table 2, and we see that the two variants of the proposed multi-class AUC loss outperform the referenced methods.

5.2 Effect of the hyperparameter on performance

This subsection investigates the effect of the hyperparameter on performance. Becaue there are no unseen unknown words in the validation set, here we only use the Closed acc and F1 score as the evaluation metrics. For simplicity, we show the experimental results on GSC v1 only. Note that the experimental phenomenon on the other evaluation dataset are consistent with that on GSC v1. Table 3 lists the result on GSC v1. From the table, one can see that the parameter , which controls the margin of the AUC loss, plays an important role on the performance. Both of the two variants of the multi-class AUC loss outperform the cross entropy baseline in the two evaluation metrics when . It is also observed that the results in both the two evaluation metrics first increase and then decrease along with the increase of , where the best performance is achieved at .

6 Conclusions

In this study, we have proposed a robust and highly accurate KWS method based on a novel multi-class AUC loss function and a confidence based decision method. Our KWS method not only significantly improves the robustness of the model against unseen sounds by optimizing the proposed multi-class AUC loss, but also eliminates the complex back-end processing module by using the simple confidence based decision method. To our knowledge, it is the first time that the low resource keyword spotting task is formulated as an open-set recognition problem. We compared the proposed method with four representative methods on the two public available datasets GSC v1 and GSC v2. Experimental results show that the proposed method significantly outperforms the four representative methods in most evaluations with smaller model sizes and less computational complexity than the latter.


  • [1] Guoguo Chen, Carolina Parada, and Georg Heigold, “Small-footprint keyword spotting using deep neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4087–4091.
  • [2] Sercan Ö Arık, Markus Kliegl, Rewon Child, Joel Hestness, Andrew Gibiansky, Chris Fougner, Ryan Prenger, and Adam Coates,

    “Convolutional recurrent neural networks for small-footprint keyword spotting,”

    Proc. Interspeech 2017, pp. 1606–1610, 2017.
  • [3] Raphael Tang and Jimmy Lin, “Deep residual learning for small-footprint keyword spotting,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5484–5488.
  • [4] Changhao Shan, Junbo Zhang, Yujun Wang, and Lei Xie, “Attention-based end-to-end models for small-footprint keyword spotting,” Proc. Interspeech 2018, pp. 2037–2041, 2018.
  • [5] Seungwoo Choi, Seokjun Seo, Beomjun Shin, Hyeongmin Byun, Martin Kersner, Beomsu Kim, Dongyoung Kim, and Sungjoo Ha, “Temporal convolution for real-time keyword spotting on mobile devices,” Proc. Interspeech 2019, pp. 3372–3376, 2019.
  • [6] Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Zhengkun Tian, Chenghao Zhao, and Cunhang Fan, “A time delay neural network with shared weight self-attention for small-footprint keyword spotting.,” in INTERSPEECH, 2019, pp. 2190–2194.
  • [7] Menglong Xu and Xiao-Lei Zhang, “Depthwise separable convolutional resnet with squeeze-and-excitation blocks for small-footprint keyword spotting,” Proc. Interspeech 2020, pp. 2547–2551, 2020.
  • [8] Chen Yang, Xue Wen, and Liming Song, “Multi-scale convolution for robust keyword spotting,” Proc. Interspeech 2020, pp. 2577–2581, 2020.
  • [9] Niccolo Sacchi, Alexandre Nanchen, Martin Jaggi, and Milos Cernak, “Open-vocabulary keyword spotting with audio and text embeddings,” in INTERSPEECH 2019-IEEE International Conference on Acoustics, Speech, and Signal Processing, 2019, number CONF.
  • [10] Yougen Yuan, Zhiqiang Lv, Shen Huang, and Lei Xie, “Verifying deep keyword spotting detection with acoustic word embeddings,” in

    2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

    . IEEE, 2019, pp. 613–620.
  • [11] Peng Zhang and Xueliang Zhang, “Deep template matching for small-footprint and configurable keyword spotting,” Proc. Interspeech 2020, pp. 2572–2576, 2020.
  • [12] Jaesung Huh, Minjae Lee, Heesoo Heo, Seongkyu Mun, and Joon Son Chung, “Metric learning for keyword spotting,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 133–140.
  • [13] Roman Vygon and Nikolay Mikhaylovskiy, “Learning efficient representations for keyword spotting with triplet loss,” arXiv preprint arXiv:2101.04792, 2021.
  • [14] Abhijit Bendale and Terrance E Boult, “Towards open set deep networks,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2016, pp. 1563–1572.
  • [15] Terrance DeVries and Graham W Taylor, “Learning confidence for out-of-distribution detection in neural networks,” arXiv preprint arXiv:1802.04865, 2018.
  • [16] Pete Warden, “Speech commands: A public dataset for single-word speech recognition,”

    Dataset available from http://download. tensorflow. org/data/speech_commands_v0

    , vol. 1, 2017.
  • [17] P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
  • [18] Zi-Chen Fan, Zhongxin Bai, Xiao-Lei Zhang, Susanto Rahardja, and Jingdong Chen,

    “Auc optimization for deep learning based voice activity detection,”

    in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6760–6764.
  • [19] Zhongxin Bai, Xiao-Lei Zhang, and Jingdong Chen, “Partial auc optimization based deep speaker embeddings with class-center learning for text-independent speaker verification,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6819–6823.
  • [20] Pablo Gimeno, Victoria Mingote, Alfonso Ortega, Antonio Miguel, and Eduardo Lleida, “Generalising auc optimisation to multiclass classification for audio segmentation with limited training data,” IEEE Signal Processing Letters, 2021.
  • [21] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.