Keyword spotting (KWS) aims to detect predefined keywords in a stream of user utterances. It is a common technology of the virtual assistants which enable people to interact with smart devices through a hands-free mode. Specifically, the devices with embedded KWS systems can be waked up by a wake-up words detection module and commanded by a speech command words detection module.
With the advances in deep learning, deep model-based approaches have played more and more important roles in both academia and industry. Due to the powerful computing resources, deep models learn good representations from a large amount of training data and achieve high performance in a variety of speech tasks 
. In KWS, basic deep models such as Deep Neural Network (DNN), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN) based models are proposed which outperform traditional HMM models[6, 7, 8]. There are also works trying to modify the above models. The Convolutional Recurrent Neural Network (CRNN) combines the strengths of both convolutional layers and recurrent layers . Some compressed methods are applied to DNN such as low-rank weight matrices and knowledge distillation 
. The deep residual network (ResNet) which performs dramatically well in computer vision also applies in KWS. Motivated by the attention mechanism which has been widely used, an attention-based end-to-end model is proved useful in KWS .
Most existing KWS systems specify which keywords users should use and one of the necessary jobs is to accumulate positive and negative training examples as many as possible. However, it will be difficult if we want to define a new series of keywords which seldom appear in a large vocabulary. For each defined keyword, much training data are needed, which is not very realistic in practical application, especially when the keywords are from a dialect or even not from a standard language.
In this paper, we present meta learning for few-shot learning to address the problem of keyword spotting with limited training data. Assume that a relatively large set of keywords and a small set of keywords which cannot be found in the large set are given, we apply Model-Agnostic Meta Learning (MAML) to learn a good parameter initialization of the base KWS model from the large dataset 
. Then we adapt the initialized model to the new task quickly within just a few steps using the small dataset. To evaluate the model, we add two baselines for comparison. One is a standard feature transfer learning method which is to pre-train a model on the large dataset and fine-tune it in the small dataset. The other is a fundamental supervised learning method which just learn from the small dataset. The experiment on Google Speech Commands dataset shows that our proposed approach outperforms the baselines.
In summary, there are two contributions of this paper:
First, we propose a meta-learning based few-shot keyword spotting approach. To the best of our knowledge, it is the first work to do few-shot learning using meta-learning in a speech task.
Second, we incorporate the negative class as external knowledge and find that it strengthens the ability of meta learning to do keyword spotting.
The rest of the paper is organized as follows. In Section 2 we present the background of MAML. In Section 3 we introduce the few-shot KWS problem. in Section 4 we describe the details of our experiments. The results and discussions are shown in Section 5. Section 6 is the conclusion.
2 Background of MAML
2.1 An Introduction to MAML
The goal of few-shot meta-learning is to train a model that can quickly adapt to a new task using only a few datapoints and training iterations . Different from Multi-task Learning which learns a shared parameter space between various tasks by their joint loss , MAML tries to learn a starting point of the model across different tasks so that it can be adaptive enough to any new tasks. When the datapoints from a new task come, the model can be adapted to good parameters within only a few steps.
Formally, to solve the N-way-K-shot classification task which contains N classes and each class contains only K training examples, MAML tries to simulate the tasks during the meta-training process. As Figure 1 shows, meta-training is training across tasks which consist of labeled examples from the large dataset while meta-testing is testing on a new task which consists of few labeled examples from the small dataset. Both of them contain a ‘training data’ part and a ‘test set’ part which are composed of the same tasks but different examples. In meta-training, the ‘training data’ part is to train a base learner and the ‘test set’ part is to train a meta learner. Through this way the base model is initialized. In meta-testing, the ‘training data’ part is for fine-tuning the model and the ‘test set’ part is for testing the final adapted model.
The base model for a single task is with as parameters. To update the parameters adapting to task , a one-step gradient descent is:
where is the updated parameters.
is the loss function andis the learning rate. The model parameters are trained by optimizing the the performance of on the new examples from ‘test set’ part across tasks sampled from , i.e. . The meta-objective is:
MAML aims to optimize the parameters so that the model can learn to adapt to new tasks with parameters
through only a few gradient steps. The ‘test set’ part of meta-training is used as unseen examples for new tasks in order to learn the meta learner. As the meta-objective across tasks is optimized using Stochastic Gradient Descent (SGD), the parametersare updated as follow:
where is the meta step size.
2.2 MAML with External Knowledge
The original proposed MAML is a general framework for learning and adapting, without knowing any external information of the new tasks. However, if we have already got some knowledge about the new tasks, we can incorporate it to the meta-training process. For example in a few-shot classification task, if we know in advance that there must be a specific class in the new task, we can insert this class to each meta-training task we build, like Figure 1 (b) shows. Moreover, if we know exactly which position the class will be in the meta-testing process, we can fix its position in the meta-training tasks, like Figure 1 (c) shows. In our work, MAML with external knowledge performs better than that without the knowledge, i.e. (b) or (c) outperforms (a). And the setting with fixed position for the specific class performs better, i.e. (b) outperforms (c).
3 Keyword Spotting
3.1 Problem Definition
In this paper, we assume that we want to define N keywords for the KWS system. But for each keyword we only have K labeled examples. What we can use for training besides those examples is a relatively larger dataset which does not include our defined keywords. Under such conditions, we need to develop an approach to make a good model adaptation from the large dataset to the small one.
3.2 Few-shot KWS with a Negative Class
Because a KWS system should take consideration of distinguishing keywords from non-keywords. Usually a negative class which contains a large collection of other words will be added during the training process.
Our proposed approach makes use of it as external knowledge which is introduced in Section 2.2 to redesign the MAML strategy from two aspects. One method is that in addition to the randomly selected classes for one task in meta-training, we add one more negative class each time. But for each task, the positions of all classes are shuffled. The other method is that based on the first method, we fix the position of the negative class in each task. The positions of other classes are still shuffled. In this way, we think the output layer will learn more information of the negative class.
3.3 Baselines and Evaluation Metrics
Two baseline models are involved to compare with our proposed approaches. The fundamental supervised learning uses the few labeled keywords for training, while feature transfer learning first pre-trains a model using the large labeled dataset and fine-tunes the model using the few labeled dataset.
There are some important evaluation metrics for the problelm – Accuracy, False Alarm Rate (FAR) and False Rejection Rate (FRR) in practical KWS systems. In our paper, we also use them to compare our models with the baseline models.
4 Experimental Setup
We conduct our experiments on a public dataset – the Google Speech Commands dataset . It consists of 6.5K 1-second long audio clips of 30 keywords which are collected from thousands of people. Each clip consists of only one keyword.
For the basic experimental setting of few-shot keyword speech recognition without a negative class, we divide the 30 keywords into three independent sets for training, validation and testing with a ratio of 4:1:1. The training examples are all labeled while the validation set and test set are only labeled 5 examples for each keyword.
For the experimental setting of few-shot KWS with a negative class, we first merge 10 keywords like  does as the negative class. From the remaining 20 keywords, 16 of them, together with the negative class are set for training. And 4 of them, together with the negative class are set for testing, among which only 5 examples of each keyword are labeled.
4.2 Model Setting
The feature extraction and base model selection is referred to
. We use 10 MFCC features, with a frame length of 40ms and a frame step of 20ms. CNN is adopted as the base model which contains 2 convolutional layers. One is with channels, height, length, height stride, length stride of [64, 10, 4, 1, 1] and the other is [76, 10, 4, 2, 1]. Then follows a low-rank linear layer with size 58 and a fully connected layer with size 128. The output layer is a softmax function.
For the experimental setting without the negative class, we conduct three sets of experiments – MAML, feature transfer learning, and fundamental supervised learning. For each set of experiment, the value of K is set to 1, 5, 10, 15, 20, 30, 50, 100 so that we can make a comprehensive comparison among the three approaches.
For the experimental setting with the negative class, We design three experiments other than two baseline experiments. The first experiment selects five keywords except the negative class for each task during the meta-training process. In the second experiment, the negative class is the specific class which appears in each task and the positions of all classes are shuffled. In the third experiment, only the position of the negative class is fixed in each task. All above three experiments, as well as two baseline experiments are doing 5-way-5-shot classification tasks. And they are all tested in the test set which contains five keywords including the negative class. Among the keywords, the negative class contains 5000 clips while each of the other four keywords contains 500 clips. Each testing operation is repeated 100 times to weaken the influence of the randomness.
5 Results and Discussion
The result of the few-shot keyword speech recognition without the negative class is shown in Figure 2. It describes the trend of accuracy with the change of K by three curves. We can find that MAML outperforms the baselines and feature transfer learning performs better than the fundamental supervised learning method. When K is small, the performance improvement is quite obvious. As K increases, the gaps between each two models become narrower. To explain it from the view of feature learning, we think that incorporating the large dataset helps feature learning of the few labeled dataset. MAML learns internal features more suitable to new tasks.
To analyze the results of few-shot KWS with the negative class, we first compare the accuracy and FAR among five experiments. According to Table 1, MAML performs much better in both accuracy and FAR than two baselines. And with external knowledge of the negative class, the performance also gets an improvement. The reason for the improvement is that with the negative examples in meta-training, the output layer learns better decision boundary between the keywords and the negative class.
We draw FAR-FRR curves by adjusting the threshold for the outputs. Figure 3 illustrates that MAML with external knowledge performs better than that without it. Moreover, meta-training with fixed position of the negative class performs better than that with a shuffled position.
We do two more experiments using the imbalanced characteristic of test samples as external knowledge. The results are shown in Figure 4. We set the K value of the negative class in meta-training to 50 which is 10 times of any other class.
However, the results show that the knowledge does not help. According to Figure 4, the imbalanced setting with shuffled positions performs much worse than other two settings. If we compare the FRR at a same low FAR, we can see that the imbalanced setting has very large FRR. The reason may be that for each task, the imbalanced setting makes the learning process of the base model be dominated by the negative class so that the FRR increases. Actually the balanced setting works similarly to the undersampling method which is to address the imbalanced problem. If we see the FRR with at a larger FAR, the imbalanced setting with fixed position of negative class performs best, which is a trade-off between FAR and FRR.
6 Conclusions and Future work
In this paper, we propose a meta-learning based framework for few-shot keyword spotting. We first compare the MAML with feature transfer learning and fundamental supervised learning in the Google Speech Commands dataset. We find that MAML outperforms the two baselines. Furthermore, to make it more close to the KWS problem, we redesign the original dataset to split one specific negative class. Experiments show that incorporating the negative class to meta-training process as external knowledge will improve the performance. In the future we would like to think more of the external knowledge for meta-training such as the imbalanced training data. We also look for a flexible method which is adaptable to any N-way-K-shot KWS problems. Furthermore, we consider the possibility to apply it to speaker recognition tasks.
-  G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning.” inAAAI, vol. 4, 2017, p. 12.
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
-  A. Zeyer, K. Irie, R. Schlüter, and H. Ney, “Improved training of end-to-end attention models for speech recognition,” arXiv preprint arXiv:1805.03294, 2018.
-  Y. Zhang, W. Chan, and N. Jaitly, “Very deep convolutional networks for end-to-end speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 4845–4849.
-  G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks.” in ICASSP, vol. 14. Citeseer, 2014, pp. 4087–4091.
-  T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
G. Chen, C. Parada, and T. N. Sainath, “Query-by-example keyword spotting using long short-term memory networks,” inAcoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5236–5240.
-  S. O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner, R. Prenger, and A. Coates, “Convolutional recurrent neural networks for small-footprint keyword spotting,” arXiv preprint arXiv:1703.05390, 2017.
-  G. Tucker, M. Wu, M. Sun, S. Panchapagesan, G. Fu, and S. Vitaladevuni, “Model compression applied to small-footprint keyword spotting.” in INTERSPEECH, 2016, pp. 1878–1882.
-  R. Tang and J. Lin, “Deep residual learning for small-footprint keyword spotting,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5484–5488.
-  C. Shan, J. Zhang, Y. Wang, and L. Xie, “Attention-based end-to-end models for small-footprint keyword spotting,” arXiv preprint arXiv:1803.10916, 2018.
-  C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” arXiv preprint arXiv:1703.03400, 2017.
-  Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. S. Feris, “Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification.” in CVPR, vol. 1, no. 2, 2017, p. 6.
-  P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
-  Y. Zhang, N. Suda, L. Lai, and V. Chandra, “Hello edge: Keyword spotting on microcontrollers,” arXiv preprint arXiv:1711.07128, 2017.