Small footprint keyword spotting (KWS), also known as wake-up word detection, is a task to detect the occurrences of a pre-defined keyword in continuous speech signals. With the rapid development of mobile devices, smart speakers, and other applications, which require a hands-free conversational interface, this technology is attracting more and more attention. Different from the traditional keyword spotting task, with the constraints of hardware, real-life wake-up word detection must have a small memory and low computational cost. And simultaneously, it also requires to be highly accurate in detection and robust in different complex environments like noisy and far-field conditions.
Traditional approaches [chen2014small, sun2017compressed]
on this task involve Hidden Markov Models (HMMs), which are utilized to construct the keyword model and the filler/background model. The background model is trained with non-keyword speech as well as background noise and silence. The acoustic modeling schemes for speech units include Gaussian Mixture Model (GMM), Deep Neural Network (DNN), and Time-Delayed Neural Network (TDNN)[sun2017compressed], and so on. After training, the Vertibi search is applied to find the optimal path in the decoding graph, and whenever the likelihood ratio of the keyword vs. filler model is larger than the pre-defined threshold, the system triggers.
In recent years, many researchers focus on the DNN based keyword spotting systems, which achieve better performances than traditional methods [sainath2015convolutional, arik2017convolutional, sun2016max, liu2019loss, wang2017small, shan2018attention, wang2019adversarial, tang2018deep, coucke2019efficient]
. In these approaches, a DNN model is trained for words instead of phonemes. The output smoothed posterior probabilities are calculated later from the output of the DNN model to compute the confidence score. DNN based methods have the advantages of light-weighting and low latency, which is suitable for real-life applications. As for modeling, many structures based on Convolutional Neural Network (CNN)[sainath2015convolutional]
, Recurrent Neural Network (RNN), Convolutional Recurrent Neural Network (CRNN)[arik2017convolutional], Long Short Time Memory [sun2016max] (LSTM) and attention mechanism [shan2018attention, wang2019adversarial] are explored. Furthermore, [tang2018deep]
adopts the residual network structure to classify the speech command words, and[coucke2019efficient] introduces a dilated convolutional structure to model the whole keyword sequence, which also shows good performance.
However, in many real-life applications, like smart speakers, the performance of the KWS system is often degraded under the low Signal-to-Noise Ratio (SNR) and far-field conditions. The room reverberation and different kinds of noises in this scenario impose great challenges on the performance of the DNN model, which is trained mainly by close-talking data due to the zero or limit resource for real data collection. A traditional method to tackle this problem is to train DNN models using pooled speech data either collected or simulated from different environments. During training, samples within a class from different domains are regarded as the same.
Inspired by the within-sample variability-invariant loss [cai2020within] and paralleled data training [qian2016investigation, peddinti2016far, li2018developing]
mechanisms successfully applied in speaker verification and automatic speech recognition on complex environments, in this paper, we propose a training scheme of multi-task learning[caruana_multitask_1997] with alignment loss on KWS. Practically, three types of alignment loss, including the MSE loss, Cosine loss, and CORAL [sun2016deep] (Correlation Alignment) loss, are implemented to reduce the mismatch of close-talking and far-field conditions in a multi-domain joint learning setup. MSE loss and Cosine distance loss are commonly used to minimize the distortion between features. CORAL loss is proposed in [sun2016return]sun2016deep], which is computed on embedding features of a neural network.
This rest of the paper is organized as follows. Section 3 describes the framework of the CNN based KWS system, and in section 4, our proposed multi-task learning method with alignment loss is introduced. Section 5 discusses the experimental results, and section 6 concludes our work, respectively.
2 CNN based KWS system
Our baseline is constructed on a CNN based KWS system proposed by [sainath2015convolutional]
. The pipeline has three main components, feature extraction, network prediction, and confidence computation, which is illustrated in Figure1. In the step of feature extraction, we extract 40-dimensional log-Mel filterbank energy (Fbank) for every 25 ms with a shift of 10 ms. Considering the context, we apply a window of 40 frames to generate training samples as the input of the model.
The structure of our convolutional network contains three convolutional layers, each of which is followed by a max-pooling layer. The convolutional kernels have the size of
with the stride of, and the pooling size is set to be
. Two fully-connected layers accompanied by a final softmax layer are then used to predict the target keywords. Rectified linear unit (ReLU) is used as the activation function in hidden layers. The total amount of parameters of this network is around 90k, which is relatively low.
After the training process, the sequence of acoustic features is projected to a sequence of posterior probabilities by the model. In the module of confidence computation, we adopt the method proposed in [liu2019loss, P2015Automatic] to make the decisions. In this approach, we define a sliding window with the length of frames which is used to compute scores and denote the input acoustic features in a window as . represents the words sequence of the pre-defined wake up word. We apply smoothing on the output probabilities at a length of frames by taking average as
where represents the smoothed probablities at time of word and refers to the network output of frame at word . After smoothing, we compute the confidence score as follows:
where refers to the output confidence score. Compared to previous methods [chen2014small], it has the advantage of considering the order of words that fire, and at the same time, the time complexity is , which is suitable for the real-time application. The system triggers whenever the confidence score exceed the pre-defined threshold.
3 Multi-taks learning with alignment loss
The influence of far-field and noisy conditions in speech processing is commonly noticed in many areas like speech recognition and speaker verification. The mismatch of inner-class feature distributions on different domains contributes to the degradation of prediction performance. Focusing on this scenario, we apply alignment loss to constrain the embedding feature distortions from different domains in the manner of multi-task learning. In particular, we employ the CORAL loss, MSE loss, and cosine loss as alignment loss. In our case, we define the penultimate layer of the neural network as our feature layer for alignment loss computation.
3.1 CORAL loss
CORAL is proposed to align the second-order statistics of the source and target distributions. [sun2016deep]
extend this work to DNN approaches by constructing a differentiable loss functions, which can be used to minimize the distance between outputs of embedding feature layer from different domains. Suppose the features from source and target domain asand , where and
are the feature vectors generated from feature layers of its domain.refers to the -dim of the feature. And we denote the dimension of the feature layer as and the covariance matrices of source and target features are and , respectively. The CORAL loss can then be defined as
where denotes the squared matrix Frobenius norm. The covariance matrices of the source and target features [sun2016deep] are
where and denote the number of training samples of source and target domain. refers to a column vector of all 1 elements.
3.2 MSE loss
MSE loss function is a kind of regression loss function commonly used in regression prediction. In this paper, it is used to constrain the distortions among features. The loss is calculated by averaging the squared -norm of the difference between inner-class features from two domains as
where refers to the number of samples in a batch.
3.3 Cosine distance loss
Cosine distance is often used as a measure of the similarity between two vectors with their included angle. In our work, it is also used as a measure to describing the similarity between embedding features of data from different domains. The cosine loss function is defined as
In our work, we the compute alignment loss on the penultimate layer of the CNN network. We employ a multi-task learning manner to train the model, and the joint loss is defined as
where is a hyper-parameter representing the weight of alignment loss. The cross-entropy loss and
are calculated with the logits of data from both the source and target domains. By minimizing the joint loss, the inner-class embedding feature variabilities between close-talking and far-field domains would be reduced. The whole framework is illustrated in Figure2.
4 Experiments and results
Our proposed work is evaluated on a wake-up word dataset consisting of four Chinese characters, ”ni hao, mi ya” (”Hello, Mia” in English) [qin2019himia]. The dataset is first proposed for a far-field text-dependent speaker verification challenge. This dataset includes the speech data recorded by iPhone, Android, microphones, and microphone arrays from a distance of 0.25m, 1m, and 3m. In our work, we use a subset of the dataset to construct models. The audio in the subset we select is recorded with an iPhone from a distance of 0.25m, 1m, and 3m. We randomly select 222 speakers for the training set and 41 speakers for the test set. In our experiment, the 0.25m environment is regarded as close-talking (source domain), and 1m and 3m conditions, are viewed as far-field (target domain). See Table 1 for Signal-to-Noise ratio of different domains and Table 2 for more details of dataset statistics.
4.2 Experiment setup
[*] We determine target word labels by force-alignment with an LVCSR system trained with the AISHELL-2 dataset [du2018aishell]
. Here, for keyword ”ni hao, mi ya”, we find out the ending time of ”ni”, ”hao”, and ”mi”, and include its previous 20 frames and next 20 frames to construct a window of 40 frames. Log fbank is adopted as our input acoustic features. The baseline system is trained with cross-entropy loss, and our proposed model is jointly trained with cross-entropy and alignment loss. Stochastic gradient descent with Nesterov momentum is selected as the optimizer. The learning rate is first initialized as 0.01 and decreases by a factor of 0.1 when the training loss plateau. We train the CNN model for 100 epochs with a batch size of 128. In the period of evaluation, we employ a sliding window of 100 frames to compute the confidence score.
In our experiments, for multi-task learning, we set the weight of the alignment loss function to 0.2, 0.4, 0.6, 0.8, and 1.0, respectively. To evaluate the effectiveness of our proposed scheme, we implement two parallel experiments on both 1m and 3m dataset. The performance is measured with the false reject (FR) rate under one false alarm (FA) per hour and a curve describing the relationship between the FRR and the FA per hour. As the baseline system, we train models using 0.25m, 1m, and 3m dataset, respectively. Besides, we also pool data from both close-talking and far-field conditions for training.
The performance of the baseline system is illustrated in Table 3. From the results, we can obtain the following observations. First, with the increase of recording distance, the distortion becomes severer, and the performance of the baseline system degrades. Second, when the training set and test set are from the same domain, the system performs better than the scenarios of domain mismatch. Third, pooling training data of close-talking domain and target domain helps improve the performance on the test set of the target domain. And the performance of the close-talking condition can still be maintained.
|Mix of 0.25M and 1M||0.91||1.38||-|
|Mix of 0.25M and 3M||1.54||-||5.60|
|Mix of all distances||1.41||1.64||6.33|
|Test set||Ali. loss|
|Test set||Ali. loss|
The results of our proposed systems are shown in Table 4 and Table 5. From the tables, we can find that applying multi-task learning with alignment loss helps improve the performances on the far-field test set. Among the systems trained with 0.25m and 1m data, FRR decreases from 0.91% to 0.71% on the 0.25m test set and from 1.38% to 0.94% on the 1m test set using the CORAL loss () . Among the systems trained with 0.25m and 3m data, FRR decreases from 5.60% to 4.23% on 3m test set using the CORAL loss () and from 1.54% to 0.94% on 1m test set using the cosine loss ().
Compare the performances between systems trained with 1m and 3m target data, we observe that in the 3m scenario, systems have a larger absolute improvement in the target domain. While on the close-talking test set, models trained with 1m and 0.25m achieve a better improvement.
Investigating different alignment losses, we can find that the systems with the CORAL loss achieve a relatively better result than the other systems. And the results of the MSE systems are close to the CORAL systems. The performances of cosine systems are relatively more unstable and sensitive to s.
Figure 3 shows the performance of the baseline system and our proposed systems on the 0.25m and 1m test set. In the figures, the weight of the CORAL system is 0.8, the MSE system is , and the cosine system is 0.6. From Figure 3(a), we find that the CORAL system is slightly better than other systems on the 0.25m test set. Figure 3(b) reveals that on the target 1m test set, our proposed systems outperform the baseline system. The CORAL system and the MSE system are slightly better than the cosine system.
In this paper, we focus on the task of small-footprint keyword spotting under the far-field environment. Far-field environments are commonly encountered in real-life speech applications, and it causes serve degradation of performance due to room reverberation and various kinds of noises. To cope with the distortions, we employ CORAL loss, MSE loss, and cosine loss as alignment loss in the manner of multi-task learning, which helps reduce the mismatch between the outputs of a DNN layer with inputs from different domains of data. Experimental results show that our method manages to maintain the performance on the close-talking test dataset and achieve significant improvement in far-field conditions. We also find that the quality of far-field training data could influence the performance of systems.
This research was funded by Kunshan Government Research (KGR) Funding in AY 2019/2020.