A Monaural Speech Enhancement Method for Robust Small-Footprint Keyword Spotting

06/20/2019 ∙ by Yue Gu, et al. ∙ 0

Robustness against noise is critical for keyword spotting (KWS) in real-world environments. To improve the robustness, a speech enhancement front-end is involved. Instead of treating the speech enhancement as a separated preprocessing before the KWS system, in this study, a pre-trained speech enhancement front-end and a convolutional neural networks (CNNs) based KWS system are concatenated, where a feature transformation block is used to transform the output from the enhancement front-end into the KWS system's input. The whole model is trained jointly, thus the linguistic and other useful information from the KWS system can be back-propagated to the enhancement front-end to improve its performance. To fit the small-footprint device, a novel convolution recurrent network is proposed, which needs fewer parameters and computation and does not degrade performance. Furthermore, by changing the input features from the power spectrogram to Mel-spectrogram, less computation and better performance are obtained. our experimental results demonstrate that the proposed method significantly improves the KWS system with respect to noise robustness.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Keyword spotting (KWS), also called keyword detection (KWD) or spoken term detection (STD), is a crucial technique for human-computer interaction interface. For example, wake-up word detection on mobile devices is an typical scenario. It detects predefined wake-up words in a continuous audio stream. A good KWS system should have low false rejection rate and also low false alarm rate. Moreover, KWS usually runs in “always-on” mode which requires low power consumption especially in small-footprint embedded systems.

Currently, the KWS system performs well in a relatively quiet environment. For example, the latest research [1] achieved an accuracy of 95% on the Google’s Speech Commands Dataset [2], with a small model. While, in noisy environments, KWS is still a challenge. In recent years, to increase the robustness against noise, a commonly and widely used method is multi-condition training [3, 4, 5] which train model with noisy utterances, directly. However, to achieve a good performance, multi-condition training always need a large model, which is impossible to be deployed on devices with limited resources [6].

Recently, with the rise of the deep learning, speech enhancement technique has made a significant progress

[7]. In the automatic speech recognition (ASR) community, the front-end enhancement techniques have been introduced and have improved the robust ASR systems, where an enhancement front-end is employed to enhance the noisy speech before recognition. Then the recognizer can be trained on clean speech [8], or trained on enhanced speech [9]. After ASR, the front-end enhancement techniques have also been introduced in KWS. In [6], a text-dependent enhancement and KWS method has been developed and shown improvements on the noise robustness. However, its enhancement model is based on bidirectional long-short time memory ( BiLSTM) which needs too many parameters and computation, which does not fit the small-footprint device.

In this paper, we propose a small-footprint enhancement method for the resource-limited KWS. Compared with the BiLSTM-based models, the proposed model achieves comparable or even better performance with much less parameters and computation. Considering speech enhancement and keyword spotting are not two independent tasks, they can benefit each other. We concatenate them to build a larger and deeper model, and then optimize them jointly to improve the noise robustness furtherly.

Experimental results demonstrate the proposed joint-training method not only significantly outperforms the multi-conditional training method, but also outperforms the enhancement front-end methods, whether its KWS recognizer is trained on clean speech or on enhanced speech. With experiments, we find that for KWS task Mel-spectrogram is a better feature than the power spectrogram, which leads to better performance and lower computation complexity. We also find with Mel-spectrogram the KWS system is less sensitive to the number of phonetic symbols in the keywords.

Figure 1: Schematic diagram of the proposed system. Solid and dotted arrows indicate the directions of forward pass and backward pass, respectively. See text for more details.

2 System description

The overall framework of our system is shown in Fig. 1. There are three components in the proposed system, i.e., speech enhancement model, feature transformation block and keyword spotting (KWS) model. The speech enhancement model is trained to predict the ideal ratio masks (IRMs) [10]

. The enhanced spectrogram are obtained by point-wisely multiplying the noisy spectrogram with the predicted masks. Then the enhanced spectrogram are transformed to the Mel-frequency cepstral coefficients (MFCCs) by the feature transformation block. Given the MFCCs, the KWS model is trained to predict the posterior probability of keywords. The details of these three components are given followingly.

2.1 Speech enhancement model

We employ the masking-based speech enhancement method, which has successfully improved the human speech perceptive quality [11] and the noise robustness of ASR [12]

. The loss function of masking-based method is defined as:


where and are the ideal and predicted time-frequency (T-F) mask at time and frequency , respectively. and are the total number of frames and frequency bins respectively. The IRM is defined as:


where represents the spectrogram of the clean speech, stands for the spectrogram of the noise signal.

In test stage, the IRM is predicted from the noisy speech and the enhanced spectrogram can be obtained by:


where is the mask predicted by the enhancement model. is the spectrogram of the noisy signal. represents the element-wise matrix multiplication.

The IRM can be defined in different T-F domains. Although the power spectrogram is a common choice in the speech enhancement community, there are better choice. In the proposed KWS system, the output of the enhancement model is feed into KWS model which requires the MFCC as input. While, the frequency bins in the power spectrograms are integrated to extract MFCCs by the Mel-filter bank. It means many information contained in the power spectrogram are filtered out. Therefore, it is not efficient and necessary to perform enhancement on the power spectrogram. In contrast, the Mel-spectrogram can be used to extract MFCCs, directly. So that we use the proposed enhancement model to predict IRM on the Mel-spectrogram. In this way, the spectrogram of noise , clean speech , and noisy speech , are all in the form of Mel-spectrogram.

To serve the small-footprint purpose, we design a novel convolution recurrent network (CRN) with the limiting parameters and computation. The architecture of CRN is shown as speech enhancement model in the lower part of the Fig. 1

. There are two components in the CRN, i.e., the convolutional encoder-decoder and the RNN with LSTM cells followed by a linear projection layer. Skip connections are added to the corresponding layers between the encoder and decoder. Batch normalization


and rectified linear units (ReLUs)


are employed in the convolutional layers and the leaky ReLUs (lReLUs) are used in the de-convolutional layers instead of ReLUs. Sigmoid nonlinearity is employed for the output layer. Note that there are two differences between the CRNs in


and ours. Firstly, with the limiting parameters and computation, the convolution layers in our CRNs have the strides on both time and frequency axises while the origin CRN only strides on the frequency axis. Secondly, we employ the lReLU at the decoding stage, which guarantees the nonzero gradients everywhere and benefits the optimizing processing of the encoding stage.

2.2 Feature transformation block

The input of KWS system is MFCC while the outputs of enhancement model are spectrograms. To extract the MFCCs from the spectrograms, we design the feature transformation blocks (FTBs) which are shown in Fig. 2. Transforming Mel-spectrogram to MFCC needs taking logarithm firstly then applying discrete cosine transformation (DCT). For comparison, an enhancement model trained to predict the IRM on the power spectrogram is also built. For this model, we need transform the power spectrogram into MFCC. Similar to the Mel-spectrogram, to obtain MFCC from power spectrogram, the input should pass a Mel-filter bank, then take logarithm, at last apply a DCT. Note that both the Mel-filter bank filtering and the DCT can be implemented with the matrix multiplication which can be further represented as the linear layers in a neural network [16]. As a result, included the FTBs, the proposed systems can be trained with back-propagation algorithm.

(a) Mel-spectrogram to MFCC (b) Power spectrogram to MFCC
Figure 2: The feature transformation block for (a) Mel-spectrogram and (b) power spectrogram.

2.3 Keyword spotting system

We employ the model cnn-trad-pool2 developed in [17] as our KWS system. This model diverges slightly from the model cnn-trad-fpool3 which is originally introduced in [15]

. The size and stride of the first max-pooling layer are set to

and the hidden linear layers are dropped in cnn-trad-pool2, which leads to better accuracy.

3 Experiments and results

3.1 Experimental settings

We evaluated the proposed models on Google’s Speech Commands Dataset which contains 105,829 one-second long utterances and 6 background noise records (including pink noise, white noise, and daily environmental sounds such as doing the dishes, exercise bike, etc.)

[2]. Following Google’s implementation, the task is to detect 10 keywords, unknown and silence. In our experiments, the baseline cnn-trad-pool2

follows the exactly the same procedure as the TensorFlow reference. The dataset is split into the training, validation, and test set with the ratio of 8:1:1. Noisy utterances are obtained by mixing up with 6 noises at signal-to-noise ratios (SNRs) of

. There are roughly 812k noisy examples for training and 97.6k each for validation and test. Another 25 keywords are employed to evaluate the models, which are not involved at the training phase. Finally, the test set contains 210k noisy utterances with keywords and non-keywords ratio of 1.3:1. To evaluate the generalization of the models, 100 Nonspeech Sounds 111http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.html

are employed, which are unseen at the training stage. The unmatched test set contains nearly 3.6M utterances. All the utterances are sampled to 16 kHz and the features are extracted with the window length of 30 ms and the shift length of 10 ms. The 480-point short-time Fourier transform is employed. The Mel-filter bank is calculated with the low frequency 20 Hz and high frequency 4 KHz. The 40-dimension DCT coefficients are used to extract MFCC.

Accuracy is the main metric, which is simply measured as the fraction of classification decisions that correct. We also plot receiver operating characteristic (ROC) curves, where the and axes show false alarm rate (FAR) and false reject rate (FRR), respectively. Methods with less area under the curve (AUC) are better. Equal error rates (EERs) are also employed to shows the KWS performance with the enhancement models.

All models are trained with the Adam optimizer [18] and the mini-batch size of 256 on the utterance-level. We set the learning rate to 0.0001. The mean squared error (MSE) and cross entropy (CE) are the objective functions of the enhancement model and KWS system, respectively. The best models are selected by the best accuracy on the validation set.

We evaluate the proposed small-footprint CRNs on the power and Mel-spectrogram. For each spectrogram, we design two models with different model size. We refer the full-size model trained on the power and Mel-spectrogram as PowCRN32 and MelCRN32 respectively, and the narrow models are referred as PowCRN16 and MelCRN16 respectively. The details of CRNs are shown in Tab. 1. As the comparison, a LSTM-based model is also evaluated, which consists of two hidden layers with 384 bidirectional LSTM cells followed by a linear projection layer with 241 units. We refer this enhancement model as BiLSTM [6]. The model size is given in Tab. 2. In Tab. 2, we list the parameter numbers and the computation complexity evaluated by the number of multiply operation for each model.

Layer Name Input Size Parameter Output Size
reshape_1 -
reshape_1 -
reshape_2 -
reshape_3 -
reshape_1 -
reshape_1 -
reshape_2 -
reshape_3 -
Table 1: The architectures of small-footprint CRNs. denotes the number of time frames in the spectrogram. is set to and for the narrow and full-size CRNs, respectively. For convolution and deconvolution layers, the parameter indicates kernel size, stride and filter number. stands for the number of bidirectional LSTM cells.
Model Name Parameters Multiplies
cnn-trad-pool2 493.7K 95.87M
BiLSTM 5661.0K 432.7M
PowCRN32 724.0K 280.1M
PowCRN16 182.3K 73.0M
MelCRN32 881.3K 115.1M
MelCRN16 221.5K 29.2M
Table 2: The number of parameters and multiplies used for the KWS system and different enhancement models.

Beside the baseline cnn-trad-pool2 which uses the multi-conditional training technique, we apply three training strategys for all other enhancement front-end based models. Firstly, the enhancement model is pre-trained against the MSE loss as Equation (1). Then, the enhancement model is concatenated to the KWS model through the feature translation block. In these enhancement front-end based models, KWS model can be trained alone with the MFCC of noisy utterances, which we refer it as KWS+{enhancement model}. KWS model also can be trained alone with the MFCC of enhanced spectrogram, which we refer it as retrain+{enhancement model}. In fact, KWS model and the enhancement model can be trained together with the noisy spectrogram, which we refer it as joint+{enhancement model}.

Model Test accuracy(%) AUC (%) EER (%)
cnn-trad-pool2 80.89 1.99 7.28
KWS+BiLSTM 87.64 1.30 6.66
retrain+BiLSTM 90.18 1.17 5.92
joint+BiLSTM 91.64 1.01 6.15
KWS+PowCRN32 86.42 1.52 6.67
retrain+PowCRN32 87.69 1.53 6.63
joint+PowCRN32 91.07 1.20 6.27
KWS+PowCRN16 86.20 1.61 6.73
retrain+PowCRN16 87.01 1.67 6.88
joint+PowCRN16 90.68 1.22 6.50
KWS+MelCRN32 87.59 1.59 6.97
retrain+MelCRN32 89.17 1.35 6.10
joint+MelCRN32 93.17 1.19 6.20
KWS+MelCRN16 86.87 1.64 7.00
retrain+MelCRN16 88.20 1.42 6.49
joint+MelCRN16 92.56 1.28 6.39
Table 3: The test accuracy, EER and AUR of each model under matched noise condition.
Model Accuracy(%) Model Accuracy(%)
cnn-trad-pool2 68.81 joint+BiLSTM 73.74
joint+PowCRN32 75.19 joint+PowCRN16 72.49
joint+MelCRN32 78.12 joint+MelCRN16 75.67
Table 4: The test accuracy of joint-trianing models under unmatched noise condition.
(a) Model (b) Training Strategy (c) Feature Domain (d) AUC Reduction
Figure 3: ROCs from the perspective of (a) different enhancement models, (b) training strategy, (c) feature domain. And (d) AUC reduction against phonetic symbol length.

3.2 Results

The experimental results are given in Tab. 3 and Fig. 3.

Model comparison: From Tab. 3 and Fig. 3 (a), we can see all of the comparison models outperform the baseline. The performance of the BiLSTM-based model is good, however its parameter number and computation is the hugest (seen in Tab. 2) which doesn’t serve the small-footprint purpose. The proposed CRNs have acceptable parameters and needs less computation, but have achieved comparable performance compared with BiLSTM-based model. The parameters and the required multiplies are further reduced in the narrow model (PowCRN16, MelCRN16), but it also obtained a comparable performance with the BiLSTM-based model.

Training strategy: From Tab. 3 and Fig. 3 (b), we can see all of the enhancement front-end based models outperform the multi-conditional trained baseline. Specifically, the retrained KWS model trained with enhanced spectrogram is better than the KWS model trained with noisy utterances, and the joint-trained KWS model is better than the retrained KWS model. It is because that the mismatch between the enhancement model and KWS model is descending in the order of model trained with clean utterances, retrained model and joint-trained model. Especially, for the small-footprint enhancement models (PowCRNs and MelCRNs), the joint-training strategy significantly improves the performance.

Mel vs power spectrogram: From Tab. 3 and Fig. 3 (c), we find the CRNs trained on the Mel-spectrogram have better performance and similar parameters compared with the models trained on the power spectrogram. Since the dimension of Mel-spectrograms is much less than the power spectrograms, the multiplies of the enhancement models can be significantly reduced. We think the Mel-spectrogram is more suitable for the KWS system. Beacuse the input of KWS system is always silence, background noise or non-speech, false alarms on those must be minimized. With the limitation of low FAR (), we find MelCRN32 achieves the lowest FRR than the PowCRN32 and BiLSTM. This advantage is also retained by the narrow model.

Sensibility on phonetic symbol length: Since the keywords have different phonetic symbols, we wonder whether enhancement models are sensitive to the number of phonetic symbols in the keywords. We split the dataset into two sets, i.e., the keywords with 2 and more phonetic symbols. Fig. 3 (d) shows AUC reductions for keywords with different number of phonetic symbols, where the less reduction the better. From the figure, we can see that the Mel-spectrogram based method is less sensitive to the number of phonetic symbols in the keywords than the models on the power spectrogram.

Noise generalization: Tab. 4 shows the results of joint-training models under the unmatched noise condition which contains 100 unseen noises. From the table, we can see the proposed full-size CRNs have better generalization to new noise conditions than the BiLSTM. And the CRNs on Mel spectrogram domain achieves higher accuracy than that on power spectrogram domain.

4 Conclusions

In this paper, we proposed a small-footprint speech enhancement technique for robust KWS, which integrates a front-end enhancement model and a back-end KWS model. The proposed CRNs achieve better performance under both matched and unmatched noise condition, and CRNs need less parameters and computation compared with the conventional BiLSTM-based model. We find Mel-spectrogram is better than power spectrogram because it can achieve comparable performance with less computation and similar or smaller model size. Beside that the Mel-spectrogram based method is non-sensitive to the phonetic symbol length in the keywords.


  • [1] R. Tang and J. Lin, “Deep residual learning for small-footprint keyword spotting,” pp. 5484–5488, 2018.
  • [2] P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
  • [3] C. Shan, J. Zhang, Y. Wang, and L. Xie, “Attention-based end-to-end models for small-footprint keyword spotting,” Proc. Interspeech 2018, pp. 2037–2041, 2018.
  • [4] R. Prabhavalkar, R. Alvarez, C. Parada, P. Nakkiran, and T. N. Sainath, “Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks,” pp. 4704–4708, 2015.
  • [5] T. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Proceedings of Interspeech, 2015.
  • [6] M. Yu, X. Ji, Y. Gao, L. Chen, J. Chen, J. Zheng, D. Su, and D. Yu, “Text-dependent speech enhancement for small-footprint robust keyword detection,” Proc. Interspeech 2018, pp. 2613–2617, 2018.
  • [7] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018.
  • [8] J. Du, Q. Wang, T. Gao, Y. Xu, L.-R. Dai, and C.-H. Lee, “Robust speech recognition with speech enhanced deep neural networks,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  • [9] M. L. Seltzer, D. Yu, and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” in 2013 IEEE international conference on acoustics, speech and signal processing.   IEEE, 2013, pp. 7398–7402.
  • [10]

    A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” in

    2013 IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2013, pp. 7092–7096.
  • [11] Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1849–1858, 2014.
  • [12] Z.-Q. Wang and D. Wang, “A joint training framework for robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 796–806, 2016.
  • [13] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [14]

    V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in

    ICML, 2010, pp. 807–814.
  • [15]

    K. Tan and D. Wang, “A convolutional recurrent neural network for real-time speech enhancement,” in

    Proceedings of Interspeech, 2018, pp. 3229–3233.
  • [16] T. N. Sainath, B. Kingsbury, A.-r. Mohamed, and B. Ramabhadran, “Learning filter banks within a deep neural network framework,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.   IEEE, 2013, pp. 297–302.
  • [17] R. Tang and J. Lin, “Honk: A pytorch reimplementation of convolutional neural networks for keyword spotting,” arXiv preprint arXiv:1710.06554, 2017.
  • [18] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.