Small-Footprint Keyword Spotting on Raw Audio Data with Sinc-Convolutions

11/05/2019 ∙ by Simon Mittermaier, et al. ∙ 0

Keyword Spotting (KWS) enables speech-based user interaction on smart devices. Always-on and battery-powered application scenarios for smart devices put constraints on hardware resources and power consumption, while also demanding high accuracy as well as real-time capability. Previous architectures first extracted acoustic features and then applied a neural network to classify keyword probabilities, optimizing towards memory footprint and execution time. Compared to previous publications, we took additional steps to reduce power and memory consumption without reducing classification accuracy. Power-consuming audio preprocessing and data transfer steps are eliminated by directly classifying from raw audio. For this, our end-to-end architecture extracts spectral features using parametrized Sinc-convolutions. Its memory footprint is further reduced by grouping depthwise separable convolutions. Our network achieves the competitive accuracy of 96.4 with only 62k parameters.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech processing enables natural communication with smart phones or smart home assistants, e.g., Amazon Echo, Google Home. However, continuously performing speech recognition is not energy-efficient and would drain batteries of smart devices. Instead, most speech recognition systems passively listen for utterances of certain wake words such as “Ok Google”, “Hey Siri”, “Alexa”, etc. to trigger the continuous speech recognition system on demand. This task is referred to as keyword spotting (KWS). There are also uses of KWS where a view simple speech commands (e.g. “on”, “off”) are enough to interact with a device such as a voice-controlled light bulb.

Conventional hybrid approaches to KWS first divide their audio signal in time frames to extract features, e.g., Mel Frequency Cepstral Coefficients (MFCC). A neural net then estimates phoneme or state posteriors of the keyword Hidden Markov Model in order to calculate the keyword probability using a Viterbi search. In recent years, end-to-end architectures gained traction that directly classify keyword posterior probabilites based on the previously extracted features, e.g.,  

[1, 22, 16, 3, 6].

Typical application scenarios imply that the device is powered by a battery, and possesses restricted hardware resources to reduce costs. Therefore previous works optimized towards memory footprint and operations per second. In contrast to this, we tune our neural network towards energy conservation in microcontrollers motivated by obervations on power consumption, as detailed in Sec. 3.1. To extract meaningful and representative features from raw audio, our architecture uses parametrized Sinc-convolutions (SincConv) from SincNet proposed by Ravianelli et al. [12]. We use Depthwise Separable Convolutions (DSConv) [10, 4] that preserve time-context information while at the same time compare features in different channels. To further reduce the number of network parameters, which is key for energy efficiency, we group DSConv-layers, a technique we refer to as Grouped DSConv (GDSConv).

Our key contributions are:

  • We propose a neural network architecture tuned towards energy efficiency in microcontrollers grounded on the observation that memory access is costly, while computation is cheap [15].

  • Our keyword-spotting network classifies on raw audio employing SincConvs while at the same time reducing the number of parameters using (G)DSConvs.

  • Our base model with k parameters performs with the state-of-the-art accuracy of on the test set of Google’s Speech Commands dataset, on par with TC-ResNet [3] that has k parameters and requires separate preprocessing. Our low-parameter model achieves with only k parameters.

2 Related Work

Recently, CNNs have been successfully applied to KWS [22, 16, 3]. Zhang et al. evaluated different neural network architectures (such as CNNs, LSTMs, GRUs) in terms of accuracy, computational operations and memory footprint as well as their deployment on embedded hardware [22]. They achieved their best results using a CNN with DSConvs. Tang et al. explored the use of Deep Residual Networks with dilated convolutions to achieve a high accuracy of [16], while keeping the number of parameters comparable to [22]. Choi et al. build on this work as they also use a ResNet-inspired architecture. Instead of using 2D convolution over a time-frequency representation of the data they convolve along the time dimension and treat the frequency dimension as channels [3].

This bears similarities with our approach as we are using 1D convolution along the time dimension as well. However, all the approaches mentioned classify from MFCCs or similar preprocessed features. Our architecture works directly on raw audio signals. There is a recent trend towards using CNNs on raw audio data directly [12, 13, 19, 20]. Ravanelli et al. present an effective method of processing raw audio with CNNs, called SincNet. Kernels of the first convolutional layer are restricted to only learn shapes of parametrized sinc functions. This method was first introduced for Speaker Recognition [12] and later also used for Phoneme Recognition [13].

To the best of our knowledge, we are the first to apply this method to the task of KWS. The first convolutional layer of our model is inspired by SincNet and we combine it with DSCconv. DSCconvs have first been introduced in the domain of Image Processing [4, 8] and have been applied to other domains since: Zhang et al. applied DSCconv to KWS [22]

. Kaiser et al. used DSConv for neural machine translation

[10]. They also introduce the “super-separable” convolution, a DSConv that also uses grouping and thus reduces the already small number of parameters of DSConv even further. A similar method is used by ShuffleNet where they combine DSConv with grouping and channel shuffling [21]. The idea of Grouped Convolutions was first used in AlexNet [11] to reduce parameters and operations and to enable distributed computing of the model over multiple GPUs. We denominate the combination of grouping and DSconv as GDSConv in our work and use it for our smallest model.

3 Model

3.1 Keyword-Spotting on Battery-Powered Devices

Typical application scenarios for smart devices imply that the device is powered by a battery, and possesses restricted hardware resources. The requirements for a KWS system in these scenarios are (1) very low power consumption to maximize battery life, (2) real-time or near real-time capability, (3) low memory footprint and (4) high accuracy to avoid random activations and to ensure responsiveness.

Regarding real-time capability, our model is designed to operate on a single-core microcontroller capable of 50 MOps per second [22]. We assume that in microcontrollers the memory consumption of a KWS neural network is associated with its power consumption: Reading memory values contributes most to power consumption which makes re-use of weights favorable. While in general large memory modules leak more power than small memory modules, one read operation from RAM costs far more energy than the corresponding multiply-and-accumulate computation [7, 15]. In addition to the parameter-reducing approach in this work, further steps may be employed to reduce power consumption such as quantization, model compression or optimization strategies regarding dataflows that depend on the utilized hardware platform [7, 15, 2, 14].

3.2 Feature Extraction using SincConvs

SincNet [12] classifies on raw audio by restricting the filters of the first convolutional layer of a CNN to only learn parametrized sinc functions, i.e., . One sinc function in the time domain represents a rectangular function in the spectral domain, therefore two sinc functions can be combined to an ideal band-pass filter:


Performing convolution with such a filter extracts the parts of the input signal that lie within a certain frequency range. SincNet combines Sinc-convolutions with CNNs; as we only use the feature extraction layer of this architecture, we label this layer as SincConv to establish a distinction to SincNet.

Compared to one filter of a regular CNN, the number of parameters is derived from its kernel width, e.g.,  [19]. Sinc-convolutions only require two parameters to derive each filter, the lower and upper cut-off frequencies (), resulting in a small memory footprint. SincConv filters are initialized with the cutoff frequencies of the mel-scale filter bank and then further adjusted during training. Fig. 1 visualizes this adjustment from initialization to after training. SincConv filter banks can be easily interpreted, as the two learned parameter correspond to a specific frequency band. Fig. 2 visualizes how a SincConv layer with 7 filters processes an audio sample containing the word “yes”.

Figure 1: SincConv filter bank with 7 filters, (a) after initialization with mel-scale weights and (b) after training on [18].
Figure 2: An audio sample of the keyword “yes” is convolved with the 7-filter SincConv layer from Fig. 1 to extract meaningful features.

3.3 Low-Parameter GDSConv Layers

Figure 3: Steps from a regular convolution to the grouped depthwise separable convolution. (a) Regular 1D convolutional layers perform convolution along the time axis, across all channels. (b) The DSConv convolves all channels separately along the time dimension (depthwise), and then adds a 1x1 convolution (i.e., a pointwise convolution) to combine information across channels. (c) The GDSConv is performed by partitioning the channels into groups and then applying a DSConv on each group. Our base model employs DSConv layers, and our low-parameter model GDSConv layers.

DSConv have been successfully applied to the domain of computer vision

[4, 8], neural translation [10] and KWS [22]. Fig. 3 provides an overview of the steps from a regular convolution to the GDSConv.

The number of parameters of one DSConv layer amounts to with the kernel size and the number of input and output channels and respectively; the first summand is determined by the depthwise convolution, the second summand by the pointwise convolution [10]. In our model configuration, the depthwise convolution only accounts for roughly of parameters in this layer, the pointwise for . We therefore reduced the parameters of the pointwise convolution using grouping by a factor to , rather than the parameters in the depthwise convolution. To allow information exchange between groups we alternate the number of groups per layer, namely 2 and 3, as proposed in [10].

3.4 Two Low-Parameter Architectures

The SincConv as the first layer extracts features from the raw input samples, as shown in Fig. 4. As non-linearity after the SincConv we opt to use log-compression, i.e.,

, instead of a common activation function (e.g., ReLU). This has also shown to be effective in other CNN architectures for raw audio processing 

[19, 20]. Five (G)DSConv layers are then used to process the features further: The first layer has a larger kernel size and scales the number of channels to . The other four layers have each

input and output channels. Each (G)DSConv block contains the (G)DSConv layer, batch normalization 

[9] and spatial dropout [17] for regularization, as well as average pooling to reduce temporal resolution. After the (G)DSConv blocks, we use global average pooling to receive a

-element vector that can be transformed to class posteriors using a Softmax layer to classify

classes, i.e., keywords as well as a class for unknown and for silence.

The low-parameter model is obtained by grouping the DSConv layers with an alternating number of groups between 2 and 3. For the configuration shown in Fig. 4, the base model has k parameters. After grouping, the number of parameters is reduced to a total of k.

Figure 4: The model architecture as described in Sec. 3.4. Parameter configurations of convolutions are given as

that represent the number of output channels, kernel length and stride, respectively. In our low-parameter model, convolutions are grouped from the third to the sixth layer.

4 Evaluation

4.1 Training on the Speech Commands Dataset

We train and evaluate our model using Google’s Speech Commands data set [18], an established dataset for benchmarking KWS systems. The first version of the data set consists of k one-second long utterances of different keywords spoken by different speakers. The most common setup consists of a classification of 12 classes: “yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop”, “go”, unknown, or silence. The remaining keywords are labeled as unknown, samples of provided background noise files as silence. To ensure the benchmark reproducibility, a separate test set was released with a predefined list of samples for the unknown and the silence class. The second version of the dataset contains k samples and five additional keywords [18]. However, previous publications on KWS reported only results on the first version, therefore we focused on the first version and additionally report testing results on version 2 of the dataset.

Every sample from the training set is used in training, this leads to a class imbalance as there are much more samples for unknown. Class weights in the training phase assign a lower weight to samples labeled as unknown such that the impact on the model is proportional to the other classes. This way, the model can see more unknown

word samples during training without getting biased. Our model is trained for 60 epochs with the Adam optimizer 

[5] with an initial learning rate of 0.001 and learning rate decay of 0.5 after 10 epochs; the model with the highest validation accuracy is saved to evaluate accuracy on the test set.

4.2 Results and Discussion

The base model composed of DSConv layers without grouping achieves the state-of-the-art accuracy of 96.6% on the Speech Commands test set. The low-parameter model with GDSConv achieves almost the same accuracy of 96.4% with only about half the parameters. This validates the effectiveness of GDSConv for model size reduction.

Table 1 lists these results in comparison with related work. Compared to the DSConv network in [22], our network is more efficient in terms of accuracy for a given parameter count. Their biggest model has a 1.2% lower accuracy than our base model while having about 4 times the parameters. Choi et al. [3] has the most competitive results while we are still able to improve upon their accuracy for a given number of parameters. They are using 1D convolution along the time dimension as well which may be evidence that this yields better performance for audio processing or at least KWS.

As opposed to previous works, our architecture does not use preprocessing to extract features, but is able to extract features from raw audio samples with the SincConv layer. That makes it possible to execute a full inference as floating point operations, without requiring additional hardware modules to process or transfer preprocessed features. Furthermore, we deliberately opted to not use residual connections in our network architecture, considering the memory overhead and added difficulty for hardware acceleration modules.

For future comparability, we also trained and evaluated our model on the newer version 2 of the Speech Commands data set; see Table 2 for results. On a side note, we observed that models trained on version 2 of the Speech Commands dataset tend to perform better on both the test set for version 2 and the test set for version 1 [18].

Model Accuracy Parameters
DS-CNN-S [22] k
DS-CNN-M [22] k
DS-CNN-L [22] k
ResNet15 [16] k
TC-ResNet8 [3] k
TC-ResNet14 [3] k
TC-ResNet14-1.5 [3] k
SincConv+DSConv k
SincConv+GDSConv k
Table 1: Comparison of results on the Speech Commands dataset [18].
Model Accuracy Parameters
SincConv+DSConv k
SincConv+GDSConv k
Table 2: Results on Speech Commands version 2 [18].

5 Conclusion

Always-on, battery-powered devices running keyword spotting require energy efficient neural networks with high accuracy. For this, we identified the parameter count in a neural network as a main contributor to power consumption, as memory accesses contribute far more to power consumption than the computation. Based on this observation, we proposed an energy efficient KWS neural network architecture by combining feature extraction using SincConvs with GDSConv layers.

Starting with the base model composed of DSConvs that have already less parameters than a regular convolution, we achieved state-of-the-art accuracy on Google’s Speech Commands dataset. We further reduce the number of parameters by grouping the convolutional channels to GDSConv, resulting in a low-parameter model with only k parameters.


  • [1] G. Chen, C. Parada, and G. Heigold (2014-05) Small-footprint keyword spotting using deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4087–4091. External Links: Document, ISSN Cited by: §1.
  • [2] Y. Chen, J. Emer, and V. Sze (2016-06)

    Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks

    In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Vol. , pp. 367–379. External Links: Document, ISSN Cited by: §3.1.
  • [3] S. Choi, S. Seo, B. Shin, H. Byun, M. Kersner, B. Kim, D. Kim, and S. Ha (2019) Temporal Convolution for Real-Time Keyword Spotting on Mobile Devices. In Proc. Interspeech 2019, pp. 3372–3376. External Links: Document, Link Cited by: 3rd item, §1, §2, §4.2, Table 1.
  • [4] F. Chollet (2017-07)

    Xception: deep learning with depthwise separable convolutions


    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. , pp. 1800–1807. External Links: Document, ISSN Cited by: §1, §2, §3.3.
  • [5] J. B. D. P. Kingma (2015) Adam: a method for stochastic optimization. In Conference on Learning Representations (ICLR), Vol. , pp. . Cited by: §4.1.
  • [6] S. Fernández, A. Graves, and J. Schmidhuber (2007)

    An application of recurrent neural networks to discriminative keyword spotting

    In International Conference on Artificial Neural Networks (ICANN), pp. 220–229. Cited by: §1.
  • [7] M. Horowitz (2014-02) 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Vol. , pp. 10–14. External Links: Document, ISSN Cited by: §3.1.
  • [8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. External Links: 1704.04861 Cited by: §2, §3.3.
  • [9] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015

    pp. 448–456. External Links: Link Cited by: §3.4.
  • [10] L. Kaiser, A. N. Gomez, and F. Chollet (2018) Depthwise separable convolutions for neural machine translation. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §1, §2, §3.3, §3.3.
  • [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), Vol. , pp. . Cited by: §2.
  • [12] M. Ravanelli and Y. Bengio (2018-12) Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT), Vol. , pp. 1021–1028. External Links: Document, ISSN Cited by: §1, §2, §3.2.
  • [13] M. Ravanelli and Y. Bengio (2018) Interpretable convolutional filters with sincnet. In NIPS 2018 Interpretability and Robustness for Audio, Speech and Language Workshop, Vol. , pp. . Cited by: §2.
  • [14] M. Shahnawaz, E. Plebani, I. Guaneri, D. Pau, and M. Marcon (2018-Sep.) Studying the effects of feature extraction settings on the accuracy and memory requirements of neural networks for keyword spotting. In 2018 IEEE 8th International Conference on Consumer Electronics - Berlin (ICCE-Berlin), Vol. , pp. 1–6. External Links: Document, ISSN Cited by: §3.1.
  • [15] V. Sze, Y. Chen, T. Yang, and J. S. Emer (2017-12) Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295–2329. External Links: Document, ISSN Cited by: 1st item, §3.1.
  • [16] R. Tang and J. Lin (2018-04) Deep residual learning for small-footprint keyword spotting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5484–5488. External Links: Document, ISSN Cited by: §1, §2, Table 1.
  • [17] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler (2015-06) Efficient object localization using convolutional networks. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 648–656. External Links: Document, ISSN Cited by: §3.4.
  • [18] P. Warden (2018) Speech commands: a dataset for limited-vocabulary speech recognition. arXiv:1804.03209. External Links: 1804.03209 Cited by: Figure 1, §4.1, §4.2, Table 1, Table 2.
  • [19] N. Zeghidour, N. Usunier, I. Kokkinos, T. Schaiz, G. Synnaeve, and E. Dupoux (2018-04) Learning filterbanks from raw speech for phone recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5509–5513. External Links: Document, ISSN Cited by: §2, §3.2, §3.4.
  • [20] N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert, and E. Dupoux (2018) End-to-end speech recognition from the raw waveform. In Proc. Interspeech 2018, pp. 781–785. External Links: Document, Link Cited by: §2, §3.4.
  • [21] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018-06) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 6848–6856. External Links: Document, ISSN Cited by: §2.
  • [22] Y. Zhang, N. Suda, L. Lai, and V. Chandra (2017) Hello edge: keyword spotting on microcontrollers. arXiv:1711.07128. External Links: 1711.07128 Cited by: §1, §2, §2, §3.1, §3.3, §4.2, Table 1.