Speech processing enables natural communication with smart phones or smart home assistants, e.g., Amazon Echo, Google Home. However, continuously performing speech recognition is not energy-efficient and would drain batteries of smart devices. Instead, most speech recognition systems passively listen for utterances of certain wake words such as “Ok Google”, “Hey Siri”, “Alexa”, etc. to trigger the continuous speech recognition system on demand. This task is referred to as keyword spotting (KWS). There are also uses of KWS where a view simple speech commands (e.g. “on”, “off”) are enough to interact with a device such as a voice-controlled light bulb.
Conventional hybrid approaches to KWS first divide their audio signal in time frames to extract features, e.g., Mel Frequency Cepstral Coefficients (MFCC). A neural net then estimates phoneme or state posteriors of the keyword Hidden Markov Model in order to calculate the keyword probability using a Viterbi search. In recent years, end-to-end architectures gained traction that directly classify keyword posterior probabilites based on the previously extracted features, e.g.,[1, 22, 16, 3, 6].
Typical application scenarios imply that the device is powered by a battery, and possesses restricted hardware resources to reduce costs. Therefore previous works optimized towards memory footprint and operations per second. In contrast to this, we tune our neural network towards energy conservation in microcontrollers motivated by obervations on power consumption, as detailed in Sec. 3.1. To extract meaningful and representative features from raw audio, our architecture uses parametrized Sinc-convolutions (SincConv) from SincNet proposed by Ravianelli et al. . We use Depthwise Separable Convolutions (DSConv) [10, 4] that preserve time-context information while at the same time compare features in different channels. To further reduce the number of network parameters, which is key for energy efficiency, we group DSConv-layers, a technique we refer to as Grouped DSConv (GDSConv).
Our key contributions are:
We propose a neural network architecture tuned towards energy efficiency in microcontrollers grounded on the observation that memory access is costly, while computation is cheap .
Our keyword-spotting network classifies on raw audio employing SincConvs while at the same time reducing the number of parameters using (G)DSConvs.
Our base model with k parameters performs with the state-of-the-art accuracy of on the test set of Google’s Speech Commands dataset, on par with TC-ResNet  that has k parameters and requires separate preprocessing. Our low-parameter model achieves with only k parameters.
2 Related Work
Recently, CNNs have been successfully applied to KWS [22, 16, 3]. Zhang et al. evaluated different neural network architectures (such as CNNs, LSTMs, GRUs) in terms of accuracy, computational operations and memory footprint as well as their deployment on embedded hardware . They achieved their best results using a CNN with DSConvs. Tang et al. explored the use of Deep Residual Networks with dilated convolutions to achieve a high accuracy of , while keeping the number of parameters comparable to . Choi et al. build on this work as they also use a ResNet-inspired architecture. Instead of using 2D convolution over a time-frequency representation of the data they convolve along the time dimension and treat the frequency dimension as channels .
This bears similarities with our approach as we are using 1D convolution along the time dimension as well. However, all the approaches mentioned classify from MFCCs or similar preprocessed features. Our architecture works directly on raw audio signals. There is a recent trend towards using CNNs on raw audio data directly [12, 13, 19, 20]. Ravanelli et al. present an effective method of processing raw audio with CNNs, called SincNet. Kernels of the first convolutional layer are restricted to only learn shapes of parametrized sinc functions. This method was first introduced for Speaker Recognition  and later also used for Phoneme Recognition .
To the best of our knowledge, we are the first to apply this method to the task of KWS. The first convolutional layer of our model is inspired by SincNet and we combine it with DSCconv. DSCconvs have first been introduced in the domain of Image Processing [4, 8] and have been applied to other domains since: Zhang et al. applied DSCconv to KWS 
. Kaiser et al. used DSConv for neural machine translation. They also introduce the “super-separable” convolution, a DSConv that also uses grouping and thus reduces the already small number of parameters of DSConv even further. A similar method is used by ShuffleNet where they combine DSConv with grouping and channel shuffling . The idea of Grouped Convolutions was first used in AlexNet  to reduce parameters and operations and to enable distributed computing of the model over multiple GPUs. We denominate the combination of grouping and DSconv as GDSConv in our work and use it for our smallest model.
3.1 Keyword-Spotting on Battery-Powered Devices
Typical application scenarios for smart devices imply that the device is powered by a battery, and possesses restricted hardware resources. The requirements for a KWS system in these scenarios are (1) very low power consumption to maximize battery life, (2) real-time or near real-time capability, (3) low memory footprint and (4) high accuracy to avoid random activations and to ensure responsiveness.
Regarding real-time capability, our model is designed to operate on a single-core microcontroller capable of 50 MOps per second . We assume that in microcontrollers the memory consumption of a KWS neural network is associated with its power consumption: Reading memory values contributes most to power consumption which makes re-use of weights favorable. While in general large memory modules leak more power than small memory modules, one read operation from RAM costs far more energy than the corresponding multiply-and-accumulate computation [7, 15]. In addition to the parameter-reducing approach in this work, further steps may be employed to reduce power consumption such as quantization, model compression or optimization strategies regarding dataflows that depend on the utilized hardware platform [7, 15, 2, 14].
3.2 Feature Extraction using SincConvs
SincNet  classifies on raw audio by restricting the filters of the first convolutional layer of a CNN to only learn parametrized sinc functions, i.e., . One sinc function in the time domain represents a rectangular function in the spectral domain, therefore two sinc functions can be combined to an ideal band-pass filter:
Performing convolution with such a filter extracts the parts of the input signal that lie within a certain frequency range. SincNet combines Sinc-convolutions with CNNs; as we only use the feature extraction layer of this architecture, we label this layer as SincConv to establish a distinction to SincNet.
Compared to one filter of a regular CNN, the number of parameters is derived from its kernel width, e.g., . Sinc-convolutions only require two parameters to derive each filter, the lower and upper cut-off frequencies (), resulting in a small memory footprint. SincConv filters are initialized with the cutoff frequencies of the mel-scale filter bank and then further adjusted during training. Fig. 1 visualizes this adjustment from initialization to after training. SincConv filter banks can be easily interpreted, as the two learned parameter correspond to a specific frequency band. Fig. 2 visualizes how a SincConv layer with 7 filters processes an audio sample containing the word “yes”.
3.3 Low-Parameter GDSConv Layers
DSConv have been successfully applied to the domain of computer vision[4, 8], neural translation  and KWS . Fig. 3 provides an overview of the steps from a regular convolution to the GDSConv.
The number of parameters of one DSConv layer amounts to with the kernel size and the number of input and output channels and respectively; the first summand is determined by the depthwise convolution, the second summand by the pointwise convolution . In our model configuration, the depthwise convolution only accounts for roughly of parameters in this layer, the pointwise for . We therefore reduced the parameters of the pointwise convolution using grouping by a factor to , rather than the parameters in the depthwise convolution. To allow information exchange between groups we alternate the number of groups per layer, namely 2 and 3, as proposed in .
3.4 Two Low-Parameter Architectures
The SincConv as the first layer extracts features from the raw input samples, as shown in Fig. 4. As non-linearity after the SincConv we opt to use log-compression, i.e.,19, 20]. Five (G)DSConv layers are then used to process the features further: The first layer has a larger kernel size and scales the number of channels to . The other four layers have each
input and output channels. Each (G)DSConv block contains the (G)DSConv layer, batch normalization and spatial dropout  for regularization, as well as average pooling to reduce temporal resolution. After the (G)DSConv blocks, we use global average pooling to receive a classes, i.e., keywords as well as a class for unknown and for silence.
The low-parameter model is obtained by grouping the DSConv layers with an alternating number of groups between 2 and 3. For the configuration shown in Fig. 4, the base model has k parameters. After grouping, the number of parameters is reduced to a total of k.
4.1 Training on the Speech Commands Dataset
We train and evaluate our model using Google’s Speech Commands data set , an established dataset for benchmarking KWS systems. The first version of the data set consists of k one-second long utterances of different keywords spoken by different speakers. The most common setup consists of a classification of 12 classes: “yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop”, “go”, unknown, or silence. The remaining keywords are labeled as unknown, samples of provided background noise files as silence. To ensure the benchmark reproducibility, a separate test set was released with a predefined list of samples for the unknown and the silence class. The second version of the dataset contains k samples and five additional keywords . However, previous publications on KWS reported only results on the first version, therefore we focused on the first version and additionally report testing results on version 2 of the dataset.
Every sample from the training set is used in training, this leads to a class imbalance as there are much more samples for unknown. Class weights in the training phase assign a lower weight to samples labeled as unknown such that the impact on the model is proportional to the other classes. This way, the model can see more unknown
word samples during training without getting biased. Our model is trained for 60 epochs with the Adam optimizer with an initial learning rate of 0.001 and learning rate decay of 0.5 after 10 epochs; the model with the highest validation accuracy is saved to evaluate accuracy on the test set.
4.2 Results and Discussion
The base model composed of DSConv layers without grouping achieves the state-of-the-art accuracy of 96.6% on the Speech Commands test set. The low-parameter model with GDSConv achieves almost the same accuracy of 96.4% with only about half the parameters. This validates the effectiveness of GDSConv for model size reduction.
Table 1 lists these results in comparison with related work. Compared to the DSConv network in , our network is more efficient in terms of accuracy for a given parameter count. Their biggest model has a 1.2% lower accuracy than our base model while having about 4 times the parameters. Choi et al.  has the most competitive results while we are still able to improve upon their accuracy for a given number of parameters. They are using 1D convolution along the time dimension as well which may be evidence that this yields better performance for audio processing or at least KWS.
As opposed to previous works, our architecture does not use preprocessing to extract features, but is able to extract features from raw audio samples with the SincConv layer. That makes it possible to execute a full inference as floating point operations, without requiring additional hardware modules to process or transfer preprocessed features. Furthermore, we deliberately opted to not use residual connections in our network architecture, considering the memory overhead and added difficulty for hardware acceleration modules.
For future comparability, we also trained and evaluated our model on the newer version 2 of the Speech Commands data set; see Table 2 for results. On a side note, we observed that models trained on version 2 of the Speech Commands dataset tend to perform better on both the test set for version 2 and the test set for version 1 .
Always-on, battery-powered devices running keyword spotting require energy efficient neural networks with high accuracy. For this, we identified the parameter count in a neural network as a main contributor to power consumption, as memory accesses contribute far more to power consumption than the computation. Based on this observation, we proposed an energy efficient KWS neural network architecture by combining feature extraction using SincConvs with GDSConv layers.
Starting with the base model composed of DSConvs that have already less parameters than a regular convolution, we achieved state-of-the-art accuracy on Google’s Speech Commands dataset. We further reduce the number of parameters by grouping the convolutional channels to GDSConv, resulting in a low-parameter model with only k parameters.
-  (2014-05) Small-footprint keyword spotting using deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4087–4091. External Links: Cited by: §1.
Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Vol. , pp. 367–379. External Links: Cited by: §3.1.
-  (2019) Temporal Convolution for Real-Time Keyword Spotting on Mobile Devices. In Proc. Interspeech 2019, pp. 3372–3376. External Links: Cited by: 3rd item, §1, §2, §4.2, Table 1.
Xception: deep learning with depthwise separable convolutions. In
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1800–1807. External Links: Cited by: §1, §2, §3.3.
-  (2015) Adam: a method for stochastic optimization. In Conference on Learning Representations (ICLR), Vol. , pp. . Cited by: §4.1.
An application of recurrent neural networks to discriminative keyword spotting. In International Conference on Artificial Neural Networks (ICANN), pp. 220–229. Cited by: §1.
-  (2014-02) 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Vol. , pp. 10–14. External Links: Cited by: §3.1.
-  (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. External Links: Cited by: §2, §3.3.
Batch normalization: accelerating deep network training by reducing internal covariate shift.
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 448–456. External Links: Cited by: §3.4.
-  (2018) Depthwise separable convolutions for neural machine translation. In International Conference on Learning Representations (ICLR), External Links: Cited by: §1, §2, §3.3, §3.3.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), Vol. , pp. . Cited by: §2.
-  (2018-12) Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT), Vol. , pp. 1021–1028. External Links: Cited by: §1, §2, §3.2.
-  (2018) Interpretable convolutional filters with sincnet. In NIPS 2018 Interpretability and Robustness for Audio, Speech and Language Workshop, Vol. , pp. . Cited by: §2.
-  (2018-Sep.) Studying the effects of feature extraction settings on the accuracy and memory requirements of neural networks for keyword spotting. In 2018 IEEE 8th International Conference on Consumer Electronics - Berlin (ICCE-Berlin), Vol. , pp. 1–6. External Links: Cited by: §3.1.
-  (2017-12) Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295–2329. External Links: Cited by: 1st item, §3.1.
-  (2018-04) Deep residual learning for small-footprint keyword spotting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5484–5488. External Links: Cited by: §1, §2, Table 1.
-  (2015-06) Efficient object localization using convolutional networks. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 648–656. External Links: Cited by: §3.4.
-  (2018) Speech commands: a dataset for limited-vocabulary speech recognition. arXiv:1804.03209. External Links: Cited by: Figure 1, §4.1, §4.2, Table 1, Table 2.
-  (2018-04) Learning filterbanks from raw speech for phone recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5509–5513. External Links: Cited by: §2, §3.2, §3.4.
-  (2018) End-to-end speech recognition from the raw waveform. In Proc. Interspeech 2018, pp. 781–785. External Links: Cited by: §2, §3.4.
-  (2018-06) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 6848–6856. External Links: Cited by: §2.
-  (2017) Hello edge: keyword spotting on microcontrollers. arXiv:1711.07128. External Links: Cited by: §1, §2, §2, §3.1, §3.3, §4.2, Table 1.