With the rapid proliferation of voice-enabled devices, such as the Amazon Echo and Apple iPhone, speech recognition systems are becoming increasingly prevalent in our daily lives. Importantly, these systems improve safety and convenience in hands-free interactions, such as using Apple’s Siri to dial contacts while driving. However, a prominent drawback is that most of these systems perform speech recognition in the cloud, where a remote server receives all audio to be transcribed, as recorded by the device. Clearly, the privacy and security implications are significant: the server may be accessed by other people—authorized or not. Thus, it is important to capture the relevant speech only and not all incoming audio, all the while providing a hands-free experience.
Enter keyword spotting systems. They solve the aforementioned issues by implementing an on-device mechanism to “wake up” the intelligent agent, e.g., “Okay, Google” for triggering the Android assistant. This then allows the device to record and transmit a limited segment of relevant speech only, obviating the need to be always-listening. Specifically, the task of keyword spotting (KWS) is to detect the presence of pre-specified phrases in a stream of audio, often with the end goal of wake-word detection or simple command recognition on device. Currently, state of the art uses lightweight neural networks [1, 2, 3, 4], which can perform inference in real-time even on low-end devices [4, 5].
Despite the popularity of voice-enabled products, web applications have yet to make use of keyword spotting. This is surprising, since modern web applications are supported on billions of devices ranging from desktops to smartphones. Also, an in-browser KWS system would be able to perform the aforementioned simple commands recognition and wake-word detection. Thus, we attempt to close the gap between KWS systems and web applications in both research literature and industrial applications, building and evaluating such an in-browser system. Unfortunately, the browser is a highly inefficient platform for deploying neural networks, mainly due to poorly optimized matrix multiply routines. Fortunately, in recent years, the art of compressing neural networks has made significant advances in both general [6, 7, 8] and keyword spotting literature [4, 9]. On our task, we demonstrate that network slimming  is a simple yet highly effective method to achieve low latency with minimal impact on accuracy.
Thus, our main contributions are as follows: first, we develop a novel web application with an in-browser KWS system based on previous state-of-the-art  models. Second, we provide the first set of comprehensive experimental results for the latency of an in-browser KWS system on a broad range of devices. Finally, to the best of our knowledge, we are the first to apply network slimming to examine various accuracy–efficiency operating points of a state-of-the-art KWS model. On the Google Speech Commands dataset , our most accurate in-browser model achieves an accuracy of 94% while performing inference in less than 10 milliseconds. With network slimming, we further reduce latency by 66% while increasing the error rate by only 4%.
2 Background and Related Work
Keyword spotting. KWS is the task of detecting a spoken phrase in audio, applicable to simple command recognition [3, 10] and wake-word detection [2, 1]. A typical requirement is that such a KWS system must be small-footprint at inference time, since the target platforms are mobile phones, Internet-of-things (IoT) devices, and other portable electronics. To achieve this goal, resource-efficient architectures using convolutional neural networks (CNNs) [3, 1]
and recurrent neural networks (RNNs) have been proposed, while other works make use of low-bitwidth weights [4, 9]
Compressing neural networks. Sparse matrix storage leads to inefficient computation and storage in general-purpose hardware; thus, inducing structured sparsity in neural networks, e.g., on entire rows and columns, has been the cornerstone of various compression techniques [6, 8]. Network slimming  is one such state-of-the-art approach that have been applied successfully to CNNs: first, models are trained with an
penalty on the scale parameters in 2D batch normalization layers, which encourages entire channels to approach zero. Then, a fixed percentage of smallest and hence unimportant scale parameters are removed, along with the correspondent preceding and succeeding filters in the convolution layers (see Figure 1). Finally, the entire network is fine-tuned on the training set—this entire process can optionally be repeated multiple times.
3 Data and Implementation
For consistency with past results [3, 5], we train our models on the first version of the Google Speech Commands dataset , which comprises a total of 65,000 spoken utterances for 30 short, one-second phrases. To compare with past work 
, we pick the following twelve classes: “yes,” “no,” “stop,” “go,” “left,” “right,” “on,” “off,” unknown, and silence. It contains roughly 2,000 examples per class, including a few background noise samples of both man-made and artificial noise, e.g., washing dishes and white noise. As is standard in speech processing literature, all audio is in 16-bit PCM, 16kHz mono-channel WAV format. We use the standard 80%, 10%, and 10% splits for the training, validation, and test sets, respectively[3, 10].
3.1 Input preprocessing
and improve the robustness of the model under noisy conditions. Following the official TensorFlow implementation, we also apply a random timeshift of Uniform
milliseconds. Then, for the feature extraction step, 40-dimensional Mel-frequency cepstral coefficients (MFCCs) are computed, with a window size of 30 milliseconds and a frame shift of 10 milliseconds, yielding a final preprocessed input size offor each one-second audio sample.
3.2 Model architecture
We use the res8 and res8-narrow architectures from Tang and Lin  as a starting point, which represent prior state of the art in residual CNNs  for KWS. In both models, given the input , we first expand the input channel-wise by applying a 2D convolution layer with weights
and padding of one on all sides. This step results in an output of, which we then downsample using an average pooling layer with a kernel size of . Next, inspired by insights in image classification , the output is passed through a series of three residual blocks comprising convolution and batch normalization  layers—Figure 2
illustrates one such block. Finally, we average-pool across the channels and pass the features through a softmax layer across the twelve classes.
In the previous description, we are free to choose to dictate the expressiveness and computational footprint of the model. res8 and res8-narrow choose 45 and 19, respectively, for . In total, res8 contains 110K parameters and incurs 30 million multiplies per second of audio, while res8-narrow uses 19.9K parameters and incurs 5.65 million multiplies per second.
Overall, we successfully enable KWS functionality in browser without any server-side inference. Since the audio data is quickly processed within the browser, it is much more efficient than transferring data over the network for inference. Furthermore, users are now freed from security and privacy implications, such as eavesdropping of network traffic and collection of personal speech data.
Network slimming. Since the compression technique  relies on the presence of scale parameters in batch normalization layers, we cannot apply slimming as-is to the original res8-*, which does not use affine transforms. For pruning, we must introduce a scale parameter for each batch normalization operation, corresponding to for input , mean
, and standard deviation. Note that these new scale parameters are only introduced in the pruned architecture, because they are unnecessary for the full architecture. We create two configurations of pruned models: one with 40% of the parameters removed, and another with a more aggressive 80% removed. We append -40 and -80 to res8 and res8-narrow, depending on the level of pruning.
|Latency (ms)||Accuracy (%)||Latency (ms)||Accuracy (%)|
|GPU||Desktop||GTX 1080 Ti||PyTorch||1||94.34||1||91.16|
|Desktop||GTX 1080 Ti||Firefox||8||94.06||7||90.91|
|Macbook Pro (2017)||Intel Iris Plus 650||Firefox||17||93.99||10||90.78|
|Macbook Air (2013)||Intel HD 6000||Firefox||34||93.99||19||90.78|
|Galaxy S8 (2017)||Adreno 540||Firefox||60||94.06||43||88.96|
|Macbook Pro (2017)||i5-7287U (quad)||PyTorch||12||94.15||3||91.16|
|Macbook Pro (2017)||i5-7287U (quad)||Firefox||338||93.99||94||90.78|
|Macbook Air (2013)||i5-4260U (dual)||Firefox||485||93.99||115||90.78|
|Galaxy S8 (2017)||Snapdragon 835 (octa)||Firefox||1105||94.06||265||88.96|
Two main metrics for neural network application are accuracy and inference latency. To be consistent with the training process, the experiments use the same test set partitioned from the data. We conduct experiments and evaluate performance on desktop, laptop, and smartphone configurations to demonstrate the feasibility of our web application on a broad range of devices. First, we evaluate our application on a desktop with 16GB RAM, an i7-4790k CPU, and a GTX 1080 Ti. Then, we use the Macbook Pro (2017) and Macbook Air (2013) for our laptop configurations; the former has a quad-core i5-7287U CPU and an Intel Iris Plus 650 GPU, while the latter has a lighter dual-core i5-4260U CPU and an Intel HD 6000 GPU. Finally, we choose the Galaxy S8 as our smartphone configuration. We select Firefox as the browser, and results are collected both with and without the existence of WebGL to evaluate the benefits of hardware acceleration.
In-browser KWS inference efficiency. Measured with our university WiFi connection, the average latency to the Google server is about 25ms with standard deviation of 20ms. Network latency is much higher for transferring audio data. With a server written in Python, our evaluation presents an average latency of 481ms with standard deviation of 183ms for 16kHz mono-channel audio data. With in-browser inference, we achieve a serverless architecture which no longer suffers from variable network latency.
Table 1 summarizes latency and accuracy results for both res8 and res8- narrow on various devices. Note that results on our PyTorch implementation are included on laptop and desktop setups to compare to the standard baseline; the original implementation achieves an accuracy of 94.34% for res8 and 91.16% for res8-narrow (see first few rows in the table). Slight differences are observed among platforms due to mismatch of MFCCs between LibROSA and Meyda. However, the accuracy for each model is consistent for every platform, confirming that our in-browser web application is indeed robust.
Even though latency is processor-dependent, the res8- narrow model performs inference in real-time on every platform, ranging from 7 to 43 milliseconds on GPU and 86 to 265 milliseconds on CPU configurations. Given that these delays are perceived by humans to be near-instantaneous , the latency we observe is sufficient for real-time interactive web applications, even considering the in-browser overhead. In fact, it is now feasible to deploy cross-platform neural network web applications even on mobile devices.
Latency–accuracy tradeoff. Under the limited computational resources on mobile devices, network slimming can provide an option to tradeoff accuracy for inference latency. To understand tradeoffs between latency and accuracy, we evaluate res8 and res8-narrow models with 40% and 80% of its batch normalization layer pruned as well (see Figure 3); to illustrate the trend concisely, the figure includes results on CPU configurations only.
From res8-narrow-80 to res8-narrow-40, accuracy increases by 4% with minimal latency increase. However, starting from res8-narrow-40, the increase in latency is clear, indicating that obtaining higher accuracy comes at a cost. The slope of the curve increasingly steepens as accuracy increases, yielding tradeoff curves similar to those observed in other works [7, 17]. Between res8-40 and res8, change in accuracy is less than 1% even though the most increase in latency is observed ranging from 111 ms to 363 ms. In other words, res8-40 performs as well as res8 while achieving lower latency.
Overall, we achieve a 50% decrease in latency in res8- narrow-80 and 66% in res8-80, with only an absolute error rate increase of 4%. res8-narrow on Macbook Pro requires 94 ms but drops down to 41 ms with res8-narrow-80. Similarly, latency on Galaxy S8 starts from 265 ms and decreases to 137 ms. Also, given that both accuracy and latency of res8-narrow are comparable to res8-80, network slimming provides option to replace one model to the other depending on target device.
In this paper, we realize a new paradigm for serving neural network applications by implementing KWS with in-browser inference. The serverless architecture allows our application to be efficient and cross-device compatible, with the additional benefit that users are freed from security and privacy implications, such as eavesdropping of network traffic and collection of personal speech data.
We implement a KWS web application that achieves an accuracy of 94% while maintaining an inference latency of less than 10 ms on modern devices. With the goal of understanding accuracy and inference latency tradeoffs, we also analyze the impact of network slimming on existing res8 and res8-narrow models. Our study shows that, with network slimming, our model yields a 66% decrease in latency with a minimal increase in error rate of 4%, along with accuracy–efficiency tradeoff curves like those observed in the past.
-  Tara N Sainath and Carolina Parada, “Convolutional neural networks for small-footprint keyword spotting,” in INTERSPEECH-2015, 2015.
-  Sercan O Arik, Markus Kliegl, Rewon Child, Joel Hestness, Andrew Gibiansky, Chris Fougner, Ryan Prenger, and Adam Coates, “Convolutional recurrent neural networks for small-footprint keyword spotting,” arXiv:1703.05390, 2017.
-  Raphael Tang and Jimmy Lin, “Deep residual learning for small-footprint keyword spotting,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 5484–5488.
-  Javier Fernández-Marqués, W-S Tseng Vincent, Sourav Bhattachara, and Nicholas D Lane, “BinaryCmd: Keyword spotting with deterministic binary basis,” 2018.
-  Raphael Tang, Weijie Wang, Zhucheng Tu, and Jimmy Lin, “An experimental analysis of the power consumption of convolutional neural networks for keyword spotting,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 5479–5483.
Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui
“Learning efficient convolutional networks through network
2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2755–2763.
-  Song Han, Huizi Mao, and William J Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv:1510.00149, 2015.
-  Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf, “Pruning filters for efficient convnets,” arXiv:1608.08710, 2016.
-  Yundong Zhang, Naveen Suda, Liangzhen Lai, and Vikas Chandra, “Hello edge: Keyword spotting on microcontrollers,” arXiv:1711.07128, 2017.
-  Pete Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv:1804.03209, 2018.
Sergey Ioffe and Christian Szegedy,
“Batch normalization: Accelerating deep network training by reducing
internal covariate shift,”
International Conference on Machine Learning, 2015, pp. 448–456.
-  Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur, “Audio augmentation for speech recognition,” in INTERSPEECH-2015, 2015.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,
“Deep residual learning for image recognition,”
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto, “librosa: Audio and music signal analysis in Python,” in Proceedings of the 14th Python in Science Conference, 2015, pp. 18–25.
-  Hugh Rawlinson, Nevo Segal, and Jakub Fiala, “Meyda: an audio feature extraction library for the web audio api,” in The 1st Web Audio Conference (WAC), 2015.
-  Robert B. Miller, “Response time in man-computer conversational transactions,” in Proceedings of the December 9-11, 1968, Fall Joint Computer Conference, Part I, 1968, AFIPS ’68 (Fall, part I), pp. 267–277.
-  Raphael Tang and Jimmy Lin, “Adaptive pruning of neural language models for mobile devices,” arXiv:1809.10282, 2018.