Most voice-enabled intelligent agents, such as Apple’s Siri and the Amazon Echo, are powered by a combination of two technologies: lightweight keyword spotting (KWS) to detect a few pre-defined phrases within streaming audio (e.g., “Hey Siri”) and full automatic speech recognition (ASR) to transcribe complete user utterances. In this work, we explore a middle ground: techniques for voice query recognition capable of handling a couple of hundred commands.
Why is this an interesting point in the design space? On the one hand, this task is much more challenging than the (at most) a couple of dozen keywords handled by state-of-the-art KWS systems [1, 2]. Their highly constrained vocabulary limits application to wake-word and simple command recognition. Furthermore, their use is constrained to detecting whether some audio contains a phrase, not exact transcriptions needed for voice query recognition. For example, if “YouTube” were the keyword, KWS systems would make no distinction between the phrases “quit YouTube” and “open YouTube”—this is obviously not sufficient since they correspond to different commands. On the other hand, our formulation of voice query recognition was specifically designed to be far more lightweight than full ASR models, typically recurrent neural networks that comprise tens of millions of parameters, take weeks to train and fine tune, and require enormous investment in gathering training data. Thus, full ASR typically incurs high computational costs during inference time and have large memory footprints .
The context of our work is the Comcast Xfinity X1 entertainment platform, which provides a “voice remote” that accepts spoken queries from users. A user, for example, might initiate a voice query with a button push on the remote and then say “CNN” as an alternative to remembering the exact channel number or flipping through channel guides. Voice queries are a powerful feature, since modern entertainment packages typically have hundreds of channels and remote controls have become too complicated for many users to operate. On average, X1 accepts tens of millions of voice queries per day, totaling 1.7 terabytes of audio, equal to 15,000 spoken hours.
A middle ground between KWS and full ASR is particularly interesting in our application because of the Zipfian distribution of users’ queries. The 200 most popular queries cover a significant portion of monthly voice traffic and accounts for millions of queries per day. The key contribution of this work is a novel, resource-efficient architecture for streaming voice query recognition on the Comcast X1. We show that existing KWS models are insufficient for this task, and that our models answer queries more than eight times faster than the current full ASR system, with a low false alarm rate (FAR) of 1.0% and query error rate (QER) of 6.0%.
2 Related Work
The typical approach to voice query recognition is to develop a full automatic speech recognition (ASR) system . Open-source toolkits like Kaldi  provide ASR models to researchers; however, state-of-the-art commercial systems frequently require thousands of hours of training data  and dozens of gigabytes for the combined acoustic and language models . Furthermore, we argue that these systems are excessive for usage scenarios characterized by Zipf’s Law, such as those often encountered in voice query recognition: for example, on the X1, the top 200 queries cover a significant, disproportionate amount of our entire voice traffic. Thus, to reduce computational requirements associated with training and running a full ASR system, we propose to develop a lightweight model for handling the top-K queries only.
While our task is related to keyword spotting, KWS systems only strictly detect the mere occurrence of a phrase within audio, not the exact transcription, as in our task. Neural networks with both convolutional and recurrent components have been successfully used in keyword spotting [2, 7]
; others use only convolutional neural networks (CNNs)[8, 1] and popular image classification models .
3 Task and Model
Our precise task is to classify an audio clip as one of classes, with labels denoting different voice queries and a single unknown label representing everything else. To improve responsiveness and hence the user experience, we impose the constraint that model inference executes in an on-line, streaming manner, defined as predictions that occur every 100 milliseconds and in constant time and space, with respect to the total audio input length. This enables software applications to display on-the-fly transcriptions of real-time speech, which is important for user satisfaction: we immediately begin processing speech input when the user depresses the trigger button on the X1 voice remote.
3.1 Input preprocessing
First, we apply dataset augmentation to reduce generalization error in speech recognition models . In our work, we randomly apply noise, band-pass filtering, and pitch shifting to each audio sample. Specifically, we add a mixture of Gaussian and salt-and-pepper noise—the latter is specifically chosen due to the voice remote microphone introducing such artifacts, since we notice “clicks” while listening to audio samples. For band-pass filtering, we suppress by a factor of 0.5 the frequencies outside a range with random endpoints , where and roughly correspond to frequencies drawn uniformly from kHz and kHz, respectively. For pitch shifting, we apply a random shift of 33 Hz. The augmentation procedure was verified by ear to be reasonable.
We then preprocess the dataset from raw audio waveform to forty-dimensional per-channel energy normalized (PCEN)  frames, with a window size of 30 milliseconds and a frame shift of 10 milliseconds. PCEN provides robustness to per-channel energy differences between near-field and far-field speech applications, where it is used to achieve the state of the art in keyword spotting [2, 11]. Conveniently, it handles streaming audio; in our application, the user’s audio is streamed in real-time to our platform. As is standard in speech recognition applications, all audio is recorded in 16kHz, 16-bit mono-channel PCM format.
3.2 Model Architecture
We draw inspiration from convolutional recurrent neural networks (ConvRNN) for text modeling , where it has achieved state of the art in sentence classification. However, the model cannot be applied as-is to our task, since the bi-directional components violate our streaming constraint, and it was originally designed for no more than five output labels. Thus, we begin with this model as a template only.
We illustrate our architecture in Figure 1
, where the model can be best described as having three sequential components: first, it uses causal convolutions to model short-term speech context. Next, it feeds the short-term context into a gated recurrent unit (GRU) layer and pools across time to model long-term context. Finally, it feeds the long-term context into a deep neural network (DNN) classifier for our voice query labels.
Short-term context modeling. Given 40-dimensional PCEN inputs , we first stack the frames to form a 2D input ; see Figure 1, label C, where the x-axis represents 40-dimensional features and the y-axis time. Then, to model short-term context, we use a 2D causal convolution layer (Figure 1
, label D) to extract feature vectorsfor , where is the convolution weight,
is silence padding in the beginning,denotes valid convolution, and is a context vector in
. Finally, we pass the outputs into a rectified linear (ReLU) activation and then a batch normalization layer, as is standard in image classification. Since causal convolutions use a fixed number of past and current inputs only, the streaming constraint is necessarily maintained.
Long-term context modeling. To model long-term context, we first flatten the short-term context vector per time step from to . Then, we feed them into a single uni-directional GRU layer (examine Figure 1, label E) consisting of hidden units, yielding hidden outputs . Following text modeling work , we then use a 1D convolution filter with ReLU activation to extract features from the hidden outputs, where is the number of output channels. We max-pool these features across time (see Figure 1, label G) to obtain a fixed-length context . Finally, we concatenate and for the final context vector, , as shown in Figure 1, label H.
Clearly, these operations maintain the streaming constraint, since uni-directional GRUs and max-pooling across time require the storage of only the last hidden and maximum states, respectively. We also experimentally find that the max-pooling operation helps to propagate across time the strongest activations, which may be “forgotten” if only the last hidden output from the GRU were used as the context.
DNN classifier. Finally, we feed the context vector into a small DNN with one hidden layer with ReLU activation, and a softmax output across the voice query labels. For inference on streaming audio, we merely execute the DNN on the final context vector at a desired interval, such as every 100 milliseconds; in our models, we choose the number of hidden units so that the classifier is sufficiently lightweight.
On our specific task, we choose representing the top 200 queries on the Xfinity X1 platform, altogether covering a significant portion of all voice traffic—this subset corresponds to hundreds of millions of queries to the system per month. For each positive class, we collected 1,500 examples consisting of anonymized real data. For the negative class, we collected a larger set of 670K examples not containing any of the positive keywords. Thus, our dataset contains a total of 970K examples. For the training set, we used the first 80% of each class; for the validation and test sets, we used the next two 10% partitions. Each example was extremely short—only 2.1 seconds on average. All of the transcriptions were created by a state-of-the-art commercial ASR system with 5.8
% (95% confidence interval) word-error rate (WER) on our dataset; this choice is reasonable because the WER of human annotations is similar, and our deployment approach is to short-circuit and replace the current third-party ASR system where possible.
4.1 Training and Hyperparameters
|#||Type||# Par.||# Mult.||Hyperparameters|
|Short-term context modeling|
|Long-term context modeling|
|DNN classifier (100ms interval)|
For the causal convolution layer, we choose output channels, width in time, and
length in frequency. We then stride the entire filter across time and frequency by one and ten steps, respectively. This configuration yields a receptive field of 50 milliseconds acrossdifferent frequency bands, which roughly correspond to highs, mids, and lows. For long-term context modeling, we choose hidden dimensions and convolution filters. Finally, we choose the hidden layer of the classifier to have 768 units. Table 1 summarizes the footprint and hyperparameters of our architecture; we name this model crnn-750m, with the first “c” representing the causal convolution layer and the trailing “m” max pooling.
During training, we feed only the final context vector of the entire audio sample into the DNN classifier. For each sample, we obtain a single softmax output across the 201 targets for the cross entropy loss. The model is then trained using stochastic gradient descent with a momentum of 0.9, batch size of 48,weight decay of , and an initial learning rate of
. At epochs 9 and 13, the learning rate decreases toand , respectively, before training finishes for a total of 16 epochs.
Model Variants. As a baseline, we adapt the previous state-of-the-art KWS model res8  to our task by increasing the number of outputs in the final softmax to 201 classes. This model requires fixed-length audio, so we pad and trim audio input to a length that is sufficient to cover most of the audio in our dataset. We choose this length to be eight seconds, since 99.9% of queries are shorter.
To examine the effect of the causal convolution layer, we train a model without it, feeding the PCEN inputs directly to the GRU layer. We also examine the contribution of max-pooling across time by removing it: we name these variants rnn-750m and crnn-750.
4.2 Results and Discussion
|FAR||QER||FAR||QER||# Par.||# Mult.|
The model runs quickly on a commodity GPU machine with one Nvidia GTX 1080: to classify one second of streaming audio, our model takes 68 milliseconds. Clearly, the model is also much more lightweight than a full ASR system, occupying only 19 MB of disk space for the weights and 5 KB of RAM for the persistent state per audio stream. The state consists of the two previous PCEN frames for the causal convolution layer (320 bytes; all zeros for the first two padding frames), the GRU hidden state (3 KB), and the last maximum state for max-pooling across time (1.4 KB).
In our system, we define a false alarm (FA) as a negative misclassification. In other words, a model prediction is counted as an FA if it is misclassified and the prediction is one of the known, 200 queries. This is reasonable, since we fall back to the third-party ASR system if the voice query is classified as unknown. We also define a query error (QE) as any misclassified example; then, false alarm rate (FAR) and query error rate (QER) correspond to the number of FAs and QEs, respectively, divided by the number of examples. Thus, the overall query accuracy rate is .
Initially, the best model, crnn-750m, attains an FAR and QER of and , respectively. This FAR is higher than our production target of ; thus, we further threshold the predictions to adjust the specificity of the model. Used also in our previous work 
, a simple approach is to classify as unknown all predictions whose probability outputs are below some global threshold. That is, if the probability of a prediction falls below some threshold , it is classified as unknown. In Table 2, we report the results corresponding to our target FAR of , with the determined from the validation set. To draw ROC curves (see Figure 2) on the test set, we sweep from 0 to 0.9999, where QER is analogous to false reject rate (FRR) in the classic keyword spotting literature. We omit res8 due to it having a QER of 29%, which is unusable in practice.
After thresholding, our best model with max pooling and causal convolutions (crnn-750m) achieves an FAR of 1% and QER of 6% on both the validation and test sets, as shown in Table 2, row 2. Max-pooling across time is effective, resulting in a QER improvement of 0.5% over the ablated model (crnn-750; see row 3). The causal convolution layer is effective as well, though slightly less than max-pooling is; for the same QER (6.4%) on the validation set, the model without the causal convolution layer, rnn-750m, uses 87M fewer multiplies per second than crnn-750 does (presented in row 4), due to the large decrease in the number of parameters for the GRU, which uses an input of size 40 in rnn-750m, compared to 750 in crnn-750. We have similar findings for the ROC curves (see Figure 2), where crnn-750m outperforms crnn-750 and rnn-750m, and the ablated models yield similar curves. All of these models greatly outperform res8, which was originally designed for keyword spotting.
5 Conclusion and Future Work
We describe a novel resource-efficient model for the task of voice query recognition on streaming audio, achieving an FAR and QER of and , respectively, while performing more than faster than the current third-party ASR system. One potential extension to this paper is to explore the application of neural network compression techniques, such as intrinsic sparse structures  and binary quantization , which could further decrease the footprint of our model.
-  Raphael Tang and Jimmy Lin, “Deep residual learning for small-footprint keyword spotting,” in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
-  Sercan O Arik, Markus Kliegl, Rewon Child, Joel Hestness, Andrew Gibiansky, Chris Fougner, Ryan Prenger, and Adam Coates, “Convolutional recurrent neural networks for small-footprint keyword spotting,” arXiv:1703.05390, 2017.
-  Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Katya Gonina, et al., “State-of-the-art speech recognition with sequence-to-sequence models,” arXiv:1712.01769, 2017.
-  William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.
-  Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely, “The Kaldi speech recognition toolkit,” 2011.
-  Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig, “The Microsoft 2016 conversational speech recognition system,” in Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
-  Dong Wang, Shaohe Lv, Xiaodong Wang, and Xinye Lin, “Gated convolutional LSTM for speech commands recognition,” in ICCS 2018, 2018.
-  Tara N Sainath and Carolina Parada, “Convolutional neural networks for small-footprint keyword spotting,” in INTERSPEECH-2015, 2015.
Brian McMahan and Delip Rao,
“Listening to the world improves speech command recognition,”
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur, “Audio augmentation for speech recognition,” in INTERSPEECH-2015, 2015.
-  Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F Lyon, and Rif A Saurous, “Trainable frontend for robust and far-field keyword spotting,” in Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
-  Chenglong Wang, Feijun Jiang, and Hongxia Yang, “A hybrid framework for text modeling with convolutional RNN,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017.
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio,
“Learning phrase representations using RNN encoder–decoder for
statistical machine translation,”
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
-  Andreas Stolcke and Jasha Droppo, “Comparing human and machine errors in conversational speech transcription,” arXiv:1708.08615, 2017.
Wei Wen, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, Wenhan Wang, Fang Liu,
Bin Hu, Yiran Chen, and Hai Li,
“Learning intrinsic sparse structures within long short-term memory,”in International Conference on Learning Representations, 2018.
-  Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1,” arXiv:1602.02830, 2016.