1 Introduction and Problem Description
At Voicera, we are building Eva — Enterprise Voice Assistant — to collaborate with meeting participants using voice Tawakol (2017). We believe interactions with Eva should feel as natural as interactions with any other participant in the meeting, so we have designed Eva to recognize and respond to voice commands. Eva continuously listens to the conversation during a meeting and verbally acknowledges the wake word "Okay Eva." In order for this interaction to feel natural, Eva’s KWS and audible responses need to be real-time.
Eva continuously predicts whether or not the keywords of interest were uttered in a real-time audio stream. For fear that users find Eva vexing due to unsolicited interruptions, the false positive rate should be less than 1 per hour; on the other hand, users may abandon the service if Eva doesn’t respond when addressed, so we strive to maximize recall while maintaining a low false positive rate.
In addition to challenges that other KWS systems face while working with real-time speech, Eva’s KWS system — in order to be ubiquitous — needs to support speech signals carried over the public-switched telephone network, which typically uses G.711: an NB audio codec that operates at a low bit-rate and at a sample rate of 8kHz Gallardo (2015). Both human and automatic speech recognition suffer significantly from loss of accuracy when listening to NB audio Gallardo (2015); Voran (1997); Möller et al. (2014). Moreover, Eva’s KWS system needs to adapt to new microphones, environment settings, speakers, and noise profiles whose characteristics vary drastically — making the input signal to the KWS system non-IID.
2 Prior Art
Voice-enabled AI assistants — like Apple’s Siri, Amazon’s Alexa, and Google Assistant — perform similar KWS tasks to enable users to interact with their devices. Google proposed a model that uses Convolutional Neural Networks for performing a KWS task to detect 14 different phrasesSainath and Parada (2015); this architecture showed a 27-44% relative improvement in false positive rate compared to its predecessor that used a Deep Neural Network (DNN) Chen et al. (2014)
, which in turn showed a 45% relative improvement in confidence score compared to KWS using Hidden Markov Models (HMM)Rohlicek et al. (1989). In Team (2017), Apple uses a DNN to predict scores for 20 sound classes every 10ms, combining these scores in an HMM-like graph to calculate a composite score indicating that "Hey Siri" was uttered. Similar to our approach, which we detail below, the authors used a cascade of two classifiers to conserve power. Given the control Apple, Amazon, and Google have over their hardware specifications, they can acquire wide-band audio, which significantly improves the accuracy of speech recognition systems.
Our approach to the KWS problem is shown in Figure 1. The KWS system is triggered every 10ms. The audio signal from the past 500ms is captured for processing, since we found that the median command duration is 500ms. In this paper, we often call these 500ms windows "examples". Each of our six classifiers — three cascaded classifiers composed of two DNNs trained on different feature representations — are fully connected DNNs with two 128-input hidden layers.
We initially collected 19k user examples from 200 individuals using a crowdsourcing platform. The collection covered a wide variety of speakers living in the continental United States. These examples act as positive examples in our experiments. Negative examples for each cascaded classifier were generated from a repository of audio samples from a variety of meeting recordings that do not contain the keyword. All of the audio examples were either acquired as or converted to NB, 8kHz audio.
80% of the data were designated for training; the remaining 20% were used for evaluation. The data were stratified over individual speakers: A speaker belongs to either the training or the evaluation set.
3.2 Multi-Representation Feature Extraction
It is challenging to sample negative examples adequately. As such, discriminative classifiers suffer from the so-called "Novelty Detection" problem: a sound not encountered during the training process can be misclassified as a keyword — resulting in a false positive. This happens because the decision boundary learned by the discriminative classifier is undefined in the areas that were not sampled during training.
To overcome this, we train two classifiers on two different representations, Mel Frequency Cepstral Coefficient (MFCC) and Perceptual Linear Prediction (PLP) features; both are commonly used in speech recognition Young et al. (2002)
. These features are extracted on the same audio input for each stage of the cascade. These classifiers are ensembled using model-averaging to compute the final probability of an audio window being a keyword. Because these classifiers were trained on different representations of the same data, the classifiers concur on areas of the distribution that they were trained on and behave randomly otherwise. This results in a lower rate of false positives. In our system, both were extracted from 30ms frames with a stride of 10ms and concatenated for each 500ms example.
Cascaded classifiers are commonly used to deal with highly asymmetric classification problems; they have been popularized through their use in many practical machine learning solutions; the Viola-Jones face detector is one notable exampleViola and Jones (2004).
Cascading is an instance of ensemble learning based on the concatenation of several classifiers of increasing complexity. Each classifier is trained on the examples that are not filtered-out by the previous classifiers. A threshold is established on the output of each classifier such that a portion of the negative (non-keyword) examples it is subjected to get correctly classified. The remaining examples are either positive (keyword) examples or hard negative examples. For training the next classifier of the cascade, we run the previous classifiers of the cascade on a large repository of audio guaranteed not to have instances of the target keyword, using the discovered false positives as hard negative examples. This allows subsequent classifiers in the cascade to focus on negative examples that previous classifiers confuse with the positive examples. The distribution of positive to negative data gets more symmetric for subsequent classifiers. One drawback of this technique is that extracting hard examples for training subsequent classifiers in the cascade takes longer as the cascade gets better. In our experiments below, the first cascade classifier is trained with a ratio of 100 negative examples for every positive example. The second and third cascade classifier maintain a ratio of 2 negative examples for every positive example. During inference time, cascaded classifiers provide a way to early-terminate subsequent KWS computations on non-keyword windows.
3.4 Multiple-Instance Learning
The final stage of the pipeline, in Figure 1(iii), aggregates the outputs of the current and the past outputs of the cascaded classifiers to make a final decision about the triggering of the targeted keyword. The KWS problem, as we modeled it, is innately a Multiple-Instance Learning (MIL) Zhang et al. (2006) problem: The keyword is somewhere in the audio signal, but at an imprecisely defined location and may vary drastically in duration as humans have a wide range of ways of pronouncing the same word. We’ve found that the "Okay Eva" utterance could vary from 300ms to 900ms.
In MIL, training examples are not singletons; they come in “bags” such that all of the examples in a bag share a label Zhang et al. (2006); Dietterich et al. (1997). A positive bag of windows means that at least one window in the bag is positive while a negative bag means that all windows in the bag are negative. In MIL, learning must simultaneously learn which examples in the positive bags are positive, along with the parameters of the classifier. In our case, a bag is a group of 500ms windows strided by 10ms.
As such we have designed our learning process to learn simultaneously on all windows encompassed in a particular positive or negative bag of windows. The outputs corresponding to all of the windows in the bag are aggregated by a "Noisy Or."
4 Experiments and Results
We evaluated detection of the phrase "Okay Eva" using the hourly false positive rate plotted against the false negative rate.
The ROC characteristics of a 3-stage cascade with only one feature representation (MFCCs or PLPs individually) compared to a 3-stage cascade with both representations is shown in Figure 3. The multi-representation scheme improves the trade-off between False Positives and False Reject rates. For example, at a False Reject rate of around 7% (0.07), MFCCs have an hourly False Positive rate of 1.2; PLPs have an hourly False Positive rate of 1.0; the multi-representation has an hourly False Positive rate of 0.55.
We trained one-, two-, and three-cascaded classifiers for comparison. Because the first classifier is trained on mostly negative examples, the threshold on the output probability of the first classifier in each cascade was computed to guarantee that most of the positive examples would not be filtered out. The ROC characteristics of each of the stages of the cascaded classifiers are shown in Figure 3. The ROC characteristics significantly improves as more classifiers are cascaded.
We demonstrate significant gains in KWS in narrow-band audio while minimizing computational resource usage. By incorporating multiple feature representations and three cascaded classifiers, we reduce our false positives per hour at 5% false negative rate from 8 to 1.2, a reduction of 85%. Although they are not directly comparable due to differences in datasets, our system performs better on narrow-band 8kHz audio than Google’s DNN system Sainath and Parada (2015) performs on wide-band 16kHz audio.
- Tawakol (2017) O. Tawakol, “Introducing eva by voicera,” Feb 2017, (Accessed 29-Oct-2017). [Online]. Available: https://www.voicera.com/introducing-eva-by-workfit/
- Gallardo (2015) L. Gallardo, Human and Automatic Speaker Recognition over Telecommunication Channels, ser. T-Labs Series in Telecommunication Services. Springer Singapore, 2015.
- Voran (1997) S. Voran, “Listener ratings of speech passbands,” in 1997 IEEE Workshop on Speech Coding for Telecommunications Proceedings. Back to Basics: Attacking Fundamental Problems in Speech Coding, Sep 1997, pp. 81–82.
- Möller et al. (2014) S. Möller, F. Köster, L. F. Gallardo, and M. Wagner, “Comparison of transmission quality dimensions of narrowband, wideband, and super-wideband speech channels,” in 2014 8th International Conference on Signal Processing and Communication Systems (ICSPCS), Dec 2014, pp. 1–6.
- Sainath and Parada (2015) T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6-10, 2015. ISCA, 2015, pp. 1478–1482.
- Chen et al. (2014) G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4087–4091.
- Rohlicek et al. (1989) J. R. Rohlicek, W. Russell, S. Roukos, and H. Gish, “Continuous hidden markov modeling for speaker-independent word spotting,” in International Conference on Acoustics, Speech, and Signal Processing,, May 1989, pp. 627–630 vol.1.
- Team (2017) S. Team, “Hey siri: An on-device dnn-powered voice trigger for apple’s personal assistant - apple,” Oct 2017, (Accessed 29-Oct-2017). [Online]. Available: https://machinelearning.apple.com/2017/10/01/hey-siri.html
- Young et al. (2002) S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey et al., “The htk book,” Cambridge university engineering department, vol. 3, 2002.
Viola and Jones (2004)
P. Viola and M. J. Jones, “Robust real-time face detection,”
International journal of computer vision, vol. 57, no. 2, pp. 137–154, 2004.
- Zhang et al. (2006) C. Zhang, J. C. Platt, and P. A. Viola, “Multiple instance boosting for object detection,” in Advances in neural information processing systems, 2006, pp. 1417–1424.
- Dietterich et al. (1997) T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez, “Solving the multiple instance problem with axis-parallel rectangles,” Artificial intelligence, vol. 89, no. 1, pp. 31–71, 1997.