In most voice assistive technologies, keyword spotting (a.k.a wake word detection [kumatani2017direct]) is a common way to initiate the human-machine conversation (e.g.
“OK Google”, “Alexa”, or “Hey Siri”). In recent years, keyword spotting techniques have evolved with many exciting advances, for example, using deep neural networks[chen2014small], or end-to-end models [alvarez2019end, shan2018attention].
However, most modern keyword spotting models are based on single or a few predefined phrases, often assuming the keyword is covered by a fixed-length window of audio. Supporting a new phrase usually requires re-training the entire system, which could be resource and time consuming.
In many scenarios, users would largely prefer a more seamless and natural interaction with the voice assistant without having to say a predefined keyword; especially for simple commands, such as “Turn on the lights”. However, these interactions pose new challenges for conventional keyword spotting systems. In particular,
The system must be able to detect a large corpus of keyphrases.
The keyphrases may have variable length, from single word (e.g. “Stop”) to longer sentences (e.g. “What is the weather tomorrow?”). The audio duration of the keyphrases could also vary depending on the speaker.
The set of recognized keyphrases should be easily customizable without training and deploying new models.
Instead of using a dedicated keyphrase detection model, we explore the possibility of using a generic ASR model that allows user-defined keyphrases, thereby providing greater flexibility to the users. A similar system was previously described in [he2017streaming]
, where a Recurrent Neural Network Transducer (RNN-T) was trained to predict either phonemes or graphemes as subword units, thus allowing the detection of arbitrary keyphrases. However, a distinct challenge of a keyphrase detection that was not addressed in[he2017streaming] is being able to discriminate between the spoken keyphrases and noise in the background. This is especially difficult if the ambient noise includes speech that contains similar keyphrases. For example, a speaker on TV saying “turn off the lights” could easily false trigger the system.
Recognizing speech in a noisy, multi-talker environment, or the cocktail-party problem, is an active area of research [simpsonCocktailParty, nachmani2020voice]. The human brain has the remarkable ability to identify and separate one person’s voice from another [mcdermott2009cocktail], especially if the speaker is familiar. One way the brain solves the cocktail-party problem is by using top-down attention to identify vocal features from a known speaker, while filtering out other irrelevant ambient sounds [pressnitzer2008perceptual]. In this paper, we represent vocal features of the enrolled speaker with neural network embeddings [wan2018generalized], and use this information to suppress background speech from unknown speakers [Wang2020] in the feature frontend of the speaker verification model.
In addition, on devices where we have multiple microphones separated by a small distance (e.g. smart home speakers), an adaptive noise cancellation algorithm can further enhance the speech signals by suppressing background noise.
The original contributions of this paper include: (1) We adopt the state-of-the-art RNN-T model proposed in [sainath2020e2e] and apply pruning [zhu2018prune] so that it can run continuously on device with significantly reduced CPU usage; (2) We combine the RNN-T based ASR model with speaker verification and speaker separation models to achieve low false trigger and false rejection rates under various noise conditions; (3) We propose Speech Cleaner, an adaptive noise cancellation algorithm that generalizes Hotword Cleaner [huang2019hotword] for generic speech recognition.
The rest of this paper is organized as follows. In Section 2.1, we provide an overview of our keyphrase detection system, followed by detailed descriptions of the ASR model in Section 2.2, speaker verification model in Section 2.3, speaker separation model in Section 2.4, and adaptive noise cancellation in Section 2.5. Two groups of experiments are presented in Section 3. In Section 3.1, we demonstrate that a VoiceFilter-Lite model can largely reduce the Equal Error Rate (EER) of a standard text-independent speaker verification system under multi-talker scenarios. In Section 3.2, we provide end-to-end evaluations of our keyphrase detection system under various noise conditions. Conclusions are drawn in Section 4.
2.1 System overview
A diagram of the proposed keyphrase detection system is provided in Fig. 1.
2.1.1 Feature frontend
A shared feature frontend is used by all speech models in the system. This frontend first applies automatic gain control [prabhavalkar2015automatic] to the input audio, then extracts 32ms-long Hanning-windowed frames with a step of 10ms. For each frame, 128-dimensional log Mel-filterbank energies are computed in the range between 125Hz and 7500Hz. These filterbank energies are then stacked by 4 frames and subsampled by 3 frames, resulting in final features of 512 dimensions with a frame rate of 30ms.
The d-vector is an embedding vector that represents the voice characteristics of the enrolled user. It is obtained by prompting the user to follow an offline voice enrollment process [enrollmentblog, wang2020version]. At runtime, the d-vector is used in two ways: (1) It is used as a side input to the speaker separation model [Wang2020] to remove feature components not from the target speaker; (2) It represents the enrolled speaker in the speaker verification model.
2.1.3 Keyphrase detection
The keyphrase detection system only triggers when both the following conditions are met:
The text-independent speaker verification system successfully verified against the target enrolled user.
The recognized text from the speech recognition model successfully matched with one of the predefined keyphrases.
Given this, there are two main sources of errors: (1) False accepts, where either a phrase other than the keyphrase or a keyphrase spoken by an unknown speaker (for example, in the background) triggers the detection system. (2) False rejects, where either the keyphrase was not recognized correctly by the ASR model, or the target user was mis-identified by the speaker verification system.
2.2 Speech recognition
The speech recognition model is an end-to-end RNN Transducer (RNN-T) model [graves2012sequence] with a similar architecture as proposed in [he2019streaming, sainath2020e2e]
. The target output vocabulary consists of 4096 word-pieces. The encoder network has 8 CIFG-LSTM layers[greff2016lstm]
and the prediction network has 2 CIFG-LSTM layers. Each CIFG-LSTM layer has 2048 hidden units followed by a projection size of 640 units. The joint network has 640 hidden units and a softmax layer with 4096 units. Since the speech recognition model needs to run continuously on device, we shrink the model by applying 60% sparsity[zhu2018prune] to each CIFG-LSTM layer in order to reduce the CPU usage, and consequently prolong the life of the device. The total model size is 42MB after sparsification and quantization [shangguan2019optimizing]. The model is trained on 400K hours of multi-domain data including YouTube, voice search, farfield and telephony speech [narayanan2019longform]. We also add domain-ID to the model input during model training and inference, which improves the speech recognition quality in the target domain [sainath2020e2e].
In this work, we focus on home automation applications in the evaluation. So we combine the voice search and farfield domains with a shared domain-ID during training, and use this ID during inference. However, since the target keyphrases tested in our work are common voice command queries, such as “Stop” or “Turn on the light”, they appear frequently in the target domain training data. This in turn causes the ASR to have an implicit bias towards hypothesizing these keyphrases during inference.
2.3 Speaker verification
Many keyword spotting systems are shipped together with a speaker verification (SV) model. The speaker verification model not only enables features such as personalized queries [multiuser] (e.g. “What’s on my calendar?”), but also largely reduces the false accept rate of the keyword spotting system.
Since conventional keyword spotting systems only support single or a few keywords (e.g. “OK Google” and “Hey Google”), the speaker verification model shipped with them is also usually text-dependent. However, for a personalized keyphrase detection system that needs to support theoretically an infinite number of keyphrases, a text-independent speaker verification model must be used.
In this work, we use a text-independent model trained with the generalized end-to-end loss [wan2018generalized]. Most of our training data are from a vendor collected multi-language speech query dataset covering 37 locales. We also added public datasets including LibriVox, VoxCeleb [nagrani2017voxceleb, chung2018voxceleb2], CN-Celeb [fan2020cn]
, TIMIT[garofolo1993darpa], VCTK [yamagishi2019cstr], Spoken Wikipedia Corpora [baumann2019spoken] and BookTubeSpeech [pham2020toward] to the training data for domain robustness. Multi-style training (MTR) [lippmann1987multi, ko2017study, kim2017generation]
is applied during the training process for noise robustness. The speaker verification model has 3 LSTM layers each with 768 nodes and a projection size of 256. The output of the last LSTM layer is then linearly transformed to the final 256-dimension d-vector.
2.4 Speaker separation
Since the ASR model is implicitly biased towards the keyphrases via domain-ID, we found that even under noisy background conditions, the false rejection rate of the keyphrase detection is still low. In contrast, speaker verification systems are vulnerable to overlapping speech. For example, when the target user and an interfering speaker speak at the same time, the speaker verification system might reject the utterance, as the d-vector computed from overlapping speech would be very different to the d-vector derived from the target user speech alone.
Since speaker verification is critical to reducing false triggering, it is important to address the challenge of accurate speaker verification in multi-talker conditions. In this work, we use the VoiceFilter-Lite model [Wang2020] to enhance the input features from the enrolled speaker to the speaker verification model while masking out background speech.
Unlike other speech enhancement or separation models [hershey2016deep, kolbaek2017multitalker, Rao2019, Wang2019], VoiceFilter-Lite has these benefits: (1) It directly enhances filterbank energies instead of the audio waveform, which largely reduces the number of runtime operations; (2) It supports streaming inference with low latency; (3) It uses an adaptive suppression strength, such that it is only effective on overlapping speech, avoiding unnecessary over-suppression; (4) It is optimized for on-device applications [shangguan2019optimizing]. For more details on the training data and model topology of VoiceFilter-Lite, please refer to [Wang2020].
2.5 Adaptive noise cancellation
Many devices, such as smart speakers and mobile phones, have more than one microphone. On these devices, an adaptive noise-cancellation (ANC) algorithm [widrow1975adaptive] can be used to learn a filter that suppresses noise based on the correlation of the audio signals at multiple microphones during noise-only segments. Such an algorithm was proposed in [huang2019hotword] for noise-robust keyword spotting.
For our personalized keyphrase detection system, we use a module code-named Speech Cleaner. Unlike Hotword Cleaner [huang2019hotword]
where the adaptive filter coefficients are estimated using a FIFO buffer, inSpeech Cleaner
, the adaptive filter coefficients are determined from a three second-long period of non-speech audio that preceeds the speech signal. These coefficients are then kept frozen in order to suppress noise during the epoch containing speech.
|source||(dB)||No VFL||With VFL|
3.1 Multi-talker speaker verification
Our first group of experiments focuses on addressing the multi-talker speaker verification challenge. We evaluate the standard speaker verification task under various noise conditions with and without a VoiceFilter-Lite model, while the noise source can be either non-speech noise or an interference speaker, and the room condition can be either additive or reverberant to simulate both near-field and far-field devices.
For this evaluation, we use a vendor-provided English speech query dataset. There are 8,069 utterances from 1,434 speakers in the enrollment list, and 194,890 utterances from 1,241 speakers in the test list. The interference speech are from a separate English dev-set consisting of 220,092 utterances from 958 speakers. The non-speech noises are from various sources such as ambient noises recorded in silent environments, cafes, vehicles, and audio clips of music and sound effects downloaded from gettyimages.com.
In Table 1, we can see that under both clean and non-speech noise conditions, adding VoiceFilter-Lite does not affect the EER of the speaker verification system. This is expected because VoiceFilter-Lite uses an adaptive suppression strength, as explained in [Wang2020]
. However, under speech noise conditions, VoiceFilter-Lite largely reduces the EER of the speaker verification system, under both additive and reverberant room conditions, and for various signal-to-noise ratio (SNR) setups. On average, VoiceFilter-Lite offersa relative 67.4% EER reduction under speech noise conditions. Similar results had been reported by researchers using different models and data [Rao2019].
3.2 Keyphrase detection in the presence of ambient noise
Our second group of experiments focuses on evaluating the overall performance of the keyphrase detection system under various noise conditions. Specifically, we use two key metrics to evaluate the performance: (1) the number of false acceptance per hour (FA/h), which measures how many false phrases are wrongly accepted by the system; and (2) the false rejection rate (FRR), which measures the percentage of true keyphrases that are ignored by the system.
To evaluate FA/h, we used a dataset consisting of 156 hours of English speech from curated and hand-annotated YouTube videos [narayanan2019longform]. This dataset is designed to mimic background noise, and contains no true keyphrases. As such, phrases in this dataset that trigger the detection system are considered false accepts. We generated d-vectors by enrolling human speakers from a custom, in-house database of speakers.
To evaluate FRR, we used two datasets containing a set of commonly used keyphrases such as “remind me to set an alarm”, “turn off the lights”, and “set a timer”. First, we synthesized a dataset of 98 Text-to-Speech (TTS) speakers, each with 1000 keyphrases, using a previously published method [shen2018natural]. Additionally, we evaluated the performance on a set of vendor-provided keyphrases consisting of 61,555 utterances from 250 speakers with an average of 240 utterances per speaker. Each utterance was hand transcribed. For both datasets, we generated d-vectors from four enrollment utterances (e.g. “Hey Google, remind me to water my plants”). Each speaker was enrolled separately to mimic single-user devices. The remaining utterances were used for evaluation. We augmented both datasets with either speech or non-speech background noise at three different SNR levels using MTR.
To evaluate adaptive noise cancellation, we prepended each utterance with three seconds of silence before applying MTR. As a result, each utterance had three seconds of pure noise before the start of the main audio, which we used to estimate and freeze the Speech Cleaner filter coefficients. We used the same noise sources and room configurations in all experiments.
The overall performance of our keyphrase detection system on the three above-mentioned datasets is shown in Table 2. We observed that including speaker verification alone significantly decreased FA/h from 0.395 to 0.035 (rel. 91%). Adding a VoiceFilter-Lite (VFL) speaker separation model in the frontend of speaker verification, but not ASR, further reduced FA/h by improving speaker verification accuracy. Therefore, knowledge of speaker identity is sufficient to reduce false triggering.
This reduction in FA/h (no VFL), however, was accompanied by a 46.5% increase in FRR in the multi-talker and a 3.08% increase in the non-speech case (SNR = 0dB) relative to the model with no speaker verification. We observed a similar trend for the vendor-provided data with a rel. 20.6% increase in FRR when speech background noise was added (SNR = 0dB). In both datasets, non-speech background noise resulted in far fewer false rejections than the multi-talker scenario. It is important to note that the increase in false rejections was primarily due to incorrect speaker verification (64% of errors), rather than incorrect speech recognition.
Adding a speaker separation (VFL) model to the feature frontend of speaker verification reduced the FRR from 41.9% to 29.6%, resulting in a 29.4% reduction in FRR in the SNR = 0dB multi-talker case relative to the model with only speaker verification. In particular, for both the TTS and vendor-provided datasets, adding speaker separation mitigated the increase in FRR caused by speaker verification. This improvement was due to the fact that VoiceFilter-Lite is effective at identifying and suppressing overlapping speech from a non-enrolled speaker, which in turn improved speaker verification accuracy. All three models performed similarly in the non-speech background case. Notably, adding speaker separation to the feature frontend of the ASR alone did not produce a similar decrease in FRR, underscoring the fact that the false rejections in this keyphrase detection system are primarily due to speaker verification errors in the presence of speech background noise.
Finally, to further improve the robustness of our keyphrase detection system to background noise, we included adaptive noise cancellation (ANC) in the feature frontends of both ASR and speaker verification. Relative to the model with only speaker verification, adding ANC reduced FRR by 68.3% in the non-speech and 25.2% in the speech background noise (SNR = 0dB) situations respectively. We refer the reader to Table 3 for a full description of the results.
Altogether, using both TTS and vendor-provided data, we have demonstrated that adding speaker verification, separation and adaptive noise cancellation results in a personalized keyphrase detection system that is robust to both background noise and overlapping speech.
We proposed a streaming personalized keyphrase detection system that is highly robust to background noise and overlapping speech. We used a RNN-T based ambient ASR model that was pruned to fit on-device constraints and implicitly biased it towards voice commands via domain-id. To compensate for false triggering caused by biasing, we used a text-independent speaker verification model that rejected all keyphrases from non-enrolled speakers, which reduced FA/h by 91%. To mitigate the increased false rejections caused by speaker verification in the multi-talker scenario, we added a speaker separation model to the feature frontend of the speaker verification system. This resulted in a 67.4% reduction of speaker verification EER and a 29.4% reduction of FRR when the background contains overlapping speech. We also proposed Speech Cleaner, a multi-microphone adaptive noise cancellation algorithm that further reduced FRR for noisy conditions.