1 Introduction11footnotetext: Work performed at Apple
Voice trigger detection for personal devices, such as smart phones, is an important task which enables activating a voice assistant by speech containing a keyword phrase. It is also important to ensure that the keyword phrase is spoken by the owner of the device by running a speaker verification system.
A typical approach is to cascade speaker independent voice trigger detection and speaker verification [15, 13, 21, 6]. A universal voice trigger detector is trained on speech signals from various speakers to perform speaker independent voice trigger detection, then speaker verification is performed by a speaker recognition model exploiting enrollment utterances spoken by the target user. Various approaches have been proposed for speaker independent voice trigger detection including ASR-based approaches [33, 19, 36, 22, 11, 2]
, as well as discriminative approaches with convolutional neural networks (CNNs)[23, 30, 8, 18]
, recurrent neural networks (RNNs)[9, 29, 4, 16] and attention-based networks [2, 5]. However, such speaker independent voice trigger detectors typically suffer from performance degradation on speech from underrepresented groups such as accented speakers [26, 32]. This is true even when a small amount of adaptation data is available, since adapting a large speaker independent voice trigger detector is a challenge with only limited data.
In this work, we propose a novel approach for fast adaptation of the voice trigger detector to reduce the number(s) of false rejections and/or false positive activations. Our proposed model consists of an encoder that performs speaker independent voice trigger detection and a decoder that performs speaker-adapted voice trigger detection. The decoder summarizes acoustic information in an utterance and produces a fixed dimensional embedding. The model is trained using metric learning, where we maximize distance between embeddings of a keyword phrase and non-keyword phrases. We also minimize distance between embeddings of a keyword phrase spoken by the same speaker, and maximize the distance of those spoken by different speakers. The metric learning encourages the model to learn not only differences between the keyword and non-keywords, but also those between keyword phrases spoken by different speakers, thus enabling speaker adaptation. At test time, a speaker-adapted voice trigger score can be obtained as the distance between speaker-specific embeddings extracted from previously seen utterances and embeddings from a test utterance.
Experimental results show that the proposed approach achieves a 38% relative reduction in a false rejection rate (FRR) compared to a baseline speaker independent voice trigger model for a voice trigger detection task.
2 Related work
Query-by-example is a popular approach for keyword spotting that can also exploit enrollment utterances [10, 35, 3, 17, 7, 34, 14]. In this approach, an acoustic model converts an audio input into a useful representation, e.g., phonetic representation, and then a similarity between the representations of the enrollment and a test utterance is computed using a technique such as dynamic time warping [10, 35, 3] or finite-state transducers 
. Phrase-level embedding computed by neural networks is also used as the representation in recent work[7, 34, 14]. Our proposed approach efficiently integrates the essence of the query-by-example approach with the speaker independent voice trigger detector using an encoder-decoder architecture. Moreover, speaker-aware training is performed in our approach using metric learning to explicitly differentiate between speakers and reject keyword phrases from non-target speakers.
Regarding joint modeling for voice trigger detection and speaker verification, Sigtia et al. used multi-task learning (MTL) and trained a single model with two branches for voice trigger detection and speaker verification, respectively. Our proposed approach is an extension of  by adding extra training objectives to reject non-keyword phrases spoken by a target speaker. Note that a simple speaker verification system cannot suppress non-keyword speech from the target speaker, and thus cannot be used for improving voice trigger detection accuracy.
Acoustic model adaptation can also be performed by feeding a speaker embedding into the acoustic model along with audio features [1, 24, 25]. The speaker embedding can be computed by running a speaker identification model on the enrollment utterances. In contrast, we compare embeddings of known utterances and test utterances for voice trigger detection as we aim to detect whether the two utterances contain the same content, i.e., the keyword phrase, spoken by the same speaker.
3 Proposed approach
We propose a novel MTL approach where an encoder performs a speaker independent phoneme prediction, and a decoder performs speaker-adapted voice trigger detection. See Figure 1 for an overview of our proposed approach.
3.1 Model architecture
We borrow the model architecture from  and adapt it for speaker-adapted voice trigger detection. The model is based on an encoder-decoder Transformer architecture. Our encoder consists of stacked Transformer encoder blocks with self-attention. The self-attention encoder performs phoneme predictions which transforms the input feature sequence, i.e., denoted by X
, into hidden representations as
where denotes a hidden representation after the -th encoder block. A linear layer is applied to the last encoder output
to get logits for phoneme classes which are used to compute a phonetic loss.
Our cross-attention decoder comprises of Transformer decoder blocks with attention layers. The decoder takes the encoder embedding output after the -th encoder block
as well as a set of trainable query vectors as inputs. Following, we use an intermediate representation () since the speaker information can be diminished at the top encoder layer. Let denote a set of the trainable vectors, where . By feeding the encoder output and the query vectors, a set of decoder embedding vectors is obtained as
where denotes an output of stacked Transformer decoder blocks. The set of the decoder outputs is then reshaped to form an utterance-wise embedding vector of size . Unlike  that uses the decoder embedding only for a phrase-level cross entropy loss, we use the embedding for three different losses for speaker-adapted voice trigger detection. We first branch out at this stage into two task level linear layers – one linear layer is applied on the embedding to predict a scalar logit for the keyword phrase; another linear layer is applied to obtain logits for speaker verification. Finally, we also use the decoder embedding to perform metric learning within a mini-batch.
3.2 Multi-task learning
In contrast to the previously-proposed MTL framework for keyword spotting [20, 27, 28, 12], we introduce the metric-learning loss, to obtain a speaker-adapted voice trigger detection score by comparing the decoder embeddings. In our proposed MTL framework, the model is trained using the phonetic loss at the encoder output and at the decoder output we have three branches – keyword-phrase loss, speaker-identification loss and the metric-learning loss. The objective function for the training can be formulated as
where , , and denote the phonetic loss, the speaker-identification loss, the keyword-phrase loss and the metric learning loss, respectively. are the scaling factors for balancing the losses.
We use a phoneme-level connectionist temporal classification (CTC) loss for the phonetic loss to compute a speaker independent voice trigger detection score from the encoder output. The keyword phrase loss is a cross-entropy (CE) loss on the scalar logits obtained from the decoder branch with the utterance-wise phrase labels. Similarly, a speaker CE loss is computed using the other decoder branch which constitutes the speaker-identification loss. The speaker-ID CE loss acts as a regularizers, which help generalize the model (see our ablation study in section4.3).
The metric loss
is a cosine similarity metric with scale and offset parameters that is applied directly on the decoder embedding output for positive pairs, defined as utterances from same speaker containing the keyword phrase; and the negative pairs constitute utterances from different speakers, or utterances from same speaker with opposite phrase labels (see Fig.1
). We first convert the cosine similarity into a probability as
where is a cosine distance between the decoder embeddings of the -th and -th utterances. and denote trainable scale and offset parameters, respectively. The metric loss can be computed as
where and denote sets of the positive and negative pairs within a mini-batch, and and denote the numbers of positive and negative pairs. We balance the numbers of positive and negative pairs when computing the loss by randomly sub-sampling the negative pairs. The metric-learning loss computes a speaker-adapted voice trigger score in a consistent way during training and inference.
3.3 Data Sampling
We use two sources of data per mini-batch for training the MTL tasks. The first source is set of anonymized utterances that have either the phoneme labels or keyword phrase labels (voice-trigger data), which is mainly used for the phonetic loss and the keyword phrase loss. Non-keyword utterances from the voice-trigger data are also used for the metric learning loss as a negative class. The dataset can be obtained by combining an ASR dataset with the phoneme labels and a keyword spotting dataset with the keyword phrase labels [27, 12]. The other dataset includes utterances with speaker labels (speaker-ID data), where each utterance contains a keyword phrase followed by a non-keyword sentence. The speaker-ID data are used for all of the losses, except the phonetic loss since there is no transcription for this dataset.
We employ a batch sampling strategy that picks samples from both of these sets for every mini-batch of training. For example, for a batch size of 128, we pick 112 utterances from the speaker-ID data which includes 4 utterances from 28 unique speakers, and the rest comes from the voice-trigger data. Also, we randomly drop the keyword phrase segment for the utterances sampled from the speaker data to create negative pairs (keyword vs non-keyword) for the same speaker, which helps metric learning.
At inference, we first obtain an anchor embedding as an average of the decoder embeddings from existing utterances for a speaker. Next, we compute the decoder embedding on the test utterance, and then compute the similarity score between the anchor embedding and the test embedding using Eq. (4). The similarity score corresponds to the speaker-adapted voice trigger score. Optionally, we combine the speaker-adapted score and a speaker independent voice trigger score obtained from the encoder output. First the speaker-adapted score is calibrated as where and
are the global mean and standard deviation of the scores computed on a validation set. Then we use a simple weighted average to combine these two voice trigger scores:
where is a weight factor.
4 Experimental evaluation
The training data are thousand hours of randomly sampled anonymized utterances from recordings and manually transcribed for phonetic labels (54-dimensional). These audio data are augmented with room-impulse responses (RIRs) and echo residuals to obtain a total of approximately 9 million utterances, similar to [28, 12]. We add roughly 65k false triggers and 300k true triggers that are short-lived anonymized utterances randomly sampled from speakers for the keyword phrase detection task. The training data for the speaker identification task comprises 15 million utterances. The set contains 131k different anonymized speakers with minimum of 100 samples, and median of 115 random samples per speaker. These contain only speaker labels, and no phonetic information. However, each training utterance contains the keyword phrase and the meta information of keyword phrase segment. The training data are formed by concatenating these datasets and we use the batch sampling strategy mentioned in Section3.3 to ensure each mini-batch contains samples for all tasks.
For evaluation, we use a synthetic dataset, where 7535 positive samples are internally collected under controlled conditions from 72 different speakers, evenly divided between genders. Each utterance contains the keyword phrase followed by a voice command spoken to a smartphone. The acoustic conditions include quiet, external noise from TV or kitchen appliances, and music playback. To measure false accept (FA) per hour, we include negative data of 2k
hours of audio recordings that do not contain keyword phrase by playing podcasts, audiobooks, TV, etc. We randomly sample five utterances per speaker for computing the anchor embedding, and we evaluate using the remains utterances. To estimate the variability, we repeat this five times for each speaker, changing the utterances that are used to compute the anchor embedding. We report the mean performance over the five runs.
Similar to , we use a two stage approach to reduce the overall compute cost and accommodate the Transformer-based architecture on device for voice trigger detection. We first run light-weight
fully-connected neural networks on continuous audio and obtain audio segments of keyword candidates using hidden Markov model (HMM) alignments. Then only the detected audio segment is fed into the baseline/proposed model and a voice trigger score is recomputed. See for more details.
4.2 Model training
We use a speaker independent voice trigger detector proposed in  as a baseline. The baseline system has an encoder-decoder architecture that is trained with the speaker independent phonetic and keyword phrase losses on the voice trigger data. The input features are 40-dimensional log mel-filter bank features
3 context frames, and sub-sampled once per three frames which reduces computational complexity. We also normalize the features using the global mean and variance. A phonetic encoder has 6 layers of Transformer encoder blocks, where each block of multi-head attention has a hidden dimension of 256 and 4 heads. The feed forward network has 1024 hidden units. The final encoder output is projected into 54-dimensional logits using a linear layer. This encoder is trained with CTC loss using the phonetic labels. A decoder consists of one Transformer decoder block with the same hidden dimensions as the encoder. The query vector has dimensionof 256 and length is fixed to 4. The final decoder output embedding is reshaped into 1024 dimensional. The baseline approach has the phrase-level CE loss on decoder output for the keyword phrase detection. We also investigate metric-based inference described in section 3.4 even though the baseline model is not trained with the metric-learning loss.
For our proposed approach, we add another linear layer with a dropout of 0.6 on top of the decoder for the speaker-ID loss with the 131k
speakers. In addition, we initialize our proposed model with the weights of the baseline model and fix the encoder weights to take advantage of the phonetic performance. We only fine-tune the decoder weights in a transfer-learning fashion with the keyword-phrase CE loss, speaker-identification CE loss and the metric learning loss. Also, we consider the penultimate encoder layer embedding for the decoder input (). The scaling factors in Eq.(3) , , are empirically set to be 1, 1, 0.1, respectively. The optimizer used is Adam
, where initial learning rate is linearly increased until 0.001 until epoch 2, and then linearly decayed to 0.0007 for the next 25 epochs. We then exponentially decay the learning rate with minimum learning rate of 1e-7 until the last epoch set at 40. We use 64 GPUs for training and the batch size is 128 at each GPU.
Figure 2 shows the detection error trade-off (DET) curves for the baseline and the proposed approach. The horizontal axis represents FA/hr and the vertical axis represents false reject rates (FRRs). Table 1 shows the FRRs at our operating point of FA/hr. The baseline FRR is at 3.8% when using the phonetic branch for inference. The phrase branch of the baseline shows regressions compared to the phonetic branch even when we apply metric-based inference on the decoder embedding. In addition, fine-tuning the decoder on the speaker-ID data with the speaker independent phrase loss does not improve the performance. In the case of the proposed model, we can see that the new MTL improves FRRs. This improvement signifies that the speaker information helps to adapt to the keyword phrase detection using the speaker-adapted score. Additionally, this speaker-adapted score shows the effectiveness of the embedding space being structured by the metric learning. By combining the speaker-adapted score and the speaker independent score from the phonetic branch, we see further improvement in the FRRs. The proposed approach reduced the FRRs by 38% relative from the baseline model trained in a speaker independent fashion.
|+ fine-tuning w/ spk-ID data||
Table 2 shows ablation study for the proposed approach. When we train our proposed model from scratch with encoder layer as decoder input, we see a slight improvement with the metric branch over the baseline. We also see that absence of phonetic loss fails to generalize the model, similar to results reported in . Initializing with the baseline model helps retain the phonetic performance , however, the FRR with degrades. This could be because the encoder performs speaker independent phoneme prediction, where speaker information can be diminished at the last layer. Utilising the intermediate encoder layer (), we observe improvements. The rest of Table 2 highlights the importance of the three losses on the decoder. We also observe that any fine-tuning of the encoder with the CTC loss introduces performance degradation.
We propose a novel approach for improving voice trigger detection by adapting to speaker information using metric learning. Our model employs an encoder-decoder architecture, where the encoder performs phoneme prediction for a speaker independent voice trigger detection while the decoder predicts an utterance-wise embedding for speaker-adapted voice trigger detection. The speaker-adapted voice trigger score is obtained by computing a similarity between an anchor embedding for each speaker and the decoder embedding for a test utterance. Experimental results show that our proposed approach outperforms the baseline speaker independent voice trigger detector by in terms of FRRs.
-  (2013) Fast speaker adaptation of hybrid nn/hmm model for speech recognition based on discriminative learning of speaker code. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. , pp. 7942–7946. External Links: Cited by: §2.
-  (2020) Hybrid transformer/ctc networks for hardware efficient voice triggering. In Interspeech, pp. 3351–3355. Cited by: §1.
-  (2013) Memory efficient subsequence dtw for query-by-example spoken term detection. In 2013 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §2.
-  (2017) Convolutional recurrent neural networks for small-footprint keyword spotting. In Interspeech, pp. 1606–1610. Cited by: §1.
Keyword transformer: a self-attention model for keyword spotting. arXiv preprint arXiv:2104.00769. Cited by: §1.
-  (2019) Event-driven pipeline for low-latency low-compute keyword spotting and speaker verification system. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 7953–7957. External Links: Cited by: §1.
Query-by-example keyword spotting using long short-term memory networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5236–5240. External Links: Cited by: §2.
-  (2019) Temporal convolution for real-time keyword spotting on mobile devices. External Links: Cited by: §1.
-  (2007) An application of recurrent neural networks to discriminative keyword spotting. In International Conference on Artificial Neural Networks, pp. 220–229. Cited by: §1.
Query-by-example spoken term detection using phonetic posteriorgram templates.
2009 IEEE Workshop on Automatic Speech Recognition Understanding, Vol. , pp. 421–426. External Links: Cited by: §2.
-  (2017) Streaming small-footprint keyword spotting using sequence-to-sequence models. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 474–481. Cited by: §1.
-  (2021) Multi-task learning with cross attention for keyword spotting. arXiv preprint arXiv:2107.07634. Cited by: §3.1, §3.1, §3.2, §3.3, §4.1, §4.1, §4.2, Table 1.
-  (2021) The npu system for the 2020 personalized voice trigger challenge. External Links: Cited by: §1.
-  (2021) Query-by-example keyword spotting system using multi-head attention and soft-triple loss. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6858–6862. Cited by: §2.
The 2020 personalized voice trigger challenge: open database, evaluation metrics and the baseline systems. External Links: Cited by: §1.
-  (2021) Tiny-crnn: streaming wakeword detection in a low footprint setting. External Links: Cited by: §1.
-  (2019) Query-by-example on-device keyword spotting. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 532–538. Cited by: §2.
-  (2020) MatchboxNet: 1d time-channel separable convolutional neural network architecture for speech commands recognition. External Links: Cited by: §1.
-  (2007) Rapid and accurate spoken term detection. In Eighth Annual Conference of the international speech communication association, Cited by: §1.
-  (2016) Multi-task learning and weighted cross-entropy for dnn-based keyword spotting.. In Interspeech, Vol. 9, pp. 760–764. Cited by: §3.2.
-  (2021) Personalized keyphrase detection using speaker and environment information. External Links: Cited by: §1.
-  (2017) End-to-end speech recognition and keyword search on low-resource languages. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5280–5284. Cited by: §1.
-  (2015) Convolutional neural networks for small-footprint keyword spotting. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §1.
-  (2013) Speaker adaptation of neural network acoustic models using i-vectors. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Vol. , pp. 55–59. External Links: Cited by: §2.
-  (2014) Improving dnn speaker independence with i-vector inputs. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 225–229. External Links: Cited by: §2.
-  (2019) Personalizing asr for dysarthric and accented speech with limited data. arXiv preprint arXiv:1907.13511. Cited by: §1.
-  (2020) Multi-task learning for voice trigger detection. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 7449–7453. External Links: Cited by: §3.2, §3.3, §4.3.
-  (2020) Multi-task learning for speaker verification and voice trigger detection. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6844–6848. External Links: Cited by: §2, §3.1, §3.2, §4.1.
-  (2016) Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting. In 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 474–480. Cited by: §1.
-  (2018) Deep residual learning for small-footprint keyword spotting. External Links: Cited by: §1.
-  (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §3.1.
-  (2019) End-to-end accented speech recognition.. In Interspeech, pp. 2140–2144. Cited by: §1.
-  (1993) Keyword-spotting using sri’s decipher large-vocabulary speech-recognition system. In 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 463–466 vol.2. External Links: Cited by: §1.
-  (2019) Verifying deep keyword spotting detection with acoustic word embeddings. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Vol. , pp. 613–620. External Links: Cited by: §2.
-  (2009) Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams. In 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 398–403. Cited by: §2.
-  (2016) Unrestricted vocabulary keyword spotting using lstm-ctc.. In Interspeech, pp. 938–942. Cited by: §1.