Log In Sign Up

Improving Voice Trigger Detection with Metric Learning

by   Prateeth Nayak, et al.
Apple Inc.

Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented groups, such as accented speakers. In this work, we propose a novel voice trigger detector that can use a small number of utterances from a target speaker to improve detection accuracy. Our proposed model employs an encoder-decoder architecture. While the encoder performs speaker independent voice trigger detection, similar to the conventional detector, the decoder predicts a personalized embedding for each utterance. A personalized voice trigger score is then obtained as a similarity score between the embeddings of enrollment utterances and a test utterance. The personalized embedding allows adapting to target speaker's speech when computing the voice trigger score, hence improving voice trigger detection accuracy. Experimental results show that the proposed approach achieves a 38 rejection rate (FRR) compared to a baseline speaker independent voice trigger model.


page 1

page 2

page 3

page 4


Enrollment-less training for personalized voice activity detection

We present a novel personalized voice activity detection (PVAD) learning...

TTS Skins: Speaker Conversion via ASR

We present a fully convolutional wav-to-wav network for converting betwe...

Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition

By implicitly recognizing a user based on his/her speech input, speaker ...

The NPU System for the 2020 Personalized Voice Trigger Challenge

This paper describes the system developed by the NPU team for the 2020 p...

Automatic Speaker Independent Dysarthric Speech Intelligibility Assessment System

Dysarthria is a condition which hampers the ability of an individual to ...

The AS-NU System for the M2VoC Challenge

This paper describes the AS-NU systems for two tracks in MultiSpeaker Mu...

Building a mixed-lingual neural TTS system with only monolingual data

When deploying a Chinese neural text-to-speech (TTS) synthesis system, o...

1 Introduction

11footnotetext: Work performed at Apple

Voice trigger detection for personal devices, such as smart phones, is an important task which enables activating a voice assistant by speech containing a keyword phrase. It is also important to ensure that the keyword phrase is spoken by the owner of the device by running a speaker verification system.

A typical approach is to cascade speaker independent voice trigger detection and speaker verification [15, 13, 21, 6]. A universal voice trigger detector is trained on speech signals from various speakers to perform speaker independent voice trigger detection, then speaker verification is performed by a speaker recognition model exploiting enrollment utterances spoken by the target user. Various approaches have been proposed for speaker independent voice trigger detection including ASR-based approaches [33, 19, 36, 22, 11, 2]

, as well as discriminative approaches with convolutional neural networks (CNNs)

[23, 30, 8, 18]

, recurrent neural networks (RNNs)

[9, 29, 4, 16] and attention-based networks [2, 5]. However, such speaker independent voice trigger detectors typically suffer from performance degradation on speech from underrepresented groups such as accented speakers [26, 32]. This is true even when a small amount of adaptation data is available, since adapting a large speaker independent voice trigger detector is a challenge with only limited data.

In this work, we propose a novel approach for fast adaptation of the voice trigger detector to reduce the number(s) of false rejections and/or false positive activations. Our proposed model consists of an encoder that performs speaker independent voice trigger detection and a decoder that performs speaker-adapted voice trigger detection. The decoder summarizes acoustic information in an utterance and produces a fixed dimensional embedding. The model is trained using metric learning, where we maximize distance between embeddings of a keyword phrase and non-keyword phrases. We also minimize distance between embeddings of a keyword phrase spoken by the same speaker, and maximize the distance of those spoken by different speakers. The metric learning encourages the model to learn not only differences between the keyword and non-keywords, but also those between keyword phrases spoken by different speakers, thus enabling speaker adaptation. At test time, a speaker-adapted voice trigger score can be obtained as the distance between speaker-specific embeddings extracted from previously seen utterances and embeddings from a test utterance.

Experimental results show that the proposed approach achieves a 38% relative reduction in a false rejection rate (FRR) compared to a baseline speaker independent voice trigger model for a voice trigger detection task.

2 Related work

Query-by-example is a popular approach for keyword spotting that can also exploit enrollment utterances [10, 35, 3, 17, 7, 34, 14]. In this approach, an acoustic model converts an audio input into a useful representation, e.g., phonetic representation, and then a similarity between the representations of the enrollment and a test utterance is computed using a technique such as dynamic time warping [10, 35, 3] or finite-state transducers [17]

. Phrase-level embedding computed by neural networks is also used as the representation in recent work

[7, 34, 14]. Our proposed approach efficiently integrates the essence of the query-by-example approach with the speaker independent voice trigger detector using an encoder-decoder architecture. Moreover, speaker-aware training is performed in our approach using metric learning to explicitly differentiate between speakers and reject keyword phrases from non-target speakers.

Regarding joint modeling for voice trigger detection and speaker verification, Sigtia et al.[28] used multi-task learning (MTL) and trained a single model with two branches for voice trigger detection and speaker verification, respectively. Our proposed approach is an extension of [28] by adding extra training objectives to reject non-keyword phrases spoken by a target speaker. Note that a simple speaker verification system cannot suppress non-keyword speech from the target speaker, and thus cannot be used for improving voice trigger detection accuracy.

Acoustic model adaptation can also be performed by feeding a speaker embedding into the acoustic model along with audio features [1, 24, 25]. The speaker embedding can be computed by running a speaker identification model on the enrollment utterances. In contrast, we compare embeddings of known utterances and test utterances for voice trigger detection as we aim to detect whether the two utterances contain the same content, i.e., the keyword phrase, spoken by the same speaker.

Figure 1: Proposed MTL Framework: Phonetic Encoder (grey) and Cross-Attention Decoder (orange) blocks. Green block is the Phonetic Branch with CTC Loss (). Blue block is the Speaker-Identification branch with CE Loss (). Purple block denotes the Keyword-Phrase Branch with CE Loss () and, Metric Learning branch ().

3 Proposed approach

We propose a novel MTL approach where an encoder performs a speaker independent phoneme prediction, and a decoder performs speaker-adapted voice trigger detection. See Figure 1 for an overview of our proposed approach.

3.1 Model architecture

We borrow the model architecture from [12] and adapt it for speaker-adapted voice trigger detection. The model is based on an encoder-decoder[31] Transformer architecture. Our encoder consists of stacked Transformer encoder blocks with self-attention. The self-attention encoder performs phoneme predictions which transforms the input feature sequence, i.e., denoted by X

, into hidden representations as


where denotes a hidden representation after the -th encoder block. A linear layer is applied to the last encoder output

to get logits for phoneme classes which are used to compute a phonetic loss.

Our cross-attention decoder comprises of Transformer decoder blocks with attention layers. The decoder takes the encoder embedding output after the -th encoder block

as well as a set of trainable query vectors as inputs. Following

[28], we use an intermediate representation () since the speaker information can be diminished at the top encoder layer. Let denote a set of the trainable vectors, where . By feeding the encoder output and the query vectors, a set of decoder embedding vectors is obtained as


where denotes an output of stacked Transformer decoder blocks. The set of the decoder outputs is then reshaped to form an utterance-wise embedding vector of size . Unlike [12] that uses the decoder embedding only for a phrase-level cross entropy loss, we use the embedding for three different losses for speaker-adapted voice trigger detection. We first branch out at this stage into two task level linear layers – one linear layer is applied on the embedding to predict a scalar logit for the keyword phrase; another linear layer is applied to obtain logits for speaker verification. Finally, we also use the decoder embedding to perform metric learning within a mini-batch.

3.2 Multi-task learning

In contrast to the previously-proposed MTL framework for keyword spotting [20, 27, 28, 12], we introduce the metric-learning loss, to obtain a speaker-adapted voice trigger detection score by comparing the decoder embeddings. In our proposed MTL framework, the model is trained using the phonetic loss at the encoder output and at the decoder output we have three branches – keyword-phrase loss, speaker-identification loss and the metric-learning loss. The objective function for the training can be formulated as


where , , and denote the phonetic loss, the speaker-identification loss, the keyword-phrase loss and the metric learning loss, respectively. are the scaling factors for balancing the losses.

We use a phoneme-level connectionist temporal classification (CTC) loss for the phonetic loss to compute a speaker independent voice trigger detection score from the encoder output. The keyword phrase loss is a cross-entropy (CE) loss on the scalar logits obtained from the decoder branch with the utterance-wise phrase labels. Similarly, a speaker CE loss is computed using the other decoder branch which constitutes the speaker-identification loss. The speaker-ID CE loss acts as a regularizers, which help generalize the model (see our ablation study in section4.3).

The metric loss

is a cosine similarity metric with scale and offset parameters that is applied directly on the decoder embedding output for positive pairs, defined as utterances from same speaker containing the keyword phrase; and the negative pairs constitute utterances from different speakers, or utterances from same speaker with opposite phrase labels (see Fig.


). We first convert the cosine similarity into a probability as


where is a cosine distance between the decoder embeddings of the -th and -th utterances. and denote trainable scale and offset parameters, respectively. The metric loss can be computed as


where and denote sets of the positive and negative pairs within a mini-batch, and and denote the numbers of positive and negative pairs. We balance the numbers of positive and negative pairs when computing the loss by randomly sub-sampling the negative pairs. The metric-learning loss computes a speaker-adapted voice trigger score in a consistent way during training and inference.

3.3 Data Sampling

We use two sources of data per mini-batch for training the MTL tasks. The first source is set of anonymized utterances that have either the phoneme labels or keyword phrase labels (voice-trigger data), which is mainly used for the phonetic loss and the keyword phrase loss. Non-keyword utterances from the voice-trigger data are also used for the metric learning loss as a negative class. The dataset can be obtained by combining an ASR dataset with the phoneme labels and a keyword spotting dataset with the keyword phrase labels [27, 12]. The other dataset includes utterances with speaker labels (speaker-ID data), where each utterance contains a keyword phrase followed by a non-keyword sentence. The speaker-ID data are used for all of the losses, except the phonetic loss since there is no transcription for this dataset.

We employ a batch sampling strategy that picks samples from both of these sets for every mini-batch of training. For example, for a batch size of 128, we pick 112 utterances from the speaker-ID data which includes 4 utterances from 28 unique speakers, and the rest comes from the voice-trigger data. Also, we randomly drop the keyword phrase segment for the utterances sampled from the speaker data to create negative pairs (keyword vs non-keyword) for the same speaker, which helps metric learning.

3.4 Inference

At inference, we first obtain an anchor embedding as an average of the decoder embeddings from existing utterances for a speaker. Next, we compute the decoder embedding on the test utterance, and then compute the similarity score between the anchor embedding and the test embedding using Eq. (4). The similarity score corresponds to the speaker-adapted voice trigger score. Optionally, we combine the speaker-adapted score and a speaker independent voice trigger score obtained from the encoder output. First the speaker-adapted score is calibrated as where and

are the global mean and standard deviation of the scores computed on a validation set. Then we use a simple weighted average to combine these two voice trigger scores:


where is a weight factor.

4 Experimental evaluation

4.1 Data

The training data are thousand hours of randomly sampled anonymized utterances from recordings and manually transcribed for phonetic labels (54-dimensional). These audio data are augmented with room-impulse responses (RIRs) and echo residuals to obtain a total of approximately 9 million utterances, similar to [28, 12]. We add roughly 65k false triggers and 300k true triggers that are short-lived anonymized utterances randomly sampled from speakers for the keyword phrase detection task. The training data for the speaker identification task comprises 15 million utterances. The set contains 131k different anonymized speakers with minimum of 100 samples, and median of 115 random samples per speaker. These contain only speaker labels, and no phonetic information. However, each training utterance contains the keyword phrase and the meta information of keyword phrase segment. The training data are formed by concatenating these datasets and we use the batch sampling strategy mentioned in Section3.3 to ensure each mini-batch contains samples for all tasks.

For evaluation, we use a synthetic dataset, where 7535 positive samples are internally collected under controlled conditions from 72 different speakers, evenly divided between genders. Each utterance contains the keyword phrase followed by a voice command spoken to a smartphone. The acoustic conditions include quiet, external noise from TV or kitchen appliances, and music playback. To measure false accept (FA) per hour, we include negative data of 2k

hours of audio recordings that do not contain keyword phrase by playing podcasts, audiobooks, TV, etc. We randomly sample five utterances per speaker for computing the anchor embedding, and we evaluate using the remains utterances. To estimate the variability, we repeat this five times for each speaker, changing the utterances that are used to compute the anchor embedding. We report the mean performance over the five runs.

Similar to [12], we use a two stage approach to reduce the overall compute cost and accommodate the Transformer-based architecture on device for voice trigger detection. We first run light-weight

fully-connected neural networks on continuous audio and obtain audio segments of keyword candidates using hidden Markov model (HMM) alignments. Then only the detected audio segment is fed into the baseline/proposed model and a voice trigger score is recomputed. See

[12] for more details.

4.2 Model training

We use a speaker independent voice trigger detector proposed in [12] as a baseline. The baseline system has an encoder-decoder architecture that is trained with the speaker independent phonetic and keyword phrase losses on the voice trigger data. The input features are 40-dimensional log mel-filter bank features

3 context frames, and sub-sampled once per three frames which reduces computational complexity. We also normalize the features using the global mean and variance. A phonetic encoder has 6 layers of Transformer encoder blocks, where each block of multi-head attention has a hidden dimension of 256 and 4 heads. The feed forward network has 1024 hidden units. The final encoder output is projected into 54-dimensional logits using a linear layer. This encoder is trained with CTC loss using the phonetic labels. A decoder consists of one Transformer decoder block with the same hidden dimensions as the encoder. The query vector has dimension

of 256 and length is fixed to 4. The final decoder output embedding is reshaped into 1024 dimensional. The baseline approach has the phrase-level CE loss on decoder output for the keyword phrase detection. We also investigate metric-based inference described in section 3.4 even though the baseline model is not trained with the metric-learning loss.

For our proposed approach, we add another linear layer with a dropout of 0.6 on top of the decoder for the speaker-ID loss with the 131k

speakers. In addition, we initialize our proposed model with the weights of the baseline model and fix the encoder weights to take advantage of the phonetic performance. We only fine-tune the decoder weights in a transfer-learning fashion with the keyword-phrase CE loss, speaker-identification CE loss and the metric learning loss. Also, we consider the penultimate encoder layer embedding for the decoder input (

). The scaling factors in Eq.(3) , , are empirically set to be 1, 1, 0.1, respectively. The optimizer used is Adam

, where initial learning rate is linearly increased until 0.001 until epoch 2, and then linearly decayed to 0.0007 for the next 25 epochs. We then exponentially decay the learning rate with minimum learning rate of 1e-7 until the last epoch set at 40. We use 64 GPUs for training and the batch size is 128 at each GPU.

4.3 Results

Figure 2: DET Curve for evaluation set. The vertical dotted line indicates an operating point.

Figure 2 shows the detection error trade-off (DET) curves for the baseline and the proposed approach. The horizontal axis represents FA/hr and the vertical axis represents false reject rates (FRRs). Table 1 shows the FRRs at our operating point of FA/hr. The baseline FRR is at 3.8% when using the phonetic branch for inference. The phrase branch of the baseline shows regressions compared to the phonetic branch even when we apply metric-based inference on the decoder embedding. In addition, fine-tuning the decoder on the speaker-ID data with the speaker independent phrase loss does not improve the performance. In the case of the proposed model, we can see that the new MTL improves FRRs. This improvement signifies that the speaker information helps to adapt to the keyword phrase detection using the speaker-adapted score. Additionally, this speaker-adapted score shows the effectiveness of the embedding space being structured by the metric learning. By combining the speaker-adapted score and the speaker independent score from the phonetic branch, we see further improvement in the FRRs. The proposed approach reduced the FRRs by 38% relative from the baseline model trained in a speaker independent fashion.

Branch FRRs
Baseline [12]
+ fine-tuning w/ spk-ID data
and (=0.4)
and (=0.8)
and (=0.95)
and (=0.99)
Table 1: False reject rates [] at an operating point of 0.01 FA/hr.

Table 2 shows ablation study for the proposed approach. When we train our proposed model from scratch with encoder layer as decoder input, we see a slight improvement with the metric branch over the baseline. We also see that absence of phonetic loss fails to generalize the model, similar to results reported in [27]. Initializing with the baseline model helps retain the phonetic performance , however, the FRR with degrades. This could be because the encoder performs speaker independent phoneme prediction, where speaker information can be diminished at the last layer. Utilising the intermediate encoder layer (), we observe improvements. The rest of Table 2 highlights the importance of the three losses on the decoder. We also observe that any fine-tuning of the encoder with the CTC loss introduces performance degradation.

Init. Branch FRRs
Random 6
Random 6 82.84
Pretrained Fixed 6
Pretrained Fixed 5 2.48
Pretrained Fixed 5 4.53
Pretrained Fixed 5 12.25
Pretrained Fixed 5 7.50
Pretrained Fine-tuned 5
Table 2: Ablation study.

5 Conclusions

We propose a novel approach for improving voice trigger detection by adapting to speaker information using metric learning. Our model employs an encoder-decoder architecture, where the encoder performs phoneme prediction for a speaker independent voice trigger detection while the decoder predicts an utterance-wise embedding for speaker-adapted voice trigger detection. The speaker-adapted voice trigger score is obtained by computing a similarity between an anchor embedding for each speaker and the decoder embedding for a test utterance. Experimental results show that our proposed approach outperforms the baseline speaker independent voice trigger detector by in terms of FRRs.


  • [1] O. Abdel-Hamid and H. Jiang (2013) Fast speaker adaptation of hybrid nn/hmm model for speech recognition based on discriminative learning of speaker code. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. , pp. 7942–7946. External Links: Document Cited by: §2.
  • [2] S. Adya, V. Garg, S. Sigtia, P. Simha, and C. Dhir (2020) Hybrid transformer/ctc networks for hardware efficient voice triggering. In Interspeech, pp. 3351–3355. Cited by: §1.
  • [3] X. Anguera and M. Ferrarons (2013) Memory efficient subsequence dtw for query-by-example spoken term detection. In 2013 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §2.
  • [4] S. O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner, R. Prenger, and A. Coates (2017) Convolutional recurrent neural networks for small-footprint keyword spotting. In Interspeech, pp. 1606–1610. Cited by: §1.
  • [5] A. Berg, M. O’Connor, and M. T. Cruz (2021)

    Keyword transformer: a self-attention model for keyword spotting

    arXiv preprint arXiv:2104.00769. Cited by: §1.
  • [6] E. Ceolini, J. Anumula, S. Braun, and S. Liu (2019) Event-driven pipeline for low-latency low-compute keyword spotting and speaker verification system. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 7953–7957. External Links: Document Cited by: §1.
  • [7] G. Chen, C. Parada, and T. N. Sainath (2015)

    Query-by-example keyword spotting using long short-term memory networks

    In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5236–5240. External Links: Document Cited by: §2.
  • [8] S. Choi, S. Seo, B. Shin, H. Byun, M. Kersner, B. Kim, D. Kim, and S. Ha (2019) Temporal convolution for real-time keyword spotting on mobile devices. External Links: 1904.03814 Cited by: §1.
  • [9] S. Fernández, A. Graves, and J. Schmidhuber (2007) An application of recurrent neural networks to discriminative keyword spotting. In International Conference on Artificial Neural Networks, pp. 220–229. Cited by: §1.
  • [10] T. J. Hazen, W. Shen, and C. White (2009) Query-by-example spoken term detection using phonetic posteriorgram templates. In

    2009 IEEE Workshop on Automatic Speech Recognition Understanding

    Vol. , pp. 421–426. External Links: Document Cited by: §2.
  • [11] Y. He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, and I. McGraw (2017) Streaming small-footprint keyword spotting using sequence-to-sequence models. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 474–481. Cited by: §1.
  • [12] T. Higuchi, A. Gupta, and C. Dhir (2021) Multi-task learning with cross attention for keyword spotting. arXiv preprint arXiv:2107.07634. Cited by: §3.1, §3.1, §3.2, §3.3, §4.1, §4.1, §4.2, Table 1.
  • [13] J. Hou, L. Zhang, Y. Fu, Q. Wang, Z. Yang, Q. Shao, and L. Xie (2021) The npu system for the 2020 personalized voice trigger challenge. External Links: 2102.13552 Cited by: §1.
  • [14] J. Huang, W. Gharbieh, H. S. Shim, and E. Kim (2021) Query-by-example keyword spotting system using multi-head attention and soft-triple loss. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6858–6862. Cited by: §2.
  • [15] Y. Jia, X. Wang, X. Qin, Y. Zhang, X. Wang, J. Wang, and M. Li (2021)

    The 2020 personalized voice trigger challenge: open database, evaluation metrics and the baseline systems

    External Links: 2101.01935 Cited by: §1.
  • [16] M. O. Khursheed, C. Jose, R. Kumar, G. Fu, B. Kulis, and S. K. Cheekatmalla (2021) Tiny-crnn: streaming wakeword detection in a low footprint setting. External Links: 2109.14725 Cited by: §1.
  • [17] B. Kim, M. Lee, J. Lee, Y. Kim, and K. Hwang (2019) Query-by-example on-device keyword spotting. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 532–538. Cited by: §2.
  • [18] S. Majumdar and B. Ginsburg (2020) MatchboxNet: 1d time-channel separable convolutional neural network architecture for speech commands recognition. External Links: 2004.08531 Cited by: §1.
  • [19] D. R. Miller, M. Kleber, C. Kao, O. Kimball, T. Colthurst, S. A. Lowe, R. M. Schwartz, and H. Gish (2007) Rapid and accurate spoken term detection. In Eighth Annual Conference of the international speech communication association, Cited by: §1.
  • [20] S. Panchapagesan, M. Sun, A. Khare, S. Matsoukas, A. Mandal, B. Hoffmeister, and S. Vitaladevuni (2016) Multi-task learning and weighted cross-entropy for dnn-based keyword spotting.. In Interspeech, Vol. 9, pp. 760–764. Cited by: §3.2.
  • [21] R. Rikhye, Q. Wang, Q. Liang, Y. He, D. Zhao, Yiteng, Huang, A. Narayanan, and I. McGraw (2021) Personalized keyphrase detection using speaker and environment information. External Links: 2104.13970 Cited by: §1.
  • [22] A. Rosenberg, K. Audhkhasi, A. Sethy, B. Ramabhadran, and M. Picheny (2017) End-to-end speech recognition and keyword search on low-resource languages. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5280–5284. Cited by: §1.
  • [23] T. N. Sainath and C. Parada (2015) Convolutional neural networks for small-footprint keyword spotting. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §1.
  • [24] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny (2013) Speaker adaptation of neural network acoustic models using i-vectors. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Vol. , pp. 55–59. External Links: Document Cited by: §2.
  • [25] A. Senior and I. Lopez-Moreno (2014) Improving dnn speaker independence with i-vector inputs. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 225–229. External Links: Document Cited by: §2.
  • [26] J. Shor, D. Emanuel, O. Lang, O. Tuval, M. Brenner, J. Cattiau, F. Vieira, M. McNally, T. Charbonneau, M. Nollstadt, et al. (2019) Personalizing asr for dysarthric and accented speech with limited data. arXiv preprint arXiv:1907.13511. Cited by: §1.
  • [27] S. Sigtia, P. Clark, R. Haynes, H. Richards, and J. Bridle (2020) Multi-task learning for voice trigger detection. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 7449–7453. External Links: Document Cited by: §3.2, §3.3, §4.3.
  • [28] S. Sigtia, E. Marchi, S. Kajarekar, D. Naik, and J. Bridle (2020) Multi-task learning for speaker verification and voice trigger detection. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6844–6848. External Links: Document Cited by: §2, §3.1, §3.2, §4.1.
  • [29] M. Sun, A. Raju, G. Tucker, S. Panchapagesan, G. Fu, A. Mandal, S. Matsoukas, N. Strom, and S. Vitaladevuni (2016) Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting. In 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 474–480. Cited by: §1.
  • [30] R. Tang and J. Lin (2018) Deep residual learning for small-footprint keyword spotting. External Links: 1710.10361 Cited by: §1.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §3.1.
  • [32] T. Viglino, P. Motlicek, and M. Cernak (2019) End-to-end accented speech recognition.. In Interspeech, pp. 2140–2144. Cited by: §1.
  • [33] M. Weintraub (1993) Keyword-spotting using sri’s decipher large-vocabulary speech-recognition system. In 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 463–466 vol.2. External Links: Document Cited by: §1.
  • [34] Y. Yuan, Z. Lv, S. Huang, and L. Xie (2019) Verifying deep keyword spotting detection with acoustic word embeddings. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Vol. , pp. 613–620. External Links: Document Cited by: §2.
  • [35] Y. Zhang and J. R. Glass (2009) Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams. In 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 398–403. Cited by: §2.
  • [36] Y. Zhuang, X. Chang, Y. Qian, and K. Yu (2016) Unrestricted vocabulary keyword spotting using lstm-ctc.. In Interspeech, pp. 938–942. Cited by: §1.