DeepAI
Log In Sign Up

Multi-task Learning with Cross Attention for Keyword Spotting

07/15/2021
by   Takuya Higuchi, et al.
0

Keyword spotting (KWS) is an important technique for speech applications, which enables users to activate devices by speaking a keyword phrase. Although a phoneme classifier can be used for KWS, exploiting a large amount of transcribed data for automatic speech recognition (ASR), there is a mismatch between the training criterion (phoneme recognition) and the target task (KWS). Recently, multi-task learning has been applied to KWS to exploit both ASR and KWS training data. In this approach, an output of an acoustic model is split into two branches for the two tasks, one for phoneme transcription trained with the ASR data and one for keyword classification trained with the KWS data. In this paper, we introduce a cross attention decoder in the multi-task learning framework. Unlike the conventional multi-task learning approach with the simple split of the output layer, the cross attention decoder summarizes information from a phonetic encoder by performing cross attention between the encoder outputs and a trainable query sequence to predict a confidence score for the KWS task. Experimental results on KWS tasks show that the proposed approach outperformed the conventional multi-task learning with split branches and a bi-directional long short-team memory decoder by 12

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

08/22/2021

Multilingual Speech Recognition for Low-Resource Indian Languages using Multi-Task conformer

Transformers have recently become very popular for sequence-to-sequence ...
06/28/2022

Personalized Keyword Spotting through Multi-task Learning

Keyword spotting (KWS) plays an essential role in enabling speech-based ...
04/05/2022

Hear No Evil: Towards Adversarial Robustness of Automatic Speech Recognition via Multi-Task Learning

As automatic speech recognition (ASR) systems are now being widely deplo...
11/01/2018

End-to-end Models with auditory attention in Multi-channel Keyword Spotting

In this paper, we propose an attention-based end-to-end model for multi-...
06/17/2019

Multi-task Learning For Detecting and Segmenting Manipulated Facial Images and Videos

Detecting manipulated images and videos is an important topic in digital...
07/10/2019

Multi-layer Attention Mechanism for Speech Keyword Recognition

As an important part of speech recognition technology, automatic speech ...
05/07/2020

Mutli-task Learning with Alignment Loss for Far-field Small-Footprint Keyword Spotting

In this paper, we focus on the task of small-footprint keyword spotting ...

1 Introduction

11footnotetext: Work performed at Apple

Keyword spotting (KWS) is a task to detect a keyword phrase from audio. KWS enables users to activate voice assistant systems on devices, such as smart phones and smart speakers, by simply speaking the keyword phrase. For usability and privacy of users, it is important to deploy an accurate KWS system on-device.

A typical approach for KWS is to train a keyword-specific acoustic model to predict a confidence score for each keyword phrase. Earlier works use deep neural networks with a hidden Markov model (HMM)

[6, 16, 15, 22, 14, 9, 26, 21]

, and more recent works use convolutional neural networks (CNNs)

[18]

and recurrent neural networks

[7, 3]. Hardware friendly model architectures have also been investigated for small footprint KWS [24, 2, 11, 27]. These models are typically trained on a KWS dataset, which consists of pairs of audio and corresponding phrase level labels. Although these approaches are a direct optimization for the target task, preparing a large labeled in-domain KWS dataset is challenging practically due to, e.g., the sparsity of false triggers and privacy concerns.

Another approach is to use an acoustic model of an automatic speech recognition (ASR) system. The ASR acoustic model (e.g., a phoneme classifier) is trained on a transcribed speech dataset (ASR dataset) to perform ASR [28, 17, 10, 1]. At inference, a decoding score for a particular keyword phrase is computed, which corresponds to a confidence score of the presence of the keyword phrase. The advantage of this approach is that a large transcribed ASR dataset can be used for model training, and the keyword phrase is configurable at test time.

Recently, multi-task learning has been applied to KWS [15, 19, 20]

to better generalize models leveraging both large ASR and in-domain KWS datasets. In this framework, an output layer of the acoustic model is split into two branches for the two tasks. Then the model is trained on both the ASR and KWS loss functions.

In this paper, we introduce a cross attention decoder in a multi-task learning framework. Unlike conventional multi-task learning, the cross attention decoder summarizes information from the acoustic model (i.e., a phonetic encoder) using attention layers. Hidden representations from the phonetic encoder are fed into the cross attention decoder, and then cross attention between the phonetic representations and a query sequence is performed to predict a scalar confidence score for the KWS task. The phonetic encoder and the cross attention decoder are jointly trained using the multi-task learning framework. Experimental results on KWS tasks show that the proposed cross attention decoder outperformed the conventional multi-task learning and a bi-directional long short-term memory (BLSTM)-based decoder by

on average.

The remainder of this paper is organized as follows. Section 2 reviews related work and describes the contributions of this paper. Section 3 presents our proposed approach. Section 4 describes experimental evaluation and Section 5 concludes the paper.

2 Related work

T. Bluche et al. also proposed to use a decoder on top of phoneme classifier outputs [5]. A BLSTM keyword encoder was trained to predict phrase level confidence scores based on outputs of an LSTM phoneme classifier and a specified keyword phrase. The LSTM phoneme classifier was pre-trained on an ASR dataset and fixed during training of the keyword encoder. In contrast, our encoder and decoder are jointly trained from scratch, exploiting both ASR and KWS data in the multi-task learning framework. Moreover, we use an attention-based architecture for the decoder, following recent successes of Transformers in ASR.

Transformers were originally proposed in [25] and applied to ASR (e.g., [12]). Adya et al. used the vanilla Transformer as a phoneme classifier for KWS [1]

. Unlike the original Transformer decoder used in the previous articles, our cross attention decoder is not an auto-regressive model since our decoder predicts a scalar confidence score for a keyword phrase given an audio sequence. The length of a sequence of query vectors is fixed for our decoder, and the query vectors are jointly trained with the model parameters.

Tian et al. also applied multi-task learning for KWS, where they used both ASR and KWS data to train a recurrent neural network transducer model [23]. Only ASR data was used to train a prediction network so that the prediction network did not overfit to KWS data, and both ASR and KWS data were used to train an encoder. Phoneme-level (or syllable-level) labels were used to train the encoder with both ASR and KWS data. In contrast, our proposed approach uses phoneme-level labels for ASR data, and phrase-level labels for KWS data. The proposed cross attention decoder summarizes phonetic information and performs a phrase-level prediction, which is used to compute a phrase-level loss on KWS data. The phrase-level prediction achieves better KWS performances compared to a phoneme-level prediction as shown in section 4.

3 Cross attention for multi-task learning

3.1 Overview

Figure 1: Block diagrams of (a) conventional multi-task learning for KWS [19, 20] and (b) our proposed approach. In the conventional approach , a last layer is simply split into two branches, one for phoneme prediction and one for phrase prediction. In contrast, the proposed approach uses a phonetic encoder for phoneme prediction, and a cross attention decoder is introduced to efficiently summarize phonetic information for phrase prediction.

Figure 1 shows an overview of the conventional multi-task learning approach and the proposed approach. In the conventional approach, the output of the acoustic model is split and fed into two different branches, one for phoneme classification and one for KWS. Then the model is trained on both ASR data and KWS data using either phonetic loss or discriminative loss, depending on which dataset a data sample is from. Unlike the multi-branch approach in the conventional multi-task learning, we introduce the cross attention decoder which works on top of the phonetic encoder outputs. The decoder takes the outputs from the phoneme classifier as key and value vectors for attention layers. Then cross attention is performed between the phonetic encoder outputs and a trainable query sequence to predict a score for KWS. The encoder and the decoder are jointly trained in the multi-task learning framework.

3.2 Multi-task learning

In the multi-task learning framework, the model is trained using both phonetic loss and phrase loss [15, 19, 20]. Let us assume that we sample utterances for a mini-batch from a combined set of an ASR dataset and a KWS dataset. utterances are sampled from the KWS dataset and utterances from the ASR dataset. The objective function on a mini-batch for training can be written as

(1)

where and denote phonetic and phrase losses computed on utterances from the ASR dataset and the KWS dataset, respectively. is a scaling factor for balancing the phonetic loss and the phrase loss. To train the acoustic model on these two losses, an output layer of an acoustic model is typically split into two branches, one for the phonetic loss and one for the phrase loss.

3.3 Cross attention decoder

Figure 2: The proposed cross attention decoder.

We introduce a cross attention decoder to perform phrase level classification based on phonetic predictions from the acoustic model, i.e., the phonetic encoder. Let denote a set of input features at time frames to for -th utterance sampled from the ASR dataset. The phonetic encoder transforms the input feature sequence into a sequence of hidden representations as

(2)

where denotes a set of the hidden representations, and

denotes a mapping function by the phonetic encoder defined with neural networks. Then a linear layer is applied to project the hidden representations to logits. Next the CTC loss for the

-th utterance is computed using the logits and phoneme labels.

In addition, a discriminative phrase level loss is computed using the cross attention decoder on utterances sampled from the KWS dataset. Let us assume that -th utterance is sampled from the KWS dataset, and denotes acoustic features of the -th utterance. The features are first processed by the phonetic encoder as

(3)

to produce a sequence of the hidden representations, .

Following the recent success of Transformers in ASR, our cross attention decoder is based on Transformer blocks with attention layers. The attention layer for a query matrix , a key matrix and a value matrix can be written as

(4)

where is the size of the key and the query vectors. Figure 2 shows a detailed block diagram of the cross attention decoder. The decoder takes the encoder output and a sequence of trainable vectors (i.e., query vectors) as inputs. Let denote a sequence of the query vectors, where . First, self-attention is performed on by a multi-head attention layer as

(5)

where denotes the output from the self-attention layer and denotes the multi-head attention layer described as

(6)

where . , , and are obtained by applying affine transformations to , and different heads use different affine transformations. Then, cross attention is performed with the output of the self-attention layer and the phonetic encoder output by , where is obtained by applying an affine transformation to , and and are obtained by applying affine transformations to

. The output of the cross attention layer is fed into position-wise feedforward networks. Residual connection and layer normalization

[4] are used after each attention block and feedforward block following the original Transformer. The transformer block is repeated times. Finally, the output from the Transformer block is reshaped and fed into a linear layer to predict a scalar logit for the keyword phrase. Unlike the original Transformer architecture, our decoder for KWS is not an auto-regressive model and the length of the query sequence, , is fixed, which enables reshaping so a linear layer can produce a scalar logit for each keyword phrase per audio sequence. Moreover, positional encoding is not required for the query vectors since the query vectors are jointly optimized with model parameters and then fixed for any audio input at inference.

The phrase loss for the -th utterance, , is defined as the cross entropy between the logits from the decoder and utterance-wise phrase labels. The encoder and the decoder are jointly trained in the multi-task learning framework using Eq. (1).

4 Experimental evaluation

We evaluated the effectiveness of the proposed approach on a KWS task, and compared its performance with a self-attention phonetic decoder with/without the conventional multi-task learning and a BLSTM decoder with the conventional multi-task learning. Although we used our internal datasets in experiments, our proposed approach is easily applicable to any public ASR and KWS datasets.

4.1 Data

Our ASR training data consisted of approximately 3 million utterances of transcribed near-field speech signals recorded with devices such as smart phones. Then data augmentation was performed by convolving room impulse responses (RIRs) with speech signals. The RIRs were recorded in various rooms with smart speakers with six microphones. Additionally, echo residuals were added to the augmented data. As a result, we obtained approximately 9 million augmented utterances consisting of the near-field signals, simulated far-field signals, and simulated far-field signals with the echo residuals. The KWS data consisted of approximately false triggers and true triggers spoken by anonymous speakers, which were triggered by a reference voice triggering system. The audio signals were recorded with smart speakers. The KWS data were combined with the augmented ASR dataset, and utterances were randomly sampled from the combined dataset for mini-batch training.

For evaluation, we used two different datasets. The first is a structured dataset, where positive samples containing a keyword phrase were internally collected in controlled conditions from 100 participants, approximately evenly divided between males and females. Each subject spoke the keyword phrase followed by prompted voice commands to a smart speaker. The recordings were made in four different acoustic conditions: quiet, external noise from TV or kitchen appliances, music playing from the device at medium volume, and music playing at loud volume. 13000 such positive utterances were collected. For negative data, we used a set of 2000 hours of audio recordings which did not contain the keyword phrase by playing podcasts, audiobooks, TV play-back, etc. These negative audio samples were also recorded by the same smart speaker. The negative audio data allowed us to compute false accept (FA) per hour. The second dataset, called take home evaluation set, is a more realistic and challenging dataset collected at home by employees. Each of the 80 participants volunteered to use the smart speaker daily for two weeks. By applying extra audio logging on device and personal review by the user, audio below the usual on-device trigger threshold was collected. This setup allowed us to collect challenging negative data, which was similar to the keyword phrase. We collected 7896 positive and 20919 negative audio samples111The amount of the dataset has been increased by additional participants compared to the evaluation dataset used in [1] and [19], so the result reported in this paper is not directly comparable. for evaluation. This dataset allowed us to compute the absolute number of false accepts (FAs).

4.2 Two-stage approach for efficient KWS

Figure 3: A two-stage approach for efficient KWS [8, 21]. A 1st pass light-weight KWS system is always-on and takes streaming audio signals, where a DNN-HMM system is used to obtain a KWS score and an alignment for an audio segment containing a keyword. Once the 1st pass KWS score exceeds a threshold, the audio segment is passed to a bigger KWS model (so-called checker) and a KWS score is re-computed.

We used a two-stage approach for efficient KWS [8, 21] as shown in figure 3. A light-weight model was always-on and first detected candidate audio segments from streaming audio inputs. Once the segments were detected, a bigger model (so-called checker) was turned on and checked if the segments actually contained the keyword phrase or not. This two-stage approach greatly reduces compute cost and battery consumption on-device. For the 1st pass model, we used five layers of fully-connected neural networks with 64 hidden units as the acoustic model. We used 20 target classes for the acoustic model; 18 phoneme classes for the keyword, one for silence and one for other speech. We computed a 13-dimensional MFCC feature at a rate of 100 frames per second, and supplied 19 consecutive frames to the acoustic model. The confidence scores for KWS and alignments to extract audio segments were obtained using an HMM. Given keyword start and end times from the HMM alignment, we used seconds and seconds for segmentation to ensure that the segment contained the detected keyword portion. The 1st pass threshold was set to obtain approximately 21 FA/hr on the structured evaluation dataset. We used the same 1st pass system for all the experiments and evaluated the effectiveness of our proposed model as the checker.

4.3 Model training

For a baseline phoneme classifier, we used a self-attention based acoustic model. The model consisted of 6 layers of Transformer blocks, each of which had a multi-head self-attention layer with 256 hidden dimension and 4 heads, followed by a feedforward neural network with 1024 hidden units. Finally, outputs from the Transformer blocks were projected to 54-dimensional logits for phonetic and blank labels by a linear layer. The baseline model was trained with the CTC loss222In [1], the vanilla Transformer decoder was also trained along with the self-attention encoder using cross entropy loss, and used as a regularizer during training. We omitted the regularization just because of simplicity in our experiments. The regularization can be applied to all the approaches in our experiments including the proposed approach.. The same architecture was also used for the conventional multi-task learning [19] by splitting the last layer into 54 outputs for the phonetic CTC loss and three discriminative outputs for a positive class, a negative class and a blank label for the phrase level CTC loss. Regarding the proposed approach, we used the same self-attention phoneme classifier for the phonetic encoder. The cross attention decoder consisted of a Transformer decoder block (i.e., ) which had the same configuration as the Transformer blocks of the encoder except the cross attention block. The dimension of the query vector and the length of the query sequence were set at 256 and 4, respectively. The last linear layer projected the reshaped -dimensional vector to two logits for positive and negative classes. The encoder and the decoder were jointly trained using the phonetic CTC loss and the phrase level cross entropy loss (see Section 3). We also explored a BLSTM decoder by replacing the cross attention decoder by a layer of BLSTMs with 256 hidden units followed by a linear layer which processed a concatenated BLSTM outputs at the first and last frame to predict logits. The scaling factor in Eq. (1) for the multi-task learning was experimentally set at . -dimensional log mel-filter bank features 3 context frames were used as inputs. In addition, we sub-sampled the features once per three frames to reduce computational complexity. All models were trained using the Adam optimizer [13]. The learning rate was first increased linearly to

until epoch

, then linearly decayed to until epoch . Finally the learning rate was exponentially decreased until the last epoch which was set at . We used 16 GPUs for training and the batch size was 128 at each GPU.

4.4 Results

Figure 4: DET curves for structured evaluation set. The vertical dotted line indicates an operating point.
Figure 5: DET curves for take home evaluation set. The vertical dotted line indicates an operating point.
MTL Branch Structured evaluation set Take home evaluation set Avg.
Phoneme classifier Phonetic 20.26 27.72 23.99
Conventional MTL [19]
Phonetic
Phrase
5.00
3.49
14.11
10.11
9.56
6.80
BLSTM decoder
Phonetic
Phrase
5.02
4.76
12.36
8.89
8.69
6.83
Cross attention decoder
Phonetic
Phrase
4.64
3.82
13.21
8.17
8.93
6.00
Table 1: False reject ratios for structured evaluation set [] at an operating point of 1 FA/100 hrs, and for take home evaluation set at an operating point of 100 FAs.

Figures 4 and 5 show detection error tradeoff (DET) curves for all models evaluated on the structured evaluation dataset and take home evaluation dataset, respectively. The horizontal axis represents FA/hr for the structured dataset or the absolute number of FAs for take home dataset. The vertical axis represents false reject ratios (FRRs). Table 1 shows FRRs obtained with the baseline and proposed models at operating points. In the case of multi-task learning, results from both the phonetic and phrase branches were reported. First, multi-task learning significantly improved the FRRs compared to the phoneme classifier which was trained only on the ASR data. This result shows the effectiveness of using both the ASR and the KWS data for KWS model training. Second, the phrase branch always yielded better results than the phonetic branch, presumably because the phrase branch was directly optimized for the target task. Note that although the performance of the phonetic branch was not as good as the phrase branch, the phonetic branch has an advantage of flexibility where the keyword phrase is configurable at test time. Lastly, the proposed cross attention decoder with the phrase branch achieved the best performance and reduced the FRRs by compared to the conventional multi-task learning and the BLSTM decoder. The cross attention decoder has another advantage over the BLSTM decoder, which is less training time and less runtime cost as reported in [1].

Even though the proposed decoder can effectively learn from the KWS training data333Cross validation loss with the conventional multi-task learning was higher than the loss with the cross attention decoder., the proposed approach with the phrase branch did not outperform the conventional multi-task learning for the structured evaluation set. This performance degradation could be because of mismatched conditions/distributions between the KWS training data and the structured evaluation dataset that was recorded in the controlled conditions.

5 Conclusions

We proposed the cross attention decoder in the multi-task learning framework for KWS. The cross attention decoder performed cross attention between the hidden representations from the phonetic encoder and the query sequence, and then predicted a confidence score for the KWS task. The phonetic encoder and the cross attention decoder were jointly trained in the multi-task learning framework leveraging both the ASR and KWS datasets. The proposed approach outperformed the conventional multi-task learning and the BLSTM decoder by . Our future work includes an extension of this approach to open vocabulary KWS.

References

  • [1] S. Adya, V. Garg, S. Sigtia, P. Simha, and C. Dhir (2020) Hybrid transformer/ctc networks for hardware efficient voice triggering. In Interspeech, pp. 3351–3355. Cited by: §1, §2, §4.4, footnote 1, footnote 2.
  • [2] R. Alvarez and H. Park (2019) End-to-end streaming keyword spotting. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6336–6340. Cited by: §1.
  • [3] S. O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner, R. Prenger, and A. Coates (2017) Convolutional recurrent neural networks for small-footprint keyword spotting. In Interspeech, pp. 1606–1610. Cited by: §1.
  • [4] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.3.
  • [5] T. Bluche and T. Gisselbrecht (2020) Predicting detection filters for small footprint open-vocabulary keyword spotting. In Interspeech, pp. 2552–2556. Cited by: §2.
  • [6] G. Chen, C. Parada, and G. Heigold (2014) Small-footprint keyword spotting using deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4087–4091. Cited by: §1.
  • [7] S. Fernández, A. Graves, and J. Schmidhuber (2007) An application of recurrent neural networks to discriminative keyword spotting. In International Conference on Artificial Neural Networks, pp. 220–229. Cited by: §1.
  • [8] A. Gruenstein, R. Alvarez, C. Thornton, and M. Ghodrat (2017) A cascade architecture for keyword spotting on mobile devices. arXiv preprint arXiv:1712.03603. Cited by: Figure 3, §4.2.
  • [9] J. Guo, K. Kumatani, M. Sun, M. Wu, A. Raju, N. Ström, and A. Mandal (2018) Time-delayed bottleneck highway networks using a dft feature for keyword spotting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5489–5493. Cited by: §1.
  • [10] Y. He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, and I. McGraw (2017) Streaming small-footprint keyword spotting using sequence-to-sequence models. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 474–481. Cited by: §1.
  • [11] T. Higuchi, M. Ghasemzadeh, K. You, and C. Dhir (2020) Stacked 1d convolutional networks for end-to-end small footprint voice trigger detection. In Interspeech, pp. 2592–2596. Cited by: §1.
  • [12] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, et al. (2019) A comparative study on transformer vs rnn in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449–456. Cited by: §2.
  • [13] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.3.
  • [14] K. Kumatani, S. Panchapagesan, M. Wu, M. Kim, N. Strom, G. Tiwari, and A. Mandai (2017) Direct modeling of raw audio with dnns for wake word detection. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 252–257. Cited by: §1.
  • [15] S. Panchapagesan, M. Sun, A. Khare, S. Matsoukas, A. Mandal, B. Hoffmeister, and S. Vitaladevuni (2016) Multi-task learning and weighted cross-entropy for dnn-based keyword spotting.. In Interspeech, Vol. 9, pp. 760–764. Cited by: §1, §1, §3.2.
  • [16] R. Prabhavalkar, R. Alvarez, C. Parada, P. Nakkiran, and T. N. Sainath (2015) Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4704–4708. Cited by: §1.
  • [17] A. Rosenberg, K. Audhkhasi, A. Sethy, B. Ramabhadran, and M. Picheny (2017) End-to-end speech recognition and keyword search on low-resource languages. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5280–5284. Cited by: §1.
  • [18] T. N. Sainath and C. Parada (2015) Convolutional neural networks for small-footprint keyword spotting. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §1.
  • [19] S. Sigtia, P. Clark, R. Haynes, H. Richards, and J. Bridle (2020) Multi-task learning for voice trigger detection. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 7449–7453. External Links: Document Cited by: §1, Figure 1, §3.2, §4.3, Table 1, footnote 1.
  • [20] S. Sigtia, J. Bridle, H. Richards, P. Clark, E. Marchi, and V. Garg (2020) Progressive voice trigger detection: accuracy vs latency. arXiv preprint arXiv:2010.15446. Cited by: §1, Figure 1, §3.2.
  • [21] S. Sigtia, R. Haynes, H. Richards, E. Marchi, and John. Bridle (2018) Efficient Voice Trigger Detection for Low Resource Hardware. In INTERSPEECH, pp. 2092–2096. Cited by: §1, Figure 3, §4.2.
  • [22] M. Sun, D. Snyder, Y. Gao, V. K. Nagaraja, M. Rodehorst, S. Panchapagesan, N. Strom, S. Matsoukas, and S. Vitaladevuni (2017) Compressed time delay neural network for small-footprint keyword spotting.. In INTERSPEECH, pp. 3607–3611. Cited by: §1.
  • [23] Y. Tian, H. Yao, M. Cai, Y. Liu, and Z. Ma (2021) Improving rnn transducer modeling for small-footprint keyword spotting. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5624–5628. External Links: Document Cited by: §2.
  • [24] G. Tucker, M. Wu, M. Sun, S. Panchapagesan, G. Fu, and S. Vitaladevuni (2016) Model compression applied to small-footprint keyword spotting.. In INTERSPEECH, pp. 1878–1882. Cited by: §1.
  • [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. 31st Conference on Neural Information Processing Systems (NIPS 2017), pp. 1–11. Cited by: §2.
  • [26] M. Wu, S. Panchapagesan, M. Sun, J. Gu, R. Thomas, S. N. P. Vitaladevuni, B. Hoffmeister, and A. Mandal (2018) Monophone-based background modeling for two-stage on-device wake word detection. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5494–5498. Cited by: §1.
  • [27] E. Yılmaz, O. B. Gevrek, J. Wu, Y. Chen, X. Meng, and H. Li (2020) Deep convolutional spiking neural networks for keyword spotting. In Interspeech, pp. 2557–2561. Cited by: §1.
  • [28] Y. Zhuang, X. Chang, Y. Qian, and K. Yu (2016) Unrestricted vocabulary keyword spotting using lstm-ctc.. In Interspeech, pp. 938–942. Cited by: §1.