1 Introduction
Keyword spotting has recently been an essential function of consumer devices, such as mobile phones and smart speakers, because it provides a natural way of voice user interface. It is mainly used for detecting predefined keywords, e.g., “Alexa”, “Hey Siri”, and “OK Google” for getting devices ready to process users’ following commands or queries. Despite the widespread use of this technology in various devices today, it is still a challenging problem due to requiring low false rejection rate (FRR) and false alarm rate (FAR) while operating with small memory footprint and low power consumption.
In the previous studies, keyword/filler hidden Markov models (HMMs) were proposed, which explicitly model the acoustic characteristics of nonkeyword general speech (filler) as well as the target keyword speech [rohlicek1989continuous]
. With the recent advances in deep learning, Gaussian mixture models in the HMMs were replaced with various neural network architectures, such as feed forward deep neural networks, convolutional neural networks, and convolutional recurrent neural networks (CRNNs)
[chen2014small, sainath2015convolutional, arik2017convolutional, lengerich2016end]. Although those deep learningbased approaches significantly improve the system performance by increasing modeling capacity, it still requires well predicted timealigned labels.Recently, a number of attentionbased keyword spotting models have been proposed [he2017streaming, shan2018attention, de2018neural]. While [he2017streaming] used attention in an assistive form for biasing RNNbased decoders toward a keyword of interest, [shan2018attention] aggressively deploy the attention mechanism proposed in [chowdhury2017attention]
for direct keyword feature representation with which a binary classifier discriminates keywords from nonkeywords. Since these approaches are based on basic singlehead attention mechanism, it is natural to extend to use multihead attention. Multihead attention
[vaswani2017attention, chiu2018state] is introduced for joint representation of information in different subspaces while attending to different positions of a sequence. However, as there is no explicit mechanism which guarantee such diversity either in positions and in representational subspaces, each attention head may contain redundant information which results in inefficiency of the network. [li2018multi]proposed the three types of disagreement regularizations, i.e., disagreements on subspaces, attended positions and outputs, to explicitly encourage the diversity among attention heads based on the cosine similarity.
In this paper, we investigate the use of multihead attention in keyword spotting tasks and propose an orthogonality constrained multihead attention mechanism. The regularization is derived from the constraints of context and score vectors between attention heads such that they are orthogonal to each other, respectively. The regularization by interhead orthogonality of context vectors and score vectors lets the attention heads have less redundancy to each other, while the regularization by intrahead nonorthogonality of context vectors lets them have consistency across samples for the given task. Regularization presented in this work is related to [li2018multi] while it is more oriented to speech data and keyword spotting tasks. We show that the proposed regularization techniques improve the keyword detection performance by reducing the false rejection rates with only a small amount of increase in the model size.
2 MultiHead AttentionBased EndtoEnd Model for Keyword Spotting
2.1 Keyword spotting system description
We extend the singlehead attentionbased endtoend network structure for keyword spotting presented in [shan2018attention] to multihead attention based network as depicted in Fig. 1. The encoder takes an acoustic feature , , the 40dimensional Melfilter bank energies extracted from 16 kHz sampled audio signals with perchannel energy normalization [wang2017trainable] where
is the time frame index, as input and converts it into a hidden representation
. The encoder network consists of a canonical CRNN structure with convolutional and recurrent layers in sequence to capture spectral and temporal characteristics of the acoustic features. As a base model, we use one convolutional layer with a kernel size ofand a stride of
and one gated recurrent unit (GRU) layer with 64 hidden units as proposed in
[shan2018attention]. The encoder output vector is then processed by an attention mechanism in each attention head to produce a context vector where denotes the attention head index. Using attention heads, the context vectors are concatenated as wheredenotes the matrix transpose. Finally, the model performs binary classification with a linear transformation and a softmax operation on
to compute a posterior probability of a keyword state
given input observation , . In the inference stage, we decide that a keyword is detected when the confidence is larger than a preset threshold. Note that this system does not require any graph searching or framelevel alignment of training data, which largely simplifies both training and inference.2.2 Base attention mechanism
In each attention head, we use the nonlinear soft attention mechanism proposed in [chowdhury2017attention] for speaker verification and adopted in [shan2018attention] for keyword spotting. The attention weight for the th attention head at the th time frame is calculated by
(1) 
where the scalar score is calculated by a nonlinear scoring function with the parameters shared across time
(2) 
The context vector , the output of each attention head, is then calculated by the weighted sum as follows:
(3) 
3 Orthogonality Regularized MultiHead Attention
In speech recognition tasks including keyword spotting, although endtoend neural networks are very attractive due to simplicity of their structures and learning procedures, hybrid systems often have competitive or even better performances as they use explicit sequence models for better leveraging structured information coming from speech subsequences, i.e., phonemes, syllables, or wordpieces [luscher2019rwth]. In this perspective, the multihead attention mechanism is considered as a promising alternative to capture the structured information from speech subsequences while keeping the endtoend nature [dong2019self, dong2018speech].
Multihead attention, proposed in [vaswani2017attention], is capable of diverse learning of representations since different heads can pay attention to different positions in a sequence and give different representations. However, the diversity is not guaranteed by its natural form as they may have redundancy either in position and representation. Fig. 2(b) shows an example of multihead attention weight distributions where 3 of them severely overlap to each other. For encouraging the diversity of the multihead attention, [li2018multi] proposed three types of disagreement regularization in the context of machine translation, i.e., disagreements on subspaces, attended positions and outputs, based on maximization of the negative cosine similarities. In this section, we propose a regularization technique for training the multihead attentionbased keyword spotting model by orthogonality constraints between attention heads.
3.1 Interhead orthogonality regularization
We argue that, to capture the temporally structured information in a sp eech input sequence, the attention heads should pay attention to different parts of the sequence and produce context outputs with minimal redundancy with each other. To achieve this, we introduce regularization of the multihead attention by orthogonality constraints on context and score vectors between the attention heads. The problem is to find the network parameters that minimize the cross entropy loss subject to the orthogonality constraints and for each pair of . Suppose that we have a training batch of samples, then we define the regularization terms and by
(4) 
(5) 
where is the sample index, is the number of attention heads, denotes the Frobenius norm, and
(6) 
(7) 
are the context matrix and the score matrix, respectively, which consist of the normalized context vectors and the normalized score vectors .
One main difference from the output disagreement regularization in [li2018multi] is that our system does not use value projection and thus directly compute the context vector from the encoder output by multiplying the attention weights. Since the interhead context orthogonality constraint can easily be satisfied by an orthogonal value projection in each head, regardless of the encoder outputs, we desire such orthogonality is achieved by the encoder network, not by the subspace projection. This encourages the encoder network to discriminateively represent different subsequences of a keyword utterance which results in better keyword detection.
3.2 Intrahead nonorthogonality regularization
On the contrary, since each attention head finds a specific subsequence with similar content, the context vectors from the same attention head are expected to be similar across different samples. Thus, we augment a regularization term which maximizes the similarity or nonorthogonality of the context vectors between different samples from the same attention head as follows:
(8) 
where
(9) 
Similar regularization to score vectors is not considered as the position of a subsequence attended by each attention head can vary from sample to sample.
3.3 Selective regularization
Since the discussion about orthogonality and nonorthogonality constraints are only valid for positive data, i.e., keyword utterances, we modify (4), (5) and (8) to be selectively calculated, given that the true label of the th training sample is for positive and for negative as follows:
(10)  
(11)  
(12) 
where denotes the number of positive samples and is the diagonal selection matrix .
Now we can write the problem as minimization of the cross entropy loss with the regularization terms as follows:
(13) 
where each
is a hyperparameter that controls the importance of the corresponding regularization term. Note that
has the opposite sign, since this regularization term is to be maximized while the others are to be minimized.3.4 Semisupervised salience learning
One interesting perspective of this work is that it roughly provides a semisupervised way of learning representations of salient features from keyword utterances for the given task. In other words, without the sequence part alignment information such as phoneme labels and frame indices, the encoder finds taskrelevant subsequences which have important roles for distinguishing keywords from nonkeywords while only the keyword label is provided. Fig. 2 illustrates examples of attention weights for an utterance of the “Hey Snapdragon” keyword. In Fig. 2(a) and (b), it can be seen that the single head attention has a wide range of weight distribution across time with emphasis on the keyword end part, while the attention weights from different heads of the plain, i.e., without regularization, multihead model are distributed in different positions capturing the encoder output representations of the corresponding subsequences. However, some of them overlap with each other, indicating the context vectors from the attention heads have redundant information. With the proposed regularization, it can be seen in Fig. 2(c) that the attention heads pay attention to exclusive sequence parts.
4 Experiments
4.1 Datasets and experimental setup
The target keyword in our experiments is “Hey Snapdragon” which consists of four English syllables. In order to train the model and evaluate the performance, we collected a number of clean positive and negative samples from 325 speakers. The positive dataset has
12,000 samples from 325 speakers and divided into training, validation and test subsets at a ratio of 10:1:1. For validation and test datasets, we augmented the keyword utterances with 4 types of noises, i.e., babble, car, music, office, at signaltonoise ratios (SNRs) of 6, 0, and 6 dB and with reverberation with a room impulse response measured in a regular meeting room, so that the total number of each of the positive validation and test samples is
15,000. For negative samples, we collected 400 hours of general English sentences and divided them at a ratio 1:1:1 for training, validation, and test. We also augmented the negative validation and test datasets with random noises to double the amount, so that the total number of each of the negative validation and test samples is 38,000 and 33,000, respectively. Note that there is no duplication and no overlap in speaker, noise sample and room impulse response between all positive and negative training, validation, and test sets.To improve acoustic environmental robustness, we augmented 50% of positive and negative training samples in an online manner where each sample is synthetically corrupted during data loading with randomly selected room impulse response and background noise sample from of 200 hours of noise and reverberation datasets. We assumed that all data have a fixed length and thus segmented them to 1.8 s length while guaranteeing all utterance in the training set are not clipped out in time. This assumption does not restrict ondevice usability as we can apply sliding window techniques in continuous audio stream without harming the assumption. From 1.8 s input audio sequences sampled at 16 kHz, 40dimensional Mel filter bank energies with perchannel energy normalization [wang2017trainable]
were computed for 30 ms frames at every 10 ms by performing shorttime Fourier transform with 512point Hamming window, and then fed into the network.
We performed experiments with the network structure described in 2.1 while varying the number of attention heads with empirically chosen values in (13
). All models were trained from scratch with randomly initialized parameters for 200 epochs which is considered to be a sufficient number to reach convergence. A minibatch was constituted with 128 shuffled positive and negative training samples with their numbers of ratio 1:3. We used Adam optimizer
[kingma2014adam] with a learning rate of which decays at each epoch with a factor of 0.98 while gradients with norm values above 1.0 were clipped. Since each attention head has learnable parameters in scoring function 2and the number of nodes in the softmax layer changes due to concatenation of the context vectors from the attention heads, the number of parameters of 4head model is 91 k while that of the singlehead model is 78 k.
4.2 Regularization loss variation
Fig. 3 shows the regularization losses calculated from the positive validation set during training. It can be observed that and are decreasing as intended, i.e., the orthogonality between the context vectors and the score vectors between the attention heads are increasing, meaning that interhead redundancy in time and subspace is reduced by the regularization. Meanwhile, as can be seen in Fig. 3(c), increases which indicates that the output context vectors of each attention head from different positive samples get more similar to each other as training progresses. This is desirable for the classification stage because, generally, it is beneficial to have less variation of feature representation, i.e., context vector, in feature space for the positive samples.
4.3 Performance with different combinations of regularizations
To see how the regularization affects the keyword spotting performance, we compare the keyword spotting test results for different combinations of regularizations applied during training. False rejection rates (FRR) measured at confidence thresholds corresponding to 1 false alarm per hour (FA/hr) for corresponding models are used for the performance metric. For simplicity of comparison, we fixed the number of attention heads as 4, motivated by the keyword has 4 syllables, and the value as 0.1.


Systems  FRR (%) at  
1 FA/hr  2 FA/hr  4 FA/hr  
1        5.57  4.33  3.24 
4        5.22  4.04  3.13 
4  0.1      4.70  3.79  3.00 
4    0.1    4.37  3.21  2.40 
4      0.1  4.58  3.58  2.75 
4  0.1  0.1    4.44  3.46  2.59 
4  0.1    0.1  3.97  2.97  2.27 
4    0.1  0.1  4.07  3.26  2.37 
4  0.1  0.1  0.1  3.91  2.88  2.07 
4  0.1  0.1  0.1  5.50  4.17  3.05 

From Table 1
, we can see that all types of regularization contributes for improving the performance both individually and in combination, while using all regularization terms gives the lowest FRR. Note that using plain multihead attention also gives some improvement over the single head attention model. At the thresholds corresponding to 1 FA/hr, the proposed regularization introduces up to 32.6% and 25.1% relative reduction of FRRs over the single head attention model and the plain multihead attention model, respectively.
Fig. 4 shows that the receiver operating characteristic (ROC) curves of the singlehead, the plain 4head, and the regularized 4head attention models for the test dataset where we set all ’s to 0.1. It can be seen that the regularized multihead model consistently and significantly outperforms both the singlehead attention model and the plain or nonregularized multihead attention model for all FA/hr. At 1 FA/hr, for the test dataset, FRRs are reduced by 34.4% and 36.0%, respectively.
4.4 Varying values
We also show how the performance changes according to the values. To see the change, we varied the values from to while all have the same value in one training instance for simplicity. The number of attention heads is fixed to as before. From Fig. 5, we can see that the best performance is achieved at . Although we did not investigate the different combinations of values for different regularization terms, this result suggests that one can find the optimal point in the hyperparameter space of
’s for which automated machine learning algorithms can be used.
5 Conclusion
In this paper, we have proposed a multihead attentionbased keyword spotting system trained with regularization derived from orthogonality constraints on context and score vectors of attention heads. The interhead orthogonality regularization of context vectors and score vectors encourages the attention heads to have less redundancy to each other in positions and subspaces, while the intrahead nonorthogonality regularization of context vectors lets them have contextual consistency across samples for the given task. The proposed orthogonality constrained multihead attention mechanism has been shown to learn exclusive representation of sequence parts both in position and in subspaces, which in turn improves the keyword spotting performance by extracting richer taskrelevant information from structured data. In the experiment with the “Hey Snapdragon” keyword, the proposed method reduced the relative false rejection rate by 34.4% and 36.0% at 1 FA/hr over singlehead and plain multihead attentionbased models, respectively, for the test dataset. Our future works include investigation on other criteria for regularizing multihead attention and extension of the idea to other speech tasks such as speaker verification and speech recognition.