Keyword spotting has recently been an essential function of consumer devices, such as mobile phones and smart speakers, because it provides a natural way of voice user interface. It is mainly used for detecting pre-defined keywords, e.g., “Alexa”, “Hey Siri”, and “OK Google” for getting devices ready to process users’ following commands or queries. Despite the widespread use of this technology in various devices today, it is still a challenging problem due to requiring low false rejection rate (FRR) and false alarm rate (FAR) while operating with small memory footprint and low power consumption.
In the previous studies, keyword/filler hidden Markov models (HMMs) were proposed, which explicitly model the acoustic characteristics of non-keyword general speech (filler) as well as the target keyword speech [rohlicek1989continuous]
. With the recent advances in deep learning, Gaussian mixture models in the HMMs were replaced with various neural network architectures, such as feed forward deep neural networks, convolutional neural networks, and convolutional recurrent neural networks (CRNNs)[chen2014small, sainath2015convolutional, arik2017convolutional, lengerich2016end]. Although those deep learning-based approaches significantly improve the system performance by increasing modeling capacity, it still requires well predicted time-aligned labels.
Recently, a number of attention-based keyword spotting models have been proposed [he2017streaming, shan2018attention, de2018neural]. While [he2017streaming] used attention in an assistive form for biasing RNN-based decoders toward a keyword of interest, [shan2018attention] aggressively deploy the attention mechanism proposed in [chowdhury2017attention]
for direct keyword feature representation with which a binary classifier discriminates keywords from nonkeywords. Since these approaches are based on basic single-head attention mechanism, it is natural to extend to use multi-head attention. Multi-head attention[vaswani2017attention, chiu2018state] is introduced for joint representation of information in different subspaces while attending to different positions of a sequence. However, as there is no explicit mechanism which guarantee such diversity either in positions and in representational subspaces, each attention head may contain redundant information which results in inefficiency of the network. [li2018multi]
proposed the three types of disagreement regularizations, i.e., disagreements on subspaces, attended positions and outputs, to explicitly encourage the diversity among attention heads based on the cosine similarity.
In this paper, we investigate the use of multi-head attention in keyword spotting tasks and propose an orthogonality constrained multi-head attention mechanism. The regularization is derived from the constraints of context and score vectors between attention heads such that they are orthogonal to each other, respectively. The regularization by inter-head orthogonality of context vectors and score vectors lets the attention heads have less redundancy to each other, while the regularization by intra-head non-orthogonality of context vectors lets them have consistency across samples for the given task. Regularization presented in this work is related to [li2018multi] while it is more oriented to speech data and keyword spotting tasks. We show that the proposed regularization techniques improve the keyword detection performance by reducing the false rejection rates with only a small amount of increase in the model size.
2 Multi-Head Attention-Based End-to-End Model for Keyword Spotting
2.1 Keyword spotting system description
We extend the single-head attention-based end-to-end network structure for keyword spotting presented in [shan2018attention] to multi-head attention based network as depicted in Fig. 1. The encoder takes an acoustic feature , , the 40-dimensional Mel-filter bank energies extracted from 16 kHz sampled audio signals with per-channel energy normalization [wang2017trainable] where
is the time frame index, as input and converts it into a hidden representation. The encoder network consists of a canonical CRNN structure with convolutional and recurrent layers in sequence to capture spectral and temporal characteristics of the acoustic features. As a base model, we use one convolutional layer with a kernel size of
and a stride of
and one gated recurrent unit (GRU) layer with 64 hidden units as proposed in[shan2018attention]. The encoder output vector is then processed by an attention mechanism in each attention head to produce a context vector where denotes the attention head index. Using attention heads, the context vectors are concatenated as where
denotes the matrix transpose. Finally, the model performs binary classification with a linear transformation and a softmax operation on
to compute a posterior probability of a keyword stategiven input observation , . In the inference stage, we decide that a keyword is detected when the confidence is larger than a pre-set threshold. Note that this system does not require any graph searching or frame-level alignment of training data, which largely simplifies both training and inference.
2.2 Base attention mechanism
In each attention head, we use the nonlinear soft attention mechanism proposed in [chowdhury2017attention] for speaker verification and adopted in [shan2018attention] for keyword spotting. The attention weight for the -th attention head at the -th time frame is calculated by
where the scalar score is calculated by a nonlinear scoring function with the parameters shared across time
The context vector , the output of each attention head, is then calculated by the weighted sum as follows:
3 Orthogonality Regularized Multi-Head Attention
In speech recognition tasks including keyword spotting, although end-to-end neural networks are very attractive due to simplicity of their structures and learning procedures, hybrid systems often have competitive or even better performances as they use explicit sequence models for better leveraging structured information coming from speech subsequences, i.e., phonemes, syllables, or word-pieces [luscher2019rwth]. In this perspective, the multi-head attention mechanism is considered as a promising alternative to capture the structured information from speech subsequences while keeping the end-to-end nature [dong2019self, dong2018speech].
Multi-head attention, proposed in [vaswani2017attention], is capable of diverse learning of representations since different heads can pay attention to different positions in a sequence and give different representations. However, the diversity is not guaranteed by its natural form as they may have redundancy either in position and representation. Fig. 2(b) shows an example of multi-head attention weight distributions where 3 of them severely overlap to each other. For encouraging the diversity of the multi-head attention, [li2018multi] proposed three types of disagreement regularization in the context of machine translation, i.e., disagreements on subspaces, attended positions and outputs, based on maximization of the negative cosine similarities. In this section, we propose a regularization technique for training the multi-head attention-based keyword spotting model by orthogonality constraints between attention heads.
3.1 Inter-head orthogonality regularization
We argue that, to capture the temporally structured information in a sp eech input sequence, the attention heads should pay attention to different parts of the sequence and produce context outputs with minimal redundancy with each other. To achieve this, we introduce regularization of the multi-head attention by orthogonality constraints on context and score vectors between the attention heads. The problem is to find the network parameters that minimize the cross entropy loss subject to the orthogonality constraints and for each pair of . Suppose that we have a training batch of samples, then we define the regularization terms and by
where is the sample index, is the number of attention heads, denotes the Frobenius norm, and
are the context matrix and the score matrix, respectively, which consist of the normalized context vectors and the normalized score vectors .
One main difference from the output disagreement regularization in [li2018multi] is that our system does not use value projection and thus directly compute the context vector from the encoder output by multiplying the attention weights. Since the inter-head context orthogonality constraint can easily be satisfied by an orthogonal value projection in each head, regardless of the encoder outputs, we desire such orthogonality is achieved by the encoder network, not by the subspace projection. This encourages the encoder network to discriminateively represent different subsequences of a keyword utterance which results in better keyword detection.
3.2 Intra-head non-orthogonality regularization
On the contrary, since each attention head finds a specific subsequence with similar content, the context vectors from the same attention head are expected to be similar across different samples. Thus, we augment a regularization term which maximizes the similarity or non-orthogonality of the context vectors between different samples from the same attention head as follows:
Similar regularization to score vectors is not considered as the position of a subsequence attended by each attention head can vary from sample to sample.
3.3 Selective regularization
Since the discussion about orthogonality and non-orthogonality constraints are only valid for positive data, i.e., keyword utterances, we modify (4), (5) and (8) to be selectively calculated, given that the true label of the -th training sample is for positive and for negative as follows:
where denotes the number of positive samples and is the diagonal selection matrix .
Now we can write the problem as minimization of the cross entropy loss with the regularization terms as follows:
is a hyperparameter that controls the importance of the corresponding regularization term. Note thathas the opposite sign, since this regularization term is to be maximized while the others are to be minimized.
3.4 Semi-supervised salience learning
One interesting perspective of this work is that it roughly provides a semi-supervised way of learning representations of salient features from keyword utterances for the given task. In other words, without the sequence part alignment information such as phoneme labels and frame indices, the encoder finds task-relevant subsequences which have important roles for distinguishing keywords from non-keywords while only the keyword label is provided. Fig. 2 illustrates examples of attention weights for an utterance of the “Hey Snapdragon” keyword. In Fig. 2(a) and (b), it can be seen that the single head attention has a wide range of weight distribution across time with emphasis on the keyword end part, while the attention weights from different heads of the plain, i.e., without regularization, multi-head model are distributed in different positions capturing the encoder output representations of the corresponding subsequences. However, some of them overlap with each other, indicating the context vectors from the attention heads have redundant information. With the proposed regularization, it can be seen in Fig. 2(c) that the attention heads pay attention to exclusive sequence parts.
4.1 Datasets and experimental setup
The target keyword in our experiments is “Hey Snapdragon” which consists of four English syllables. In order to train the model and evaluate the performance, we collected a number of clean positive and negative samples from 325 speakers. The positive dataset has
12,000 samples from 325 speakers and divided into training, validation and test subsets at a ratio of 10:1:1. For validation and test datasets, we augmented the keyword utterances with 4 types of noises, i.e., babble, car, music, office, at signal-to-noise ratios (SNRs) of -6, 0, and 6 dB and with reverberation with a room impulse response measured in a regular meeting room, so that the total number of each of the positive validation and test samples is15,000. For negative samples, we collected 400 hours of general English sentences and divided them at a ratio 1:1:1 for training, validation, and test. We also augmented the negative validation and test datasets with random noises to double the amount, so that the total number of each of the negative validation and test samples is 38,000 and 33,000, respectively. Note that there is no duplication and no overlap in speaker, noise sample and room impulse response between all positive and negative training, validation, and test sets.
To improve acoustic environmental robustness, we augmented 50% of positive and negative training samples in an online manner where each sample is synthetically corrupted during data loading with randomly selected room impulse response and background noise sample from of 200 hours of noise and reverberation datasets. We assumed that all data have a fixed length and thus segmented them to 1.8 s length while guaranteeing all utterance in the training set are not clipped out in time. This assumption does not restrict on-device usability as we can apply sliding window techniques in continuous audio stream without harming the assumption. From 1.8 s input audio sequences sampled at 16 kHz, 40-dimensional Mel filter bank energies with per-channel energy normalization [wang2017trainable]
were computed for 30 ms frames at every 10 ms by performing short-time Fourier transform with 512-point Hamming window, and then fed into the network.
). All models were trained from scratch with randomly initialized parameters for 200 epochs which is considered to be a sufficient number to reach convergence. A mini-batch was constituted with 128 shuffled positive and negative training samples with their numbers of ratio 1:3. We used Adam optimizer[kingma2014adam] with a learning rate of which decays at each epoch with a factor of 0.98 while gradients with norm values above 1.0 were clipped. Since each attention head has learnable parameters in scoring function 2
and the number of nodes in the softmax layer changes due to concatenation of the context vectors from the attention heads, the number of parameters of 4-head model is 91 k while that of the single-head model is 78 k.
4.2 Regularization loss variation
Fig. 3 shows the regularization losses calculated from the positive validation set during training. It can be observed that and are decreasing as intended, i.e., the orthogonality between the context vectors and the score vectors between the attention heads are increasing, meaning that inter-head redundancy in time and subspace is reduced by the regularization. Meanwhile, as can be seen in Fig. 3(c), increases which indicates that the output context vectors of each attention head from different positive samples get more similar to each other as training progresses. This is desirable for the classification stage because, generally, it is beneficial to have less variation of feature representation, i.e., context vector, in feature space for the positive samples.
4.3 Performance with different combinations of regularizations
To see how the regularization affects the keyword spotting performance, we compare the keyword spotting test results for different combinations of regularizations applied during training. False rejection rates (FRR) measured at confidence thresholds corresponding to 1 false alarm per hour (FA/hr) for corresponding models are used for the performance metric. For simplicity of comparison, we fixed the number of attention heads as 4, motivated by the keyword has 4 syllables, and the value as 0.1.
|Systems||FRR (%) at|
|1 FA/hr||2 FA/hr||4 FA/hr|
From Table 1
, we can see that all types of regularization contributes for improving the performance both individually and in combination, while using all regularization terms gives the lowest FRR. Note that using plain multi-head attention also gives some improvement over the single head attention model. At the thresholds corresponding to 1 FA/hr, the proposed regularization introduces up to 32.6% and 25.1% relative reduction of FRRs over the single head attention model and the plain multi-head attention model, respectively.
Fig. 4 shows that the receiver operating characteristic (ROC) curves of the single-head, the plain 4-head, and the regularized 4-head attention models for the test dataset where we set all ’s to 0.1. It can be seen that the regularized multi-head model consistently and significantly outperforms both the single-head attention model and the plain or non-regularized multi-head attention model for all FA/hr. At 1 FA/hr, for the test dataset, FRRs are reduced by 34.4% and 36.0%, respectively.
4.4 Varying values
We also show how the performance changes according to the values. To see the change, we varied the values from to while all have the same value in one training instance for simplicity. The number of attention heads is fixed to as before. From Fig. 5, we can see that the best performance is achieved at . Although we did not investigate the different combinations of values for different regularization terms, this result suggests that one can find the optimal point in the hyperparameter space of
’s for which automated machine learning algorithms can be used.
In this paper, we have proposed a multi-head attention-based keyword spotting system trained with regularization derived from orthogonality constraints on context and score vectors of attention heads. The inter-head orthogonality regularization of context vectors and score vectors encourages the attention heads to have less redundancy to each other in positions and subspaces, while the intra-head non-orthogonality regularization of context vectors lets them have contextual consistency across samples for the given task. The proposed orthogonality constrained multi-head attention mechanism has been shown to learn exclusive representation of sequence parts both in position and in subspaces, which in turn improves the keyword spotting performance by extracting richer task-relevant information from structured data. In the experiment with the “Hey Snapdragon” keyword, the proposed method reduced the relative false rejection rate by 34.4% and 36.0% at 1 FA/hr over single-head and plain multi-head attention-based models, respectively, for the test dataset. Our future works include investigation on other criteria for regularizing multi-head attention and extension of the idea to other speech tasks such as speaker verification and speech recognition.