Keyword Transformer: A Self-Attention Model for Keyword Spotting

04/01/2021 ∙ by Axel Berg, et al. ∙ ARM 0

The Transformer architecture has been successful across many domains, including natural language processing, computer vision and speech recognition. In keyword spotting, self-attention has primarily been used on top of convolutional or recurrent encoders. We investigate a range of ways to adapt the Transformer architecture to keyword spotting and introduce the Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data. Surprisingly, this simple architecture outperforms more complex models that mix convolutional, recurrent and attentive layers. KWT can be used as a drop-in replacement for these models, setting two new benchmark records on the Google Speech Commands dataset with 98.6 the 12 and 35-command tasks respectively.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent works in machine learning show that the Transformer architecture, first introduced by Vaswani et al. 

[1], is competitive not only in language processing, but also in e.g. image classification, [2, 3, 4]

, image colorization

[5], object detection [6], video classification [7] and multi-agent spatiotemporal modeling [8]

. This can be seen in the light of a broader trend, where a single neural network architecture generalizes across many domains of data and tasks.

Attention mechanisms have also been explored for speech recognition [9, 10, 11]

, but only as an extension to other architectures, such as convolutional or recurrent neural networks.

Inspired by the strength of the simple Vision Transformer (ViT) model [2] in computer vision and by the techniques that improves its data-efficiency [3], we propose an adaptation of this architecture for keyword spotting and find that it matches or outperforms existing models on the much smaller Google Speech Commands dataset [12] without additional data.

We summarize our main contributions as follows:

  1. An investigation into the application of the Transformer architecture to keyword spotting, finding that applying self-attention is more effective in the time domain than in the frequency domain.

  2. We introduce the Keyword Transformer, as illustrated in Figure 1, a fully self-attentional architecture inspired by ViT [2] that can be used as a drop-in replacement for existing keyword spotting models and visualize the effect of the learned attention masks and positional embeddings.

  3. An evaluation of this model across several tasks using the Google Speech Commands dataset with comparisons to state-of-the-art convolutional, recurrent and attention-based models.

  4. An analysis of model latency on a mobile phone, showing that the Keyword Transformer is competitive in edge use cases.

Figure 1: The Keyword Transformer architecture. Audio is preprocessed into a mel-scale spectrogram, which is partitioned into non-overlapping patches in the time domain. Together with a learned class token, these form the input tokens for a multi-layer Transformer encoder. As with ViT [2], a learned position embedding is added to each token. The output of the class token is passed through a linear head and used to make the final class prediction.

2 Related Work

Figure 2: The PostNorm (left) and PreNorm (right) Transformer encoder architectures. KWT uses a PostNorm encoder.

2.1 Keyword Spotting

Keyword spotting is used to detect specific words from a stream of audio, typically in a low-power always-on setting such as smart speakers and mobile phones. To achieve this, audio is processed locally on the device. In addition to detecting target words, classifiers may also distinguish between ”silence” and ”unknown” for words or sounds that are not in the target list.

In recent years, machine learning techniques, such as deep (DNN), convolutional (CNN) and recurrent (RNN) neural networks, have proven to be useful for keyword spotting. These networks are typically used in conjunction with a pre-processing pipeline which extracts the mel-frequency cepstrum coefficients (MFCC) [13]. Zhang et al. [14] investigated several small-scale network architectures and identified depthwise-separable CNN (DS-CNN) as providing the best classification accuracy tradeoff for memory footprint and computational resources. Other works have improved upon this result using synthesized data [15], temporal convolutions [16, 17], and self-attention [9].

Rybakov et al. [10] created a framework for training streaming models that continually accept input and achieved a new state of the art result on Google Speech Commands using MHAtt-RNN, a non-streaming CNN, RNN and multi-headed (MH) self-attention model.

2.2 Self-Attention and the Vision Transformer

Dosovitskiy et al. introduced the Vision Transformer (ViT) [2] and showed that Transformers can learn high-level image features by computing self-attention between different image patches. This simple approach outperformed CNNs but required pre-training on large datasets. Touvroun et al. [3]

improved data efficiency using strong augmentation, careful hyperparameter tuning and token-based distillation.

While self-attention has previously been used on top of convolutional and recurrent feature extractors for keyword spotting, to the best of our knowledge fully-attentional models based on the Transformer architecture [1] have not been investigated. Our approach is inspired by ViT, in the sense that we use patches of the audio spectrogram as input. We restrict ourselves to a non-streaming setting in this work, noting that others have previously investigated extensions of Transformers to a streaming setting [18].

3 The Keyword Transformer

3.1 Model Architecture

Let denote the output of the MFCC spectrogram, with time windows and frequencies . The spectrogram is first mapped to a higher dimension , using a linear projection matrix in the frequency domain. In order to learn a global feature that represents the whole spectrogram, a learnable class embedding is concatenated with the input in the time-domain. Then a learnable positional embedding matrix is added, such that the input representation fed into the Transformer encoder is given by


The projected frequency-domain features are then fed into a sequential Transformer encoder consisting of

multi-head attention (MSA) and multi-layer perceptron (MLP) blocks. In the

:th Transformer block, queries, keys and values are calculated as , and respectively, where and is the dimensionality of each attention-head. The self attention (SA) is calculated as


The MSA operation is obtained by linearly projecting the concatenated output, using another matrix , from the attention heads.


In our default setting, we use the PostNorm [1] Transformer architecture as shown in Figure 2, where the Layernorm (LN) [19] is applied after the MSA and MLP blocks, in contrast to the PreNorm [20] variant, where LN is applied first. This decision is discussed further in the ablation section. As is typical for Transformers, we use GELU [21] activations in all MLP blocks.

In summary, the output of the :th Transformer block is given by


At the output layer, the class embedding is fed into a linear classifier. Our approach treats time windows in a manner analogous to the handling of image patches in ViT. Whereas in ViT, the self-attention is computed over image patches, the attention mechanism here takes place in the time-domain, such that different time windows will attend to each other in order to form a global representation in the class embedding.

The model size can be adjusted by tuning the parameters of the Transformer. Following [3], we fix the number of sequential Transformer encoder blocks to 12, and let , where is the embedding dimension and is the number of attention heads. By varying the number of heads as , we end up with three different models as shown in Table 1.

Model dim mlp-dim heads layers # parameters
KWT-1 64 256 1 12 607K
KWT-2 128 512 2 12 2,394K
KWT-3 192 768 3 12 5,361K
Table 1: Model parameters for the KWT architecture.

3.2 Knowledge Distillation

As introduced by [22], knowledge distillation uses a pre-trained teacher’s predictions to provide an auxiliary loss to the student model being trained. We follow [3] and implement this by appending an additional learned distillation token to the input. At the output layer the distillation token is fed into a linear classifier and trained using the hard labels predicted by the teacher.

In contrast to [22], this teacher is typically weaker than the student and in contrast to [23] the teacher receives the same augmentation of the input as the student. In contrast to both, the hard predictions of the teacher labels are used as targets. Let

be the logits of the student class token,

be the logits of the student distillation token and be the logits of the teacher model. The overall loss becomes


where are the hard decision of the teacher, are the ground-truth labels, is the softmax function and is the cross-entropy loss. At inference time the class and distillation token predictions are averaged to produce a single prediction. In all experiments, we use MHAtt-RNN as a teacher and denote hard-distillation models with KWT.

4 Experiments

Training steps 23,000
Batch size 512
Optimizer AdamW
Learning rate 0.001
Schedule Cosine

Warmup epochs

Weight decay 0.1
Label smoothing 0.1
Dropout 0
Time window length 30 ms

Time window stride

10 ms
#DCT Features 40
Data augmentation
Time shift [ms] [-100, 100]
Resampling [0.85, 1.15]
Background vol. 0.1
#Time masks 2
Time mask size [0,25]
#Frequency masks 2
Frequency mask size [0,7]
Table 2: Hyperparameters used in all experiments.

4.1 Keyword Spotting on Google Speech Commands

We provide experimental results on the Google Speech Commands dataset V1 and V2 [12]. Both datasets consist of 1 second long audio snippets, sampled at 16 kHz, containing utterances of short keywords recorded in natural environments. V1 of the dataset contains 65,000 snippets of 30 different words, whereas V2 contains 105,000 snippets of 35 different words. The 12-label classification task uses 10 words: ”up”, ”down”, ”left”, ”right”, ”yes”, ”no”, ”on”, ”off”, ”go”, and ”stop”, in addition to ”silence” and ”unknown”, where instances of the latter is taken from the remaining words in the dataset, whereas the 35-label task uses all available words. We use the same 80:10:10 train/validation/test split as [12, 14, 10] for side-by-side comparisons. We adhere as closely as possible to the evaluation criteria of [10], and for each experiment, we train the model three times with different random initializations.

As our intention is to explore the extent to which results using Transformers from other domains transfer to keyword spotting, we follow the choices and hyperparameters from [3] as closely as possible, with the notable exception that we found increasing weight decay from 0.05 to 0.1 to be important. Furthermore, we use the same data pre-processing and augmentation policy as in [10], which consists of random time shifts, resampling, background noise, as well as augmenting the MFCC features using SpecAugment [24]. We train our models over the same number of total input examples as MHAtt-RNN (12M) to allow a fair comparison. For clarity, the hyperparameters used in all experiments are reported in Table 2.

The results are shown in Table 3

, where for our own results, we report a 95% confidence interval for the mean accuracy over all three model evaluations. Our best models match or surpass the previous state-of-the-art accuracies, with significant improvements on both the 12-label and 35-label V2-datasets. In general, Transformers tend to benefit more from large amounts of data, which could explain why KWT does not outperform MHAtt-RNN on the smaller V1-dataset. Nevertheless, we also note that knowledge distillation is effective in improving the accuracy of KWT in most scenarios.

Model V1-12 V2-12 V2-35
DS-CNN [14] 95.4
TC-ResNet [16] 96.6
Att-RNN [9] 95.6 96.9 93.9
MatchBoxNet [17] 97.48 97.6
Embed + Head [15] 97.7
MHAtt-RNN [10] 97.2 98.0
Res15 [27] 98.0 96.4
MHAtt-RNN (Ours) 97.50 98.36 97.27
KWT-3 (Ours) 97.24 98.54 97.51
KWT-2 (Ours) 97.36 98.21 97.53
KWT-1 (Ours) 97.05 97.72 96.85
KWT-3(Ours) 97.49 98.56 97.69
KWT-2(Ours) 97.27 98.43 97.74
KWT-1(Ours) 97.26 98.08 96.95
Table 3: Accuracy on Speech Commands V1 [25] and V2 [26].

4.2 Ablation Studies

Figure 3: Accuracy on Speech Commands V2-12 using KWT-3 with different patch sizes.

We investigate different approaches to self-attention by varying the shapes of the MFCC spectrogram patches that are fed into the Transformer. Using our default hyperparameters, the spectrogram consists of 98 time windows, containing 40 mel-scale frequencies. Our baseline uses time-domain attention, but we also investigate frequency-domain attention and intermediate steps where rectangular patches are used. We find time-domain attention to perform best, as shown in Figure 3. This is in agreement with previous findings that temporal convolutions work well for keyword spotting [16], since the first projection layer of our model can be interpreted as a form of convolution.

We also investigate the use of PreNorm and PostNorm and found that the latter improves performance for keyword spotting in our experiments. This is contrary to previous findings on other tasks [28], where PreNorm has been shown to yield better results and we encourage further work to explore the role of normalization in Transformers across different domains.

4.3 Attention Visualization

In order to examine which parts of the audio signal the model attends to, we propagate the attention weights of each Transformer layer from the input to the class token by averaging the attention weights over all heads. This produces a set of attention weights for each time window of the input signal. Figure 4 shows the attention mask overlayed on the waveform of four different utterances. It can be seen that the model is able to pay attention to the important parts of the input while effectively suppressing background noise.

We also study the position embeddings of the final model by analyzing their cosine similarity, as shown in Figure

5. Nearby position embeddings exhibit a high degree of similarity and distant embeddings are almost orthogonal. This pattern is less emphasized for time windows near the start and the beginning of the audio snippets. We hypothesize that this is either because words are typically in the middle of each snippet and therefore relative position is more important there, or because the audio content at the start and end is less distinguishable.

Figure 4: The learned attention mask, propagated from the input to the class token, overlaid on four different audio snippets, without (top) and with (bottom) background noise.
Figure 5: Cosine similarities of the learned position embeddings of KWT.

4.4 Latency Measurements

We converted our KWT models, DS-CNN (with stride) [14], TC-ResNet [16] and MHAtt-RNN [10]

to Tensorflow (TF) Lite format to measure inference latency on a OnePlus 6 mobile device based on the Snapdragon 845 (4x Arm Cortex-A75, 4x Arm Cortex-A55). For latency comparisons we report accuracy figures for the Google Speech Commands V2 with 12 labels and 35 labels

[26, 10]. The TFLite Benchmark tool [29] is used to measure latency, defined by the processing time of a single 1 second speech sequence. For each model, we do 10 warmup runs followed by 100 inference runs, capturing the average latency on a single thread.

In Figure 6 we observe that Transformer-based models are competitive with the existing state-of-the-art despite being designed with no regard to latency. There is a broad body of research on optimizing Transformer models — of particular note is the replacement of layer normalization and activations in [30] that decreases latency by a factor of three. Our findings here suggest that many of these results could be leveraged in the keyword spotting domain to extend the practicality of these models.

Figure 6: Latency and accuracy of processing a whole 1 sec audio, on a single thread on a mobile phone.

5 Conclusion

In this paper we explore the direct application of Transformers to keyword spotting, using a standard architecture and a principled approach to converting the audio input into tokens.

In doing so we introduce KWT, a fully-attentional model that matches or exceeds the state-of-the-art over a range of keyword spotting tasks with real-world latency that remains competitive with previous work.

These surprising results suggest that Transformer research in other domains offers a rich avenue for future exploration in this space. In particular we note that Transformers benefit from large-scale pre-training [2], have seen 5.5x latency reduction through model compression [30] and can obtain up to 4059x energy reduction through sparsity and hardware codesign [31]. Such improvements would make a meaningful impact on many keyword spotting applications and we encourage future research in this area.

6 Acknowledgements

This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP), funded by the Knut and Alice Wallenberg Foundation. We thank Matt Mattina for supporting this work, Magnus Oskarsson for his feedback and comments, and Oleg Rybakov and Hugo Touvron for sharing their code with the community.