Attention-Free Keyword Spotting

10/14/2021 ∙ by Mashrur M. Morshed, et al. ∙ 0

Till now, attention-based models have been used with great success in the keyword spotting problem domain. However, in light of recent advances in deep learning, the question arises whether self-attention is truly irreplaceable for recognizing speech keywords. We thus explore the usage of gated MLPs – previously shown to be alternatives to transformers in vision tasks – for the keyword spotting task. We verify our approach on the Google Speech Commands V2-35 dataset and show that it is possible to obtain performance comparable to the state of the art without any apparent usage of self-attention.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Transformers [23]

have been a highly disruptive innovation in deep learning. They were originally introduced in the domain of Natural Language Processing, and have since replaced classical recurrent neural networks as the default approach in many NLP tasks

[5, 26]. Not only have Transformers transformed the field of NLP, but they have also been proven to have competitive performance across several important problem domains—such as image classification [6, 27], video classification [17], object detection [2]

, automatic speech recognition

[11, 9] and so on.

Partly due to their remarkable success in vision tasks, Transformers have lately been studied in the field of keyword spotting [1, 8], and have shown similar exceptional results.

Recent research [22, 12, 15] shows that a core component of Transformers, self-attention, may not be necessary for achieving good performance in vision and language tasks. This finding necessitates a study on whether self-attention, which has been a core component of several recent state of the art methods, is truly necessary in the keyword spotting task.

Our contributions can be summarized as follows:

  1. We introduce the Keyword-MLP, an attention-free alternative to the Keyword-Transformer (KWT) [1]. It achieves 97.56% accuracy on the Google Speech Commands V2-35 [25] dataset—showing comparable performance to the KWT, while being more parameter efficient.

  2. We provide support for the hypothesis that self-attention is sufficient, but not strictly necessary, for highly performant keyword spotting solutions.

2 Related Work

2.1 Keyword Spotting

Keyword spotting (KWS) deals with identifying some pre-specified speech keywords from an audio stream. As it is commonly used in always-on edge applications, KWS research often focuses on both accuracy and efficiency.

While research in keyword spotting goes back to the 1960s [21], most of the recent and relevant works have been focused on the Google Speech Commands dataset [25]—which has inspired numerous works and has rapidly grown to be the standard benchmark in this field. Notably, the dataset has two versions, V1 and V2, consisting of 30 and 35 keywords respectively.

Initial approaches to keyword spotting on the speech commands consisted of convolutional models [25]. Majumdar and Ginsburg [14], Mordido et al. [16], and Zhang et al. [28] proposed lightweight CNN models with depth-wise separable convolutions. Andrade et al. [3] proposed using a convolutional recurrent model with attention, introducing the usage of attention in the KWS task.

Rybakov et al. [19] proposed a multi-headed, self attention based RNN (MHAtt-RNN). Vygon and Mikhaylovskiy [24] proposed an efficient representation learning method with triplet loss for keyword spotting. While the state of the art in KWS at that time was the method of Rybakov et al. [19], it was empirically seen that triplet loss performed poorly with RNN based models. The authors later obtained excellent results with ResNet [20] variants.

Very recently, Berg et al. [1] and Gong et al. [8] proposed the Keyword Transformer (KWT) and Audio Spectrogram Transformer (AST) respectively. Both approaches are inspired by the success of the Vision Transformer (ViT) [6]

, and show that fully self-attention based models can obtain state of the art or comparable results on the keyword spotting task. A key difference between these two transformer-based approaches is that AST uses ImageNet

[4] and Audioset [7] pre-training. Furthermore, the best performing KWT models are trained with knowledge distillation [10] using a teacher MHAtt-RNN [19].

2.2 MLP-based Vision

The Vision Transformer [6] has thus far shown the remarkable capability of Transformers on image and vision tasks. However, several recent works have questioned the necessity of self-attention in ViT.

Melas-Kyriazi [15] directly raises the question on the necessity of attention, and shows that the effectiveness of the Vision Transformer may be more related to the idea of the patch embedding rather than self-attention. Tolstikhin et al. [22]

proposed the MLP-Mixer, which performs token mixing and channel mixing on image patches/tokens, and shows competitive performance on the ImageNet benchmark. Liu et al.

[12] proposed the g-MLP, consisting of very simple channel projections and spatial projections with multiplicative gating—showing remarkable performance without any apparent use of self attention.

3 Keyword-MLP

(without distillation) Model # Params GFLOPS Acc. (V2-35) KWT-1 [1] 0.607 M 0.108 96.85 KWT-2 [1] 2.394 M 0.469 97.53 KWT-3 [1] 5.361 M 1.053 97.51 KW-MLP (ours) 0.424 M 0.045 97.56

Table 1: Performance Comparison with KWT[1]

Inputs to the Keyword-MLP consist of mel-frequency cepstrum coefficients (MFCC). Let an arbitrary input MFCC be denoted as , where and are the frequency bins and time-steps respectively. We divide into patches of shape —getting a total of patches, where .

The patches are flattened, giving us . We then map to a higher dimension , with a linear projection matrix . The overall input to the model thus stands:


An advantage of KW-MLP over KWT is that we only require the patch embeddings ; we do not need any positional embedding.

Figure 1: The Keyword-MLP architecture. It consists of L consecutive g-MLP blocks. Note that represents the typical added skip-connection, while represents element-wise product.

The obtained is passed through consecutive, identical gated-MLP (g-MLP) blocks [12]. On a high level, a g-MLP block can be summarized as a pair of channel projections separated by a spatial projection.

We can formulate the g-MLP block with the following set of equations (omitting normalization for the sake of conciseness):


and denote the linear channel projection, while

represents the GELU activation function.

represents the the Spatial Gating Unit (SGU) [12], where performs the linear projection across the spatial dimension, which is followed by the linear gating—an element-wise multiplication with .

Unlike the original g-MLP paper, we find that applying Layer Normalization after the channel and spatial projections in the gated MLP blocks result in a notably faster and more optimal convergence.

The overall system is shown in Figure 1. In Keyword-MLP, we use —that is, 12 consecutive g-MLP layers. Input embeddings are of dimensions . The model accepts inputs of shape , with patch sizes of . KW-MLP also consists of 424K parameters, which is notably smaller than KWT models.

Since our work closely follows the method and settings of the Keyword Transformer, we show a parameter vs result comparison with KWT variants in Table 1.

Epochs 140
Batch Size 256
Optimizer AdamW
Learning Rate 0.001
Warmup Epochs 10
Scheduling Cosine
Label Smoothing 0.1
Weight Decay 0.1
Block Survival Prob. 0.9
Audio Processing
Sampling Rate 16000
Window Length 30 ms
Hop Length 10 ms
n_mfcc 40
Num. Time Masks 2
Time Mask Width [0, 25]
Num. Freq Masks 2
Freq Mask Width [0, 7]
Num. Blocks, 12
Input MFCC Shape
Patch Size
Dim, 64
Dim Proj. 256
Num. Classes 35
Table 2: Overview of Hyper-Parameters and Settings
Method Extra Knowledge Accuracy
Attention-RNN [3] 93.9
Res-15 [24] 96.4
MHAtt-RNN [19] 97.27
AST-S [8] ImageNet pretrained 98.11
AST-P [8] ImageNet & Audioset 97.88
KWT-3 [1] KD with [19] 97.69
KWT-2 [1] KD with [19] 97.74
KWT-1 [1] KD with [19] 96.95
KWT-3 [1] 97.51
KWT-2 [1] 97.53
KWT-1 [1] 96.85
KW-MLP (ours) 97.56
Table 3: Comparison of Accuracy on Google Speech Commands V2-35 dataset[25]

4 Experimental Details

We follow similar hyperparameters to

[19, 1], with minor changes. For training, we use a batch-size of 256, and train for 140 epochs. We use the AdamW optimizer [13] with an initial learning rate of 0.001 and a weight decay of 0.1. Label smoothing of 0.1 is used. Warmup scheduling is applied for the first 10 epochs, followed by cosine annealing scheduling.

No other augmentation apart from Spectral Augmentation [18] is applied. For Spectral Augmentation, we use 2 time masks and 2 frequency masks, each having widths in the closed range [0, 25] and [0, 7] respectively. A complete list of hyperparameters and settings can be seen in Table 2.

As an additional regularization method, at training time, each g-MLP block is dropped with a probability of

(conversely, each block has a probability of survival).

As seen from Table 1, the KW-MLP model has 424K parameters and 0.045 GFLOPS — which is more storage and compute efficient in contrast to the KWT models, while having comparable or better accuracy. Furthermore, since we do not apply expensive augmentations like resampling, time-shifting, adding background noise, mixup, etc. at runtime, training is quite fast. On free cloud resources (NVIDIA Tesla P100, at Kaggle), it is possible to train KW-MLP models in about two hours.

From Table 3, we see that despite not using self-attention, the performance of KW-MLP is comparable to that of self-attention and transformer based models. Furthermore, apart from being compute efficient, KW-MLP is also data efficient. It achieves good standalone performance without pretraining on Audioset or ImageNet as in AST, or without distillation from another pre-trained model like in KWT.

One limitation of the KW-MLP experiments is that, as mentioned earlier, the effect of various augmentation techniques have not been sufficiently explored. Although this has the favourable result of training being quite fast and inexpensive, it is also the case that it may be possible to further push the accuracy of the KW-MLP model. Regardless, we believe that our original intention of proving that self-attention is sufficient, but not necessary in KWS can be inferred from the results we have at hand.

5 Ablation

5.1 MLP-Mixer based KWS

The gated-MLP architecture is not the only attention-free, MLP-based method that works with patched inputs. As such, we also cursorily explored the approach of Tolstikhin et al. [22], the MLP-Mixer, with the intention of observing its performance on keyword spotting on the Google Speech Commands V2-35 dataset [25].

Mixer layers are composed of two types of MLP layers—channel mixing layers, which operate on each token across a channel, and patch or token mixing layers, which operate patch-wise across spatial locations.

The inputs to the system closely follow the approach detailed in Berg et al. [1] and in this paper— patches are taken from MFCCs.

We conducted several experiments, and the highest found accuracy was 94.11%. The corresponding mixer model consisted of , , and , where and denote the hidden dimensions for token mixing and channel mixing MLP blocks respectively. We used the Adam optimizer with an initial learning rate of , exponential decay scheduling, training for 150 epochs with a batch size of 64.

Liu et al. [12] specifically mention in their paper that gated-MLPs are generally more accurate over MLP-Mixers—this statistic also seems to apply to the keyword spotting task.

5.2 Pre or Post Normalization Ambiguity

In Liu et al. [12], the authors apply normalization before the initial channel projection, at the beginning of the gated MLP block. However, our experiments suggest that this may not be suitable for the keyword spotting task. Instead, we find that applying normalization after the final channel projection (as in Fig 1) shows much better results. Using similar settings, pre-norm achieves accuracy. Interestingly, Berg et al. [1] also mention a similar phenomenon in their work—that the PostNorm Transformer performs better than the PreNorm Transformer in keyword spotting. Figure 2 shows a comparative validation curve of pre-norm and post-norm.

Figure 2: Comparison of validation curves for pre-norm (blue) and post-norm (gray).

While we have empirical evidence of this phenomenon being true, we are curious as to why this occurs. We believe that if possible, the cause and effect of Pre and Post normalization needs to be further analyzed in future works in this area.

6 Conclusion

The Keyword-MLP has shown itself to be an efficient solution to keyword spotting. Moreover, the implications of this work is that self-attention is not strictly necessary in the keyword spotting field. We hope that our work helps future researchers also consider the much simpler gated-MLP as an alternative approach, when working with self-attention based Transformers in speech and other related domains.


  • [1] A. Berg, M. O’Connor, and M. T. Cruz (2021)

    Keyword Transformer: A Self-Attention Model for Keyword Spotting

    In Proc. Interspeech 2021, pp. 4249–4253. External Links: Document Cited by: item 1, §1, §2.1, Table 1, Table 3, §4, §5.1, §5.2.
  • [2] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In

    European Conference on Computer Vision

    pp. 213–229. Cited by: §1.
  • [3] D. C. de Andrade, S. Leo, M. L. D. S. Viana, and C. Bernkopf (2018) A neural attention model for speech command recognition. arXiv preprint arXiv:1808.08929. Cited by: §2.1, Table 3.
  • [4] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    Vol. , pp. 248–255. External Links: Document Cited by: §2.1.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • [6] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §1, §2.1, §2.2.
  • [7] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA. Cited by: §2.1.
  • [8] Y. Gong, Y. Chung, and J. Glass (2021) AST: audio spectrogram transformer. arXiv preprint arXiv:2104.01778. Cited by: §1, §2.1, Table 3.
  • [9] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al. (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. Cited by: §1.
  • [10] G. Hinton, O. Vinyals, and J. Dean (2015)

    Distilling the knowledge in a neural network

    arXiv preprint arXiv:1503.02531. Cited by: §2.1.
  • [11] A. T. Liu, S. Li, and H. Lee (2021)

    Tera: self-supervised learning of transformer encoder representation for speech

    IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, pp. 2351–2366. Cited by: §1.
  • [12] H. Liu, Z. Dai, D. R. So, and Q. V. Le (2021) Pay attention to mlps. arXiv preprint arXiv:2105.08050. Cited by: §1, §2.2, §3, §3, §5.1, §5.2.
  • [13] I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §4.
  • [14] S. Majumdar and B. Ginsburg (2020)

    MatchboxNet: 1d time-channel separable convolutional neural network architecture for speech commands recognition

    arXiv preprint arXiv:2004.08531. Cited by: §2.1.
  • [15] L. Melas-Kyriazi (2021) Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet. arXiv preprint arXiv:2105.02723. Cited by: §1, §2.2.
  • [16] G. Mordido, M. Van Keirsbilck, and A. Keller (2021) Compressing 1d time-channel separable convolutions using sparse random ternary matrices. arXiv preprint arXiv:2103.17142. Cited by: §2.1.
  • [17] D. Neimark, O. Bar, M. Zohar, and D. Asselmann (2021)

    Video transformer network

    arXiv preprint arXiv:2102.00719. Cited by: §1.
  • [18] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779. Cited by: §4.
  • [19] O. Rybakov, N. Kononenko, N. Subrahmanya, M. Visontai, and S. Laurenzo (2020) Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720. Cited by: §2.1, §2.1, Table 3, §4.
  • [20] R. Tang and J. Lin (2018) Deep residual learning for small-footprint keyword spotting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5484–5488. Cited by: §2.1.
  • [21] C. Teacher, H. Kellett, and L. Focht (1967) Experimental, limited vocabulary, speech recognizer. IEEE Transactions on Audio and Electroacoustics 15 (3), pp. 127–130. Cited by: §2.1.
  • [22] I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, D. Keysers, J. Uszkoreit, M. Lucic, et al. (2021) Mlp-mixer: an all-mlp architecture for vision. arXiv preprint arXiv:2105.01601. Cited by: §1, §2.2, §5.1.
  • [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • [24] R. Vygon and N. Mikhaylovskiy (2021) Learning efficient representations for keyword spotting with triplet loss. arXiv preprint arXiv:2101.04792. Cited by: §2.1, Table 3.
  • [25] P. Warden (2018) Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: item 1, §2.1, §2.1, Table 3, §5.1.
  • [26] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32. Cited by: §1.
  • [27] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. Tay, J. Feng, and S. Yan (2021) Tokens-to-token vit: training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986. Cited by: §1.
  • [28] C. Zhang and K. Koishida (2017) End-to-end text-independent speaker verification with triplet loss on short utterances.. In Interspeech, pp. 1487–1491. Cited by: §2.1.