End-to-end Models with auditory attention in Multi-channel Keyword Spotting

by   Haitong Zhang, et al.

In this paper, we propose an attention-based end-to-end model for multi-channel keyword spotting (KWS), which is trained to optimize the KWS result directly. As a result, our model outperforms the baseline model with signal pre-processing techniques in both the clean and noisy testing data. We also found that multi-task learning results in a better performance when the training and testing data are similar. Transfer learning and multi-target spectral mapping can dramatically enhance the robustness to the noisy environment. At 0.1 false alarm (FA) per hour, the model with transfer learning and multi-target mapping gain an absolute 30 in the noisy data with SNR about -20.


page 1

page 2

page 3

page 4


Personalized Keyword Spotting through Multi-task Learning

Keyword spotting (KWS) plays an essential role in enabling speech-based ...

Attention-based End-to-End Models for Small-Footprint Keyword Spotting

In this paper, we propose an attention-based end-to-end neural approach ...

Multi-task Learning with Cross Attention for Keyword Spotting

Keyword spotting (KWS) is an important technique for speech applications...

Efficient keyword spotting using time delay neural networks

This paper describes a novel method of live keyword spotting using a two...

Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

Hand-crafted spatial features, such as inter-channel intensity differenc...

Transfer Learning of Artist Group Factors to Musical Genre Classification

The automated recognition of music genres from audio information is a ch...

End-to-End Open Vocabulary Keyword Search

Recently, neural approaches to spoken content retrieval have become popu...

1 Introduction

Keyword spotting is a task to detect a pre-defined keyword from a continuous stream of speech. KWS recently has drawn increasing attention since it is used as a wake-up word on mobile devices. In this case, the KWS model should satisfy the requirement of high accuracy, low-latency, and small-footprint.

The methods based on large vocabulary continuous speech recognition system (LVCSR) are used to process the audio offline [1]

. They generate rich lattices and search for the keyword. Due to high-latency, these methods are not suitable for the mobile devices. Another competitive technique for KWS is the keyword/filler Hidden Markov Model (HMM)

[2]. HMMs are trained separately for the keyword and non-keyword segments. At runtime, a Viterbi searching is needed to search for the keyword, which can be computationally expensive due to the HMM typology.

With the success of the application of neural network (NN) in automatic speech recognition (ASR), both keyword and non-keyword audio segments are used to train the same acoustic NN-based model. For example, in the Deep KWS model proposed by


, a single DNN model is used to output the posterior probability of the sub-segment of the keyword and to make the KWS decision based on a confidence score using a posterior smoothing. To improve the performance, the more powerful neural networks such as convolutional neural network (CNN)


and recurrent neural network (RNN)

[5] are used to substitute DNN. Inspired by these models, some end-to-end models have been proposed to directly output the probability of the whole keyword instead of sub-word, without any searching method or posterior handling [6, 7, 8].

Although tremendous improvement has been made, the previous models mainly focus on single-channel KWS. In industry, people usually use the microphone array for more complicated situations. Thus some signal processing techniques should be applied to convert the multi-channel signal into single-channel. However, these pre-processing techniques are sub-optimal because they are not optimized towards the final goal of interest[9]. There are extensive literature in learning useful representations for multi-channel input in speech recognition. For example, [10] concatenates the multi-channel signal into the network input. In [11], CNN is used to implicitly explore the spatial relationship between multiple channels. Attention-based methods have also been proposed to model the auditory attention in multi-channel speech recognition [12].

Inspired by the application of attention mechanism in ASR [12], we propose an attention-based end-to-end model for multi-channel KWS. Compared with [12], our attention mechanism is computationally cheaper, which is more suitable in the KWS task. Transfer learning and multi-target spectral mapping are incorporated in the model to achieve an better result in the noisy evaluation data.

We describe the proposed model in Section 2. The experiment data, setup, and results follow in section 3. Section 4 closes with the conclusion.

Attention Mechanism


Figure 1: The proposed model architecture.

2 The Proposed Model

As illustrated in Fig.1, the model mainly consists of three components: (i) the attention mechanism, (ii) the sequence-to-sequence training, (iii) the decoding smoothing.

2.1 Attention Mechanism

The attention mechanism we use is the soft attention, as proposed in[13]

. For each time-step, we compute a 6-dimensional attention weight vector

as followed:


where is a 640 input feature matrix, W is a 40

128 weight matrix, b is a 128-dimension bias vector, and

is a 128-dimension vector. A softmax function is applied for normalization. is the weighted sum of the multi-channel inputs .

2.2 Sequence to sequence training

The training framework we use is sequence-to-sequence. The encoder learns the higher representations for the enhanced speech features

. With a linear transformation and a softmax function, the probability of the whole keyword can be predicted at each frame.

(a) The first situation for recording.
(b) The second situation for recording.
Figure 2: Two situations for recording the noisy testing data in a 4m*4m*3.5m labatory, where N denotes noise and R denotes recording device.

Multi-task Learning. To improve the performance, we take the spectral mapping as an auxiliary task for our KWS model. The model learns the nonlinear mapping between the multi-channel speech features and the single-channel speech features, which is inspired by [14]

. The target speech features come from the traditional signal processing techniques. Although we depreciate the idea of separating the front-end signal processing techniques and the acoustic model, we conjecture whether the multi-task framework can improve the performance. The loss function is:


Transfer Learning. We also adopt transfer learning to improve the performance in the noisy environment. Transfer learning [15] refers to initializing the model parameters with the corresponding parameters of a trained model. Here we initialize the network using the proposed multi-channel KWS model trained with the relatively clean data, and fine-tune the model with only noisy data.

Multi-target Mapping. Since it is difficult to train the model with all noisy data, we propose multi-target spectral mapping. We conjecture that with more mapping targets, the spectral mapping can converge better than learning the nonlinear relationship between the noisiest input and the cleanest output. Compared with the spectral mapping mentioned above, two extra mapping targets are involved when training (detailed in sub-section 3.1). The loss function is described as followed:


with the constraint that .

2.3 Decoding

When decoding, our model, takes as the input a feature matrix and outputs the keyword spotting probability at each frame. We adopt a posterior probability smoothing method, and finally the decision is made based on the average probability of frames.

3 Experiments

3.1 Datasets

The training data consists of 240k utterances of the keyword (which includes 120k ones with echo and the other without echo), and 200 hours of negative examples, with 10% of them used for validation. The evaluation data includes 50 hours of filler data and 48k keyword data (with 50% echo keywords and 50% non-echo ones). We also record 1k noisy keywords as Fig.2 illustrates. They consists of two equal parts, which are recorded in two situations. The first half (referred to hard-noisy data, with the average SNR about -20) is recorded as shown in Fig.1(a) where the recording device is close to the music noise and the speaker is 3 meters away from the device. The other (referred to easy-noisy data, with the average SNR about -18) are recorded as Fig.1(b) shows. The distance between the device and speaker remains unchanged, but the between the device and the music source is 1 meter.

Besides that, we also recorded 50 hours of music for the multi-target mapping experiment. In the experiment, we randomly add music to the 120k non-echo keywords and 200 hours of negative examples to generate the noisy training data. Algorithm 1 shows the procedure of creating multiple mapping targets.

3.2 Experiment setups

In the baseline single-channel KWS model, the front-end component mainly includes beamforming and acoustic echo cancellation (AEC). These two blocks are constructed as proposed in [16] and [17], respectively.

The input feature in all the experiment is the trainable PCEN [18]

. The 40-dimension filter-bank features are extracted using a window of 25ms with a shift of 10ms. The encoder in the experiment is two GRU (Gated Recurrent Units) layers

[19] and one fully-connected layer. Both the GRU and FC layer have 128 units, with a 0.9 dropout rate. In the multi-task models, two tasks use two separate FC layers. Adam optimizer [20] is used to update the training parameters, with the batch size and initial learning rate is 64 and 0.001, respectively. The value in the spectral mapping is 0.5, 0.2,0.2 ,and 0.1, respectively, since they are reasonably good in the development data.

(a) ROC for the Non-echo data
(b) ROC for the Echo data
(c) ROC for the Easy-noisy data
(d) ROC for the Hard-noisy data
Figure 3: The results of the baseline model and the proposed models, with the smoothing frame .
1:for each multi-channel wav a in all wavs do
2:     select one music clip b randomly
3:     add b into a, with SNR is about -10 Input
4:     convert a into single-channel c Target 1
5:     convert b into single-channel d
6:     add d into c, with SNR is about +5 Target 2
7:     add d into c, with SNR is about +10 Target 3
8:     return Input, Target 1, Target 2, Target 3
Algorithm 1 Procedure for creating noisy training data in the multi-target spectral mapping

3.3 Impact of Attention mechanism

We first evaluate the performance of the baseline model and the proposed multi-channel models. The baseline model uses the signal processing techniques in sub-section 3.2. The input of the signal processing techniques is seven channels, while the input of the proposed model (i.e. Attention) is only six channels, without the reference signal for AEC.

It is obvious that Attention outperforms the baseline model in all the evaluation data sets. At 0.5 false alarm (FA) per hour, Attention gains an absolute 4% improvement and 7%, respectively in the non-echo and echo data (Fig. 2(a) and Fig.2(b)).

The difference in the performances becomes larger in the noisy data (Fig.2(c) and Fig.2(d)). The performance improvements are 40% and 60%, respectively in the hard-noisy data and easy-noisy data. This great difference may be largely attributed to that the signal processing techniques are not robust to the noisy environment, especially when the noise is close to the wake-up device.

3.4 Impact of multi-task learning

As indicated in Fig.2(a) and Fig.2(b), the proposed model with spectral mapping (i.e. Mapping) outperforms Attention slightly, which to some degree confirms our conjecture. However, the result gets worse in the noisy data (Fig.2(c) and Fig.2(d)). Such a difference lie in the difference between the training data and noisy testing data.

3.5 Impact of transfer learning and multi-target mapping

To increase the noise-robustness of the model, we initialize the model with the parameters of Attention and fine-tune the model with the artificial noisy training data (detailed in sub-section 3.1). As shown in Fig.2(a) and Fig.2(b), all the models with transfer learning (i.e. Transfer, Transfer_map, and Tran_Multi_Map) perform worse than Attention in both non-echo and echo testing data. The reason lies in the difference between the training data and the testing data. However, the main target of the transfer learning and multi-target mapping is the noisy data. As illustrated in Fig.2(c) and Fig.2(d), transfer learning and single-target spectra mapping do not result in a better result than Attention, which confirms the difficulties in training the model with only noisy data. However, the model with the transfer learning and multi-target mapping (i.e. Tran_Multi_Map) outperforms all the models by a large margin in the noisy data. At 0.5 false alarm per hour, Tran_Multi_map gains an absolute 30% and 10% improvement over Attention, respectively in the hard_noisy data and easy_noisy data.

4 Conclusions

Without the reference signal for AEC, the proposed attention-based model for multi-channel KWS out-performs the baseline model in all the testing data. With spectral mapping, the performance can gain a slight improvement when the training data and testing data are similar. In addition, transfer learning and multi-target spectral mapping can enhance the model’s robustness to the noisy environment, which shed lights on the NN-based speech enhancement in ASR.

5 Acknowledge

The authors would like to thanks the colleagues from the acoustic group for useful discussion.