Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks

02/25/2020 ∙ by Théodore Bluche, et al. ∙ 0

We explore a keyword-based spoken language understanding system, in which the intent of the user can directly be derived from the detection of a sequence of keywords in the query. In this paper, we focus on an open-vocabulary keyword spotting method, allowing the user to define their own keywords without having to retrain the whole model. We describe the different design choices leading to a fast and small-footprint system, able to run on tiny devices, for any arbitrary set of user-defined keywords, without training data specific to those keywords. The model, based on a quantized long short-term memory (LSTM) neural network, trained with connectionist temporal classification (CTC), weighs less than 500KB. Our approach takes advantage of some properties of the predictions of CTC-trained networks to calibrate the confidence scores and implement a fast detection algorithm. The proposed system outperforms a standard keyword-filler model approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Automatic speech recognition (ASR) systems have recently reached close to human recognition performance [38], allowing voice assistants (Alexa, Google Assistant, Siri), vocal interfaces and other spoken language understanding (SLU) systems to flourish. However, to achieve such performance, most “ask-me-anything” voice assistants run large-vocabulary continuous speech recognition (LVCSR) models, demanding a lot of resource and computing power. Therefore, most of the processing is performed in the cloud, inducing privacy concerns and latency issues. When SLU is limited to a specific number of tasks, in a closed-ontology setting (e.g. with a task-specific language model [24]), the inference can be performed on device. Recently, generic ASR models running on mobile devices have also been proposed [14]. In both cases, the full systems weigh more than 100MB, which remains too large for small devices typical of IoT applications where the memory and computing power is scarce.

We target “mini-SLU” scenarios, in which the detection of simple keywords in the query is sufficient to convey its meaning. In such a system, the user should be able to speak in natural language to trigger an action based on the keywords, as illustrated on Fig. 1. For this system to be practical and easy to adapt to any use-case, we assume that it should adapt to situations where the set of keywords is not known in advance, allowing the user to define their own interactions based on custom keywords. This also implies that no specific training data is available.

We present in this paper a keyword spotting (KWS) system, designed to be small enough to fit on micro-controllers, i.e. weigh less than 500KB. The development of tiny KWS models is an active area of research, mainly focusing on the detection of wake words or a pre-defined set of commands allowing single-word interactions. For these applications, it is reasonably feasible to collect a training dataset labeled at the keyword level (e.g. the Google Speech Commands [34] or “Hey Snips” [9] datasets). These works focus mainly on the neural network architecture (feed-forward [5]; convolutional [26]; residual [31]; recurrent [10, 29] neural networks; WaveNet [9]), or on the compression methods [32, 41, 1]. The networks are usually trained at the frame level using the cross-entropy loss. Other choices of losses, such as the connectionist temporal classification (CTC) [11, 10]

or a max-pooling loss 

[29] have also been proposed. Although they have an attractive formulation, since the neural network directly predicts the confidence at the keyword level, these methods are not suited to the scenario we explore, because they require to know the set of keywords in advance, and a specific training set made of these keywords.

Fig. 1: Mini-SLU system based on keyword spotting. The user says a query in natural language. The system performs an action based on the detection of keywords in the query.

Historically, the approach consisting in modeling the keyword directly to score segments of audio [4, 37]

evolved into acoustic KWS, mainly based on hidden Markov models (HMMs). These methods take advantage of the modeling of sub-word units (e.g. phone) by the HMMs to enable building acoustic models of any arbitrary keyword, only requiring generic ASR training data. To cope with the issue of scoring and comparing acoustic segments of different lengths, these approaches generally involve a

“filler model” of speech segments outside the keyword [22, 23, 35, 30] (e.g. an ergodic phone HMM). A background model may be applied to compute the likelihood ratio between keyword and generic speech [36]. A survey on acoustic KWS can be found in [19].

Similarly to what has been done in ASR, HMMs have been replaced with neural networks that predict the phone or grapheme posteriors directly, for example with CTC training [15, 17, 7]

. Since the network predicts phone posteriors, the filler model may either be ignored, because it will always give a probability of one, replaced by the greedy prediction of the network 

[15], or augmented with a phone language model [13]

. When the neural network is very small, it tends to make phoneme or grapheme prediction errors. The systems can take advantage of the network predicting phonemes or graphemes by augmenting the keyword set with alternative pronunciations, either estimated from the training set 

[39] or from examples spoken by the user [18]. Using the knowledge about the confusions of the network along with the peaky behaviour of CTC-trained networks, an efficient detection can be implemented based on a minimum edit distance search of the keywords in a compact phone lattice [42].

Recently, a few end-to-end neural networks for open-vocabulary KWS have been proposed, that rely on the embedding in a vector space of the audio on the one hand, and of the keyword phone sequence on the other hand, followed by a detection decided based on the distance between the two 

[25] or by a neural network [2]. However, these last two methods seem to be only applicable to single-keyword queries, which do not fit our mini-SLU scenario.

In this paper, we explore a method similar to [15], based on a CTC-trained neural network made of long short-term memory (LSTM) layers. We particularly focus on making the system as small and fast as possible, to detect sequences of keywords in natural language. In particular, we compare different model sizes, choices of confidence scores, and optimizations of the decoding procedure. We evaluate the models on two crowd-sourced datasets for the task of mini-SLU. We compare the proposed approach to several baselines: LVCSR approaches based on Viterbi decoding and lattices, keyword-filler models, and other methods proposed for CTC-trained neural networks. We show that we can have good results with models smaller than 500KB, outperforming the keyword-filler method. The final model can run in real time on micro-controllers.

The main contributions of this paper are the following:

  • We propose a quantization strategy for LSTM networks

  • We devise a confidence score adapted to the particularities of the outputs of CTC-trained networks

  • We propose a fast decoding strategy, which besides pruning, is made faster by skipping frames, using ideas similar to [6, 42]

  • We carry out a comprehensive comparison of different design choices.

The remaining of the paper is as follows. In Section II we present the acoustic model neural network: its architecture and how it is quantized and trained. We explain in Section III our keyword spotting mechanism, including the confidence score settings and the various optimizations we explored. In Section IV we describe the experimental setup, including the datasets and metrics we used. The results of our exploration are reported in Section V.

Ii Acoustic Model

The method we present in this paper relies on a trained acoustic model. The main requirements for this model are: (i) it has to be small, to fit on tiny devices such as micro-controllers, (ii) it has to be compatible with streaming recognition, to enable real-time KWS, and (iii) it should be accurate enough for keywords to be detectable from its output.

In this section, we present the chosen architecture. We build a multi-layer LSTM neural network [27], described in Section II-A. In order to keep the whole model under 500KB, we quantize the parameters and intermediate activations of the neural network. The chosen quantization scheme is presented in Section II-B. The model is directly trained on a generic ASR dataset with connectionist temporal classification (CTC) [11], as presented in Section II-C.

Ii-a Stack&Skip LSTM architecture

The inputs of the networks are sequences of stacks of consecutive MFCC frames, computed every frames [27]

. The networks consist of a first affine layer with a

activation, followed by a stack of LSTM layers. The output of an LSTM layer at time , for the input sequence is computed as follows:

(1)
(2)
(3)
(4)
(5)
(6)

where are the free parameters of the LSTM, are respectively the input, forget, output gate and cell inputs, and

is the internal state. An affine output layer is added on top of the last LSTM to compute the class logits: one for each phone, plus one representing a

“blank” (or null) class.

Ii-B Quantization Scheme

There has been some work on LSTM quantization, but either not explicit regarding the quantization scheme [20, 33] or only partially quantizing the LSTM for inference, with the internal state [12] or the whole LSTM [40] kept in floating-point. In this section, we explain the quantization scheme we propose for LSTM layers.

All weights and activations are quantized to 8 bits, following a scheme similar to the one proposed by Jacod et al [16]. We use a special case of that scheme, with symmetric quantization ranges with power-of-two bounds. This choice simplifies the computation, thanks to the absence of offset and the changes of scale being implemented as bit shifts.

The weights are quantized post-training. The range is set to the next power of two to the maximum (or negative minimum) value of each weight matrix. To avoid the side effects of large outlier weights, the weights are first clipped to

, ensuring at least a precision of once quantized.

The activations are quantized during training. Instead of computing the quantization range from min/max statistics, we use fixed ranges. This choice is motivated by several reasons. First, the LSTM contains saturating activation functions. Thus we know

a priori that their outputs will lie in . Moreover, we may set a fixed range for their inputs, since large values will be in the saturating part anyway. We set that range to . Using the same fixed range for all activation function inputs and outputs allows to have a single lookup table at inference for the sigmoid and the tanh functions, making the model faster to execute. The second motivation comes from the fact that the LSTM contains many additions, which are easier to implement if the operands have the same quantization parameters. Finally, the inner state is not bounded, since it can increase by one at each time step. If it were quantized using min/max statistics, we might loose at lot of precision.

The equations of the computation of the LSTM inner state and output are modified as follows:

(7)
(8)
(9)
(10)
(11)
(12)

where represents the quantization of the values in the range . During training, we use floating-point quantized values with the following fake quantization operator:

(13)
(14)

where is the rounding operation.

Ii-C Training

The acoustic model is trained with the CTC loss: an end-to-end training method that does not require to align the data prior to training. The goal is to minimize the following loss:

(15)

where, is a dataset of audio feature vector sequences with the corresponding phone target sequences.

To deal with the fact that the phone sequence and inputs have different lengths, the CTC extend the phone alphabet with a so-called “blank” class : and defines a simple mapping that removes symbol repetitions and blanks. For example:

Then we can build the set of all label sequences of a given length that yield a given phone sequence through :

With a conditional independence assumption on the labels, the posterior phone sequence probability can be rewritten as

(16)

where corresponds to the -th output of the acoustic model at time , allowing to compute and minimize the CTC loss with gradient descent.

Layers Units Num. Params Model size Quantized size
3 64 115k 460kB 115kB
5 64 181k 724kB 181kB
3 96 246k 984kB 246kB
5 96 395k 1.5MB 395kB
3 128 427k 1.7MB 427kB
TDNN-LSTM 2.6M 10.5MB -
TABLE I: Size of the base acoustic models evaluated in this paper, with different number of layers and of units on each layer. The quantized networks use one byte per parameter.

The weights are quantized after training. To make sure that the quantization range is not too big, and to keep enough precision of the weights, we add an -regularization loss, with a weight decay parameter of

. The quantized LSTM implementation is quite slow compared to the cuDNN LSTM implementation. Therefore we start by training a model without quantization for 40 epochs using the cuDNN implementation. We then activate the fake quantization of the activations and train for an additional four epochs.

To measure the impact of the size of the acoustic model, we train several neural networks of different sizes (Table I). All tested networks have less than 500k parameters, and weigh less than 500kB in their quantized form. Very small networks with 115 to 250k parameters were also evaluated, as they can be compared in size to modern KWS neural networks trained with keyword-specific datasets [5, 26]. Finally, we also compare these models to a time-delay and LSTM neural network (TDNN-LSTM) hybrid NN/HMM model with tied biphone states trained with Kaldi with the lattice-free MMI objective [24]. This model is not quantized and has about 2.6M parameters.

Iii Keyword Spotting Method

We build an ASR-based keyword spotting method, similar to other existing CTC-based approaches [15, 17, 7]. The goal is to search for the keyword phone sequence in the predictions of the acoustic model. We compute a confidence score for every keyword in all segments of the prediction sequence, and then search for the best keyword sequence. We present the keyword detection method in Section III-A, the search for the keyword sequence in Section III-B, the confidence scores explored in Section III-C and some optimizations in Section III-D.

Iii-a Keyword detection

In LVCSR, given an acoustic model and a language model, a search for the most likely sequence of words is carried out. In a keyword spotting approach, most of the words occurring in utterances are unknown, and the goal is to detect specific words or short phrases. Some methods based on LSTM and CTC are close to the LVCSR approach, where a filler model is inserted to replace all the unknown words [15, 13, 7].

Here we adopt a different strategy. We consider all segments , where in the prediction sequence of length . For every such segment, for each keyword we compute a confidence score (cf. Section III-C). From these, we build a set of detection candidates for a threshold :

(17)

We implemented a trie-based decoding, slightly better in complexity than scoring all segments and keywords separately. The set of keywords is converted to a prefix trie of the pronunciations. The decoding is implemented as a token passing algorithm. At each time step , a new token is inserted at the root of the trie. All existing tokens are propagated based on the predictions of the network at . A new candidate is created for each token in the terminal nodes if the confidence score exceeds the threshold . We discuss how we improve the complexity in Section III-D.

Iii-B Finding the best keyword sequence

The goal of the post-processing is to build the final detection list , from elements of such that for all , i.e. the detected keywords are not overlapping. This could help in ambiguous situations. For example, if the set of keywords is “play, playlist, stop”, the query “play the playlist top fifty” could be ambiguous, and we might have overlapping detections of all three keywords in the “playlist top” segment.

We explored two strategies to obtain that sequence: a greedy approach where the keyword is output as soon as it is detected, and a full search for the best sequence that considers all possible non-overlapping detection sequences.

Iii-B1 Greedy

In the greedy approach, we look at the candidate detections in increasing order of end time . When we find one for a given , we add it to the list (if we find several we keep the one with the highest confidence score) and remove from all candidates that overlap (cf. Algorithm 1).

0:  : detection candidates
  
  for  do
     
     if  then
        
        
     end if
  end for
  return  
Algorithm 1 Greedy decoding

Iii-B2 Sequence

The greedy approach does not guarantee that the best sequence of keywords will be output. For instance, it is likely that play will be detected with the greedy approach for “launch my playlist” even if playlist is in the keywords set. In the sequence approach, all the candidates are considered and the sequence of non-overlapping keyword with the maximum cumulative confidence is selected:

(18)

where is the set of sequences of non-overlapping elements of :

Iii-C Confidence score computation

Iii-C1 CTC-based confidence score

The CTC framework readily provides a method to compute the probability of a label sequence. For a segment , the probability of keyword is given by:

(19)

With the Viterbi approximation, we can define the raw confidence:

(20)

The main problem with that computation arises from the fact that the networks make local predictions. The number of factors in the multiplication of local probabilities is equal to the length of the segment. So even if the network assigns a probability of in all the frames to the phones of the correct keyword, the resulting confidence will be for a segment of frames, and for a segment of frames. Therefore, the confidence will tend to be smaller for longer keywords and would not reflect the confidence of the network predictions. In the following, we discuss how we set a more meaningful confidence score.

Iii-C2 Segment length normalization

One way to make the confidence score independent from the length of the segment is to normalize it by the segment length [3]. We perform that normalization in the log space to compute :

(21)

which amounts to take the exponential of the average frame log-likelihood of the segment as confidence score.

Iii-C3 No-blank normalization

CTC-trained networks tend to mostly predict blanks with high probability and the labels quite locally. At first approximation:

(22)
(23)
(24)

where is the number of blanks in . With this approximation, we see that the segment length normalization will tend to favor longer segments with more blanks, reducing the impact of label predictions. Since blanks are neither informative nor discriminating, we would like to normalize the score by the number of meaningful frames in the segments, i.e. the number of factors in the product in Eq. 24, that is . We can approximate

(25)

and define the “noblank” confidence score as:

(26)

Iii-C4 Likelihood ratio

It is common in confidence score estimation to compute a likelihood ratio [36]. For example, with generative models:

(27)

where is computed with a background model.

In the same vein, we may calibrate the confidence score by computing the ratio between the keyword probability and the probability of the best label sequence given the outputs of the network [15]. The probability of the best label sequence is given by

and we compute the normalized confidence as:

(28)

This confidence equals one when the best label sequence corresponds to the keyword, and is close to zero when it is very different. In general, it measures how close is the keyword prediction to the best label prediction.

Iii-C5 Normalization and ratio

The ratio is still a product of positive factors smaller than one. Although they each will be higher than the initial probability, the obtained confidence will suffer the same issues regarding segment length and amount of blank frames. Thus we can also combine the normalization schemes with the ratio, for example with the segment length normalization:

Fig. 2: Comparison of confidence score calibration techniques.

In Fig. 2, we compare the different combinations of normalization and ratio for the query “please turn off lights for the bedroom”. At each time step we plot the maximum confidence score of the two keywords appearing in the utterance ending at . We observe that without any normalization, the confidence scores are quite low: 0.006 for turn off, bedroom is not detected. With the ratio, the confidence of turn off improves (around 0.2), but bedroom is still not detected. With the segment length normalization, the confidence scores are higher, and we see that it especially improved for bedroom. However, we also see many steps with relatively high confidence scores which do not correspond to keywords. With the ratio, the confidence scores are higher but follow the same trend. With the “no-blank” technique, the confidence score of bedroom is lower but it looks like the most discriminative technique of all.

Iii-D Improving decoding speed

Computing a confidence for all keywords in all segments is quite expensive, even with the trie implementation. The number of segments on which the computation of the confidence score should happen is for a sequence of length and keywords. This will also impact the speed of the post-processing. We propose a few optimizations to improve the speed.

Iii-D1 Boundaries subsampling

Since it is not crucial to find the exact boundaries of the segment, we can consider starting times only every three frames for example (i.e. every ms). This divides the number of segments explored by three. We can apply the same idea to ending times and add detections to only every three frames. This has no impact on the number of computation in the trie, since the score for segments and will anyway be calculated during the computation of the score for . However, it has an impact on the complexity of the post-processing, dividing by three the number of detection candidates to consider.

Iii-D2 Maximum segment length

It is fair to assume that a given keyword will be uttered in a limited amount of time. It should not be necessary to consider too long segments. Usually, about one second is sufficient. We can reduce a lot the number of segments to score by only computing scores for segments shorter than some predefined duration ().

Iii-D3 Pruning

The keyword detection scores are computed iteratively in a prefix trie. If the prefix of a keyword has a low probability, the whole keyword is likely to have a low probability. In the token passing algorithm, we drop any path for which the average negative log-likelihood per frame is higher than .

Fig. 3: Amount of dropped frames, and missed phones for different thresholds on the blank probability, measured on an aligned dataset.

Iii-D4 Ignoring blank frames

We have seen that the blank predictions could be problematic to compute the confidence score because they are not informative and dominate the prediction sequence. Taking inspiration from the phone-synchronous decoding [6, 42], we drop the prediction frames when the probability of the blank label exceeds some threshold. It amounts to realizing the approximation of Eq. 24. To measure the impact of this strategy, we align a dataset and compute the ratio of dropped frames and missed phones at different dropping thresholds. The results are displayed in Fig. 3. We observe that we can almost drop 60% of the frames while missing less than 1% of actual phones. That would represent 60% less computation in each segment.

Iii-E Online keyword spotting

Since the acoustic model is only made of dense layers and unidirectional LSTM, we can use it in a streaming, online mode, where we feed the MFCC frames as the audio comes, and output a new prediction frame every 30ms. In the keyword detector, the tokens are updated at every prediction frame, and candidate detections ending at this time step are produced. This is therefore also compatible with a streaming mode. The greedy post-processor outputs detections as soon as the exceed the threshold, so it is easily applicable to online scenario. The sequence post-processor should remember all best sequences ending anywhere between and (where is the maximum segment length), and must wait for the end of the query to output the final detected sequence of keywords. However, the computation itself can be done in streaming.

Iv Experimental setup

In this section, we present the experimental setup: the training data and procedure, the evaluation tasks and associated datasets, the metrics with which we evaluate our system, and the baselines we compare it to.

Iv-a Training

We train the acoustic model on the Librispeech dataset [21]. The training set contains 960 hours of English read speech. To make the model robust to noisy far-field environments, we augment the training data four times. We use the pyroomacoustics library [28] to simulate random rooms and speaker and microphone positions, with random noise sources. We train the networks with CTC [11] to predict phone sequences.

The non-quantized LSTM networks are first trained for

epochs on the augmented Librispeech training set. After this first step, they are converted to use the quantized LSTM cell, and further trained for five epochs, with the same training hyperparameters. A pronunciation model combining a flat lexicon with a grapheme-to-phoneme converter is used to convert the transcripts of the dataset to phone sequences that serve as target for the training.

Iv-B Evaluation tasks and datasets

We evaluate our system in a mini-SLU scenario. In this scenario, we suppose that the system has been triggered by the user, for example by saying a wake word. The goal is to detect the keywords of interest from the subsequent query. We defined two tasks corresponding to a smart lights scenario and a washing machine scenario. For each task, we selected eight keywords: turn on, turn off, increase, decrease, brightness, kitchen, living room, bedroom for smart lights, hot water, cold water, high spin, low spin, wash heavy duty, wash normal, wash colors, wash delicate for washing machine.

lights washing
Samples 564 545
Unique keywords 8 8
Speakers (M/F) 32 (22/10) 33 (22/11)
Samples/speaker - avg (min/max) 18 (8/60) 17 (5/50)
Duration (s) - avg (min/max) 2.6 (1.6/6.1) 3.4 (1.8/6.7)
TABLE II: Mini-SLU datasets statistics.

We crowd-sourced over five hundred queries for each use-case from over speakers. Each query contains between one and four keywords, and are expressed in natural language (e.g. “could you [turn on] the lights in the [bedroom]”). Each dataset was re-recorded in clean and noisy, reverberated far-field conditions with a SNR of 5dB. We present some statistics of these datasets in Table II. The datasets will be made publicly available.

Iv-C Evaluation metrics

We evaluate our models with two metrics. At the keyword level, we measure the F1 score, which illustrate the ability of the system to pick up all and only the keywords. For a simple keyword-based SLU system, it is important that the whole query is correctly parsed, i.e. that the correct sequence of keywords is detected. We measure this by computing the ratio of exactly parsed queries, i.e. those for which the sequence of detected keywords exactly matches the expected one.

Iv-D LVCSR-based KWS baselines

The first baselines are based on the decoding of the queries with a large vocabulary and a language model, similar to the ones used for ASR tasks. Two such baselines are evaluated with both the large TDNN-LSTM network and the best quantized LSTM network. The first one consists in looking for the keywords in the Viterbi decoding of the query. For the second one, we first extract recognition lattices and compute word posteriors in the lattice. The keywords are then searched for in the lattice.

Iv-E Filler model baselines

For the filler model baseline, a decoding graph is built with two parallel paths: one for the keywords and one for a phone-loop filler model. The false alarm rate is controlled by adjusting the transition cost from the filler model to the keywords. The output of this baseline is derived from a Viterbi decoding in this graph. We evaluated this baseline for the large TDNN-LSTM network and the best quantized LSTM network.

Iv-F CTC-KWS baselines

Finally, two methods for CTC-based KWS were implemented: a phone-synchronous decoding mimimum edit distance (PSD-MED) approach derived from [42, 7] and a CTC-decoding one similar to [15].

The PSD-MED baseline implements the decoding method of Zhuang et al. [42, 7] on top of the best quantized LSTM neural network trained for this paper. The main differences with the reference papers are that there is no word boundary class for that neural network, that the confidence scores are normalized by the number of frames, and that the “sequence” post-processing is applied to the output of the method. The results we obtained, in particular how they compare to keyword-filler baseline, are consistent with the results reported in [8].

For the CTC-decoding baseline, we set up our system to match the approach proposed by Hwang et al. [15] as closely as possible. The confidence scores are normalized by the length of the detected segments and the decision is made based on the score ratio, with the greedy approach. The main difference with the reference paper is the absence of a word boundary class.

V Results

In this part we present the detailed results of the experiments for the proposed approach. The first step is to train a generic ASR acoustic model. We give in Section V-A the error rates of the different trained models in an LVCSR setup. The purpose is to give an idea of the performance of the obtained models, to put the subsequent results in perspective. In Section V-B, we compare the results obtained with different confidence score strategies. The post-processing methods are compared in Section V-C. The impact of the presented techniques to improve the decoding speed is measured in Section V-D. We show the impact of model size in Section V-E and of quantization in Section V-F. Finally, we compare the proposed approach to the baselines in Section V-G.

V-a Acoustic model training

We train the base acoustic models according to the method presented in Section II-C. The models are stacks of 3 or 5 unidirectional LSTM layers, with 64, 96 or 128 units in each layer. They are trained to minimize the CTC loss [11] with the Adam optimizer with minibatches of 32 samples and a learning rate of , annealed by a factor every time that no improvement has been seen for updates. We applied a curriculum strategy to focus on the shorter samples at the beginning of training and add longer ones at each epoch.

We plot the convergence curves for the five models we trained in Fig. 4. As expected, bigger models yield lower error rates, and models of similar sizes yield similar error rates. When quantization is activated at epoch , all networks suffer a large loss of performance, almost entirely recovered after the quantized training. This loss of performance is mainly due to the quantization of the logits, which has a direct impact on the predictions of the network.

Fig. 4: Convergence curves of CTC training of networks of different sizes. The top plot shows the CTC loss per frame. The bottom plot show the normalized edit distance between the raw CTC predictions and the ground-truth.

Then, we plug the small neural network acoustic models into a large-vocabulary ASR setup. We use the standard vocabulary of 200k words provided with the Librispeech corpus111http://www.openslr.org/11/. We applied a trigram language model pruned with threshold , also provided with Librispeech, sometimes referred to as tgmed in the literature. We carried out a single-pass Viterbi beam search, with a log-likelihood beam of 8. We measure the word error rate of the obtained ASR system on the development and test sets of Librispeech, and report the results in Table III. We evaluate all the models, before quantization (i.e. at the end of the initial epochs of training) and after quantization (at the end of training).

dev-clean dev-other test-clean test-other
3x64 23.0 44.5 22.8 47.1
(quantized) 22.8 41.7 21.8 43.7
5x64 17.5 36.8 17.2 39.1
(quantized) 18.1 36.3 18.0 38.5
3x96 16.4 35.7 16.7 37.7
(quantized) 16.8 34.7 16.7 36.6
5x96 13.7 30.7 13.7 31.9
(quantized) 13.5 29.5 13.8 31.0
3x128 13.8 31.3 14.2 33.0
(quantized) 13.8 29.9 13.9 30.9
TDNN-LSTM 7.3 16.6 7.4 17.4
TABLE III: Word Error Rates (%) obtained with different acoustic models on LibriSpeech, with the tgmed language model.

Again, we logically get better results with bigger models. The error rates achieved with the quantized models tend to be slightly lower than the corresponding results of the non-quantized ones. Although it may seem counter-intuitive, it is possible that this effect is merely due to the additional five epochs of training. Finally, we note that these results are quite far from the current state-of-the-art for this dataset. However, the models we present are very small, and the error rates result from a single-pass streaming decoding. They are given as an indication of the performance of the trained network as an acoustic model, and look acceptable from this perspective.

We applied the same LVCSR decoding to the proposed dataset and measured the word error rates for the quantized 5x96 LSTM network and for the large TDNN-LSTM model. The results are reported in Table IV. We see that these datasets are quite challenging. Even the TDNN-LSTM yields high error rates, more than five times those obtained on Librispeech.

lights washing
clean noisy clean noisy
5x96 (quantized) 54.6 79.3 72.8 87.9
TDNN-LSTM 32.8 59.6 46.5 68.6
TABLE IV: Word Error Rates (%) obtained with different acoustic models on the mini-SLU datasets, with the tgmed language model.

V-B Confidence score calibration

We have seen in Section III that the definition of a good confidence score was quite important to aggregate the framewise phone-level scores of the acoustic model into a meaningful keyword-level confidence. We first compare the different strategies for normalization (, and ), with or without ratio (, ). We plot the F1 score and exact rate results in Fig. 5, with the not quantized 3x96 network on the washing-clean dataset.

Fig. 5: Comparison of F1-scores and exact rates for different confidence score calibration methods on the washing-clean dataset using the not quantized 3x96 network.

We see that the different normalization strategies are associated with different optimal thresholds. For reasons we already explained, without any normalization, most of the confidence scores are quite low. As expected, normalizing by the number of frames () gives more understandable results. Indeed, the confidence in that case can be interpreted as a per-frame probability. The “no blank” normalization yields the best results. It tends to give lower scores in general, but scores much closer to zero for incorrect detection candidates, reducing the risk of false alarms.

Using score ratios as confidence (dashed lines) tends to yield higher scores and optimal thresholds. Moreover, the performance seems to decrease more smoothly when the threshold deviates from the optimal one, giving a bit more robustness to the system. It also improves the performance for segment length normalization or when there is no normalization. For the “no blank” normalization, the performance is slightly lower.

V-C Post-processor

The previous results were obtained with the greedy post-processor. We now compare the performance of the greedy post-processor with the sequence one which provides a more elaborated enforcement of non-overlapping constraints.

Fig. 6: Comparison of F1-scores and exact rates for different post-processors on the washing-clean dataset using the not quantized 3x96 network.

The results are displayed in Fig. 6 for the confidence scores using the score ratio and the normalizations by the number of frames and or non-blank frames ( and ). With higher threshold, the false rejection rate increases. That is not recovered by using a different post-processor. For lower threshold however, the greedy post-processor triggers more quickly, and the false alarm rate increases. Using the sequence post-processor, some of these false alarms are discarded by the non-overlap constraints. Therefore, there is a slower decrease of performance when the threshold decreases. For the “no-blank” normalization (), we get better results than the greedy approach with a smaller threshold.

V-D Decoding speed

We presented in Section III a few tricks to improve the decoding speed. They usually also result in less detection candidates. This should also increase the speed of the post-processor, but might result in a degraded detection accuracy. In the following, we will measure the processing time reduction as well as the performance degradation to find the best trade-off.

Fig. 7: Processing time reduction on washing-clean dataset using the not quantized 3x96 network.

In Fig. 7 we plot the relative processing time reduction for both the decoder and post-processor for the different tricks. We see that each of them can easily bring a improvement of processing time by a factor two for the decoder and the post-processor.

Fig. 8: Comparison of F1-scores and exact rates for different boundary subsampling factors on the washing-clean dataset using the not quantized 3x96 network.

For the boundaries subsampling, we already achieve more than a factor two by considering only every other frame as possible boundary (top-left plot in Fig. 7). However we notice in Fig. 8 that the accuracy of the system decreases quickly with the subsampling factor, for two confidence score configurations.

Fig. 9: Comparison of F1-scores and exact rates for different blank dropping probability thresholds on the washing-clean dataset using the not quantized 3x96 network.

Another option was to drop completely in the search the frames where the blank probability exceeded some threshold. We saw in Fig. 3 that at the expense of potentially dropping a few frames containing useful information, we could almost skip half the frames during decoding. The same acceleration factor is achieved with a quite high threshold of on the blank probability as with a boundary subsampling factor of 2 (top-right plot of Fig. 7). We also see in Fig. 9 that the system is more robust to that acceleration method than to the boundary subsampling one. With thresholds above we almost observe no degradation of performance.

Fig. 10: Comparison of F1-scores and exact rates for different pruning thresholds with the (top) and (bottom) confidence on the washing-clean dataset using the not quantized 3x96 network.

Most of the evaluated segments will have a very different content from any of the keyword. That might be detected early in the decoding process. As a result, pruning strategies should allow to drop the segment early, and as we notice in the bottom-left plot of Fig. 7, provide huge speed-ups. In Fig. 10, we see that the performance remains almost the same for any pruning threshold above . With the “no-blank” confidence with ratio (, bottom), a threshold of induce a significant drop of exact rate. For the “num. frames” confidence with ratio (), stricter threshold actually improves the performance for small threshold: it might be because it reduces the number of false alarms, which we have seen are numerous for this confidence.

Fig. 11: Comparison of F1-scores and exact rates for different values of with the (top) and (bottom) confidence on the washing-clean dataset using the not quantized 3x96 network.

Finally, the decoding speed decreases linearly with the maximum segment length (bottom-right plot of Fig. 7). This parameter is closely related to the actual duration of a keyword. If it is set to a too small value, there will be a lot of false rejections of long keywords. If it is too big, the chances of unrealistic detections increases. This is verified in Fig. 11, where we see that (i.e. 600ms) seems too small, while (900ms) appears to be the best choice. With bigger values, the performance decreases. This is especially true for the “num. frames” confidence with ratio (): it is normalized by the segment length, so the impact of a single frame on the confidence is smaller for longer segments.

V-E Impact of acoustic model size

We now compare the results for acoustic models of different sizes to measure the impact of the number of parameters on the performance of the system. We plot the F1 score and exact rate results for all the trained models, before quantization on Fig. 12. As expected, bigger models yield better results. Yet the difference between the smallest one (3x64) and the biggest (3x128) is not as big for the keyword detection task as it is for the LVCSR task of Section V-A.

Fig. 12: Comparison of F1-scores and exact rates for different neural network sizes on the lights-clean dataset (not quantized), with the confidence.

V-F Impact of quantization

After a fixed number of epochs, the acoustic models are quantized and further trained. We have seen in Section V-A that the quantized models, which have been trained longer, yield similar or better word error rates for LVCSR than the ones before quantization. We compare in Fig. 13 the effect of quantization on the performance of the systems for two model sizes and two datasets. We notice that quantized models tend to perform a bit worse than their floating-point counterparts. The difference in performance is not so big for noisy datasets but appears to be significant on clean ones.

Fig. 13: Comparison of F1-scores and exact rates for quantized and not quantized networks on the lights dataset (clean and noisy), with the confidence.

V-G Final results

To compare the systems and design choices (post-processor, confidence score, etc.), we select, for each case, the threshold that corresponds to the best cumulative exact rate across the four datasets. This threshold might not be the optimal one for an individual dataset but corresponds more closely to a real-world scenario where a single threshold is set for the system.

Fig. 14: Comparison of F1-scores and exact rates for different sizes of quantized networks.

We show in Fig. 14 the performance of quantized networks of different sizes, using the sequence post-processor and “no-blank” confidence (). With some exceptions, bigger models are better. The best neural network across the different datasets seems to be the 5x96 architecture, with parameters.

Fig. 15: Comparison of F1-scores and exact rates for the quantized 5x96 architecture for different confidence scores.

In Fig. 15, we compare different choices of confidence scores for that network. We see that the raw confidence score () gives the poorest performance. Normalizing by the segment length gives much better results (). The normalization by the estimated number of blank frames () yields the best results for both metrics and all four datasets. Using likelihood (or confidence) ratio helps a lot for the raw confidence, but always gives worse results for “no-blank”. For , no clear conclusion can be drawn.

Fig. 16: Comparison of F1-scores and exact rates for the 5x96 architecture and the large TDNN-LSTM for different KWS methods.
lights washing
Model Method clean noisy clean noisy
TDNN-LSTM Transcript 0.783 0.513 0.567 0.352
Lattice 0.889 0.674 0.692 0.446
Filler 0.891 0.707 0.811 0.666
5x96 Transcript 0.435 0.154 0.299 0.113
(quantized) Lattice 0.623 0.311 0.448 0.198
Filler 0.739 0.560 0.826 0.698
PSD-MED 0.636 0.478 0.800 0.647
Hwang 0.657 0.435 0.778 0.620
Proposed (greedy) 0.784 0.545 0.866 0.708
Proposed (sequence) 0.808 0.588 0.871 0.725

TABLE V: Comparison of F1-scores of the quantized 5x96 architecture for different post-processors.

Finally, we compare the proposed method to the baselines. The exact rates and F1 scores of all methods are shown in Fig. 16 and the F1 scores are also reported in Table V. The proposed method is configured with the quantized 5x96 architecture and the no-blank confidence without ratio (). We used that same network for different baseline methods, and also compare a few methods with the large TDNN-LSTM network.

We notice that the “Transcript” LVCSR baseline performs poorly. This may be expected from the word error rates presented in Table IV. A big improvement over this baseline is achieved by using lattices instead of the Viterbi decoding. Using a filler model instead of an LVCSR approach provides further improvement, especially for the small network. The PSD-MED and Hwang methods are a bit worse than the filler model, which is consistent with previous publications [8], and may also be explained by the lack of word boundary class in our model compared to the reference ones presented in the corresponding papers [42, 7, 15].

The method we propose in this paper, in green in Fig. 16, performs better than the filler method using the same neural network. The sequence post-processor gives a small improvement over the greedy approach. It even reaches better results than the filler model using the TDNN-LSTM model on the “washing machine” dataset. Although the results are a bit behind on the “smart light” dataset, it is nonetheless better than the LVCSR “transcript” approach, with a much smaller network and fast decoding.

Vi Conclusion

We have presented a small-footprint keyword spotting method that does not require training data labeled at the keyword level, applied to a spoken language understanding scenario in which multiple keywords should be detected in a single query. The acoustic neural network is trained with CTC on generic ASR data, and can be used to detect any arbitrary keyword. We proposed a quantization scheme for LSTM layers, allowing us to build a small neural network weighing less than 500KB, which can run in real-time on micro-controllers.

We exploited the characteristics of the typical outputs of a CTC-trained network to optimize the keyword spotting algorithm. We proposed a confidence score calibration based on a normalization of the CTC score by an estimate of the number of “blank” frames, and gained a factor two in the speed of the detection algorithm by dropping the blank frames. Moreover, we carried out a comprehensive exploration of the impact of different aspects of the proposed system.

We compared the detection performance of our method to standard baselines, either based on a LVCSR system or on a filler model. We have shown that our approach outperforms the filler model on the studied tasks and datasets, as well as several other approaches proposed in the literature.

Future work will focus on developing the SLU system on top of the outputs of this models, potentially providing cues to improve the decoding of keyword sequences. We will also use that system as a baseline to evaluate our future work on open-vocabulary KWS models.

References

  • [1] J. Amoh and K. M. Odame (2019) An optimized recurrent unit for ultra-low-power keyword spotting. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3 (2), pp. 1–17. Cited by: §I.
  • [2] K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, and B. Kingsbury (2017) End-to-end asr-free keyword search from speech. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1351–1359. Cited by: §I.
  • [3] G. Bernardis and H. Bourlard (1998) Improving posterior based confidence measures in hybrid hmm/ann speech recognition systems. In Fifth International Conference on Spoken Language Processing, Cited by: §III-C2.
  • [4] J. S. Bridle (1973) An efficient elastic-template method for detecting given words in running speech. In Brit. Acoust. Soc. Meeting, pp. 1–4. Cited by: §I.
  • [5] G. Chen, C. Parada, and G. Heigold (2014) Small-footprint keyword spotting using deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4087–4091. Cited by: §I, §II-C.
  • [6] Z. Chen, W. Deng, T. Xu, and K. Yu (2016) Phone synchronous decoding with ctc lattice.. In Interspeech, pp. 1923–1927. Cited by: 3rd item, §III-D4.
  • [7] Z. Chen, Y. Qian, and K. Yu (2018)

    Sequence discriminative training for deep learning based acoustic keyword spotting

    .
    Speech Communication 102, pp. 100–111. Cited by: §I, §III-A, §III, §IV-F, §IV-F, §V-G.
  • [8] Z. Chen (2019-05) Sequence modeling and decoding in speech recognition. Ph.D. Thesis, Shanghai Jiao Tong University. Note: Chap. 7 Cited by: §IV-F, §V-G.
  • [9] A. Coucke, M. Chlieh, T. Gisselbrecht, D. Leroy, M. Poumeyrol, and T. Lavril (2019) Efficient keyword spotting using dilated convolutions and gating. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6351–6355. Cited by: §I.
  • [10] S. Fernández, A. Graves, and J. Schmidhuber (2007)

    An application of recurrent neural networks to discriminative keyword spotting

    .
    In International Conference on Artificial Neural Networks, pp. 220–229. Cited by: §I.
  • [11] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In

    Proceedings of the 23rd international conference on Machine learning

    ,
    pp. 369–376. Cited by: §I, §II, §IV-A, §V-A.
  • [12] Q. He, H. Wen, S. Zhou, Y. Wu, C. Yao, X. Zhou, and Y. Zou (2016) Effective quantization methods for recurrent neural networks. arXiv preprint arXiv:1611.10176. Cited by: §II-B.
  • [13] Y. He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, and I. McGraw (2017) Streaming small-footprint keyword spotting using sequence-to-sequence models. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 474–481. Cited by: §I, §III-A.
  • [14] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, et al. (2019) Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. Cited by: §I.
  • [15] K. Hwang, M. Lee, and W. Sung (2015) Online keyword spotting with a character-level recurrent neural network. arXiv preprint arXiv:1512.08903. Cited by: §I, §I, §III-A, §III-C4, §III, §IV-F, §IV-F, §V-G.
  • [16] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2704–2713. Cited by: §II-B.
  • [17] C. Lengerich and A. Hannun (2016) An end-to-end architecture for keyword spotting and voice activity detection. arXiv preprint arXiv:1611.09405. Cited by: §I, §III.
  • [18] L. Lugosch, S. Myer, and V. S. Tomar (2018) DONUT: ctc-based query-by-example keyword spotting. arXiv preprint arXiv:1811.10736. Cited by: §I.
  • [19] A. Mandal, K. P. Kumar, and P. Mitra (2014) Recent developments in spoken term detection: a survey. International Journal of Speech Technology 17 (2), pp. 183–198. Cited by: §I.
  • [20] I. McGraw, R. Prabhavalkar, R. Alvarez, M. G. Arenas, K. Rao, D. Rybach, O. Alsharif, H. Sak, A. Gruenstein, F. Beaufays, et al. (2016) Personalized speech recognition on mobile devices. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5955–5959. Cited by: §II-B.
  • [21] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §IV-A.
  • [22] J. R. Rohlicek, W. Russell, S. Roukos, and H. Gish (1989) Continuous hidden markov modeling for speaker-independent word spotting. In International Conference on Acoustics, Speech, and Signal Processing,, pp. 627–630. Cited by: §I.
  • [23] R. C. Rose and D. B. Paul (1990) A hidden markov model based keyword recognition system. In International conference on acoustics, speech, and signal processing, pp. 129–132. Cited by: §I.
  • [24] A. Saade, A. Coucke, A. Caulier, J. Dureau, A. Ball, T. Bluche, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, et al. (2018) Spoken language understanding on the edge. arXiv preprint arXiv:1810.12735. Cited by: §I, §II-C.
  • [25] N. Sacchi, A. Nanchen, M. Jaggi, and M. Cernak (2019) Open-vocabulary keyword spotting with audio and text embeddings. In INTERSPEECH 2019-IEEE International Conference on Acoustics, Speech, and Signal Processing, Cited by: §I.
  • [26] T. Sainath and C. Parada (2015) Convolutional neural networks for small-footprint keyword spotting. Cited by: §I, §II-C.
  • [27] H. Sak, A. Senior, K. Rao, and F. Beaufays (2015) Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947. Cited by: §II-A, §II.
  • [28] R. Scheibler, E. Bezzam, and I. Dokmanic (2018) Pyroomacoustics: a python package for audio room simulation and array processing algorithms. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 351–355. Cited by: §IV-A.
  • [29] M. Sun, A. Raju, G. Tucker, S. Panchapagesan, G. Fu, A. Mandal, S. Matsoukas, N. Strom, and S. Vitaladevuni (2016) Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting. In 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 474–480. Cited by: §I.
  • [30] I. Szoke, P. Schwarz, P. Matejka, L. Burget, M. Karafiát, M. Fapso, and J. Cernocky (2005) Comparison of keyword spotting approaches for informal continuous speech. In Ninth European conference on speech communication and technology, Cited by: §I.
  • [31] R. Tang and J. Lin (2018) Deep residual learning for small-footprint keyword spotting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5484–5488. Cited by: §I.
  • [32] G. Tucker, M. Wu, M. Sun, S. Panchapagesan, G. Fu, and S. Vitaladevuni (2016) Model compression applied to small-footprint keyword spotting.. In INTERSPEECH, pp. 1878–1882. Cited by: §I.
  • [33] S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, and Y. Liang (2018) C-lstm: enabling efficient lstm using structured compression techniques on fpgas. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 11–20. Cited by: §II-B.
  • [34] P. Warden (2018) Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: §I.
  • [35] M. Weintraub (1993) Keyword-spotting using sri’s decipher large-vocabulary speech-recognition system. In 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 463–466. Cited by: §I.
  • [36] M. Weintraub (1995) LVCSR log-likelihood ratio scoring for keyword spotting. In 1995 International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 297–300. Cited by: §I, §III-C4.
  • [37] J. G. Wilpon, L. R. Rabiner, C. Lee, and E. Goldman (1990) Automatic recognition of keywords in unconstrained speech using hidden markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing 38 (11), pp. 1870–1878. Cited by: §I.
  • [38] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig (2016) Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256. Cited by: §I.
  • [39] Y. Yang, A. Lalitha, J. Lee, and C. Lott (2018) Automatic grammar augmentation for robust voice command recognition. arXiv preprint arXiv:1811.06096. Cited by: §I.
  • [40] H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak (2016) Fast, compact, and high quality lstm-rnn based statistical parametric speech synthesizers for mobile devices. arXiv preprint arXiv:1606.06061. Cited by: §II-B.
  • [41] Y. Zhang, N. Suda, L. Lai, and V. Chandra (2017) Hello edge: keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128. Cited by: §I.
  • [42] Y. Zhuang, X. Chang, Y. Qian, and K. Yu (2016) Unrestricted vocabulary keyword spotting using lstm-ctc.. In Interspeech, pp. 938–942. Cited by: 3rd item, §I, §III-D4, §IV-F, §IV-F, §V-G.