Stochastic Adaptive Neural Architecture Search for Keyword Spotting

11/16/2018 ∙ by Tom Véniat, et al. ∙ 8

The problem of keyword spotting i.e. identifying keywords in a real-time audio stream is mainly solved by applying a neural network over successive sliding windows. Due to the difficulty of the task, baseline models are usually large, resulting in a high computational cost and energy consumption level. We propose a new method called SANAS (Stochastic Adaptive Neural Architecture Search) which is able to adapt the architecture of the neural network on-the-fly at inference time such that small architectures will be used when the stream is easy to process (silence, low noise, ...) and bigger networks will be used when the task becomes more difficult. We show that this adaptive model can be learned end-to-end by optimizing a trade-off between the prediction performance and the average computational cost per unit of time. Experiments on the Speech Commands dataset show that this approach leads to a high recognition level while being much faster (and/or energy saving) than classical approaches where the network architecture is static.



There are no comments yet.


page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

Neural Networks (NN) are known to obtain very high recognition rates on a large variety of tasks, and especially over signal-based problems like speech recognition Amodei et al. (2015), image classification He et al. (2015); Real et al. (2018), etc. However these models are usually composed of millions of parameters involved in millions of operations and have high computational and energy costs at prediction time. There is thus a need to increase their processing speed and reduce their energy footprint.

From the NN point of view, this problem is often viewed as a problem of network architecture discovery and solved with Neural Architecture Search (NAS) methods in which the search is guided by a trade-off between prediction quality and prediction cost Veniat and Denoyer (2018); Huang and Wang (2017); Gordon et al. (2017)

. Recent approaches involve for instance Genetic Algorithms

Real et al. (2017, 2018)

or Reinforcement Learning

Zoph and Le (2016); Zoph et al. (2017). While these models often rely on expensive training procedures where multiple architectures are trained, some recent works have proposed to simultaneously discover the architecture of the network while learning its parameters Veniat and Denoyer (2018); Huang and Wang (2017) resulting in models that are fast both at training and at inference time. But in all these works, the discovered architecture is static i.e. the same NN being re-used for all the predictions.

When dealing with streams of information, reducing the computational and energy costs is of crucial importance. For instance, let us consider the keyword spotting111See Section 3 for a formal description. problem which is the focus of this paper. It consists in detecting keywords in an audio stream and is particularly relevant for virtual assistants which must continuously listen to their environments to spot user interaction requests. This requires detecting when a word is pronounced, which word has been pronounced and able to run quickly on resource-limited devices. Some recent works Sainath and Parada (2015); Arik et al. (2017); Tang and Lin (2018)

proposed to use convolutional neural networks (CNN) in this streaming context, applying a particular model to successive sliding windows

Sainath and Parada (2015); Tang and Lin (2018)

or combining CNNs with recurrent neural networks (RNN) to keep track of the context

Arik et al. (2017). In such cases, the resulting system spends the same amount of time to process each audio frame, irrespective of the content of the frame or its context.

Our conjecture is that, when dealing with streams of information, a model able to adapt its architecture to the difficulty of the prediction problem at each timestep – i.e. a small architecture being used when the prediction is easy, and a larger architecture being used when the prediction is more difficult – would be more efficient than a static model, particularly in terms of computation or energy consumption. To achieve this goal, we propose the SANAS algorithm (Section 2.3): it is, as far as we know, the first architecture search method producing a system which dynamically adapts the architecture of a neural network during prediction at each timestep and which is learned end-to-end by minimizing a trade-off between computation cost and prediction loss. After learning, our method can process audio streams at a higher speed than classical static methods while keeping a high recognition rate, spending more prediction time on complex signal windows and less time on easier ones (see Section 3).

Figure 1: SANAS Architecture. At timestep , the distribution is generated from the previous hidden state, . A discrete architecture is then sampled from and evaluated over the input

. This evaluation gives both a feature vector

to compute the next hidden state, and the prediction of the model using

. Dashed edges represent sampling operations. At inference, the architecture which has the highest probability is chosen at each timestep.

2 Adaptive Neural Architecture Search

2.1 Problem Definition

We consider the generic problem of stream labeling where, at each timestep, the system receives a datapoint denoted and produces an output label . In the case of audio streams, is usually a time-frequency feature map, and is the presence or absence of a given keyword. In classical approaches, the output label is predicted using a neural network whose architecture222 a precise definition of the notion of architecture is given further. is denoted and whose parameters are . We consider in this paper the recurrent modeling scheme where the context is encoded using a latent representation , such that the prediction at time is made computing , being updated at each timestep such that - note that and can share some common computations.

For a particular architecture , the parameters are learned over a training set of labeled sequences , being the size of the training set, by solving:

where is the length of sequence , and

a differentiable loss function. At inference, given a new stream

, each label is predicted by computing , where are the predictions of the model at previous timesteps. In that case, the computation cost of each prediction step solely depends on the architecture and is denoted .

Figure 2: SANAS architecture based on cnn-trad-fpool3 Sainath and Parada (2015)

. Edges between layers are sampled by the model. The highlighted architecture is the base model on which we have added shortcut connections. Conv1 and Conv2 have filter sizes of (20,8) and (10,4). Both have 64 channels and Conv1 has a stride of 3 in the frequency domain. Linear 1,2 and the Classifier have 32, 128 and 12 neurons respectively. Shortcut linears all have 128 neurons to match the dimension of the classifier.

2.2 Stochastic Adaptive Architecture Search: Principles

We propose now a different setting where the architecture of the model can change at each timestep depending on the context of the prediction . At time , in addition to producing a distribution over possible labels, our model also maintains a distribution over possible architectures denoted . The prediction being now made following333 is usually a distribution over possible labels. and the context update being . In that case, the cost of a prediction at time is now , which also includes the computation of the architecture distribution . It is important to note that, since the architecture is chosen by the model, it has the possibility to learn to control this cost itself. A budgeted learning problem can thus be defined as minimizing a trade-off between prediction loss and average cost. Considering a labeled sequence , this trade-off is defined as :

where are sampled following and controls the trade-off between cost and prediction efficiency. Considering that is differentiable, and following the derivation schema proposed in Denoyer and Gallinari (2014) or Veniat and Denoyer (2018)

, this cost can be minimized using the Monte-Carlo estimation of the gradient. Given one sample of architectures

, the gradient can be approximated by:


In practice, a variance correcting value is used in this gradient formulation to accelerate the learning as explained in

Williams (1992); Wierstra et al. (2007).

2.3 The SANAS Model

We now instantiate the previous generic principles in a concrete model where the architecture search is cast into a sub-graph discovery in a large graph representing the search space called Super-Network as in Veniat and Denoyer (2018).

NAS with Super-Networks (static case): A Super-Network is a directed acyclic graph of layers , of edges and where each existing edge connecting layers and () is associated with a (small) neural network . The layer is the input layer while is the output layer. The inference of the output is made by propagating the input over the edges, and by summing, at each layer level, the values coming from incoming edges. Given a Super-Network, the architecture search can be made by defining a distribution matrix

that can be used to sample edges in the network using a Bernoulli distribution. Indeed, let us consider a binary matrix

sampled following , the matrix defines a sub-graph of and corresponds to a particular neural-network architecture which size is smaller than ( being the Hadamard product). Learning thus results in doing architecture search in the space of all the possible neural networks contained in Super-Network. At inference, the architecture with the highest probability is chosen.

SANAS with Super-Networks: Based on the previously described principle, our method proposes to use a RNN to generate the architecture distribution at each timestep – see Figure 1. Concretely, at time , a distribution over possible sub-graphs is computed from the context . This distribution is then used to sample a particular sub-graph represented by , being a Bernoulli distribution. This particular sub-grap corresponds to the architecture used at time . Then the prediction and the next state are computed using the functions and respectively, where

is a classical RNN operator like a Gated Recurrent Unit

Cho et al. (2014) cell for instance and is a feature vector used to update the latent state and computed using the sampled architecture . The learning of the parameters of the proposed model relies on a gradient-descent method based on the approximation of the gradient provided previously, which simultaneously updates the parameters and the conditional distribution over possible architectures.

Figure 3: Example of labeling using the method presented in section 3. To build the dataset, a ground noise (red) is mixed with randomly located words (green). The signal is then split in 1s frames every 200ms. When a frame contains at least 50% of a word signal, it is labeled with the corresponding word (frame B and C – frame A is labeled as bg-noise ). Note that this labeling could be imperfect (see frame A and C).

3 Experiments

We train and evaluate our model using the Speech Commands dataset Warden (2018). It is composed of 65000 short audio clips of 30 common words. As done in Tang and Lin (2018); Tang et al. (2018); Zhang et al. (2017), we treat this problem as a classification task with 12 categories: ’yes’, ’no’, ’up’, ’down’, ’left’, ’right’, ’on’, ’off’, ’stop’, ’go’, ’bg-noise’ for background noise and ’unknown’ for the remaining words.

Instead of directly classifying 1 second samples, we use this dataset to generate between 1 and 3 second long audio files by combining a background noise coming from the dataset with a randomly located word (see Figure 3), the signal-to-noise ratio being randomly sampled with a minimum of 5dB. We thus obtain a dataset of about 30,000 small files444tools for building this dataset are available at with the open-source implementation., and then split this dataset in train, validation and test sets using a 80:10:10 ratio. The sequence of frames is created by taking overlapping windows of 1 second every 200ms. The input features for each window are computed by extracting 40 mel-frequency spectral coefficients (MFCC) on 30ms frames every 10ms and stacking them to create 2D time/frequency maps. For the evaluation, we use both the prediction accuracy and the number of operations per frame (FLOPs) value. The model selection is made by training multiple models, selecting the best models on the validation set, and computing their performance on the test set. Note that the ’best models’ in terms of both accuracy and FLOPs are the models located on the pareto front of the accuracy/cost validation curve as done for instance in Contardo et al. (2016). These models are also evaluated using the matched, correct, wrong and false alarm (FA) metrics as proposed in Warden (2018) and computed over the one hour stream provided with the original dataset. Note that these last metrics are computed after using a post-processing method that ensures a labeling consistency as described in the reference paper.

As baseline static model, we use a standard neural network architecture used for Keyword Spotting aka the cnn-trad-fpool3 architecture proposed in Sainath and Parada (2015) which consists in two convolutional layers followed by 3 linear layers. We then proposed a SANAS extension of this model (see Figure 2) with additional connections that will be adaptively activated (or not) during the audio stream processing. In the SANAS models, the recurrent layer is a one-layer GRU Cho et al. (2014) and the function mapping from the hidden state to the distribution over architecture is a one-layer linear module followed by a sigmoid activation. The models are learned using the ADAM Kingma and Ba (2014) optimizer with and , gradient steps between and and in range [, ] with the order of magnitude of the cost of the full model. Training time is reasonable and corresponds to about 1 day on a single GPU computer.

Match Correct Wrong FA FLOPs per frame
81.7% 72.8% 8.9% 0.0% 124.6M
cnn-trad-fpool3 + shortcut connections
82.9% 77.9% 5.0% 0.3% 137.3M
61.2% 53.8% 7.3% 0.7% 519.2K
62.0% 57.3% 4.8% 0.1% 22.9M
86.5% 80.7% 5.8% 0.3% 37.7M
86.3% 80.6% 5.7% 0.2% 48.8M
81.7% 76.4% 5.3% 0.1% 94.0M
81.4% 76.3% 5.2% 0.2% 105.4M
Table 1: Evaluation of models on 1h of audio from Warden (2018) containing words roughly every 3 seconds with different background noises. We use the label post processing and the streaming metrics proposed in Warden (2018) to avoid repeated and noisy detections. Matched % corresponds to the portion of words detected, either correctly (Correct %) or incorrectly (Wrong %). FA is False Alarm.
Figure 4: Cost accuracy curves. Reported results are computed on the test set using models selected by computing the Pareto front over the validation set. Each point represents a model.
Figure 5: Training dynamics. Average cost per output label during training. The network is able to find an architecture that solves the task while sampling notably cheaper architectures when only background noise is present in the frames.

Results obtained by various models are illustrated in Table 1 for the one-hour test stream, and in Figure 4 on the test evaluation set. It can be seen that, at a given level of accuracy, the SANAS approach is able to greatly reduce the number of FLOPs, resulting in a model which is much more power efficient. For example, with an average cost of 37.7M FLOPs per frame, our model is able to match 86.5% of the words, (80.7% correctly and 5.8% wrongly) while the baseline models match 81.7% and 82.9% of the words with 72.8% and 77.9% correct predictions while having a higher budget of 124.6M and 137.3 FLOPs per frame respectively. Moreover, it is interesting to see that our model also outperforms both baselines in term of accuracy, or regarding the metrics in Table 1. This is due to the fact that, knowing that we have added shortcut connections in the base architecture, our model has a better expressive power. Note that in our case, over-fitting is avoided by the cost minimization term in the objective function, while it occurs when using the complete architecture with shortcuts (see Figure 4). Figure 5 illustrates the average cost per possible prediction during training. It is not surprising to show that our model automatically ’decides’ to spend less time on frames containing background noise and much more time on frames containing words. Moreover, at convergence, the model also divides its budget differently on the different words, for example spending less time on the yes words that are easy to detect.

4 Conclusion

We have proposed a new model for keyword spotting where the recurrent network is able to automatically adapt its size during inference depending on the difficulty of the prediction problem at time . This model is learned end-to-end based on a trade-off between prediction efficiency and computation cost and is able to find solutions that keep a high prediction accuracy while minimizing the average computation cost per timestep. Ongoing research includes using these methods on larger super-networks and investigating other types of budgets like memory footprint or electricity consumption on connected devices.


This work has been funded in part by grant ANR-16-CE23-0016 “PAMELA” and grant ANR-16-CE23-0006 “Deep in France”.