Neuromorphic Acceleration for Approximate Bayesian Inference on Neural Networks via Permanent Dropout

04/29/2019 ∙ by Nathan Wycoff, et al. ∙ Argonne National Laboratory Virginia Polytechnic Institute and State University 0

As neural networks have begun performing increasingly critical tasks for society, ranging from driving cars to identifying candidates for drug development, the value of their ability to perform uncertainty quantification (UQ) in their predictions has risen commensurately. Permanent dropout, a popular method for neural network UQ, involves injecting stochasticity into the inference phase of the model and creating many predictions for each of the test data. This shifts the computational and energy burden of deep neural networks from the training phase to the inference phase. Recent work has demonstrated near-lossless conversion of classical deep neural networks to their spiking counterparts. We use these results to demonstrate the feasibility of conducting the inference phase with permanent dropout on spiking neural networks, mitigating the technique's computational and energy burden, which is essential for its use at scale or on edge platforms. We demonstrate the proposed approach via the Nengo spiking neural simulator on a combination drug therapy dataset for cancer treatment, where UQ is critical. Our results indicate that the spiking approximation gives a predictive distribution practically indistinguishable from that given by the classical network.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Deep neural networks (DNNs) are the arguable flagship of the machine learning (ML) revolution, having captured the imagination of the academic research community, industry, and to some extent the public at large because of their widespread empirical successes and captivating connection to human information processing. Historically sporting a black-box, predictive-error-driven approach, ML culture is increasingly interested in quantifying the uncertainty of its predictions. Standard, off-the-shelf tools from classical and Bayesian statistics to this end are often too computationally expensive to be of use in problems of even modest scale, a challenge the ML community has risen to meet.

DNNs are increasingly being used for tasks that require quantification of prediction uncertainty. For instance, many autonomous vehicle frameworks are built on convolutional networks (Janai et al., 2017)

. Also, in the context of reinforcement learning with a DNN value function approximator, understanding model uncertainty is important in order to determine where the agent should next explore

(Osband et al., 2016). For camera relocalization, Kendall and Cipolla (Kendall and Cipolla, 2016) avail themselves of the uncertainty obtained from permadrop to obtain improvements in challenging indoor and outdoor problems. Recently, Thulasidasan et al. (Thulasidasan et al., 2019)

developed a neural net with abstention, where the DNN may decide not to classify an instance if sufficient uncertainty exists. Furthermore, uncertainty quantification (UQ) is critical to many scientific ML applications as well

(Baker et al., 2018).

Dropout (Srivastava et al., 2014)

, an approach wherein individual neurons are randomly turned off (or otherwise perturbed), has been shown to be an effective approach for regularizing DNNs. The same approach applied during inference can approximate a Bayesian treatment of model uncertainty

(Gal and Ghahramani, 2016). In particular, it was shown that permanent dropout (called Monte Carlo dropout in the initial article and referred to as permadrop here) networks approximate a form of deep Gaussian processes (Rasmussen and Williams, 2005; Damianou and Lawrence, 2013). Traditionally, the cost of training dominates that of inference (Goodfellow et al., 2016); however, the permadrop strategy reverses this paradigm, since the inference phase must be executed many times, with increasing iterations giving increasing Monte Carlo accuracy. With the utility of UQ in DNNs having been carried out via permadrop, the challenge of reducing the concomitant computational and energy costs has become critical and nontrival.

Spiking neural networks (SNNs) that run on neuromorphic hardware are a promising approach to address the computational and energy concerns of DNNs running on CPUs and GPUs for a class of applications. A recent study (Blouw et al., 2018) using Intel Loihi (Davies et al., 2018) found that it used 23.1 times fewer joules than a CPU (Xeon E5-2630) and 109.1 times fewer joules than a GPU (Quadro K4000) on an audio-processing problem with a two-layer neural net during the inference phase. In this paper, we explore the prospect of offsetting the energy expense of the permadrop procedure in DNNs by converting them to SNNs during the inference phase. To do so, we expand upon the nengo and nengo_extras (Bekolay et al., 2014) packages, which allow conversion of simple DNNs to SNNs, implementing permadrop layers in the Nengo framework and demonstrating the feasibility of the process using the simulator therein.

Figure 1. Architecture of the Combo neural network. Given two drugs, the aim is to predict the percent growth in human-derived cancer cell lines where these two drugs to be applied in a combination therapy.

2. Background and Related Work

This article addresses a topic at the confluence of two threads of research: UQ on DNNs, and spiking conversion of DNNs.

2.1. Permanent Dropout

Dropout (Srivastava et al., 2014)

is a method for regularization in DNNs. In its simplest form, it involves randomly turning off neurons during each minibatch of training independently with some probability

. As originally proposed, the inference phase is unmodified aside from a scaling of the weights of each layer (as there are now more units present than during training). The intuition behind the method is that nodes cannot rely on a particular upstream or downstream neuron to modify their output and must instead pass on information that is more generally useful, as well as being forced to learn redundant representations. As outlined in (Srivastava et al., 2014), dropout may be viewed as approximate model averaging over all networks formed by subsets of the full network architecture. Gal and Ghahramani (Gal and Ghahramani, 2016) showed that keeping dropout active during prediction (permadropout) is an approximation to a fully Bayesian treatment using a connection between neural networks and Gaussian processes. Each forward evaluation gives a random output; many forward evaluations build up an approximate predictive distribution.

2.2. Spiking Conversion of Classical Neural Networks

While SNNs are more powerful than DNNs in terms of theoretical computational ability (Maass, 1997), their often-discontinuous and computationally expensive nature means that training SNNs has been more challenging in practice than has been training DNNs, an already daunting task and the subject of major research. For this reason, the idea of conducting the training phase on a DNN and finding an SNN with similar behavior is an appealing one. Several approaches have been suggested for converting a DNN to an SNN while minimizing performance loss. Diehl et al. (Diehl et al., 2015)

focused on converting DNNs with standard nonlinearities such as the softmax or ReLU functions, which was expanded upon by Rueckauer et al. 

(Rueckauer et al., 2017) to enable conversion of much more general neural architectures.

Other work (Cao et al., 2015) requires tailoring the DNN to optimize the SNN’s performance, the approach we take in this paper. In particular, we follow the technique outlined in (Hunsberger and Eliasmith, 2015)

, which simply requires using a specific activation function, termed the SoftLIF function.

Given a sufficiently large constant input to trigger an action potential, the firing rate of a linear Leaky Integrate and Fire (LIF) neuron with input current is given by (Gerstner and Kistler, 2002)

(1)

where . Unfortunately, this function is not continuously differentiable, complicating gradient-based optimization methods. To resolve this issue, Hunsberger and Eliasmith (Hunsberger and Eliasmith, 2015) suggest replacing with a smooth approximation given by

(2)

which matches exactly as . Having trained a DNN with the SoftLIF activation, weights need simply to be transferred to an SNN of identical structure.

Figure 2.

P-values for KS tests comparing samples of size 100 output by the DNN and the SNN for the first 20 observations. Were the distributions equal, we would expect the p-values to be approximately uniformly distributed.

An SNN may be imbued with permadrop in a manner analogous to DNNs. We used the Nengo framework in simulations; since it did not previously have support for permadrop, we modified the nengo_extras

package for our purposes. This task involved simply sampling a drop-mask for each layer during each simulation, that is, a binary vector of length equal to the number of neurons in a particular layer, in which 1’s represent “on” neurons, which will contribute to this simulation normally, and 0’s represent “off” neurons, which will not contribute at all. These vectors were sampled independently from a Bernoulli measure with some probability of success (i.e., neuron is active)

. A new drop-mask was sampled during each simulation, ultimately giving a distribution of outputs.

3. Experimentation

We executed our proposed method on the Combo benchmark of CANDLE, a U.S. Department of Energy Exascale Computing Project activity. The Combo deep neural network aims to predict the effectiveness of two drugs used in combination given tumor cell features (942 dimensions) as well as the description of each drug (3,820 dimensions), containing 248,656 observations. The data were obtained from the National Cancer Institute’s ALMANAC resource (Holbeck et al., 2017). Network weights were shared for processing each drug of the pair; see Figure 1 for details. In decision-making for cancer treatment, a complete accounting of uncertainty is critical, motivating the need for permadrop. On this benchmark, however, inference is expected to be 7 times more computationally expensive than training, because of UQ, underlining the potential gain from neuromorphic acceleration.

To implement our SNN, we used the Nengo framework (Bekolay et al., 2014)

, a Python-based spiking neuron simulator. Nengo allows conversion of feedforward neural networks implemented in, for instance, Keras

(Chollet et al., 2015)

, into spiking Nengo objects, which may subsequently be simulated on a standard computer or in specialized hardware, such as Loihi. In our experiments, we trained a permadrop DNN using Keras with TensorFlow

(Abadi et al., 2015) as a backend.

Figure 3. Output potential for the SNN on one observation. The transition time between states is removed using a 0.2 ms (or 200 tick) burn-in. Horizontal line gives DNN output.

While the DNN’s output is a scalar quantity giving predicted cell growth in percent, the output of its SNN analogue will be a time-valued quantity. We summarize the output potential over the time period by simply averaging the results, treating the first 0.2 ms as a ”burn in” period and omitting the potential during this time from the average. Figure 3 illustrates that the output potential of the SNN hovers around the output value of the DNN for most of the period on the first record of the Combo dataset. This same behavior is exhibited for all other observations.

We demonstrate that the distributions of outputs from the permadrop DNN and SNN are indistinguishable after averaging SNN output as described above after each dropout sampling. To quantitatively verify this claim, we got distributions of predictions for 100 observations containing 20 model forward steps each and ran a statistical hypothesis test that the two samples come from the same distribution. We used the Kolmogorov-Smirnov (KS) test, which involves measuring the infinity-norm difference (that is, maximum absolute discrepancy) between the empirical cumulative distribution functions of each sample. Figure

2 gives a histogram of p-values from each pairwise comparison, corresponding to the output distributions of each neural net for a particular observation. In general hypothesis testing, under the null distribution, the p-value is uniformly distributed on the unit interval (Casella and Berger, 2002); however, since the KS test is asymptotic, we should expect this to hold only approximately in this case. We are satisfied that the KS test p-values generally seem to follow a uniform distribution,111

A common criticism of KS tests (and general point-null hypothesis testing) is that for large sample sizes, even the smallest discrepancy will cause the test to reject the null hypothesis

(Wasserstein and Lazar, 2016). It is likely that we could consistently get results indicating that the two predictive distributions are different if we were willing to use a much larger sample size, though this would not mean that the distributions are, practically speaking, significantly different.
indicating that we could not detect a statistical difference between the two samples, and implying functional equivalence of the SNN and DNN. The histogram of the two predictive distributions corresponding to a single observation is shown in Figure 4 for illustration purposes.

Figure 4. Histograms representing 100 draws from the predictive distributions for each neural network for the training example with the largest KS statistic (i.e., that with the most different distribution). Since they are similar visually, we conclude that the SNN is a good approximator.

4. Conclusions and Perspectives

We showed that permanent dropout for the purpose of approximate Bayesian predictive distribution computation on classical neural networks can be carried out on an SNN without any noticeable loss in distribution quality, opening the door for low-energy UQ via permadrop. We used the open source Nengo framework for simulation, which allows easy transfer of these models to neuromorphic hardware.

In our experiments, we first sampled a dropout mask, then ran an SNN with that mask, repeating this process many times to achieve a distribution of outputs. However, each of these outputs represents an aggregation of SNN potentials over some period of time. It may be possible to conduct the dropout sampling during SNN simulation, such that the network connections are constantly changing in the SNN, and only one forward evaluation is required, even further reducing the computational burden. It is not a priori clear whether the naive approach of simply sampling a different dropout mask at each iteration would match permadrop exactly or what modifications may be necessary. We leave investigations of such an approach to future work.

All this work was conducted on a simulator. A complete proof of concept would involve actual neuromorphic hardware and energy comparisons with standard DNNs run on standard hardware such as CPUs, GPUs, or TPUs.

Acknowledgements.
N. Wycoff acknowledges funding from DOE LAB 17-1697 via a subaward from Argonne National Laboratory for SciDAC/DOE Office of Science ASCR and High Energy Physics. This material is based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357.

References