1 Introduction
Spiking Neural Networks (SNNs) are a significant shift from the standard way of operation of Artificial Neural Networks farabet2012comparison
. Most of the success of deep learning models of neural networks in complex pattern recognition tasks are based on neural units that receive, process and transmit analog information. Such Analog Neural Networks (ANNs)
diehl2015fast, however, disregard the fact that the biological neurons in the brain (the computing framework after which it is inspired) processes binary spikebased information. Driven by this observation, the past few years have witnessed significant progress in the modeling and formulation of training schemes for SNNs as a new computing paradigm that can potentially replace ANNs as the next generation of Neural Networks. In addition to the fact that SNNs are inherently more biologically plausible, they offer the prospect of eventdriven hardware operation. Spiking Neurons process input information only on the receipt of incoming binary spike signals. Given a sparselydistributed input spike train, the hardware overhead (power consumption) for such a spike or eventbased hardware would be significantly reduced since large sections of the network that are not driven by incoming spikes can be powergated
chen1998estimation . However, the vast majority of research on SNNs have been limited to very simple and shallow network architectures on relatively simple digit recognition datasets like MNIST lecun1998gradient while only few works report their performance on more complex standard vision datasets like CIFAR10 krizhevsky2009learning and ImageNet russakovsky2015imagenet . The main reason behind their limited performance stems from the fact that SNNs are a significant shift from the operation of ANNs due to their temporal information processing capability. This has necessitated a rethinking of training mechanisms for SNNs.2 Related Work
Broadly, there are two main categories for training SNNs  supervised and unsupervised. Although unsupervised learning mechanisms like SpikeTiming Dependent Plasticity (STDP) are attractive for the implementation of lowpower onchip local learning, their performance is still outperformed by supervised networks on even simple digit recognition platforms like the MNIST dataset
diehl2015unsupervised. Driven by this fact, a particular category of supervised SNN learning algorithms attempts to train ANNs using standard training schemes like backpropagation (to leverage the superior performance of standard training techniques for ANNs) and subsequently convert to eventdriven SNNs for network operation
diehl2015fast ; cao2015spiking ; zhao2015feedforward ; perez2013mapping . This can be particularly appealing for NN implementations in lowpower neuromorphic hardware specialized for SNNs merolla2014million ; akopyan2015truenorth or interfacing with silicon cochleas or eventdriven sensors posch2014retinomorphic ; posch2011qvga . Our work falls in this category and is based on the ANNSNN conversion scheme proposed by authors in Ref. diehl2015fast . However, while prior work considers the ANN operation only during the conversion process, we show that considering the actual SNN operation during the conversion step is crucial for achieving minimal loss in classification accuracy. To that effect, we propose a novel weightnormalization technique that ensures that the actual SNN operation is in the loop during the conversion phase. Note that this work tries to exploit neural activation sparsity by converting networks to the spiking domain for powerefficient hardware implementation and are complementary to efforts aimed at exploring sparsity in synaptic connections han2015deep .3 Main Contributions
The specific contributions of our work are as follows:
(i) As will be explained in later sections, there are various architectural constraints involved for training ANNs that can be converted to SNNs in a nearlossless manner. Hence, it is unclear whether the proposed techniques would scale to larger and deeper architectures for more complicated tasks. We provide proof of concept experiments that deep SNNs (extending from 16 to 34 layers) can provide competitive accuracies over complex datasets like CIFAR10 and ImageNet.
(ii) We propose a new ANNSNN conversion technique that statistically outperforms stateoftheart techniques. We report a classification error of 8.45% on the CIFAR10 dataset which is the bestperforming result reported for any SNN network, till date. For the first time, we report an SNN performance on the entire ImageNet 2012 validation set. We achieve a 30.04% top1 error rate and 10.99% top5 error rate for VGG16 architectures.
(iii) We explore Residual Network (ResNet) architectures as a potential pathway to enable deeper SNNs. We present insights and design constraints that are required to ensure ANNSNN conversion for ResNets. We report a classification error of 12.54% on the CIFAR10 dataset and a 34.53% top1 error rate and 13.67% top5 error rate on the ImageNet validation set. This is the first work that attempts to explore SNNs with residual network architectures.
(iv) We demonstrate that SNN network sparsity significantly increases as the network depth increases. This further motivates the exploration of converting ANNs to SNNs for eventdriven operation to reduce compute overhead.
4 Preliminaries
4.1 Input and Output Representation
The main difference between ANN and SNN operation is the notion of time. While ANN inputs are static, SNNs operate based on dynamic binary spiking inputs as a function of time. The neural nodes also receive and transmit binary spike input signals in SNNs, unlike in ANNs, where the inputs and outputs of the neural nodes are analog values. In this work, we consider a rateencoded network operation where the average number of spikes transmitted as input to the network over a large enough time window is approximately proportional to the magnitude of the original ANN inputs (pixel intensity in this case). The duration of the time window is dictated by the desired network performance (for instance, classification accuracy) at the output layer of the network. A Poisson eventgeneration process is used to produce the input spike train to the network. Every timestep of SNN operation is associated with the generation of a random number whose value is compared against the magnitude of the corresponding input. A spike event is triggered if the generated random number is less than the value of the corresponding pixel intensity. This process ensures that the average number of input spikes in the SNN is proportional to the magnitude of the corresponding ANN inputs and is typically used to simulate an SNN for recognition tasks based on datasets for static images diehl2015fast . Fig. 1 depicts a particular timedsnapshot of the input spikes transmitted to the SNN for a particular image from the CIFAR10 dataset. SNN operation of such networks are “pseudosimultaneous”, i.e. a particular layer operates immediately on the incoming spikes from the previous layer and does not have to wait for multiple timesteps for information from the previous layer neurons to get accumulated. Given a Poissongenerated spike train being fed to the network, spikes will be produced at the network outputs. Inference is based on the cumulative spike count of neurons at the output layer of the network over a given timewindow.
4.2 ANN and SNN Neural Operation
ANN to SNN conversion schemes usually consider Rectified Linear Unit (ReLU) as the ANN neuron activation function. For a neuron receiving inputs
through synaptic weights , the ReLU neuron output is given by,(1) 
Although ReLU neurons are typically used in a large number of machine learning tasks at present, the main reason behind their usage for ANNSNN conversion schemes is that they bear functional equivalence to an IntegrateFire (IF) Spiking Neuron without any leak and refractory period cao2015spiking ; diehl2015fast . Note that this is a particular type of Spiking Neuron model izhikevich2003simple . Let us consider the ANN inputs encoded in time as a spike train , where (for the rate encoding network being considered in this work). The IF Spiking Neuron keeps track of its membrane potential, , which integrates incoming spikes and generates an output spike whenever the membrane potential cross a particular threshold . The membrane potential is reset to zero at the generation of an output spike. All neurons are reset whenever a spike train corresponding to a new image/pattern in presented. The IF Spiking Neuron dynamics as a function of timestep, , can be described by the following equation,
(2) 
Let us first consider the simple case of a neuron being driven by a single input and a positive synaptic weight . Due to the absence of any leak term in the neural dynamics, it is intuitive to show that the corresponding output spiking rate of the neuron is given by , with the proportionality factor being dependent on the ratio of and . In the case when the synaptic weight is negative, the output spiking activity of the IF neuron is zero since the neuron is never able to cross the firing potential , mirroring the functionality of a ReLU. The higher the ratio of the threshold with respect to the weight, the more time is required for the neuron to spike, thereby reducing the neuron spiking rate, , or equivalently increasing the timedelay for the neuron to generate a spike. A relatively high firing threshold can cause a huge delay for neurons to generate output spikes. For deep architectures, such a delay can quickly accumulate and cause the network to not produce any spiking outputs for relatively long periods of time. On the other hand, a relatively low threshold causes the SNN to lose any ability to distinguish between different magnitudes of the spike inputs being accumulated to the membrane potential (the term in Eq. 2) of the Spiking Neuron, causing it to lose evidence during the membrane potential integration process. This, in turn, results in accuracy degradation of the converted network. Hence, an appropriate choice of the ratio of the neuron threshold to the synaptic weights is essential to ensure minimal loss in classification accuracy during the ANNSNN conversion process diehl2015fast . Consequently, most of the research work in this field has been concentrated on outlining appropriate algorithms for thresholdbalancing, or equivalently, weight normalizing different layers of a network to achieve nearlossless ANNSNN conversion.
4.3 Architectural Constraints
4.3.1 Bias in Neural Units
Typically neural units used for ANNSNN conversion schemes are trained without any bias term diehl2015fast
. This is due to the fact that optimization of the bias term in addition to the spiking neuron threshold expands the parameter space exploration, thereby causing the ANNSNN conversion process to be more difficult. Requirement of bias less neural units also entails that Batch Normalization technique
ioffe2015batch cannot be used as a regularizer during the training process since it biases the inputs to each layer of the network to ensure each layer is provided with inputs having zero mean. Instead, we use dropout srivastava2014dropoutas the regularization technique. This technique simply masks portions of the input to each layer by utilizing samples from a Bernoulli distribution where each input to the layer has a specified probability of being dropped.
4.3.2 Pooling Operation
Deep convolutional neural network architectures typically consist of intermediate pooling layers to reduce the size of the convolution output maps. While various choices exist for performing the pooling mechanism, the two popular choices are either maxpooling (maximum neuron output over the pooling window) or spatialaveraging (twodimensional average pooling operation over the pooling window). Since the neuron activations are binary in SNNs instead of analog values, performing maxpooling would result in significant information loss for the next layer. Consequently, we consider spatialaveraging as the pooling mechanism in this work
diehl2015fast .5 Deep Convolutional SNN Architectures: VGG
As mentioned previously, our work is based on the proposal outlined by authors in Ref. diehl2015fast . In order to ensure that a spiking neuron threshold is sufficiently high to distinguish different magnitude of the spike inputs, a worst case solution would be to set the threshold of a particular layer to the maximum of the summation of all the positive synaptic weights of neurons in that layer. However, such a “ModelBased Normalization” technique is highly pessimistic since all the fanin neurons are not supposed to fire at every timestep diehl2015fast . In order to circumvent this issue, authors in Ref. diehl2015fast proposed a “DataBased Normalization” Technique wherein the neuron threshold of a particular layer is set equal to the maximum activation of all ReLUs in the corresponding layer (by passing the entire training set through the trained ANN once after training is completed). Such a “DataBased” technique performed significantly better than the “ModelBased” algorithm in terms of the final classification accuracy and latency of the converted SNN (threelayered fully connected and convolutional architectures) for a digit recognition problem on the MNIST dataset diehl2015fast . Note that, this process is referred to as “weightnormalization” and “thresholdbalancing” interchangeably in this text. As mentioned before, the goal of this work is to optimize the ratio of the synaptic weights with respect to the neuron firing threshold, . Hence, either all the synaptic weights preceding a neural layer are scaled by a normalization factor equal to the maximum neural activation and the threshold is set equal to (“weightnormalization”), or the threshold is set equal to the maximum neuron activation for the corresponding layer with the synaptic weights remaining unchanged (“thresholdbalancing”). Both operations are exactly equivalent mathematically.
5.1 Proposed Algorithm: SpikeNorm
However, the above algorithm leads us to the question: Are ANN activations representative of SNN activations? Let us consider a particular example for the case of maximum activation for a single ReLU. The neuron receives two inputs, namely and . Let us consider unity synaptic weights in this scenario. Since the maximum ReLU activation is , the neuron threshold would be set equal to . However, when this network is converted to the SNN mode, both the inputs would be propagating binary spike signals. The ANN input, equal to , would be converted to spikes transmitting at every timestep while the other input would transmit spikes approximately of the duration of a large enough timewindow. Hence, the actual summation of spike inputs received by the neuron per timestep would be for a large number of samples, which is higher than the spiking threshold (). Clearly, some information loss would take place due to the lack of this evidence integration.
Driven by this observation, we propose a weightnormalization technique that adaptively balances the threshold of each layer by considering the actual operation of the SNN in the loop during the ANNSNN conversion process. The algorithm normalizes the weights of the network sequentially for each layer. Given a particular trained ANN, the first step is to generate the input Poisson spike train for the network over the training set for a large enough timewindow. The Poisson spike train allows us to record the maximum summation of weighted spikeinput (the term in Eq. 2, and hereafter referred to maximum SNN activation in this text) that would be received by the first neural layer of the network. In order to minimize the temporal delay of the neuron and simultaneously ensure that the neuron firing threshold is not too low, we weightnormalize the first layer depending on the maximum spikebased input received by the first layer. After the threshold of the first layer is set, we are provided with a representative spike train at the output of the first layer which enables us to generate the input spikestream for the next layer. The process is continued sequentially for all the layers of the network. The main difference between our proposal and prior work diehl2015fast is the fact that the proposed weightnormalization scheme accounts for the actual SNN operation during the conversion process. As we will show in the Results section, this scheme is crucial to ensure nearlossless ANNSNN conversion for significantly deep architectures and for complex recognition problems. The pseudocode of the algorithm is given below.
6 Extension to Residual Architectures
Residual network architectures were proposed as an attempt to scale convolutional neural networks to very deep layered stacks he2016deep . Although different variants of the basic functional unit have ben explored, we will only consider identity shortcut connections in this text (shortcut typeA according to the paper he2016deep ). Each unit consists of two parallel paths. The nonidentity path consists of two spatial convolution layers with an intermediate ReLU layer. While the original ResNet formulation considers ReLUs at the junction of the parallel nonidentity and identity paths he2016deep , recent formulations do not consider junction ReLUs in the network architecture he2016identity . Absence of ReLUs at the junction point of the nonidentity and identity paths was observed to produce a slight improvement in classification accuracy on the CIFAR10 dataset^{1}^{1}1http://torch.ch/blog/2016/02/04/resnets.html. Due to the presence of the shortcut connections, important design considerations need to be accounted for to ensure nearlossless ANNSNN conversion. We start with the basic unit, as shown in Fig. 2(a), and pointwise impose various architectural constraints with justifications.
6.1 ReLUs at each junction point
As we will show in the Results section, application of our proposed SpikeNorm algorithm on such a residual architecture resulted in a converted SNN that exhibited accuracy degradation in comparison to the original trained ANN. We hypothesize that this degradation is attributed mainly to the absence of any ReLUs at the junction points. Each ReLU when converted to an IF Spiking Neuron imposes a particular amount of characteristic temporal delay (time interval between an incoming spike and the outgoing spike due to evidence integration). Due to the shortcut connections, spike information from the initial layers gets instantaneously propagated to later layers. The unbalanced temporal delay in the two parallel paths of the network can result in distortion of the spike information being propagated through the network. Consequently, as shown in Fig. 2(b), we include ReLUs at each junction point to provide a temporal balancing effect to the parallel paths (when converted to IF Spiking Neurons). An ideal solution would be to include a ReLU in the parallel path, but that would destroy the advantage of the identity mapping.
6.2 Same threshold of all fanin layers
As shown in the next section, direct application of our proposed thresholdbalancing scheme still resulted in some amount of accuracy loss in comparison to the baseline ANN accuracy. However, note that the junction neuron layer receives inputs from the previous junction neuron layer as well as the nonidentity neuron path. Since the output spiking activity of a particular neuron is also dependent on the thresholdbalancing factor, all the fanin neuron layers should be thresholdbalanced by the same amount to ensure that input spike information to the next layer is rateencoded appropriately. However, the spiking threshold of the neuron layer in the nonidentity path is dependent on the activity of the neuron layer at the previous junction. An observation of the typical thresholdbalancing factors for the network without using this constraint (shown in Fig. 2
(c)) reveal that the thresholdbalancing factors mostly lie around unity after a few initial layers. This occurs mainly due to the identity mapping. The maximum summation of spike inputs received by the neurons in the junction layers are dominated by the identity mapping (close to unity). From this observation, we heuristically choose both the thresholds of the nonidentity ReLU layer and the identityReLU layer equal to
. However, the accuracy is still unable to approach the baseline ANN accuracy, which leads us to the third design constraint.6.3 Initial NonResidual PreProcessing Layers
An observation of Fig. 2(c) reveals that the thresholdbalancing factors of the initial junction neuron layers are significantly higher than unity. This can be a primary reason for the degradation in classification accuracy of the converted SNN. We note that the residual architectures used by authors in Ref. he2016deep use an initial convolution layer with a very wide receptive field (
with a stride of
) on the ImageNet dataset. The main motive behind such an architecture was to show the impact of increasing depth in their residual architectures on the classification accuracy. Inspired by the VGGarchitecture, we replace the first convolutional layer by a series of three convolutions where the first two layers do not exhibit any shortcut connections. Addition of such initial nonresidual preprocessing layers allows us to apply our proposed thresholdbalancing scheme in the initial layers while using a unity thresholdbalancing factor for the later residual layers. As shown in the Results section, this scheme significantly assists in achieving classification accuracies close to the baseline ANN accuracy since after the initial layers, the maximum neuron activations decay to values close to unity because of the identity mapping.7 Experiments
7.1 Datasets and Implementation
We evaluate our proposals on standard visual object recognition benchmarks, namely the CIFAR10 and ImageNet datasets. Experiments performed on networks for the CIFAR10 dataset are trained on the training set images with perpixel mean subtracted and evaluated on the testing set. We also present results on the much more complex ImageNet 2012 dataset that contains 1.28 million training images and report evaluation (top1 and top5 error rates) on the validation set. crops from the input images are used for this experiment.
We use VGG16 architecture simonyan2014very for both the datasets. ResNet20 configuration outlined in Ref. he2016deep is used for the CIFAR10 dataset while ResNet34 is used for experiments on the ImageNet dataset. As mentioned previously, we do not utilize any batchnormalization layers. For VGG networks, a dropout layer is used after every ReLU layer except for those layers which are followed by a pooling layer. For Residual networks, we use dropout only for the ReLUs at the nonidentity parallel paths but not at the junction layers. We found this crucial for achieving training convergence.
Our implementation is derived from the Facebook ResNet implementation code for CIFAR and ImageNet datasets available publicly^{2}^{2}2https://github.com/facebook/fb.resnet.torch. We use similar image preprocessing steps and scale and aspectratio augmentation techniques as used in szegedy2015going . We report singlecrop testing results while the error rates can be further reduced with 10crop testing krizhevsky2012imagenet . Networks used for the CIFAR10 dataset are trained on GPUs with a batchsize of for epochs, while ImageNet training is performed on GPUs for epochs with a similar batchsize. The initial learning rate is . The learning rate is divided by twice, at and epochs for CIFAR10 dataset and at and epochs for ImageNet dataset. A weight decay of and a momentum of is used for all the experiments. Proper weight initialization is crucial to achieve convergence in such deep networks without batchnormalization. For a nonresidual convolutional layer (for both VGG and ResNet architectures) having kernel size with
output channels, the weights are initialized from a normal distribution and standard deviation
. However, for residual convolutional layers, the standard deviation used for the normal distribution was . We observed this to be important for achieving training convergence and a similar observation was also outlined in Ref. hardt2016identity although their networks were trained without both dropout and batchnormalization.7.2 Experiments for VGG Architectures
Our VGG16 model architecture follows the implementation outlined in ^{3}^{3}3https://github.com/szagoruyko/cifar.torch except that we do not utilize the batchnormalization layers. We used a randomly chosen minibatch of size 256 from the training set for the weightnormalization process on the CIFAR10 dataset. While the entire training set can be used for the weightnormalization process, using a representative subset did not impact the results. We confirmed this by running multiple independent runs for both the CIFAR and ImageNet datasets. The standard deviation of the final classification error rate after timesteps was . All results reported in this section represent the average of 5 independent runs of the spiking network (since the input to the network is a random process). No notable difference in the classification error rate was observed at the end of timesteps and the network outputs converged to deterministic values despite being driven by stochastic inputs. For the SNN model based weightnormalization scheme (SpikeNorm algorithm) we used timesteps for each layer sequentially to normalize the weights.
Table 1 summarizes our results for the CIFAR10 dataset. The baseline ANN error rate on the testing set was . Since the main contribution of this work is to minimize the loss in accuracy during conversion from ANN to SNN for deeplayered networks and not in pushing stateoftheart results in ANN training, we did not perform any hyperparameter optimization. However, note that despite several architectural constraints being present in our ANN architecture, we are able to train deep networks that provide competitive classification accuracies using the training mechanisms described in the previous subsection. Further reduction in the baseline ANN error rate is possible by appropriately tuning the learning parameters. For the VGG16 architecture, our implementation of the ANNmodel based weightnormalization technique, proposed by Ref. diehl2015fast , yielded an average SNN error rate of leading to an error increment of . The error increment was minimized to on applying our proposed SpikeNorm algorithm. Note that we consider a strict modelbased weightnormalization scheme to isolate the impact of considering the effect of an ANN versus our SNN model for thresholdbalancing. Further optimizations of considering the maximum synaptic weight during the weightnormalization process diehl2015fast is still possible.
Previous works have mainly focused on much shallower convolutional neural network architectures. Although Ref. hunsberger2016training reports results with an accuracy loss of
, their baseline ANN suffers from some amount of accuracy degradation since their networks are trained with noise (in addition to architectural constraints mentioned before) to account for neuronal response variability due to incoming spike trains
hunsberger2016training . It is also unclear whether the training mechanism with noise would scale up to deeper layered networks. Our work reports the best performance of a Spiking Neural Network on the CIFAR10 dataset till date.The impact of our proposed algorithm is much more apparent on the more complex ImageNet dataset. The rates for the top1 (top5) error on the ImageNet validation set are summarized in Table 2. Note that these are singlecrop results. The accuracy loss during the ANNSNN conversion process is minimized by a margin of by considering SNNmodel based weightnormalization scheme. It is therefore expected that our proposed SpikeNorm algorithm would significantly perform better than an ANNmodel based conversion scheme as the pattern recognition problem becomes more complex since it accounts for the actual SNN operation during the conversion process. Note that Ref. hunsberger2016training reports a performance of on the first 3072image test batch of the ImageNet dataset.
At the time we developed this work, we were unaware of a parallel effort to scale up the performance of SNNs to deeper networks and largescale machine learning tasks. The work was recently published in Ref. rueckauer2017conversion . However, their work differs from our approach in the following aspects:
(i) Their work improves on prior approach outlined in Ref. diehl2015fast by proposing conversion methods for removing the constraints involved in ANN training (discussed in Section 4.3). We are improving on prior art by scaling up the methodology outlined in Ref. diehl2015fast for ANNSNN conversion by including the constraints.
(ii) We are demonstrating that considering SNN operation in the conversion process helps to minimize the conversion loss. Ref. rueckauer2017conversion uses ANN based normalization scheme used in Ref. diehl2015fast .
While removing the constraints in ANN training allows authors in Ref. rueckauer2017conversion to train ANNs with better accuracy, they suffer significant accuracy loss in the conversion process. This occurs due to a nonoptimal ratio of biases/batchnormalization factors and weights rueckauer2017conversion . This is the primary reason for our exploration of ANNSNN conversion without bias and batchnormalization. For instance, their best performing network on CIFAR10 dataset incurs a conversion loss of in contrast to reported by our proposal for a much deeper network. The accuracy loss is much larger for their VGG16 network on the ImageNet dataset  in contrast to for our proposal. Although Ref. rueckauer2017conversion reports a top1 SNN error rate for a InceptionV3 network, their ANN is trained with an error rate of . The resulting conversion loss is and much higher than our proposals. The InceptionV3 network conversion was also optimized by a voltage clamping method, that was found to be specific for the Inception network and did not apply to the VGG network rueckauer2017conversion . Note that the results reported on ImageNet in Ref. rueckauer2017conversion are on a subset of image samples. Hence, the performance on the entire dataset is unclear. Our contribution lies in the fact that we are demonstrating ANNs can be trained with the abovementioned constraints with competitive accuracies on largescale tasks and converted to SNNs in a nearlossless manner.
This is the first work that reports competitive performance of a Spiking Neural Network on the entire ImageNet 2012 validation set.
Network Architecture  ANN
Error 
SNN
Error 
Error Increment 

4layered networks cao2015spiking
(Input cropped to 24 x 24) 

3layered networks esser2016convolutional  
8layered networks hunsberger2016training
(Input cropped to 24 x 24) 
0.18%  
6layered networks rueckauer2017conversion

1.06%  
VGG16
(ANN model based conversion) 

VGG16
(SPIKENORM) 
8.3%  8.45%  0.15% 
7.3 Experiments for Residual Architectures
Our residual networks for CIFAR10 and ImageNet datasets follow the implementation in Ref. he2016deep . We first attempt to explain our design choices for ResNets by sequentially imposing each constraint on the network and showing their corresponding impact on network performance in Fig. 3. The “Basic Architecture” involves a residual network without any junction ReLUs. “Constraint 1” involves junction ReLUs without having equal spiking thresholds for all fanin neural layers. “Constraint 2” imposes an equal threshold of unity for all the layers while “Constraint 3” performs best with two preprocessing plain convolutional layers () at the beginning of the network. The baseline ANN ResNet20 was trained with an error of on the CIFAR10 dataset. Note that although we are using terminology consistent with Ref. he2016deep for the network architectures, our ResNets contain two extra plain preprocessing layers. The converted SNN according to our proposal yielded a classification error rate of . Weightnormalizing the initial two layers using the ANNmodel based weightnormalization scheme produced an average error of , further validating the efficiency of our weightnormalization technique.
On the ImageNet dataset, we use the deeper ResNet34 model outlined in Ref. he2016deep . The initial convolutional layer is replaced by three convolutional layers where the initial two layers are nonresidual plain units. The baseline ANN is trained with an error of while the converted SNN error is at the end of timesteps. The results are summarized in Table. 3 and convergence plots for all our networks are provided in Fig. 4.
Network Architecture  ANN
Error 
SNN
Error 
Error Increment 

8layered networks hunsberger2016training
(Tested on subset of 3072 images) 

VGG16 rueckauer2017conversion
(Tested on subset of 2570 images) 

VGG16
(ANN model based conversion) 

VGG16
(SPIKENORM) 
29.48%
(10.61%) 
30.04%
(10.99%) 
0.56%
(0.38%) 
It is worth noting here that the main motivation of exploring Residual Networks is to go deeper in Spiking Neural Networks. We explore relatively simple ResNet architectures, as the ones used in Ref. he2016deep , which have an order of magnitude lower parameters than standard VGGarchitectures. Further hyperparameter optimizations or more complex architectures are still possible. While the accuracy loss in the ANNSNN conversion process is more for ResNets than plain convolutional architectures, yet further optimizations like including more preprocessing initial layers or better thresholdbalancing schemes for the residual units can still be explored. This work serves as the first work to explore ANNSNN conversion schemes for Residual Networks and attempts to highlight important design constraints required for minimal loss in the conversion process.
7.4 Computation Reduction Due to Sparse Neural Events
ANN operation for prediction of the output class of a particular input requires a single feedforward pass per image. For SNN operation, the network has to be evaluated over a number of timesteps. However, specialized hardware that accounts for the eventdriven neural operation and “computes only when required” can potentially exploit such alternative mechanisms of network operation. For instance, Fig. 5 represents the average total number of output spikes produced by neurons in VGG and ResNet architectures as a function of the layer for ImageNet dataset. A randomly chosen minibatch was used for the averaging process. We used timesteps for accumulating the spikecounts for VGG networks while timesteps were used for ResNet architectures. This is in accordance to the convergence plots shown in Fig. 4. An important insight obtained from Fig. 5
is the fact that neuron spiking activity becomes sparser as the network depth increases. Hence, benefits from eventdriven hardware is expected to increase as the network depth increases. While an estimate of the actual energy consumption reduction for SNN mode of operation is outside the scope of this current work, we provide an intuitive insight by providing the number of computations per synaptic operation being performed in the ANN versus the SNN.
The number of synaptic operations per layer of the network can be easily estimated for an ANN from the architecture for the convolutional and linear layers. For the ANN, a multiplyaccumulate (MAC) computation takes place per synaptic operation. On the other hand, a specialized SNN hardware would perform an accumulate computation (AC) per synaptic operation only upon the receipt of an incoming spike. Hence, the total number of AC operations occurring in the SNN would be represented by the layerwise product and summation of the average cumulative neural spike count for a particular layer and the corresponding number of synaptic operations. Calculation of this metric reveal that for the VGG network, the ratio of SNN AC operations to ANN MAC operations is while the ratio is for the ResNet (the metric includes only ReLU/IF spiking neuron activations in the network). However, note the fact that a MAC operation involves an order of magnitude more energy consumption than an AC operation. For instance, Ref. han2015learning reports that the energy consumption in a bit floating point MAC operation is while the energy consumption is only for an AC operation in 45nm technology. Hence, the energy consumption reduction for our SNN implementation is expected to be for the VGG network and for the ResNet in comparison to the original ANN implementation.
Dataset  Network
Architecture 
ANN
Error 
SNN
Error 

CIFAR10  ResNet20  
ImageNet  ResNet34 
8 Conclusions and Future Work
This work serves to provide inspiration to the fact that SNNs exhibit similar computing power as their ANN counterparts. This can potentially pave the way for the usage of SNNs in large scale visual recognition tasks, which can be enabled by lowpower neuromorphic hardware. However, there are still open areas of exploration for improving SNN performance. A significant contribution to the present success of deep NNs is attributed to BatchNormalization ioffe2015batch . While using bias less neural units constrain us to train networks without BatchNormalization, algorithmic techniques to implement Spiking Neurons with a bias term should be explored. Further, it is desirable to train ANNs and convert to SNNs without any accuracy loss. Although the proposed conversion technique attempts to minimize the conversion loss to a large extent, yet other variants of neural functionalities apart from ReLUIF Spiking Neurons could be potentially explored to further reduce this gap. Additionally, further optimizations to minimize the accuracy loss in ANNSNN conversion for ResNet architectures should be explored to scale SNN performance to even deeper architectures.
References
 (1) C. Farabet, R. Paz, J. PérezCarrasco, C. ZamarreñoRamos, A. LinaresBarranco, Y. LeCun, E. Culurciello, T. SerranoGotarredona, and B. LinaresBarranco, “Comparison between frameconstrained fixpixelvalue and framefree spikingdynamicpixel ConvNets for visual processing,” Frontiers in neuroscience, vol. 6, 2012.

(2)
P. U. Diehl, D. Neil, J. Binas, M. Cook, S.C. Liu, and M. Pfeiffer, “Fastclassifying, highaccuracy spiking deep networks through weight and threshold balancing,” in
Neural Networks (IJCNN), 2015 International Joint Conference on. IEEE, 2015, pp. 1–8.  (3) Z. Chen, M. Johnson, L. Wei, and W. Roy, “Estimation of standby leakage power in CMOS circuit considering accurate modeling of transistor stacks,” in Low Power Electronics and Design, 1998. Proceedings. 1998 International Symposium on. IEEE, 1998, pp. 239–244.
 (4) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 (5) A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.

(6)
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet large scale
visual recognition challenge,”
International Journal of Computer Vision
, vol. 115, no. 3, pp. 211–252, 2015.  (7) P. U. Diehl and M. Cook, “Unsupervised learning of digit recognition using spiketimingdependent plasticity,” Frontiers in computational neuroscience, vol. 9, 2015.
 (8) Y. Cao, Y. Chen, and D. Khosla, “Spiking deep convolutional neural networks for energyefficient object recognition,” International Journal of Computer Vision, vol. 113, no. 1, pp. 54–66, 2015.
 (9) B. Zhao, R. Ding, S. Chen, B. LinaresBarranco, and H. Tang, “Feedforward categorization on AER motion events using cortexlike features in a spiking neural network,” IEEE transactions on neural networks and learning systems, vol. 26, no. 9, pp. 1963–1978, 2015.
 (10) J. A. PérezCarrasco, B. Zhao, C. Serrano, B. Acha, T. SerranoGotarredona, S. Chen, and B. LinaresBarranco, “Mapping from FrameDriven to FrameFree EventDriven Vision Systems by LowRate Rate Coding and Coincidence Processing–Application to Feedforward ConvNets,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 11, pp. 2706–2719, 2013.
 (11) P. A. Merolla, J. V. Arthur, R. AlvarezIcaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura et al., “A million spikingneuron integrated circuit with a scalable communication network and interface,” Science, vol. 345, no. 6197, pp. 668–673, 2014.
 (12) F. Akopyan, J. Sawada, A. Cassidy, R. AlvarezIcaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G.J. Nam et al., “TrueNorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. 34, no. 10, pp. 1537–1557, 2015.
 (13) C. Posch, T. SerranoGotarredona, B. LinaresBarranco, and T. Delbruck, “Retinomorphic eventbased vision sensors: Bioinspired cameras with spiking output,” Proceedings of the IEEE, vol. 102, no. 10, pp. 1470–1484, 2014.
 (14) C. Posch, D. Matolin, and R. Wohlgenannt, “A QVGA 143 dB dynamic range framefree PWM image sensor with lossless pixellevel video compression and timedomain CDS,” IEEE Journal of SolidState Circuits, vol. 46, no. 1, pp. 259–275, 2011.
 (15) S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
 (16) E. M. Izhikevich, “Simple model of spiking neurons,” IEEE Transactions on neural networks, vol. 14, no. 6, pp. 1569–1572, 2003.
 (17) S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.
 (18) N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting.” Journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
 (19) K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 (20) ——, “Identity mappings in deep residual networks,” in European Conference on Computer Vision. Springer, 2016, pp. 630–645.
 (21) K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 (22) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
 (23) A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 (24) M. Hardt and T. Ma, “Identity matters in deep learning,” arXiv preprint arXiv:1611.04231, 2016.
 (25) E. Hunsberger and C. Eliasmith, “Training spiking deep networks for neuromorphic hardware,” arXiv preprint arXiv:1611.05141, 2016.
 (26) B. Rueckauer, Y. Hu, I.A. Lungu, M. Pfeiffer, and S.C. Liu, “Conversion of continuousvalued deep networks to efficient eventdriven networks for image classification,” Frontiers in neuroscience, vol. 11, p. 682, 2017.
 (27) S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry, T. Melano, D. R. Barch et al., “Convolutional networks for fast, energyefficient neuromorphic computing,” Proceedings of the National Academy of Sciences, p. 201604850, 2016.
 (28) S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, 2015, pp. 1135–1143.