1 Introduction
Over the last few years, deep learning has made tremendous progress and has become a prevalent tool for coping with various cognitive tasks such as object detection, speech recognition and reasoning. Various deep learning techniques
[22, 44, 16] enable the effective optimization of deep ANNs by constructing multiple levels of feature hierarchies and show remarkable results, which occasionally outperform human level performance [21, 13, 40]. To that effect, deploying deep learning is becoming necessary not only on largescale computers, but also on edge devices (e.g. phone, tablet, smart watch, robot etc.). However, the evergrowing complexity of the stateoftheart deep neural networks together with the explosion in the amount of data to be processed, place significant energy demands on current computing platforms. For example, a deep ANN model requires unprecedented amount of computing hardware resources that often requires huge computing power of cloud servers and significant amount of time to train.Spiking Neural Networks (SNNs) are the leading candidates for overcoming the constraints of neural computing and to efficiently harness the machine learning algorithm in reallife (or mobile) applications
[29, 5]. The concepts of SNN, which is often regarded as the 3^{rd} generation neural network [28], are inspired by biologically plausible Leaky Integrate and Fire (LIF) spiking neuron models [6] that can efficiently process spatiotemporal information. The LIF neuron model is characterized by the internal state, called membrane potential, that integrates the inputs over time and generates an output spike (or Dirac delta pulse) whenever it reaches the neuronal firing threshold. This mechanism enables eventdriven and asynchronous computations across the layers on spiking systems, which makes it naturally suitable for ultralow power and low latency. Furthermore, recent works [39, 36] have shown that these properties make SNNs significantly more attractive for deeper networks in the case of hardware implementation. This is because the spike signals become significantly sparser as the layer goes deeper, such that the number of required computations significantly reduces. In this context, several training strategies can be applied to take full advantage of SNNs.The general training strategy of SNNs can be categorized in two ways  ANNtoSNN conversion and direct spikebased training. First, there are studies which have successfully deployed the ANNtoSNN conversion technique that transforms offline trained ANN to SNN for efficient eventdriven inference [4, 8, 15, 39, 36]
. The main objective of ANNtoSNN conversion scheme is to leverage the stateoftheart ANN training techniques, so that the transformed networks can mimic the competitive classification performances of the ANNs. For instance, specialized SNN hardwares (such as SpiNNaker
[10], IBM TrueNorth [29]) have exhibited greatly improved power efficiency as well as the stateoftheart performance for inferencing. However, it takes large number of timesteps (latency) to resemble the inputoutput mapping of pretrained ANN counterpart. This is because, only IntegrateandFire (IF) spiking neuron can be replaced with an ANN (ReLU) neuron, and hence, can not effectively capture the temporal dynamics of spatiotemporal eventdriven information. On the other hand, it is still a difficult problem to directly train a deep spiking neural network using input spike events and spikebased learning algorithm, mainly because of nondifferentiable activation and discontinuous nature of spike signals. To that effect, unsupervised SpikeTimingDependentPlasticity (STDP) learning algorithm has been explored for training twolayer SNNs (consisting of input and output layers) by considering the local correlations of pre and post neuronal spike timing. STDPtrained twolayer network (consisting of 6400 output neurons) has been shown to achieve 95% classification accuracy on MNIST dataset. However, shallow network structure limits the expressive power of neural network
[7, 50, 3, 42, 43] and suffers from scalability issues as the classification performance easily saturates. Layerwise STDP learning [18, 24]has shown the capabilities of efficient feature extraction on multilayer convolutional SNNs. Nevertheless, the performance gaps compared to ANN models (trained with standard BP algorithm) are still significantly large. The unsatisfactory classification performances of unsupervised local learning necessitate a spikebased supervised learning rule such as gradient descent backpropagation (BP) algorithm
[37]. In the context of SNNs, the spikebased BP learning algorithm introduced in [2, 25] dealt with the membrane potential as a differentiable activation of spiking neuron to train the synaptic weights. [34]apply BP based supervised training for the classifier after training the feature extractor layer by layer using autoencoder mechanism. By leveraging the best of both unsupervised and supervised learning,
[23] have shown that layerwise STDP learning along with spikebased BP have synergistic effect to improve the robustness, generalization ability as well as acceleration of training speed. In this paper, we take these prior works forward to effectively train very deep SNNs using endtoend spikebased gradient descent backpropagation learning.The main contributions of our work are specified as follows. First, we develop a spikebased supervised gradient descent BP algorithm that exploits a differentiable approximated activation function of LIF neuron. In addition, we leverage the key idea of the successful deep ANN models such as LeNet5
[22], VGG [41] and ResNet [13] for efficiently constructing stateoftheart deep SNN network architectures. We also adapt dropout [44] technique in order to better regularize deep SNN training. Next, we demonstrate the effectiveness of our methodology for visual recognition tasks on standard character and object datasets (MNIST, SVHN, CIFAR10) and a neuromorphic dataset (NMNIST). To the best of our knowledge, this work achieves the best classification accuracy in MNIST, SVHN and CIFAR10 datasets through training deep SNNs. Lastly, we expand our efforts to quantify and analyze the advantages of eventdriven BP algorithm compared to ANNtoSNN conversion techniques in terms of inference time and energy consumption.The rest of the paper is organized as follows. In section 2.1, we provide the background on fundamental components and architectures of deep convolutional SNNs. In section 2.2.1, we detail the spikebased gradient descent backpropagation learning algorithm. Subsequently, in section 2.2.2, we describe the spiking version of dropout technique used for this work. In section 3.13.2, we describe the experiments and report the simulation results, which validate the efficacy of spikebased BP training for MNIST, SVHN, CIFAR10 and NMNIST datasets. In section 4.1, we discuss the proposed algorithmin comparison to relevant works. In section 4.24.4, we analyze the spike activity, inference speedup and complexity reduction of directspike trained SNNs and ANNSNN converted networks. Finally, we summarize and conclude the paper in section 5.
2 Materials and methods
2.1 The Component and Architecture of Spiking Neural Network
2.1.1 Spiking Neural Network Components
The LeakyIntegrateandFire (LIF) neurons [6]
and plastic synapses are fundamental and biologically plausible computational elements for emulating the dynamics of SNNs. The neurons in adjacent layers are massively interconnected via each associated plastic synapse whereas no connection exists within a layer. The spike input signals always move in one direction, a way from the input layer through the hidden layers and to the output layer. The dynamics of LIF spiking neuron can be formulated as:
where means postneuronal membrane potential, is the time constant for membrane potential decay, indicates the number of preneurons, is the synaptic weight connecting preneuron to postneuron and denotes a spike event from preneuron at time . The operation of a LIF neuron is presented in figure 1. The impacts of each prespike, , are modulated by the corresponding synaptic weight () to generate the current influx flowing into the postneuron in the next layer. The stimulus fed as current influx is integrated in the postneuronal membrane potential () that leaks exponentially over time. The decay constant () decides the degree of membrane leakage over time and a smaller value of indicates stronger membrane potential decay. When the accumulated membrane potential reaches or exceeds a certain neuronal firing threshold (), the corresponding neuron generates a postspike to the fanout synapses and resets its own membrane potential to initial value (zero). In table 1, we list the annotations used in equations (114).
Notations  Meaning  




Sum of spike events throughout the time  

Synaptic weight  

Membrane potential  

Neuronal firing threshold  
Total (incoming) current influx throughout the time  
Activation of spiking neuron  
Loss function  
Error gradient 
2.1.2 Deep Convolutional Spiking Neural Network
Building Blocks
In this work, we develop a training methodology for convolutional SNN models that consist of an input layer followed by intermediate hidden layers and a final output layer. In the input layer, the pixel images are encoded as Poissondistributed spike trains, where the probability of spike generation is proportional to the pixel intensity. The hidden layers consist of multiple convolutional (C) and spatialpooling (P) layers which are often arranged in an alternating manner. These convolutional (C) and spatialpooling (P) layers represent the intermediate stages of feature extractor. The spikes from the feature extractor are combined to generate one dimensional vector input for the fullyconnected (FC) layers to produce the final classification. The convolutional and fullyconnected layers contain trainable parameters (i.e. synaptic weights), while the spatialpooling layers are fixed
a priori. Through the training procedure, weight kernels in the convolutional layers can encode the feature representations of the input patterns at multiple hierarchical levels. Therefore, through convolution operation, the trained convolutional kernels can detect the spatially correlated local features in the input patterns. This inherently allows the network to be invariant to translation (shift) in the object location. A convolutional layer is often followed by a spatialpooling layer. The spatialpooling layer is used to downscale the dimensions of the feature maps, produced by the previous convolutional layer, while retaining the spatial correlation between neighborhood pixels in every feature map.There are various choices for performing the spatialpooling operation in the ANN domain. The two major choices are maxpooling (maximum neuron output over the pooling window) or averagepooling (twodimensional average pooling operation over the pooling window). In most of the stateoftheart deep ANNs, maxpooling is considered as the most popular option. However, since the neuron activations are binary in SNNs instead of analog values, maxpooling does not provide useful information to the following layer. Therefore, we have used averaging mechanism for spatialpooling. In SNNs, averagepooling scheme is different than in ANN as an additional thresholding is used after averaging to generate output spikes. For instance, a fixed 2
2 kernel (each having a weight of 0.25) strides through a convolutional feature map without overlapping and fires an output spike at the corresponding location in the pooled feature map only if the sum of the weighted spikes of the 4 inputs within the kernel window exceeds a designated threshold. The threshold for averagepooling has to be carefully set, so that the spike propagation is not disrupted due to the pooling. If the threshold is too low, then there will be too many spikes that can cause loss of spatial location of the feature that was extracted from the previous layer. On the other hand, if the threshold is too high, then there will not be enough spike propagation to the deeper layers. We have used a threshold of 0.75 for a fixed (2
2) kernel (each having a weight of 0.25) in the average pooling layers. This means that if there are at least 3 spikes in the (22) window, then 1 spike will be generated in the pooled map. For a different kernel size, the threshold has to be properly adjusted maintaining a similar ratio (0.75). The pooling operation provides several key benefits. First, it reduces size of the convolutional feature maps and provides additional network invariance to input transformations. Furthermore, the pooling operation enlarges the effective size of convolutional kernels in the following layer as the feature maps are downscaled. This allows consecutive convolutional layers to efficiently learn hierarchical representations from low to high levels of abstractions. The number of pooled feature maps is the same as the number of output feature maps of the previous convolutional layer. The feature maps of the final pooling layer before the fullyconnected layers are unrolled into a 1D vector to be used as input for a fullyconnected layer. There are one or more fullyconnected layers eventually reaching to the output layer which produces inference decisions. This final fullyconnected part of the network acts as a classifier to effectively incorporate the composition of features resulting from the alternating convolutional and pooling layers into the final output classes.Deep Convolutional SNN architecture: VGG and Residual SNNs
Deep network topologies are essential for recognizing complex input patterns so that they can effectively learn hierarchical representations. To that effect, we investigate the stateoftheart deep neural network architectures such as VGG [41] and ResNet [13] in order to build deep SNN architectures. VGG [41] was one of the first neural networks which used the idea of using small (33) convolutional kernels uniformly throughout the network. The utilization of small (33) kernels enables effective stacking of convolutional layers while minimizing the number of parameters in deep networks. In this work, we build deep convolutional SNNs (containing more than 5 trainable layers) by using ‘Spiking VGG Block’ which contains stack of convolutional layers using small (33) kernels. The figure (a)a shows a ‘Spiking VGG block’ containing two stacked convolutional layers with intermediate LIF neuronal layer. Next, ResNet [13] introduced the skip connections throughout the network that had large success in enabling successful training of significantly deeper networks. In particular, ResNet addresses the degradation (of training accuracy) problem [13] that occurs while increasing the number of layers in normal feedforward neural network. We employ the concept of the skip connection to construct deep residual SNNs whose number of trainable layers is 711. The figure (b)b shows a ‘Spiking Residual Block’ consisting of nonresidual and residual paths. The nonresidual path consists of two convolutional layers with an intermediate LIF neuronal layer. The residual path (skip connection) is composed of the identity mapping when the number of input and output feature maps are the same, and 11 convolutional kernels when the number of input and output feature maps are different. Both of the nonresidual and residual path outputs are integrated to the membrane potential in the last LIF neuronal layer (LIF Neuron 2 in figure (b)b
) to generate output spikes from the ‘Spiking Residual Block’. Within the feature extractor, a ‘Spiking VGG Block’ or ‘Spiking Residual Block’ is often followed by an averagepooling layer to construct the alternating convolutional and spatialpooling structure. Note, in some ‘Spiking Residual Blocks’, last convolutional and residual connections employ convolution with stride of 2 to incorporate the functionality of the spatialpooling layers. At the end of the feature extractor, extracted features from the last averagepooling layer is fed to a fullyconnected layer as a 1D vector input for initiating the classifier operation.
2.2 Supervised Training of Deep Spiking Neural Network
2.2.1 Spikebased Gradient Descent Backpropagation Algorithm
The spikebased BP algorithm in SNN is adapted from standard BP [37] in the ANN domain. In standard BP, the network parameters are iteratively updated in a direction to minimize the difference between the final outputs of the network and target labels. The standard BP algorithm achieves this goal by backpropagating the output error through the hidden layers using gradient descent method. However, the major difference between ANNs and SNNs is the dynamics of neuronal output. An artificial neuron (such as sigmoid, tanh, or ReLU) communicates via continuous values whereas a spiking neuron generates binary spike outputs over time. In SNNs, spatiotemporal spike trains are fed to the network as inputs. Accordingly, the outputs of spiking neuron are spike events which are discontinuous and discrete (nondifferentiable) over time. Hence, the standard BP algorithm can not be utilized to train SNNs, as it requires the gradients of spiking neuronal activation function for backpropagating the output error. We derive a spikebased BP algorithm which is capable of learning spatiotemporal patterns in spiketrains. We formulate the a differentiable (but approximated) activation of LIF neuron that enables modulation of the network parameters using gradient descent method in spiking system. The spikebased BP can be divided into three phases  forward propagation, backward propagation and weight update. We now describe the spikebased BP algorithm by going through each phase.
Forward Propagation
In forward propagation, spike trains representing input patterns and corresponding output (target) labels are presented to the network for estimating the loss function. The loss function is a measure of discrepancy between target labels and outputs predicted by the network. To generate the spike inputs, the input pixel values are converted to Poissondistributed spike trains and delivered to the network. The input spikes are multiplied with synaptic weights to produce an input current. The resultant current is accumulated in the membrane potential of post neurons. The postneuron generates an output spike whenever the respective membrane potential exceeds a neuronal firing threshold. Otherwise, membrane potential decays exponentially with time. After the postneuronal firing, the membrane potential is reset, and the output spike is broadcast to be the input to the subsequent layer. The postneurons of every layer carry out this process successively based on the weighted spikes received from the preceding layer. Over time, the total weighted summations of the spike trains are integrated at the
postneuron as formulated in equation (2). The sum of spike trains (denoted by for the input neuron) is weighted by interconnecting synaptic weights, .where stands for the total (resultant) current influx received by postneuron throughout the time , is the number of preneurons and is a spike event from preneuron at time instant .
In SNN, the ‘activation function’ indicates the relationship between weighted summation of preneuronal spike inputs and postneuronal outputs over time. A spike output signal is nondifferentiable since it is discrete and creates a discontinuity (because of step jump) at the time instance of firing. To that effect, applying standard backpropagation [37] in the spiking domain becomes difficult since it requires a differentiable activation function. To get around this predicament, we generate a ‘differentiable activation’ of the spiking neuron by lowpass filtering the individual postspike train and dividing it by the total propagation steps (T) as formulated below in equation (3). To compute the activation, , of a LIF neuron, the unit spikes (at time instants ) are temporally integrated and the resultant sum is decayed within the time periods as shown in equation (3). The time constant () determines the decay rate of the spiking neuronal activation. It influences the temporal dynamics of the spiking neuron by accounting for the exponential membrane potential decay and reset mechanisms. The neuronal firing threshold of the final layer is set to a very high value such that the output neurons do not generate any spike output. In output layer, the weighted spikes from previous layer are accumulated in the membrane potential while decaying over time. At the last time step, the accumulated membrane potential is divided by the number of total time steps in order to quantify the output distribution (output) as presented by equation (4). The prediction error of each output neuron is evaluated by comparing the output distribution (output) with the desired target label (label) of the presented input spike trains as shown in equation (5). The corresponding loss function (E in equation (6)) is defined as the mean square of the final prediction error over all the output neurons.
Backward Propagation and Weight Update
Next, we formulate gradient descent backward propagation for SNNs. The first step is to estimate the gradients of loss function at the output layer. Then, the gradients are propagated backward all the way down to the inputs through the hidden layers using recursive chain rule. The following equations (714) and figure
3 describe the detailed mathematical steps for obtaining the partial derivatives of error with respect to weights.The partial derivative of error with respect to each weight can be calculated by applying the chainrule twice as shown in equation (7). In equation (8), differentiating the loss function with respect to postneuronal activation provides the first term in equation (7). The gradient of LIF neuronal activation can be derived by employing the ‘backpropagation through time (BPTT)’ technique used in the recurrent ANN training [47]. The derivative of postneuronal activation with respect to the net input current is obtained by adding an unity value to the time derivative of the corresponding neuronal activation and divide them by corresponding neuronal firing threshold as described in equation (9). The addition of unity allows us to get around the discontinuity (step jump) that arises at each spike time and the time derivative incorporates the leaky effect of respective LIF neuronal membrane potential. It is worth mentioning here that [14, 1] employed the BPTT technique in different ways for spikebased BP algorithm. At the output layer, the error gradient, , represents the gradient of the output loss with respect to the net input current received by the postneurons. It can be calculated by multiplying the final output error () with the derivative of the corresponding postneuronal activation, , with respect to its inputs as shown in equation (10). Note that elementwise multiplication is indicated by ‘.’ while matrix multiplication is represented by ‘*’ in the respective equations. At any hidden layer, the local error gradient, , is recursively estimated by multiplying the backpropagated gradient from the successive layer () with derivative of the neuronal activation () as presented in equation (11).
The derivative of net current with respect to weight is simply the total incoming spikes throughout the time as derived in equation (12). The derivative of the output loss with respect to the weights interconnecting the layers and ( in equation (13)) is determined by multiplying the transposed error gradient at () with the input spikes from layer
. In case of convolutional neural networks, we backpropagate the error in order to get the partial derivatives of the loss function with respect to the given output feature map. Then, we average the partial derivatives over the output map connections sharing the particular weight to account for the effective updates of convolutional weights. Finally, the calculated partial derivatives of loss function are used to update the respective weights using a learning rate (
) as illustrated in equation (14). As a result, iteratively updating the weights over minibatches of input patterns leads the network state to a local minimum, thereby enabling the network to capture multiplelevels of internal representations of the data.2.2.2 Dropout in Spiking Nerual Network
Dropout [44] is one of the popular regularization techniques while training deep ANNs. This technique randomly disconnects certain units with a given probability (
) to avoid units being overfitted and coadapted too much to given training data. We employ the concept of dropout technique in order to effectively regularize deep SNNs. Note, dropout technique is only applied during training and is not used when evaluating the performance of the network through inference. There is a subtle difference in the way dropout is applied in SNNs compared to ANNs. In ANNs, each epoch of training has several iterations of minibatches. In each iteration, randomly selected units (with dropout ratio of
) are disconnected from the network while weighting by its posterior probability (
). However, in SNNs, each iteration has more than one forward propagation depending on the time length of the spike train. We backpropagate the output error and modify the network parameters only at the last time step. For dropout to be effective in our training method, it has to be ensured that the set of connected units within an iteration of minibatch data is not changed, such that the neural network is constituted by the same random subset of units during each forward propagation within a single iteration. On the other hand, if the units are randomly connected at each timestep, the effect of dropout will be averaged out over the entire forward propagation times within an iteration. Then, the dropout effect would fadeout once the output error is propagated backward and the parameters are updated at the last time step. Therefore, it is necessary to keep the set of randomly connected units for entire time window within an iteration. In the experiment, we use the SNN version of dropout technique with the probability () of omitting units equal to 0.20.25. Note that the activation is much sparser in SNN forward propagation compared to ANN, hence the optimal for SNNs need to be less than typical ANN dropout ratio (=0.5). The details of SNN forward propagation with dropout are specified in Algorithm 1.3 Experimental setup and Result
3.1 Experimental Setup
The primary goal of our experiments is to demonstrate the effectiveness of the proposed spikebased BP training methodology in a variety of deep network architectures. We first describe our experimental setup and baselines. For the experiments, we developed a custom simulation framework using the Pytorch
[35]deep learning package for evaluating our proposed SNN training algorithm. Our deep convolutional SNNs are populated with biologically plausible LIF neurons in which a pair of pre and post neurons are interconnected by plastic synapses. At the beginning, the neuronal firing thresholds are set to an unity value and the synaptic weights are initialized with Gaussian random distribution of zeromean and standard deviation of
(: number of fanin synapses) as introduced in [12]. Note, the initialization constant differs by the type of network architecture. For instance, we have used for nonresidual network and for residual network. For training, the synaptic weights are trained with minibatch spikebased BP algorithm in an endtoend manner as explained in section 2.2.1. We train our network models for 150 epochs using minibatch stochastic gradient descent BP that reduces its learning rate at 70
^{th}, 100^{th} and 125^{th} training epochs. For the neuromorphic dataset, we use Adam [19] learning method and reduces its learnig rate at 40^{th}, 80^{th} and 120^{th} training epochs. Please, refer to table 2 for more implementation details. The datasets and network topologies used for benchmarking, the spike generation scheme for event driven operation and determination of the number of timesteps required for training and inference are described in the following subsections.Parameter  Value  


100 timesteps  



Inference Time Duration  Same as training  

1632  
Spatialpooling Nonoverlapping Region/Stride  22, 2  
Weight Initialization Constant ()  2 (nonresidual network), 1 (residual network)  
Learning rate ()  0.002  0.003  
Dropout Ratio ()  0.2  0.25 
3.1.1 Benchmarking Datasets
We demonstrate the efficacy of our proposed training methodology for deep convolutional SNNs on three standard vision datasets and one neuromorphic vision dataset, namely the MNIST [22], SVHN [32], CIFAR10 [20] and NMNIST [33]. The MNIST dataset is composed of grayscale (onedimensional) images of handwritten digits whose sizes are 28 by 28. The SVHN and CIFAR10 datasets are composed of color (threedimensional) images whose sizes are 32 by 32. The NMNIST dataset is a neuromorphic (spiking) dataset which is converted from static MNIST dataset using Dynamic Vision Sensor (DVS)[26]. The NMNIST dataset contains twodimensional images that include ON and OFF event stream data whose sizes are 34 by 34. The ON (OFF) event represents the increase (decrease) in pixel bright changes. Details of the benchmark datasets are listed in table 3. For evaluation, we report the top1 classification accuracy by classifying the test samples (training samples and test samples are mutually exclusive).
Dataset  Image  #Training Samples  #Testing Samples  #Category 

MNIST  , gray  60,000  10,000  10 
SVHN  , color  73,000  26,000  10 
CIFAR10  , color  50,000  10,000  10 
NMNIST  , ON and OFF spikes  60,000  10,000  10 
3.1.2 Network Topologies
We use various SNN architectures depending on the complexity of the benchmark datasets. For MNIST and NMNIST datasets, we used a network consisting of two sets of alternating convolutional and spatialpooling layers followed by two fullyconnected layers. This network architecture is derived from LeNet5 model [22]. Note that table 4 summarizes the layer type, kernel size, the number of output feature maps and stride of SNN model for MNIST dataset. The kernel size shown in the table is for 3D convolution where the 1^{st} dimension is for number of input featuremaps and 2^{nd}3^{rd} dimensions are for convolutional kernels. For SVHN and CIFAR10 datasets, we used deeper network models consisting of 7 to 11 trainable layers including convolutional, spatialpooling and fullyconnected layers. In particular, these networks consisting of beyond 5 trainable layers are constructed using small () convolutional kernels. We term the deep convolutional SNN architecture that includes convolutional kernel [41] without residual connections as ‘VGG SNN’ and with skip (residual) connections [13] as ‘Residual SNN’. In Residual SNNs, some convolutional layers convolve kernel with stride of 2 in both x and y directions, to incorporate the functionality of spatialpooling layers. Please, refer to tables 4 and 5 that summarize the details of deep convolutional SNN architectures. In the results section, we will discuss the benefit of deep SNNs in terms of classification performance as well as inference speedup and energy efficiency.
4 layer network  VGG7  ResNet7  
Layer type  Kernel size  #o/p featuremaps  Stride  Layer type  Kernel size  #o/p featuremaps  Stride  Layer type  Kernel size  #o/p featuremaps  Stride 
Convolution  155  20  1  Convolution  333  64  1  Convolution  333  64  1 
Averagepooling  22  2  Convolution  6433  64  2  Averagepooling  22  2  
Averagepooling  22  2  
Convolution  2055  50  1  Convolution  6433  128  1  Convolution  6433  128  1 
Averagepooling  22  2  Convolution  12833  128  2  Convolution  12833  128  2  
Convolution  12833  128  2  Skip convolution  6411  128  2  
Averagepooling  22  2  
Convolution  12833  256  1  
Convolution  25633  256  2  
Skip convolution  12811  256  2  
Fullyconnected  200  Fullyconnected  1024  Fullyconnected  1024  
Output  10  Output  10  Output  10 
VGG9  ResNet9  ResNet11  
Layer type  Kernel size  #o/p featuremaps  Stride  Layer type  Kernel size  #o/p featuremaps  Stride  Layer type  Kernel size  #o/p featuremaps  Stride 
Convolution  333  64  1  Convolution  333  64  1  Convolution  333  64  1 
Convolution  6433  64  1  Averagepooling  22  2  Averagepooling  22  2  
Averagepooling  22  2  
Convolution  6433  128  1  Convolution  6433  128  1  Convolution  6433  128  1 
Convolution  12833  128  1  Convolution  12833  128  1  Convolution  12833  128  1 
Averagepooling  22  2  Skip convolution  6411  128  1  Skip convolution  6411  128  1  
Convolution  12833  256  1  Convolution  12833  256  1  Convolution  12833  256  1 
Convolution  25633  256  1  Convolution  25633  256  2  Convolution  25633  256  2 
Convolution  25633  256  1  Skip connection  12811  256  2  Skip convolution  12811  256  2 
Averagepooling  22  2  
Convolution  25633  512  1  Convolution  25633  512  1  
Convolution  51233  512  2  Convolution  51233  512  1  
Skip convolution  25611  512  2  Skip convolution  51211  512  1  
Convolution  51233  512  1  
Convolution  51233  512  2  
Skip convolution  51211  512  2  
Fullyconnected  1024  Fullyconnected  1024  Fullyconnected  1024  
Output  10  Output  10  Output  10 
3.1.3 ANNSNN Conversion Scheme
As mentioned previously, offline trained ANNs can be successfully converted to SNNs by replacing ANN (ReLU) neurons with Integrate and Fire (IF) spiking neurons and adjusting the neuronal thresholds with respect to synaptic weights. It is important to set the neuronal firing thresholds sufficiently high so that each spiking neuron can closely resemble ANN activation without loss of information. In the literature, several methods have been proposed [4, 8, 15, 39, 36] for balancing appropriate ratios between neuronal thresholds and synaptic weights of spiking neuron in the case of ANNtoSNN conversion. In this paper, we compare various aspects of our directspike trained models with one recent work [39], which proposed a nearlossless ANNtoSNN conversion scheme for deep network architectures. In brief, [39] balanced the neuronal firing thresholds with respect to corresponding synaptic weights layerbylayer depending on the actual spiking activities of each layer using a subset of training samples. Basically, we compare our directspike trained model with converted SNN on the same network architecture in terms of accuracy, inference speed and energyefficiency. Please note that there are couple of differences on the network architecture and conversion technique between [39] and our scheme. First, [39] always uses averagepooling to reduce the size of previous convolutional output featuremap, whereas our models interchangeably use average pooling or convolve kernels with stride of 2 in convolutional layer. Next, [39] only consider identity skip connections for residual SNNs. However, we implement skip connections using either identity mapping or convolutional kernel. Lastly, we used lower (0.75) threshold for avgpooling layer instead of 0.8 to ensure enough spike propagation on both directtrained and converted network models. Even in the case of ANNtoSNN conversion scheme, lower averagepooling threshold provides us slightly better classification performance than [39].
3.1.4 Spike Generation Scheme
For the static vision datasets (MNIST, SVHN and CIFAR10), each input pixel intensity is converted to stream of Poisson distributed spike events that have equivalent firing rates. The Poisson input spikes are fed to the network throughout the time. This ratebased spike encoding is used for a given period of time during both training and inference. For color image datasets, we use image preprocessing techniques of random cropping and horizontal flip before generating input spikes. These input pixels are normalized to represent zero mean and unit standard deviation. Thereafter, we scale the pixel intensities to bound them in the range [1,1] to represent the whole spectrum of input pixel representations. The normalized pixel intensities are converted to Poisson spike events such that the generated input signals are bipolar spikes. For the neuromorphic version of dataset (NMNIST), we use the original (unfiltered and uncentered) version of spike streams to directly train and test the network in time domain.
3.1.5 Timesteps
As mentioned in section 3.1.4, we generate stochastic Poisson spike train for each input pixel intensity for eventdriven operation. The duration of the spike train is very important for SNNs. We measure the length of the spike train (spike time window) in timesteps. For example, a 100 timestep spike train will have approximately 50 random spikes if the corresponding pixel intensity is half in a range of [0,1]. If the number of timesteps (spike time window) is too less, then the SNN will not receive enough information for training or inference. On the other hand, if the number of timesteps is too high, then the latency will also be high and the spike stream will behave more like a deterministic input. Hence, the stochastic property of SNNs will be lost, the inference will become too slow, and the network will not have much energy efficiency over ANN implementation. For these reasons, we experimented with different number of timesteps to empirically obtain the optimal number of timesteps required for both training and inference. The experimental process and results are explained in the following subsections.
Optimal #timesteps for Training
A spike event can only represent 0 or 1 in each time step, therefore usually its bit precision is considered 1. However, the spike train provides temporal data, which is an additional source of information. Therefore, the spike train length (number of timesteps) in SNN can be considered as its actual precision of neuronal activation. To obtain the optimal #timesteps required for our proposed training method, we trained a VGG9 network on CIFAR10 dataset using different timesteps ranging from 10 to 120 (shown in figure (a)a). We found that for only 10 timesteps, the network is unable to learn anything as there is not enough information (input precision too low) for the network to be able to learn. This phenomena is explained by the lack of spikes in the final output. With the initial weights, the accumulated sum of the LIF neuron is not enough to generate output spikes in the latter layers. Hence, none of the input spikes propagates to the final output neurons and the output distributions remain 0. Therefore, the computed gradients are always 0 and the network is not updated. For 2030 timesteps, some input spikes are able to reach the final layer, hence the network starts to learn but do not converge. For 3550 timesteps, the network learns well and converges to a reasonable point. From 70 timesteps, the network accuracy starts to saturate. At about 100 timesteps the network training improvement completely saturates. This is consistent with the bit precision of the inputs. It has been shown in [38] that 8 bit inputs and activations are sufficient to achieve optimal network performance for standard image recognition tasks. Ideally, we need 128 timesteps to represent 8 bit inputs using bipolar spikes. However, 100 timesteps proved to be sufficient as more timesteps provide marginal improvement. We observe similar trend in VGG7, ResNet7, ResNet9 and ResNet11 SNNs as well, while training for SVHN and CIFAR10 datasets. Therefore, we considered 100 timesteps as the optimal #timesteps for training in our proposed methodology. Moreover, for MNIST dataset, we used 50 timesteps since the required bit precision is only 4 bits [38].
Optimal #timesteps for Inference
To obtain the optimal #timesteps required for inferring an image utilizing a network trained with our proposed method, we conducted similar experiments as described in section 3.1. We first trained a VGG9 network for CIFAR10 dataset using 100 timesteps (optimal according to experiments in section 3.1). Then, we tested the network performances with different timesteps ranging from 10 to 4000 (shown in figure (b)b). We observed that the network performs very well even with only 10 timesteps, while the peak performance occurs around 100 timesteps. For more than 100 timesteps, the accuracy degrades slightly from the peak. This behavior is very different from ANNSNN converted networks where the accuracy keeps on improving as #timesteps is increased (shown in figure (b)b). This can be attributed to the fact that our proposed spikebased training method incorporates the temporal information well in to the network training procedure so that the trained network is tailored to perform best at a specific spike time window when inferencing. On the other hand, the ANNSNN conversion schemes are unable to incorporate the temporal information of the input in the trained network and therefore are heavily dependent on the deterministic behavior of the input. Hence, the ANNSNN conversion schemes require much higher #timesteps for inference in order to resemble inputoutput mappings similar to ANNs.
3.2 Result
In this section, we analyze the classification performance and efficiency achieved by the proposed spikebased training methodology for deep convolutional SNNs compared to the performance of the transformed SNN using ANNtoSNN conversion scheme.
3.2.1 The Classification Performance
Most of the classification performances available in literature for SNNs are for MNIST and CIFAR10 datasets. The popular methods for SNN training are ‘Spike Time Dependent Plasticity (STDP)’ based unsupervised learning
[7, 50, 3, 42, 43] and ‘Spikebased Backpropagation’ based supervised learning [25, 17, 49, 31, 30]. There are a few works [45, 18, 46, 23] which tried to combine the two approaches to get the best of both worlds. However, these training methods were able to neither train deep SNNs nor achieve good inference performance compared to ANN implementations. Hence, ANNSNN conversion schemes have been explored by researchers [4, 8, 15, 39, 36]. Till date, ANNSNN conversion schemes achieved the best inference performance for CIFAR10 dataset using deep networks [39, 36]. Classification performances of all these works are listed in table 6 along with ours. To the best of our knowledge, we achieved the best inference accuracy for MNIST using LeNet structured network. We also achieved accuracy performance comparable with ANNSNN converted network [39] for CIFAR10 dataset using much smaller network models, while beating all other SNN training methods.Model  Learning Method 





Hunsberger et al.[15]  Offline learning, conversion  98.37%  –  82.95%  
Esser et al.[9]  Offline learning, conversion  –  –  89.32%  
Diehl et al.[8]  Offline learning, conversion  99.10%  –    
Rueckauer et al.[36]  Offline learning, conversion  99.44%  –  88.82%  
Sengupta et al.[39]  Offline learning, conversion  –  –  91.55%  
Kheradpisheh et al.[18]  Layerwise STDP + offline SVM classifier  98.40%  –  –  
Panda et al.[34]  Spikebased autoencoder 
99.08%  –  70.16%  
Lee et al.[25]  Spikebased BP  99.31%  98.74%  –  
Wu et al.[49]  Spikebased BP  99.42%  98.78%  50.70%  
Lee et al.[23]  STDPbased pretraining + spikebased BP  99.28%  –  –  
Jin et al.[17]  Spikebased BP  99.49%  98.88%  –  
Wu et al.[48]  Spikebased BP  –  99.53%  90.53%  
This work  Spikebased BP  99.59%  99.09%  90.95% 
For a more extensive comparison, we compare inference performances of trained networks using our proposed methodology with the stateoftheart ANNs and ANNSNN conversion scheme, for same network configuration (depth and structure) side by side in table 7. We also compare with the previous best SNN training results found in literature that may or may not have same network depth and structure as ours. The ANNSNN conversion scheme is a modified and improved version of [39]. We are using this modified scheme since it achieves better conversion performance than [39] as explained in section 3.1.3. Note that all reported classification accuracies are the average of the maximum inference accuracies for 3 independent runs with different seeds.
After initializing the weights, we train the SNNs using spikebased BP algorithm in an endtoend manner with Poisson spike train inputs. Our evaluation on a MNIST dataset yields a classification accuracy of 99.59% which is the best compared to any other SNN training scheme and also our ANNSNN conversion scheme. We achieve ~96% inference accuracy on SVHN dataset for both trained nonresidual and residual SNN which is very close to the stateoftheart ANN implementation. Inference performance for SNNs trained on SVHN dataset have not been reported previously in literature.
We implemented three different networks, as shown in table 5, for classifying CIFAR10 dataset using proposed spikebased BP algorithm. For the VGG9 network, the ANNSNN conversion scheme provides near lossless converted network compared to baseline ANN implementation, while our proposed training method yields a classification accuracy of 90.45%. For ResNet9 network, the ANNSNN conversion scheme provides inference accuracy within 3% of baseline ANN implementation. However, our proposed spikebased training method achieve better inference accuracy that is within ~1.5% of baseline ANN implementation. In the case of ResNet11, we observe that the inference accuracy improvement is marginal compared to ResNet9 for baseline ANN implementation. However, ANNSNN conversion scheme and proposed SNN training show improvement of ~0.5% for ResNet11 compared to ResNet9. Overall, for ResNet networks, our proposed training method achieves better inference accuracy compared to ANNSNN conversion scheme.
Inference Accuracy (%)  

Dataset  Model  ANN  ANNSNN  SNN [Previous Best]  SNN [This Work] 
MNIST  LeNet  99.57  99.59  99.49 [17]  99.59 
NMNIST  LeNet  –  –  99.53 [48]  99.09 
SVHN  VGG7  96.36  96.30  –  96.06 
ResNet7  96.43  95.93  –  96.21  
CIFAR10  VGG9  91.98  92.01  90.53 [48]  90.45 
ResNet9  91.85  89.00  90.35  
ResNet11  91.87  90.15  90.95 
3.2.2 Accuracy Improvement with Network Depth
One of the major drawbacks of STDP based unsupervised learning for SNNs is that it is very difficult to train beyond 2 convolutional layers [18, 24]. Therefore, researchers are leaning more towards backpropagation based supervised learning for deep SNNs.
In order to analyze the effect of network depth for directspike trained SNNs, we experimented with networks of different depths while training for SVHN and CIFAR10 datasets. For SVHN dataset, we started with a small network derived from LeNet5 model [22] with 2 convolutional and 2 fullyconnected layers. This network was able to achieve inference accuracy of only 92.38%. Then, we increased the network depth by adding 1 convolutional layer before the 2 fullyconnected layers and we termed this network as VGG5. VGG5 network was able to achieve significant improvement over its predecessor. Similarly, we tried VGG6 followed by VGG7, and the improvement started to become very small. We have also trained ResNet7 to understand how residual networks perform compared to nonresidual networks of similar depth. Results of these experiments are shown in figure (a)a. We carried out similar experiments for CIFAR10 dataset as well. The results show similar trend (figure (b)b). These results ensure that network depth improves learning capacity of directspike trained SNNs similar to ANNs. The nonresidual networks saturate at certain depth and start to degrade if network depth is further increased (VGG11 in figure (b)b) due to the degradation problem mentioned in [13]. In such scenario, the residual connections in deep residual ANNs allows the network to maintain peak classification accuracy utilizing the skip connections [13] as seen in figure (b)b (ResNet9 and ResNet11).
4 Discussion
4.1 Comparison with Relevant works
In this section, we compare our proposed supervised learning algorithm with other recent spikebased BP algorithms. The spikebased learning rules primarily focus on directly training and testing SNNs with spiketrains and no conversion is necessary for applying in realworld spiking scenario. In recent years, there are an increasing number of supervised gradient descent method in spikebased learning. [34] developed spikebased autoencoder mechanism to train deep convolutional SNNs. They dealt with membrane potential as a differentiable signal and showed recognition capabilities in standard vision tasks (MNIST and Cifar10 datasets). [25] followed the similar approach to explore a spikebased BP algorithm in an endtoend manner. In addition, [25] presented the error normalization scheme to prevent exploding gradient phenomenon for training deep SNNs. [17] proposed hybrid macro/micro level backpropagation (HM2BP). HM2BP is developed to capture the temporal effect of individual spike (in microlevel) and rateencoded error (in macrolevel). In temporal encoding domain, [30] proposed an interesting temporal spikebased BP algorithm by treating the spiketime as the differential activation of neuron. Temporal encoding based SNN has the potential to process spatiotemporal spike patterns with small number of spikes. All of these works demonstrated spikebased learning in simple network architectures and has large gap in classification accuracy compared to deep ANNs. More recently, [48] presented a neuron normalization technique (called NeuNorm) that calculates the average input firing rates to adjust neuron selectivity. NeuNorm enables spikebased training within relatively short timewindow while achieving a competitive performances. In addition, they presented an input encoding scheme that receives both spike and nonspike signals for preserving the precision of input data.
There are several points that distinguish our work from the others. First, we derived a differentiable (but approximated) activation of a LIF neuron given the measured neuronal outputs (as defined in equation (3)). The activation of a LIF neuron is formulated as ‘lowpass filtered output signal’ which is the accumulation of leaky output spikes throughout the time. In backpropagation phase, the defined activation of a LIF neuron enables us to calculate the neuronal pseudoderivative while accounting for the leaky behavior (as explained in equation (9)). Note that the effect of leaky component has high impact on the dynamics of LIF spiking neuron. It is worth mentioning here that better approximation of ‘discrete LIF’ neuronal activation function enables our network to achieve better performance than the other methods in the literature. Next, we construct our networks by leveraging stateoftheart deep architectures such as VGG [41] and ResNet [13]. To the best of our knowledge, this is the first work that demonstrates spikebased supervised BP learning for SNNs containing more than 10 trainable layers. Our deep SNNs obtain the superior classification accuracies in MNIST, SVHN and CIFAR10 datasets in comparison to the other networks trained with spikebased algorithm. Moreover, we present a network parameter (i.e. weights and threshold) initialization scheme for a variety of deep SNN architectures. In the experiment, we show that the proposed initialization scheme appropriately initializes the deep SNNs facilitating training convergence for a given network architecture and training strategy. In addition, as opposed to complex error or neuron normalization method adopted by [25] and [48], respectively, we demonstrate that deep SNNs can be naturally trained by only considering the spiking activities of the system. As a result, our work paves the effective way for training deep SNNs with spikebased BP algorithm.
4.2 Spike Activity Analysis
The most important advantage of eventdriven operation of neural networks is that the events are very sparse in nature. To verify this claim, we analyzed the spiking activities of the directspike trained SNNs and ANNSNN converted networks in the following subsections.
4.2.1 Spike Activity per Layer
The layerwise spike activities of both SNN trained using our proposed methodology and ANNSNN converted network for VGG9 and ResNet9 are shown in figure (a)a and (b)b, respectively. In the case of ResNet9, only first average pooling layer’s output spike activity is shown in the figure as for the directspike trained SNN, the other spatialpoolings are done by stride 2 convolutions. In figure 6, it can be seen that the input layer has the highest spike activity that is significantly higher than any other layer. The spike activity reduces significantly as the network depth increases.
We can observe from figure (a)a and figure (b)b that the average spike activity in directspike trained SNN is much higher than ANNSNN converted network. The ANNSNN converted network uses higher threshold compared to 1 (in case of directspike trained SNN) since the conversion scheme applies layerwise neuronal threshold modulation. This higher threshold reduces spike activity in ANNSNN converted networks. However, in both cases, the spike activity decreases with increasing network depth.
Dataset  Model  Spike/image 


ANNSNN  SNN  
MNIST  LeNet  29094  55212  0.53x  
73085  1.32x  
SVHN  VGG7  10251782  5564306  1.84x  
16615596  2.99x  
ResNet7  –  4656760  –  
20607244  4.43x  
CIFAR10  VGG9  2226732  1240492  1.80x  
9647563  7.78x  
ResNet9  –  4319988  –  
8745271  2.02x  
ResNet11  –  1531985  –  
8116343  5.30x 
4.2.2 #Spikes/Inference
From figure 6, it is evident that average spike activity in ANNSNN converted networks is much less than in SNN trained with our proposed methodology. However, for inference, the network has to be evaluated over a number of timesteps. Therefore, to quantify the actual spike activity for an inference operation, we measured the average number of spikes required for inferring one image. For this purpose, we counted number of spikes generated (including input spikes) for classifying the test set of a particular dataset for a specific number of timesteps and averaged the count for generating the quantity ‘#spikes per image inference’. We have used two different timesteps for ANNSNN converted VGG networks; one for isoaccuracy comparison and the other one for maximum accuracy comparison with the directspike trained SNNs. Isoaccuracy inference requires less #timesteps than maximum accuracy inference, hence has lower number of spikes per image inference. For ResNet networks, the ANNSNN conversion scheme always provides accuracy less than SNN (trained with proposed algorithm). Hence, we only compare spikes per image inference in maximum accuracy condition for ANNSNN converted ResNet networks while comparing with directspike trained SNNs. We can quantify the spikeefficiency (amount reduction in #spikes) from the #spikes/image inference. The results are listed in table 8 where, for each network, the 1^{st} row corresponds to isoaccuracy and 2^{nd} row corresponds to maximumaccuracy condition.
Figure 7 shows the relationship between inference accuracy, latency and #spikes/inference for ResNet11 network trained on CIFAR10 dataset. We can observe that #spikes/inference is higher for directspike trained SNN compared to ANNSNN converted network at any particular latency. However, SNN trained with spikebased BP requires only 100 timesteps for maximum inference accuracy, whereas ANNSNN converted network requires about 3000 timesteps to reach maximum inference accuracy (which is slightly less than directspike trained SNN accuracy). Hence, for maximumaccuracy condition, directspike trained SNN requires much less #spikes/inference compared to ANNSNN converted network while achieving similar accuracy.
4.3 Inference Speedup
The time required for inference is almost linearly proportional to the #timesteps (figure 7). Hence, we can also quantify the inference speedup for directspike trained SNN compared to ANNSNN converted network from the #timesteps required for inference as shown in table 9. For VGG9 network, we achieve 8x speedup for isoaccuracy and up to 36x speedup in inference for maximum accuracy comparison. Similarly, for ResNet networks we achieve up to 25x30x speedup in inference.
Dataset  Model  Timesteps 


ANNSNN  SNN  
MNIST  LeNet  200  50  4x  
500  10x  
SVHN  VGG7  1600  100  16x  
2600  26x  
ResNet7  –  100  –  
2500  25x  
CIFAR10  VGG9  800  100  8x  
3600  30x  
ResNet9  –  100  –  
3000  30x  
ResNet11  –  100  –  
3000  30x 
4.4 Complexity Reduction
Deep ANNs struggle to meet the demand of extraordinary computational requirements. SNNs can mitigate this effort by enabling efficient eventdriven computations. To compare the computational complexity for these two cases, we first need to understand the operation principle of both. An ANN operation for inferring the category of a particular input requires a single feedforward pass per image. For the same task, the network must be evaluated over a number of timesteps in spiking domain. If regular hardware is used for both ANN and SNN, then it is evident that SNN will have computation complexity in the order of hundreds or thousands more compared to an ANN. However, there are specialized hardware that accounts for the eventdriven neural operation and ‘computes only when required’. SNNs can potentially exploit such alternative mechanisms of network operation and carry out an inference operation in spiking domain much more efficiently than an ANN. Also, for deep SNNs, we have observed the increase in sparsity as the network depth increases. Hence, the benefits from eventdriven hardware is expected to increase as the network depth increases.
An estimate of the actual energy consumption of SNNs and comparison with ANNs is outside the scope of this work. However, we can gain some insight by quantifying the energy consumption for a synaptic operation and comparing the number of synaptic operations being performed in the ANN versus the SNN trained with our proposed algorithm and ANNSNN converted network. We can estimate the number of synaptic operations per layer of a neural network from the structure for the convolutional and linear layers. In an ANN, a multiplyaccumulate (MAC) computation is performed per synaptic operation. While, a specialized SNN hardware would perform simply an accumulate computation (AC) per synaptic operation only if an incoming spike is received. Hence, the total number of AC operations in a SNN can be estimated by the layerwise product and summation of the average neural spike count for a particular layer and the corresponding number of synaptic connections. We also have to multiply the #timesteps with the #AC operations to get total #AC operation for one image inference. Based on this concept, we estimated total number of MAC operations for ANN, and total number of AC operations for directspike trained SNN and ANNSNN converted network, for VGG9, ResNet9 and ResNet11. The ratio of ANNSNN converted network AC operations to directspike trained SNN AC operations to ANN MAC operations is 28.18:3.61:1 for VGG9 while the ratio is 11.94:5.06:1 for the ResNet9 and 7.26:2.09:1 for ResNet11 (for maximum accuracy condition).
However, a MAC operation usually consumes an order of magnitude more energy than an AC operation. For instance, according to [[11]], a 32bit floating point MAC operation consumes 4.6pJ and a 32bit floating point AC operation consumes 0.9pJ in 45nm technology node. Hence, one synaptic operation in an ANN is equivalent to ~5 synaptic operations in a SNN. Moreover, 32bit floating point computation can be replaced by fixed point computation using integer MAC and AC units without losing accuracy since the conversion is reported to be almost lossless [[27]]. A 32bit integer MAC consumes roughly 3.2pJ, while a 32bit AC operation consumes only 0.1pJ in 45nm process technology. Considering this fact, our calculations demonstrate that the SNNs trained using proposed method will be 7.81x and 8.87x more computationally energyefficient compared to an ANNSNN converted network and an ANN, respectively, for the VGG9 network architecture. We also gain 3.47x(2.36x) and 15.32x(6.32x) energyefficiency, for the ResNet11(ResNet9) network, compared to an ANNSNN converted network and an ANN, respectively. Figure 8 shows the reduction in computation complexity for ANNSNN conversion and SNN trained with proposed methodology compared to ANNs. It is worth noting here that as the sparsity of the spike signals increases with increase in network depth in SNNs, the energyefficiency is expected to increase almost exponentially in both ANNtoSNN conversion network [39] and SNN trained with proposed methodology compared to an ANN implementation. Hence, depth of network is the key factor for achieving significant increase in the energy efficiency for eventdriven SNNs in contrast to ANNs.
Training complexity can also be easily estimated for proposed methodology as it follows the steps in standard ANN backpropagation algorithm. In backpropagation algorithm, training effort consists of two costs: i) Forward propagation cost and ii) Backward propagation and weight update cost. Using simple cost estimation model based on the number of synaptic operations, we observed that for ANN, backward propagation and weight update cost is ~2x more expensive compared to forward propagation for a minibatch size of one. Minibatch size of one is chosen for simplicity. To get an ANNSNN converted network, first an ANN is trained using standard backpropagation algorithm and then neuronal threshold modulation is applied to convert it to a SNN. We can neglect the threshold modulation cost as it is a one time cost while training can consist of many epochs. Hence, overall computational complexity of training for ANNSNN conversion network is similar to ANN. On the other hand, for spikebased backpropagation training scheme, we backpropagate only once in each minibatch iteration which is computationally same as backpropagation in an ANN. Therefore, the backpropagation and weight update cost for proposed methodology is also same as an ANN. Considering these factors and combining the forward propagation cost (as described earlier), backward propagation and weight update cost, we can estimate the total computational complexity for training. Our estimation shows that, for a minibatch size of 1, proposed training methodology is ~1.4x computationally energy efficient compared to an ANN and an ANNSNN conversion network.
5 Conclusion
In this work, we propose a spikebased backpropagation training methodology for stateoftheart deep SNN architectures. This methodology enables realtime training in deep SNNs while achieving comparable inference accuracies on standard image recognition tasks. Our experiments show the effectiveness of the proposed learning strategy on deeper SNNs (711 layer VGG and ResNet network architectures) by achieving the best classification accuracies in MNIST, SVHN and CIFAR10 datasets among other networks trained with spikebased learning till date. The performance gap in terms of quality between ANN and SNN is substantially reduced by the application of our proposed methodology. We can achieve 6.32x15.32x energyefficiency compared to ANN counterparts as well as 2.36x7.81x over ANNSNN converted networks for inference by exploiting our training methodology and applying the trained SNN on neuromorphic hardware. Moreover, trained deep SNNs can infer 8x36x faster than ANNSNN converted networks.
Acknowledgement
This work was supported in part by CBRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, the National Science Foundation, Intel Corporation, the DoD Vannevar Bush Fellowship and the U.S. Army Research Laboratory and the U.K. Ministry of Defence under Agreement Number W911NF1630001. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Army Research Laboratory, the U.S. Government, the U.K. Ministry of Defence or the U.K. Government. The U.S. and U.K. Governments are authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.
References
 [1] G. Bellec, D. Salaj, A. Subramoney, R. Legenstein, and W. Maass. Long shortterm memory and learningtolearn in networks of spiking neurons. arXiv preprint arXiv:1803.09574, 2018.
 [2] S. M. Bohte, J. N. Kok, and H. La Poutre. Errorbackpropagation in temporally encoded networks of spiking neurons. Neurocomputing, 48(14):17–37, 2002.
 [3] J. M. Brader, W. Senn, and S. Fusi. Learning realworld stimuli in a neural network with spikedriven synaptic dynamics. Neural computation, 19(11):2881–2912, 2007.

[4]
Y. Cao, Y. Chen, and D. Khosla.
Spiking deep convolutional neural networks for energyefficient
object recognition.
International Journal of Computer Vision
, 113(1):54–66, 2015.  [5] M. Davies, N. Srinivasa, T.H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, et al. Loihi: A neuromorphic manycore processor with onchip learning. IEEE Micro, 38(1):82–99, 2018.
 [6] P. Dayan and L. F. Abbott. Theoretical neuroscience, volume 806. Cambridge, MA: MIT Press, 2001.
 [7] P. U. Diehl and M. Cook. Unsupervised learning of digit recognition using spiketimingdependent plasticity. Frontiers in computational neuroscience, 9:99, 2015.

[8]
P. U. Diehl, G. Zarrella, A. Cassidy, B. U. Pedroni, and E. Neftci.
Conversion of artificial recurrent neural networks to spiking neural networks for lowpower neuromorphic hardware.
In Rebooting Computing (ICRC), IEEE International Conference on, pages 1–8. IEEE, 2016.  [9] S. Esser, P. Merolla, J. Arthur, A. Cassidy, R. Appuswamy, A. Andreopoulos, D. Berg, J. McKinstry, T. Melano, D. Barch, et al. Convolutional networks for fast, energyefficient neuromorphic computing. 2016. Preprint on ArXiv. http://arxiv. org/abs/1603.08270. Accessed, 27, 2016.
 [10] S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras, S. Temple, and A. D. Brown. Overview of the spinnaker system architecture. IEEE Transactions on Computers, 62(12):2454–2467, 2013.
 [11] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.

[12]
K. He, X. Zhang, S. Ren, and J. Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015. 
[13]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [14] D. Huh and T. J. Sejnowski. Gradient descent for spiking neural networks. In Advances in Neural Information Processing Systems, pages 1440–1450, 2018.
 [15] E. Hunsberger and C. Eliasmith. Spiking deep networks with lif neurons. arXiv preprint arXiv:1510.08829, 2015.
 [16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [17] Y. Jin, P. Li, and W. Zhang. Hybrid macro/micro level backpropagation for training deep spiking neural networks. arXiv preprint arXiv:1805.07866, 2018.
 [18] S. R. Kheradpisheh, M. Ganjtabesh, S. J. Thorpe, and T. Masquelier. Stdpbased spiking deep neural networks for object recognition. arXiv preprint arXiv:1611.01421, 2016.
 [19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [20] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [23] C. Lee, P. Panda, G. Srinivasan, and K. Roy. Training deep spiking convolutional neural networks with stdpbased unsupervised pretraining followed by supervised finetuning. Frontiers in Neuroscience, 12:435, 2018.
 [24] C. Lee, G. Srinivasan, P. Panda, and K. Roy. Deep spiking convolutional neural network trained with unsupervised spike timing dependent plasticity. IEEE Transactions on Cognitive and Developmental Systems, 2018.
 [25] J. H. Lee, T. Delbruck, and M. Pfeiffer. Training deep spiking neural networks using backpropagation. Frontiers in neuroscience, 10:508, 2016.
 [26] P. Lichtsteiner, C. Posch, and T. Delbruck. A 128128 120 db 15s latency asynchronous temporal contrast vision sensor. IEEE journal of solidstate circuits, 43(2):566–576, 2008.
 [27] D. Lin, S. Talathi, and S. Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pages 2849–2858, 2016.
 [28] W. Maass. Networks of spiking neurons: the third generation of neural network models. Neural networks, 10(9):1659–1671, 1997.
 [29] P. A. Merolla, J. V. Arthur, R. AlvarezIcaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, et al. A million spikingneuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668–673, 2014.
 [30] H. Mostafa. Supervised learning based on temporal coding in spiking neural networks. IEEE transactions on neural networks and learning systems, 2017.
 [31] E. O. Neftci, C. Augustine, S. Paul, and G. Detorakis. Eventdriven random backpropagation: Enabling neuromorphic deep learning machines. Frontiers in neuroscience, 11:324, 2017.
 [32] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
 [33] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor. Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in neuroscience, 9:437, 2015.
 [34] P. Panda and K. Roy. Unsupervised regenerative learning of hierarchical features in spiking deep networks for object recognition. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 299–306. IEEE, 2016.
 [35] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 [36] B. Rueckauer, I.A. Lungu, Y. Hu, M. Pfeiffer, and S.C. Liu. Conversion of continuousvalued deep networks to efficient eventdriven networks for image classification. Frontiers in neuroscience, 11:682, 2017.
 [37] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
 [38] S. S. Sarwar, G. Srinivasan, B. Han, P. Wijesinghe, A. Jaiswal, P. Panda, A. Raghunathan, and K. Roy. Energy efficient neural computing: A study of crosslayer approximations. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2018.
 [39] A. Sengupta, Y. Ye, R. Wang, C. Liu, and K. Roy. Going deeper in spiking neural networks: Vgg and residual architectures. arXiv preprint arXiv:1802.02627, 2018.
 [40] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
 [41] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [42] G. Srinivasan, P. Panda, and K. Roy. Spilinc: Spiking liquidensemble computing for unsupervised speech and image recognition. Frontiers in Neuroscience, 12:524, 2018.
 [43] G. Srinivasan, P. Panda, and K. Roy. Stdpbased unsupervised feature learning using convolutionovertime in spiking neural networks for energyefficient neuromorphic computing. ACM Journal on Emerging Technologies in Computing Systems (JETC), 14(4):44, 2018.
 [44] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [45] A. Tavanaei and A. S. Maida. Bioinspired spiking convolutional neural network using layerwise sparse coding and stdp learning. arXiv preprint arXiv:1611.03000, 2016.
 [46] A. Tavanaei and A. S. Maida. Multilayer unsupervised learning in a spiking convolutional neural network. In Neural Networks (IJCNN), 2017 International Joint Conference on, pages 2023–2030. IEEE, 2017.
 [47] P. J. Werbos et al. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
 [48] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi. Direct training for spiking neural networks: Faster, larger, better. arXiv preprint arXiv:1809.05793, 2018.
 [49] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi. Spatiotemporal backpropagation for training highperformance spiking neural networks. Frontiers in neuroscience, 12, 2018.
 [50] B. Zhao, R. Ding, S. Chen, B. LinaresBarranco, and H. Tang. Feedforward categorization on aer motion events using cortexlike features in a spiking neural network. IEEE transactions on neural networks and learning systems, 26(9):1963–1978, 2015.
Comments
There are no comments yet.