1 Introduction
In recent years, Spiking Neural Networks (SNNs) have shown promise towards enabling lowpower machine intelligence with eventdriven neuromorphic hardware. Founded on bioplausibility, the neurons in an SNN compute and communicate information through discrete binary events (or ‘spikes’) a significant shift from the standard artificial neural networks (ANNs), which process data in a realvalued (or analog) manner. The binary allornothing spikebased communication combined with sparse temporal processing precisely make SNNs a lowpower alternative to conventional ANNs. With all its appeal for power efficiency, training SNNs still remains a challenge. The discontinuous and nondifferentiable nature of a spiking neuron (generally, modeled as leakyintegrateandfire (LIF), or integrateandfire (IF)) poses difficulty to conduct gradient descent based backpropagation. Practically, SNNs still lag behind ANNs, in terms of performance or accuracy, in traditional learning tasks. Consequently, there has been several works over the past few years that propose different learning algorithms or learning rules for implementing deep convolutional SNNs for complex visual recognition tasks (Wu et al., 2019; Hunsberger and Eliasmith, 2015; Cao et al., 2015). Of all the techniques, conversion from ANNtoSNN (Diehl et al., 2016, 2015; Sengupta et al., 2019; Hunsberger and Eliasmith, 2015) has yielded stateoftheart accuracies matching deep ANN performance for Imagenet dataset on complex architectures (such as, VGG (Simonyan and Zisserman, 2014) and ResNet (He et al., 2016)
). In conversion, we train an ANN with ReLU neurons using gradient descent and then convert the ANN to an SNN with IF neurons by using suitable threshold balancing
(Sengupta et al., 2019). But, SNNs obtained through conversion incur large latency of time steps (measured as total number of time steps required to process a given input image^{2}^{2}2SNNs process Poisson ratecoded input spike trains, wherein, each pixel in an image is converted to a Poissondistribution based spike train with the spiking frequency proportional to the pixel value
). The term ‘time step’ defines an unit of time required to process a single input spike across all layers and represents the network latency. The large latency translates to higher energy consumption during inference, thereby, diminishing the efficiency improvements of SNNs over ANNs. To reduce the latency, spikebased backpropagation rules have been proposed that perform endtoend gradient descent training on spike data. In spikebased backpropagation methods, the nondifferentiability of the spiking neuron is handled by either approximating the spiking neuron model as continuous and differentiable (Huh and Sejnowski, 2018) or by defining a surrogate gradient as a continuous approximation of the real gradient (Wu et al., 2018; Bellec et al., 2018; Neftci et al., 2019). Spikebased SNN training reduces the overall latency by (for instance, time steps required to process an input (Lee et al., 2019)) but requires more training effort (in terms of total training iterations) than conversion approaches. A single feedforward pass in ANN corresponds to multiple forward passes in SNN which is proportional to the number of time steps. In spikebased backpropagation, the backward pass requires the gradients to be integrated over the total number of time steps that increases the computation and memory complexity. The multipleiteration training effort with exploding memory requirement (for backward pass computations) has limited the applicability of spikebased backpropagation methods to small datasets (like CIFAR10) on simple fewlayered convolutional architectures.In this work, we propose a hybrid training technique which combines ANNSNN conversion and spikebased backpropagation that reduces the overall latency as well as decreases the training effort for convergence. We use ANNSNN conversion as an initialization step followed by spikebased backpropagation incremental training (that converges to optimal accuracy with few epochs due to the precursory initialization). Essentially, our hybrid approach of taking a converted SNN and incrementally training it using backpropagation yields improved energyefficiency as well as higher accuracy than a model trained from scratch with only conversion or only spikebased backpropagation.
In summary, this paper makes the following contributions:

We introduce a hybrid computationallyefficient training methodology for deep SNNs. We use the weights and firing thresholds of an SNN converted from an ANN as the initialization step for spikebased backpropagation. We then train this initialized network with spikebased backpropagation for few epochs to perform inference at a reduced latency or time steps.

We propose a novel spike timedependent backpropagation (STDB, a variant of standard spikebased backpropagation) that computes surrogate gradient using neuron’s spike time. The parameter update is triggered by the occurrence of spike and the gradient is computed based on the time difference between the current time step and the most recent time step the neuron generated an output spike. This is motivated from the Hebb’s principle which states that the plasticity of a synapse is dependent on the spiking activity of the neurons connected to the synapse.

Our hybrid approach with the novel surrogate gradient descent allows training of largescale SNNs without exploding memory required during spikebased backpropagation. We evaluate our hybrid approach on large SNNs (VGG, ResNetlike architectures) on Imagenet, CIFAR datasets and show near isoaccuracy compared to similar ANNs and converted SNNs at lower compute cost and energy.
2 Spike Timing Dependent Backpropagation (STDB)
In this section, we describe the spiking neuron model, derive the equations for the proposed surrogate gradient based learning, present the weight initialization method for SNN, discuss the constraints applied for ANNSNN conversion, and summarize the overall training methodology.
2.1 Leaky Integrate and Fire (LIF) Neuron Model
The neuron model defines the dynamics of the neuron’s internal state and the trigger for it to generate a spike. The differential equation
(1) 
is widely used to characterize the leakyintegrateandfire (LIF) neuron model where, is the internal state of the neuron referred as the membrane potential, is the resting potential, and are the input resistance and the current, respectively. The above equation is valid when the membrane potential is below the threshold value (). The neuron geneartes an output spike when and
is reduced to the reset potential. This representation is described in continuous domain and more suitable for biological simulations. We modify the equation to be evaluated in a discrete manner in the Pytorch framework
(Wu et al., 2018). The iterative model for a single postneuron is described by(2) 
(3) 
where is the membrane potential, subscript and represent the post and preneuron, respectively, superscript is the time step, is a constant () responsible for the leak in membrane potential, is the weight connecting the pre and postneuron, is the binary output spike, and is the firing threshold potential. The right hand side of Equation 2 has three terms: the first term calculates the leak in the membrane potential from the previous time step, the second term integrates the input from the previous layer and adds it to the membrane potential, and the third term which is outside the summation reduces the membrane potential by the threshold value if a spike is generated. This is known as soft reset as the membrane potential is lowered by compared to hard reset where the membrane potential is reduced to the reset value. Soft reset enables the spiking neuron to carry forward the excess potential above the firing threshold to the following time step, thereby minimizing information loss.
2.2 Spike Timing Dependent Backpropagation (STDB) Learning Rule
The neuron dynamics (Equation 2) show that the neuron’s state at a particular time step recurrently depends on its state in previous time steps. This introduces implicit recurrent connections in the network (Neftci et al., 2019)
. Therefore, the learning rule has to perform the temporal credit assignment along with the spatial credit assignment. Credit assignment refers to the process of assigning credit or blame to the network parameters according to their contribution to the loss function. Spatial credit assignment identifies structural network parameters (like weights), whereas temporal credit assignment determines which past network activities contributed to the loss function. Gradientdescent learning solves both credit assignment problem: spatial credit assignment is performed by distributing error spatially across all layers using the chain rule of derivatives, and temporal credit assignment is done by unrolling the network in time and performing backpropagation through time (BPTT) using the same chain rule of derivatives
(Werbos and others, 1990). In BPTT, the network is unrolled for all time steps and the final output is computed as the sum of outputs from each time step. The loss function is defined on the summed output.The dynamics of the neuron in the output layer is described by Equation (4), where the leak part is removed () and the neuron only integrates the input without firing. This eliminates the difficulty of defining the loss function on spike count (Lee et al., 2019).
(4) 
The number of neurons in the output layer is the same as the number of categories in the classification task. The output of the network is passed through a softmax layer that outputs a probability distribution. The loss function is defined as the crossentropy between the true output and the network’s predicted distribution.
(5) 
(6) 
is the loss function, the true output, the prediction, the total number of time steps, the accumulated membrane potential of the neuron in the output layer from all time steps, and the number of categories in the task. For deeper networks and large number of time steps the truncated version of the BPTT algorithm is used to avoid memory issues. In the truncated version the loss is computed at some time step before T based on the potential accumulated till . The loss is backpropagated to all layers and the loss gradients are computed and stored. At this point, the history of the computational graph is cleaned to save memory. The subsequent computation of loss gradients at later time steps () are summed together with the gradient at to get the final gradient. The optimizer updates the parameters at based on the sum of the gradients. Gradient descent learning has the objective of minimizing the loss function. This is achieved by backpropagating the error and updating the parameters opposite to the direction of the derivative. The derivative of the loss function w.r.t. to the membrane potential of the neuron in the final layer is described by,
(7) 
To compute the gradient at current time step, the membrane potential at last time step ( in Equation 4) is considered as an input quantity. Therefore, gradient descent updates the network parameters of the output layer as,
(8) 
(9) 
where is the learning rate, and represents the copy of the weight used for computation at time step . In the output layer the neurons do not generate a spike, and hence, the issue of nondifferentiability is not encountered. The update of the hidden layer parameters is described by,
(10) 
where is the thresholding function (Equation 3) whose derivative w.r.t to is zero everywhere and not defined at the time of spike. The challenge of discontinuous spiking nonlinearity is resolved by introducing a surrogate gradient which is the continuous approximation of the real gradient.
(11) 
where and are constants, is the time difference between the current time step () and the last time step the postneuron generated a spike (). It is an integer value whose range is from zero to the total number of time steps ().
(12) 
The values of and are selected depending on the value of . If is large is lowered to reduce the exponential decay so a spike can contribute towards gradients for later time steps. The value of is also reduced for large because the gradient can propagate through many time steps. The gradient is summed at each time step and thus a large may lead to exploding gradient. The surrogate gradient can be precomputed for all values of and stored in a lookup table for faster computation. The parameter updates are triggered by the spiking activity but the error gradients are still nonzero for time steps following the spike time. This enables the algorithm to avoid the ‘dead neuron’ problem, where no learning happens when there is no spike. Fig. 1 shows the activation gradient for different values of , the gradient decreases exponentially for neurons that have not been active for a long time. In Hebbian models of biological learning, the parameter update is activity dependent. This is experimentally observed in spiketimingdependent plasticity (STDP) learning rule which modulates the weights for pair of neurons that spike within a time window (Song et al., 2000).
3 SNN Weight Initialization
A prevalent method of constructing SNNs for inference is ANNSNN conversion (Diehl et al., 2015; Sengupta et al., 2019). Since the network is trained with analog activations it does not suffer from the nondifferentiablity issue and can leverage the training techniques of ANNs. The conversion process has a major drawback: it suffers from long inference latency ( time steps) as mentioned in Section 1. As there is no provision to optimize the parameters after conversion based on spiking activity, the network can not leverage the temporal information of the spikes. In this work, we propose to use the conversion process as an initialization technique for STDB. The converted weights and thresholds serve as a good initialization for the optimizer and the STDB learning rule is applied for temporal and spatial credit assignment.
Algorithm 1 explains the ANNSNN conversion process. The threshold voltages in SNN needs to be adjusted based on the ANN weights. Sengupta et al. (2019) showed two ways to achieve this: weightnormalization and thresholdbalancing. In weightnormalization the weights are scaled by a normalization factor and threshold is set to 1, whereas in thresholdbalancing the weights are unchanged and the threshold is set to the normalization factor. Both have a similar effect and either can be used to set the threshold. We employ the thresholdbalancing method and the normalization factor is calculated as the maximum output of the corresponding convolution/linear layer in SNN. The maximum is calculated over a minibatch of input for all time steps.
There are several constraints imposed on training the ANN for the conversion process (Sengupta et al., 2019; Diehl et al., 2015)
. The neurons are trained without the bias term because the bias term in SNN has an indirect effect on the threshold voltage which increases the difficulty of threshold balancing and the process becomes more prone to conversion loss. The absence of bias term eliminates the use of Batch Normalization
(Ioffe and Szegedy, 2015) as a regularizer in ANN since it biases the input of each layer to have zero mean. As an alternative, Dropout (Srivastava et al., 2014) is used as a regularizer for both ANN and SNN training. The implementation of Dropout in SNN is further discussed in Section 5. The pooling operation is widely used in ANN to reduce the convolution map size. There are two popular variants: max pooling and average pooling
(Boureau et al., 2010). Max (Average) pooling outputs the maximum (average) value in the kernel space of the neuron’s activations. In SNN, the activations are binary and performing max pooling will result in significant information loss for the next layer, so we adopt the average pooling for both ANN and SNN (Diehl et al., 2015).4 Network Architectures
In this section, we describe the changes made to the VGG (Simonyan and Zisserman, 2014) and residual architecture (He et al., 2016) for hybrid learning and discuss the process of threshold computation for both the architectures.
4.1 VGG Architecture
The threshold balancing is performed for all layers except the input and output layer in a VGG architecture. For every hidden convolution/linear layer the maximum input^{3}^{3}3input to a neuron is the weighted sum of spkies from preneurons to the neuron is computed over all time steps and set as threshold for that layer. The threshold assignment is done sequentially as described in Algorithm 1. The threshold computation for all layers can not be performed in parallel (in one forward pass) because in the forward method (Algorithm 3) we need the threshold at each time step to decide if the neuron should spike or not.
4.2 Residual Architecture
Residual architectures introduce shortcut connections between layers that are not next to each other. In order to minimize the ANNSNN conversion loss various considerations were made by Sengupta et al. (2019). The original residual architecture proposed by He et al. (2016) uses an initial convolution layer with wide kernel (
, stride
). For conversion, this is replaced by a preprocessing block consisting of a series of three convolution layer (, stride ) with dropout layer in between (Fig. 2). The threshold balancing mechanism is applied to only these three layers and the layers in the basic block have unity threshold.Architecture  ANN  ANNSNN Conversion ()  ANNSNN Conversion (reduced time steps)  Hybrid Training (ANNSNN Conversion + STDB) 

CIFAR10  
VGG5  ()  ()  
VGG9  ()  ()  
VGG16  ()  ()  
ResNet8  ()  ()  
ResNet20  ()  ()  
CIFAR100  
VGG11  ()  ()  
ImageNet  
ResNet34  ()  ()  
VGG16  ()  () 
5 Overall Training Algorithm
Algorithm 1 defines the process to initialize the parameters (weights, thresholds) of SNN based on ANNSNN conversion. Algorithm 2 and 3 show the mechanism of training the SNN with STDB. Algorithm 2 initializes the neuron parameters for every minibatch, whereas Algorithm 3 performs the forward and backward propagation and computes the credit assignment. The threshold voltage for all neurons in a layer is same and is not altered in the training process. For each dropout layer we initialize a mask () for every minibatch of inputs. The function of dropout is to randomly drop a certain number of inputs in order to avoid overfitting. In case of SNN, inputs are represented as a spike train and we want to keep the dropout units same for the entire duration of the input. Thus, a random mask () is initialized (Algorithm 2) for every minibatch and the input is elementwise multiplied with the mask to generate the output of the dropout layer (Lee et al., 2019). The Poisson generator function outputs a Poisson spike train with rate proportional to the pixel value in the input. A random number is generated at every time step for each pixel in the input image. The random number is compared with the normalized pixel value and if the random number is less than the pixel value an output spike is generated. This results in a Poisson spike train with rate equivalent to the pixel value if averaged over a long time. The weighted sum of the input is accumulated in the membrane potential of the first convolution layer. The STDB function compares the membrane potential and the threshold of that layer to generate an output spike. The neurons that output a spike their corresponding entry in is updated with current time step (). The last spike time is initialized with a large negative number (Algorithm 2) to denote that at the beginning the last spike happened at negative infinity time. This is repeated for all layers until the last layer. For last layer the inputs are accumulated over all time steps and passed through a softmax layer to compute the multiclass probability. The crossentropy loss function is defined on the output of the softmax and the weights are updated by performing the temporal and spatial credit assignment according to the STDB rule.
6 Experiments
We tested the proposed training mechanism on image classification tasks from CIFAR (Krizhevsky et al., 2009) and ImageNet (Deng et al., 2009) datasets. The results are summarized in Table 1. CIFAR10: The dataset consists of labeled images of categories divided into training () and testing () set. The images are of size with RGB channels.
CIFAR100: The dataset is similar to CIFAR10 except that it has 100 categories.
ImageNet: The dataset comprises of labeled highresolution million training images and validation images with categories.
7 EnergyDelay Product Analysis of SNNs
A single spike in an SNN consumes a constant amount of energy (Cao et al., 2015). The first order analysis of energydelay product of an SNN is dependent on the number of spikes and the total number of time steps. Fig. 3 shows the average number of spikes in each layer when evaluated for samples from CIFAR10 testset for VGG16 architecture. The average is computed by summing all the spikes in a layer over time steps and dividing by the number of neurons in that layer. For example, the average number of spikes in the layer is for both the networks, which implies that over a time step period each neuron in that layer spikes times on average over all input samples. Higher spiking activity corresponds to lower energyefficiency. The average number of spikes is compared for a converted SNN and SNN trained with conversionandSTDB. The SNN trained with conversionandSTDB has less number of average spikes over all layers under iso conditions (time steps, threshold voltages, inputs, etc.) and achieves higher accuracy compared to the converted SNN. The converted SNNs when simulated for larger time steps further degrade the energydelay product with minimal increase in accuracy (Sengupta et al., 2019).
8 Related Work
Bohte et al. (2000) proposed a method to directly train on SNN by keeping track of the membrane potential of spiking neurons only at spike times and backpropagating the error at spike times based on only the membrane potential. This method is not suitable for networks with sparse activity due to the ‘dead neuron’ problem: no learning happens when the neurons do not spike. In our work, we need one spike for the learning to start but gradient contribution continues in later time steps as shown in Fig. 1. Zenke and Ganguli (2018) derived a surrogate gradient based method on the membrane potential of a spiking neuron at a single time step only. The error was backpropagated at only one time step and only the input at that time step contributed to the gradient. This method neglects the effect of earlier spike inputs. In our approach, the error is backpropagated for every time step and the weight update is performed on the gradients summed over all time steps. Shrestha and Orchard (2018) proposed a gradient function similar to the one proposed in this work. They used the difference between the membrane potential and the threshold to compute the gradient compared to the difference in spike timing used in this work. The membrane potential is a continuous value whereas the spike time is an integer value bounded by the number of time steps. Therefore, gradients that depend on spike time can be precomputed and stored in a lookup table for faster computation. They evaluated their approach on shallow architectures with two convolution layer for MNIST dataset. In this work, we trained deep SNNs with multiple stacked layers for complex calssification tasks. Wu et al. (2018) performed backpropagation through time on SNN with a surrogate gradient defined on the membrane potential. The surrogate gradient was defined as piecewise linear or exponential function of the membrane potential. The other surrogate gradients proposed in the literature are all computed on the membrane potential (Neftci et al., 2019). Lee et al. (2019) approximated the neuron output as continuous lowpass filtered spike train. They used this approximated continuous value to perform backpropagation. Most of the works in the literature on direct training of SNN or conversion based methods have been evaluated on shallow architectures for simple classification problems. In Table 2 we compare our model with the models that reported accuracy on CIFAR10 and ImageNet dataset. Wu et al. (2019) achieved convergence in time steps by using a dedicated encoding layer to capture the input precision. It is beyond the scope of this work to compute the hardware and energy implications of such encoding layer. Our model performs better than all other models at far fewer number of time steps.
Model  Dataset  Training Method  Architecture  Accuracy  Timesteps 

Hunsberger and Eliasmith (2015)  CIFAR10  ANNSNN Conversion  2Conv, 2Linear  
Cao et al. (2015)  CIFAR10  ANNSNN Conversion  3Conv, 2Linear  
Sengupta et al. (2019)  CIFAR10  ANNSNN Conversion  VGG16  
Lee et al. (2019)  CIFAR10  Spiking BP  VGG9  
Wu et al. (2019)  CIFAR10  Surrogate Gradient  5Conv, 2Linear  
This work  CIFAR10  Hybrid Training  VGG16  
Sengupta et al. (2019)  ImageNet  ANNSNN Conversion  VGG16  
This work  ImageNet  Hybrid Training  VGG16 
9 Conclusions
The direct training of SNN with backpropagation is computationally expensive and slow, whereas ANNSNN conversion suffers from high latency. To address this issue we proposed a hybrid training technique for deep SNNs. We took an SNN converted from ANN and used its weights and thresholds as initialization for spikebased backpropagation of SNN. We then performed spikebased backpropagation on this initialized network to obtain an SNN that can perform with fewer number of time steps. The number of epochs required to train SNN was also reduced by having a good initial starting point. The resultant trained SNN had higher accuracy and lower number of spikes/inference compared to purely converted SNNs at reduced number of time steps. The backpropagation through time was performed with surrogate gradient defined using neuron’s spike time that captured the temporal information and helped in reducing the number of time steps. We tested our algorithm on CIFAR and ImageNet datasets and achieved stateoftheart performance with fewer number of time steps.
Acknowledgments
This work was supported in part by the National Science Foundation, in part by Vannevar Bush Faculty Fellowship, and in part by CBRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.
References
 Long shortterm memory and learningtolearn in networks of spiking neurons. In Advances in Neural Information Processing Systems, pp. 787–797. Cited by: Appendix A, §1.
 SpikeProp: backpropagation for networks of spiking neurons.. In ESANN, pp. 419–424. Cited by: §8.

A theoretical analysis of feature pooling in visual recognition.
In
Proceedings of the 27th international conference on machine learning (ICML10)
, pp. 111–118. Cited by: §3. 
Spiking deep convolutional neural networks for energyefficient object recognition
.International Journal of Computer Vision
113 (1), pp. 54–66. Cited by: §1, §7, Table 2.  ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, Cited by: §6.

Fastclassifying, highaccuracy spiking deep networks through weight and threshold balancing
. In 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §1, §3, §3. 
Conversion of artificial recurrent neural networks to spiking neural networks for lowpower neuromorphic hardware
. In 2016 IEEE International Conference on Rebooting Computing (ICRC), pp. 1–8. Cited by: §1. 
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §1, §4.2, §4.  Gradient descent for spiking neural networks. In Advances in Neural Information Processing Systems, pp. 1433–1443. Cited by: §1.
 Spiking deep networks with lif neurons. arXiv preprint arXiv:1510.08829. Cited by: §1, Table 2.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.
 Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §6.
 Enabling spikebased backpropagation in stateoftheart deep neural network architectures. arXiv preprint arXiv:1903.06379. Cited by: §1, §2.2, §5, Table 2, §8.
 Surrogate gradient learning in spiking neural networks. arXiv preprint arXiv:1901.09948. Cited by: §1, §2.2, §8.
 Going deeper in spiking neural networks: vgg and residual architectures. Frontiers in neuroscience 13. Cited by: §1, §3, §3, §3, §4.2, §7, Table 2.
 SLAYER: spike layer error reassignment in time. In Advances in Neural Information Processing Systems, pp. 1412–1421. Cited by: Appendix A, §8.
 Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §4.
 Competitive hebbian learning through spiketimingdependent synaptic plasticity. Nature neuroscience 3 (9), pp. 919. Cited by: §2.2.
 Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §3.
 Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 78 (10), pp. 1550–1560. Cited by: §2.2.
 Spatiotemporal backpropagation for training highperformance spiking neural networks. Frontiers in neuroscience 12. Cited by: Appendix A, §1, §2.1, §8.

Direct training for spiking neural networks: faster, larger, better.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 1311–1318. Cited by: §1, Table 2, §8. 
Superspike: supervised learning in multilayer spiking neural networks
. Neural computation 30 (6), pp. 1514–1541. Cited by: Appendix A, §8.
Appendix A Comparisons with other Surrogate Gradients
The transfer function of the spiking neuron is a step function and its derivative is zero everywhere except at the time of spike where it is not defined. In order to perform backpropagation with spiking neuron several approximations are proposed for the gradient function (Bellec et al., 2018; Zenke and Ganguli, 2018; Shrestha and Orchard, 2018; Wu et al., 2018). These approximations are either a linear or exponential function of , where is the membrane potential and the threshold voltage (Fig. 4). These approximations are referred as surrogate gradient or pseudoderivative. In this work, we proposed an approximation that is computed using the spike timing of the neuron (Equation 11). We compare our proposed approximation with the following surrogate gradients:
(13) 
(14) 
where is the binary output of the neuron, is the membrane potential, is the threshold potential, and are constants. Equation 13 and Equation 14 represent the linear and exponential approximation of the gradient, respectively. We employed these approximations in the hybrid training for a VGG9 network for CIFAR10 dataset. All the approximations (Equation 11, 13, and 14) produced similar results in terms of accuracy and number of epochs for convergence. This shows that the term (Equation 11) is a good replacement for (Equation 14). The behaviour of and is similar, i.e., it is small closer to the time of spike and increases as we move away from the spiking event. The advantage of using is that its domain is bounded by the total number of time steps (Equation 12). Hence, all possible values of gradients can be precomputed and stored in a table for faster access during training. This is not possible for membrane potential because it is a real value computed based on the stochastic inputs and previous state of the neuron which is not known before hand. The exact benefit in energy from the precomputation is dependent on the overall system architecture and evaluating it is beyond the scope of this paper.
Appendix B Comparisons of Simulation Time and Memory Requirements
The simulation time and memory requirements for ANN and SNN are very different. SNN requires much more resources to iterate over multiple time steps and store the membrane potential for each neuron. Fig. 5 shows the training and inference time and memory requirements for ANN, SNN trained with backpropagation from scratch, and SNN trained with the proposed hybrid technique. The performance was evaluated for VGG16 architecture trained for CIFAR10 dataset. SNN trained from scratch and SNN trained with hybrid conversionandSTDB are evaluated for 100 time steps. One epoch of ANN training (inference) takes () minutes and () GB of GPU memory. On the other hand, one epoch of SNN training (inference) takes () minutes and () GB of GPU memory for same hardware and minibatch size. ANN and SNN trained from scratch reached convergence after 250 epochs. The hybrid technique requires epochs of ANN training and epochs of spikebased backpropagation. The hybrid training technique is one order of magnitude faster than training SNN from scratch. The memory requirement for hybrid technique is same as SNN as we need to perform finetuning with spikebased backpropagation.