I Introduction
Spiking Neural Networks (SNNs) attempt to emulate the remarkable energy efficiency of the brain in vision, perception, and cognitionrelated tasks using eventdriven neuromorphic hardware [13]
. Neurons in an SNN exchange information via discrete binary spikes, representing a significant paradigm shift from highprecision, continuousvalued deep neural networks (DNN)
[26, 1]. Due to its high activation sparsity and use of accumulates (AC) instead of expensive multiplyandaccumulates (MAC), SNNs have emerged as a promising lowpower alternative to DNNs whose hardware implementations are typically associated with high compute and memory costs.Because SNNs receive and transmit information via spikes, analog inputs have to be encoded with a sequence of spikes. There have been multiple encoding methods proposed, such as rate coding [6], temporal coding [2], rankorder coding [14], and others. However, recent works [27, 32, 18] showed that, instead of converting the image pixel values into spike trains, directly feeding the analog pixel values in the first convolutional layer, and thereby, emitting spikes only in the subsequent layers, can reduce the number of time steps needed to achieve SOTA accuracy by an order of magnitude. Although the first layer now requires MACs, as opposed to the cheaper ACs in the remaining layers, the overhead is negligible for deep convolutional architectures. Hence, we adopt this technique, termed direct encoding, in this work.
In addition to accommodating various forms of encoding inputs, supervised learning algorithms for SNNs have overcome various roadblocks associated with the discontinuous derivative of the spike activation function
[21, 15]. Moreover, SNNs can be converted from DNNs with low error by approximating the activation value of ReLU neurons with the firing rate of spiking neurons
[30]. SNNs trained using DNNtoSNN conversion, coupled with supervised training, have been able to perform similar to SOTA DNNs in terms of test accuracy in traditional image recognition tasks [27, 28]. However, the training effort still remains high, because SNNs need multiple time steps (at least with direct encoding [27]) to process an input, and hence, the backpropagation step requires the gradients of the unrolled SNN to be integrated over all these time steps, which significantly increases the memory cost
[24]. Moreover, the multiple forward passes result in an increased number of spikes, which degrade the SNN’s energy efficiency, both during training and inference, and possibly offset the compute advantage of the ACs. This motivates our exploration of novel training algorithms to reduce both the test error of a DNN and the conversion error to a SNN, while keeping the number of time steps extremely small during both training and inference.In summary, the current challenges in SNNs are multiple time steps, large spiking activity, and high training effort, both in terms of compute and memory. To address these challenges, this paper makes the following contributions.

We propose a novel DNNtoSNN conversion and finetuning algorithm that reduces the conversion error for ultra low latencies by accurately capturing these distributions and thus minimizing the difference between SNN and DNN activation functions.

We demonstrate the latencyaccuracy tradeoff benefits of our proposed framework through extensive experiments with both VGG [31] and ResNet [10] variants of deep SNN models on CIFAR and CIFAR [16]. We benchmark and compare the models’ training time, memory requirements, and inference energy efficiency in both GPU and neuromorphic hardware with two SOTA lowlatency SNNs.^{1}^{1}1We use VGG on CIFAR and CIFAR to show compute efficiency.
The remainder of this paper is organized as follows. Section IIA provides background on DNNs and SNNs and the SOTA DNNtoSNN conversion techniques. Section III explains why these techniques fail for ultralow SNN latencies and discusses our proposed methodology. Our accuracy and latency results are presented in Section IV and our analysis of training resources and inference energy efficiency is presented in Sections V and VI, respectively. The paper concludes in Section VII.
Ii Background
Iia Difference between DNNs and SNNs
Neurons in a nonspiking DNN integrate weightmodulated analog inputs and apply a nonlinear activation function. Although ReLU is widely used as the activation function, previous work [11] has proposed a trainable threshold term, , for similarity with SNNs. In particular, the neuron outputs with threshold ReLU can be expressed as
(1) 
where , and and denote the outputs of the neurons in the preceding layer and the weights connecting the two layers. The gradients of
are estimated using gradient descent during the backward computations of the DNN.
On the other hand, the computation dynamics of a SNN are typically represented by the popular LeakyIntegrateandFire (LIF) model [20], where a neuron transmits binary spike trains (except the input layer for direct encoding) over multiple time steps ( denotes the presence of a spike). To account for the temporal dimension of the inputs, each input has an internal state called a membrane potential, which captures the integration of the incoming (preneuron) spikes (denoted as ) modulated by weights and leaks with a fixed time constant. Each neuron emits an output spike whenever crosses a spiking threshold after which is reduced by . This behavior of the membrane potential and output can be expressed as
(2)  
(3)  
(4) 
where denotes the leak term. When , the SNN model is termed IntegrateandFire (IF).
IiB DNNtoSNN Conversion
Previous research has demonstrated that SNNs can be converted from DNNs with negligible accuracy drop by approximating the activation value of ReLU neurons with the firing rate of IF neurons using a threshold balancing technique that copies the weights from the source DNN to the target SNN [1, 29, 5, 30]. Since this technique uses the standard backpropagation algorithm for DNN training, and thus involves only a single forward pass to process a single input, the training procedure is simpler than the approximate gradient techniques used to train SNNs from scratch. However, the key disadvantage of DNNtoSNN conversion is that it yields SNNs with much higher latency compared to other techniques. Some previous research [9, 22]
proposed to downscale the threshold term to train lowlatency SNNs, but the scaling factor was either a hyperparameter or obtained via linear gridsearch, and the latency needed for convergence still remained large (
).To further reduce the conversion error, [4] minimized the difference between the DNN and SNN postactivation values for each layer. To do this, the activation function of the IF SNN must first be derived [4, 22]. We assume that the initial membrane potential of a layer () is . Moreover, we let be the average SNN output of layer . Then, where is the discrete output at the time step, and is the total number of time steps,
(5) 
where and denote the layer threshold and weight matrix respectively. Eq 5 is illustrated in Fig. 1(a) by the piecewise staircase function of the SNN activation.
Reference [4] also proved that the average difference in the postactivation values can be reduced by adding a bias term to shift the SNN activation curve to the left by , as shown in Fig. 1(a), assuming both the DNN () and SNN () preactivation values are uniformly and identically distributed. To further reduce the difference, [4] added a nontrainable threshold equal to the maximum DNN preactivation value to the ReLU activation function in each layer and equated it with the SNN spiking threshold, which ensures zero difference between the DNN and SNN postactivation values when the DNN preactivation values exceed . However,
is an outlier, and
of the preactivation values lie between . Hence, we propose to use the ReLU activation with a trainable threshold for each layer (denoted as , where for all layers) as discussed in Section IIA and shown in Fig. 1(a). This trainable threshold, as will be described below, also helps reduce the average difference for nonuniform DNN preactivation distributions.Iii Proposed Training Framework
In this section, we analytically and empirically show that the SOTA conversion strategies, along with our proposed modification described above, fail to obtain the SOTA SNN test accuracy for smaller time steps. We then propose a novel conversion algorithm that scales the SNN threshold and postactivation values to reduce the conversion error for small .
Iiia Why Does Conversion Fail for Ultra Low Latencies?
Even though we can minimise the difference between the DNN and SNN postactivation values with bias addition and thresholding, in practice, the SNNs obtained are still not as accurate as their isoarchitecture DNN counterparts when decreases substantially. We empirically show this trend for VGG and ResNet architectures on the CIFAR dataset in Fig. 2. This is due to the flawed baseline assumption that the DNN and SNN preactivation are uniformly distributed
. Both the distributions are rather skewed (i.e., most of the values are close to
), as illustrated in Fig. 1(a).To analytically see this, let us assume the DNN and SNN preactivation probability density functions are
and and postactivation values are denoted as and , respectively. Assuming , derived from DNN training, the expected difference in the postactivation values for a particular layer and can be written as(6) 
where the first approximation is due to the fact that greater than of both and are less than . The subsequent equality is because when . The last equality is based on the introduction of which captures the bias shift of , and and the observation that the term lies between its upper and lower integral limits, and thus can be rewritten as , where lies in the range . The exact value of depends on the distribution .
Assuming , Eq. IIIA can be then written as
(7) 
When and are uniformly distributed in the range , they must equal . This implies that and, consequently, . Moreover, , and hence the first term of , , whereas the second term, , equals . Hence, similar to , . Thus, Eq. 7 evaluates to which implies the error can be completely eliminated, as also concluded in [4].
However, when the distributions are skewed, we observe that while is independent of , decreases significantly as we reduce below around 5, as shown in the insert in Fig. 1(a). Intuitively, for small
, most of the probability density of
lies to the left of the first staircase starting at , due to its sharply decreasing nature. Consequently, the remaining area under the curve captured in becomes negligible, reducing the number of output spikes significantly.Hence, for ultralow SNN latencies, the error per layer remains significant and accumulates over the network.
This analysis explains the accuracy gap that is observed between original DNNs and their converted SOTA SNNs for , as exemplified in Fig. 2. Moreover, training with a nontrainable threshold [4], can be modeled by replacing with in Eq. 7. This further increases , as observed from the increased accuracy degradation shown in Fig. 2.
IiiB Conversion & Finetuning for Ultra LowLatency SNNs
While Eq. 7 suggests that we can tune to compensate for low , this introduces other errors. In particular, if we replace with a downscaled version^{2}^{2}2Upscaling further reduces the output spike count and increases the error. , with , the SNN activation curve will shift left, as shown in Fig. 1(b), and there will be an additional difference between and that stems from the values of and in the range as follows
To mitigate this additional error term, we propose to also optimize the step size of the SNN activation function in the ydirection by modifying the IF model from Eq. 3,
(8) 
which introduces another scaling factor illustrated in Fig. 1(b). Moreover, we remove the bias term since it complicates the parameter space exploration and poses difficulty in training the SNNs after conversion, changing to . This results in a new difference function
Thus, our task reduces to finding the and that minimises for a given low .
Since it is difficult to analytically compute to guide SNN conversion, we empirically estimate it by discretizing into percentiles , where is the largest integer satisfying , using the activations of a particular layer of the trained DNN. In particular, for each , we vary between and with a step size of , as shown in Algorithm 1. This percentilebased approach for is better than a linear search because it enables a finergrained analysis in the range of with higher likelihood. We find the () pair that yields the lowest for each DNN layer.
For DNNtoSNN conversion, we copy the SNN weights from a pretrained DNN with trainable threshold , set each layer threshold as , and produce an output whenever the membrane potential crosses the threshold. Although we incur an overhead of two additional parameters per SNN layer, the parameter increase is negligible compared to the total number of weights. Moreover, as the outputs for each time step are either or , we can absorb the scaling factor into the weight values, avoiding the need for explicit multiplication. After conversion, we apply SGL in the SNN domain where we jointly finetune the threshold, leak, and weights [27]. To approximate the gradient of ReLU, we compute the surrogate gradient as , and otherwise, which is used to estimate the gradients of the trainable parameters [27].
Iv Experimental Results
Iva Experimental Setup
Since we omit the bias term during DNNtoSNN conversion described in Section IIIB
, we avoid Batch Normalization, and instead use Dropout as the regularizer for both ANN and SNN training. Although prior works
[27, 30, 28]claim that max pooling incurs information loss for binaryspikebased activation layers, we use max pooling because it improves the accuracy of both the baseline DNN and converted SNN. Moreover, max pooling layers produce binary spikes at the output, and ensures that the SNN only requires AC operations for all the hidden layers
[7], thereby improving energy efficiency.We performed the baseline DNN training for epochs with an initial learning rate (LR) of that decays by a factor of at every , , and of the total number of epochs. Initialized with the layer thresholds and postactivation values, we performed the SNN training with direct input encoding for epochs for CIFAR and epochs for CIFAR. We used a starting LR of which decays similar to that in DNN training. All experiments are performed on a Nvidia Ti GPU with GB memory.
Number  a.  b. Accuracy ()  c. Accuracy ()  

Architecture  of  DNN ()  with DNNtoSNN  after SNN 
time steps  accuracy  conversion  training  
Dataset : CIFAR10  
VGG11  2  90.76  65.82  89.39 
3  91.10  78.76  89.79  
VGG16  2  93.26  69.58  91.79 
3  93.26  85.06  91.93  
ResNet20  2  93.07  61.96  90.00 
3  93.07  73.57  90.06  
Dataset : CIFAR100  
VGG16  2  68.45  19.57  64.19 
3  68.45  36.84  63.92  
ResNet20  2  63.88  19.85  57.81 
3  63.88  31.43  59.29  

IvB Classification Accuracy & Latency
We evaluated the performance of these networks on multiple VGG and ResNet architectures, namely VGG, and VGG, and Resnet for CIFAR, VGG and Resnet for CIFAR. We report the (a) baseline DNN accuracy, (b) SNN accuracy with our proposed DNNtoSNN conversion, and (c) SNN accuracy with conversion, followed by SGL, for and time steps. Note that the models reported in (b) are far from SOTA, but act as a good initialization for SGL.
Table II provides a comparison of the performances of models generated through our training framework with SOTA deep SNNs. On CIFAR, our approach outperforms the SOTA VGGbased SNN [27] with fewer time steps and negligible drop in test accuracy. To the best of our knowledge, our results represent the first successful training and inference of CIFAR on an SNN with only time steps, yielding a reduction in latency compared to others.
Ablation Study:
The threshold scaling heuristics proposed in
[22, 9], coupled with SGL, lead to a statistical test accuracy of and on CIFAR and CIFAR respectively, with both and time steps. Also, our scaling technique alone (without SGL) requires steps, while the SOTA conversion approach [4] needs steps to obtain similar test accuracy.V Simulation Time & Memory Requirements
Because SNNs require iteration over multiple time steps and storage of the membrane potentials for each neuron, their simulation time and memory requirements can be substantially higher than their DNN counterparts. However, reducing their latency can bridge this gap significantly, as shown in Figure 3. On average, our lowlatency, 2timestep SNNs represent a and reduction in training and inference time per epoch respectively, compared to the hybrid training approach [27] which represents the SOTA in latency, with isobatch conditions. Also, our proposal uses lower GPU memory compared to [27] during training, while the inference memory usage remains almost identical.
Authors  Training  Architecture  Accuracy  Time 
type  ()  steps  
Dataset : CIFAR10  
Wu et al.  Surrogate  5 CONV,  90.53  12 
(2019) [32]  gradient  2 linear  
Rathi et al.  Hybrid  VGG  92.70  5 
(2020) [27]  training  
Kundu et al.  Hybrid  VGG  92.74  10 
(2021) [19]  training  
Deng et al.  DNNtoSNN  VGG16  92.29  16 
(2021) [4]  conversion  
This work  Hybrid Training  VGG16  91.79  2 
Dataset : CIFAR100  
Kundu et al.  Hybrid  VGG  65.34  10 
(2021) [19]  training  CNN  
Deng et al.  DNNtoSNN  VGG  65.94  16 
(2021) [4]  conversion  
This work  Hybrid Training  VGG16  64.19  2 
Vi Energy Consumption During Inference
Via Spiking Activity
As suggested in [17, 3], the average spiking activity of an SNN layer can be used as a measure of the compute energy of the model during inference. This is computed as the ratio of the total number of spikes in steps over all the neurons of the layer to the total number of neurons in that layer. Fig. 4(a) shows the perimage average number of spikes for each layer with our proposed algorithm (using both and time steps), the hybrid training algorithm by [27] (with steps), and the SOTA conversion algorithm [4] which requires
time steps, while classifying CIFAR
and CIFAR using VGG. On average, our approach yields and reduction in spike count compared to [27] and [4], respectively.ViB Floating Point Operations (FLOPs) & Compute Energy
We use FLOP count to capture the energy efficiency of our SNNs, since each emitted spike indicates which weights need to be accumulated at the postsynaptic neurons and results in a fixed number of AC operations. This, coupled with the MAC operations required for direct encoding in the first layer (also used in [27, 4]), dominates the total number of FLOPs. For DNNs, FLOPs are dominated by the MAC operations in all the convolutional and linear layers. Assuming and denote the MAC and AC energy respectively, the inference compute energy of the baseline DNN model can be computed as , whereas that of the SNN model as , where and are the FLOPs count in the layer of DNN and SNN respectively.
Fig. 4(b) and (c) illustrate the FLOP counts and compute energy consumption for our baseline DNN and SNN models of VGG16 while classifying CIFARdatasets, along with the SOTA comparisons [27, 4]. As we can see, the number of FLOPs for our lowlatency SNN is smaller than that for an isoarchitecture DNN and the SNNs obtained from the prior works. Moreover, ACs consume significantly less energy than MACs both on GPU as well as neuromorphic hardware. To estimate the compute energy, we assume a nm CMOS process at V, where pJ, while pJ for multiplication and for addition) [12] for 32bit integer representation. Then, for CIFAR, our proposed SNN consumes lower compute energy compared to its DNN counterpart and and lower energy than [27] and [4] respectively. For CIFAR, the improvements are over the baseline DNN, over the 5step hybrid SNN, and over the 16step optimally converted SNN.
On custom neuromorphic architectures, such as TrueNorth [23], and SpiNNaker [8], the total energy is estimated as [25], where the parameters can be normalized to and for TrueNorth and SpiNNaker, respectively [25]. Since the total FLOPs for VGG is several orders of magnitude higher than the SOTA , the total energy of a deep SNN on neuromorphic hardware is compute bound and thus we would see similar energy improvements on them.
Vii Conclusions
This paper shows that current DNNtoSNN algorithms cannot achieve ultra low latencies because they rely on simplistic assumptions of the DNN and SNN preactivation distributions. The paper then proposes a novel training algorithm, inspired by empirically observed distributions, that can more effectively optimize the SNN thresholds and postactivation values. This approach enables training of SNNs with as little as time steps and without any significant degradation in accuracy for complex image recognition tasks. The resulting SNNs are estimated to consume lower energy than isoarchitecture DNNs.
References

[1]
(201505)
Spiking deep convolutional neural networks for energyefficient object recognition
.International Journal of Computer Vision
113, pp. 54–66. Cited by: §I, §IIB.  [2] (2020) Temporal coding in spiking neural networks with alpha synaptic function. In ICASSP 2020  2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 8529–8533. Cited by: §I.
 [3] (2021) Training energyefficient deep spiking neural networks with singlespike hybrid input encoding. In 2021 International Joint Conference on Neural Networks (IJCNN), Vol. 1, pp. 1–8. External Links: Document Cited by: §VIA.
 [4] (2021) Optimal conversion of conventional artificial neural networks to spiking neural networks. In International Conference on Learning Representations, Cited by: 1st item, §IIB, §IIB, Fig. 2, §IIIA, §IIIA, §IVB, Fig. 4, TABLE II, §VIA, §VIB, §VIB.
 [5] (2015) Fastclassifying, highaccuracy spiking deep networks through weight and threshold balancing. In 2015 International Joint Conference on Neural Networks (IJCNN), Vol. 1, pp. 1–8. Cited by: §IIB.

[6]
(2016)
Conversion of artificial recurrent neural networks to spiking neural networks for lowpower neuromorphic hardware
. In 2016 IEEE International Conference on Rebooting Computing (ICRC), pp. 1–8. Cited by: §I.  [7] (2020) Incorporating learnable membrane time constant to enhance learning of spiking neural networks. arXiv preprint arXiv:2007.05785. External Links: 2007.05785 Cited by: §IVA.
 [8] (2014) The spinnaker project. Proceedings of the IEEE 102 (5), pp. 652–665. External Links: Document Cited by: §VIB.
 [9] (2020) Deep spiking neural network: energy efficiency through time based coding. In European Conference on Computer Vision (ECCV), pp. 388–404. Cited by: §IIB, §IVB.

[10]
(2016)
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: 3rd item.  [11] (2021) TCL: an ANNtoSNN conversion with trainable clipping layers. arXiv preprint arXiv:2008.04509. External Links: 2008.04509 Cited by: §IIA.
 [12] (2014) Computing’s energy problem (and what we can do about it). In 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14. Cited by: §VIB.
 [13] (2011) Frontiers in neuromorphic engineering. Frontiers in Neuroscience 5. Cited by: §I.
 [14] (202005) Temporal backpropagation for spiking neural networks with one spike per neuron. International Journal of Neural Systems 30 (06). Cited by: §I.
 [15] (2020) Revisiting batch normalization for training lowlatency deep spiking neural networks from scratch. arXiv preprint arXiv:2010.01729. External Links: 2010.01729 Cited by: §I.
 [16] (2009) Learning multiple layers of features from tiny images. Technical report Technical Report 0, Technical report, University of Toronto, University of Toronto, Toronto, Ontario. Cited by: 3rd item.
 [17] (202101) Spikethrift: towards energyefficient deep spiking neural networks by limiting spiking activity via attentionguided compression. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 3953–3962. Cited by: §VIA.
 [18] (202110) HIREsnn: harnessing the inherent robustness of energyefficient deep spiking neural networks by training with crafted input noise. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5209–5218. Cited by: §I.
 [19] (2021) Towards lowlatency energyefficient deep SNNs via attentionguided compression. arXiv preprint arXiv:2107.12445. External Links: 2107.12445 Cited by: TABLE II.
 [20] (2020) Enabling spikebased backpropagation for training deep neural network architectures. Frontiers in Neuroscience 14. Cited by: §IIA.
 [21] (2016) Training deep spiking neural networks using backpropagation. Frontiers in Neuroscience 10. Cited by: §I.
 [22] (2021) A free lunch from ANN: towards efficient, accurate spiking neural networks calibration. arXiv preprint arXiv:2106.06984. External Links: 2106.06984 Cited by: 1st item, §IIB, §IIB, §IVB.
 [23] (2014) A million spikingneuron integrated circuit with a scalable communication network and interface. Science 345, pp. 668–673. Cited by: §VIB.

[24]
(2020)
Toward scalable, efficient, and accurate deep spiking neural networks with backward residual connections, stochastic softmax, and hybridization
. Frontiers in Neuroscience 14. Cited by: §I.  [25] (2020) T2FSNN: deep spiking neural networks with timetofirstspike coding. arXiv preprint arXiv:2003.11741. External Links: 2003.11741 Cited by: §VIB.
 [26] (2018) Deep learning with spiking neurons: opportunities and challenges. Frontiers in Neuroscience 12, pp. 774. Cited by: §I.
 [27] (2020) DIETSNN: direct input encoding with leakage and threshold optimization in deep spiking neural networks. arXiv preprint arXiv:2008.03658. External Links: 2008.03658 Cited by: §I, §I, §IIIB, §IVA, §IVB, Fig. 3, Fig. 4, TABLE II, §V, §VIA, §VIB, §VIB.
 [28] (2020) Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. arXiv preprint arXiv:2005.01807. External Links: 2005.01807 Cited by: §I, §IVA.
 [29] (2017) Conversion of continuousvalued deep networks to efficient eventdriven networks for image classification. Frontiers in Neuroscience 11, pp. 682. Cited by: §IIB.
 [30] (2019) Going deeper in spiking neural networks: VGG and residual architectures. Frontiers in Neuroscience 13, pp. 95. Cited by: §I, §IIB, §IVA.
 [31] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: 3rd item.

[32]
(2019)
Direct training for spiking neural networks: faster, larger, better.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 1311–1318. Cited by: §I, TABLE II.