1 Introduction
Artificial Neural Networks (ANNs) have shown stateoftheart performance across various computer vision tasks. Nonetheless, huge energy consumption incurred for implementing ANNs on conventional vonNeumann hardware limits their usage in lowpower and resourceconstrained Internet of Things (IoT) environment, such as mobile phones, drones among others. In the context of lowpower machine intelligence, Spiking Neural Networks (SNNs) have received considerable attention in the recent past
[31, 26, 4, 9, 5]. Inspired by biological neuronal mechanisms, SNNs process visual information with discrete spikes or events over multiple timesteps. Recent works have shown that the eventdriven behavior of SNNs can be implemented on emerging neuromorphic hardware to yield 12 order of magnitude energy efficiency over ANNs [1, 6]. Despite the energy efficiency benefits, SNNs have still not been widely adopted due to inherent training challenges. The training issue arises from the nondifferentiable characteristic of a spiking neuron, generally, IntegrateandFire (IF) type [3], that makes SNNs incompatible with gradient descent training.*Method (total timesteps / accuracy)
(a) Conversion (1000 / 91.2%)  (b) Surrogate BP (100 / 88.7%)  (c) BNTT (25 / 90.5%) 
Visualization of the average number of spikes in each layer with respect to timesteps. Compared to (a) ANNSNN conversion and (b) surrogate gradientbased backpropagation, our (c) BNTT captures the temporal dynamics of spike activation with learnable parameters, enabling lowlatency (
i.e., small timesteps) and lowenergy (i.e., less number of spikes) training. All experiments are conducted on CIFAR10 with VGG9.To address the training issue of SNNs, several methods, such as, Conversion and Surrogate Gradient Descent have been proposed. In ANNSNN conversion [34, 13, 10, 32]
, offtheshelf trained ANNs are converted to SNNs using normalization methods to transfer ReLU activation to IF spiking activity. The advantage here is that training happens in the ANN domain leveraging widely used machine learning frameworks like, PyTorch, that yield short training time and can be applied to complex datasets. But the ANNSNN conversion method requires large number of timesteps (
) for inference to yield competitive accuracy, which significantly increases the latency and energy consumption of the SNN. On the other hand, directly training SNNs with a surrogate gradient function [24, 19, 39] exploits temporal dynamics of spikes, resulting in lesser number of timesteps (). However, the discrepancy between forward spike activation function and backward surrogate gradient function during backpropagation restricts the training capability. Only shallow SNNs (
e.g., VGG5) can be trained using surrogate gradient descent and therefore they achieve high performance only for simple datasets (e.g., MNIST and CIFAR10). Recently, a hybrid method
[30] that combines the conversion method and the surrogate gradientbased method shows stateoftheart performance at reasonable latency ( timesteps). However, the hybrid method incurs sequential processes, i.e., training ANN from scratch, conversion of ANN to SNN, and training SNNs using surrogate gradient descent, that increases the total computation cost to obtain the final SNN model. Overall, training highaccuracy and lowlatency SNNs from scratch still remains an open problem.In this paper, we revisit Batch Normalization (BN) for more advanced SNN training. The BN layer [15] has been used extensively in deep learning to accelerate the training process of ANNs. It is well known that BN reduces internal covariate shift (or soothing optimization landscape [33]) mitigating the problem of exploding/vanishing gradients. However, till now, numerous studies on surrogate gradient of SNNs [20] have witnessed that BN does not help with SNN optimization. Moreover, most ANNSNN conversion methods [34] get rid of BN since timesequential spikes with BN set the firing threshold of all neurons to nondiscriminative/similar values across all inputs, resulting in accuracy decline.
Motivation & Contribution: A natural question then arises: Can standard BN capture the proper structure of temporal dynamics of spikes in SNNs? Through this paper, we assert that standard BN hardly captures temporal characteristics as it represents the statistics of total timesteps as one common parameter. Thus, a temporally adaptive BN approach is required. To this end, we propose a new SNNcrafted batch normalization layer called Batch Normalization Through Time (BNTT) that decouples the parameters in the BN layer across different timesteps. BNTT is implemented as an additional layer in SNNs and is trained with surrogate gradient backpropagation. To investigate the effect of our BNTT, we compare the statistics of spike activity of BNTT with previous approaches: Conversion [34] and standard Surrogate Gradient Descent [24], as shown in Fig. 1. Interestingly, different from the conversion method and surrogate gradient method (without BNTT) that maintain reasonable spike activity during the entire time period across different layers, spike activity of layers trained with BNTT follows a gaussianlike trend. BNTT imposes a variation in spiking across different layers, wherein, each layer’s activity peaks in a particular timestep range and then decreases. Moreover, the peaks for early layers occur at initial timesteps and latter layers peak at later timesteps. This phenomenon implies that learnable parameters in BNTT enable the networks to pass the visual information temporally from shallow to deeper layers in an effective manner.
The newly observed characteristics of BNTT brings several advantages. First, similar to BN, the BNTT layer enables SNNs to be trained stably from scratch even for largescale datasets. Second, learnable parameters in BNTT enable SNNs to be trained with low latency ( timesteps) and impose optimum spike activity across different layers for lowenergy inference. Finally, the distribution of the BNTT learnable parameter (i.e., ) is a good representation of the temporal dynamics of spikes. Hence, relying on the observation that low value induces low spike activity and viceversa, we further propose a temporal early exit algorithm. Here, an SNN can predict at an earlier timestep and does not need to wait till the end of the time period to make a prediction.
In summary, our key contributions are as follows: (i) For the first time, we introduce a batch normalization technique for SNNs, called BNTT. (ii) BNTT allows SNNs to be implemented in a lowlatency and lowenergy environment. (iii) We further propose a temporal early exit algorithm at inference time by monitoring the learnable parameters in BNTT. (iv) To ascertain that BNTT captures the temporal characteristics of SNNs, we mathematically show that proposed BNTT has similar effect as controlling the firing threshold of the spiking neuron at every time step during inference.
2 Batch Normalization
Batch Normalization (BN) reduces the internal covariate shift (or variation of loss landscape [33]) caused by the distribution change of input signal, which is a known problem of deep neural networks [15]. Instead of calculating the statistics of total dataset, the intermediate representations are standardized with a minibatch to reduce the computation complexity. Given a minibatch
, the BN layer computes the mean and variance of the minibatch as:
(1) 
Then, the input features in the minibatch are normalized with calculated statistics as:
(2) 
where, is a small constant for numerical stability. To further improve the representation capability of the layer, learnable parameters and are used to transform the input features that can be formulated as
. At inference time, BN uses the running average of mean and variance obtained from training. Previous works show that the BN layer not only improves the performance but also reduces the number of iterations required for training convergence. Therefore, BN is an indispensable training component for all ANN models, such as convolutional neural networks
[36][12]. On the other hand, the effectiveness of BN in bioplausible SNNs has not been observed yet.3 Methodology
3.1 Spiking Neural Networks
Different from conventional ANNs, SNNs transmit information using binary spike trains. To leverage the temporal spike information, LeakyIntegrateandFire (LIF) model [7] is widely used to emulate neuronal functionality in SNNs, which can be formulated as a differential equation:
(3) 
where, represents the membrane potential of the neuron that characterizes the internal state of the neuron, is the time constant of membrane potential decay. Also, and denote the input resistance and the input current at time , respectively. Following the previous work [40], we convert this continuous dynamic equation into a discrete equation for digital simulation. For a single postsynaptic neuron , we can represent the membrane potential at timestep as:
(4) 
Here, is the index of a presynaptic neuron, is a leak factor with value less than , is the binary spike activation, and is the weight of the connection between pre and postneurons. From Eq. 4, the membrane potential of a neuron decreases due to leak and increases due to the weighted sum of incoming input spikes.
If the membrane potential exceeds a predefined firing threshold , the LIF neuron generates a binary spike output . After that, we perform a soft reset, where the membrane potential is reset by reducing its value by the threshold . Compared to a hard reset (resetting the membrane potential to zero after neuron spikes), the soft reset minimizes information loss by maintaining the residual voltage and carrying it forward to the next time step, thereby achieving better performance [13]. Fig. 2(a) illustrates the membrane potential dynamics of a LIF neuron.
For the output layer, we discard the thresholding functionality so that neurons do not generate any spikes. We allow the output neurons to accumulate the spikes over all timesteps by fixing the leak parameter ( in Eq. 4
) as one. This enables the output layer to compute probability distribution after softmax function without information loss. As with ANNs, the number of output neurons in SNNs is identical to the number of classes
in the dataset. From the accumulated membrane potential, we can define the crossentropy loss for SNNs as:(5) 
where, is the groundtruth label, and represents the total number of timesteps. Then, the weights of all layers are updated by backpropagating the loss value with gradient descent.
To compute the gradients of each layer , we use backpropagation through time (BPTT), which accumulates the gradients over all timesteps [24]. These approaches can be implemented with autodifferentiation tools, such as PyTorch [29]
, that enable backpropagation on the unrolled network. To this end, we compute the loss function at timestep
and use gradient descent optimization. Mathematically, we can define the accumulated gradients at the layerby chain rule as:
(6) 
Here, and are output spikes and membrane potential at layer , respectively. For the output layer, we get the derivative of the loss with respect to the membrane potential at final timestep :
(7) 
This derivative function is continuous and differentiable for all possible membrane potential values. On the other hand, LIF neurons in hidden layers generate spike output only if the membrane potential exceeds the firing threshold, leading to nondifferentiability. To deal with this problem, we introduce an approximate gradient:
(8) 
where, is a damping factor for backpropagated gradients. Note, a large value causes unstable training as gradients are summed over all timesteps. Hence, we set to . Overall, we update the network parameters at the layer based on the gradient value (Eq. 6) as .
3.2 Batch Normalization Through Time (BNTT)
The main contribution of this paper is a new SNNcrafted Batch Normalization (BN) technique. Naively applying BN does not have any effect on training SNNs. This is because using the same BN parameters (e.g., global mean , global variation , and learnable parameter ) for the statistics of all timesteps do not capture the temporal dynamics of input spike trains. For example, an LIF neuron requires at least one timestep to propagate spikes to the next layer. Therefore, input signals for the third layer of an SNN will have a zero value till . Following the initial spike activity in the layer at , the spike signals vary depending upon the weight connections and the membrane potentials of previous layers. Therefore, a fixed global mean from a standard BN layer may not store any timespecific information, resulting in performance degradation at inference.
To resolve this issue, we vary the internal parameters in a BN layer through time, that we define as, BNTT. Similar to the digital simulation of LIF neuron across different timesteps, one BNTT layer is expanded temporally with a local learning parameter associated with each timestep. This allows the BNTT layer to capture temporal statistics (see Section 3.3 for mathematical analysis). The proposed BNTT layer is easily applied to SNNs by inserting the layer after convolutional/linear operations as:
(9) 
During the training process, we compute the mean and variance from the samples in a minibatch for each time step , as shown in Algorithm 1. Note, for each timestep , we apply an exponential moving average to approximate global mean and variance over training iterations. These global statistics are used to normalize the test data at inference. Also, we do not utilize as in conventional BN, since it adds redundant voltage to the membrane potential of SNNs.
Adding the BNTT layer to LIF neurons changes the gradient calculation for backpropagation. Given that is an input signal to the BNTT layer, we can calculate the gradient value passed through lower layers by the BNTT layer as:
(10) 
Here, we omit a neuron index for simplicity. Also, and denote the batch size and batch index (see Appendix A for more detail). Thus, for every timestep , gradients are calculated based on the timespecific statistics of input signals. This allows the networks to take into account temporal dynamics for training weight connections. Moreover, a learnable parameter is updated to restore the representation power of the batch normalized signal. Since we use different values across all timesteps, finds an optimum over each timestep for efficient inference. We update gamma where:
(11) 
3.3 Mathematical Analysis
In this section, we discuss the connections between BNTT and the firing threshold of a LIF neuron. Specifically, we formally prove that using BNTT has a similar effect as varying the firing threshold over different timesteps, thereby ascertaining that BNTT captures temporal characteristics in SNNs. Recall that BNTT normalizes the input signal using stored approximated global average
at inference. From Eq. 9, we can calculate a membrane potential at timestep , given that initial membrane potential has a zero value:(12) 
Here, we assume can be neglected with small signal approximation due to the spike sparsity in SNNs, and is membrane potential at timestep without BNTT (obtained from Eq. 4). We can observe that the membrane potential with BNTT is proportional to the membrane potential without BNTT at . For timestep , we should take into account the membrane potential from the previous timestep, which is multiplied by leak . To this end, by substituting Eq. 12 in the BNTT equation (Eq. 9), we can formulate the membrane potential at as:
(13) 
In the third line, the learnable parameter and have similar values in adjacent time intervals () because of continuous time property. Hence, we can approximate and as and , respectively. Finally, we can extend the equation of BNTT to the timestep :
(14) 
Considering that a neuron produces an output spike activation whenever the membrane potential exceeds the predefined firing threshold , the spike firing condition with BNTT can be represented . Comparing with the threshold of a neuron without BNTT, we can reformulate the firing condition as:
(15) 
Thus, we can infer that using a BNTT layer changes the firing threshold value by at every timestep. In practice, BNTT results in an optimum during training that improves the representation power, producing better performance and lowlatency SNNs.This observation allows us to consider the advantages of timevarying learnable parameters in SNNs. This implication is in line with previous work [13]
, which insists that manipulating the firing threshold improves the performance and latency of the ANNSNN conversion method. However, Han et al. change the threshold value in a heuristic way without any optimization process and fix the threshold value across all timesteps. On the other hand, our BNTT yields timespecific
which can be optimized via backpropagation.3.4 Early Exit Algorithm
The main objective of early exit is to reduce the latency during inference [38, 27]. Most previous methods [39, 19, 34, 30, 13] accumulate output spikes till the end of the timesequence, at inference, since all layers generate spikes across all timesteps as shown in Fig. 1(a) and Fig. 1(b). On the other hand, learnable parameters in BNTT manipulate the spike activity of each layer to produce a peak value, which falls again (a gaussianlike trend), as shown in Fig. 1(c). This phenomenon shows that SNNs using BNTT convey little information at the end of spike trains.
Inspired by this observation, we propose a temporal early exit algorithm based on the value of . From Eq. 15, we know that a low value increases the firing threshold, resulting in low spike activity. A high value, in contrast, induces more spike activity. It is worth mentioning that shows similar values across all timesteps and therefore we only focus on . Given that the intensity of spike activity is proportional to , we can infer that spikes will hardly contribute to the classification result once values across every layer drop to a minimum value. Therefore, we measure the average of values in each layer at every timestep, and terminate the inference when value in every layer is below a predetermined threshold. For example, as shown in Fig. 3, we observe that all averaged values are lower than threshold after . Therefore, we define the early exit time at . Note that we can determine the optimum timestep for early exit before forward propagation without any additional computation. In summary, the temporal early exit method enables us to find the earliest timestep during inference that ensures integration of crucial information, in turn reducing the inference latency without significant loss of accuracy.
3.5 Overall Optimization
Algorithm 2 summarizes the whole training process of SNNs with BNTT. Our proposed BNTT acts as a regularizer, unlike previous methods [19, 34, 20, 30] that use dropout to perform regularization. Our training scheme is based on widely used rate coding where the spike generator produces a Poisson spike train (see Appendix B) for each pixel in the image with frequency proportional to the pixel intensity [31]
. For all layers, the weighted sum of the input signal is passed through a BNTT layer and then is accumulated in the membrane potential. If the membrane potential exceeds the firing threshold, the neuron generates an output spike. For last layer, we accumulate the input voltage over all timesteps without leak, that we feed to a softmax layer to output a probability distribution. Then, we calculate a crossentropy loss function and gradients for weight of each layer with the approximate gradient function. During the training phase, a BNTT layer computes the timedependent statistics (
i.e., and ) and stores the movingaverage global mean and variance. At inference, we first define the early exit timestep based on the value ofin BNTT. Then, the networks classify the test input (note, test data normalized with precomputed global
BNTT statistics) based on the accumulated output voltage at the precomputed early exit timestep.4 Experiments
In this section, we carry out comprehensive experiments on public classification datasets. Till now, training SNNs from scratch with surrogate gradient has been limited to simple datasets, e.g., CIFAR10, due to the difficulty of direct optimization. In this paper, for the first time, we train SNNs with surrogate gradients from scratch and report the performance on largescale datasets including CIFAR100 and TinyImageNet with multilayered network architectures. We first compare our BNTT with previous SNNs training methods. Then, we quantitatively and qualitatively demonstrate the effectiveness of our proposed BNTT.
4.1 Experimental Setup
We evaluate our method on three static datasets (i.e., CIFAR10, CIFAR100, TinyImageNet) and one neuromophic dataset (i.e., DVSCIFAR10). CIFAR10 [17] consists of 60,000 images (50,000 for training / 10,000 for testing) with 10 categories. All images are RGB color images whose size are 32 32. CIFAR100 has the same configuration as CIFAR10, except it contains images from 100 categories. TinyImageNet is the modified subset of the original ImageNet dataset. Here, there are 200 different classes of ImageNet dataset [8], with 100,000 training and 10,000 validation images. The resolution of the images is 6464 pixels. DVSCIFAR10 [21] has the same configuration as CIFAR10. This discrete eventstream dataset is collected by moving the eventdriven camera. We follow the similar data preprocessing protocol and a network architecture used in previous work [40] (details in Appendix C). Our implementation is based on Pytorch [29]
. We train the networks with standard SGD with momentum 0.9, weight decay 0.0005 and also apply random crop and horizontal flip to input images. The base learning rate is set to 0.3 and we use stepwise learning rate scheduling with a decay factor 10 at 50%, 70%, and 90% of the total number of epochs. Here, we set the total number of epochs to 120, 240, 90, and 60 for CIFAR10, CIFAR100, TinyImageNet, and DVSCIFAR10, respectively.
Dataset  Training Method  Architecture  Timesteps  Accuracy(%)  
Cao et al. [4]  CIFAR10  ANNSNN Conversion  3Conv, 2Linear  400  77.4 
Sengupta et al. [34]  CIFAR10  ANNSNN Conversion  VGG16  2500  91.5 
Lee et al. [19]  CIFAR10  Surrogate Gradient  VGG9  100  90.4 
Rathi et al. [30]  CIFAR10  Hybrid  VGG16  200  92.0 
Han et al. [13]  CIFAR10  ANNSNN Conversion  VGG16  2048  93.6 
w.o. BNTT  CIFAR10  Surrogate Gradient  VGG9  100  88.7 
BNTT (ours)  CIFAR10  Surrogate Gradient  VGG9  25  90.5 
BNTT + Early Exit (ours)  CIFAR10  Surrogate Gradient  VGG9  20  90.3 
Sengupta et al. [34]  CIFAR100  ANNSNN Conversion  VGG16  2500  70.9 
Rathi et al. [30]  CIFAR100  Hybrid  VGG16  125  67.8 
Han et al. [13]  CIFAR100  ANNSNN Conversion  VGG16  2048  70.9 
w.o. BNTT  CIFAR100  Surrogate Gradient  VGG11  n/a  n/a 
BNTT (ours)  CIFAR100  Surrogate Gradient  VGG11  50  66.6 
BNTT + Early Exit (ours)  CIFAR100  Surrogate Gradient  VGG11  30  65.8 
Sengupta et al. [34]  TinyImageNet  ANNSNN Conversion  VGG11  2500  54.2 
w.o. BNTT  TinyImageNet  Surrogate Gradient  VGG11  n/a  n/a 
BNTT (ours)  TinyImageNet  Surrogate Gradient  VGG11  30  57.8 
BNTT + Early Exit (ours)  TinyImageNet  Surrogate Gradient  VGG11  25  56.8 
Method  Type  Accuracy (%) 
Orchard et al. [25]  Random Forest  31.0 
Lagorce et al. [18]  HOTS  27.1 
Sironi et al. [37]  HAT  52.4 
Sironi et al. [37]  GaborSNN  24.5 
Wu et al. [40]  Surrogate Gradient  60.5 
w.o. BNTT  Surrogate Gradient  n/a 
BNTT (ours)  Surrogate Gradient  63.2 
(a)  (b) 
4.2 Comparison with Previous Methods
On public datasets, we compare our proposed BNTT method with previous ratecoding based SNN training methods, including ANNSNN conversion [13, 34, 4], surrogate gradient backpropagation [19], and hybrid [30] methods. From Table 1, we can observe some advantages and disadvantages of each training method. The ANNSNN conversion method performs better than the surrogate gradient method across all datasets. However, they require large number of timesteps for training and testing, which is energyinefficient and impractical in a realtime application. The hybrid method aims to resolve this highlatency problem, but it still requires over hundreds of timesteps. The surrogate gradient method suffers from poor optimization and hence cannot be scaled to larger datasets such as CIFAR100 and TinyImageNet. Our BNTT is based on the surrogate gradient method, however, it enables SNNs to achieve high performance even for more complicated datasets. At the same time, we dramatically reduce the latency due to the inclusion of learnable parameters and temporal statistics in the BNTT layer. As a result, BNTT can be trained with 25 timesteps on a simple CIFAR10 dataset, while preserving stateoftheart accuracy. For CIFAR100, we achieve about and faster inference speed compared to the conversion methods and the hybrid method, respectively. Interestingly, for TinyImageNet, BNTT achieves better performance and shorter latency compared to previous conversion method. Note that ANN with VGG11 architecture used for ANNSNN conversion achieves 56.3% accuracy. Moreover, using an early exit algorithm further reduces the latency by , which enables the networks to be implemented with lowerlatency and energyefficiency. It is worth mentioning that surrogate gradient method without BNTT (w.o. BNTT in Table 1) only converges on CIFAR10. For neuromorphic DVSCIFAR10 dataset (Table 2), ANNSNN Conversion methods are not applicable since ANNs hardly capture the temporal dynamic of a spike train. Using BNTT improves the stability of training compared to a surrogate gradient baseline (i.e., w.o. BNTT), and achieves stateoftheart performance. These results show that our BNTT technique is very effective on eventdriven data and hence wellsuited for neuromorphic applications.
Method  Latency  Accuracy (%)  
VGG9 (ANN)  1  91.5  1 
Conversion  1000  91.2  0.32 
Conversion  500  90.9  0.55 
Conversion  100  89.3  2.71 
Surrogate Gradient  100  88.7  1.05 
BNTT  25  90.5  9.14 
4.3 Energy Comparison
We compare the layerwise spiking activities of our BNTT with two widelyused methods, i.e., ANNSNN conversion method [34] and surrogate gradient method (w.o. BNTT) [24]. Note, we refer to our approach as BNTT and standard surrogate approach w.o. BNTT as surrogate gradient in the remainder of the text. Specifically, we calculate the spike rate of each layer , which can be defined as the total number of spikes at layer over total timesteps divided by the number of neurons in layer (see Appendix D for the equation of spike rate). In Fig. 4(a), converted SNNs show a high spike rate for every layer as they forward spike trains through a larger number of timesteps compared to other methods. Even though the surrogate gradient method uses less number of timesteps, it still requires nearly hundreds of spikes for each layer. Compared to these methods, we can observe that BNTT significantly improves the spike sparsity across all layers.
More precisely, as done in previous works [28, 20], we compute the energy consumption for SNNs in standard CMOS technology [14] as shown in Appendix D by calculating the net multiplicationandaccumulate (MAC) operations. As the computation of SNNs are eventdriven with binary {1, 0} spike processing, the MAC operation reduces to just a floating point (FP) addition. On the other hand, conventional ANNs still require one FP addition and one FP multiplication to conduct the same MAC operation (see Appendix D for more detail). Table 3 shows the energy efficiency of ANNs and SNNs with a VGG9 architecture [36] on CIFAR10. As expected, ANNSNN conversion yields a tradeoff between accuracy and energy efficiency. For the same latency, the surrogate gradient method expends higher energy compared to the conversion method. It is interesting to note that even though our BNTT is trained based on the surrogate gradient method, we get improvement in energy efficiency compared to ANNs. In addition, we conduct further energy comparison on Neuromorphic architecture in Appendix E.
4.4 Analysis on Learnable Parameters in BNTT
The key observation of our work is the change of across timesteps. To analyze the distribution of the learnable parameters in our BNTT, we visualize the histogram of in conv1, conv4, and conv7 layers in VGG9 as shown in Fig. 5
. Interestingly, all layers show different temporal evolution of gamma distributions. For example, conv1 has high
values at the initial timesteps which decrease as time goes on. On the other hand, starting from small values, the values in conv4 and conv7 layers peak at and , respectively, and then shrink to zero at later timesteps. Notably, the peak time is delayed as the layer goes deeper, implying that the visual information is passed through the network sequentially over a period of time similar to Fig.1(c). This gaussianlike trend with rise and fall of across different timesteps can support the explanation of overall low spike activity compared to other methods (Fig. 4(a)).4.5 Analysis on Early Exit
Recall that we measure the average of values in each layer at every timestep, and stop the inference when all values in every layer is lower than a predetermined threshold. To further investigate this, we vary the predetermined threshold and show the accuracy and exit time trend. As shown in Fig. 6, we observe that high threshold enables the networks to infer at earlier timesteps. Although we use less number timesteps during inference, the accuracy drops marginally. This implies that BNTT rarely sends crucial information at the end of spike train (see Fig. 1(c)). Note that the temporal evolution of learnable parameter with our BNTT allows us to exploit the early exit algorithm that yields a huge advantage in terms of reduced latency at inference. Such strategy has not been proposed or explored in any prior works that have mainly focused on reducing the number of timesteps during training without effectively using temporal statistics.
(a)  (b)  (c) 
4.6 Analysis on Robustness
Finally, we highlight the advantage of BNTT in terms of the robustness to noisy input. To investigate the effect of our BNTT for robustness, we evaluate the performance change in the SNNs as we feed in inputs with varying levels of noise. We generate the noisy input by adding Gaussian noise to the clean input image. From Fig. 4 (b), we observe the following: i) The accuracy of conversion method degrades considerably for . ii) Compared to ANNs, SNNs trained with surrogate gradient backpropagation shows better performance at higher noise intensity. Still, they suffer from large accuracy drops in presence of noisy inputs. iii) BNTT achieves significantly higher performance than the other methods across all noise intensities. This is because using BNTT decreases the overall number of timesteps which is a crucial contributing factor towards robustness [35]. These results imply that, in addition to lowlatency and energyefficiency, our BNTT method also offers improved robustness for suitably implementing SNNs in a realworld scenario. We further analyze the robustness regarding adversarial attack [11] in Appendix F.
5 Conclusion
In this paper, we revisit the batch normalization technique and propose a novel mechanism for training lowlatency, energyefficient, robust, and accurate SNNs from scratch. Our key idea is to extend the effect of batch normalization to the temporal dimension with timespecific learnable parameters and statistics. We discover that optimizing learnable parameters during the training phase enables visual information to be passed through the layers sequentially. For the first time, we directly train SNNs on large datasets such as TinyImageNet, which opens up the potential advantage of surrogate gradientbased backpropagation for future practical research in SNNs.
Appendix A Appendix: Backward Gradient of BNTT
Here, we calculate the backward gradient of a BNTT layer. Note that we omit the neuron index for simplicity. For one sample in a minibatch, we compute the backward gradient of BNTT at timestep :
(16) 
where,
(17) 
It is worth mentioning that we accumulate input signals at the last layer in order to remove information loss. Then we convert the accumulated voltage into probabilities by using a softmax function. Therefore, we calculate backward gradient with respect to the loss , following the previous work [24]. The first term of R.H.S in Eq. (16) can be calculated as:
(18) 
For the second term of R.H.S in Eq. (16),
(19) 
For the third term of R.H.S in Eq. (16),
(20) 
(21) 
To summarize, for every timestep , gradients are calculated based on the timespecific statistics of input signals. This allows the networks to take into account temporal dynamics for training weight connections.
Appendix B Appendix: Rate Coding
Spiking neural networks process multiple binary spikes. Therefore, for training and inference, a static image needs to be converted. There are various spike coding schemes such as rate, temporal, and phase [23, 16]. Among them, we use rate coding due to its reliable performance across various tasks. Rate coding provides spikes proportional to the pixel intensity of the given image. In order to implement this, following previous work [31], we compare each pixel value with a random number ranging between at every timestep. Here, correspond to the minimum and maximum possible pixel intensity. If the random number is greater than the pixel intensity, the Poisson spike generator outputs a spike with amplitude . Otherwise, the Poisson spike generator does not yield any spikes. We visualize rate coding in Fig. 7. We see that the spikes generated at a given timestep is random. However, as time goes on, the accumulated spikes represent a similar result to the original image.
Appendix C Appendix: DVSCIFAR10 dataset
On DVSCIFAR10, following [40], we downsample the size of the 128 ×128 images to 42×42. Also, we divide the total number of timesteps available from the original timeframe data into 20 intervals and accumulate the spikes within each interval. We use a similar architecture as previous work [40], which consists of a 5layerered feature extractor and a classifier. The detailed architecture is shown in Fig. 8 in this appendix.
Appendix D Appendix: Energy Calculation
In this appendix section, we provide the details of energy calculation discussed in Section 4.3 in the main paper. The total computational cost is proportional to the total number of floating point operations (FLOPS). This is approximately the same as the number of MatrixVector Multiplication (MVM) operations. For layer
in ANNs, we can calculate FLOPS as:(22) 
Here, is kernel size. is output feature map size. and are input and output channels, respectively. For SNNs, we first define spiking rate at layer which is the average firing rate per neuron.
(23) 
Since neurons in SNNs nly consume energy whenever the neurons spike, we multiply the spiking rate with FLOPS to obtain the SNN FLOP count.
(24) 
Finally, total inference energy of ANNs () and SNNs () across all layer can be obtained.
(25) 
(26) 
The , values are calculated using a standard 45 nm CMOS process [14] as shown in Table 1.
Operation  Energy(pJ) 
32bit FP MULT ()  3.7 
32bit FP ADD ()  0.9 
32bit FP MAC ()  4.6 (= ) 
32bit FP AC ()  0.9 
Appendix E Appendix: Energy Comparison in Neuromorphic Architecture
Method  Timesteps  #Spikes ()  Energy [1] 
Conversion  1000  419.30  1 
Surrogate  100  141.96  0.3384 
BNTT  25  13.106  0.0312 
We further show the energyefficiency of BNTT in a neuromorphic architecture, TrueNorth [1]. Following the previous work [28, 22], we compute the normalized energy, which can be classified into dynamic energy () and static energy (). The value corresponds to the computing cores and routers, and is for maintaining the state of the CMOS circuit. The total energy consumption can be calculated as #Spikes + #Timestep , where (, ) are (0.4, 0.6). In Table 5, we show that our BNTT has a huge advantage in terms of energy efficiency in neuromorphic hardware.
Appendix F Appendix: Adversarial Robustness
In order to further validate the robustness of BNTT, we conduct experiments on adversarial inputs. We use FGSM [11] to generate adversarial samples for ANN. For a given image , we compute the loss function with the ground truth label . The objective of FGSM attack is to change the pixel intensity of the input image that maximizes the cost function:
(27) 
We call as “adversarial sample”. Here, denotes the strength of the attack. To conduct the FGSM attack for SNN, we use the SNNcrafted FGSM method proposed in [35]. In Fig. 9, we show the classification performance for varying intensities of FGSM attack. The SNN approaches (e.g., BNTT and Surrogate BP) show more robustness than ANN due to the temporal dynamics and stochastic neuronal functionality. We highlight that our proposed BNTT shows much higher robustness compared to others. Thus, we assert that BNTT improves robustness of SNNs in addition to energy efficiency and latency.
Appendix G Appendix: Comparison with Layer Norm
Layer Normalization (LN) [2] proposed the optimization method for recurrent neural networks (RNNs). They asserted that directly applying BN layers is hardly applicable since RNNs vary with the length of the input sequence. To this end, a LN layer calculates the mean and variance for each single layer. As SNNs also take the timesequence data as an input, we compare our BNTT wit Layer Normalization in Table 6. For all experiments, we use a VGG9 architecture. Also, we set a base learning rate to 0.3 and we use stepwise learning rate scheduling as described in Section 4.1 of our main manuscript. The results show that BNTT is more suitable structure to capture the temporal dynamics of Poisson encoding spikes.
Method  Acc (%) 
Layer Normalization [2]  75.4 
BNTT  90.5 
References
 [1] (2015) Truenorth: design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE transactions on computeraided design of integrated circuits and systems 34 (10), pp. 1537–1557. Cited by: Table 5, Appendix E, §1.
 [2] (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: Table 6, Appendix G.
 [3] (2006) A review of the integrateandfire neuron model: i. homogeneous synaptic input. Biological cybernetics 95 (1), pp. 1–19. Cited by: §1.
 [4] (2015) Spiking deep convolutional neural networks for energyefficient object recognition. International Journal of Computer Vision 113 (1), pp. 54–66. Cited by: §1, §4.2, Table 1.
 [5] (2020) Temporal coding in spiking neural networks with alpha synaptic function. In ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8529–8533. Cited by: §1.
 [6] (2018) Loihi: a neuromorphic manycore processor with onchip learning. IEEE Micro 38 (1), pp. 82–99. Cited by: §1.
 [7] (2001) Theoretical neuroscience, vol. 806. Cambridge, MA: MIT Press. Cited by: §3.1.

[8]
(2009)
Imagenet: a largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Cited by: §4.1.  [9] (2015) Unsupervised learning of digit recognition using spiketimingdependent plasticity. Frontiers in computational neuroscience 9, pp. 99. Cited by: §1.
 [10] (2015) Fastclassifying, highaccuracy spiking deep networks through weight and threshold balancing. In 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §1.
 [11] (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: Appendix F, §4.6.
 [12] (2016) LSTM: a search space odyssey. IEEE transactions on neural networks and learning systems 28 (10), pp. 2222–2232. Cited by: §2.
 [13] (2020) RMPsnn: residual membrane potential neuron for enabling deeper highaccuracy and lowlatency spiking neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13558–13567. Cited by: §1, §3.1, §3.3, §3.4, §4.2, Table 1.
 [14] (2014) 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14. Cited by: Appendix D, §4.3.
 [15] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §1, §2.
 [16] (2018) Deep neural networks with weighted spikes. Neurocomputing 311, pp. 373–386. Cited by: Appendix B.
 [17] (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
 [18] (2016) Hots: a hierarchy of eventbased timesurfaces for pattern recognition. IEEE transactions on pattern analysis and machine intelligence 39 (7), pp. 1346–1359. Cited by: Table 2.
 [19] (2020) Enabling spikebased backpropagation for training deep neural network architectures. Frontiers in Neuroscience 14. Cited by: §1, §3.4, §3.5, §4.2, Table 1.
 [20] (2016) Training deep spiking neural networks using backpropagation. Frontiers in neuroscience 10, pp. 508. Cited by: §1, §3.5, §4.3.
 [21] (2017) Cifar10dvs: an eventstream dataset for object classification. Frontiers in neuroscience 11, pp. 309. Cited by: §4.1.
 [22] (2018) The impact of onchip communication on memory technologies for neuromorphic systems. Journal of Physics D: Applied Physics 52 (1), pp. 014003. Cited by: Appendix E.
 [23] (2017) Supervised learning based on temporal coding in spiking neural networks. IEEE transactions on neural networks and learning systems 29 (7), pp. 3227–3235. Cited by: Appendix B.
 [24] (2019) Surrogate gradient learning in spiking neural networks. IEEE Signal Processing Magazine 36, pp. 61–63. Cited by: Appendix A, §1, §1, §3.1, §4.3.
 [25] (2015) HFirst: a temporal approach to object recognition. IEEE transactions on pattern analysis and machine intelligence 37 (10), pp. 2028–2040. Cited by: Table 2.

[26]
(2020)
Toward scalable, efficient, and accurate deep spiking neural networks with backward residual connections, stochastic softmax, and hybridization
. Frontiers in Neuroscience 14. Cited by: §1.  [27] (2016) Conditional deep learning for energyefficient and enhanced pattern recognition. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 475–480. Cited by: §3.4.
 [28] (2020) T2FSNN: deep spiking neural networks with timetofirstspike coding. arXiv preprint arXiv:2003.11741. Cited by: Appendix E, §4.3.
 [29] (2017) Automatic differentiation in pytorch. In NIPSW, Cited by: §3.1, §4.1.
 [30] (2020) Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. arXiv preprint arXiv:2005.01807. Cited by: §1, §3.4, §3.5, §4.2, Table 1.
 [31] (2019) Towards spikebased machine intelligence with neuromorphic computing. Nature 575 (7784), pp. 607–617. Cited by: Appendix B, §1, §3.5.
 [32] (2017) Conversion of continuousvalued deep networks to efficient eventdriven networks for image classification. Frontiers in neuroscience 11, pp. 682. Cited by: §1.
 [33] (2018) How does batch normalization help optimization?. In Advances in Neural Information Processing Systems, pp. 2483–2493. Cited by: §1, §2.
 [34] (2019) Going deeper in spiking neural networks: vgg and residual architectures. Frontiers in neuroscience 13, pp. 95. Cited by: §1, §1, §1, §3.4, §3.5, §4.2, §4.3, Table 1.
 [35] (2020) Inherent adversarial robustness of deep spiking neural networks: effects of discrete input encoding and nonlinear activations. arXiv preprint arXiv:2003.10399. Cited by: Appendix F, §4.6.
 [36] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2, §4.3.
 [37] (2018) HATS: histograms of averaged time surfaces for robust eventbased object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1731–1740. Cited by: Table 2.
 [38] (2016) Branchynet: fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2464–2469. Cited by: §3.4.
 [39] (2018) Spatiotemporal backpropagation for training highperformance spiking neural networks. Frontiers in neuroscience 12, pp. 331. Cited by: §1, §3.4.

[40]
(2019)
Direct training for spiking neural networks: faster, larger, better.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 1311–1318. Cited by: Appendix C, §3.1, §4.1, Table 2.
Comments
There are no comments yet.