Revisiting Batch Normalization for Training Low-latency Deep Spiking Neural Networks from Scratch

by   Youngeun Kim, et al.
Yale University

Spiking Neural Networks (SNNs) have recently emerged as an alternative to deep learning owing to sparse, asynchronous and binary event (or spike) driven processing, that can yield huge energy efficiency benefits on neuromorphic hardware. However, training high-accuracy and low-latency SNNs from scratch suffers from non-differentiable nature of a spiking neuron. To address this training issue in SNNs, we revisit batch normalization and propose a temporal Batch Normalization Through Time (BNTT) technique. Most prior SNN works till now have disregarded batch normalization deeming it ineffective for training temporal SNNs. Different from previous works, our proposed BNTT decouples the parameters in a BNTT layer along the time axis to capture the temporal dynamics of spikes. The temporally evolving learnable parameters in BNTT allow a neuron to control its spike rate through different time-steps, enabling low-latency and low-energy training from scratch. We conduct experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet and event-driven DVS-CIFAR10 datasets. BNTT allows us to train deep SNN architectures from scratch, for the first time, on complex datasets with just few 25-30 time-steps. We also propose an early exit algorithm using the distribution of parameters in BNTT to reduce the latency at inference, that further improves the energy-efficiency.



There are no comments yet.


page 7


Enabling Deep Spiking Neural Networks with Hybrid Conversion and Spike Timing Dependent Backpropagation

Spiking Neural Networks (SNNs) operate with asynchronous discrete events...

Training High-Performance Low-Latency Spiking Neural Networks by Differentiation on Spike Representation

Spiking Neural Network (SNN) is a promising energy-efficient AI model wh...

Direct Training via Backpropagation for Ultra-low Latency Spiking Neural Networks with Multi-threshold

Spiking neural networks (SNNs) can utilize spatio-temporal information a...

Hessian Aware Quantization of Spiking Neural Networks

To achieve the low latency, high throughput, and energy efficiency benef...

Spiking Networks for Improved Cognitive Abilities of Edge Computing Devices

This concept paper highlights a recently opened opportunity for large sc...

DCT-SNN: Using DCT to Distribute Spatial Information over Time for Learning Low-Latency Spiking Neural Networks

Spiking Neural Networks (SNNs) offer a promising alternative to traditio...

Advancing Deep Residual Learning by Solving the Crux of Degradation in Spiking Neural Networks

Despite the rapid progress of neuromorphic computing, the inadequate dep...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Artificial Neural Networks (ANNs) have shown state-of-the-art performance across various computer vision tasks. Nonetheless, huge energy consumption incurred for implementing ANNs on conventional von-Neumann hardware limits their usage in low-power and resource-constrained Internet of Things (IoT) environment, such as mobile phones, drones among others. In the context of low-power machine intelligence, Spiking Neural Networks (SNNs) have received considerable attention in the recent past

[31, 26, 4, 9, 5]. Inspired by biological neuronal mechanisms, SNNs process visual information with discrete spikes or events over multiple time-steps. Recent works have shown that the event-driven behavior of SNNs can be implemented on emerging neuromorphic hardware to yield 1-2 order of magnitude energy efficiency over ANNs [1, 6]. Despite the energy efficiency benefits, SNNs have still not been widely adopted due to inherent training challenges. The training issue arises from the non-differentiable characteristic of a spiking neuron, generally, Integrate-and-Fire (IF) type [3], that makes SNNs incompatible with gradient descent training.

*Method (total time-steps / accuracy)

            (a) Conversion (1000 / 91.2%)     (b) Surrogate BP (100 / 88.7%)     (c) BNTT (25 / 90.5%)
Figure 1:

Visualization of the average number of spikes in each layer with respect to time-steps. Compared to (a) ANN-SNN conversion and (b) surrogate gradient-based backpropagation, our (c) BNTT captures the temporal dynamics of spike activation with learnable parameters, enabling low-latency (

i.e., small time-steps) and low-energy (i.e., less number of spikes) training. All experiments are conducted on CIFAR-10 with VGG9.

To address the training issue of SNNs, several methods, such as, Conversion and Surrogate Gradient Descent have been proposed. In ANN-SNN conversion [34, 13, 10, 32]

, off-the-shelf trained ANNs are converted to SNNs using normalization methods to transfer ReLU activation to IF spiking activity. The advantage here is that training happens in the ANN domain leveraging widely used machine learning frameworks like, PyTorch, that yield short training time and can be applied to complex datasets. But the ANN-SNN conversion method requires large number of time-steps (

) for inference to yield competitive accuracy, which significantly increases the latency and energy consumption of the SNN. On the other hand, directly training SNNs with a surrogate gradient function [24, 19, 39] exploits temporal dynamics of spikes, resulting in lesser number of time-steps (

). However, the discrepancy between forward spike activation function and backward surrogate gradient function during backpropagation restricts the training capability. Only shallow SNNs (

e.g., VGG5) can be trained using surrogate gradient descent and therefore they achieve high performance only for simple datasets (e.g

., MNIST and CIFAR-10). Recently, a hybrid method

[30] that combines the conversion method and the surrogate gradient-based method shows state-of-the-art performance at reasonable latency ( time-steps). However, the hybrid method incurs sequential processes, i.e., training ANN from scratch, conversion of ANN to SNN, and training SNNs using surrogate gradient descent, that increases the total computation cost to obtain the final SNN model. Overall, training high-accuracy and low-latency SNNs from scratch still remains an open problem.

In this paper, we revisit Batch Normalization (BN) for more advanced SNN training. The BN layer [15] has been used extensively in deep learning to accelerate the training process of ANNs. It is well known that BN reduces internal covariate shift (or soothing optimization landscape [33]) mitigating the problem of exploding/vanishing gradients. However, till now, numerous studies on surrogate gradient of SNNs [20] have witnessed that BN does not help with SNN optimization. Moreover, most ANN-SNN conversion methods [34] get rid of BN since time-sequential spikes with BN set the firing threshold of all neurons to non-discriminative/similar values across all inputs, resulting in accuracy decline.

Motivation & Contribution: A natural question then arises: Can standard BN capture the proper structure of temporal dynamics of spikes in SNNs? Through this paper, we assert that standard BN hardly captures temporal characteristics as it represents the statistics of total time-steps as one common parameter. Thus, a temporally adaptive BN approach is required. To this end, we propose a new SNN-crafted batch normalization layer called Batch Normalization Through Time (BNTT) that decouples the parameters in the BN layer across different time-steps. BNTT is implemented as an additional layer in SNNs and is trained with surrogate gradient backpropagation. To investigate the effect of our BNTT, we compare the statistics of spike activity of BNTT with previous approaches: Conversion [34] and standard Surrogate Gradient Descent [24], as shown in Fig. 1. Interestingly, different from the conversion method and surrogate gradient method (without BNTT) that maintain reasonable spike activity during the entire time period across different layers, spike activity of layers trained with BNTT follows a gaussian-like trend. BNTT imposes a variation in spiking across different layers, wherein, each layer’s activity peaks in a particular time-step range and then decreases. Moreover, the peaks for early layers occur at initial time-steps and latter layers peak at later time-steps. This phenomenon implies that learnable parameters in BNTT enable the networks to pass the visual information temporally from shallow to deeper layers in an effective manner.

The newly observed characteristics of BNTT brings several advantages. First, similar to BN, the BNTT layer enables SNNs to be trained stably from scratch even for large-scale datasets. Second, learnable parameters in BNTT enable SNNs to be trained with low latency ( time-steps) and impose optimum spike activity across different layers for low-energy inference. Finally, the distribution of the BNTT learnable parameter (i.e., ) is a good representation of the temporal dynamics of spikes. Hence, relying on the observation that low value induces low spike activity and vice-versa, we further propose a temporal early exit algorithm. Here, an SNN can predict at an earlier time-step and does not need to wait till the end of the time period to make a prediction.

In summary, our key contributions are as follows: (i) For the first time, we introduce a batch normalization technique for SNNs, called BNTT. (ii) BNTT allows SNNs to be implemented in a low-latency and low-energy environment. (iii) We further propose a temporal early exit algorithm at inference time by monitoring the learnable parameters in BNTT. (iv) To ascertain that BNTT captures the temporal characteristics of SNNs, we mathematically show that proposed BNTT has similar effect as controlling the firing threshold of the spiking neuron at every time step during inference.

2 Batch Normalization

Batch Normalization (BN) reduces the internal covariate shift (or variation of loss landscape [33]) caused by the distribution change of input signal, which is a known problem of deep neural networks [15]. Instead of calculating the statistics of total dataset, the intermediate representations are standardized with a mini-batch to reduce the computation complexity. Given a mini-batch

, the BN layer computes the mean and variance of the mini-batch as:


Then, the input features in the mini-batch are normalized with calculated statistics as:


where, is a small constant for numerical stability. To further improve the representation capability of the layer, learnable parameters and are used to transform the input features that can be formulated as

. At inference time, BN uses the running average of mean and variance obtained from training. Previous works show that the BN layer not only improves the performance but also reduces the number of iterations required for training convergence. Therefore, BN is an indispensable training component for all ANN models, such as convolutional neural networks


and recurrent neural networks

[12]. On the other hand, the effectiveness of BN in bio-plausible SNNs has not been observed yet.

3 Methodology

3.1 Spiking Neural Networks

Different from conventional ANNs, SNNs transmit information using binary spike trains. To leverage the temporal spike information, Leaky-Integrate-and-Fire (LIF) model [7] is widely used to emulate neuronal functionality in SNNs, which can be formulated as a differential equation:


where, represents the membrane potential of the neuron that characterizes the internal state of the neuron, is the time constant of membrane potential decay. Also, and denote the input resistance and the input current at time , respectively. Following the previous work [40], we convert this continuous dynamic equation into a discrete equation for digital simulation. For a single post-synaptic neuron , we can represent the membrane potential at time-step as:


Here, is the index of a pre-synaptic neuron, is a leak factor with value less than , is the binary spike activation, and is the weight of the connection between pre- and post-neurons. From Eq. 4, the membrane potential of a neuron decreases due to leak and increases due to the weighted sum of incoming input spikes.

Figure 2: (a) Illustration of spike activities in Leaky-Integrate-and-Fire neurons. (b) The approximated gradient value with respect to the membrane potential.

If the membrane potential exceeds a pre-defined firing threshold , the LIF neuron generates a binary spike output . After that, we perform a soft reset, where the membrane potential is reset by reducing its value by the threshold . Compared to a hard reset (resetting the membrane potential to zero after neuron spikes), the soft reset minimizes information loss by maintaining the residual voltage and carrying it forward to the next time step, thereby achieving better performance [13]. Fig. 2(a) illustrates the membrane potential dynamics of a LIF neuron.

For the output layer, we discard the thresholding functionality so that neurons do not generate any spikes. We allow the output neurons to accumulate the spikes over all time-steps by fixing the leak parameter ( in Eq. 4

) as one. This enables the output layer to compute probability distribution after softmax function without information loss. As with ANNs, the number of output neurons in SNNs is identical to the number of classes

in the dataset. From the accumulated membrane potential, we can define the cross-entropy loss for SNNs as:


where, is the ground-truth label, and represents the total number of time-steps. Then, the weights of all layers are updated by backpropagating the loss value with gradient descent.

To compute the gradients of each layer , we use back-propagation through time (BPTT), which accumulates the gradients over all time-steps [24]. These approaches can be implemented with auto-differentiation tools, such as PyTorch [29]

, that enable backpropagation on the unrolled network. To this end, we compute the loss function at time-step

and use gradient descent optimization. Mathematically, we can define the accumulated gradients at the layer

by chain rule as:


Here, and are output spikes and membrane potential at layer , respectively. For the output layer, we get the derivative of the loss with respect to the membrane potential at final time-step :


This derivative function is continuous and differentiable for all possible membrane potential values. On the other hand, LIF neurons in hidden layers generate spike output only if the membrane potential exceeds the firing threshold, leading to non-differentiability. To deal with this problem, we introduce an approximate gradient:


where, is a damping factor for back-propagated gradients. Note, a large value causes unstable training as gradients are summed over all time-steps. Hence, we set to . Overall, we update the network parameters at the layer based on the gradient value (Eq. 6) as .

3.2 Batch Normalization Through Time (BNTT)

The main contribution of this paper is a new SNN-crafted Batch Normalization (BN) technique. Naively applying BN does not have any effect on training SNNs. This is because using the same BN parameters (e.g., global mean , global variation , and learnable parameter ) for the statistics of all time-steps do not capture the temporal dynamics of input spike trains. For example, an LIF neuron requires at least one time-step to propagate spikes to the next layer. Therefore, input signals for the third layer of an SNN will have a zero value till . Following the initial spike activity in the layer at , the spike signals vary depending upon the weight connections and the membrane potentials of previous layers. Therefore, a fixed global mean from a standard BN layer may not store any time-specific information, resulting in performance degradation at inference.

To resolve this issue, we vary the internal parameters in a BN layer through time, that we define as, BNTT. Similar to the digital simulation of LIF neuron across different time-steps, one BNTT layer is expanded temporally with a local learning parameter associated with each time-step. This allows the BNTT layer to capture temporal statistics (see Section 3.3 for mathematical analysis). The proposed BNTT layer is easily applied to SNNs by inserting the layer after convolutional/linear operations as:


During the training process, we compute the mean and variance from the samples in a mini-batch for each time step , as shown in Algorithm 1. Note, for each time-step , we apply an exponential moving average to approximate global mean and variance over training iterations. These global statistics are used to normalize the test data at inference. Also, we do not utilize as in conventional BN, since it adds redundant voltage to the membrane potential of SNNs.

Adding the BNTT layer to LIF neurons changes the gradient calculation for backpropagation. Given that is an input signal to the BNTT layer, we can calculate the gradient value passed through lower layers by the BNTT layer as:


Here, we omit a neuron index for simplicity. Also, and denote the batch size and batch index (see Appendix A for more detail). Thus, for every time-step , gradients are calculated based on the time-specific statistics of input signals. This allows the networks to take into account temporal dynamics for training weight connections. Moreover, a learnable parameter is updated to restore the representation power of the batch normalized signal. Since we use different values across all time-steps, finds an optimum over each time-step for efficient inference. We update gamma where:


3.3 Mathematical Analysis

In this section, we discuss the connections between BNTT and the firing threshold of a LIF neuron. Specifically, we formally prove that using BNTT has a similar effect as varying the firing threshold over different time-steps, thereby ascertaining that BNTT captures temporal characteristics in SNNs. Recall that BNTT normalizes the input signal using stored approximated global average

and standard deviation

at inference. From Eq. 9, we can calculate a membrane potential at time-step , given that initial membrane potential has a zero value:


Here, we assume can be neglected with small signal approximation due to the spike sparsity in SNNs, and is membrane potential at time-step without BNTT (obtained from Eq. 4). We can observe that the membrane potential with BNTT is proportional to the membrane potential without BNTT at . For time-step , we should take into account the membrane potential from the previous time-step, which is multiplied by leak . To this end, by substituting Eq. 12 in the BNTT equation (Eq. 9), we can formulate the membrane potential at as:


Input: mini-batch at time step (), learnable parameter (), update factor ()

5:% Exponential moving average
Algorithm 1 BNTT layer

In the third line, the learnable parameter and have similar values in adjacent time intervals () because of continuous time property. Hence, we can approximate and as and , respectively. Finally, we can extend the equation of BNTT to the time-step :


Considering that a neuron produces an output spike activation whenever the membrane potential exceeds the pre-defined firing threshold , the spike firing condition with BNTT can be represented . Comparing with the threshold of a neuron without BNTT, we can reformulate the firing condition as:


Thus, we can infer that using a BNTT layer changes the firing threshold value by at every time-step. In practice, BNTT results in an optimum during training that improves the representation power, producing better performance and low-latency SNNs.This observation allows us to consider the advantages of time-varying learnable parameters in SNNs. This implication is in line with previous work [13]

, which insists that manipulating the firing threshold improves the performance and latency of the ANN-SNN conversion method. However, Han et al. change the threshold value in a heuristic way without any optimization process and fix the threshold value across all time-steps. On the other hand, our BNTT yields time-specific

which can be optimized via back-propagation.

Input: mini-batch (); label set (); max_timestep ()
Output: updated network weights

1:for  to  do
2:     fetch a mini batch X
3:     for  to  do
4:          O PoissonGenerator(X)
5:          for  to  do
7:          end for
8:          % For the final layer , stack the voltage
10:     end for
11:     % Calculate the loss and back-propagation
13:end for
Algorithm 2 Training process with BNTT
Figure 3: The average value of at each layer over all time-steps. Early exit time can be calculated as since values at every layer have lower value than threshold after time-step 20 (blue shaded area). Here, we use a VGG9 architecture on CIFAR-10.

3.4 Early Exit Algorithm

The main objective of early exit is to reduce the latency during inference [38, 27]. Most previous methods [39, 19, 34, 30, 13] accumulate output spikes till the end of the time-sequence, at inference, since all layers generate spikes across all time-steps as shown in Fig. 1(a) and Fig. 1(b). On the other hand, learnable parameters in BNTT manipulate the spike activity of each layer to produce a peak value, which falls again (a gaussian-like trend), as shown in Fig. 1(c). This phenomenon shows that SNNs using BNTT convey little information at the end of spike trains.

Inspired by this observation, we propose a temporal early exit algorithm based on the value of . From Eq. 15, we know that a low value increases the firing threshold, resulting in low spike activity. A high value, in contrast, induces more spike activity. It is worth mentioning that shows similar values across all time-steps and therefore we only focus on . Given that the intensity of spike activity is proportional to , we can infer that spikes will hardly contribute to the classification result once values across every layer drop to a minimum value. Therefore, we measure the average of values in each layer at every time-step, and terminate the inference when value in every layer is below a pre-determined threshold. For example, as shown in Fig. 3, we observe that all averaged values are lower than threshold after . Therefore, we define the early exit time at . Note that we can determine the optimum time-step for early exit before forward propagation without any additional computation. In summary, the temporal early exit method enables us to find the earliest time-step during inference that ensures integration of crucial information, in turn reducing the inference latency without significant loss of accuracy.

3.5 Overall Optimization

Algorithm 2 summarizes the whole training process of SNNs with BNTT. Our proposed BNTT acts as a regularizer, unlike previous methods [19, 34, 20, 30] that use dropout to perform regularization. Our training scheme is based on widely used rate coding where the spike generator produces a Poisson spike train (see Appendix B) for each pixel in the image with frequency proportional to the pixel intensity [31]

. For all layers, the weighted sum of the input signal is passed through a BNTT layer and then is accumulated in the membrane potential. If the membrane potential exceeds the firing threshold, the neuron generates an output spike. For last layer, we accumulate the input voltage over all time-steps without leak, that we feed to a softmax layer to output a probability distribution. Then, we calculate a cross-entropy loss function and gradients for weight of each layer with the approximate gradient function. During the training phase, a BNTT layer computes the time-dependent statistics (

i.e., and ) and stores the moving-average global mean and variance. At inference, we first define the early exit time-step based on the value of

in BNTT. Then, the networks classify the test input (note, test data normalized with pre-computed global

BNTT statistics) based on the accumulated output voltage at the pre-computed early exit time-step.

4 Experiments

In this section, we carry out comprehensive experiments on public classification datasets. Till now, training SNNs from scratch with surrogate gradient has been limited to simple datasets, e.g., CIFAR-10, due to the difficulty of direct optimization. In this paper, for the first time, we train SNNs with surrogate gradients from scratch and report the performance on large-scale datasets including CIFAR-100 and Tiny-ImageNet with multi-layered network architectures. We first compare our BNTT with previous SNNs training methods. Then, we quantitatively and qualitatively demonstrate the effectiveness of our proposed BNTT.

4.1 Experimental Setup

We evaluate our method on three static datasets (i.e., CIFAR-10, CIFAR-100, Tiny-ImageNet) and one neuromophic dataset (i.e., DVS-CIFAR10). CIFAR-10 [17] consists of 60,000 images (50,000 for training / 10,000 for testing) with 10 categories. All images are RGB color images whose size are 32 32. CIFAR-100 has the same configuration as CIFAR-10, except it contains images from 100 categories. Tiny-ImageNet is the modified subset of the original ImageNet dataset. Here, there are 200 different classes of ImageNet dataset [8], with 100,000 training and 10,000 validation images. The resolution of the images is 6464 pixels. DVS-CIFAR10 [21] has the same configuration as CIFAR-10. This discrete event-stream dataset is collected by moving the event-driven camera. We follow the similar data pre-processing protocol and a network architecture used in previous work [40] (details in Appendix C). Our implementation is based on Pytorch [29]

. We train the networks with standard SGD with momentum 0.9, weight decay 0.0005 and also apply random crop and horizontal flip to input images. The base learning rate is set to 0.3 and we use step-wise learning rate scheduling with a decay factor 10 at 50%, 70%, and 90% of the total number of epochs. Here, we set the total number of epochs to 120, 240, 90, and 60 for CIFAR-10, CIFAR-100, Tiny-ImageNet, and DVS-CIFAR10, respectively.

Dataset Training Method Architecture Time-steps Accuracy(%)
Cao et al. [4] CIFAR-10 ANN-SNN Conversion 3Conv, 2Linear 400 77.4
Sengupta et al. [34] CIFAR-10 ANN-SNN Conversion VGG16 2500 91.5
Lee et al. [19] CIFAR-10 Surrogate Gradient VGG9 100 90.4
Rathi et al. [30] CIFAR-10 Hybrid VGG16 200 92.0
Han et al. [13] CIFAR-10 ANN-SNN Conversion VGG16 2048 93.6
w.o. BNTT CIFAR-10 Surrogate Gradient VGG9 100 88.7
BNTT (ours) CIFAR-10 Surrogate Gradient VGG9 25 90.5
BNTT + Early Exit (ours) CIFAR-10 Surrogate Gradient VGG9 20 90.3
Sengupta et al. [34] CIFAR-100 ANN-SNN Conversion VGG16 2500 70.9
Rathi et al. [30] CIFAR-100 Hybrid VGG16 125 67.8
Han et al. [13] CIFAR-100 ANN-SNN Conversion VGG16 2048 70.9
w.o. BNTT CIFAR-100 Surrogate Gradient VGG11 n/a n/a
BNTT (ours) CIFAR-100 Surrogate Gradient VGG11 50 66.6
BNTT + Early Exit (ours) CIFAR-100 Surrogate Gradient VGG11 30 65.8
Sengupta et al. [34] Tiny-ImageNet ANN-SNN Conversion VGG11 2500 54.2
w.o. BNTT Tiny-ImageNet Surrogate Gradient VGG11 n/a n/a
BNTT (ours) Tiny-ImageNet Surrogate Gradient VGG11 30 57.8
BNTT + Early Exit (ours) Tiny-ImageNet Surrogate Gradient VGG11 25 56.8
Table 1: Classification Accuracy (%) on CIFAR-10, CIFAR-100, and Tiny-ImageNet.
Method Type Accuracy (%)
Orchard et al. [25] Random Forest 31.0
Lagorce et al. [18] HOTS 27.1
Sironi et al. [37] HAT 52.4
Sironi et al. [37] Gabor-SNN 24.5
Wu et al. [40] Surrogate Gradient 60.5
w.o. BNTT Surrogate Gradient n/a
BNTT (ours) Surrogate Gradient 63.2
Table 2: Classification Accuracy (%) on DVS-CIFAR10.
    (a)    (b)
Figure 4: (a) Visualization layer-wise spike activity (log scale) in VGG9 on CIFAR-10 dataset. (b) Performance change with respect to the standard deviation of the Gaussian noise.

4.2 Comparison with Previous Methods

On public datasets, we compare our proposed BNTT method with previous rate-coding based SNN training methods, including ANN-SNN conversion [13, 34, 4], surrogate gradient back-propagation [19], and hybrid [30] methods. From Table 1, we can observe some advantages and disadvantages of each training method. The ANN-SNN conversion method performs better than the surrogate gradient method across all datasets. However, they require large number of time-steps for training and testing, which is energy-inefficient and impractical in a real-time application. The hybrid method aims to resolve this high-latency problem, but it still requires over hundreds of time-steps. The surrogate gradient method suffers from poor optimization and hence cannot be scaled to larger datasets such as CIFAR-100 and Tiny-ImageNet. Our BNTT is based on the surrogate gradient method, however, it enables SNNs to achieve high performance even for more complicated datasets. At the same time, we dramatically reduce the latency due to the inclusion of learnable parameters and temporal statistics in the BNTT layer. As a result, BNTT can be trained with 25 time-steps on a simple CIFAR-10 dataset, while preserving state-of-the-art accuracy. For CIFAR-100, we achieve about and faster inference speed compared to the conversion methods and the hybrid method, respectively. Interestingly, for Tiny-ImageNet, BNTT achieves better performance and shorter latency compared to previous conversion method. Note that ANN with VGG11 architecture used for ANN-SNN conversion achieves 56.3% accuracy. Moreover, using an early exit algorithm further reduces the latency by , which enables the networks to be implemented with lower-latency and energy-efficiency. It is worth mentioning that surrogate gradient method without BNTT (w.o. BNTT in Table 1) only converges on CIFAR-10. For neuromorphic DVS-CIFAR10 dataset (Table 2), ANN-SNN Conversion methods are not applicable since ANNs hardly capture the temporal dynamic of a spike train. Using BNTT improves the stability of training compared to a surrogate gradient baseline (i.e., w.o. BNTT), and achieves state-of-the-art performance. These results show that our BNTT technique is very effective on event-driven data and hence well-suited for neuromorphic applications.

Figure 5: Histogram visualization (x axis: value, y axis: frequency) at conv1 (row1), conv4 (row2), and conv7 (row3) layers in VGG9 across all time-steps. The experiments are conducted on CIFAR-10 with 25 time-steps.
Method Latency Accuracy (%)
VGG9 (ANN) 1 91.5 1
Conversion 1000 91.2 0.32
Conversion 500 90.9 0.55
Conversion 100 89.3 2.71
Surrogate Gradient 100 88.7 1.05
BNTT 25 90.5 9.14
Table 3: Energy efficiency comparison.

4.3 Energy Comparison

We compare the layer-wise spiking activities of our BNTT with two widely-used methods, i.e., ANN-SNN conversion method [34] and surrogate gradient method (w.o. BNTT) [24]. Note, we refer to our approach as BNTT and standard surrogate approach w.o. BNTT as surrogate gradient in the remainder of the text. Specifically, we calculate the spike rate of each layer , which can be defined as the total number of spikes at layer over total time-steps divided by the number of neurons in layer (see Appendix D for the equation of spike rate). In Fig. 4(a), converted SNNs show a high spike rate for every layer as they forward spike trains through a larger number of time-steps compared to other methods. Even though the surrogate gradient method uses less number of time-steps, it still requires nearly hundreds of spikes for each layer. Compared to these methods, we can observe that BNTT significantly improves the spike sparsity across all layers.

More precisely, as done in previous works [28, 20], we compute the energy consumption for SNNs in standard CMOS technology [14] as shown in Appendix D by calculating the net multiplication-and-accumulate (MAC) operations. As the computation of SNNs are event-driven with binary {1, 0} spike processing, the MAC operation reduces to just a floating point (FP) addition. On the other hand, conventional ANNs still require one FP addition and one FP multiplication to conduct the same MAC operation (see Appendix D for more detail). Table 3 shows the energy efficiency of ANNs and SNNs with a VGG9 architecture [36] on CIFAR-10. As expected, ANN-SNN conversion yields a trade-off between accuracy and energy efficiency. For the same latency, the surrogate gradient method expends higher energy compared to the conversion method. It is interesting to note that even though our BNTT is trained based on the surrogate gradient method, we get improvement in energy efficiency compared to ANNs. In addition, we conduct further energy comparison on Neuromorphic architecture in Appendix E.

4.4 Analysis on Learnable Parameters in BNTT

The key observation of our work is the change of across time-steps. To analyze the distribution of the learnable parameters in our BNTT, we visualize the histogram of in conv1, conv4, and conv7 layers in VGG9 as shown in Fig. 5

. Interestingly, all layers show different temporal evolution of gamma distributions. For example, conv1 has high

values at the initial time-steps which decrease as time goes on. On the other hand, starting from small values, the values in conv4 and conv7 layers peak at and , respectively, and then shrink to zero at later time-steps. Notably, the peak time is delayed as the layer goes deeper, implying that the visual information is passed through the network sequentially over a period of time similar to Fig.1(c). This gaussian-like trend with rise and fall of across different time-steps can support the explanation of overall low spike activity compared to other methods (Fig. 4(a)).

4.5 Analysis on Early Exit

Recall that we measure the average of values in each layer at every time-step, and stop the inference when all values in every layer is lower than a predetermined threshold. To further investigate this, we vary the predetermined threshold and show the accuracy and exit time trend. As shown in Fig. 6, we observe that high threshold enables the networks to infer at earlier time-steps. Although we use less number time-steps during inference, the accuracy drops marginally. This implies that BNTT rarely sends crucial information at the end of spike train (see Fig. 1(c)). Note that the temporal evolution of learnable parameter with our BNTT allows us to exploit the early exit algorithm that yields a huge advantage in terms of reduced latency at inference. Such strategy has not been proposed or explored in any prior works that have mainly focused on reducing the number of time-steps during training without effectively using temporal statistics.

      (a)    (b)    (c)
Figure 6: Visualization of accuracy and early exit time with respect to the threshold value for . (a) CIFAR-10. (b) CIFAR-100. (c) Tiny-ImageNet.

4.6 Analysis on Robustness

Finally, we highlight the advantage of BNTT in terms of the robustness to noisy input. To investigate the effect of our BNTT for robustness, we evaluate the performance change in the SNNs as we feed in inputs with varying levels of noise. We generate the noisy input by adding Gaussian noise to the clean input image. From Fig. 4 (b), we observe the following: i) The accuracy of conversion method degrades considerably for . ii) Compared to ANNs, SNNs trained with surrogate gradient back-propagation shows better performance at higher noise intensity. Still, they suffer from large accuracy drops in presence of noisy inputs. iii) BNTT achieves significantly higher performance than the other methods across all noise intensities. This is because using BNTT decreases the overall number of time-steps which is a crucial contributing factor towards robustness [35]. These results imply that, in addition to low-latency and energy-efficiency, our BNTT method also offers improved robustness for suitably implementing SNNs in a real-world scenario. We further analyze the robustness regarding adversarial attack [11] in Appendix F.

5 Conclusion

In this paper, we revisit the batch normalization technique and propose a novel mechanism for training low-latency, energy-efficient, robust, and accurate SNNs from scratch. Our key idea is to extend the effect of batch normalization to the temporal dimension with time-specific learnable parameters and statistics. We discover that optimizing learnable parameters during the training phase enables visual information to be passed through the layers sequentially. For the first time, we directly train SNNs on large datasets such as Tiny-ImageNet, which opens up the potential advantage of surrogate gradient-based backpropagation for future practical research in SNNs.

Appendix A Appendix: Backward Gradient of BNTT

Here, we calculate the backward gradient of a BNTT layer. Note that we omit the neuron index for simplicity. For one sample in a mini-batch, we compute the backward gradient of BNTT at time-step :




It is worth mentioning that we accumulate input signals at the last layer in order to remove information loss. Then we convert the accumulated voltage into probabilities by using a softmax function. Therefore, we calculate backward gradient with respect to the loss , following the previous work [24]. The first term of R.H.S in Eq. (16) can be calculated as:


For the second term of R.H.S in Eq. (16),


For the third term of R.H.S in Eq. (16),


Based on Eq (18), Eq (19), and Eq (20), we can reformulate Eq. (16) as:


To summarize, for every time-step , gradients are calculated based on the time-specific statistics of input signals. This allows the networks to take into account temporal dynamics for training weight connections.

Appendix B Appendix: Rate Coding

Figure 7: Example of rate coding. As time goes on, the accumulated spikes represent similar image to original image. We use image in the CIFAR-10 dataset.

Spiking neural networks process multiple binary spikes. Therefore, for training and inference, a static image needs to be converted. There are various spike coding schemes such as rate, temporal, and phase [23, 16]. Among them, we use rate coding due to its reliable performance across various tasks. Rate coding provides spikes proportional to the pixel intensity of the given image. In order to implement this, following previous work [31], we compare each pixel value with a random number ranging between at every time-step. Here, correspond to the minimum and maximum possible pixel intensity. If the random number is greater than the pixel intensity, the Poisson spike generator outputs a spike with amplitude . Otherwise, the Poisson spike generator does not yield any spikes. We visualize rate coding in Fig. 7. We see that the spikes generated at a given time-step is random. However, as time goes on, the accumulated spikes represent a similar result to the original image.

Appendix C Appendix: DVS-CIFAR10 dataset

On DVS-CIFAR10, following [40], we downsample the size of the 128 ×128 images to 42×42. Also, we divide the total number of time-steps available from the original time-frame data into 20 intervals and accumulate the spikes within each interval. We use a similar architecture as previous work [40], which consists of a 5-layerered feature extractor and a classifier. The detailed architecture is shown in Fig. 8 in this appendix.

Figure 8: Illustration of network structures for DVS dataset. Here, AP denotes average pooling, FC denotes fully connected configuration.

Appendix D Appendix: Energy Calculation

In this appendix section, we provide the details of energy calculation discussed in Section 4.3 in the main paper. The total computational cost is proportional to the total number of floating point operations (FLOPS). This is approximately the same as the number of Matrix-Vector Multiplication (MVM) operations. For layer

in ANNs, we can calculate FLOPS as:


Here, is kernel size. is output feature map size. and are input and output channels, respectively. For SNNs, we first define spiking rate at layer which is the average firing rate per neuron.


Since neurons in SNNs nly consume energy whenever the neurons spike, we multiply the spiking rate with FLOPS to obtain the SNN FLOP count.


Finally, total inference energy of ANNs () and SNNs () across all layer can be obtained.


The , values are calculated using a standard 45 nm CMOS process [14] as shown in Table 1.

Operation Energy(pJ)
32bit FP MULT () 3.7
32bit FP ADD () 0.9
32bit FP MAC () 4.6 (= )
32bit FP AC () 0.9
Table 4: Energy table for 45nm CMOS process.

Appendix E Appendix: Energy Comparison in Neuromorphic Architecture

Method Time-steps #Spikes () Energy [1]
Conversion 1000 419.30 1
Surrogate 100 141.96 0.3384
BNTT 25 13.106 0.0312
Table 5: Normalized energy comparison on neuromorphic architecture: TrueNorth[1]. We set conversion as a reference for normalized energy comparison. We conduct experiments on CIFAR-10 with a VGG9 architecture.

We further show the energy-efficiency of BNTT in a neuromorphic architecture, TrueNorth [1]. Following the previous work [28, 22], we compute the normalized energy, which can be classified into dynamic energy () and static energy (). The value corresponds to the computing cores and routers, and is for maintaining the state of the CMOS circuit. The total energy consumption can be calculated as #Spikes + #Time-step , where (, ) are (0.4, 0.6). In Table 5, we show that our BNTT has a huge advantage in terms of energy efficiency in neuromorphic hardware.

Appendix F Appendix: Adversarial Robustness

Figure 9: Classification accuracy with respect to the intensity of FGSM attack (eps).

In order to further validate the robustness of BNTT, we conduct experiments on adversarial inputs. We use FGSM [11] to generate adversarial samples for ANN. For a given image , we compute the loss function with the ground truth label . The objective of FGSM attack is to change the pixel intensity of the input image that maximizes the cost function:


We call as “adversarial sample”. Here, denotes the strength of the attack. To conduct the FGSM attack for SNN, we use the SNN-crafted FGSM method proposed in [35]. In Fig. 9, we show the classification performance for varying intensities of FGSM attack. The SNN approaches (e.g., BNTT and Surrogate BP) show more robustness than ANN due to the temporal dynamics and stochastic neuronal functionality. We highlight that our proposed BNTT shows much higher robustness compared to others. Thus, we assert that BNTT improves robustness of SNNs in addition to energy efficiency and latency.

Appendix G Appendix: Comparison with Layer Norm

Layer Normalization (LN) [2] proposed the optimization method for recurrent neural networks (RNNs). They asserted that directly applying BN layers is hardly applicable since RNNs vary with the length of the input sequence. To this end, a LN layer calculates the mean and variance for each single layer. As SNNs also take the time-sequence data as an input, we compare our BNTT wit Layer Normalization in Table 6. For all experiments, we use a VGG9 architecture. Also, we set a base learning rate to 0.3 and we use step-wise learning rate scheduling as described in Section 4.1 of our main manuscript. The results show that BNTT is more suitable structure to capture the temporal dynamics of Poisson encoding spikes.

Method Acc (%)
Layer Normalization [2] 75.4
BNTT 90.5
Table 6: Comparison with Layer Normalization on CIFAR-10 dataset.


  • [1] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G. Nam, et al. (2015) Truenorth: design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE transactions on computer-aided design of integrated circuits and systems 34 (10), pp. 1537–1557. Cited by: Table 5, Appendix E, §1.
  • [2] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: Table 6, Appendix G.
  • [3] A. N. Burkitt (2006) A review of the integrate-and-fire neuron model: i. homogeneous synaptic input. Biological cybernetics 95 (1), pp. 1–19. Cited by: §1.
  • [4] Y. Cao, Y. Chen, and D. Khosla (2015) Spiking deep convolutional neural networks for energy-efficient object recognition. International Journal of Computer Vision 113 (1), pp. 54–66. Cited by: §1, §4.2, Table 1.
  • [5] I. M. Comsa, T. Fischbacher, K. Potempa, A. Gesmundo, L. Versari, and J. Alakuijala (2020) Temporal coding in spiking neural networks with alpha synaptic function. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8529–8533. Cited by: §1.
  • [6] M. Davies, N. Srinivasa, T. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, et al. (2018) Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro 38 (1), pp. 82–99. Cited by: §1.
  • [7] P. Dayan, L. F. Abbott, et al. (2001) Theoretical neuroscience, vol. 806. Cambridge, MA: MIT Press. Cited by: §3.1.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §4.1.
  • [9] P. U. Diehl and M. Cook (2015) Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Frontiers in computational neuroscience 9, pp. 99. Cited by: §1.
  • [10] P. U. Diehl, D. Neil, J. Binas, M. Cook, S. Liu, and M. Pfeiffer (2015) Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing. In 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §1.
  • [11] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: Appendix F, §4.6.
  • [12] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber (2016) LSTM: a search space odyssey. IEEE transactions on neural networks and learning systems 28 (10), pp. 2222–2232. Cited by: §2.
  • [13] B. Han, G. Srinivasan, and K. Roy (2020) RMP-snn: residual membrane potential neuron for enabling deeper high-accuracy and low-latency spiking neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13558–13567. Cited by: §1, §3.1, §3.3, §3.4, §4.2, Table 1.
  • [14] M. Horowitz (2014) 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14. Cited by: Appendix D, §4.3.
  • [15] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §1, §2.
  • [16] J. Kim, H. Kim, S. Huh, J. Lee, and K. Choi (2018) Deep neural networks with weighted spikes. Neurocomputing 311, pp. 373–386. Cited by: Appendix B.
  • [17] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
  • [18] X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B. Benosman (2016) Hots: a hierarchy of event-based time-surfaces for pattern recognition. IEEE transactions on pattern analysis and machine intelligence 39 (7), pp. 1346–1359. Cited by: Table 2.
  • [19] C. Lee, S. S. Sarwar, P. Panda, G. Srinivasan, and K. Roy (2020) Enabling spike-based backpropagation for training deep neural network architectures. Frontiers in Neuroscience 14. Cited by: §1, §3.4, §3.5, §4.2, Table 1.
  • [20] J. H. Lee, T. Delbruck, and M. Pfeiffer (2016) Training deep spiking neural networks using backpropagation. Frontiers in neuroscience 10, pp. 508. Cited by: §1, §3.5, §4.3.
  • [21] H. Li, H. Liu, X. Ji, G. Li, and L. Shi (2017) Cifar10-dvs: an event-stream dataset for object classification. Frontiers in neuroscience 11, pp. 309. Cited by: §4.1.
  • [22] S. Moradi and R. Manohar (2018) The impact of on-chip communication on memory technologies for neuromorphic systems. Journal of Physics D: Applied Physics 52 (1), pp. 014003. Cited by: Appendix E.
  • [23] H. Mostafa (2017) Supervised learning based on temporal coding in spiking neural networks. IEEE transactions on neural networks and learning systems 29 (7), pp. 3227–3235. Cited by: Appendix B.
  • [24] E. O. Neftci, H. Mostafa, and F. Zenke (2019) Surrogate gradient learning in spiking neural networks. IEEE Signal Processing Magazine 36, pp. 61–63. Cited by: Appendix A, §1, §1, §3.1, §4.3.
  • [25] G. Orchard, C. Meyer, R. Etienne-Cummings, C. Posch, N. Thakor, and R. Benosman (2015) HFirst: a temporal approach to object recognition. IEEE transactions on pattern analysis and machine intelligence 37 (10), pp. 2028–2040. Cited by: Table 2.
  • [26] P. Panda, S. A. Aketi, and K. Roy (2020)

    Toward scalable, efficient, and accurate deep spiking neural networks with backward residual connections, stochastic softmax, and hybridization

    Frontiers in Neuroscience 14. Cited by: §1.
  • [27] P. Panda, A. Sengupta, and K. Roy (2016) Conditional deep learning for energy-efficient and enhanced pattern recognition. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 475–480. Cited by: §3.4.
  • [28] S. Park, S. Kim, B. Na, and S. Yoon (2020) T2FSNN: deep spiking neural networks with time-to-first-spike coding. arXiv preprint arXiv:2003.11741. Cited by: Appendix E, §4.3.
  • [29] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §3.1, §4.1.
  • [30] N. Rathi, G. Srinivasan, P. Panda, and K. Roy (2020) Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. arXiv preprint arXiv:2005.01807. Cited by: §1, §3.4, §3.5, §4.2, Table 1.
  • [31] K. Roy, A. Jaiswal, and P. Panda (2019) Towards spike-based machine intelligence with neuromorphic computing. Nature 575 (7784), pp. 607–617. Cited by: Appendix B, §1, §3.5.
  • [32] B. Rueckauer, I. Lungu, Y. Hu, M. Pfeiffer, and S. Liu (2017) Conversion of continuous-valued deep networks to efficient event-driven networks for image classification. Frontiers in neuroscience 11, pp. 682. Cited by: §1.
  • [33] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry (2018) How does batch normalization help optimization?. In Advances in Neural Information Processing Systems, pp. 2483–2493. Cited by: §1, §2.
  • [34] A. Sengupta, Y. Ye, R. Wang, C. Liu, and K. Roy (2019) Going deeper in spiking neural networks: vgg and residual architectures. Frontiers in neuroscience 13, pp. 95. Cited by: §1, §1, §1, §3.4, §3.5, §4.2, §4.3, Table 1.
  • [35] S. Sharmin, N. Rathi, P. Panda, and K. Roy (2020) Inherent adversarial robustness of deep spiking neural networks: effects of discrete input encoding and non-linear activations. arXiv preprint arXiv:2003.10399. Cited by: Appendix F, §4.6.
  • [36] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2, §4.3.
  • [37] A. Sironi, M. Brambilla, N. Bourdis, X. Lagorce, and R. Benosman (2018) HATS: histograms of averaged time surfaces for robust event-based object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1731–1740. Cited by: Table 2.
  • [38] S. Teerapittayanon, B. McDanel, and H. Kung (2016) Branchynet: fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2464–2469. Cited by: §3.4.
  • [39] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi (2018) Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in neuroscience 12, pp. 331. Cited by: §1, §3.4.
  • [40] Y. Wu, L. Deng, G. Li, J. Zhu, Y. Xie, and L. Shi (2019) Direct training for spiking neural networks: faster, larger, better. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 1311–1318. Cited by: Appendix C, §3.1, §4.1, Table 2.