Enabling Spike-based Backpropagation in State-of-the-art Deep Neural Network Architectures

03/15/2019 ∙ by Chankyu Lee, et al. ∙ Purdue University 0

Spiking Neural Networks (SNNs) has recently emerged as a prominent neural computing paradigm. However, the typical shallow spiking network architectures have limited capacity for expressing complex representations, while training a very deep spiking network have not been successful so far. Diverse methods have been proposed to get around this issue such as converting off-line trained deep Artificial Neural Networks (ANNs) to SNNs. However, ANN-to-SNN conversion scheme fails to capture the temporal dynamics of a spiking system. On the other hand, it is still a difficult problem to directly train deep SNNs using input spike events due to the discontinuous and non-differentiable nature of the spike signals. To overcome this problem, we propose using differentiable (but approximate) activation for Leaky Integrate-and-Fire (LIF) spiking neurons to train deep convolutional SNNs with input spike events using spike-based backpropagation algorithm. Our experiments show the effectiveness of the proposed spike-based learning strategy on state-of-the-art deep networks (VGG and Residual architectures) by achieving the best classification accuracies in MNIST, SVHN and CIFAR-10 datasets compared to other SNNs trained with spike-based learning. Moreover, we analyze sparse event-driven computations to demonstrate the efficacy of proposed SNN training method for inference operation in the spiking domain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last few years, deep learning has made tremendous progress and has become a prevalent tool for coping with various cognitive tasks such as object detection, speech recognition and reasoning. Various deep learning techniques

[22, 44, 16] enable the effective optimization of deep ANNs by constructing multiple levels of feature hierarchies and show remarkable results, which occasionally outperform human level performance [21, 13, 40]. To that effect, deploying deep learning is becoming necessary not only on large-scale computers, but also on edge devices (e.g. phone, tablet, smart watch, robot etc.). However, the ever-growing complexity of the state-of-the-art deep neural networks together with the explosion in the amount of data to be processed, place significant energy demands on current computing platforms. For example, a deep ANN model requires unprecedented amount of computing hardware resources that often requires huge computing power of cloud servers and significant amount of time to train.

Spiking Neural Networks (SNNs) are the leading candidates for overcoming the constraints of neural computing and to efficiently harness the machine learning algorithm in real-life (or mobile) applications

[29, 5]. The concepts of SNN, which is often regarded as the 3rd generation neural network [28], are inspired by biologically plausible Leaky Integrate and Fire (LIF) spiking neuron models [6] that can efficiently process spatio-temporal information. The LIF neuron model is characterized by the internal state, called membrane potential, that integrates the inputs over time and generates an output spike (or Dirac delta pulse) whenever it reaches the neuronal firing threshold. This mechanism enables event-driven and asynchronous computations across the layers on spiking systems, which makes it naturally suitable for ultra-low power and low latency. Furthermore, recent works [39, 36] have shown that these properties make SNNs significantly more attractive for deeper networks in the case of hardware implementation. This is because the spike signals become significantly sparser as the layer goes deeper, such that the number of required computations significantly reduces. In this context, several training strategies can be applied to take full advantage of SNNs.

The general training strategy of SNNs can be categorized in two ways - ANN-to-SNN conversion and direct spike-based training. First, there are studies which have successfully deployed the ANN-to-SNN conversion technique that transforms off-line trained ANN to SNN for efficient event-driven inference [4, 8, 15, 39, 36]

. The main objective of ANN-to-SNN conversion scheme is to leverage the state-of-the-art ANN training techniques, so that the transformed networks can mimic the competitive classification performances of the ANNs. For instance, specialized SNN hardwares (such as SpiNNaker

[10], IBM TrueNorth [29]

) have exhibited greatly improved power efficiency as well as the state-of-the-art performance for inferencing. However, it takes large number of time-steps (latency) to resemble the input-output mapping of pre-trained ANN counterpart. This is because, only Integrate-and-Fire (IF) spiking neuron can be replaced with an ANN (ReLU) neuron, and hence, can not effectively capture the temporal dynamics of spatio-temporal event-driven information. On the other hand, it is still a difficult problem to directly train a deep spiking neural network using input spike events and spike-based learning algorithm, mainly because of non-differentiable activation and discontinuous nature of spike signals. To that effect, unsupervised Spike-Timing-Dependent-Plasticity (STDP) learning algorithm has been explored for training two-layer SNNs (consisting of input and output layers) by considering the local correlations of pre- and post- neuronal spike timing. STDP-trained two-layer network (consisting of 6400 output neurons) has been shown to achieve 95% classification accuracy on MNIST dataset. However, shallow network structure limits the expressive power of neural network

[7, 50, 3, 42, 43] and suffers from scalability issues as the classification performance easily saturates. Layer-wise STDP learning [18, 24]

has shown the capabilities of efficient feature extraction on multi-layer convolutional SNNs. Nevertheless, the performance gaps compared to ANN models (trained with standard BP algorithm) are still significantly large. The unsatisfactory classification performances of unsupervised local learning necessitate a spike-based supervised learning rule such as gradient descent backpropagation (BP) algorithm

[37]. In the context of SNNs, the spike-based BP learning algorithm introduced in [2, 25] dealt with the membrane potential as a differentiable activation of spiking neuron to train the synaptic weights. [34]

apply BP based supervised training for the classifier after training the feature extractor layer by layer using auto-encoder mechanism. By leveraging the best of both unsupervised and supervised learning,

[23] have shown that layer-wise STDP learning along with spike-based BP have synergistic effect to improve the robustness, generalization ability as well as acceleration of training speed. In this paper, we take these prior works forward to effectively train very deep SNNs using end-to-end spike-based gradient descent backpropagation learning.

The main contributions of our work are specified as follows. First, we develop a spike-based supervised gradient descent BP algorithm that exploits a differentiable approximated activation function of LIF neuron. In addition, we leverage the key idea of the successful deep ANN models such as LeNet5

[22], VGG [41] and ResNet [13] for efficiently constructing state-of-the-art deep SNN network architectures. We also adapt dropout [44] technique in order to better regularize deep SNN training. Next, we demonstrate the effectiveness of our methodology for visual recognition tasks on standard character and object datasets (MNIST, SVHN, CIFAR-10) and a neuromorphic dataset (N-MNIST). To the best of our knowledge, this work achieves the best classification accuracy in MNIST, SVHN and CIFAR-10 datasets through training deep SNNs. Lastly, we expand our efforts to quantify and analyze the advantages of event-driven BP algorithm compared to ANN-to-SNN conversion techniques in terms of inference time and energy consumption.

The rest of the paper is organized as follows. In section 2.1, we provide the background on fundamental components and architectures of deep convolutional SNNs. In section 2.2.1, we detail the spike-based gradient descent backpropagation learning algorithm. Subsequently, in section 2.2.2, we describe the spiking version of dropout technique used for this work. In section 3.1-3.2, we describe the experiments and report the simulation results, which validate the efficacy of spike-based BP training for MNIST, SVHN, CIFAR-10 and N-MNIST datasets. In section 4.1, we discuss the proposed algorithmin comparison to relevant works. In section 4.2-4.4, we analyze the spike activity, inference speedup and complexity reduction of direct-spike trained SNNs and ANN-SNN converted networks. Finally, we summarize and conclude the paper in section 5.

Figure 1: The operation of a Leaky Integrate and Fire (LIF) neuron.

2 Materials and methods

2.1 The Component and Architecture of Spiking Neural Network

2.1.1 Spiking Neural Network Components

The Leaky-Integrate-and-Fire (LIF) neurons [6]

and plastic synapses are fundamental and biologically plausible computational elements for emulating the dynamics of SNNs. The neurons in adjacent layers are massively inter-connected via each associated plastic synapse whereas no connection exists within a layer. The spike input signals always move in one direction, a way from the input layer through the hidden layers and to the output layer. The dynamics of LIF spiking neuron can be formulated as:

where means post-neuronal membrane potential, is the time constant for membrane potential decay, indicates the number of pre-neurons, is the synaptic weight connecting pre-neuron to post-neuron and denotes a spike event from pre-neuron at time . The operation of a LIF neuron is presented in figure 1. The impacts of each pre-spike, , are modulated by the corresponding synaptic weight () to generate the current influx flowing into the post-neuron in the next layer. The stimulus fed as current influx is integrated in the post-neuronal membrane potential () that leaks exponentially over time. The decay constant () decides the degree of membrane leakage over time and a smaller value of indicates stronger membrane potential decay. When the accumulated membrane potential reaches or exceeds a certain neuronal firing threshold (), the corresponding neuron generates a post-spike to the fan-out synapses and resets its own membrane potential to initial value (zero). In table 1, we list the annotations used in equations (1-14).

Notations Meaning
Spike
Sum of spike events throughout the time
Synaptic weight
Membrane potential
Neuronal firing threshold
Total (incoming) current influx throughout the time
Activation of spiking neuron
Loss function
Error gradient
Table 1: List of Notations

2.1.2 Deep Convolutional Spiking Neural Network

Building Blocks

In this work, we develop a training methodology for convolutional SNN models that consist of an input layer followed by intermediate hidden layers and a final output layer. In the input layer, the pixel images are encoded as Poisson-distributed spike trains, where the probability of spike generation is proportional to the pixel intensity. The hidden layers consist of multiple convolutional (C) and spatial-pooling (P) layers which are often arranged in an alternating manner. These convolutional (C) and spatial-pooling (P) layers represent the intermediate stages of feature extractor. The spikes from the feature extractor are combined to generate one dimensional vector input for the fully-connected (FC) layers to produce the final classification. The convolutional and fully-connected layers contain trainable parameters (i.e. synaptic weights), while the spatial-pooling layers are fixed

a priori. Through the training procedure, weight kernels in the convolutional layers can encode the feature representations of the input patterns at multiple hierarchical levels. Therefore, through convolution operation, the trained convolutional kernels can detect the spatially correlated local features in the input patterns. This inherently allows the network to be invariant to translation (shift) in the object location. A convolutional layer is often followed by a spatial-pooling layer. The spatial-pooling layer is used to downscale the dimensions of the feature maps, produced by the previous convolutional layer, while retaining the spatial correlation between neighborhood pixels in every feature map.

There are various choices for performing the spatial-pooling operation in the ANN domain. The two major choices are max-pooling (maximum neuron output over the pooling window) or average-pooling (two-dimensional average pooling operation over the pooling window). In most of the state-of-the-art deep ANNs, max-pooling is considered as the most popular option. However, since the neuron activations are binary in SNNs instead of analog values, max-pooling does not provide useful information to the following layer. Therefore, we have used averaging mechanism for spatial-pooling. In SNNs, average-pooling scheme is different than in ANN as an additional thresholding is used after averaging to generate output spikes. For instance, a fixed 2

2 kernel (each having a weight of 0.25) strides through a convolutional feature map without overlapping and fires an output spike at the corresponding location in the pooled feature map only if the sum of the weighted spikes of the 4 inputs within the kernel window exceeds a designated threshold. The threshold for average-pooling has to be carefully set, so that the spike propagation is not disrupted due to the pooling. If the threshold is too low, then there will be too many spikes that can cause loss of spatial location of the feature that was extracted from the previous layer. On the other hand, if the threshold is too high, then there will not be enough spike propagation to the deeper layers. We have used a threshold of 0.75 for a fixed (2

2) kernel (each having a weight of 0.25) in the average pooling layers. This means that if there are at least 3 spikes in the (22) window, then 1 spike will be generated in the pooled map. For a different kernel size, the threshold has to be properly adjusted maintaining a similar ratio (0.75). The pooling operation provides several key benefits. First, it reduces size of the convolutional feature maps and provides additional network invariance to input transformations. Furthermore, the pooling operation enlarges the effective size of convolutional kernels in the following layer as the feature maps are downscaled. This allows consecutive convolutional layers to efficiently learn hierarchical representations from low to high levels of abstractions. The number of pooled feature maps is the same as the number of output feature maps of the previous convolutional layer. The feature maps of the final pooling layer before the fully-connected layers are unrolled into a 1-D vector to be used as input for a fully-connected layer. There are one or more fully-connected layers eventually reaching to the output layer which produces inference decisions. This final fully-connected part of the network acts as a classifier to effectively incorporate the composition of features resulting from the alternating convolutional and pooling layers into the final output classes.

(a)
(b)
Figure 2: Basic building blocks of (a) VGG and (b) ResNet architectures in deep convolutional SNNs.

Deep Convolutional SNN architecture: VGG and Residual SNNs

Deep network topologies are essential for recognizing complex input patterns so that they can effectively learn hierarchical representations. To that effect, we investigate the state-of-the-art deep neural network architectures such as VGG [41] and ResNet [13] in order to build deep SNN architectures. VGG [41] was one of the first neural networks which used the idea of using small (33) convolutional kernels uniformly throughout the network. The utilization of small (33) kernels enables effective stacking of convolutional layers while minimizing the number of parameters in deep networks. In this work, we build deep convolutional SNNs (containing more than 5 trainable layers) by using ‘Spiking VGG Block’ which contains stack of convolutional layers using small (33) kernels. The figure (a)a shows a ‘Spiking VGG block’ containing two stacked convolutional layers with intermediate LIF neuronal layer. Next, ResNet [13] introduced the skip connections throughout the network that had large success in enabling successful training of significantly deeper networks. In particular, ResNet addresses the degradation (of training accuracy) problem [13] that occurs while increasing the number of layers in normal feedforward neural network. We employ the concept of the skip connection to construct deep residual SNNs whose number of trainable layers is 7-11. The figure (b)b shows a ‘Spiking Residual Block’ consisting of non-residual and residual paths. The non-residual path consists of two convolutional layers with an intermediate LIF neuronal layer. The residual path (skip connection) is composed of the identity mapping when the number of input and output feature maps are the same, and 11 convolutional kernels when the number of input and output feature maps are different. Both of the non-residual and residual path outputs are integrated to the membrane potential in the last LIF neuronal layer (LIF Neuron 2 in figure (b)b

) to generate output spikes from the ‘Spiking Residual Block’. Within the feature extractor, a ‘Spiking VGG Block’ or ‘Spiking Residual Block’ is often followed by an average-pooling layer to construct the alternating convolutional and spatial-pooling structure. Note, in some ‘Spiking Residual Blocks’, last convolutional and residual connections employ convolution with stride of 2 to incorporate the functionality of the spatial-pooling layers. At the end of the feature extractor, extracted features from the last average-pooling layer is fed to a fully-connected layer as a 1-D vector input for initiating the classifier operation.

2.2 Supervised Training of Deep Spiking Neural Network

2.2.1 Spike-based Gradient Descent Backpropagation Algorithm

The spike-based BP algorithm in SNN is adapted from standard BP [37] in the ANN domain. In standard BP, the network parameters are iteratively updated in a direction to minimize the difference between the final outputs of the network and target labels. The standard BP algorithm achieves this goal by backpropagating the output error through the hidden layers using gradient descent method. However, the major difference between ANNs and SNNs is the dynamics of neuronal output. An artificial neuron (such as sigmoid, tanh, or ReLU) communicates via continuous values whereas a spiking neuron generates binary spike outputs over time. In SNNs, spatio-temporal spike trains are fed to the network as inputs. Accordingly, the outputs of spiking neuron are spike events which are discontinuous and discrete (non-differentiable) over time. Hence, the standard BP algorithm can not be utilized to train SNNs, as it requires the gradients of spiking neuronal activation function for backpropagating the output error. We derive a spike-based BP algorithm which is capable of learning spatio-temporal patterns in spike-trains. We formulate the a differentiable (but approximated) activation of LIF neuron that enables modulation of the network parameters using gradient descent method in spiking system. The spike-based BP can be divided into three phases - forward propagation, backward propagation and weight update. We now describe the spike-based BP algorithm by going through each phase.

Forward Propagation

In forward propagation, spike trains representing input patterns and corresponding output (target) labels are presented to the network for estimating the loss function. The loss function is a measure of discrepancy between target labels and outputs predicted by the network. To generate the spike inputs, the input pixel values are converted to Poisson-distributed spike trains and delivered to the network. The input spikes are multiplied with synaptic weights to produce an input current. The resultant current is accumulated in the membrane potential of post neurons. The post-neuron generates an output spike whenever the respective membrane potential exceeds a neuronal firing threshold. Otherwise, membrane potential decays exponentially with time. After the post-neuronal firing, the membrane potential is reset, and the output spike is broadcast to be the input to the subsequent layer. The post-neurons of every layer carry out this process successively based on the weighted spikes received from the preceding layer. Over time, the total weighted summations of the spike trains are integrated at the

post-neuron as formulated in equation (2). The sum of spike trains (denoted by for the input neuron) is weighted by inter-connecting synaptic weights, .

where stands for the total (resultant) current influx received by post-neuron throughout the time , is the number of pre-neurons and is a spike event from pre-neuron at time instant .

Figure 3: Illustration the three phases (forward propagation, backward propagation and weight update) of spike-based backpropagation algorithm in a LIF neuron.

In SNN, the ‘activation function’ indicates the relationship between weighted summation of pre-neuronal spike inputs and post-neuronal outputs over time. A spike output signal is non-differentiable since it is discrete and creates a discontinuity (because of step jump) at the time instance of firing. To that effect, applying standard backpropagation [37] in the spiking domain becomes difficult since it requires a differentiable activation function. To get around this predicament, we generate a ‘differentiable activation’ of the spiking neuron by low-pass filtering the individual post-spike train and dividing it by the total propagation steps (T) as formulated below in equation (3). To compute the activation, , of a LIF neuron, the unit spikes (at time instants ) are temporally integrated and the resultant sum is decayed within the time periods as shown in equation (3). The time constant () determines the decay rate of the spiking neuronal activation. It influences the temporal dynamics of the spiking neuron by accounting for the exponential membrane potential decay and reset mechanisms. The neuronal firing threshold of the final layer is set to a very high value such that the output neurons do not generate any spike output. In output layer, the weighted spikes from previous layer are accumulated in the membrane potential while decaying over time. At the last time step, the accumulated membrane potential is divided by the number of total time steps in order to quantify the output distribution (output) as presented by equation (4). The prediction error of each output neuron is evaluated by comparing the output distribution (output) with the desired target label (label) of the presented input spike trains as shown in equation (5). The corresponding loss function (E in equation (6)) is defined as the mean square of the final prediction error over all the output neurons.

Backward Propagation and Weight Update

Next, we formulate gradient descent backward propagation for SNNs. The first step is to estimate the gradients of loss function at the output layer. Then, the gradients are propagated backward all the way down to the inputs through the hidden layers using recursive chain rule. The following equations (7-14) and figure

3 describe the detailed mathematical steps for obtaining the partial derivatives of error with respect to weights.

The partial derivative of error with respect to each weight can be calculated by applying the chain-rule twice as shown in equation (7). In equation (8), differentiating the loss function with respect to post-neuronal activation provides the first term in equation (7). The gradient of LIF neuronal activation can be derived by employing the ‘backpropagation through time (BPTT)’ technique used in the recurrent ANN training [47]. The derivative of post-neuronal activation with respect to the net input current is obtained by adding an unity value to the time derivative of the corresponding neuronal activation and divide them by corresponding neuronal firing threshold as described in equation (9). The addition of unity allows us to get around the discontinuity (step jump) that arises at each spike time and the time derivative incorporates the leaky effect of respective LIF neuronal membrane potential. It is worth mentioning here that [14, 1] employed the BPTT technique in different ways for spike-based BP algorithm. At the output layer, the error gradient, , represents the gradient of the output loss with respect to the net input current received by the post-neurons. It can be calculated by multiplying the final output error () with the derivative of the corresponding post-neuronal activation, , with respect to its inputs as shown in equation (10). Note that element-wise multiplication is indicated by ‘.’ while matrix multiplication is represented by ‘*’ in the respective equations. At any hidden layer, the local error gradient, , is recursively estimated by multiplying the back-propagated gradient from the successive layer () with derivative of the neuronal activation () as presented in equation (11).

The derivative of net current with respect to weight is simply the total incoming spikes throughout the time as derived in equation (12). The derivative of the output loss with respect to the weights interconnecting the layers and ( in equation (13)) is determined by multiplying the transposed error gradient at () with the input spikes from layer

. In case of convolutional neural networks, we backpropagate the error in order to get the partial derivatives of the loss function with respect to the given output feature map. Then, we average the partial derivatives over the output map connections sharing the particular weight to account for the effective updates of convolutional weights. Finally, the calculated partial derivatives of loss function are used to update the respective weights using a learning rate (

) as illustrated in equation (14). As a result, iteratively updating the weights over mini-batches of input patterns leads the network state to a local minimum, thereby enabling the network to capture multiple-levels of internal representations of the data.

2.2.2 Dropout in Spiking Nerual Network

Dropout [44] is one of the popular regularization techniques while training deep ANNs. This technique randomly disconnects certain units with a given probability (

) to avoid units being overfitted and co-adapted too much to given training data. We employ the concept of dropout technique in order to effectively regularize deep SNNs. Note, dropout technique is only applied during training and is not used when evaluating the performance of the network through inference. There is a subtle difference in the way dropout is applied in SNNs compared to ANNs. In ANNs, each epoch of training has several iterations of mini-batches. In each iteration, randomly selected units (with dropout ratio of

) are disconnected from the network while weighting by its posterior probability (

). However, in SNNs, each iteration has more than one forward propagation depending on the time length of the spike train. We back-propagate the output error and modify the network parameters only at the last time step. For dropout to be effective in our training method, it has to be ensured that the set of connected units within an iteration of mini-batch data is not changed, such that the neural network is constituted by the same random subset of units during each forward propagation within a single iteration. On the other hand, if the units are randomly connected at each time-step, the effect of dropout will be averaged out over the entire forward propagation times within an iteration. Then, the dropout effect would fade-out once the output error is propagated backward and the parameters are updated at the last time step. Therefore, it is necessary to keep the set of randomly connected units for entire time window within an iteration. In the experiment, we use the SNN version of dropout technique with the probability () of omitting units equal to 0.2-0.25. Note that the activation is much sparser in SNN forward propagation compared to ANN, hence the optimal for SNNs need to be less than typical ANN dropout ratio (=0.5). The details of SNN forward propagation with dropout are specified in Algorithm 1.

1:Input : Poisson input spike train (), Dropout ratio (), Total number of time steps ()
2:// Define the random subset of units (with a probability ) at each iteration
3:for  to  do
4:     
5:for  to  do
6:     // Set input of first layer equal to spike train of a mini-batch data
7:     
8:     for  to  do
9:         // Integrate weighted sum of input spikes to membrane potential with decay over time
10:         
11:         // Post-neuron fires if membrane potential is greater than neuronal threshold
12:         
13:         // Reset the membrane potential if post-neuron fires
14:               
Algorithm 1 Forward propagation with Dropout at each iteration in SNN

3 Experimental setup and Result

3.1 Experimental Setup

The primary goal of our experiments is to demonstrate the effectiveness of the proposed spike-based BP training methodology in a variety of deep network architectures. We first describe our experimental setup and baselines. For the experiments, we developed a custom simulation framework using the Pytorch

[35]

deep learning package for evaluating our proposed SNN training algorithm. Our deep convolutional SNNs are populated with biologically plausible LIF neurons in which a pair of pre- and post- neurons are interconnected by plastic synapses. At the beginning, the neuronal firing thresholds are set to an unity value and the synaptic weights are initialized with Gaussian random distribution of zero-mean and standard deviation of

(: number of fan-in synapses) as introduced in [12]. Note, the initialization constant differs by the type of network architecture. For instance, we have used for non-residual network and for residual network. For training, the synaptic weights are trained with mini-batch spike-based BP algorithm in an end-to-end manner as explained in section 2.2.1

. We train our network models for 150 epochs using mini-batch stochastic gradient descent BP that reduces its learning rate at 70

th, 100th and 125th training epochs. For the neuromorphic dataset, we use Adam [19] learning method and reduces its learnig rate at 40th, 80th and 120th training epochs. Please, refer to table 2 for more implementation details. The datasets and network topologies used for benchmarking, the spike generation scheme for event driven operation and determination of the number of time-steps required for training and inference are described in the following sub-sections.

Parameter Value
Decay Constant of Membrane Potential and Neuronal Activation
100 time-steps
BP Training Time Duration
50-100 time-steps
Inference Time Duration Same as training
Mini-batch Size
16-32
Spatial-pooling Non-overlapping Region/Stride 22, 2
Weight Initialization Constant () 2 (non-residual network), 1 (residual network)
Learning rate () 0.002 - 0.003
Dropout Ratio () 0.2 - 0.25
Table 2: Parameters used in the Experiments

3.1.1 Benchmarking Datasets

We demonstrate the efficacy of our proposed training methodology for deep convolutional SNNs on three standard vision datasets and one neuromorphic vision dataset, namely the MNIST [22], SVHN [32], CIFAR-10 [20] and N-MNIST [33]. The MNIST dataset is composed of gray-scale (one-dimensional) images of handwritten digits whose sizes are 28 by 28. The SVHN and CIFAR-10 datasets are composed of color (three-dimensional) images whose sizes are 32 by 32. The N-MNIST dataset is a neuromorphic (spiking) dataset which is converted from static MNIST dataset using Dynamic Vision Sensor (DVS)[26]. The N-MNIST dataset contains two-dimensional images that include ON and OFF event stream data whose sizes are 34 by 34. The ON (OFF) event represents the increase (decrease) in pixel bright changes. Details of the benchmark datasets are listed in table 3. For evaluation, we report the top-1 classification accuracy by classifying the test samples (training samples and test samples are mutually exclusive).

Dataset Image #Training Samples #Testing Samples #Category
MNIST , gray 60,000 10,000 10
SVHN , color 73,000 26,000 10
CIFAR-10 , color 50,000 10,000 10
N-MNIST , ON and OFF spikes 60,000 10,000 10
Table 3: Benchmark Datasets

3.1.2 Network Topologies

We use various SNN architectures depending on the complexity of the benchmark datasets. For MNIST and N-MNIST datasets, we used a network consisting of two sets of alternating convolutional and spatial-pooling layers followed by two fully-connected layers. This network architecture is derived from LeNet5 model [22]. Note that table 4 summarizes the layer type, kernel size, the number of output feature maps and stride of SNN model for MNIST dataset. The kernel size shown in the table is for 3-D convolution where the 1st dimension is for number of input feature-maps and 2nd-3rd dimensions are for convolutional kernels. For SVHN and CIFAR-10 datasets, we used deeper network models consisting of 7 to 11 trainable layers including convolutional, spatial-pooling and fully-connected layers. In particular, these networks consisting of beyond 5 trainable layers are constructed using small () convolutional kernels. We term the deep convolutional SNN architecture that includes convolutional kernel [41] without residual connections as ‘VGG SNN’ and with skip (residual) connections [13] as ‘Residual SNN’. In Residual SNNs, some convolutional layers convolve kernel with stride of 2 in both x and y directions, to incorporate the functionality of spatial-pooling layers. Please, refer to tables 4 and 5 that summarize the details of deep convolutional SNN architectures. In the results section, we will discuss the benefit of deep SNNs in terms of classification performance as well as inference speedup and energy efficiency.

4 layer network VGG7 ResNet7
Layer type Kernel size #o/p feature-maps Stride Layer type Kernel size #o/p feature-maps Stride Layer type Kernel size #o/p feature-maps Stride
Convolution 155 20 1 Convolution 333 64 1 Convolution 333 64 1
Average-pooling 22 2 Convolution 6433 64 2 Average-pooling 22 2
Average-pooling 22 2
Convolution 2055 50 1 Convolution 6433 128 1 Convolution 6433 128 1
Average-pooling 22 2 Convolution 12833 128 2 Convolution 12833 128 2
Convolution 12833 128 2 Skip convolution 6411 128 2
Average-pooling 22 2
Convolution 12833 256 1
Convolution 25633 256 2
Skip convolution 12811 256 2
Fully-connected 200 Fully-connected 1024 Fully-connected 1024
Output 10 Output 10 Output 10
Table 4: The deep convolutional spiking neural network architectures for MNIST, N-MNIST and SVHN dataset
VGG9 ResNet9 ResNet11
Layer type Kernel size #o/p feature-maps Stride Layer type Kernel size #o/p feature-maps Stride Layer type Kernel size #o/p feature-maps Stride
Convolution 333 64 1 Convolution 333 64 1 Convolution 333 64 1
Convolution 6433 64 1 Average-pooling 22 2 Average-pooling 22 2
Average-pooling 22 2
Convolution 6433 128 1 Convolution 6433 128 1 Convolution 6433 128 1
Convolution 12833 128 1 Convolution 12833 128 1 Convolution 12833 128 1
Average-pooling 22 2 Skip convolution 6411 128 1 Skip convolution 6411 128 1
Convolution 12833 256 1 Convolution 12833 256 1 Convolution 12833 256 1
Convolution 25633 256 1 Convolution 25633 256 2 Convolution 25633 256 2
Convolution 25633 256 1 Skip connection 12811 256 2 Skip convolution 12811 256 2
Average-pooling 22 2
Convolution 25633 512 1 Convolution 25633 512 1
Convolution 51233 512 2 Convolution 51233 512 1
Skip convolution 25611 512 2 Skip convolution 51211 512 1
Convolution 51233 512 1
Convolution 51233 512 2
Skip convolution 51211 512 2
Fully-connected 1024 Fully-connected 1024 Fully-connected 1024
Output 10 Output 10 Output 10
Table 5: The deep convolutional spiking neural network architectures for a CIFAR-10 dataset

3.1.3 ANN-SNN Conversion Scheme

As mentioned previously, off-line trained ANNs can be successfully converted to SNNs by replacing ANN (ReLU) neurons with Integrate and Fire (IF) spiking neurons and adjusting the neuronal thresholds with respect to synaptic weights. It is important to set the neuronal firing thresholds sufficiently high so that each spiking neuron can closely resemble ANN activation without loss of information. In the literature, several methods have been proposed [4, 8, 15, 39, 36] for balancing appropriate ratios between neuronal thresholds and synaptic weights of spiking neuron in the case of ANN-to-SNN conversion. In this paper, we compare various aspects of our direct-spike trained models with one recent work [39], which proposed a near-lossless ANN-to-SNN conversion scheme for deep network architectures. In brief, [39] balanced the neuronal firing thresholds with respect to corresponding synaptic weights layer-by-layer depending on the actual spiking activities of each layer using a subset of training samples. Basically, we compare our direct-spike trained model with converted SNN on the same network architecture in terms of accuracy, inference speed and energy-efficiency. Please note that there are couple of differences on the network architecture and conversion technique between [39] and our scheme. First, [39] always uses average-pooling to reduce the size of previous convolutional output feature-map, whereas our models interchangeably use average pooling or convolve kernels with stride of 2 in convolutional layer. Next, [39] only consider identity skip connections for residual SNNs. However, we implement skip connections using either identity mapping or convolutional kernel. Lastly, we used lower (0.75) threshold for avg-pooling layer instead of 0.8 to ensure enough spike propagation on both direct-trained and converted network models. Even in the case of ANN-to-SNN conversion scheme, lower average-pooling threshold provides us slightly better classification performance than [39].

3.1.4 Spike Generation Scheme

For the static vision datasets (MNIST, SVHN and CIFAR-10), each input pixel intensity is converted to stream of Poisson distributed spike events that have equivalent firing rates. The Poisson input spikes are fed to the network throughout the time. This rate-based spike encoding is used for a given period of time during both training and inference. For color image datasets, we use image pre-processing techniques of random cropping and horizontal flip before generating input spikes. These input pixels are normalized to represent zero mean and unit standard deviation. Thereafter, we scale the pixel intensities to bound them in the range [-1,1] to represent the whole spectrum of input pixel representations. The normalized pixel intensities are converted to Poisson spike events such that the generated input signals are bipolar spikes. For the neuromorphic version of dataset (N-MNIST), we use the original (unfiltered and uncentered) version of spike streams to directly train and test the network in time domain.

3.1.5 Time-steps

As mentioned in section 3.1.4, we generate stochastic Poisson spike train for each input pixel intensity for event-driven operation. The duration of the spike train is very important for SNNs. We measure the length of the spike train (spike time window) in time-steps. For example, a 100 time-step spike train will have approximately 50 random spikes if the corresponding pixel intensity is half in a range of [0,1]. If the number of time-steps (spike time window) is too less, then the SNN will not receive enough information for training or inference. On the other hand, if the number of time-steps is too high, then the latency will also be high and the spike stream will behave more like a deterministic input. Hence, the stochastic property of SNNs will be lost, the inference will become too slow, and the network will not have much energy efficiency over ANN implementation. For these reasons, we experimented with different number of time-steps to empirically obtain the optimal number of time-steps required for both training and inference. The experimental process and results are explained in the following subsections.

Optimal #time-steps for Training

A spike event can only represent 0 or 1 in each time step, therefore usually its bit precision is considered 1. However, the spike train provides temporal data, which is an additional source of information. Therefore, the spike train length (number of time-steps) in SNN can be considered as its actual precision of neuronal activation. To obtain the optimal #time-steps required for our proposed training method, we trained a VGG9 network on CIFAR-10 dataset using different time-steps ranging from 10 to 120 (shown in figure (a)a). We found that for only 10 time-steps, the network is unable to learn anything as there is not enough information (input precision too low) for the network to be able to learn. This phenomena is explained by the lack of spikes in the final output. With the initial weights, the accumulated sum of the LIF neuron is not enough to generate output spikes in the latter layers. Hence, none of the input spikes propagates to the final output neurons and the output distributions remain 0. Therefore, the computed gradients are always 0 and the network is not updated. For 20-30 time-steps, some input spikes are able to reach the final layer, hence the network starts to learn but do not converge. For 35-50 time-steps, the network learns well and converges to a reasonable point. From 70 time-steps, the network accuracy starts to saturate. At about 100 time-steps the network training improvement completely saturates. This is consistent with the bit precision of the inputs. It has been shown in [38] that 8 bit inputs and activations are sufficient to achieve optimal network performance for standard image recognition tasks. Ideally, we need 128 time-steps to represent 8 bit inputs using bipolar spikes. However, 100 time-steps proved to be sufficient as more time-steps provide marginal improvement. We observe similar trend in VGG7, ResNet7, ResNet9 and ResNet11 SNNs as well, while training for SVHN and CIFAR-10 datasets. Therefore, we considered 100 time-steps as the optimal #time-steps for training in our proposed methodology. Moreover, for MNIST dataset, we used 50 time-steps since the required bit precision is only 4 bits [38].

(a)
(b)
Figure 4: Inference performance variation due to (a) #Training-Timesteps and (b) #Inference-Timesteps. T# in (a) indicates number of time-steps used for training.

Optimal #time-steps for Inference

To obtain the optimal #time-steps required for inferring an image utilizing a network trained with our proposed method, we conducted similar experiments as described in section 3.1. We first trained a VGG9 network for CIFAR-10 dataset using 100 time-steps (optimal according to experiments in section 3.1). Then, we tested the network performances with different time-steps ranging from 10 to 4000 (shown in figure (b)b). We observed that the network performs very well even with only 10 time-steps, while the peak performance occurs around 100 time-steps. For more than 100 time-steps, the accuracy degrades slightly from the peak. This behavior is very different from ANN-SNN converted networks where the accuracy keeps on improving as #time-steps is increased (shown in figure (b)b). This can be attributed to the fact that our proposed spike-based training method incorporates the temporal information well in to the network training procedure so that the trained network is tailored to perform best at a specific spike time window when inferencing. On the other hand, the ANN-SNN conversion schemes are unable to incorporate the temporal information of the input in the trained network and therefore are heavily dependent on the deterministic behavior of the input. Hence, the ANN-SNN conversion schemes require much higher #time-steps for inference in order to resemble input-output mappings similar to ANNs.

3.2 Result

In this section, we analyze the classification performance and efficiency achieved by the proposed spike-based training methodology for deep convolutional SNNs compared to the performance of the transformed SNN using ANN-to-SNN conversion scheme.

3.2.1 The Classification Performance

Most of the classification performances available in literature for SNNs are for MNIST and CIFAR-10 datasets. The popular methods for SNN training are ‘Spike Time Dependent Plasticity (STDP)’ based unsupervised learning

[7, 50, 3, 42, 43] and ‘Spike-based Backpropagation’ based supervised learning [25, 17, 49, 31, 30]. There are a few works [45, 18, 46, 23] which tried to combine the two approaches to get the best of both worlds. However, these training methods were able to neither train deep SNNs nor achieve good inference performance compared to ANN implementations. Hence, ANN-SNN conversion schemes have been explored by researchers [4, 8, 15, 39, 36]. Till date, ANN-SNN conversion schemes achieved the best inference performance for CIFAR-10 dataset using deep networks [39, 36]. Classification performances of all these works are listed in table 6 along with ours. To the best of our knowledge, we achieved the best inference accuracy for MNIST using LeNet structured network. We also achieved accuracy performance comparable with ANN-SNN converted network [39] for CIFAR-10 dataset using much smaller network models, while beating all other SNN training methods.

Model Learning Method
Accuracy
(MNIST)
Accuracy
(N-MNIST)
Accuracy
(CIFAR-10)
Hunsberger et al.[15] Offline learning, conversion 98.37% 82.95%
Esser et al.[9] Offline learning, conversion 89.32%
Diehl et al.[8] Offline learning, conversion 99.10% -
Rueckauer et al.[36] Offline learning, conversion 99.44% 88.82%
Sengupta et al.[39] Offline learning, conversion 91.55%
Kheradpisheh et al.[18] Layerwise STDP + offline SVM classifier 98.40%
Panda et al.[34]

Spike-based autoencoder

99.08% 70.16%
Lee et al.[25] Spike-based BP 99.31% 98.74%
Wu et al.[49] Spike-based BP 99.42% 98.78% 50.70%
Lee et al.[23] STDP-based pretraining + spike-based BP 99.28%
Jin et al.[17] Spike-based BP 99.49% 98.88%
Wu et al.[48] Spike-based BP 99.53% 90.53%
This work Spike-based BP 99.59% 99.09% 90.95%
Table 6: Comparison of the SNNs classification accuracies on MNIST, N-MNIST and CIFAR-10 datasets.

For a more extensive comparison, we compare inference performances of trained networks using our proposed methodology with the state-of-the-art ANNs and ANN-SNN conversion scheme, for same network configuration (depth and structure) side by side in table 7. We also compare with the previous best SNN training results found in literature that may or may not have same network depth and structure as ours. The ANN-SNN conversion scheme is a modified and improved version of [39]. We are using this modified scheme since it achieves better conversion performance than [39] as explained in section 3.1.3. Note that all reported classification accuracies are the average of the maximum inference accuracies for 3 independent runs with different seeds.

After initializing the weights, we train the SNNs using spike-based BP algorithm in an end-to-end manner with Poisson spike train inputs. Our evaluation on a MNIST dataset yields a classification accuracy of 99.59% which is the best compared to any other SNN training scheme and also our ANN-SNN conversion scheme. We achieve ~96% inference accuracy on SVHN dataset for both trained non-residual and residual SNN which is very close to the state-of-the-art ANN implementation. Inference performance for SNNs trained on SVHN dataset have not been reported previously in literature.

We implemented three different networks, as shown in table 5, for classifying CIFAR-10 dataset using proposed spike-based BP algorithm. For the VGG9 network, the ANN-SNN conversion scheme provides near lossless converted network compared to baseline ANN implementation, while our proposed training method yields a classification accuracy of 90.45%. For ResNet9 network, the ANN-SNN conversion scheme provides inference accuracy within 3% of baseline ANN implementation. However, our proposed spike-based training method achieve better inference accuracy that is within ~1.5% of baseline ANN implementation. In the case of ResNet11, we observe that the inference accuracy improvement is marginal compared to ResNet9 for baseline ANN implementation. However, ANN-SNN conversion scheme and proposed SNN training show improvement of ~0.5% for ResNet11 compared to ResNet9. Overall, for ResNet networks, our proposed training method achieves better inference accuracy compared to ANN-SNN conversion scheme.

Inference Accuracy (%)
Dataset Model ANN ANN-SNN SNN [Previous Best] SNN [This Work]
MNIST LeNet 99.57 99.59 99.49 [17] 99.59
N-MNIST LeNet 99.53 [48] 99.09
SVHN VGG7 96.36 96.30 96.06
ResNet7 96.43 95.93 96.21
CIFAR-10 VGG9 91.98 92.01 90.53 [48] 90.45
ResNet9 91.85 89.00 90.35
ResNet11 91.87 90.15 90.95
Table 7: Comparison of Classification Performance

3.2.2 Accuracy Improvement with Network Depth

One of the major drawbacks of STDP based unsupervised learning for SNNs is that it is very difficult to train beyond 2 convolutional layers [18, 24]. Therefore, researchers are leaning more towards backpropagation based supervised learning for deep SNNs.

In order to analyze the effect of network depth for direct-spike trained SNNs, we experimented with networks of different depths while training for SVHN and CIFAR-10 datasets. For SVHN dataset, we started with a small network derived from LeNet5 model [22] with 2 convolutional and 2 fully-connected layers. This network was able to achieve inference accuracy of only 92.38%. Then, we increased the network depth by adding 1 convolutional layer before the 2 fully-connected layers and we termed this network as VGG5. VGG5 network was able to achieve significant improvement over its predecessor. Similarly, we tried VGG6 followed by VGG7, and the improvement started to become very small. We have also trained ResNet7 to understand how residual networks perform compared to non-residual networks of similar depth. Results of these experiments are shown in figure (a)a. We carried out similar experiments for CIFAR-10 dataset as well. The results show similar trend (figure (b)b). These results ensure that network depth improves learning capacity of direct-spike trained SNNs similar to ANNs. The non-residual networks saturate at certain depth and start to degrade if network depth is further increased (VGG11 in figure (b)b) due to the degradation problem mentioned in [13]. In such scenario, the residual connections in deep residual ANNs allows the network to maintain peak classification accuracy utilizing the skip connections [13] as seen in figure (b)b (ResNet9 and ResNet11).

(a)
(b)
Figure 5: Accuracy Improvement with Network Depth for (a) SVHN dataset and (b) CIFAR-10 dataset.

4 Discussion

4.1 Comparison with Relevant works

In this section, we compare our proposed supervised learning algorithm with other recent spike-based BP algorithms. The spike-based learning rules primarily focus on directly training and testing SNNs with spike-trains and no conversion is necessary for applying in real-world spiking scenario. In recent years, there are an increasing number of supervised gradient descent method in spike-based learning. [34] developed spike-based auto-encoder mechanism to train deep convolutional SNNs. They dealt with membrane potential as a differentiable signal and showed recognition capabilities in standard vision tasks (MNIST and Cifar-10 datasets). [25] followed the similar approach to explore a spike-based BP algorithm in an end-to-end manner. In addition, [25] presented the error normalization scheme to prevent exploding gradient phenomenon for training deep SNNs. [17] proposed hybrid macro/micro level backpropagation (HM2-BP). HM2-BP is developed to capture the temporal effect of individual spike (in micro-level) and rate-encoded error (in macro-level). In temporal encoding domain, [30] proposed an interesting temporal spike-based BP algorithm by treating the spike-time as the differential activation of neuron. Temporal encoding based SNN has the potential to process spatio-temporal spike patterns with small number of spikes. All of these works demonstrated spike-based learning in simple network architectures and has large gap in classification accuracy compared to deep ANNs. More recently, [48] presented a neuron normalization technique (called NeuNorm) that calculates the average input firing rates to adjust neuron selectivity. NeuNorm enables spike-based training within relatively short time-window while achieving a competitive performances. In addition, they presented an input encoding scheme that receives both spike and non-spike signals for preserving the precision of input data.

There are several points that distinguish our work from the others. First, we derived a differentiable (but approximated) activation of a LIF neuron given the measured neuronal outputs (as defined in equation (3)). The activation of a LIF neuron is formulated as ‘low-pass filtered output signal’ which is the accumulation of leaky output spikes throughout the time. In backpropagation phase, the defined activation of a LIF neuron enables us to calculate the neuronal pseudo-derivative while accounting for the leaky behavior (as explained in equation (9)). Note that the effect of leaky component has high impact on the dynamics of LIF spiking neuron. It is worth mentioning here that better approximation of ‘discrete LIF’ neuronal activation function enables our network to achieve better performance than the other methods in the literature. Next, we construct our networks by leveraging state-of-the-art deep architectures such as VGG [41] and ResNet [13]. To the best of our knowledge, this is the first work that demonstrates spike-based supervised BP learning for SNNs containing more than 10 trainable layers. Our deep SNNs obtain the superior classification accuracies in MNIST, SVHN and CIFAR-10 datasets in comparison to the other networks trained with spike-based algorithm. Moreover, we present a network parameter (i.e. weights and threshold) initialization scheme for a variety of deep SNN architectures. In the experiment, we show that the proposed initialization scheme appropriately initializes the deep SNNs facilitating training convergence for a given network architecture and training strategy. In addition, as opposed to complex error or neuron normalization method adopted by [25] and [48], respectively, we demonstrate that deep SNNs can be naturally trained by only considering the spiking activities of the system. As a result, our work paves the effective way for training deep SNNs with spike-based BP algorithm.

4.2 Spike Activity Analysis

The most important advantage of event-driven operation of neural networks is that the events are very sparse in nature. To verify this claim, we analyzed the spiking activities of the direct-spike trained SNNs and ANN-SNN converted networks in the following subsections.

4.2.1 Spike Activity per Layer

The layer-wise spike activities of both SNN trained using our proposed methodology and ANN-SNN converted network for VGG9 and ResNet9 are shown in figure (a)a and (b)b, respectively. In the case of ResNet9, only first average pooling layer’s output spike activity is shown in the figure as for the direct-spike trained SNN, the other spatial-poolings are done by stride 2 convolutions. In figure 6, it can be seen that the input layer has the highest spike activity that is significantly higher than any other layer. The spike activity reduces significantly as the network depth increases.

(a)
(b)
Figure 6: Layer-wise spike activity in direct-spike trained SNN and ANN-SNN converted network for CIFAR-10 dataset: (a) VGG9 (b) ResNet9 network. The spike activity is normalized with respect to the input layer spike activity which is same for both networks.

We can observe from figure (a)a and figure (b)b that the average spike activity in direct-spike trained SNN is much higher than ANN-SNN converted network. The ANN-SNN converted network uses higher threshold compared to 1 (in case of direct-spike trained SNN) since the conversion scheme applies layer-wise neuronal threshold modulation. This higher threshold reduces spike activity in ANN-SNN converted networks. However, in both cases, the spike activity decreases with increasing network depth.

Dataset Model Spike/image
Spike
Efficiency
ANN-SNN SNN
MNIST LeNet 29094 55212 0.53x
73085 1.32x
SVHN VGG7 10251782 5564306 1.84x
16615596 2.99x
ResNet7 4656760
20607244 4.43x
CIFAR-10 VGG9 2226732 1240492 1.80x
9647563 7.78x
ResNet9 4319988
8745271 2.02x
ResNet11 1531985
8116343 5.30x
Table 8: #Spikes per Image Inference

4.2.2 #Spikes/Inference

From figure 6, it is evident that average spike activity in ANN-SNN converted networks is much less than in SNN trained with our proposed methodology. However, for inference, the network has to be evaluated over a number of time-steps. Therefore, to quantify the actual spike activity for an inference operation, we measured the average number of spikes required for inferring one image. For this purpose, we counted number of spikes generated (including input spikes) for classifying the test set of a particular dataset for a specific number of time-steps and averaged the count for generating the quantity ‘#spikes per image inference’. We have used two different time-steps for ANN-SNN converted VGG networks; one for iso-accuracy comparison and the other one for maximum accuracy comparison with the direct-spike trained SNNs. Iso-accuracy inference requires less #time-steps than maximum accuracy inference, hence has lower number of spikes per image inference. For ResNet networks, the ANN-SNN conversion scheme always provides accuracy less than SNN (trained with proposed algorithm). Hence, we only compare spikes per image inference in maximum accuracy condition for ANN-SNN converted ResNet networks while comparing with direct-spike trained SNNs. We can quantify the spike-efficiency (amount reduction in #spikes) from the #spikes/image inference. The results are listed in table 8 where, for each network, the 1st row corresponds to iso-accuracy and 2nd row corresponds to maximum-accuracy condition.

Figure 7: The comparison of ‘accuracy vs latency vs #spikes/inference’ for ResNet11 architecture.

Figure 7 shows the relationship between inference accuracy, latency and #spikes/inference for ResNet11 network trained on CIFAR-10 dataset. We can observe that #spikes/inference is higher for direct-spike trained SNN compared to ANN-SNN converted network at any particular latency. However, SNN trained with spike-based BP requires only 100 time-steps for maximum inference accuracy, whereas ANN-SNN converted network requires about 3000 time-steps to reach maximum inference accuracy (which is slightly less than direct-spike trained SNN accuracy). Hence, for maximum-accuracy condition, direct-spike trained SNN requires much less #spikes/inference compared to ANN-SNN converted network while achieving similar accuracy.

4.3 Inference Speedup

The time required for inference is almost linearly proportional to the #time-steps (figure 7). Hence, we can also quantify the inference speedup for direct-spike trained SNN compared to ANN-SNN converted network from the #time-steps required for inference as shown in table 9. For VGG9 network, we achieve 8x speedup for iso-accuracy and up to 36x speedup in inference for maximum accuracy comparison. Similarly, for ResNet networks we achieve up to 25x-30x speedup in inference.

Dataset Model Timesteps
Inference
Speedup
ANN-SNN SNN
MNIST LeNet 200 50 4x
500 10x
SVHN VGG7 1600 100 16x
2600 26x
ResNet7 100
2500 25x
CIFAR-10 VGG9 800 100 8x
3600 30x
ResNet9 100
3000 30x
ResNet11 100
3000 30x
Table 9: Inference Speedup

4.4 Complexity Reduction

Deep ANNs struggle to meet the demand of extraordinary computational requirements. SNNs can mitigate this effort by enabling efficient event-driven computations. To compare the computational complexity for these two cases, we first need to understand the operation principle of both. An ANN operation for inferring the category of a particular input requires a single feed-forward pass per image. For the same task, the network must be evaluated over a number of time-steps in spiking domain. If regular hardware is used for both ANN and SNN, then it is evident that SNN will have computation complexity in the order of hundreds or thousands more compared to an ANN. However, there are specialized hardware that accounts for the event-driven neural operation and ‘computes only when required’. SNNs can potentially exploit such alternative mechanisms of network operation and carry out an inference operation in spiking domain much more efficiently than an ANN. Also, for deep SNNs, we have observed the increase in sparsity as the network depth increases. Hence, the benefits from event-driven hardware is expected to increase as the network depth increases.

An estimate of the actual energy consumption of SNNs and comparison with ANNs is outside the scope of this work. However, we can gain some insight by quantifying the energy consumption for a synaptic operation and comparing the number of synaptic operations being performed in the ANN versus the SNN trained with our proposed algorithm and ANN-SNN converted network. We can estimate the number of synaptic operations per layer of a neural network from the structure for the convolutional and linear layers. In an ANN, a multiply-accumulate (MAC) computation is performed per synaptic operation. While, a specialized SNN hardware would perform simply an accumulate computation (AC) per synaptic operation only if an incoming spike is received. Hence, the total number of AC operations in a SNN can be estimated by the layer-wise product and summation of the average neural spike count for a particular layer and the corresponding number of synaptic connections. We also have to multiply the #time-steps with the #AC operations to get total #AC operation for one image inference. Based on this concept, we estimated total number of MAC operations for ANN, and total number of AC operations for direct-spike trained SNN and ANN-SNN converted network, for VGG9, ResNet9 and ResNet11. The ratio of ANN-SNN converted network AC operations to direct-spike trained SNN AC operations to ANN MAC operations is 28.18:3.61:1 for VGG9 while the ratio is 11.94:5.06:1 for the ResNet9 and 7.26:2.09:1 for ResNet11 (for maximum accuracy condition).

Figure 8: Inference computation complexity comparison between ANN, ANN-SNN conversion and SNN trained with spike-based backpropagation. ANN computational complexity is considered as baseline for normalization.

However, a MAC operation usually consumes an order of magnitude more energy than an AC operation. For instance, according to [[11]], a 32-bit floating point MAC operation consumes 4.6pJ and a 32-bit floating point AC operation consumes 0.9pJ in 45nm technology node. Hence, one synaptic operation in an ANN is equivalent to ~5 synaptic operations in a SNN. Moreover, 32-bit floating point computation can be replaced by fixed point computation using integer MAC and AC units without losing accuracy since the conversion is reported to be almost loss-less [[27]]. A 32-bit integer MAC consumes roughly 3.2pJ, while a 32-bit AC operation consumes only 0.1pJ in 45nm process technology. Considering this fact, our calculations demonstrate that the SNNs trained using proposed method will be 7.81x and 8.87x more computationally energy-efficient compared to an ANN-SNN converted network and an ANN, respectively, for the VGG9 network architecture. We also gain 3.47x(2.36x) and 15.32x(6.32x) energy-efficiency, for the ResNet11(ResNet9) network, compared to an ANN-SNN converted network and an ANN, respectively. Figure 8 shows the reduction in computation complexity for ANN-SNN conversion and SNN trained with proposed methodology compared to ANNs. It is worth noting here that as the sparsity of the spike signals increases with increase in network depth in SNNs, the energy-efficiency is expected to increase almost exponentially in both ANN-to-SNN conversion network [39] and SNN trained with proposed methodology compared to an ANN implementation. Hence, depth of network is the key factor for achieving significant increase in the energy efficiency for event-driven SNNs in contrast to ANNs.

Training complexity can also be easily estimated for proposed methodology as it follows the steps in standard ANN backpropagation algorithm. In backpropagation algorithm, training effort consists of two costs: i) Forward propagation cost and ii) Backward propagation and weight update cost. Using simple cost estimation model based on the number of synaptic operations, we observed that for ANN, backward propagation and weight update cost is ~2x more expensive compared to forward propagation for a mini-batch size of one. Mini-batch size of one is chosen for simplicity. To get an ANN-SNN converted network, first an ANN is trained using standard backpropagation algorithm and then neuronal threshold modulation is applied to convert it to a SNN. We can neglect the threshold modulation cost as it is a one time cost while training can consist of many epochs. Hence, overall computational complexity of training for ANN-SNN conversion network is similar to ANN. On the other hand, for spike-based backpropagation training scheme, we back-propagate only once in each mini-batch iteration which is computationally same as back-propagation in an ANN. Therefore, the back-propagation and weight update cost for proposed methodology is also same as an ANN. Considering these factors and combining the forward propagation cost (as described earlier), backward propagation and weight update cost, we can estimate the total computational complexity for training. Our estimation shows that, for a mini-batch size of 1, proposed training methodology is ~1.4x computationally energy efficient compared to an ANN and an ANN-SNN conversion network.

5 Conclusion

In this work, we propose a spike-based backpropagation training methodology for state-of-the-art deep SNN architectures. This methodology enables real-time training in deep SNNs while achieving comparable inference accuracies on standard image recognition tasks. Our experiments show the effectiveness of the proposed learning strategy on deeper SNNs (7-11 layer VGG and ResNet network architectures) by achieving the best classification accuracies in MNIST, SVHN and CIFAR-10 datasets among other networks trained with spike-based learning till date. The performance gap in terms of quality between ANN and SNN is substantially reduced by the application of our proposed methodology. We can achieve 6.32x-15.32x energy-efficiency compared to ANN counterparts as well as 2.36x-7.81x over ANN-SNN converted networks for inference by exploiting our training methodology and applying the trained SNN on neuromorphic hardware. Moreover, trained deep SNNs can infer 8x-36x faster than ANN-SNN converted networks.

Acknowledgement

This work was supported in part by C-BRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, the National Science Foundation, Intel Corporation, the DoD Vannevar Bush Fellowship and the U.S. Army Research Laboratory and the U.K. Ministry of Defence under Agreement Number W911NF-16-3-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Army Research Laboratory, the U.S. Government, the U.K. Ministry of Defence or the U.K. Government. The U.S. and U.K. Governments are authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.

References

  • [1] G. Bellec, D. Salaj, A. Subramoney, R. Legenstein, and W. Maass. Long short-term memory and learning-to-learn in networks of spiking neurons. arXiv preprint arXiv:1803.09574, 2018.
  • [2] S. M. Bohte, J. N. Kok, and H. La Poutre. Error-backpropagation in temporally encoded networks of spiking neurons. Neurocomputing, 48(1-4):17–37, 2002.
  • [3] J. M. Brader, W. Senn, and S. Fusi. Learning real-world stimuli in a neural network with spike-driven synaptic dynamics. Neural computation, 19(11):2881–2912, 2007.
  • [4] Y. Cao, Y. Chen, and D. Khosla. Spiking deep convolutional neural networks for energy-efficient object recognition.

    International Journal of Computer Vision

    , 113(1):54–66, 2015.
  • [5] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, et al. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro, 38(1):82–99, 2018.
  • [6] P. Dayan and L. F. Abbott. Theoretical neuroscience, volume 806. Cambridge, MA: MIT Press, 2001.
  • [7] P. U. Diehl and M. Cook. Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Frontiers in computational neuroscience, 9:99, 2015.
  • [8] P. U. Diehl, G. Zarrella, A. Cassidy, B. U. Pedroni, and E. Neftci.

    Conversion of artificial recurrent neural networks to spiking neural networks for low-power neuromorphic hardware.

    In Rebooting Computing (ICRC), IEEE International Conference on, pages 1–8. IEEE, 2016.
  • [9] S. Esser, P. Merolla, J. Arthur, A. Cassidy, R. Appuswamy, A. Andreopoulos, D. Berg, J. McKinstry, T. Melano, D. Barch, et al. Convolutional networks for fast, energy-efficient neuromorphic computing. 2016. Preprint on ArXiv. http://arxiv. org/abs/1603.08270. Accessed, 27, 2016.
  • [10] S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras, S. Temple, and A. D. Brown. Overview of the spinnaker system architecture. IEEE Transactions on Computers, 62(12):2454–2467, 2013.
  • [11] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.

    In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • [14] D. Huh and T. J. Sejnowski. Gradient descent for spiking neural networks. In Advances in Neural Information Processing Systems, pages 1440–1450, 2018.
  • [15] E. Hunsberger and C. Eliasmith. Spiking deep networks with lif neurons. arXiv preprint arXiv:1510.08829, 2015.
  • [16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [17] Y. Jin, P. Li, and W. Zhang. Hybrid macro/micro level backpropagation for training deep spiking neural networks. arXiv preprint arXiv:1805.07866, 2018.
  • [18] S. R. Kheradpisheh, M. Ganjtabesh, S. J. Thorpe, and T. Masquelier. Stdp-based spiking deep neural networks for object recognition. arXiv preprint arXiv:1611.01421, 2016.
  • [19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [20] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [23] C. Lee, P. Panda, G. Srinivasan, and K. Roy. Training deep spiking convolutional neural networks with stdp-based unsupervised pre-training followed by supervised fine-tuning. Frontiers in Neuroscience, 12:435, 2018.
  • [24] C. Lee, G. Srinivasan, P. Panda, and K. Roy. Deep spiking convolutional neural network trained with unsupervised spike timing dependent plasticity. IEEE Transactions on Cognitive and Developmental Systems, 2018.
  • [25] J. H. Lee, T. Delbruck, and M. Pfeiffer. Training deep spiking neural networks using backpropagation. Frontiers in neuroscience, 10:508, 2016.
  • [26] P. Lichtsteiner, C. Posch, and T. Delbruck. A 128128 120 db 15s latency asynchronous temporal contrast vision sensor. IEEE journal of solid-state circuits, 43(2):566–576, 2008.
  • [27] D. Lin, S. Talathi, and S. Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pages 2849–2858, 2016.
  • [28] W. Maass. Networks of spiking neurons: the third generation of neural network models. Neural networks, 10(9):1659–1671, 1997.
  • [29] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, et al. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668–673, 2014.
  • [30] H. Mostafa. Supervised learning based on temporal coding in spiking neural networks. IEEE transactions on neural networks and learning systems, 2017.
  • [31] E. O. Neftci, C. Augustine, S. Paul, and G. Detorakis. Event-driven random back-propagation: Enabling neuromorphic deep learning machines. Frontiers in neuroscience, 11:324, 2017.
  • [32] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
  • [33] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor. Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in neuroscience, 9:437, 2015.
  • [34] P. Panda and K. Roy. Unsupervised regenerative learning of hierarchical features in spiking deep networks for object recognition. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 299–306. IEEE, 2016.
  • [35] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
  • [36] B. Rueckauer, I.-A. Lungu, Y. Hu, M. Pfeiffer, and S.-C. Liu. Conversion of continuous-valued deep networks to efficient event-driven networks for image classification. Frontiers in neuroscience, 11:682, 2017.
  • [37] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
  • [38] S. S. Sarwar, G. Srinivasan, B. Han, P. Wijesinghe, A. Jaiswal, P. Panda, A. Raghunathan, and K. Roy. Energy efficient neural computing: A study of cross-layer approximations. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2018.
  • [39] A. Sengupta, Y. Ye, R. Wang, C. Liu, and K. Roy. Going deeper in spiking neural networks: Vgg and residual architectures. arXiv preprint arXiv:1802.02627, 2018.
  • [40] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
  • [41] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [42] G. Srinivasan, P. Panda, and K. Roy. Spilinc: Spiking liquid-ensemble computing for unsupervised speech and image recognition. Frontiers in Neuroscience, 12:524, 2018.
  • [43] G. Srinivasan, P. Panda, and K. Roy. Stdp-based unsupervised feature learning using convolution-over-time in spiking neural networks for energy-efficient neuromorphic computing. ACM Journal on Emerging Technologies in Computing Systems (JETC), 14(4):44, 2018.
  • [44] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • [45] A. Tavanaei and A. S. Maida. Bio-inspired spiking convolutional neural network using layer-wise sparse coding and stdp learning. arXiv preprint arXiv:1611.03000, 2016.
  • [46] A. Tavanaei and A. S. Maida. Multi-layer unsupervised learning in a spiking convolutional neural network. In Neural Networks (IJCNN), 2017 International Joint Conference on, pages 2023–2030. IEEE, 2017.
  • [47] P. J. Werbos et al. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
  • [48] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi. Direct training for spiking neural networks: Faster, larger, better. arXiv preprint arXiv:1809.05793, 2018.
  • [49] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in neuroscience, 12, 2018.
  • [50] B. Zhao, R. Ding, S. Chen, B. Linares-Barranco, and H. Tang. Feedforward categorization on aer motion events using cortex-like features in a spiking neural network. IEEE transactions on neural networks and learning systems, 26(9):1963–1978, 2015.