The source code for the paper entitled "A Tandem Learning Rule for Effective Training and Rapid Inference of Deep Spiking Neural Networks"
The emerging neuromorphic computing (NC) architectures have shown compelling energy efficiency to perform machine learning tasks with spiking neural networks (SNNs). However, due to the non-differentiable nature of spike generation, the standard error backpropagation algorithm is not directly applicable to SNNs. In this work, we propose a novel learning rule based on the hybrid neural network with shared weights, wherein a rate-based SNN is used during the forward propagation to determine precise spike counts and spike trains, and an equivalent ANN is used during error backpropagation to approximate the gradients for the coupled SNN. The SNNs trained with the proposed learning rule have demonstrated competitive classification accuracies on the CIFAR-10 and IMAGENET- 2012 datasets with significant savings on the inference time and total synaptic operations compared to other state-of-the-art SNN implementations. The proposed learning rule offers an intriguing solution to enable on-chip computing on the pervasive mobile and embedded devices with limited computational budgets.READ FULL TEXT VIEW PDF
The source code for the paper entitled "A Tandem Learning Rule for Effective Training and Rapid Inference of Deep Spiking Neural Networks"
Driven by the availability of large-scale labeled training data, high-performance computing resources as well as effective deep neural network architectures, deep learning has made spectacular achievements in computer vision[1, 2], speech processing [3, 4], language understanding  and robotics . Notwithstanding remarkable computational capabilities, these deep neural network models are computationally intensive and memory inefficient, making it challenging to deploy those models onto pervasive mobile and Internet-of-Things (IoT) devices that with limited computational budgets. Moreover, the ever-growing neural network model complexities, computational demands and concerns about information security motivate novel energy efficient solutions.
Human brains, with millions of years of evolution, are incredibly efficient to perform complex perceptual and cognitive tasks . Although hierarchically organized deep neural network models are brain-inspired, they differ significantly from biological brains in many ways. Fundamentally, the information is represented and communicated through asynchronous action potentials or spikes in the brain. To efficiently and rapidly process the information carried by these spike trains, biological neural systems adopt the event-driven computation strategy, whereby energies are mostly consumed during spike generation and communication. Neuromorphic computing (NC), as an emerging non-von Neumann computing paradigm, aims to mimic such asynchronous event-driven information processing with spiking neural networks (SNNs) in silicon . The novel neuromorphic computing architectures, for instances TrueNorth  and Loihi , leverage on the low-power, densely-connected parallel computing units to support spike-based computation. Furthermore, the colocated memory and computation can effectively mitigate the problem of low bandwidth between the CPU and memory (i.e., von Neumann bottleneck) . When implemented on these neuromorphic architectures, competitive classification accuracies can be achieved with high throughputs and compelling energy efficiency . Therefore, integrating the algorithmic power of deep learning with unprecedented efficiency of neuromorphic computing architectures offer an intriguing solution for intelligent embedded devices, and represent an important milestone towards future brain-inspired computing machines.
While neuromorphic computing architectures offer attractive energy saving, how to train large-scale deep SNNs remains a challenging research problem. Due to the asynchronous and discontinuous nature of synaptic operations within the SNN, the error backpropagation algorithm that widely used for ANN training is not directly applicable to the SNN.
To overcome this, differentiable proxies have been employed to enable the powerful error backpropagation algorithm with discrete spikes, examples include the membrane potential [13, 14, 15, 16], spike timing of the first spike  and spike statistics 
. Additional research efforts also devoted to training the constrained ANN that approximate the properties of spiking neuron and then map those trained weights to the SNN[19, 12, 20, 21, 22]. Although competitive accuracies were demonstrated with both approaches on the MNIST, CIFAR-10  datasets and their neuromorphic versions [24, 25], how to scale these learning rules up to the size of state-of-the-art deep ANNs remains elusive. In addition, the temporal credit assignment that performed with these spike-based learning rules is memory and computationally inefficient when sensory inputs are rate encoded, wherein spike timing carries negligible additional information .
Another vein of research in deep SNN learning rules involves the conversion of pre-trained ANNs to SNNs with the same network architecture [27, 28, 29, 30, 31]. This indirect training approach assumes the graded activation of analog neurons is equivalent to the average firing rate of spiking neurons, and simply requires parsing and normalizing the weights after training the ANNs. Notably, Rueckauer et al. provide a theoretical analysis of the performance deviation of such approach as well as a systematic study of frequently used layers in the CNN [29, 30, 31]. This conversion approach achieves the best-reported results for SNNs on many benchmark datasets including the challenging ImageNet dataset. Nevertheless, the latency and accuracy trade-off has been identified as the main shortcoming of such an approach, requiring additional techniques to improve the latency and power efficiency .
The biological plausible Hebbian learning rules  and spike-timing-dependent plasticity (STDP) [35, 36] represent another class of local learning rules that are particularly interesting for computational neuroscience studies and hardware implementations with emerging non-volatile memory device. It, however, remains challenging to apply them for large-scale machine learning tasks due to the ineffective task-specific credit assignment. Interested readers are advised to refer to the review articles [38, 39] for a systematic review of the recent progress in deep SNN learning rules and applications.
In this paper, to effectively process the rate-coded sensory inputs, we propose a novel learning rule based on the hybrid neural network with shared weights, wherein a rate-based SNN is used for the forward propagation to determine precise spike counts and spike trains, and an equivalent ANN is used for error backpropagation to approximate the gradients at each coupled SNN layer. The deep SNNs, trained with the proposed learning rule, achieve competitive classification accuracies to the baseline ANNs and other SNN implementations for image classification on the CIFAR-10 and IMAGENET-2012. Furthermore, comparing to other available SNN learning rules, the proposed learning rule support rapid inference with orders of time-saving and significantly reduced synaptic operations on machine learning tasks.
The rest of this paper is organized as follows: in Section II, we present the proposed learning rule within the hybrid networks. In Section III, we investigate why the proposed learning rule can learn effectively by comparing the high dimensional geometry of activation values and weight-activation dot products between the coupled ANN and SNN network layers. Furthermore, we evaluate the proposed learning rule on the CIFAR-10 and IMAGENET-2012 datasets by comparing classification accuracies, inference speed and energy efficiency to other SNN implementations. Finally, we conclude the paper in Section IV.
In this section, we first introduce the coding schemes and neuron models that are employed in this work. We then review and explain the spike count mismatch problem when mapping the constrained ANN weights to the rate-based SNN. Finally, we present a hybrid learning rule to circumvent such a spike count mismatch problem.
The SNN deals with spiking events, therefore, additional efforts should be paid to transform conventional frame-based images or feature vectors into spike trains as well as decode output spike trains to the associated output classes. There are two coding schemes that are commonly used: rate code and temporal code. Rate code[28, 29]
converts real-valued inputs into spike trains at each sampling time step following a Poisson or Bernoulli distribution. However, it suffers from the sampling error, thereby requires long encoding time window to compensate for such errors. Despite superior coding efficiency and computational advantages over rate code, the temporal code is very complex to decode and sensitive to noise.
In this work, as shown in Fig. 1
, we feed the real-valued input directly into the ANN layer; while zero pad the input along the temporal dimension (to match a specific encoding time window) before passing it into the SNN layer. Hence, precise input information is preserved and the first SNN layer can perform the encoding in a learnable fashion. In addition, such an encoding scheme is also beneficial for rapid inference since the information is typically encoded at early time steps. For decoding, it is feasible to decode from the SNN output layer using either the discrete spike count or the continuous aggregate membrane potential. In our experiments, we however notice that decoding using the aggregate membrane potential provides a much smoother learning curve due to high precision error gradients derived at the output layer.
In this work, we use the integrate-and-fire (IF) neuron model with reset by subtraction scheme for SNN layers. This simplified spiking neuron model drops the leaky term and refractory period; as a result, it can faithfully retain the number of input spikes it receives (until reset). While the IF neuron does not emulate the rich temporal dynamics of biological neurons, it is however ideal for working with rate-coded sensory input where spike timings don’t play a significant role.
At each time step , the input spikes to neuron at layer are integrated as follows
where is the neuron firing threshold and indicates the occurrence of an input spike from afferent neuron at time step . The denotes the synaptic weight that connects afferent neuron from layer . Neuron then integrates the input current into its membrane potential as per Eq. 2. is initialized with a learnable parameter (Eq. 3) that is equivalent to the bias term of the coupled ANN, and an output spike is generated whenever crosses the firing threshold (Eq. 4).
According to Eq. 1, the aggregated membrane potential of neuron in layer can be expressed as
where is the input spike count from pre-synaptic neuron at layer as per Eq. 6.
In our earlier work, we neglect the temporal dynamic of IF neurons and consider them as a simple non-leaky integrator, and established the following correspondence between the aggregated membrane potential and the output spike count.
where the output ‘spike count’ will be clipped at a value of zero for negative aggregated membrane potential .
Different from the continuous neuron activation function that used in the ANNs, are only non-negative integers (enforced by the term in Eq. 7). The surplus membrane potential that insufficient to induce an additional spike is ignored for the next sample, resulting in quantization error as shown in Fig. 2 and Eq. 7. Moreover, the is upper bounded by the maximum time steps , such a constraint can be alleviated using a higher time resolution
. In the backward pass, the discontinuity of the activation function is addressed with the straight-through estimator. Taking Eq. 7 as the neuron activation function for ANN and map the trained ANN weights to the SNN, we were able to achieve competitive classification accuracy on the MNIST dataset.
However, when applying this spike count based learning rule to a more complex CIFAR-10 dataset, we noticed a large accuracy drop (up to 10%) when mapping ANN weights to SNN. After carefully compare the ANN output ‘spike count’ with the actual SNN spike count, we note growing spike count mismatch between ANN and SNN layers as shown in Fig. 4. This mismatch problem happens due to ignoring the temporal dynamic of IF neurons in the constrained ANN. To allow a better understanding of this problem, as shown in Fig. 3, we have prepared a hand-crafted example wherein a post-synaptic IF neuron is connected to three pre-synaptic neurons with one spike each. Although aggregate membrane potential of the post-synaptic neuron is below the firing threshold, in the end, one spike will be generated from this neuron due to its inherent temporal dynamic. While this spike count mismatch problem may be trivial for shallow networks of the size that used for the MNIST dataset or with very high input spike rate, it has a huge impact to deep SNN with sparse input spike trains and short encoding time window as demonstrated on the CIFAR-10 dataset Fig. 4.
To overcome this spike count mismatch problem originating from the inherent dynamic of IF neuron, we propose a hybrid learning rule. As shown in Fig. 6, an ANN with activation function defined in Eq. 7 is employed to enable error backpropagation in a rate-based network; while SNN, sharing weights with the coupled ANN is employed to determine exact output spike count and spike train. The spike count and spike train derived from the SNN layer will be transmitted to the subsequent ANN and SNN layers, respectively. Despite a coupled ANN is harnessed in the training phase, the inference is executed entirely on the SNN. The idea of decoupled network layers has also been exploited in the binary neural networks 
, of which full-precision activation values are calculated at each layer, whereas binarized activation values are forward propagated to the subsequent layer.
By injecting the dynamic of IF neurons into the training phase, this learning rule effectively prevents the spike count mismatch problem from accumulating across layers. Although mismatch still exists between outputs of the ANN layer and the coupled SNN layer (spike count), our experimental results suggest that the angle between these two outputs are exceedingly small in a high dimensional space and this relationship maintains throughout learning. In addition, weight-activation dot products, a critical intermediate quantity, are approximately preserved disregard the mismatch error. Therefore, the modified learning dynamic in such a decoupled network can approximate the learning dynamic of an intact ANN. The pseudo of the proposed learning rule has been provided in Algorithm. 1.
In this section, we first evaluate the learning capability of the proposed learning rule on two standard image classification benchmarks. We further discuss why effective learning can be performed within a decoupled network configuration. Finally, we present and discuss the attractive properties of rapid inference and reduced total synaptic operations that are achieved with the proposed learning rule.
To evaluate the learning capability, convergence property and energy efficiency of the proposed learning rule, we use two image classification benchmark datasets: CIFAR-10  and IMAGENET-2012 . The CIFAR-10 consists of 60,000 color images of size 3232 from 10 classes, with a standard split of 50,000 and 10,000 for train and test, respectively. The large-scale IMAGENET-2012 dataset consists of over 1.2 million images from 1,000 object categories. Notably, the success of AlexNet  on this dataset represents a key milestone of deep learning research.
As shown in Fig. 5
, we use a customized convolutional neural network (CNN) CIFARNet with 6 learnable layers for CIFAR-10 and AlexNet for IMAGENET-2012. To reduce the dependency on weight initialization and to accelerate the training process, we add batch normalization
layer after each convolution and fully-connected layer. Given batch normalization layer only performs an affine transformation, we integrate their parameters into the preceding layer’s weight vector before copy that into the coupled SNN layer. We replace average pooling operations that commonly used in the ANN-to-SNN conversion approach with a stride of 2 convolution operations, which perform dimensionality reduction in a learnable fashion. This design choice eliminates the quantization errors that will happen to IF neurons if the average pooling layer is used.
We perform all experiments with Tensorpack toolbox 
, which is a high-level neural network training interface based on the TensorFlow. Tensorpack optimizes the whole training pipeline, providing accelerated and memory efficient training on multi-GPU machines. We follow the same data pre-processing procedures (crop, flip and mean normalization, etc.), optimizer, learning rate decay schedule that are adopted in the Tensorpack CIFAR-10 and IMAGENET-2012 examples and use those configurations consistently for all experiments. As shown in Fig.6, we implement customized convolution and fully-connected layers in Tensorpack and integrate the operations of ANN layer and coupled SNN layer under a unified interface.
The computational cost of neuromorphic architectures is typically benchmarked using the total synaptic operations [9, 29, 30, 16]. For SNN, as defined below, the total synaptic operations (SynOps) are correlated with the neurons’ firing rate, fan-out (number of outgoing connections to the subsequent layer) and simulation time window .
where is the total number of layers and denotes the total number of neurons in layer . indicates whether a spike is generated by neuron of layer at time instant .
In contrast, the total synaptic operations that required to classify one image in the ANN is given as follows
with denotes the number of incoming connections to each neuron in layer . In our experiment, we calcuate the average synaptic operations on a randomly chosen mini-batch (256 images) from the test set.
|Model||Network Architecture||Method||Test Accuracy (%)||Inference Time|
|Panda and Roy (2016)||
Spiking Convolutional Autoencoder
|Layer-wise Spike-based Learning||75.42||-|
|Esser et al. (2016)||Spiking CNN (15 layers)||Binary Neural Network||89.32||16|
|Rueckauer et al. (2017)||Spiking CNN (8 layers)||Conversion of ANN||90.85||-|
|Wu et al. (2018)||Spiking CNN (CIFARNet)||Error Backpropagation Through Time||90.53||-|
|Wu et al. (2018)||Spiking CNN (AlexNet)||Error Backpropagation Through Time||85.24||-|
|Sengupta et al. (2019)||Spiking CNN (VGG-16)||Conversion of ANN||91.46||2,500|
|Lee et al. (2019)||Spiking CNN (ResNet-11)||Conversion of ANN||90.15||3,000|
|Lee et al. (2019)||Spiking CNN (ResNet-11)||Spike-based Learning||90.95||100|
|This work (Spike Count)||Spiking CNN (6 layers)||Error Backpropagation within Hybrid Network||91.54||16|
|This work (Agg. Mem. Potential)||Spiking CNN (6 layers)||Error Backpropagation within Hybrid Network||90.07||16|
|Hunsberger and Eliasmith, (2016)||Spiking CNN (AlexNet)||Conversion of Constrained ANN||51.80 (76.20)||200|
|Rueckauer et al. (2017)||Spiking CNN (VGG-16)||Conversion of ANN||49.61 (81.63)||400|
|Sengupta et al. (2019)||Spiking CNN (VGG-16)||Conversion of ANN||69.96 (89.01)||2,500|
|This work||CNN (AlexNet)||Error Backpropagation||57.55 (80.44)||-|
|This work||Spiking CNN (AlexNet)||Error Backpropagation within Hybrid Network||46.63 (70.80)||13|
|This work||Spiking CNN (AlexNet)||Error Backpropagation within Hybrid Network||50.22 (73.60)||18|
For CIFAR-10, as shown in Table. I, the spiking-CIFARNet trained with the proposed learning rule achieve competitive test accuracies of 91.54% (spike count decoding) and 90.07% (aggregate membrane potential decoding), respectively. The spiking-CIFARNet, with spike count decoding, achieves by far the best-reported result on CIFAR-10 with SNN. As shown in Fig. 7, we however note its learning dynamic is unstable, which may attribute to the noisy error gradients derived at the output layer. Therefore, we use aggregate membrane potential decoding for the rest of the experiments on IMAGENET-2012 as well as study the effect of encoding time window on CIFAR-10. Although the learning converges slower than the plain CNN and bounded CNN (proposed in our earlier work, using the constrained activation function Eq. 7), the error rate of SNN eventually matches to that of the bounded CNN. It suggests that by adding the dynamic of IF neurons into the training phase, the spike count mismatch problem as described in Sec. II-C can be effectively alleviated.
To train a model on IMAGENET-2012 with a spike-based learning rule, it requires large computer memories to store intermediate states of spiking neuron as well as huge computational costs. Hence, only a few SNN implementations, without taking into consideration the dynamic of spiking neurons during training, have made some successful attempts on this challenging task, including ANN-to-SNN conversion [29, 30] and conversion of constrained ANN  approaches. Our approach, however, combines the advantage of both approaches; the dynamic of IF neurons are considered during the forward propagation, while only rate-based ANN is used for error backpropagation. As a result, the proposed approach improves both on the memory requirement and computational cost as compared to the spike-based learning rules.
As shown in Table. I, with an inference time of 18 time steps (input image is encoded within a time window of 10 time steps), the spiking-AlexNet trained with the proposed learning rule achieves top-1 and top-5 accuracies of 50.22% and 73.60%, respectively. This result is comparable to that of the constrained ANN conversion approach with the same AlexNet architecture. Notably, the proposed learning rule only takes 18 inference time steps which are at least an order of magnitude faster than the other reported approaches. While the ANN-to-SNN conversion approaches achieve better classification accuracies on IMAGENET-2012, their successes are large credit to more advanced models used. Furthermore, we note an accuracy drop of around 7% from the baseline AlexNet implementation (revised from the original AlexNet model ), which may attribute to the mismatch error between ANN layer activation values and SNN layer spike counts. As future work, we would like to explore strategies to minimize such mismatch errors and also evaluate more advanced network architectures.
Although the learning capability of the proposed learning rule has been demonstrated on the CIFAR-10 and IMAGENET-2012. It is puzzling why learning can be performed effectively across decoupled network layers. To address this question, we borrow ideas from the recent theoretical work of binary neural network , wherein learning is also performed across decoupled network layers (binarized activations are forward propagated to subsequent layers). In the proposed hybrid network, as shown in Fig. 8, the ANN layer activation value at layer is replaced with the aggregate spike count of the coupled SNN layer. Due to the dynamic nature of spike generation, there is no explicit transformation function between and . To circumvent this problem, we analyze the degree of mismatch between these two quantities and its effect on the activation forward propagation and error backpropagation.
In our numerical experiments on CIFAR-10 with a randomly draw mini-batch of 256 test samples, we calculate the cosine angle between vectorized and for all the convolution layers. As shown in Fig. 9, their cosine angles are below 30 degrees on average and such relationships maintain consistently throughout learning. While these angles seem large in low dimensions, they are exceedingly small in a high dimensional space. According to the hyperdimensional computing theory  and the study of binary neural network , the cosine angle between any two high dimensional random vectors is approximately orthogonal. It also worth to note that the distortion of replacing with is less severe than binarizing a high dimensional random vector, which changes cosine angle by 37 degrees in theory. Given that the activation function and error gradient that backpropagated from the subsequent ANN layer remains equal, the distortions to the error backpropagation are bounded locally by the mismatch error.
Furthermore, we calculate the Pearson Correlation Coefficient between weight-activation dot products and , which is an important intermediate quantity (input to the batch normalization layer) in our current network configurations. We note that Pearson Correlation Coefficients maintain consistently above 0.9 throughout learning for most of the samples, suggesting the linear relationship of weight-activation dot products are approximately preserved.
As shown in Fig. 7, the proposed learning rule is able to deal with different encoding window sizes on CIFAR-10. At the most challenging case when , we are able to achieve a satisfying error rate that is below 12%. This may credit to the encoding strategy that we have employed, whereby input information is encoded at the first time step before passing into the SNN layer. In addition, the Batch Normalization layer that added after each convolution and fully-connected layer ensures information transmitting effectively to top layers. The error rate reduces further with expanded time window size, while the improvement vanishes beyond . Hence, the SNN trained with the proposed learning rule can perform inference rapidly with at least an order of time-saving compared with other learning rules as shown in Table. I. While binary neural network also supports a rapid inference, they propagate information in a synchronized fashion and differ fundamentally from asynchronous information processing that studied in other works.
To study the energy efficiency of the proposed learning rule, we calculate the ratio of SNN AC operations to ANN MAC operations on the CIFAR-10 and IMAGENET-2012 and compare them with other state-of-the-art learning rules. Thanks to the short inference time required and sparse synaptic activity. As shown in Table. II, when the spiking-AlexNet that trained with the proposed learning rule, achieves a ratio of only 0.40 and 0.68 for CIFAR-10 and IMAGENET-2012 dataset, respectively. Notably, with a ratio below 1, it indicates that spiking-AlexNet is more energy efficient than its ANN counterpart. Notwithstanding the fact that for SNNs, only an accumulate (AC) operation is performed for each synaptic operation. While for ANNs, a more costly multiply-and-accumulate (MAC) operation is performed, resulting in an order of magnitude chip area as well as energy saving per synaptic operation[29, 30]. Furthermore, the proposed learning rule achieves at least 9 and 3 times synaptic operation savings as compared to other learning rules [16, 30] on the CIFAR-10 and IMAGENET-2012 datasets, respectively.
|AlexNet (this work)||13||0.27||0.50|
|AlexNet (this work)||18||0.40||0.68|
In this work, we introduce a novel learning rule based on the hybrid neural network to effectively train rate-based SNNs for efficient and rapid inference on machine learning tasks. Within the hybrid neural network, a rate-based SNN using IF neurons are employed to determine precise spike counts and spike trains for the activation forward propagation; while an ANN, sharing the weight with the coupled SNN, is used to approximate gradients of the coupled SNN. Given the error backpropagation is performed on the rate-based ANN, the proposed learning rule is memory and computationally more efficient than the error backpropagation through time algorithm that used in many spike-based learning rules [13, 14, 15].
To understand why the learning can be effectively performed with decoupled network layers, we study the learning dynamic of the decoupled network and compare that to an intact ANN. The empirical study on the CIFAR-10 reveals that cosine distances between vectorized ANN output and the coupled SNN output spike count are exceedingly small in a high dimensional space and such a relationship maintain throughout the training. Furthermore, a strong positive Pearson Correlation Coefficients are exhibited between weight-activation dot product and , an important intermediate quantity in the activation forward propagation, suggesting a linear relationship of weight-activation dot products are approximately preserved.
The SNNs trained with the proposed learning rule have demonstrated competitive classification accuracies on the CIFAR-10 and IMAGENET-2012 datasets. By encoding sensory stimuli into early time steps in a learnable fashion and adding batch normalization layers to ensure effective information flow; rapid inferences, with at least an order of magnitude time savings comparing to the state-of-the-art ANN-to-SNN conversion approach, are demonstrated on large-scale IMAGENET-2012 image classification task. Furthermore, the total synaptic operations are also significantly reduced comparing to the baseline ANNs and other SNN implementations. By integrating the algorithmic power of the proposed learning rule with the unprecedented energy efficiency of emerging neuromorphic computing architectures, we expect to enable low-power on-chip computing on the pervasive mobile and embedded devices. As future work, we will explore strategies to close the accuracy gap between the baseline ANN and SNN implementations as well as evaluate more advanced network architectures.
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
“Advances in natural language processing,”Science, vol. 349, no. 6245, pp. 261–266, 2015.
“Supervised learning based on temporal coding in spiking neural networks,”IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 7, pp. 3227–3235, 2018.
“Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors,”Cognitive computation, vol. 1, no. 2, pp. 139–159, 2009.