A Tandem Learning Rule for Efficient and Rapid Inference on Deep Spiking Neural Networks

07/02/2019 ∙ by Jibin Wu, et al. ∙ 0

Emerging neuromorphic computing (NC) architectures have shown compelling energy efficiency in machine learning tasks using spiking neural networks (SNNs). However, due to the non-differentiable nature of spiking neuronal functions, the standard error back-propagation algorithm is not directly applicable to SNNs. In this work, we propose a tandem learning framework, that consists of a SNN and an Artificial Neural Network (ANN) that share weights. The ANN is an auxiliary structure that facilitates the error back-propagation for the training of the SNN. To this end, we consider the spike count as the discrete neural representation and design ANN neuronal activation function that can effectively approximate the spike count of the coupled SNN. The SNNs that are trained with the proposed tandem learning rule show competitive classification accuracies on the CIFAR-10 and ImageNet-2012 datasets with significantly reduced inference time and total synaptic operations over other state-of-the-art SNN implementations. The proposed tandem learning rule offers a novel solution to training efficient, low latency and high accuracy deep SNNs with low computing resources.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep learning has greatly improved pattern classification performance by leaps and bounds in computer vision [1, 2], speech processing [3, 4], language understanding [5] and robotics [6]. However, deep neural networks are computationally intensive and memory inefficient, thereby, limiting their deployments in mobile and wearable devices that have limited computational budgets. This prompts us to look into energy-efficient solutions.

The human brain, with millions of years of evolution, is incredibly efficient at performing complex perceptual and cognitive tasks [7]. Although hierarchically organized deep neural network models are brain-inspired, they differ significantly from the biological brain in many ways. Fundamentally, the information is represented and communicated through asynchronous action potentials or spikes in the brain. To efficiently and rapidly process the information carried by these spike trains, biological neural systems adopt the event-driven computation strategy, whereby energy is mostly consumed only when spike generation and communication take place.

Neuromorphic computing (NC), as an emerging non-von Neumann computing paradigm, aims to mimic such asynchronous event-driven information processing with spiking neural networks (SNNs) in silicon [8]. The novel neuromorphic computing architectures, for instances TrueNorth [9] and Loihi [10], leverage on the low-power, densely-connected parallel computing units to support spike-based computation. Furthermore, the co-located memory and computation can effectively mitigate the problem of low bandwidth between the CPU and memory (i.e., von Neumann bottleneck) [11]. When implemented on these neuromorphic architectures, deep SNNs benefit from the best of two worlds: superior classification accuracies and compelling energy efficiency [12]. Such promising prospects motivate the study in this paper.

While neuromorphic computing architectures offer attractive energy-saving, how to train large-scale SNNs that can operate efficiently and effectively on these NC architectures remains as a challenging research problem. The biological plausible Hebbian learning rules [13] and spike-timing-dependent plasticity (STDP) [14, 15] are intriguing local learning rules for computational neuroscience studies and also attractive for hardware implementation with emerging non-volatile memory device [16, 17, 18]. However, they are not straightforward to use for large-scale machine learning tasks due to the ineffective task-specific credit assignment.

Due to the asynchronous and discontinuous nature of synaptic operations within the SNN, the error back-propagation algorithm that is commonly used for ANN training is not directly applicable to the SNN. Recent research works [19, 20, 21, 22, 23] have suggested that it is viable to convert pre-trained ANNs to SNNs with little adverse impacts on the classification accuracy. This indirect training approach assumes that the graded activation of analog neurons is equivalent to the average firing rate of spiking neurons, and simply requires parsing and normalizing the weights after training the ANNs.

Fig. 1:

A hand-crafted example for illustration of the approximation error, spike count discrepancy, between the SNN and the approximated ANN when encoding time window is short and neuronal firing rate is low. Although the aggregate membrane potential of the post-synaptic IF neuron stays below the firing threshold in the end (a useful intermediate quantity that is applied to approximate the output spike count), an output spike is generated due to the early arrival of spikes from positive synapses.

Rueckauer et al. [21]

provide a theoretical analysis of the performance deviation of such an approach as well as a systematic study of the Convolutional Neural Network (CNN) models for a large-scale image classification task. This conversion approach achieves the best-reported results for SNNs on many benchmark datasets including the challenging ImageNet-2012 dataset

[24]. However, this approach comes with a trade-off that has an impact on the inference speed and classification accuracy and requires at least several hundred of inference time steps to reach an optimal classification accuracy [21, 22].

Additional research efforts are also devoted to training constrained ANNs that can approximate the properties of specific spiking neuron [25, 12, 26, 27, 28], which can seamlessly transfer to the target hardware platform and perform better than the aforementioned generic conversion approach. Grounded on the rate-based spiking neuron model, this constrain-then-train approach transforms the steady-state firing rate of spiking neuron into a continuous and hence differentiable form that can be optimized with the conventional error back-propagation algorithm. While competitive classification accuracies are shown with both the generic ANN-to-SNN conversion and constrain-then-train approaches, the underlying assumption of a rate-based spiking neuron model requires a long inference time window or a high firing rate to reach the steady neuronal firing state[20, 26]. This steady-state requirement limits the computational benefits that can be acquired from the NC architectures.

To improve the overall energy efficiency as well as inference speed, an ideal SNN learning rule should support a short encoding time window with sparse synaptic activities. To exploit this desirable property, temporal coding has been investigated whereby the spike timing of the first spike was employed as a differentiable proxy to enable the error back-propagation algorithm [29, 30]. Although competitive classification accuracies were reported on the MNIST dataset with such a temporal learning rule, maintaining the stability of neuronal firing and scaling it up to the size of state-of-the-art deep ANNs remain elusive. In view of the steady-state requirement of rate-based SNNs and scalability issue of temporal-based SNNs, we are interested in developing a new learning rule that can effectively and efficiently train deep SNNs to operate under short encoding time window with sparse synaptic activities.

The spiking neuronal function is designed to describe the temporal dynamics, such as leak and reset mechanisms of the membrane potential, and refractory period, that is very different from a continuous and differentiable ANN neuronal function. Furthermore, the size of the encoding time window also plays a role in capturing the sparse synaptic activities. It is not straightforward to approximate the exact behavior of a SNN with an ANN, especially if there are multiple hidden layers in the network.

To demonstrate such an approximation error happens during information forward-propagation, namely neural representation error, we prepared a hand-crafted example as shown in Fig. 1. Although the free aggregate membrane potential, at the end of the simulation time window, of an integrate-and-fire (IF) neuron stays below the firing threshold (a useful intermediate quantity that can be applied to approximate the output spike count as will be explained in Section II-C), an output spike is generated due to early arrival of spikes from the positive synapses. Even worse, such a neural representation error (spike count discrepancy) will accumulate across layers and significantly affects the classification accuracy of the SNN when transferring the trained weights from the ANN. Therefore, to effectively train a deep SNN under short encoding time window with sparse synaptic activities, it is necessary to derive an exact neural representation with SNN in the training loop.

One way to overcome the approximation error is to formulate SNNs as recurrent neural networks

[31], and apply error Back-propagation Through Time (BPTT) algorithm to train deep SNNs with pseudo derivatives [32, 33, 34, 35]. While competitive accuracies were reported on the MNIST and CIFAR-10 [36]

datasets, it is both memory and computationally inefficient to train deep SNNs using BPTT. Furthermore, the vanishing gradient problem

[37] that is well-known for RNNs may affect learning when the firing rate is low. Readers may refer to the recent overviews on deep learning with spiking neural networks [38, 39] for more details.

In this paper, to effectively and efficiently train deep SNNs to classify inputs that are encoded in spikes within a short time window, we propose a novel learning rule with the tandem neural network. As illustrated in Fig.

2, the tandem network architecture consists of a SNN and an ANN that is coupled layer-wise with weights sharing. The ANN is an auxiliary structure that facilitates the error back-propagation for the training of the SNN, while the SNN is used to derive the exact spiking neural representation.

The rest of this paper is organized as follows: in Section II, we present the details of the tandem learning framework. In Section III, we evaluate the proposed tandem network and learning rule on the CIFAR-10 and ImageNet-2012 datasets by comparing classification accuracies, inference speed and energy efficiency to other SNN implementations. Furthermore, we investigate why the proposed tandem learning rule can learn effectively by comparing the high dimensional geometry of activation values and weight-activation dot products between the coupled ANN and SNN network layers. Finally, we conclude the paper in Section IV.

Fig. 2: Illustration of the proposed tandem learning framework that consists of a SNN and an ANN with shared weights. The spike counts are considered as the main information carrier in this framework, and ANN neuronal function is designed to approximate the spike counts of the coupled SNN. During training, in the forward pass, the spike counts and spike trains derived from a SNN layer are taken as the inputs to the subsequent SNN and ANN layers respectively; the error gradients are passed backward through the ANN layers during error back-propagation, to update the weights so as to minimize the objective function.

Ii Learning Through a Tandem Network

In this section, we first introduce the neuron model and the neural coding scheme that is used in this work. We then present a discrete neural representation scheme using spike count as the information carrier across network layers, and we design ANN neuronal activation function to effectively approximate the spike count of the coupled SNN for error back-propagation. Finally, we introduce the tandem network and its learning rule, that is called tandem learning rule, for deep SNN training.

Ii-a Neuron Model

In this work, we use the integrate-and-fire (IF) neuron model with reset by subtraction scheme[21] in the SNN layers. This simplified spiking neuron model drops the membrane potential leak and refractory period terms present in other more realistic spiking neuron models, for instance, the spike response model [40] and leaky integrate-and-fire model [41]. In this way, it retains the efficacy of input spikes that receive across time (until reset). While the IF neuron does not emulate the rich temporal dynamics of biological neurons, it is however ideal for working with sensory input where spike timing does not play a significant role and for hardware implementation.

At time step , under a discrete-time setting with encoding window size . The input spikes to neuron at layer are transduced as follows

(1)

where indicates the occurrence of an input spike from the afferent neuron at time step , and denotes the strength of the synaptic connection from afferent neuron of layer . Here, can be interpreted as a constant input current to the IF neuron. Mathematically, this term is related to the bias term

of the corresponding ReLU neuron

in the coupled ANN layer . It is important to distribute the effect of evenly throughout the encoding time window, thereby effectively preventing IF neurons from over-firing at early time steps.

The neuron then integrates the input current into its membrane potential as per Eq. 2 (without loss of generality, a unitary membrane resistance is assumed here). is reset and initialized to zero for each input sample. An output spike is generated whenever crosses the firing threshold (Eq. 3).

(2)
(3)

According to Eq. 1, the free aggregated membrane potential (no spiking) of neuron in layer at end of encoding time window T can be expressed as

(4)

where is the input spike count from pre-synaptic neuron at layer as per Eq. 5.

(5)

For ANN layers, we use bounded ReLU neurons that linearly integrate inputs and deliver only positive, integer-value ‘spike count’ to the subsequent layer. As explained in the ANN-to-SNN conversion work [21], the firing rate of IF neurons linearly correlates with the activation value of ReLU neurons. In this work, we extend this property further and approximate the spike count of IF neurons with bounded ReLU neurons as will be presented in Section II-C.

Ii-B Encoding and Decoding Schemes

Just like how cochlear converts a received sound waves into nerve impulses, and the auditory cortex then perceives the sound encoded in the incoming nerve impulses; a SNN front-end is required to encode the sensory inputs into spike trains, and a neural network back-end will then decode the output spike trains into desired pattern classes. Two encoding schemes are commonly used: rate code and temporal code. Rate code [20, 21]

converts real-valued inputs into spike trains at each sampling time step following a Poisson or Bernoulli distribution. However, it suffers from sampling errors, thereby requiring long encoding time window

to compensate for such errors. Hence, it is not desirable to be applied to encode information into a short time window. On the other hand, temporal coding uses the timing of a single spike to encode information. Therefore, it enjoys superior coding efficiency and computational advantages. However, it is complex to decode and sensitive to noise [42].

Alternatively, we adopt the encoding scheme introduced in [21, 34]

and directly input the input images or feature vectors into the neural encoding layer. The neural encoding layer performs a weighted transformation with bounded ReLU neurons as shown in Eq.

6.

(6)

where denotes the synaptic weight that connects input value to the encoding neuron and is the bias term of the encoding neuron. We use to denote the activation function of the bounded ReLU neuron that is defined in Eq. 11. Here, the activation function is analogous to the free aggregate membrane potential at the end of encoding time window . The subsequent spike train is generated by distributing this free aggregate membrane potential into consecutive time steps, since the beginning of the encoding window, as follows

(7)
(8)

Altogether, the spike train and spike count that output from the neural encoding layer can be represented as follows

(9)
(10)

This neural encoding layer converts the input into spike trains, whereby the output spike count can be adjusted in a learnable fashion to match the size of the encoding window . Such an encoding scheme is beneficial for rapid inference since the input information can be effectively encoded within a short time window. Beginning from this neural encoding layer, spike trains and spike counts are used as input to the SNN and ANN layers, respectively.

For decoding, it is feasible to decode from the SNN output layer using either the discrete spike counts or the continuous free aggregate membrane potentials. In our preliminary study, as shown in Fig. 8, we observe that the free aggregate membrane potential provides a much smoother learning curve due to the continuous error gradients derived at the output layer.

Fig. 3: Illustration of spike counts as the discrete neural representation for the tandem network CIFARNet (Fig. 6(A)). The intermediate activations of a randomly selected sample from the CIFAR-10 dataset are provided. The top and bottom row of each convolution layer refers to the exact spike count activations from SNN and the pseudo spike count activations from the coupled ANN, respectively. Note that only the first 8 feature maps are given and plotted in separated blocks. It is apparent that the ANN activation function can effectively approximate the exact spike counts from the corresponding SNN layer.

Ii-C Spike Count as a Discrete Neural Representation

Deep neural networks learn to describe the input data with compact feature representations. A typical feature representation is in the form of a continuous or discrete-valued vector. While most studies have focused on continuous feature representations, discrete representations have their unique advantages in solving some real-world problems [43, 44, 45, 46, 47]. For example, they are potentially a more natural fit for representing natural language which is inherently discrete, and also native for logical reasoning and predictive learning. Moreover, the idea of discretized neural representation has also been exploited in the binary neural network [48]

for network quantization, wherein binarized activations (-1, +1) are used for feature representation.

In this work, we consider the spike count as a discrete feature representation in deep SNNs as shown in Fig. 3. To formulate a discrete neural representation in the coupled ANN layer, if we are to ignore the temporal dynamics (membrane potential reset after spiking) of the IF neurons, we may then establish a one-to-one correspondence between the free aggregated membrane potential of the spiking neuron and the discrete pseudo output spike count of the ANN neuron:

Fig. 4: Illustration of the activation function used for the ANN layer, which approximates the spike counts of the coupled SNN layer. The is determined by rounding towards nearest integer and bounded between and the encoding time window size .
(11)

where is lower bounded at value zero. Without loss of generality, we set to 1 in this work. As shown in Fig. 4, different from the commonly used continuous neuron activation function in ANNs, are only non-negative integers. The surplus free membrane potential that is insufficient to induce an additional spike is rounded off, resulting in a quantization error as expressed in Eq. 12.

(12)

In practice, however, we did not observe any obvious interference to the learning or inference that due to this quantization error. Moreover, is upper bounded by the encoding time window size . As shown in Fig. 3, the proposed ANN activation function can effectively approximate the exact spike count information of the coupled SNN layer. As the ANN and SNN are coupled layer-by-layer, the ANN approximates the SNN layer-by-layer. This makes it possible to train the deep SNN in a similar way as a deep ANN.

Notably, as described in Fig. 1, the pseudo spike count that is derived in the ANN layer (Eq. 11) may deviate from the actual spike count of the SNN layer especially within a short time window, which may adversely affect the quality of the gradient derived in the error back-propagation. We will refer to this error as the gradient approximation error in the following sections. Our experimental results in Section III-E however suggest that the cosine angle between these two outputs are exceedingly small in a high dimensional space and this relationship maintains throughout learning. In addition, weight-activation dot products, a critical intermediate quantity, are approximately preserved despite the spike count discrepancy. Therefore, the learning dynamic in the ANN layer can effectively approximate that of the coupled SNN layer with spike count as the discrete neural representation.

Input: Input sample , target label , network parameters , neural encoding window size
Output: Updated network parameters
Forward Pass:
= Encoding()
for layer = 1 to N-1 do
         // State Update of the ANN Layer
        = ANN.layer[].forward(, )
        for t = 1 to  do
               // State Update of the SNN Layer
               = SNN.layer[l].forward(, )
        // Update the Spike Count
        =
/* Output Layer with Different Decoding Schemes */
if Decode with ‘Aggregate Membrane Potential’ then
        = ANN.layer[N].forward(, )
else if Decode with ‘Spike Count’ then
        = ANN.layer[N].forward(, )
        for t = 1 to  do
               = SNN.layer[N].forward(, )
              
       
Loss: = LossFunction()
Backward Pass:
= LossGradient()
for layer = N-1 to 1 do
         // Gradient Update through the ANN Layer
        , = ANN.layer[].backward( , , )
Update parameters of the ANN layer based on the calculated gradients.
Copy the updated parameters to the corresponding SNN layer.
Note:
For inference, state updates are performed on the SNN layers entirely.
Algorithm 1 Pseudo Codes For The Tandem Learning Rule

Ii-D Credit Assignment in the Tandem Network

Although the neural representation error is not significant at each layer alone as demonstrated in Fig. 3, they may cause severe impairments to the classification accuracy if inaccurate neural representation is propagated to the subsequent layers. To solve this problem, we propose a tandem learning framework. As shown in Fig. 1 and 5, an ANN with activation function defined in Eq. 11 is employed to enable error back-propagation in a rate-based network; while the SNN, sharing weights with the coupled ANN is employed to determine the exact neural representation (i.e., spike counts and spike trains). The spike count and spike trains are transmitted to the subsequent ANN and SNN layers, respectively. By incorporating the dynamics of the IF neuron into the training phase and propagating its output to the subsequent layers, this tandem learning framework effectively prevents the neural representation error from accumulating across layers. While a coupled ANN is used for error back-propagation, the forward inference is executed entirely on the SNN. The pseudo code of the proposed tandem learning rule is given in Algorithm. 1.

Fig. 5:

Illustration of the information flow across the convolution and fully-connected layers during the forward and backward pass. As the weights are shared between the ANN and SNN layers, the ANN is designed to approximate the discrete neural representation (spike count) of the SNN, so as to facilitate error backpropagation across the ANN layers.

It is worth mentioning that, in the forward pass, the ANN layer takes the output of the previous SNN layer as the input. This aims at synchronizing the training of the SNN with ANN via the interlaced layers, rather than trying to optimize the classification performance of the ANN. The similar idea of interlaced network layers has also been explored in the binary neural networks [48], of which full-precision activation values are calculated at each layer, whereas binarized activation values are forward propagated to the subsequent layer.

Iii Experimental Evaluation and Discussion

In this section, we first present the neural representation errors that may arise and accumulate across layers when taking the constrain-then-train approach (introduced in Section I) in the scenarios of short encoding window. Then, we evaluate the learning capability of the proposed tandem learning rule on two standard image classification benchmarks. We further discuss why effective learning can be performed within the tandem network. Finally, we discuss the properties of rapid inference and synaptic operation reduction that are achieved with the proposed tandem learning rule.

Iii-a Datasets, Network Configurations and Implementation

To evaluate the learning capability, convergence property and energy efficiency of the proposed learning rule, we use two image classification benchmark datasets: CIFAR-10 [36] and ImageNet-2012 [24]. The CIFAR-10 consists of 60,000 color images of size 3232 from 10 classes, with a standard split of 50,000 and 10,000 for train and test, respectively. The large-scale ImageNet-2012 dataset consists of over 1.2 million images from 1,000 object categories. Notably, the success of AlexNet [1] on this dataset represents a key milestone of deep learning research.

As shown in Fig. 6

, we use a convolutional neural network (CNN) with 6 learnable layers for CIFAR-10 (namely CIFARNet) and AlexNet for ImageNet-2012. To reduce the dependency on weight initialization and to accelerate the training process, we add batch normalization

[49] layer after each convolution and fully-connected layer. Given that batch normalization layer only performs an affine transformation, we follow the approach introduced in [21]

and integrate their parameters into the preceding layer’s weight vectors before copying them into the coupled SNN layer. We replace the average pooling operations that are commonly used in the ANN-to-SNN conversion approach with a stride of 2 convolution operations, which perform dimensionality reduction in a learnable fashion

[50].

Fig. 6: Network architectures used for the (A) CIFAR-10 and (B) ImageNet-2012 experiments. For Conv2D layers, the values in the bracket corresponding to the number of output features, filter size and stride, respectively. For the FC layer, the value in the bracket represents the number of output features.

Fig. 7: (A) Illustration of the constrain-then-train approach, which transfers the weights of the constrained ANN to SNN. (B) The network architecture in (A) that is used to evaluate the accumulated neural representation error when taking the constrain-then-train approach. (C) The neural representation error, i.e., the spike count difference between the SNN outputs and ANN outputs for all the layers in (B).

We perform all experiments with the Tensorpack toolbox [51]

, which is a high-level neural network training interface based on TensorFlow. Tensorpack optimizes the whole training pipeline, providing accelerated and memory-efficient training on multi-GPU machines. We follow the same data pre-processing procedures (crop, flip and mean normalization, etc.), optimizer, learning rate decay schedule that are adopted in the Tensorpack CIFAR-10 and ImageNet-2012 examples and use those configurations consistently for all experiments. As shown in Fig.

5, we implement customized convolution and fully-connected layers in Tensorpack, which integrate the operations of the ANN layer and coupled SNN layer under a unified interface.

Iii-B Counting Synaptic Operations

The computational cost of neuromorphic architectures is typically benchmarked using the number of total synaptic operations [9, 21, 22, 35]. For SNN, as defined below, the total synaptic operations (SynOps) correlate with the neurons’ firing rate, fan-out (number of outgoing connections to the subsequent layer) and encoding time window size .

(13)

where is the total number of layers and denotes the total number of neurons in layer . indicates whether a spike is generated by neuron of layer at time step .

In contrast, the total synaptic operations that are required to classify one image in the ANN is given as follows

(14)

where denotes the number of incoming connections to each neuron in layer . In our experiment, we calculate the average synaptic operations on a randomly chosen mini-batch (256 images) from the test set.

Fig. 8: (A) Error rates (ER) on the CIFAR-10 test set with different training schemes. (B) ER on the CIFAR-10 test set as a function of different encoding window sizes (we use and interchangeably in this figure).
Model Network Architecture Method Error Rate (%) Inference Time Steps

CIFAR-10

Panda and Roy (2016)[52]

Convolutional Autoencoder

Layer-wise Spike-based Learning 24.58 -
Esser et al. (2016)[12] 15-layer CNN Binary Neural Network 10.68 16
Rueckauer et al. (2017)[21] 8-layer CNN ANN-to-SNN conversion 9.15 -
Wu et al. (2018)[34] 8-layer CNN Error Backpropagation Through Time 9.47 -
Wu et al. (2018)[34] AlexNet Error Backpropagation Through Time 14.76 -
Sengupta et al. (2019)[22] VGG-16 ANN-to-SNN conversion 8.54 2,500
Lee et al. (2019)[35] ResNet-11 ANN-to-SNN conversion 9.85 3,000
Lee et al. (2019)[35] ResNet-11 Spike-based Learning 9.05 100
This work (SNN with Spike Count) CIFARNet Error Backpropagation through Tandem Network 8.46 16
This work (SNN with Agg. Mem. Potential) CIFARNet Error Backpropagation through Tandem Network 9.93 16

ImageNet

Hunsberger and Eliasmith, (2016)[26] AlexNet Constrain-then-Train 48.20 (23.80) 200
Rueckauer et al. (2017)[21] VGG-16 ANN-to-SNN conversion 50.39 (18.37) 400
Sengupta et al. (2019)[22] VGG-16 ANN-to-SNN conversion 30.04 (10.99) 2,500
This work (ANN with full-precision activation) AlexNet Error Backpropagation 42.45 (19.56) -
This work (ANN with quantized activation) AlexNet Error Backpropagation 50.73 (26.08) -
This work (SNN with Agg. Mem. Potential) AlexNet Error Backpropagation through Tandem Network 53.37 (29.20) 13
This work (SNN with Agg. Mem. Potential) AlexNet Error Backpropagation through Tandem Network 49.78 (26.40) 18
TABLE I: Comparison of classification error rates and inference speed of different SNN implementations on the CIFAR-10 and ImageNet-2012 test sets.

Iii-C Accumulated Neural Representation Error

As discussed in Section I, one can train a constrained ANN that approximates the properties of spiking neurons (e.g., firing rate or spike count) using conventional error back-propagation algorithm, and subsequently, transfer the trained weights to the SNN as described in Fig. 7A, namely the constrain-then-train approach. Taking Eq. 11 as the neuron activation function, we reported competitive classification accuracy on the MNIST dataset[28]. However, when applying this approach to the more complex CIFAR-10 dataset with , we noticed a large accuracy drop (approximately 11%) when transferring the trained ANN weights to the SNN. After carefully comparing the ANN output ‘spike count’ with the actual SNN spike count, we observe a growing spike count discrepancy between the ANN and SNN layers as shown in Fig. 7C.

This is due to the fact that the neuronal activation function of the ANN has ignored the temporal dynamics of the IF neuron. While such spike count discrepancies could be negligible for a shallow network used for classifying the MNIST dataset[28] or with very high input firing rates, it has huge impacts in the face of sparse synaptic activities and short encoding time window. By incorporating the dynamics of IF neurons during the training of the tandem network, the exact output spike counts, instead of ANN predicted spike counts, are propagated forward to the subsequent ANN layer. The proposed tandem learning framework can effectively prevent this representation error from accumulating forward across layers.

Iii-D Image Classification Results

For CIFAR-10, as shown in Table. I, the CIFARNet trained with the proposed learning rule achieves a competitive test error rate of 8.46% (spike count decoding) and 9.93% (aggregate membrane potential decoding), respectively. The CIFARNet, with spike count decoding, achieves by far the best-reported result on CIFAR-10 with a SNN. As shown in Fig. 8, we however note that its learning dynamics is unstable, which may be attributed to the discrete error gradients derived at the final output layer. Therefore, we use the aggregate membrane potential decoding for the rest of the experiments on ImageNet-2012 as well as a further study on the effect of encoding time window size on CIFAR-10. Although the learning converges slower than the plain CNN (with ReLU activation function) and bounded CNN (with bounded ReLU activation function as defined in Eq. 11), the error rate of the SNN eventually matches that of the bounded CNN. It also suggests that the representation error described in Sec. III-C can be effectively mitigated with the proposed tandem learning framework.

To train a model on ImageNet-2012 with a spike-based learning rule using BPTT for synaptic weight update, it requires a huge amount of computer memory to store the intermediate states of the spiking neurons as well as huge computational costs. Hence, only a few SNN implementations, without taking into consideration the dynamics of spiking neurons during training, have made some successful attempts on this challenging task, including ANN-to-SNN conversion [21, 22] and constrain-then-train [26] approaches. The tandem learning rule benefits from the best of two worlds: the dynamics of IF neurons are considered during the forward propagation, while only the rate-based ANN is used for error back-propagation. As a result, it reduces both the memory requirement and computational cost over other spike-based learning rules. Meanwhile, it also reduces the inference time and the total synaptic operations when compared to the ANN-to-SNN conversion and constrain-then-train approaches.

As shown in Table. I, with an inference time of 18-time steps (input image is encoded within a time window of 10-time steps), the AlexNet trained with the proposed learning rule achieves the top-1 and top-5 error rate of 49.78% and 26.40%, respectively. This result is comparable to that of the constrain-then-train approach with the same AlexNet architecture. Notably, the proposed learning rule only takes 18 inference time steps which are at least an order of magnitude faster than the other reported approaches.

Fig. 9: Analysis of mismatch errors between output spike counts of ANN and SNN layers. The cosine angle between vectorized and

for all convolution layers at Epoch 30 (A) and 200 (B). While these angles seem large in low dimensions, they are exceedingly small in a high dimensional space. The Pearson Correlation Coefficient between weight-activation dot products

and at Epoch 30 (C) and 200 (D). The Pearson Correlation Coefficients maintain consistently above 0.9 throughout learning, suggesting that the linear relationship of weight-activation dot products are approximately preserved.

While the ANN-to-SNN conversion approaches achieve better classification accuracies on the ImageNet-2012, their successes can largely be credited to the more advanced network models used. Furthermore, we note an error rate increase of around 7% from the baseline ANN implementation with full-precision activation (revised from the original AlexNet model [1] by replacing pooling layers with a stride of 2 convolution operations to match the AlexNet used in this work, and adding batch normalization layers). To investigate the effect of the discrete neural representation, whereby how much of the drop in accuracy is due to quantization, and how much of it is due to dynamics of the IF neuron, we modify the full-precision ANN by quantizing the activation function, using the bounded ReLU neuron as defined in Eq. 11. In a single trial, the resulting quantized ANN achieves the top-1 and top-5 error rate of 50.73% and 26.08%, respectively. This result is very close to that of our SNN implementation, which suggests that the quantization of the activation function alone may account for most of the accuracy drop.

Iii-E Activation Direction Preservation and Weight-Activation Dot Product Proportionality within the Interlaced Layers

After showing how effective the proposed tandem learning rule performs on the CIFAR-10 and ImageNet-2012, we further investigate why learning can be performed effectively via the interlaced network layers. To answer this question, we borrow ideas from the recent theoretical work of binary neural network [53], wherein learning is also performed across the interlaced network layers (binarized activations are forward propagated to subsequent layers). In the proposed tandem network, as shown in Fig. 10, the ANN layer activation value at layer is replaced with the spike count derived from the coupled SNN layer. Due to the dynamic nature of spike generation, it is not easy to find an analytical transformation function between and .

To circumvent this problem, we analyze the degree of mismatch between these two quantities and its effect on the activation forward propagation and error back-propagation.

Fig. 10: Illustration of interlaced network layers in Figure 2. The pseudo spike count from ANN layer is replaced with exact spike count before passing through the activation function of ANN layer .

In our numerical experiments on CIFAR-10 with a randomly drawn mini-batch of 256 test samples, we calculate the cosine angle between vectorized and for all the convolution layers. As shown in Fig. 9, their cosine angles are below 24 degrees on average and such a relationship maintains consistently throughout learning. While these angles seem large in low dimensions, they are exceedingly small in a high dimensional space. According to the hyperdimensional computing theory [54] and the theoretical study of binary neural network [53], the cosine angle between any two high dimensional random vectors is approximately orthogonal. It also worth noting that the distortion of replacing with is less severe than binarizing a random high dimensional vector, which changes cosine angle by 37 degrees in theory. Given that the activation function and error gradients that back-propagated from the subsequent ANN layer remains equal, the distortions to the error back-propagation are bounded locally by the discrepancy between and .

Furthermore, we calculate the Pearson Correlation Coefficient (PCC) between the weight-activation dot products and , which is an important intermediate quantity (input to the batch normalization layer) in our current network configurations. The PCC, ranging from -1 to 1, measures the linear correlation between two variables. A value of 1 implies a perfect positive linear relationship. As shown in Fig. 9, the PCC maintains consistently above 0.9 throughout learning for most of the samples, suggesting the linear relationship of weight-activation dot products are approximately preserved.

Iii-F Rapid Inference with Reduced Synaptic Operations

As shown in Fig. 8, the proposed learning rule can deal with and utilize different encoding window sizes on CIFAR-10. In the most challenging case when , we are able to achieve a satisfying error rate that is below 12%. This may be partially credited to the encoding strategy that we have employed, whereby important input information can be encoded at the first time step before passing into the SNN layer. In addition, the Batch Normalization layer that is added after each convolution and fully-connected layer ensures effective information transmission to the top layers. The error rate is reduced further by increasing , while the improvement vanishes beyond . Hence, the SNN trained with the proposed learning rule can perform inference rapidly with at least an order of time-saving compared with other learning rules as shown in Table. I. While binary neural network also supports a rapid inference, they propagate information in a synchronized fashion and differ fundamentally from asynchronous information processing that is studied in other SNN works.

Model Inference Time Steps CIFAR-10 ImageNet-2012
VGGNet-9 [35] 100 3.61 -
ResNet-11 [35] 100 5.06 -
VGGNet-16[22] 500 - 1.975
ResNet-34[22] 2,000 - 2.40
AlexNet (this work) 13 0.27 0.50
AlexNet (this work) 18 0.40 0.68
TABLE II: Comparison of the ratio of SNN AC operations to ANN MAC operations on the CIFAR-10 and ImageNet-2012 datasets.

Fig. 11: Average spike count per neuron of AlexNet on the ImageNet-2012. Sparse neuronal activities can be observed in early network layers that have larger activation maps, leading to low power consumption when implemented on neuromorphic hardware. Here, T refers to the total inference time steps.

To study the energy efficiency of the proposed learning rule, we follow the evaluation metrics used in

[22, 35]. As defined in Section III-B, we calculate the ratio of SNN SynOps to ANN SynOps on the CIFAR-10 and ImageNet-2012 datasets and compare them with other state-of-the-art learning rules. Given the short inference time required and sparse synaptic activities as summarized in Fig. 11, the AlexNet (shown in Table. II, with ) trained with the proposed learning rule achieves a ratio of only 0.40 and 0.68 for CIFAR-10 and ImageNet-2012 dataset, respectively. It is worth noting that the SNN is more energy-efficient than its ANN counterpart with a ratio below 1. The saving is even more significant compared to ANNs if we consider the fact that for SNNs, only an accumulate (AC) operation is performed for each synaptic operation; while for ANNs, a more costly multiply-and-accumulate (MAC) operation is performed. This results in an order of magnitude chip area as well as energy saving per synaptic operation[21, 22]. In contrast, the existing SNN implementations [35, 22] achieve a ratio of at least 3.61 and 1.975, which is at least 9 and 3 times more costly than the proposed tandem learning rule, on the CIFAR-10 and ImageNet-2012 datasets, respectively.

Iv Conclusion

In this work, we introduce a novel tandem neural network and its learning rule to effectively train SNNs for efficient and rapid inference for pattern classification tasks. Within the tandem neural network, a SNN is employed to determine spike counts as a discrete neural representation and spike trains for the activation forward propagation; while an ANN, sharing the weight with the coupled SNN, is used to approximate gradients of the coupled SNN. Given that error back-propagation is performed on the rate-based ANN, the proposed learning rule is both memory and computationally more efficient than the error back-propagation through time algorithm that is used in many spike-based learning rules [32, 33, 34].

To understand why the learning can be effectively performed within the tandem learning framework, we study the learning dynamics of the tandem network and compare it with an intact ANN. The empirical study on the CIFAR-10 reveals that the cosine distances between the vectorized ANN output and the coupled SNN output spike count are exceedingly small in a high dimensional space and such a relationship maintains throughout the training. Furthermore, strongly positive Pearson Correlation Coefficients are exhibited between weight-activation dot product and , an important intermediate quantity in the activation forward propagation, suggesting a linear relationship of weight-activation dot products are well preserved.

The SNNs trained with the proposed tandem learning rule have demonstrated competitive classification accuracies on the CIFAR-10 and ImageNet-2012 datasets. By encoding sensory stimuli within the available encoding time window through a learnable transformation layer, and adding batch normalization layers to ensure effective information flow; rapid inferences, with at least an order of magnitude time-saving compared to state-of-the-art ANN-to-SNN conversion and constrain-then-train approaches[22], are demonstrated on a large-scale ImageNet-2012 image classification task. Furthermore, the total synaptic operations are also significantly reduced compared to the baseline ANNs and other SNN implementations.

By integrating the algorithmic power of the proposed tandem learning rule with the unprecedented energy efficiency of emerging neuromorphic computing architectures, we expect to enable low-power on-chip computing on pervasive mobile and embedded devices. For future work, we will explore strategies to close the accuracy gap between the baseline ANN and SNN implementations as well as to evaluate more advanced network architectures.

References