Deep learning has greatly improved pattern classification performance by leaps and bounds in computer vision [1, 2], speech processing [3, 4], language understanding  and robotics . However, deep neural networks are computationally intensive and memory inefficient, thereby, limiting their deployments in mobile and wearable devices that have limited computational budgets. This prompts us to look into energy-efficient solutions.
The human brain, with millions of years of evolution, is incredibly efficient at performing complex perceptual and cognitive tasks . Although hierarchically organized deep neural network models are brain-inspired, they differ significantly from the biological brain in many ways. Fundamentally, the information is represented and communicated through asynchronous action potentials or spikes in the brain. To efficiently and rapidly process the information carried by these spike trains, biological neural systems adopt the event-driven computation strategy, whereby energy is mostly consumed only when spike generation and communication take place.
Neuromorphic computing (NC), as an emerging non-von Neumann computing paradigm, aims to mimic such asynchronous event-driven information processing with spiking neural networks (SNNs) in silicon . The novel neuromorphic computing architectures, for instances TrueNorth  and Loihi , leverage on the low-power, densely-connected parallel computing units to support spike-based computation. Furthermore, the co-located memory and computation can effectively mitigate the problem of low bandwidth between the CPU and memory (i.e., von Neumann bottleneck) . When implemented on these neuromorphic architectures, deep SNNs benefit from the best of two worlds: superior classification accuracies and compelling energy efficiency . Such promising prospects motivate the study in this paper.
While neuromorphic computing architectures offer attractive energy-saving, how to train large-scale SNNs that can operate efficiently and effectively on these NC architectures remains as a challenging research problem. The biological plausible Hebbian learning rules  and spike-timing-dependent plasticity (STDP) [14, 15] are intriguing local learning rules for computational neuroscience studies and also attractive for hardware implementation with emerging non-volatile memory device [16, 17, 18]. However, they are not straightforward to use for large-scale machine learning tasks due to the ineffective task-specific credit assignment.
Due to the asynchronous and discontinuous nature of synaptic operations within the SNN, the error back-propagation algorithm that is commonly used for ANN training is not directly applicable to the SNN. Recent research works [19, 20, 21, 22, 23] have suggested that it is viable to convert pre-trained ANNs to SNNs with little adverse impacts on the classification accuracy. This indirect training approach assumes that the graded activation of analog neurons is equivalent to the average firing rate of spiking neurons, and simply requires parsing and normalizing the weights after training the ANNs.
Rueckauer et al. 
provide a theoretical analysis of the performance deviation of such an approach as well as a systematic study of the Convolutional Neural Network (CNN) models for a large-scale image classification task. This conversion approach achieves the best-reported results for SNNs on many benchmark datasets including the challenging ImageNet-2012 dataset. However, this approach comes with a trade-off that has an impact on the inference speed and classification accuracy and requires at least several hundred of inference time steps to reach an optimal classification accuracy [21, 22].
Additional research efforts are also devoted to training constrained ANNs that can approximate the properties of specific spiking neuron [25, 12, 26, 27, 28], which can seamlessly transfer to the target hardware platform and perform better than the aforementioned generic conversion approach. Grounded on the rate-based spiking neuron model, this constrain-then-train approach transforms the steady-state firing rate of spiking neuron into a continuous and hence differentiable form that can be optimized with the conventional error back-propagation algorithm. While competitive classification accuracies are shown with both the generic ANN-to-SNN conversion and constrain-then-train approaches, the underlying assumption of a rate-based spiking neuron model requires a long inference time window or a high firing rate to reach the steady neuronal firing state[20, 26]. This steady-state requirement limits the computational benefits that can be acquired from the NC architectures.
To improve the overall energy efficiency as well as inference speed, an ideal SNN learning rule should support a short encoding time window with sparse synaptic activities. To exploit this desirable property, temporal coding has been investigated whereby the spike timing of the first spike was employed as a differentiable proxy to enable the error back-propagation algorithm [29, 30]. Although competitive classification accuracies were reported on the MNIST dataset with such a temporal learning rule, maintaining the stability of neuronal firing and scaling it up to the size of state-of-the-art deep ANNs remain elusive. In view of the steady-state requirement of rate-based SNNs and scalability issue of temporal-based SNNs, we are interested in developing a new learning rule that can effectively and efficiently train deep SNNs to operate under short encoding time window with sparse synaptic activities.
The spiking neuronal function is designed to describe the temporal dynamics, such as leak and reset mechanisms of the membrane potential, and refractory period, that is very different from a continuous and differentiable ANN neuronal function. Furthermore, the size of the encoding time window also plays a role in capturing the sparse synaptic activities. It is not straightforward to approximate the exact behavior of a SNN with an ANN, especially if there are multiple hidden layers in the network.
To demonstrate such an approximation error happens during information forward-propagation, namely neural representation error, we prepared a hand-crafted example as shown in Fig. 1. Although the free aggregate membrane potential, at the end of the simulation time window, of an integrate-and-fire (IF) neuron stays below the firing threshold (a useful intermediate quantity that can be applied to approximate the output spike count as will be explained in Section II-C), an output spike is generated due to early arrival of spikes from the positive synapses. Even worse, such a neural representation error (spike count discrepancy) will accumulate across layers and significantly affects the classification accuracy of the SNN when transferring the trained weights from the ANN. Therefore, to effectively train a deep SNN under short encoding time window with sparse synaptic activities, it is necessary to derive an exact neural representation with SNN in the training loop.
One way to overcome the approximation error is to formulate SNNs as recurrent neural networks, and apply error Back-propagation Through Time (BPTT) algorithm to train deep SNNs with pseudo derivatives [32, 33, 34, 35]. While competitive accuracies were reported on the MNIST and CIFAR-10 
datasets, it is both memory and computationally inefficient to train deep SNNs using BPTT. Furthermore, the vanishing gradient problem that is well-known for RNNs may affect learning when the firing rate is low. Readers may refer to the recent overviews on deep learning with spiking neural networks [38, 39] for more details.
In this paper, to effectively and efficiently train deep SNNs to classify inputs that are encoded in spikes within a short time window, we propose a novel learning rule with the tandem neural network. As illustrated in Fig.2, the tandem network architecture consists of a SNN and an ANN that is coupled layer-wise with weights sharing. The ANN is an auxiliary structure that facilitates the error back-propagation for the training of the SNN, while the SNN is used to derive the exact spiking neural representation.
The rest of this paper is organized as follows: in Section II, we present the details of the tandem learning framework. In Section III, we evaluate the proposed tandem network and learning rule on the CIFAR-10 and ImageNet-2012 datasets by comparing classification accuracies, inference speed and energy efficiency to other SNN implementations. Furthermore, we investigate why the proposed tandem learning rule can learn effectively by comparing the high dimensional geometry of activation values and weight-activation dot products between the coupled ANN and SNN network layers. Finally, we conclude the paper in Section IV.
Ii Learning Through a Tandem Network
In this section, we first introduce the neuron model and the neural coding scheme that is used in this work. We then present a discrete neural representation scheme using spike count as the information carrier across network layers, and we design ANN neuronal activation function to effectively approximate the spike count of the coupled SNN for error back-propagation. Finally, we introduce the tandem network and its learning rule, that is called tandem learning rule, for deep SNN training.
Ii-a Neuron Model
In this work, we use the integrate-and-fire (IF) neuron model with reset by subtraction scheme in the SNN layers. This simplified spiking neuron model drops the membrane potential leak and refractory period terms present in other more realistic spiking neuron models, for instance, the spike response model  and leaky integrate-and-fire model . In this way, it retains the efficacy of input spikes that receive across time (until reset). While the IF neuron does not emulate the rich temporal dynamics of biological neurons, it is however ideal for working with sensory input where spike timing does not play a significant role and for hardware implementation.
At time step , under a discrete-time setting with encoding window size . The input spikes to neuron at layer are transduced as follows
where indicates the occurrence of an input spike from the afferent neuron at time step , and denotes the strength of the synaptic connection from afferent neuron of layer . Here, can be interpreted as a constant input current to the IF neuron. Mathematically, this term is related to the bias term
of the corresponding ReLU neuronin the coupled ANN layer . It is important to distribute the effect of evenly throughout the encoding time window, thereby effectively preventing IF neurons from over-firing at early time steps.
The neuron then integrates the input current into its membrane potential as per Eq. 2 (without loss of generality, a unitary membrane resistance is assumed here). is reset and initialized to zero for each input sample. An output spike is generated whenever crosses the firing threshold (Eq. 3).
According to Eq. 1, the free aggregated membrane potential (no spiking) of neuron in layer at end of encoding time window T can be expressed as
where is the input spike count from pre-synaptic neuron at layer as per Eq. 5.
For ANN layers, we use bounded ReLU neurons that linearly integrate inputs and deliver only positive, integer-value ‘spike count’ to the subsequent layer. As explained in the ANN-to-SNN conversion work , the firing rate of IF neurons linearly correlates with the activation value of ReLU neurons. In this work, we extend this property further and approximate the spike count of IF neurons with bounded ReLU neurons as will be presented in Section II-C.
Ii-B Encoding and Decoding Schemes
Just like how cochlear converts a received sound waves into nerve impulses, and the auditory cortex then perceives the sound encoded in the incoming nerve impulses; a SNN front-end is required to encode the sensory inputs into spike trains, and a neural network back-end will then decode the output spike trains into desired pattern classes. Two encoding schemes are commonly used: rate code and temporal code. Rate code [20, 21]
converts real-valued inputs into spike trains at each sampling time step following a Poisson or Bernoulli distribution. However, it suffers from sampling errors, thereby requiring long encoding time windowto compensate for such errors. Hence, it is not desirable to be applied to encode information into a short time window. On the other hand, temporal coding uses the timing of a single spike to encode information. Therefore, it enjoys superior coding efficiency and computational advantages. However, it is complex to decode and sensitive to noise .
and directly input the input images or feature vectors into the neural encoding layer. The neural encoding layer performs a weighted transformation with bounded ReLU neurons as shown in Eq.6.
where denotes the synaptic weight that connects input value to the encoding neuron and is the bias term of the encoding neuron. We use to denote the activation function of the bounded ReLU neuron that is defined in Eq. 11. Here, the activation function is analogous to the free aggregate membrane potential at the end of encoding time window . The subsequent spike train is generated by distributing this free aggregate membrane potential into consecutive time steps, since the beginning of the encoding window, as follows
Altogether, the spike train and spike count that output from the neural encoding layer can be represented as follows
This neural encoding layer converts the input into spike trains, whereby the output spike count can be adjusted in a learnable fashion to match the size of the encoding window . Such an encoding scheme is beneficial for rapid inference since the input information can be effectively encoded within a short time window. Beginning from this neural encoding layer, spike trains and spike counts are used as input to the SNN and ANN layers, respectively.
For decoding, it is feasible to decode from the SNN output layer using either the discrete spike counts or the continuous free aggregate membrane potentials. In our preliminary study, as shown in Fig. 8, we observe that the free aggregate membrane potential provides a much smoother learning curve due to the continuous error gradients derived at the output layer.
Ii-C Spike Count as a Discrete Neural Representation
Deep neural networks learn to describe the input data with compact feature representations. A typical feature representation is in the form of a continuous or discrete-valued vector. While most studies have focused on continuous feature representations, discrete representations have their unique advantages in solving some real-world problems [43, 44, 45, 46, 47]. For example, they are potentially a more natural fit for representing natural language which is inherently discrete, and also native for logical reasoning and predictive learning. Moreover, the idea of discretized neural representation has also been exploited in the binary neural network 
for network quantization, wherein binarized activations (-1, +1) are used for feature representation.
In this work, we consider the spike count as a discrete feature representation in deep SNNs as shown in Fig. 3. To formulate a discrete neural representation in the coupled ANN layer, if we are to ignore the temporal dynamics (membrane potential reset after spiking) of the IF neurons, we may then establish a one-to-one correspondence between the free aggregated membrane potential of the spiking neuron and the discrete pseudo output spike count of the ANN neuron:
where is lower bounded at value zero. Without loss of generality, we set to 1 in this work. As shown in Fig. 4, different from the commonly used continuous neuron activation function in ANNs, are only non-negative integers. The surplus free membrane potential that is insufficient to induce an additional spike is rounded off, resulting in a quantization error as expressed in Eq. 12.
In practice, however, we did not observe any obvious interference to the learning or inference that due to this quantization error. Moreover, is upper bounded by the encoding time window size . As shown in Fig. 3, the proposed ANN activation function can effectively approximate the exact spike count information of the coupled SNN layer. As the ANN and SNN are coupled layer-by-layer, the ANN approximates the SNN layer-by-layer. This makes it possible to train the deep SNN in a similar way as a deep ANN.
Notably, as described in Fig. 1, the pseudo spike count that is derived in the ANN layer (Eq. 11) may deviate from the actual spike count of the SNN layer especially within a short time window, which may adversely affect the quality of the gradient derived in the error back-propagation. We will refer to this error as the gradient approximation error in the following sections. Our experimental results in Section III-E however suggest that the cosine angle between these two outputs are exceedingly small in a high dimensional space and this relationship maintains throughout learning. In addition, weight-activation dot products, a critical intermediate quantity, are approximately preserved despite the spike count discrepancy. Therefore, the learning dynamic in the ANN layer can effectively approximate that of the coupled SNN layer with spike count as the discrete neural representation.
Ii-D Credit Assignment in the Tandem Network
Although the neural representation error is not significant at each layer alone as demonstrated in Fig. 3, they may cause severe impairments to the classification accuracy if inaccurate neural representation is propagated to the subsequent layers. To solve this problem, we propose a tandem learning framework. As shown in Fig. 1 and 5, an ANN with activation function defined in Eq. 11 is employed to enable error back-propagation in a rate-based network; while the SNN, sharing weights with the coupled ANN is employed to determine the exact neural representation (i.e., spike counts and spike trains). The spike count and spike trains are transmitted to the subsequent ANN and SNN layers, respectively. By incorporating the dynamics of the IF neuron into the training phase and propagating its output to the subsequent layers, this tandem learning framework effectively prevents the neural representation error from accumulating across layers. While a coupled ANN is used for error back-propagation, the forward inference is executed entirely on the SNN. The pseudo code of the proposed tandem learning rule is given in Algorithm. 1.
It is worth mentioning that, in the forward pass, the ANN layer takes the output of the previous SNN layer as the input. This aims at synchronizing the training of the SNN with ANN via the interlaced layers, rather than trying to optimize the classification performance of the ANN. The similar idea of interlaced network layers has also been explored in the binary neural networks , of which full-precision activation values are calculated at each layer, whereas binarized activation values are forward propagated to the subsequent layer.
Iii Experimental Evaluation and Discussion
In this section, we first present the neural representation errors that may arise and accumulate across layers when taking the constrain-then-train approach (introduced in Section I) in the scenarios of short encoding window. Then, we evaluate the learning capability of the proposed tandem learning rule on two standard image classification benchmarks. We further discuss why effective learning can be performed within the tandem network. Finally, we discuss the properties of rapid inference and synaptic operation reduction that are achieved with the proposed tandem learning rule.
Iii-a Datasets, Network Configurations and Implementation
To evaluate the learning capability, convergence property and energy efficiency of the proposed learning rule, we use two image classification benchmark datasets: CIFAR-10  and ImageNet-2012 . The CIFAR-10 consists of 60,000 color images of size 3232 from 10 classes, with a standard split of 50,000 and 10,000 for train and test, respectively. The large-scale ImageNet-2012 dataset consists of over 1.2 million images from 1,000 object categories. Notably, the success of AlexNet  on this dataset represents a key milestone of deep learning research.
As shown in Fig. 6
, we use a convolutional neural network (CNN) with 6 learnable layers for CIFAR-10 (namely CIFARNet) and AlexNet for ImageNet-2012. To reduce the dependency on weight initialization and to accelerate the training process, we add batch normalization layer after each convolution and fully-connected layer. Given that batch normalization layer only performs an affine transformation, we follow the approach introduced in 
and integrate their parameters into the preceding layer’s weight vectors before copying them into the coupled SNN layer. We replace the average pooling operations that are commonly used in the ANN-to-SNN conversion approach with a stride of 2 convolution operations, which perform dimensionality reduction in a learnable fashion.
We perform all experiments with the Tensorpack toolbox 
, which is a high-level neural network training interface based on TensorFlow. Tensorpack optimizes the whole training pipeline, providing accelerated and memory-efficient training on multi-GPU machines. We follow the same data pre-processing procedures (crop, flip and mean normalization, etc.), optimizer, learning rate decay schedule that are adopted in the Tensorpack CIFAR-10 and ImageNet-2012 examples and use those configurations consistently for all experiments. As shown in Fig.5, we implement customized convolution and fully-connected layers in Tensorpack, which integrate the operations of the ANN layer and coupled SNN layer under a unified interface.
Iii-B Counting Synaptic Operations
The computational cost of neuromorphic architectures is typically benchmarked using the number of total synaptic operations [9, 21, 22, 35]. For SNN, as defined below, the total synaptic operations (SynOps) correlate with the neurons’ firing rate, fan-out (number of outgoing connections to the subsequent layer) and encoding time window size .
where is the total number of layers and denotes the total number of neurons in layer . indicates whether a spike is generated by neuron of layer at time step .
In contrast, the total synaptic operations that are required to classify one image in the ANN is given as follows
where denotes the number of incoming connections to each neuron in layer . In our experiment, we calculate the average synaptic operations on a randomly chosen mini-batch (256 images) from the test set.
|Model||Network Architecture||Method||Error Rate (%)||Inference Time Steps|
|Panda and Roy (2016)||
|Layer-wise Spike-based Learning||24.58||-|
|Esser et al. (2016)||15-layer CNN||Binary Neural Network||10.68||16|
|Rueckauer et al. (2017)||8-layer CNN||ANN-to-SNN conversion||9.15||-|
|Wu et al. (2018)||8-layer CNN||Error Backpropagation Through Time||9.47||-|
|Wu et al. (2018)||AlexNet||Error Backpropagation Through Time||14.76||-|
|Sengupta et al. (2019)||VGG-16||ANN-to-SNN conversion||8.54||2,500|
|Lee et al. (2019)||ResNet-11||ANN-to-SNN conversion||9.85||3,000|
|Lee et al. (2019)||ResNet-11||Spike-based Learning||9.05||100|
|This work (SNN with Spike Count)||CIFARNet||Error Backpropagation through Tandem Network||8.46||16|
|This work (SNN with Agg. Mem. Potential)||CIFARNet||Error Backpropagation through Tandem Network||9.93||16|
|Hunsberger and Eliasmith, (2016)||AlexNet||Constrain-then-Train||48.20 (23.80)||200|
|Rueckauer et al. (2017)||VGG-16||ANN-to-SNN conversion||50.39 (18.37)||400|
|Sengupta et al. (2019)||VGG-16||ANN-to-SNN conversion||30.04 (10.99)||2,500|
|This work (ANN with full-precision activation)||AlexNet||Error Backpropagation||42.45 (19.56)||-|
|This work (ANN with quantized activation)||AlexNet||Error Backpropagation||50.73 (26.08)||-|
|This work (SNN with Agg. Mem. Potential)||AlexNet||Error Backpropagation through Tandem Network||53.37 (29.20)||13|
|This work (SNN with Agg. Mem. Potential)||AlexNet||Error Backpropagation through Tandem Network||49.78 (26.40)||18|
Iii-C Accumulated Neural Representation Error
As discussed in Section I, one can train a constrained ANN that approximates the properties of spiking neurons (e.g., firing rate or spike count) using conventional error back-propagation algorithm, and subsequently, transfer the trained weights to the SNN as described in Fig. 7A, namely the constrain-then-train approach. Taking Eq. 11 as the neuron activation function, we reported competitive classification accuracy on the MNIST dataset. However, when applying this approach to the more complex CIFAR-10 dataset with , we noticed a large accuracy drop (approximately 11%) when transferring the trained ANN weights to the SNN. After carefully comparing the ANN output ‘spike count’ with the actual SNN spike count, we observe a growing spike count discrepancy between the ANN and SNN layers as shown in Fig. 7C.
This is due to the fact that the neuronal activation function of the ANN has ignored the temporal dynamics of the IF neuron. While such spike count discrepancies could be negligible for a shallow network used for classifying the MNIST dataset or with very high input firing rates, it has huge impacts in the face of sparse synaptic activities and short encoding time window. By incorporating the dynamics of IF neurons during the training of the tandem network, the exact output spike counts, instead of ANN predicted spike counts, are propagated forward to the subsequent ANN layer. The proposed tandem learning framework can effectively prevent this representation error from accumulating forward across layers.
Iii-D Image Classification Results
For CIFAR-10, as shown in Table. I, the CIFARNet trained with the proposed learning rule achieves a competitive test error rate of 8.46% (spike count decoding) and 9.93% (aggregate membrane potential decoding), respectively. The CIFARNet, with spike count decoding, achieves by far the best-reported result on CIFAR-10 with a SNN. As shown in Fig. 8, we however note that its learning dynamics is unstable, which may be attributed to the discrete error gradients derived at the final output layer. Therefore, we use the aggregate membrane potential decoding for the rest of the experiments on ImageNet-2012 as well as a further study on the effect of encoding time window size on CIFAR-10. Although the learning converges slower than the plain CNN (with ReLU activation function) and bounded CNN (with bounded ReLU activation function as defined in Eq. 11), the error rate of the SNN eventually matches that of the bounded CNN. It also suggests that the representation error described in Sec. III-C can be effectively mitigated with the proposed tandem learning framework.
To train a model on ImageNet-2012 with a spike-based learning rule using BPTT for synaptic weight update, it requires a huge amount of computer memory to store the intermediate states of the spiking neurons as well as huge computational costs. Hence, only a few SNN implementations, without taking into consideration the dynamics of spiking neurons during training, have made some successful attempts on this challenging task, including ANN-to-SNN conversion [21, 22] and constrain-then-train  approaches. The tandem learning rule benefits from the best of two worlds: the dynamics of IF neurons are considered during the forward propagation, while only the rate-based ANN is used for error back-propagation. As a result, it reduces both the memory requirement and computational cost over other spike-based learning rules. Meanwhile, it also reduces the inference time and the total synaptic operations when compared to the ANN-to-SNN conversion and constrain-then-train approaches.
As shown in Table. I, with an inference time of 18-time steps (input image is encoded within a time window of 10-time steps), the AlexNet trained with the proposed learning rule achieves the top-1 and top-5 error rate of 49.78% and 26.40%, respectively. This result is comparable to that of the constrain-then-train approach with the same AlexNet architecture. Notably, the proposed learning rule only takes 18 inference time steps which are at least an order of magnitude faster than the other reported approaches.
While the ANN-to-SNN conversion approaches achieve better classification accuracies on the ImageNet-2012, their successes can largely be credited to the more advanced network models used. Furthermore, we note an error rate increase of around 7% from the baseline ANN implementation with full-precision activation (revised from the original AlexNet model  by replacing pooling layers with a stride of 2 convolution operations to match the AlexNet used in this work, and adding batch normalization layers). To investigate the effect of the discrete neural representation, whereby how much of the drop in accuracy is due to quantization, and how much of it is due to dynamics of the IF neuron, we modify the full-precision ANN by quantizing the activation function, using the bounded ReLU neuron as defined in Eq. 11. In a single trial, the resulting quantized ANN achieves the top-1 and top-5 error rate of 50.73% and 26.08%, respectively. This result is very close to that of our SNN implementation, which suggests that the quantization of the activation function alone may account for most of the accuracy drop.
Iii-E Activation Direction Preservation and Weight-Activation Dot Product Proportionality within the Interlaced Layers
After showing how effective the proposed tandem learning rule performs on the CIFAR-10 and ImageNet-2012, we further investigate why learning can be performed effectively via the interlaced network layers. To answer this question, we borrow ideas from the recent theoretical work of binary neural network , wherein learning is also performed across the interlaced network layers (binarized activations are forward propagated to subsequent layers). In the proposed tandem network, as shown in Fig. 10, the ANN layer activation value at layer is replaced with the spike count derived from the coupled SNN layer. Due to the dynamic nature of spike generation, it is not easy to find an analytical transformation function between and .
To circumvent this problem, we analyze the degree of mismatch between these two quantities and its effect on the activation forward propagation and error back-propagation.
In our numerical experiments on CIFAR-10 with a randomly drawn mini-batch of 256 test samples, we calculate the cosine angle between vectorized and for all the convolution layers. As shown in Fig. 9, their cosine angles are below 24 degrees on average and such a relationship maintains consistently throughout learning. While these angles seem large in low dimensions, they are exceedingly small in a high dimensional space. According to the hyperdimensional computing theory  and the theoretical study of binary neural network , the cosine angle between any two high dimensional random vectors is approximately orthogonal. It also worth noting that the distortion of replacing with is less severe than binarizing a random high dimensional vector, which changes cosine angle by 37 degrees in theory. Given that the activation function and error gradients that back-propagated from the subsequent ANN layer remains equal, the distortions to the error back-propagation are bounded locally by the discrepancy between and .
Furthermore, we calculate the Pearson Correlation Coefficient (PCC) between the weight-activation dot products and , which is an important intermediate quantity (input to the batch normalization layer) in our current network configurations. The PCC, ranging from -1 to 1, measures the linear correlation between two variables. A value of 1 implies a perfect positive linear relationship. As shown in Fig. 9, the PCC maintains consistently above 0.9 throughout learning for most of the samples, suggesting the linear relationship of weight-activation dot products are approximately preserved.
Iii-F Rapid Inference with Reduced Synaptic Operations
As shown in Fig. 8, the proposed learning rule can deal with and utilize different encoding window sizes on CIFAR-10. In the most challenging case when , we are able to achieve a satisfying error rate that is below 12%. This may be partially credited to the encoding strategy that we have employed, whereby important input information can be encoded at the first time step before passing into the SNN layer. In addition, the Batch Normalization layer that is added after each convolution and fully-connected layer ensures effective information transmission to the top layers. The error rate is reduced further by increasing , while the improvement vanishes beyond . Hence, the SNN trained with the proposed learning rule can perform inference rapidly with at least an order of time-saving compared with other learning rules as shown in Table. I. While binary neural network also supports a rapid inference, they propagate information in a synchronized fashion and differ fundamentally from asynchronous information processing that is studied in other SNN works.
|Model||Inference Time Steps||CIFAR-10||ImageNet-2012|
|AlexNet (this work)||13||0.27||0.50|
|AlexNet (this work)||18||0.40||0.68|
To study the energy efficiency of the proposed learning rule, we follow the evaluation metrics used in[22, 35]. As defined in Section III-B, we calculate the ratio of SNN SynOps to ANN SynOps on the CIFAR-10 and ImageNet-2012 datasets and compare them with other state-of-the-art learning rules. Given the short inference time required and sparse synaptic activities as summarized in Fig. 11, the AlexNet (shown in Table. II, with ) trained with the proposed learning rule achieves a ratio of only 0.40 and 0.68 for CIFAR-10 and ImageNet-2012 dataset, respectively. It is worth noting that the SNN is more energy-efficient than its ANN counterpart with a ratio below 1. The saving is even more significant compared to ANNs if we consider the fact that for SNNs, only an accumulate (AC) operation is performed for each synaptic operation; while for ANNs, a more costly multiply-and-accumulate (MAC) operation is performed. This results in an order of magnitude chip area as well as energy saving per synaptic operation[21, 22]. In contrast, the existing SNN implementations [35, 22] achieve a ratio of at least 3.61 and 1.975, which is at least 9 and 3 times more costly than the proposed tandem learning rule, on the CIFAR-10 and ImageNet-2012 datasets, respectively.
In this work, we introduce a novel tandem neural network and its learning rule to effectively train SNNs for efficient and rapid inference for pattern classification tasks. Within the tandem neural network, a SNN is employed to determine spike counts as a discrete neural representation and spike trains for the activation forward propagation; while an ANN, sharing the weight with the coupled SNN, is used to approximate gradients of the coupled SNN. Given that error back-propagation is performed on the rate-based ANN, the proposed learning rule is both memory and computationally more efficient than the error back-propagation through time algorithm that is used in many spike-based learning rules [32, 33, 34].
To understand why the learning can be effectively performed within the tandem learning framework, we study the learning dynamics of the tandem network and compare it with an intact ANN. The empirical study on the CIFAR-10 reveals that the cosine distances between the vectorized ANN output and the coupled SNN output spike count are exceedingly small in a high dimensional space and such a relationship maintains throughout the training. Furthermore, strongly positive Pearson Correlation Coefficients are exhibited between weight-activation dot product and , an important intermediate quantity in the activation forward propagation, suggesting a linear relationship of weight-activation dot products are well preserved.
The SNNs trained with the proposed tandem learning rule have demonstrated competitive classification accuracies on the CIFAR-10 and ImageNet-2012 datasets. By encoding sensory stimuli within the available encoding time window through a learnable transformation layer, and adding batch normalization layers to ensure effective information flow; rapid inferences, with at least an order of magnitude time-saving compared to state-of-the-art ANN-to-SNN conversion and constrain-then-train approaches, are demonstrated on a large-scale ImageNet-2012 image classification task. Furthermore, the total synaptic operations are also significantly reduced compared to the baseline ANNs and other SNN implementations.
By integrating the algorithmic power of the proposed tandem learning rule with the unprecedented energy efficiency of emerging neuromorphic computing architectures, we expect to enable low-power on-chip computing on pervasive mobile and embedded devices. For future work, we will explore strategies to close the accuracy gap between the baseline ANN and SNN implementations as well as to evaluate more advanced network architectures.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
K. He, X. Zhang, S. Ren, and J. Sun,
“Deep residual learning for image recognition,”
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  W. Xiong, J. Droppo, X. Huang, F. Seide, M. L. Seltzer, A. Stolcke, D. Yu, and G. Zweig, “Toward human parity in conversational speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, pp. 2410–2423, Dec 2017.
-  A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio.,” SSW, vol. 125, 2016.
J. Hirschberg and C. D. Manning,
“Advances in natural language processing,”Science, vol. 349, no. 6245, pp. 261–266, 2015.
-  D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354, 2017.
-  S. B. Laughlin and T. J. Sejnowski, “Communication in neuronal networks,” Science, vol. 301, no. 5641, pp. 1870–1874, 2003.
-  C. D. Schuman, T. E. Potok, R. M. Patton, J. D. Birdwell, M. E. Dean, G. S. Rose, and J. S. Plank, “A survey of neuromorphic computing and neural networks in hardware,” arXiv preprint arXiv:1705.06963, 2017.
-  P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, et al., “A million spiking-neuron integrated circuit with a scalable communication network and interface,” Science, vol. 345, no. 6197, pp. 668–673, 2014.
-  M. Davies, N. Srinivasa, T. H. Lin, G. Chinya, S. H. Cao, Y.and Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, et al., “Loihi: A neuromorphic manycore processor with on-chip learning,” IEEE Micro, vol. 38, no. 1, pp. 82–99, 2018.
-  D. Monroe, “Neuromorphic computing gets ready for the (really) big time,” Communications of the ACM, vol. 57, no. 6, pp. 13–15, 2014.
-  S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry, T. Melano, D. R. Barch, C. di Nolfo, P. Datta, A. Amir, B. Taba, M. D. Flickner, and D. S. Modha, “Convolutional networks for fast, energy-efficient neuromorphic computing,” Proceedings of the National Academy of Sciences, vol. 113, no. 41, pp. 11441–11446, 2016.
-  D. O. Hebb, The organization of behavior: A neuropsychological theory, Psychology Press, 2005.
-  H. Markram, J. Lübke, M. Frotscher, and B. Sakmann, “Regulation of synaptic efficacy by coincidence of postsynaptic aps and epsps,” Science, vol. 275, no. 5297, pp. 213–215, 1997.
-  G. Q. Bi and M. M. Poo, “Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type,” Journal of neuroscience, vol. 18, no. 24, pp. 10464–10472, 1998.
-  G. W. Burr, R. M. Shelby, A. Sebastian, S. Kim, S. Kim, S. Sidler, K. Virwani, M.i Ishii, P. Narayanan, A. Fumarola, et al., “Neuromorphic computing using non-volatile memory,” Advances in Physics: X, vol. 2, no. 1, pp. 89–124, 2017.
N. Zheng and P. Mazumder,
“Online supervised learning for hardware-based multilayer spiking neural networks through the modulation of weight-dependent spike-timing-dependent plasticity,”IEEE transactions on neural networks and learning systems, vol. 29, no. 9, pp. 4287–4302, 2017.
-  M. Mozafari, S. R. Kheradpisheh, T. Masquelier, A. Nowzari-Dalini, and M. Ganjtabesh, “First-spike-based visual categorization using reward-modulated stdp,” IEEE transactions on neural networks and learning systems, vol. 29, no. 12, pp. 6178–6190, 2018.
-  Y. Cao, Y. Chen, and D. Khosla, “Spiking deep convolutional neural networks for energy-efficient object recognition,” International Journal of Computer Vision, vol. 113, no. 1, pp. 54–66, 2015.
-  P. U. Diehl, D. Neil, J. Binas, M. Cook, S. C. Liu, and M. Pfeiffer, “Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing,” in 2015 International Joint Conference on Neural Networks (IJCNN), July 2015, pp. 1–8.
-  B. Rueckauer, I. A. Lungu, Y. Hu, M. Pfeiffer, and S. C. Liu, “Conversion of continuous-valued deep networks to efficient event-driven networks for image classification,” Frontiers in Neuroscience, vol. 11, pp. 682, 2017.
-  A. Sengupta, Y. Ye, R. Wang, C. Liu, and K. Roy, “Going deeper in spiking neural networks: Vgg and residual architectures,” Frontiers in neuroscience, vol. 13, 2019.
-  Y. Hu, H. Tang, Y. Wang, and G. Pan, “Spiking deep residual network,” arXiv preprint arXiv:1805.01352, 2018.
-  J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, June 2009, pp. 248–255.
-  S. K. Esser, R. Appuswamy, P. Merolla, J. V. Arthur, and D. S. Modha, “Backpropagation for energy-efficient neuromorphic computing,” in Advances in Neural Information Processing Systems, 2015, pp. 1117–1125.
-  E. Hunsberger and C. Eliasmith, “Training spiking deep networks for neuromorphic hardware,” arXiv preprint arXiv:1611.05141, 2016.
-  D. Zambrano, R. Nusselder, H. S. Scholte, and S. Bohte, “Efficient computation in adaptive artificial spiking neural networks,” arXiv preprint arXiv:1710.04838, 2017.
-  J. Wu, Y. Chua, M. Zhang, Q. Yang, G. Li, and H. Li, “Deep spiking neural network with spike count based learning rule,” arXiv preprint arXiv:1902.05705, 2019.
-  H. Mostafa, “Supervised learning based on temporal coding in spiking neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 7, pp. 3227–3235, 2018.
-  C. Hong, X. Wei, J. Wang, B. Deng, H. Yu, and Y. Che, “Training spiking neural networks for cognitive tasks: A versatile framework compatible with various temporal codes,” IEEE transactions on neural networks and learning systems, 2019.
-  E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate gradient learning in spiking neural networks,” arXiv preprint arXiv:1901.09948, 2019.
-  J. H. Lee, T. Delbruck, and M. Pfeiffer, “Training deep spiking neural networks using backpropagation,” Frontiers in Neuroscience, vol. 10, pp. 508, 2016.
-  S. B. Shrestha and G. Orchard, “Slayer: Spike layer error reassignment in time,” in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., pp. 1412–1421. Curran Associates, Inc., 2018.
-  Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi, “Direct training for spiking neural networks: Faster, larger, better,” arXiv preprint arXiv:1809.05793, 2018.
-  C. Lee, S. S. Sarwar, and K. Roy, “Enabling spike-based backpropagation in state-of-the-art deep neural network architectures,” arXiv preprint arXiv:1903.06379, 2019.
-  A. Krizhevsky and G. E. Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., Citeseer, 2009.
-  S. Hochreiter, “The vanishing gradient problem during learning recurrent neural nets and problem solutions,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 6, no. 02, pp. 107–116, 1998.
-  M. Pfeiffer and T. Pfeil, “Deep learning with spiking neurons: Opportunities & challenges,” Frontiers in Neuroscience, vol. 12, pp. 774, 2018.
-  A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, and A. Maida, “Deep learning in spiking neural networks,” Neural Networks, 2018.
-  W. Gerstner and W. M. Kistler, Spiking neuron models: Single neurons, populations, plasticity, Cambridge University Press, 2002.
-  C. Koch and I. Segev, Methods in neuronal modeling: from ions to networks, MIT press, 1998.
-  J. Wu, Y. Chua, M. Zhang, H. Li, and K. C. Tan, “A spiking neural network framework for robust sound classification,” Frontiers in neuroscience, vol. 12, 2018.
-  A. van den Oord, O. Vinyals, and k. kavukcuoglu, “Neural discrete representation learning,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., pp. 6306–6315. Curran Associates, Inc., 2017.
-  A. Mnih and K. Gregor, “Neural variational inference and learning in belief networks,” arXiv preprint arXiv:1402.0030, 2014.
R. Salakhutdinov and G. Hinton,
“Deep boltzmann machines,”in Artificial intelligence and statistics, 2009, pp. 448–455.
-  A. Mnih and D. J. Rezende, “Variational inference for monte carlo objectives,” arXiv preprint arXiv:1602.06725, 2016.
A. Courville, J. Bergstra, and Y. Bengio,
“A spike and slab restricted boltzmann machine,”in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 233–241.
-  M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1,” arXiv preprint arXiv:1602.02830, 2016.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
-  X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856.
-  Y. Wu et al., “Tensorpack,” https://github.com/tensorpack/, 2016.
-  P. Panda and K. Roy, “Unsupervised regenerative learning of hierarchical features in spiking deep networks for object recognition,” in 2016 International Joint Conference on Neural Networks (IJCNN). IEEE, 2016, pp. 299–306.
-  A. G. Anderson and C. P. Berg, “The high-dimensional geometry of binary neural networks,” arXiv preprint arXiv:1705.07199, 2017.
“Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors,”Cognitive computation, vol. 1, no. 2, pp. 139–159, 2009.