I Introduction
Deep learning has greatly improved pattern classification performance by leaps and bounds in computer vision [1, 2], speech processing [3, 4], language understanding [5] and robotics [6]. However, deep neural networks are computationally intensive and memory inefficient, thereby, limiting their deployments in mobile and wearable devices that have limited computational budgets. This prompts us to look into energyefficient solutions.
The human brain, with millions of years of evolution, is incredibly efficient at performing complex perceptual and cognitive tasks [7]. Although hierarchically organized deep neural network models are braininspired, they differ significantly from the biological brain in many ways. Fundamentally, the information is represented and communicated through asynchronous action potentials or spikes in the brain. To efficiently and rapidly process the information carried by these spike trains, biological neural systems adopt the eventdriven computation strategy, whereby energy is mostly consumed only when spike generation and communication take place.
Neuromorphic computing (NC), as an emerging nonvon Neumann computing paradigm, aims to mimic such asynchronous eventdriven information processing with spiking neural networks (SNNs) in silicon [8]. The novel neuromorphic computing architectures, for instances TrueNorth [9] and Loihi [10], leverage on the lowpower, denselyconnected parallel computing units to support spikebased computation. Furthermore, the colocated memory and computation can effectively mitigate the problem of low bandwidth between the CPU and memory (i.e., von Neumann bottleneck) [11]. When implemented on these neuromorphic architectures, deep SNNs benefit from the best of two worlds: superior classification accuracies and compelling energy efficiency [12]. Such promising prospects motivate the study in this paper.
While neuromorphic computing architectures offer attractive energysaving, how to train largescale SNNs that can operate efficiently and effectively on these NC architectures remains as a challenging research problem. The biological plausible Hebbian learning rules [13] and spiketimingdependent plasticity (STDP) [14, 15] are intriguing local learning rules for computational neuroscience studies and also attractive for hardware implementation with emerging nonvolatile memory device [16, 17, 18]. However, they are not straightforward to use for largescale machine learning tasks due to the ineffective taskspecific credit assignment.
Due to the asynchronous and discontinuous nature of synaptic operations within the SNN, the error backpropagation algorithm that is commonly used for ANN training is not directly applicable to the SNN. Recent research works [19, 20, 21, 22, 23] have suggested that it is viable to convert pretrained ANNs to SNNs with little adverse impacts on the classification accuracy. This indirect training approach assumes that the graded activation of analog neurons is equivalent to the average firing rate of spiking neurons, and simply requires parsing and normalizing the weights after training the ANNs.
Rueckauer et al. [21]
provide a theoretical analysis of the performance deviation of such an approach as well as a systematic study of the Convolutional Neural Network (CNN) models for a largescale image classification task. This conversion approach achieves the bestreported results for SNNs on many benchmark datasets including the challenging ImageNet2012 dataset
[24]. However, this approach comes with a tradeoff that has an impact on the inference speed and classification accuracy and requires at least several hundred of inference time steps to reach an optimal classification accuracy [21, 22].Additional research efforts are also devoted to training constrained ANNs that can approximate the properties of specific spiking neuron [25, 12, 26, 27, 28], which can seamlessly transfer to the target hardware platform and perform better than the aforementioned generic conversion approach. Grounded on the ratebased spiking neuron model, this constrainthentrain approach transforms the steadystate firing rate of spiking neuron into a continuous and hence differentiable form that can be optimized with the conventional error backpropagation algorithm. While competitive classification accuracies are shown with both the generic ANNtoSNN conversion and constrainthentrain approaches, the underlying assumption of a ratebased spiking neuron model requires a long inference time window or a high firing rate to reach the steady neuronal firing state[20, 26]. This steadystate requirement limits the computational benefits that can be acquired from the NC architectures.
To improve the overall energy efficiency as well as inference speed, an ideal SNN learning rule should support a short encoding time window with sparse synaptic activities. To exploit this desirable property, temporal coding has been investigated whereby the spike timing of the first spike was employed as a differentiable proxy to enable the error backpropagation algorithm [29, 30]. Although competitive classification accuracies were reported on the MNIST dataset with such a temporal learning rule, maintaining the stability of neuronal firing and scaling it up to the size of stateoftheart deep ANNs remain elusive. In view of the steadystate requirement of ratebased SNNs and scalability issue of temporalbased SNNs, we are interested in developing a new learning rule that can effectively and efficiently train deep SNNs to operate under short encoding time window with sparse synaptic activities.
The spiking neuronal function is designed to describe the temporal dynamics, such as leak and reset mechanisms of the membrane potential, and refractory period, that is very different from a continuous and differentiable ANN neuronal function. Furthermore, the size of the encoding time window also plays a role in capturing the sparse synaptic activities. It is not straightforward to approximate the exact behavior of a SNN with an ANN, especially if there are multiple hidden layers in the network.
To demonstrate such an approximation error happens during information forwardpropagation, namely neural representation error, we prepared a handcrafted example as shown in Fig. 1. Although the free aggregate membrane potential, at the end of the simulation time window, of an integrateandfire (IF) neuron stays below the firing threshold (a useful intermediate quantity that can be applied to approximate the output spike count as will be explained in Section IIC), an output spike is generated due to early arrival of spikes from the positive synapses. Even worse, such a neural representation error (spike count discrepancy) will accumulate across layers and significantly affects the classification accuracy of the SNN when transferring the trained weights from the ANN. Therefore, to effectively train a deep SNN under short encoding time window with sparse synaptic activities, it is necessary to derive an exact neural representation with SNN in the training loop.
One way to overcome the approximation error is to formulate SNNs as recurrent neural networks
[31], and apply error Backpropagation Through Time (BPTT) algorithm to train deep SNNs with pseudo derivatives [32, 33, 34, 35]. While competitive accuracies were reported on the MNIST and CIFAR10 [36]datasets, it is both memory and computationally inefficient to train deep SNNs using BPTT. Furthermore, the vanishing gradient problem
[37] that is wellknown for RNNs may affect learning when the firing rate is low. Readers may refer to the recent overviews on deep learning with spiking neural networks [38, 39] for more details.In this paper, to effectively and efficiently train deep SNNs to classify inputs that are encoded in spikes within a short time window, we propose a novel learning rule with the tandem neural network. As illustrated in Fig.
2, the tandem network architecture consists of a SNN and an ANN that is coupled layerwise with weights sharing. The ANN is an auxiliary structure that facilitates the error backpropagation for the training of the SNN, while the SNN is used to derive the exact spiking neural representation.The rest of this paper is organized as follows: in Section II, we present the details of the tandem learning framework. In Section III, we evaluate the proposed tandem network and learning rule on the CIFAR10 and ImageNet2012 datasets by comparing classification accuracies, inference speed and energy efficiency to other SNN implementations. Furthermore, we investigate why the proposed tandem learning rule can learn effectively by comparing the high dimensional geometry of activation values and weightactivation dot products between the coupled ANN and SNN network layers. Finally, we conclude the paper in Section IV.
Ii Learning Through a Tandem Network
In this section, we first introduce the neuron model and the neural coding scheme that is used in this work. We then present a discrete neural representation scheme using spike count as the information carrier across network layers, and we design ANN neuronal activation function to effectively approximate the spike count of the coupled SNN for error backpropagation. Finally, we introduce the tandem network and its learning rule, that is called tandem learning rule, for deep SNN training.
Iia Neuron Model
In this work, we use the integrateandfire (IF) neuron model with reset by subtraction scheme[21] in the SNN layers. This simplified spiking neuron model drops the membrane potential leak and refractory period terms present in other more realistic spiking neuron models, for instance, the spike response model [40] and leaky integrateandfire model [41]. In this way, it retains the efficacy of input spikes that receive across time (until reset). While the IF neuron does not emulate the rich temporal dynamics of biological neurons, it is however ideal for working with sensory input where spike timing does not play a significant role and for hardware implementation.
At time step , under a discretetime setting with encoding window size . The input spikes to neuron at layer are transduced as follows
(1) 
where indicates the occurrence of an input spike from the afferent neuron at time step , and denotes the strength of the synaptic connection from afferent neuron of layer . Here, can be interpreted as a constant input current to the IF neuron. Mathematically, this term is related to the bias term
of the corresponding ReLU neuron
in the coupled ANN layer . It is important to distribute the effect of evenly throughout the encoding time window, thereby effectively preventing IF neurons from overfiring at early time steps.The neuron then integrates the input current into its membrane potential as per Eq. 2 (without loss of generality, a unitary membrane resistance is assumed here). is reset and initialized to zero for each input sample. An output spike is generated whenever crosses the firing threshold (Eq. 3).
(2) 
(3) 
According to Eq. 1, the free aggregated membrane potential (no spiking) of neuron in layer at end of encoding time window T can be expressed as
(4) 
where is the input spike count from presynaptic neuron at layer as per Eq. 5.
(5) 
For ANN layers, we use bounded ReLU neurons that linearly integrate inputs and deliver only positive, integervalue ‘spike count’ to the subsequent layer. As explained in the ANNtoSNN conversion work [21], the firing rate of IF neurons linearly correlates with the activation value of ReLU neurons. In this work, we extend this property further and approximate the spike count of IF neurons with bounded ReLU neurons as will be presented in Section IIC.
IiB Encoding and Decoding Schemes
Just like how cochlear converts a received sound waves into nerve impulses, and the auditory cortex then perceives the sound encoded in the incoming nerve impulses; a SNN frontend is required to encode the sensory inputs into spike trains, and a neural network backend will then decode the output spike trains into desired pattern classes. Two encoding schemes are commonly used: rate code and temporal code. Rate code [20, 21]
converts realvalued inputs into spike trains at each sampling time step following a Poisson or Bernoulli distribution. However, it suffers from sampling errors, thereby requiring long encoding time window
to compensate for such errors. Hence, it is not desirable to be applied to encode information into a short time window. On the other hand, temporal coding uses the timing of a single spike to encode information. Therefore, it enjoys superior coding efficiency and computational advantages. However, it is complex to decode and sensitive to noise [42].Alternatively, we adopt the encoding scheme introduced in [21, 34]
and directly input the input images or feature vectors into the neural encoding layer. The neural encoding layer performs a weighted transformation with bounded ReLU neurons as shown in Eq.
6.(6) 
where denotes the synaptic weight that connects input value to the encoding neuron and is the bias term of the encoding neuron. We use to denote the activation function of the bounded ReLU neuron that is defined in Eq. 11. Here, the activation function is analogous to the free aggregate membrane potential at the end of encoding time window . The subsequent spike train is generated by distributing this free aggregate membrane potential into consecutive time steps, since the beginning of the encoding window, as follows
(7) 
(8) 
Altogether, the spike train and spike count that output from the neural encoding layer can be represented as follows
(9) 
(10) 
This neural encoding layer converts the input into spike trains, whereby the output spike count can be adjusted in a learnable fashion to match the size of the encoding window . Such an encoding scheme is beneficial for rapid inference since the input information can be effectively encoded within a short time window. Beginning from this neural encoding layer, spike trains and spike counts are used as input to the SNN and ANN layers, respectively.
For decoding, it is feasible to decode from the SNN output layer using either the discrete spike counts or the continuous free aggregate membrane potentials. In our preliminary study, as shown in Fig. 8, we observe that the free aggregate membrane potential provides a much smoother learning curve due to the continuous error gradients derived at the output layer.
IiC Spike Count as a Discrete Neural Representation
Deep neural networks learn to describe the input data with compact feature representations. A typical feature representation is in the form of a continuous or discretevalued vector. While most studies have focused on continuous feature representations, discrete representations have their unique advantages in solving some realworld problems [43, 44, 45, 46, 47]. For example, they are potentially a more natural fit for representing natural language which is inherently discrete, and also native for logical reasoning and predictive learning. Moreover, the idea of discretized neural representation has also been exploited in the binary neural network [48]
for network quantization, wherein binarized activations (1, +1) are used for feature representation.
In this work, we consider the spike count as a discrete feature representation in deep SNNs as shown in Fig. 3. To formulate a discrete neural representation in the coupled ANN layer, if we are to ignore the temporal dynamics (membrane potential reset after spiking) of the IF neurons, we may then establish a onetoone correspondence between the free aggregated membrane potential of the spiking neuron and the discrete pseudo output spike count of the ANN neuron:
(11) 
where is lower bounded at value zero. Without loss of generality, we set to 1 in this work. As shown in Fig. 4, different from the commonly used continuous neuron activation function in ANNs, are only nonnegative integers. The surplus free membrane potential that is insufficient to induce an additional spike is rounded off, resulting in a quantization error as expressed in Eq. 12.
(12) 
In practice, however, we did not observe any obvious interference to the learning or inference that due to this quantization error. Moreover, is upper bounded by the encoding time window size . As shown in Fig. 3, the proposed ANN activation function can effectively approximate the exact spike count information of the coupled SNN layer. As the ANN and SNN are coupled layerbylayer, the ANN approximates the SNN layerbylayer. This makes it possible to train the deep SNN in a similar way as a deep ANN.
Notably, as described in Fig. 1, the pseudo spike count that is derived in the ANN layer (Eq. 11) may deviate from the actual spike count of the SNN layer especially within a short time window, which may adversely affect the quality of the gradient derived in the error backpropagation. We will refer to this error as the gradient approximation error in the following sections. Our experimental results in Section IIIE however suggest that the cosine angle between these two outputs are exceedingly small in a high dimensional space and this relationship maintains throughout learning. In addition, weightactivation dot products, a critical intermediate quantity, are approximately preserved despite the spike count discrepancy. Therefore, the learning dynamic in the ANN layer can effectively approximate that of the coupled SNN layer with spike count as the discrete neural representation.
IiD Credit Assignment in the Tandem Network
Although the neural representation error is not significant at each layer alone as demonstrated in Fig. 3, they may cause severe impairments to the classification accuracy if inaccurate neural representation is propagated to the subsequent layers. To solve this problem, we propose a tandem learning framework. As shown in Fig. 1 and 5, an ANN with activation function defined in Eq. 11 is employed to enable error backpropagation in a ratebased network; while the SNN, sharing weights with the coupled ANN is employed to determine the exact neural representation (i.e., spike counts and spike trains). The spike count and spike trains are transmitted to the subsequent ANN and SNN layers, respectively. By incorporating the dynamics of the IF neuron into the training phase and propagating its output to the subsequent layers, this tandem learning framework effectively prevents the neural representation error from accumulating across layers. While a coupled ANN is used for error backpropagation, the forward inference is executed entirely on the SNN. The pseudo code of the proposed tandem learning rule is given in Algorithm. 1.
It is worth mentioning that, in the forward pass, the ANN layer takes the output of the previous SNN layer as the input. This aims at synchronizing the training of the SNN with ANN via the interlaced layers, rather than trying to optimize the classification performance of the ANN. The similar idea of interlaced network layers has also been explored in the binary neural networks [48], of which fullprecision activation values are calculated at each layer, whereas binarized activation values are forward propagated to the subsequent layer.
Iii Experimental Evaluation and Discussion
In this section, we first present the neural representation errors that may arise and accumulate across layers when taking the constrainthentrain approach (introduced in Section I) in the scenarios of short encoding window. Then, we evaluate the learning capability of the proposed tandem learning rule on two standard image classification benchmarks. We further discuss why effective learning can be performed within the tandem network. Finally, we discuss the properties of rapid inference and synaptic operation reduction that are achieved with the proposed tandem learning rule.
Iiia Datasets, Network Configurations and Implementation
To evaluate the learning capability, convergence property and energy efficiency of the proposed learning rule, we use two image classification benchmark datasets: CIFAR10 [36] and ImageNet2012 [24]. The CIFAR10 consists of 60,000 color images of size 3232 from 10 classes, with a standard split of 50,000 and 10,000 for train and test, respectively. The largescale ImageNet2012 dataset consists of over 1.2 million images from 1,000 object categories. Notably, the success of AlexNet [1] on this dataset represents a key milestone of deep learning research.
As shown in Fig. 6
, we use a convolutional neural network (CNN) with 6 learnable layers for CIFAR10 (namely CIFARNet) and AlexNet for ImageNet2012. To reduce the dependency on weight initialization and to accelerate the training process, we add batch normalization
[49] layer after each convolution and fullyconnected layer. Given that batch normalization layer only performs an affine transformation, we follow the approach introduced in [21]and integrate their parameters into the preceding layer’s weight vectors before copying them into the coupled SNN layer. We replace the average pooling operations that are commonly used in the ANNtoSNN conversion approach with a stride of 2 convolution operations, which perform dimensionality reduction in a learnable fashion
[50].We perform all experiments with the Tensorpack toolbox [51]
, which is a highlevel neural network training interface based on TensorFlow. Tensorpack optimizes the whole training pipeline, providing accelerated and memoryefficient training on multiGPU machines. We follow the same data preprocessing procedures (crop, flip and mean normalization, etc.), optimizer, learning rate decay schedule that are adopted in the Tensorpack CIFAR10 and ImageNet2012 examples and use those configurations consistently for all experiments. As shown in Fig.
5, we implement customized convolution and fullyconnected layers in Tensorpack, which integrate the operations of the ANN layer and coupled SNN layer under a unified interface.IiiB Counting Synaptic Operations
The computational cost of neuromorphic architectures is typically benchmarked using the number of total synaptic operations [9, 21, 22, 35]. For SNN, as defined below, the total synaptic operations (SynOps) correlate with the neurons’ firing rate, fanout (number of outgoing connections to the subsequent layer) and encoding time window size .
(13) 
where is the total number of layers and denotes the total number of neurons in layer . indicates whether a spike is generated by neuron of layer at time step .
In contrast, the total synaptic operations that are required to classify one image in the ANN is given as follows
(14) 
where denotes the number of incoming connections to each neuron in layer . In our experiment, we calculate the average synaptic operations on a randomly chosen minibatch (256 images) from the test set.
Model  Network Architecture  Method  Error Rate (%)  Inference Time Steps  

CIFAR10 
Panda and Roy (2016)[52]  Convolutional Autoencoder 
Layerwise Spikebased Learning  24.58   
Esser et al. (2016)[12]  15layer CNN  Binary Neural Network  10.68  16  
Rueckauer et al. (2017)[21]  8layer CNN  ANNtoSNN conversion  9.15    
Wu et al. (2018)[34]  8layer CNN  Error Backpropagation Through Time  9.47    
Wu et al. (2018)[34]  AlexNet  Error Backpropagation Through Time  14.76    
Sengupta et al. (2019)[22]  VGG16  ANNtoSNN conversion  8.54  2,500  
Lee et al. (2019)[35]  ResNet11  ANNtoSNN conversion  9.85  3,000  
Lee et al. (2019)[35]  ResNet11  Spikebased Learning  9.05  100  
This work (SNN with Spike Count)  CIFARNet  Error Backpropagation through Tandem Network  8.46  16  
This work (SNN with Agg. Mem. Potential)  CIFARNet  Error Backpropagation through Tandem Network  9.93  16  
ImageNet 
Hunsberger and Eliasmith, (2016)[26]  AlexNet  ConstrainthenTrain  48.20 (23.80)  200 
Rueckauer et al. (2017)[21]  VGG16  ANNtoSNN conversion  50.39 (18.37)  400  
Sengupta et al. (2019)[22]  VGG16  ANNtoSNN conversion  30.04 (10.99)  2,500  
This work (ANN with fullprecision activation)  AlexNet  Error Backpropagation  42.45 (19.56)    
This work (ANN with quantized activation)  AlexNet  Error Backpropagation  50.73 (26.08)    
This work (SNN with Agg. Mem. Potential)  AlexNet  Error Backpropagation through Tandem Network  53.37 (29.20)  13  
This work (SNN with Agg. Mem. Potential)  AlexNet  Error Backpropagation through Tandem Network  49.78 (26.40)  18 
IiiC Accumulated Neural Representation Error
As discussed in Section I, one can train a constrained ANN that approximates the properties of spiking neurons (e.g., firing rate or spike count) using conventional error backpropagation algorithm, and subsequently, transfer the trained weights to the SNN as described in Fig. 7A, namely the constrainthentrain approach. Taking Eq. 11 as the neuron activation function, we reported competitive classification accuracy on the MNIST dataset[28]. However, when applying this approach to the more complex CIFAR10 dataset with , we noticed a large accuracy drop (approximately 11%) when transferring the trained ANN weights to the SNN. After carefully comparing the ANN output ‘spike count’ with the actual SNN spike count, we observe a growing spike count discrepancy between the ANN and SNN layers as shown in Fig. 7C.
This is due to the fact that the neuronal activation function of the ANN has ignored the temporal dynamics of the IF neuron. While such spike count discrepancies could be negligible for a shallow network used for classifying the MNIST dataset[28] or with very high input firing rates, it has huge impacts in the face of sparse synaptic activities and short encoding time window. By incorporating the dynamics of IF neurons during the training of the tandem network, the exact output spike counts, instead of ANN predicted spike counts, are propagated forward to the subsequent ANN layer. The proposed tandem learning framework can effectively prevent this representation error from accumulating forward across layers.
IiiD Image Classification Results
For CIFAR10, as shown in Table. I, the CIFARNet trained with the proposed learning rule achieves a competitive test error rate of 8.46% (spike count decoding) and 9.93% (aggregate membrane potential decoding), respectively. The CIFARNet, with spike count decoding, achieves by far the bestreported result on CIFAR10 with a SNN. As shown in Fig. 8, we however note that its learning dynamics is unstable, which may be attributed to the discrete error gradients derived at the final output layer. Therefore, we use the aggregate membrane potential decoding for the rest of the experiments on ImageNet2012 as well as a further study on the effect of encoding time window size on CIFAR10. Although the learning converges slower than the plain CNN (with ReLU activation function) and bounded CNN (with bounded ReLU activation function as defined in Eq. 11), the error rate of the SNN eventually matches that of the bounded CNN. It also suggests that the representation error described in Sec. IIIC can be effectively mitigated with the proposed tandem learning framework.
To train a model on ImageNet2012 with a spikebased learning rule using BPTT for synaptic weight update, it requires a huge amount of computer memory to store the intermediate states of the spiking neurons as well as huge computational costs. Hence, only a few SNN implementations, without taking into consideration the dynamics of spiking neurons during training, have made some successful attempts on this challenging task, including ANNtoSNN conversion [21, 22] and constrainthentrain [26] approaches. The tandem learning rule benefits from the best of two worlds: the dynamics of IF neurons are considered during the forward propagation, while only the ratebased ANN is used for error backpropagation. As a result, it reduces both the memory requirement and computational cost over other spikebased learning rules. Meanwhile, it also reduces the inference time and the total synaptic operations when compared to the ANNtoSNN conversion and constrainthentrain approaches.
As shown in Table. I, with an inference time of 18time steps (input image is encoded within a time window of 10time steps), the AlexNet trained with the proposed learning rule achieves the top1 and top5 error rate of 49.78% and 26.40%, respectively. This result is comparable to that of the constrainthentrain approach with the same AlexNet architecture. Notably, the proposed learning rule only takes 18 inference time steps which are at least an order of magnitude faster than the other reported approaches.
While the ANNtoSNN conversion approaches achieve better classification accuracies on the ImageNet2012, their successes can largely be credited to the more advanced network models used. Furthermore, we note an error rate increase of around 7% from the baseline ANN implementation with fullprecision activation (revised from the original AlexNet model [1] by replacing pooling layers with a stride of 2 convolution operations to match the AlexNet used in this work, and adding batch normalization layers). To investigate the effect of the discrete neural representation, whereby how much of the drop in accuracy is due to quantization, and how much of it is due to dynamics of the IF neuron, we modify the fullprecision ANN by quantizing the activation function, using the bounded ReLU neuron as defined in Eq. 11. In a single trial, the resulting quantized ANN achieves the top1 and top5 error rate of 50.73% and 26.08%, respectively. This result is very close to that of our SNN implementation, which suggests that the quantization of the activation function alone may account for most of the accuracy drop.
IiiE Activation Direction Preservation and WeightActivation Dot Product Proportionality within the Interlaced Layers
After showing how effective the proposed tandem learning rule performs on the CIFAR10 and ImageNet2012, we further investigate why learning can be performed effectively via the interlaced network layers. To answer this question, we borrow ideas from the recent theoretical work of binary neural network [53], wherein learning is also performed across the interlaced network layers (binarized activations are forward propagated to subsequent layers). In the proposed tandem network, as shown in Fig. 10, the ANN layer activation value at layer is replaced with the spike count derived from the coupled SNN layer. Due to the dynamic nature of spike generation, it is not easy to find an analytical transformation function between and .
To circumvent this problem, we analyze the degree of mismatch between these two quantities and its effect on the activation forward propagation and error backpropagation.
In our numerical experiments on CIFAR10 with a randomly drawn minibatch of 256 test samples, we calculate the cosine angle between vectorized and for all the convolution layers. As shown in Fig. 9, their cosine angles are below 24 degrees on average and such a relationship maintains consistently throughout learning. While these angles seem large in low dimensions, they are exceedingly small in a high dimensional space. According to the hyperdimensional computing theory [54] and the theoretical study of binary neural network [53], the cosine angle between any two high dimensional random vectors is approximately orthogonal. It also worth noting that the distortion of replacing with is less severe than binarizing a random high dimensional vector, which changes cosine angle by 37 degrees in theory. Given that the activation function and error gradients that backpropagated from the subsequent ANN layer remains equal, the distortions to the error backpropagation are bounded locally by the discrepancy between and .
Furthermore, we calculate the Pearson Correlation Coefficient (PCC) between the weightactivation dot products and , which is an important intermediate quantity (input to the batch normalization layer) in our current network configurations. The PCC, ranging from 1 to 1, measures the linear correlation between two variables. A value of 1 implies a perfect positive linear relationship. As shown in Fig. 9, the PCC maintains consistently above 0.9 throughout learning for most of the samples, suggesting the linear relationship of weightactivation dot products are approximately preserved.
IiiF Rapid Inference with Reduced Synaptic Operations
As shown in Fig. 8, the proposed learning rule can deal with and utilize different encoding window sizes on CIFAR10. In the most challenging case when , we are able to achieve a satisfying error rate that is below 12%. This may be partially credited to the encoding strategy that we have employed, whereby important input information can be encoded at the first time step before passing into the SNN layer. In addition, the Batch Normalization layer that is added after each convolution and fullyconnected layer ensures effective information transmission to the top layers. The error rate is reduced further by increasing , while the improvement vanishes beyond . Hence, the SNN trained with the proposed learning rule can perform inference rapidly with at least an order of timesaving compared with other learning rules as shown in Table. I. While binary neural network also supports a rapid inference, they propagate information in a synchronized fashion and differ fundamentally from asynchronous information processing that is studied in other SNN works.
Model  Inference Time Steps  CIFAR10  ImageNet2012 

VGGNet9 [35]  100  3.61   
ResNet11 [35]  100  5.06   
VGGNet16[22]  500    1.975 
ResNet34[22]  2,000    2.40 
AlexNet (this work)  13  0.27  0.50 
AlexNet (this work)  18  0.40  0.68 
To study the energy efficiency of the proposed learning rule, we follow the evaluation metrics used in
[22, 35]. As defined in Section IIIB, we calculate the ratio of SNN SynOps to ANN SynOps on the CIFAR10 and ImageNet2012 datasets and compare them with other stateoftheart learning rules. Given the short inference time required and sparse synaptic activities as summarized in Fig. 11, the AlexNet (shown in Table. II, with ) trained with the proposed learning rule achieves a ratio of only 0.40 and 0.68 for CIFAR10 and ImageNet2012 dataset, respectively. It is worth noting that the SNN is more energyefficient than its ANN counterpart with a ratio below 1. The saving is even more significant compared to ANNs if we consider the fact that for SNNs, only an accumulate (AC) operation is performed for each synaptic operation; while for ANNs, a more costly multiplyandaccumulate (MAC) operation is performed. This results in an order of magnitude chip area as well as energy saving per synaptic operation[21, 22]. In contrast, the existing SNN implementations [35, 22] achieve a ratio of at least 3.61 and 1.975, which is at least 9 and 3 times more costly than the proposed tandem learning rule, on the CIFAR10 and ImageNet2012 datasets, respectively.Iv Conclusion
In this work, we introduce a novel tandem neural network and its learning rule to effectively train SNNs for efficient and rapid inference for pattern classification tasks. Within the tandem neural network, a SNN is employed to determine spike counts as a discrete neural representation and spike trains for the activation forward propagation; while an ANN, sharing the weight with the coupled SNN, is used to approximate gradients of the coupled SNN. Given that error backpropagation is performed on the ratebased ANN, the proposed learning rule is both memory and computationally more efficient than the error backpropagation through time algorithm that is used in many spikebased learning rules [32, 33, 34].
To understand why the learning can be effectively performed within the tandem learning framework, we study the learning dynamics of the tandem network and compare it with an intact ANN. The empirical study on the CIFAR10 reveals that the cosine distances between the vectorized ANN output and the coupled SNN output spike count are exceedingly small in a high dimensional space and such a relationship maintains throughout the training. Furthermore, strongly positive Pearson Correlation Coefficients are exhibited between weightactivation dot product and , an important intermediate quantity in the activation forward propagation, suggesting a linear relationship of weightactivation dot products are well preserved.
The SNNs trained with the proposed tandem learning rule have demonstrated competitive classification accuracies on the CIFAR10 and ImageNet2012 datasets. By encoding sensory stimuli within the available encoding time window through a learnable transformation layer, and adding batch normalization layers to ensure effective information flow; rapid inferences, with at least an order of magnitude timesaving compared to stateoftheart ANNtoSNN conversion and constrainthentrain approaches[22], are demonstrated on a largescale ImageNet2012 image classification task. Furthermore, the total synaptic operations are also significantly reduced compared to the baseline ANNs and other SNN implementations.
By integrating the algorithmic power of the proposed tandem learning rule with the unprecedented energy efficiency of emerging neuromorphic computing architectures, we expect to enable lowpower onchip computing on pervasive mobile and embedded devices. For future work, we will explore strategies to close the accuracy gap between the baseline ANN and SNN implementations as well as to evaluate more advanced network architectures.
References
 [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.

[2]
K. He, X. Zhang, S. Ren, and J. Sun,
“Deep residual learning for image recognition,”
in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2016, pp. 770–778.  [3] W. Xiong, J. Droppo, X. Huang, F. Seide, M. L. Seltzer, A. Stolcke, D. Yu, and G. Zweig, “Toward human parity in conversational speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, pp. 2410–2423, Dec 2017.
 [4] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio.,” SSW, vol. 125, 2016.

[5]
J. Hirschberg and C. D. Manning,
“Advances in natural language processing,”
Science, vol. 349, no. 6245, pp. 261–266, 2015.  [6] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354, 2017.
 [7] S. B. Laughlin and T. J. Sejnowski, “Communication in neuronal networks,” Science, vol. 301, no. 5641, pp. 1870–1874, 2003.
 [8] C. D. Schuman, T. E. Potok, R. M. Patton, J. D. Birdwell, M. E. Dean, G. S. Rose, and J. S. Plank, “A survey of neuromorphic computing and neural networks in hardware,” arXiv preprint arXiv:1705.06963, 2017.
 [9] P. A. Merolla, J. V. Arthur, R. AlvarezIcaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, et al., “A million spikingneuron integrated circuit with a scalable communication network and interface,” Science, vol. 345, no. 6197, pp. 668–673, 2014.
 [10] M. Davies, N. Srinivasa, T. H. Lin, G. Chinya, S. H. Cao, Y.and Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, et al., “Loihi: A neuromorphic manycore processor with onchip learning,” IEEE Micro, vol. 38, no. 1, pp. 82–99, 2018.
 [11] D. Monroe, “Neuromorphic computing gets ready for the (really) big time,” Communications of the ACM, vol. 57, no. 6, pp. 13–15, 2014.
 [12] S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry, T. Melano, D. R. Barch, C. di Nolfo, P. Datta, A. Amir, B. Taba, M. D. Flickner, and D. S. Modha, “Convolutional networks for fast, energyefficient neuromorphic computing,” Proceedings of the National Academy of Sciences, vol. 113, no. 41, pp. 11441–11446, 2016.
 [13] D. O. Hebb, The organization of behavior: A neuropsychological theory, Psychology Press, 2005.
 [14] H. Markram, J. Lübke, M. Frotscher, and B. Sakmann, “Regulation of synaptic efficacy by coincidence of postsynaptic aps and epsps,” Science, vol. 275, no. 5297, pp. 213–215, 1997.
 [15] G. Q. Bi and M. M. Poo, “Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type,” Journal of neuroscience, vol. 18, no. 24, pp. 10464–10472, 1998.
 [16] G. W. Burr, R. M. Shelby, A. Sebastian, S. Kim, S. Kim, S. Sidler, K. Virwani, M.i Ishii, P. Narayanan, A. Fumarola, et al., “Neuromorphic computing using nonvolatile memory,” Advances in Physics: X, vol. 2, no. 1, pp. 89–124, 2017.

[17]
N. Zheng and P. Mazumder,
“Online supervised learning for hardwarebased multilayer spiking neural networks through the modulation of weightdependent spiketimingdependent plasticity,”
IEEE transactions on neural networks and learning systems, vol. 29, no. 9, pp. 4287–4302, 2017.  [18] M. Mozafari, S. R. Kheradpisheh, T. Masquelier, A. NowzariDalini, and M. Ganjtabesh, “Firstspikebased visual categorization using rewardmodulated stdp,” IEEE transactions on neural networks and learning systems, vol. 29, no. 12, pp. 6178–6190, 2018.
 [19] Y. Cao, Y. Chen, and D. Khosla, “Spiking deep convolutional neural networks for energyefficient object recognition,” International Journal of Computer Vision, vol. 113, no. 1, pp. 54–66, 2015.
 [20] P. U. Diehl, D. Neil, J. Binas, M. Cook, S. C. Liu, and M. Pfeiffer, “Fastclassifying, highaccuracy spiking deep networks through weight and threshold balancing,” in 2015 International Joint Conference on Neural Networks (IJCNN), July 2015, pp. 1–8.
 [21] B. Rueckauer, I. A. Lungu, Y. Hu, M. Pfeiffer, and S. C. Liu, “Conversion of continuousvalued deep networks to efficient eventdriven networks for image classification,” Frontiers in Neuroscience, vol. 11, pp. 682, 2017.
 [22] A. Sengupta, Y. Ye, R. Wang, C. Liu, and K. Roy, “Going deeper in spiking neural networks: Vgg and residual architectures,” Frontiers in neuroscience, vol. 13, 2019.
 [23] Y. Hu, H. Tang, Y. Wang, and G. Pan, “Spiking deep residual network,” arXiv preprint arXiv:1805.01352, 2018.
 [24] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li FeiFei, “Imagenet: A largescale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, June 2009, pp. 248–255.
 [25] S. K. Esser, R. Appuswamy, P. Merolla, J. V. Arthur, and D. S. Modha, “Backpropagation for energyefficient neuromorphic computing,” in Advances in Neural Information Processing Systems, 2015, pp. 1117–1125.
 [26] E. Hunsberger and C. Eliasmith, “Training spiking deep networks for neuromorphic hardware,” arXiv preprint arXiv:1611.05141, 2016.
 [27] D. Zambrano, R. Nusselder, H. S. Scholte, and S. Bohte, “Efficient computation in adaptive artificial spiking neural networks,” arXiv preprint arXiv:1710.04838, 2017.
 [28] J. Wu, Y. Chua, M. Zhang, Q. Yang, G. Li, and H. Li, “Deep spiking neural network with spike count based learning rule,” arXiv preprint arXiv:1902.05705, 2019.
 [29] H. Mostafa, “Supervised learning based on temporal coding in spiking neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 7, pp. 3227–3235, 2018.
 [30] C. Hong, X. Wei, J. Wang, B. Deng, H. Yu, and Y. Che, “Training spiking neural networks for cognitive tasks: A versatile framework compatible with various temporal codes,” IEEE transactions on neural networks and learning systems, 2019.
 [31] E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate gradient learning in spiking neural networks,” arXiv preprint arXiv:1901.09948, 2019.
 [32] J. H. Lee, T. Delbruck, and M. Pfeiffer, “Training deep spiking neural networks using backpropagation,” Frontiers in Neuroscience, vol. 10, pp. 508, 2016.
 [33] S. B. Shrestha and G. Orchard, “Slayer: Spike layer error reassignment in time,” in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, Eds., pp. 1412–1421. Curran Associates, Inc., 2018.
 [34] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi, “Direct training for spiking neural networks: Faster, larger, better,” arXiv preprint arXiv:1809.05793, 2018.
 [35] C. Lee, S. S. Sarwar, and K. Roy, “Enabling spikebased backpropagation in stateoftheart deep neural network architectures,” arXiv preprint arXiv:1903.06379, 2019.
 [36] A. Krizhevsky and G. E. Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., Citeseer, 2009.
 [37] S. Hochreiter, “The vanishing gradient problem during learning recurrent neural nets and problem solutions,” International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems, vol. 6, no. 02, pp. 107–116, 1998.
 [38] M. Pfeiffer and T. Pfeil, “Deep learning with spiking neurons: Opportunities & challenges,” Frontiers in Neuroscience, vol. 12, pp. 774, 2018.
 [39] A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, and A. Maida, “Deep learning in spiking neural networks,” Neural Networks, 2018.
 [40] W. Gerstner and W. M. Kistler, Spiking neuron models: Single neurons, populations, plasticity, Cambridge University Press, 2002.
 [41] C. Koch and I. Segev, Methods in neuronal modeling: from ions to networks, MIT press, 1998.
 [42] J. Wu, Y. Chua, M. Zhang, H. Li, and K. C. Tan, “A spiking neural network framework for robust sound classification,” Frontiers in neuroscience, vol. 12, 2018.
 [43] A. van den Oord, O. Vinyals, and k. kavukcuoglu, “Neural discrete representation learning,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., pp. 6306–6315. Curran Associates, Inc., 2017.
 [44] A. Mnih and K. Gregor, “Neural variational inference and learning in belief networks,” arXiv preprint arXiv:1402.0030, 2014.

[45]
R. Salakhutdinov and G. Hinton,
“Deep boltzmann machines,”
in Artificial intelligence and statistics, 2009, pp. 448–455.  [46] A. Mnih and D. J. Rezende, “Variational inference for monte carlo objectives,” arXiv preprint arXiv:1602.06725, 2016.

[47]
A. Courville, J. Bergstra, and Y. Bengio,
“A spike and slab restricted boltzmann machine,”
in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 233–241.  [48] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1,” arXiv preprint arXiv:1602.02830, 2016.
 [49] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 [50] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856.
 [51] Y. Wu et al., “Tensorpack,” https://github.com/tensorpack/, 2016.
 [52] P. Panda and K. Roy, “Unsupervised regenerative learning of hierarchical features in spiking deep networks for object recognition,” in 2016 International Joint Conference on Neural Networks (IJCNN). IEEE, 2016, pp. 299–306.
 [53] A. G. Anderson and C. P. Berg, “The highdimensional geometry of binary neural networks,” arXiv preprint arXiv:1705.07199, 2017.

[54]
P. Kanerva,
“Hyperdimensional computing: An introduction to computing in distributed representation with highdimensional random vectors,”
Cognitive computation, vol. 1, no. 2, pp. 139–159, 2009.
Comments
There are no comments yet.