1 Introduction
Spiking neural networks (SNNs) have been studied not only for their biological plausibility but also for computational efficiency that stems from information processing with binary spikes (Maass, 1997)
. One of the unique characteristics of SNNs is that the states of the neurons at different time steps are closely related to each other. This may resemble the temporal dependency in recurrent neural networks (RNNs), but in SNNs direct influences between neurons are only through the binary spikes. Since the true derivative of the binary activation function, or thresholding function, is zero almost everywhere, SNNs have an additional challenge in precise gradient computation unless the binary activation function is replaced by an alternative as in
(Huh and Sejnowski, 2018).Due to the difficulty of training SNNs, in some recent studies, parameters trained in nonspiking NNs were employed in SNNs. However, this approach is only feasible by using the similarity between ratecoded SNNs and nonspiking NNs (Diehl et al., 2015; Hunsberger and Eliasmith, 2015) or by abandoning several features of spiking neurons to maximize the similarity between SNNs and nonspiking NNs (Park et al., 2020; Rueckauer and Liu, 2018; Zhang et al., 2019). The unique characteristics of SNNs that enable efficient information processing can only be utilized with dedicated learning methods for SNNs. In this context, several studies have reported promising results with the gradientbased supervised learning methods that takes account of those characteristics (Comsa et al., 2019; Mostafa, 2017; Shrestha and Orchard, 2018; Wu et al., 2018; Zenke and Ganguli, 2018).
Previous works on gradientbased supervised learning for SNNs can be classified into two categories. The methods in the first category work around the nondifferentiability of the spiking function with the surrogate derivative
(Neftci et al., 2019) and compute the gradients with respect to the spike activation (Shrestha and Orchard, 2018; Wu et al., 2018; Zenke and Ganguli, 2018). The methods in the second category focus on the timings of existing spikes and computes the gradients with respect to the spike timing (Comsa et al., 2019; Mostafa, 2017; Bohte et al., 2002). Let us call those methods as the activationbased methods and the timingbased methods, respectively. Until now, the two approaches have been thought irrelevant to each other and studied independently.The problem with previous works is that both approaches have limitations in computing accurate gradients, which become more problematic when the spike density is low. The computational cost of the SNN is known to be proportional to the number of spikes, or the firing rates (Rueckauer and Liu, 2018; Akopyan et al., 2015; Davies et al., 2018). To make the best use of the computational power of SNNs and use them more efficiently than nonspiking counterparts, it is important to reduce the required number of spikes for inference. If there are only a few spikes in the network, the network becomes more sensitive to the change in the state of each individual spike such as the generation of a new spike, the removal of an existing spike, or the shift of an existing spike. Training SNNs with fewer spikes requires the learning method to be aware of those changes through gradient computation.
In this work, we investigated the relationship between the activationbased methods and the timingbased methods for supervised learning in SNNs. We observed that the two approaches are complementary when considering the change in the state of individual spikes. Then we devised a new learning method called activation and timingbased learning rule (ANTLR) that enables more precise gradient computation by combining the two methods. In experiments with random spiketrain matching task and widely used benchmarks (MNIST and NMNIST), our method achieved the higher accuracy than that of previous works when the networks are forced to use fewer spikes in training.
2 Backgrounds
2.1 Neuron model
We used a discretetime version of a leaky integrateandfire (LIF) neuron with the currentbased synapse model. The neuronal states of postsynaptic neuron
are formulated as(1)  
(2)  
(3) 
where is a membrane potential, is a synaptic current, is a binary spike activation. is a synaptic weight from presynaptic neuron . is a trainable bias parameter. and are the spiking function and the threshold, respectively. and are the decay coefficients for the potential and the current. , , and are the scale coefficients. We call this type of description as the RNNlike description since the temporal dependency between variables resembles that in RNNs (Neftci et al., 2019) (Figure (a)a). The term was introduced in and to reset both the potential and the synaptic current. Note that this model can express various types of commonly used neuron models by changing the decay coefficients (Figure A1 in Appendix A).
The same neuron model can also be formulated using the spike response kernel as
(4)  
(5) 
where is a spike timing of neuron , , and is the last spike timing of neuron before . We call this type of description as the SRMbased description as it is in the form of the Spike Response Model (SRM) (Gerstner, 1995) (Figure (b)b). Detailed explanations on the equivalence of the two descriptions are given in Appendix B.
2.2 Existing gradient computation methods
2.2.1 Activationbased methods
To backpropagate the gradients to the lower layers, the activationbased methods (Huh and Sejnowski, 2018; Shrestha and Orchard, 2018; Wu et al., 2018; Zenke and Ganguli, 2018) approximate the derivative of the spiking function which is zero almost everywhere. It is similar to what nonspiking NNs do to the quantized activation functions such as the thresholding function for Binary Neural Networks (Hubara et al., 2016). The approximated derivative is called the surrogate derivative (Neftci et al., 2019), and we will denote this as .
RNNlike method
Since the forward pass of the RNNlike description of the neuron model resembles that of nonspiking RNNs (Figure (a)a), backpropagation can also be treated like the BackPropagationThroughTime (BPTT) (Werbos, 1990) (Figure (a)a, the equations are in Appendix C) (Huh and Sejnowski, 2018; Wu et al., 2018).
SRMbased method
However, from the SRMbased description of the same model (Figure (b)b), backpropagation is derived in a slightly different way using the kernel function between each layer (Figure (b)b) (Shrestha and Orchard, 2018). From Equation 4, we can obtain the gradient of the membrane potential of the postsynaptic neuron at arbitrary time step with respect to the spike activation of the presynaptic neuron at time step as
(6) 
Interestingly, we found that the SRMbased method (Figure (b)b) is functionally equivalent to the RNNlike method except that the diagonal reset paths are removed (Figure (c)c, See Appendix D for detailed explanation). In fact, neglecting the reset paths in backpropagation can improve the learning result as it can avoid the accumulation of the approximation errors. Via the reset paths (red dashed arrows in Figure (a)a), the same gradient value recursively passes through the surrogate derivative (red solid arrows in Figure (a)a), as many times as the number of time steps. Even though the amount of the approximation error from a single surrogate derivative is tolerable, the accumulated error can be orders of magnitude larger because the number of time steps is usually larger than hundreds. We experimentally observed that propagating gradients via the reset paths significantly degrades training results regardless of the task and network settings. In this regard, we used the SRMbased method instead of the RNNlike method to represent the activation methods throughout this paper.
2.2.2 Timingbased methods
The timingbased methods (Comsa et al., 2019; Mostafa, 2017; Bohte et al., 2002) exploit the differentiable relationship between the spike timing and the membrane potential at the spike timing . The local linearity assumption of the membrane potential around leads to where is the time derivative of the membrane potential at time . In this work, we used approximated time derivative for discrete time domain as . Note that computing the gradient of a spike timing does not require the derivative of the spiking function .
From Equation 4 of the SRMbased description, we can obtain the gradient of the membrane potential of the postsynaptic neuron at arbitrary time step with respect to the spike timing of the presynaptic neuron as
(7) 
where is the approximated time derivative of SRM kernel in discrete time domain. Figure (d)d depicts how the timingbased method propagates the gradients. Only in the time steps with spikes, is propagated to and then is propagated to the lower layer with Equation 7.
3 Activation and Timingbased Learning Rule (ANTLR)
3.1 Complementary nature of activationbased methods and timingbased methods
Calculating the gradients is to estimate how much the network output varies when the parameters or the variables are changed. One of the main findings in our study is that the activationbased and timingbased methods are complementary in the way they consider the change in the network.
The change in SNNs can be represented by the generation, the removal, and the shift of spikes. The generation or the removal of a spike is expressed as the change of the spike activation (01 or 10). The activationbased methods, which calculate the gradient with respect to the spike activations , then naturally can consider the generations and the removals. On the other hand, the shift of a spike is expressed as the change of the spike timing . The timingbased methods, which calculate the gradient with respect to the spike timings , easily take account of the spike shifts.
The problem in the activationbased methods is that they cannot deal with the spike shifts accurately. In terms of the spike activations, the spike shift is interpreted as a pair of opposite spike activation changes with causal relationship through the reset path (Figure 3). Because of the major role of the reset path in the spike shift, gradient computation methods with the spike activations cannot consider the shift without precisely computing the gradients related to the reset paths. Unfortunately, as explained in Section 2.2.1, the SRMbased activationbased method does not have a reset path so that it is not possible to consider the spike shift at all. The RNNlike activationbased method has the reset paths, but it suffers from accuracy loss due to the accumulated errors in the reset path. Although the shift of an individual spike does not make a huge difference to the whole network in the situation where many spikes are generated and removed, it becomes important when there are not many spikes in the network.
The problem in the timingbased methods is that the generation and the removal of spikes cannot be described with the spike timings. The timingbased methods also cannot anticipate the spike number change in the network, which happens by the generation or the removal of spikes. Even though the generation and the removal happen less often compared to the spike shift when the parameters are updated by small amounts, their influences to the network are usually more significant.
3.2 Combining activationbased gradients and timingbased gradients
To overcome the limitations in previous works, we propose a new method of backpropagation for SNNs, called an activation and timingbased learning rule (ANTLR), that combines the activationbased gradients and the timingbased gradients together. The activationbased methods and the timingbased methods backpropagate the gradient through different intermediate gradients, which are and , respectively. For this reason, the two approaches have been treated as completely different approaches. However, there is another intermediate gradient calculated in both approaches. in the activationbased methods is propagated from and carries information about the generation and the removal of the spikes whereas in the timingbased methods is propagated from and carries information about the spike shift.
The main idea of ANTLR is to (1) combine the activationbased gradients and the timingbased gradients by taking weighted sum and (2) propagate the combined gradients (Figure 4). In ANTLR, the gradients are backpropagated to the lower layers as
(8)  
(9)  
(10) 
where last two terms in Equation 9 are calculated using the activationbased method as in Section 2.2.1 and last two terms in Equation 10 are calculated using the timingbased method as in Section 2.2.2
. To train SNNs using ANTLR and other methods, we implemented CUDAcompatible gradient computation functions in PyTorch
(Paszke et al., 2019)^{1}^{1}1The source code will be released later. (implementation details are described in Appendix E).Note that ANTLR with the setting , is equivalent to the activationbased method whereas ANTLR with , is equivalent to the timingbased method. Therefore, ANTLR can also be regarded as a unified framework that covers the two distinct approaches. In this work, we focused on showing the fundamental benefits of combining them and used the simplest setting , . Proper values of , may depend on the situations, but further studies are needed to precisely understand their influences.
3.3 Loss functions
Type  Count  Spiketrain  Latency 

Loss ()  
0  
0  
Compatible with  Activation, ANTLR  Activation, Timing, ANTLR  Timing, ANTLR 

represents an index of the output neurons, , , represents an exponential kernel, is a scaling factor, represents a target spike number, and
represents a target probability
Three different types of loss functions and corresponding activationbased gradient
and timingbased gradientWe used three types of widely used loss functions which are count loss, spiketrain loss, and latency loss (Table 1). Count loss is defined as a sum of squared error between the output and target number of spikes of each output neuron. Spiketrain loss is a sum of squared error between the filtered output spiketrain and the filtered target spiketrain. Latency loss is defined as the crossentropy of the softmax of negatively weighted first spike timings of output neurons. Note that the count loss cannot provide the gradient with respect to the spike timing whereas the latency loss cannot provide the gradient with respect to the spike activation. It makes those loss types inapplicable to certain types of learning methods. We want to emphasize that ANTLR can use all the loss types.
3.4 Estimated loss landscape
We conducted a simple experiment to visualize the gradients computed by each method. A fullyconnected network with two hidden layers of 1050501 neurons was trained to minimize the spiketrain loss with three random input spikes for each input neuron and a single target spike for the target neuron. After reaching to the global optimum of zero loss, we perturbed all trainable parameters (weights and biases) along first two principal components of the gradient vectors used in training and measured the true loss (Figure
(a)a). The lowest point at the center (dark blue region) represents the global minimum, and subtle loss increase around the center shows the effect of the spike timing shift. Dramatic increase of the loss depicted in the right corner shows the loss increase from the spike number change. To emphasize the subtle height difference due to the spike timing shift, we highlighted the area adjacent to the global optimum where the number of spikes does not change using the color scheme in Figure (e)e.Different learning methods provide different gradient values based on their distinct approaches. Using each method’s gradient vector at each parameter point, we visualized the estimated loss landscape using the surface reconstruction method Harker and O’Leary (2008); Jordan (2017) (Figure (b)b to (d)d). The results of the activationbased method (Figure (b)b) well demonstrated the steep loss change due to the spike number change, whereas the timingbased method (Figure (c)c) could not take account of it. On the other hand, the timingbased method captured the subtle loss change due to the spike timing shift while the activationbased method showed almost flat loss landscape in the region without the spike number change. By combining both methods, ANTLR was able to capture those features at the same time (Figure (d)d).
4 Experimental results
We evaluated practical advantages of ANTLR compared to other methods using 3 different tasks: (1) random spiketrain matching, (2) latencycoded MNIST, and (3) NMNIST. Hyperparameters for training were gridsearched for each task (detailed experimental settings are in Appendix F). For the timingbased method, we added a nospike penalty that increases the incoming synaptic weights of the neurons without any spike as in (Comsa et al., 2019).
4.1 Random spiketrain matching
Using the same experiment setup as in Section 3.4 except the varying number of the target spikes and the different network size of 1050505, we measured the training loss of the networks trained by different learning methods (Figure 6). This task was used to see the basic performance of the learning methods in a situation where each spike significantly affects the training results. During 50000 training iterations, both the activationbased method and ANTLR showed noticeable decrease in loss whereas the timingbased method failed to train the network as it cannot handle the spike number change. ANTLR outperformed other methods with much faster convergence and lower loss.
4.2 Latencycoded MNIST
In this experiment, we applied the latency coding to the input data of MNIST dataset (LeCun et al., 1998) as in (Comsa et al., 2019; Mostafa, 2017). The larger intensity value of each pixel was represented by the earlier spike timing of corresponding input neuron. We used this conversion to reduce the total number of spikes and make the situation where each learning method should take account of the precise spike timing for a better result.
The timingbased method and ANTLR used the latency loss, and the activationbased method used the count loss with the target spike number of 1/0 for correct/wrong labels. We also added a variant of the count loss to the total loss of ANTLR to prevent the target output neuron from being silent. Note that the target spike number for the activationbased method is much smaller than that from previous works since we applied the latency coding to the input to reduce the number of input spikes. The output class can either be determined using the output neuron emitting the most spikes (mostspike decision scheme) or the neuron emitting the earliest spike (earliestspike decision scheme). The timingbased method and ANTLR used the earliestspike decision scheme whereas the activationbased method used the mostspike decision scheme considering the loss types they used.
Test accuracy and the required number of hidden and output spikes to classify a single sample on (a) latencycoded MNIST task and (b) latencycoded MNIST task with the singlespike restriction. The values in the legend represent the mean and standard deviation of 16 trials.
We trained the network with a size of 78480010 and 100 time steps using a minibatch size of 16 and the split of 50000/10000 images for training/validation dataset. The results of test accuracy and the number of spikes used for each sample are shown in Figure (a)a. The number of spikes used to finish a task was usually not presented in previous works, but we included it to demonstrate the efficiency of the networks trained by different methods. The results show that ANTLR achieved the highest accuracy compared to other methods. The number of spikes for the timingbased method was exceptionally higher than the others, because of the nospike penalty and its inability to remove existing spikes during training. Figure (b)b shows a different scenario we tested, where each neuron is restricted to emit at most one spike as in (Comsa et al., 2019; Mostafa, 2017; Bohte et al., 2002). We tested this situation to further reduce the number of spikes. However, this modification did not change the trend of the results as the number of spikes was already small in the first place.
Note that previous works reported higher accuracy results, but the results were achieved with large number of spikes. In this study, we focus on the cases in which the networks are forced to use fewer spikes for high energy efficiency. We believe that such cases represent more desirable environments for application of SNNs.
4.3 NMnist
In contrast to the MNIST dataset which is static, the spiking version of MNIST, called NMNIST is a dynamic dataset that contains the samples of the input spikes in 34x34 spatial domain with two channels along 300 time steps (Orchard et al., 2015). The same loss and the classification settings as in Section 4.2 were used here except the target spike number for the activationbased method, which is increased to 10/0 considering the increased number of input spikes in the NMNIST dataset. Note that the latency loss and the earliestspike decision scheme have never been used for the NMNIST dataset, but we intentionally used them to reduce the number of spikes. We trained the network with a size of 2x34x3480010 using a minibatch size of 16 and the results are shown in Figure (a)a.
Due to the large target spike number, the activationbased method required much more spikes than ANTLR. The timingbased method again used large number of spikes because of its limitation in removing spikes. We also tested the scenario where the singlespike restriction is applied (Figure (b)b). Since the activationbased method had to use the target spike number of 1/0 due to the restriction, its accuracy result was degraded whereas the timingbased method showed improvement in both accuracy and efficiency. This supports the fact that the activationbased method favors the multispike situation and the timingbased method favors the singlespike situation.
5 Discussion and conclusion
In this work, we presented and compared the characteristics of two existing approaches of gradientbased supervised learning methods for SNN and proposed a new learning method called ANTLR that combines them. The experimental results using various tasks showed that the proposed method can improve the accuracy of the network in the situations where the number of spikes are constrained, by precisely considering the influence of individual spikes.
It is known that both the temporal coding and the rate coding play important roles for information processing in biological neurons (Gerstner et al., 2014). Interestingly, the timingbased methods are closely related to the temporal coding since they explicitly consider the spike timings in gradient computation. On the other hand, the activationbased methods are more favorable to the rate coding in which the spike timing change does not contain information. Even though we did not explicitly address the concept of the temporal coding and the rate coding in this work, to the best of our knowledge, this work is the first work that tries to unify the different learning methods suitable for different coding schemes.
Some other works that were not mentioned in this paper also have shown notable results as supervised learning methods for SNNs (Jin et al., 2018; Lee et al., 2016; Zhang and Li, 2019), but these methods are not classified as neither activationbased or timingbased. In these methods, a scalar variable mediates the backpropagation from the whole spiketrain of a postsynaptic neuron to the whole spiketrain of a presynaptic neuron. This variable may be able to capture the current state of the spiketrain and its influence to another neuron, but it cannot cope with the change in the spiketrain such as the generation, the removal, or the timing shift during training. This limitation may not be problematic with the rate coding in which the change in the state of individual spikes does not make a huge difference, but it is a critical problem when training SNNs with fewer spikes for higher efficiency.
Broader Impact
We believe that broader impact discussion is not applicable to our work because our work is to improve the general supervised learning performance of spiking neural networks and is not related to a specific application.
This research was supported by Samsung Research Funding Center of Samsung Electronics under Project Number SRFCTC160351, the MSIT (Ministry of Science and ICT), Korea, under the ICT Consilience Creative program (IITP20192011100783) supervised by the IITP (Institute for Information & communications Technology Promotion), and NRF (National Research Foundation of Korea) Grant funded by the Korean Government (NRF2016Global Ph.D. Fellowship Program).
References
 Truenorth: design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE transactions on computeraided design of integrated circuits and systems 34 (10), pp. 1537–1557. Cited by: §1.

Errorbackpropagation in temporally encoded networks of spiking neurons
. Neurocomputing 48 (14), pp. 17–37. Cited by: §1, §2.2.2, §4.2.  Temporal coding in spiking neural networks with alpha synaptic function. arXiv preprint arXiv:1907.13223. Cited by: Figure A1, §1, §1, §2.2.2, §4.2, §4.2, §4.
 Loihi: a neuromorphic manycore processor with onchip learning. IEEE Micro 38 (1), pp. 82–99. Cited by: §1.
 Fastclassifying, highaccuracy spiking deep networks through weight and threshold balancing. In 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §1.
 Neuronal dynamics: from single neurons to networks and models of cognition. Cambridge University Press. Cited by: §5.
 Time structure of the activity in neural network models. Physical review E 51 (1), pp. 738. Cited by: §2.1.

Least squares surface reconstruction from measured gradient fields.
In
2008 IEEE conference on computer vision and pattern recognition
, pp. 1–7. Cited by: §3.4.  Binarized neural networks. In Advances in neural information processing systems, pp. 4107–4115. Cited by: §2.2.1.
 Gradient descent for spiking neural networks. In Advances in Neural Information Processing Systems, pp. 1433–1443. Cited by: §1, §2.2.1, §2.2.1.
 Spiking deep networks with lif neurons. arXiv preprint arXiv:1510.08829. Cited by: §1.
 Hybrid macro/micro level backpropagation for training deep spiking neural networks. In Advances in Neural Information Processing Systems, pp. 7005–7015. Cited by: §5.
 PyGrad2Surf. GitLab. Note: https://gitlab.com/chjordan/pyGrad2Surf/ Cited by: §3.4.
 The mnist database of handwritten digits, 1998. URL http://yann. lecun. com/exdb/mnist 10, pp. 34. Cited by: §4.2.
 Training deep spiking neural networks using backpropagation. Frontiers in neuroscience 10, pp. 508. Cited by: §5.
 Networks of spiking neurons: the third generation of neural network models. Neural networks 10 (9), pp. 1659–1671. Cited by: §1.
 Supervised learning based on temporal coding in spiking neural networks. IEEE transactions on neural networks and learning systems 29 (7), pp. 3227–3235. Cited by: Figure A1, §1, §1, §2.2.2, §4.2, §4.2.
 Surrogate gradient learning in spiking neural networks: bringing the power of gradientbased optimization to spiking neural networks. IEEE Signal Processing Magazine 36 (6), pp. 51–63. Cited by: §1, §2.1, §2.2.1.
 Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in neuroscience 9, pp. 437. Cited by: §4.3.
 T2FSNN: deep spiking neural networks with timetofirstspike coding. arXiv preprint arXiv:2003.11741. Cited by: §1.

PyTorch: an imperative style, highperformance deep learning library
. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.2.  Conversion of analog to spiking neural networks using sparse temporal coding. In 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. Cited by: Figure A1, §1, §1.
 SLAYER: spike layer error reassignment in time. In Advances in Neural Information Processing Systems, pp. 1412–1421. Cited by: Figure A1, Appendix E, §1, §1, §2.2.1, §2.2.1.
 Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 78 (10), pp. 1550–1560. Cited by: §2.2.1.
 Spatiotemporal backpropagation for training highperformance spiking neural networks. Frontiers in neuroscience 12. Cited by: §1, §1, §2.2.1, §2.2.1.
 Superspike: supervised learning in multilayer spiking neural networks. Neural computation 30 (6), pp. 1514–1541. Cited by: §1, §1, §2.2.1.

Tdsnn: from deep neural networks to deep spike neural networks with temporalcoding.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 1319–1326. Cited by: §1.  Spiketrain level backpropagation for training deep recurrent spiking neural networks. In Advances in Neural Information Processing Systems, pp. 7800–7811. Cited by: §5.
Appendix
Appendix A Versatility of the neuron model
In our neuron model, depending on the decay coefficients , the shape of the postsynaptic potential induced by a single spike can be varied. Figure A1 shows some examples cases of commonly used neuron models that can be implemented using our neuron model.
Appendix B Functional equivalence of the RNNlike description and the SRMbased description of the model
From the RNNlike description of the model (Equation 1 to 3), we can infer that the postsynaptic potential induced by , the spike activation of presynaptic neuron at time step , to , the potential of a postsynaptic neuron at later time step , can be transmitted only via . Then forwards the influence to and , and it continues with s and s along the way.
If there is no spike activation between and (), this influence can reach to , and by the time it reaches, the amount of the influence from becomes . If there is the spike activation between and (), this influence cannot be transmitted to since cuts off the signals that and receive.
Appendix C RNNlike activationbased method
From the RNNlike description of the model (Equation 1 to 3), following BPTTlike backpropagation can be derived
(11)  
(12)  
(13) 
(14)  
(15)  
(16) 
that results in the gradients for the parameter update as
(17) 
Appendix D Interpreting SRMbased activationbased backpropagation with RNNlike description
The forward passes of the RNNlike description and the SRMbased description are functionally equivalent, but corresponding backpropagation methods derived from them are slightly different.
The SRMbased backpropagation can be summarized using the relationship between the potentials as follows.
(18) 
where the kernel function is given as
Similar to the derivation in Appendix B, following backpropagation formula can provide the same functionality as the SRMbased backpropagation.
(19)  
(20)  
(21)  
(22)  
(23)  
(24)  
(25) 
where is introduced to consider temporal dependency between s of the same neuron at different time steps.
Appendix E Implementation details of the learning methods
For the activationbased method and ANTLR, we used the surrogate derivative using exponential function as in (Shrestha and Orchard, 2018). For the timingbased method and ANTLR, the approximated time derivative and were calculated as and respectively.
Algorithm 1, 2, 3 show the detailed procedure for backpropagation of the activationbased method, the timingbased method, and ANTLR, respectively; is represented as for better readability, and represents a weight matrix between layer and layer . Note that and are calculated considering the loss function used (Table 1). from Appendix D was used in all methods to reduce the total number of computations by not using explicitly. For the same reason, we did not implement the for loop related to (Algorithm 2 and 3) in the actual implementation and used auxiliary variables similar to .
Appendix F Experimental settings
Hyperparameters used for loss landscape estimation (Section 3.4) and random spiketrain matching task (Section 4.1) are listed in Table A1. For latencycoded MNIST task and NMNIST task, we gridsearched several hyperparameter options and reported the results of the ones that provided highest valid accuracy (averaged over 16 trials). Table A2 and Table A3 show searched hyperparameter options and the ones used for the final results.
Some of the hyperparameters were not mentioned in the paper. grad_clip is for clipping the parameter gradients before update. init_bias_center was used as a binary option that initialize the bias with large value to ease the generation of spikes at earlier training iterations. kappa_exp is for the exponential filter used for the spiketrain loss. ste_alpha and ste_beta are coefficients for the surrogate derivative described in Appendix E.
Name  Value 

alpha_v, alpha_i  0.95, 0.95 
grad_clip  1e5 
init_bias_center  0 
kappa_exp  0.95 
learning_rate  1e3 
optimizer  ‘sgd’ 
ste_alpha  0.3 
ste_beta  1 
Hyperparameter  Searched options  Chosen for  
Activation  Timing  ANTLR  
alpha_v, alpha_i  (0.95, 0.95), (0.99, 0.99)  (0.99, 0.99)  (0.99, 0.99)  (0.99, 0.99) 
beta_softmax  0.5, 1, 2    1  1 
epoch  10  10  10  10 
grad_clip  1e6, 10, 1  1e6  1e6  1e6 
init_bias_center  0, 1  0  1  1 
learning_rate  1e2, 1e3, 1e4  1e3  1e4  1e3 
max_target_spikes  1  1     
optimizer  ‘adam’  ‘adam’  ‘adam’  ‘adam’ 
ste_alpha  0.3, 1  1    1 
ste_beta  1, 3  3    3 
weight_decay  0, 1e3, 1e4  0  0  0 
Hyperparameter  Searched options  Chosen for  
Activation  Timing  ANTLR  
alpha_v, alpha_i  (0.95, 0.95), (0.99, 0.99)  (0.99, 0.99)  (0.99, 0.99)  (0.99, 0.99) 
beta_softmax  1/6, 1/3, 2/3    1/3 (1/6)  1/6 
epoch  5  5  5  5 
grad_clip  1e6, 10, 1  10 (1)  1  1 
init_bias_center  0  0  0  0 
learning_rate  1e2, 1e3, 1e4  1e3  1e4  1e3 
max_target_spikes  1, 3, 10 (1)  10 (1)     
optimizer  ‘adam’  ‘adam’  ‘adam’  ‘adam’ 
ste_alpha  0.3, 1  1    1 
ste_beta  1, 3  3    3 
weight_decay  0, 1e3, 1e4  0  0  0 