TCJA-SNN: Temporal-Channel Joint Attention for Spiking Neural Networks

by   Rui-Jie Zhu, et al.

Spiking Neural Networks (SNNs) is a practical approach toward more data-efficient deep learning by simulating neurons leverage on temporal information. In this paper, we propose the Temporal-Channel Joint Attention (TCJA) architectural unit, an efficient SNN technique that depends on attention mechanisms, by effectively enforcing the relevance of spike sequence along both spatial and temporal dimensions. Our essential technical contribution lies on: 1) compressing the spike stream into an average matrix by employing the squeeze operation, then using two local attention mechanisms with an efficient 1-D convolution to establish temporal-wise and channel-wise relations for feature extraction in a flexible fashion. 2) utilizing the Cross Convolutional Fusion (CCF) layer for modeling inter-dependencies between temporal and channel scope, which breaks the independence of the two dimensions and realizes the interaction between features. By virtue of jointly exploring and recalibrating data stream, our method outperforms the state-of-the-art (SOTA) by up to 15.7 in terms of top-1 classification accuracy on all tested mainstream static and neuromorphic datasets, including Fashion-MNIST, CIFAR10-DVS, N-Caltech 101, and DVS128 Gesture.


page 3

page 9


Noise-Robust Deep Spiking Neural Networks with Temporal Information

Spiking neural networks (SNNs) have emerged as energy-efficient neural n...

A Spatial-channel-temporal-fused Attention for Spiking Neural Networks

Spiking neural networks (SNNs) mimic brain computational strategies, and...

MAP-SNN: Mapping Spike Activities with Multiplicity, Adaptability, and Plasticity into Bio-Plausible Spiking Neural Networks

Spiking Neural Network (SNN) is considered more biologically realistic a...

Exploiting Spiking Dynamics with Spatial-temporal Feature Normalization in Graph Learning

Biological spiking neurons with intrinsic dynamics underlie the powerful...

Exploiting Neuron and Synapse Filter Dynamics in Spatial Temporal Learning of Deep Spiking Neural Network

The recent discovered spatial-temporal information processing capability...

Multivariate Time Series Classification Using Spiking Neural Networks

There is an increasing demand to process streams of temporal data in ene...

Spatio-Temporal Pruning and Quantization for Low-latency Spiking Neural Networks

Spiking Neural Networks (SNNs) are a promising alternative to traditiona...

1 Introduction

Spiking Neural Networks (SNNs) is a valuable but challenging research area with lower energy consumption [28] and superior robustness [31]

than conventional Artificial Neural Networks (ANNs), holding substantial potential in the temporal data processing. Recently, backpropagation has been introduced to SNNs


to improve performance, allowing various ANN modules to be included in SNNs, such as batch normalization block

[38] and residual block [12]. Nonetheless, integrating an ANN-based module in an efficient and bionic manner remains a challenge.

Though considerable improvements have been seen in SNNs, it does not fully benefit from the superior representational capability of deep learning, as the unique training mode of SNNs cannot model the Spatio-temporal relationship reasonably. Zheng et al. [38] propose a batch normalization method among temporal dimension, solving the problems of gradient vanishing and threshold-input balance. For channel-wise, Wu et al. [33] propose a method called NeuNorm, which includes an auxiliary neuron to adjust the strength of the stimulus generated by the former layer. Taking advantage of channel-wise information, NeuNorm can improve performance while adding bio-plausibility by mimicking retina and nearby cells’ activity. However, these methods deal with temporal and spatial information separately, limiting joint information extraction.

The attention mechanism mimics the human ability to selectively focus on information of interest while ignoring other information, which is also worth to be explored in SNNs [36]. In [3], attention-based spike-timing-dependent plasticity SNN is proposed to address the spike-sorting problem in neuroscience. Besides, shown in Fig. 1-a, Yao et al. [36] attach the channel attention block to the temporal-wise input of SNN, assessing the significance over different frames during training and discarding irrelevant frames during inferencing.

Figure 1: How our Temporal-Channel Joint Attention differs from existing temporal-wise attention [36]

, which estimates the saliency of each time step by squeeze-and-excitation module.

denotes the time step, denotes the channel, and represents the spatial resolution. By utilizing two separate 1-D convolutional layers and the Cross Convolutional Fusion (CCF) operation, our Temporal-Channel Joint Attention establishes the association between the time step and the channel.

In this paper, we involve both temporal and channel attention mechanisms in SNNs, which is implemented by efficient 1-D convolution; Fig. 1-b shows the whole structure, we argue that this cooperative mechanism can enhance the discrimination of the learned features.

  1. We introduce a plug-and-play block into SNNs by considering the temporal and channel attentions cooperatively, which model temporal and channel information in the same phase, achieving better adaptability and bio-interpretability. To the best of our knowledge, this is the first attempt to incorporate the temporal-channel attention mechanism into the most extensively used model, LIF-based SNNs.

  2. A Cross Convolutional Fusion (CCF) operation with a cross receptive field is proposed to make use of the associated information. It not only uses the benefit of convolution to minimize parameters but also integrates features from both temporal and channel dimensions in an efficient fashion.

  3. TCJA-SNN is simple and easy to implement, and consistently outperforms prior methods on static Fashion-MNIST [35] datasets and neuromorphic N-Caltech 101 [25], CIFAR10-DVS [19], DVS128 Gesture [1] datasets. Specifically, we achieve 82.5% test accuracy on N-Caltech 101 dataset, substantially improving over the previously best 66.8% accuracy.

2 Related Works and Motivation

2.1 Spike-based backpropagation

In recent years, various ANN algorithms have been directly used for training deep SNNs, including gradient-descent-based algorithms. However, the non-differentiable of the spikes is the main obstacle. Specifically, the Heaviside function used to trigger the spike has a derivative that evaluates to zero everywhere but the origin, preventing gradient learning. The surrogate gradient descent method [9; 13; 24; 23; 32; 38] is a common solution to the problem. The Heaviside function is preserved during the forward pass, but a surrogate function replaces it during the backward pass. A simple choice of the surrogate function is the Spike-Operator [7]

, whose gradient resembles a shifted ReLU function. In this work, we take the surrogate gradient method further by introducing an ATan surrogate function and triangle-like surrogate function designed by 

[10] and [2], which are capable of activating a specific sample range, thus friendly to the training of deep SNNs.

2.2 Squeeze and Excitation

In ANNs, the Squeeze and Excitation (SE) block [11] has been shown as an effective module in enhancing the representation. SE can be placed into a network while only increasing a few parameters to recalibrate channel information. By squeezing and fully connecting, the trainable scale for each channel can be obtained. Recently, Yao et al. [36] applied the SE to formulate so-called temporal-wise attention that is effectively used in SNNs, by which the approach can figure out critical temporal frames of interest without being disturbed. Equipped with the temporal-wise attention, the given technique easily achieves the SOTA performance in various datasets, demonstrating the vast potential of the attention mechanism in SNNs.

2.3 Motivation

Based on the above analysis, the temporal-wise attention mechanism has been introduced into SNNs to learn frame-based representations for processing time-related data streams. Apart from temporal information, for spatial-wise, channel feature recalibration within convolutional layers has a high potential for performance improvement, as observed in retina cells [21] and validated in the practice of ANNs [11] and SNNs [36]. However, these works only process the data with either temporal or channel dimensions, limiting the joint feature extraction. To present the correlation between time steps and channels, we visualize the input frame and several proximity channels output from the first 2-D convolutional layer shown in Fig. 2. As circles indicated, a similar firing pattern can be distinguished from surrounding time steps and channels. To fully use this associated information, we propose the TCJA module, a novel approach for modeling temporal and channel-wise frame correlations. Furthermore, considering the inevitably increases in the model parameters caused by the attention mechanism, we attempt to adopt the 1-D convolution operation to gain a reasonable tradeoff between model performance and parameter.

Figure 2: Correlation between proximity time steps and channels. The top row is the input frame. Each figure in the nine-pattern grid of the bottom row denotes a channel output from the first 2-D convolutional layer. It is clear that a significant correlation exists in channels with varying time steps, motivating us to merge the temporal and channel information.

3 Methodology

3.1 Leaky Integrate and Fire Model

Dating back to 1907 [18], the Leaky-Integrate-and-Fire (LIF) model accumulates the membrane voltage by integrating the external input. Compared with other biological neuron models, the LIF model consumes the lowest computational cost while possessing some biological properties, making it suitable for simulating large-scale SNNs. It can be described by a differential function [33]:


where denotes a time constant, represents the membrane potential of the neuron at time , and represents the input from the presynaptic neurons, the product of a weight and a spiking input , which accomplished by convolutional or fully-connected layer. For better computational tractability, the LIF model can be described as an explicitly iterative version [28]:


where and respectively represent the indices of layer and time step, is a time constant, is membrane potential,

is the spiking tensor with binary value,

denotes the input from the previous layer, denotes the Heaviside step function, represents the reset process after spiking. As the mainstream neuron model in SNNs, the LIF model can be trained using STBP [14; 32] and surrogate gradient descent, delivering the SOTA accuracy [6; 10; 36]

. The LIF model is well-suited to common machine-learning frameworks because it allows forward and backward propagation along spatial and temporal dimensions. In our method, the LIF model with

, and serves as the spiking neuron trained with the surrogate gradient descent method mentioned in Sec. 2.1.

3.2 Temporal-Channel Joint Attention (TCJA)

Figure 3: The growth curve of parameters between Fully-Connected (FC) layer and Temporal-Channel Joint Attention (TCJA) layer when .

As mentioned above, we suggest that the frame in each time step is substantially associated with both its temporal and channel neighbors. We originally utilized the fully-connected layer to establish the correlation between temporal and channel information. However, with the increasing of channels and time steps, the number of the parameter is snowballs with a ratio of illustrated in Fig. 3. Thus, the 2-D convolutional layer is introduced to reduce the growth of parameters. The receptive field, however, is constrained by the fixed kernel size. For this reason, it is necessary to decrease the number of parameters while increasing the receptive field.

To address the mentioned difficulties, we propose a novel attention mechanism characterized by the global cross receptive field and relatively fewer parameters, named Temporal-Channel Joint Attention (TCJA), with only parameters. The purpose of TCJA is to qualify the saliency score between frames and their surroundings, and the overall structure of TCJA is shown in Fig. 4. Following that, we will go over the specifics of the TCJA. First, we utilize the squeezing operation on the input frame in Sec. 3.2.1. Next, we will introduce the temporal-wise local attention (TLA) mechanism and channel-wise local attention (CLA) mechanism in Sec. 3.2.2 and Sec. 3.2.3, respectively, then propose a cross convolutional fusion (CCF) mechanism to conjointly learn the information of temporal and channel in Sec. 3.2.4.

Figure 4: The Framework of SNN with TCJA module. The information flown in SNNs is a form of spike sequence along both temporal and spatial dimensions. In temporal-wise, the spiking neuron feed-forward in membrane potential () and spike () as the Eq. 2, and backpropagation with the surrogate function. In spatial-wise, data flows between layers as ANN. TCJA module first compresses the information both temporal-wise and spatial-wise, then apply TLA and CLA to establish the relationship in both temporal and channel dimensions and blend them by CCF layer.

3.2.1 Average Matrix by Squeezing

In order to efficiently capture the temporal and channel correlations between frames, we first perform the squeeze step on the spatial feature map of the input frame stream , where denotes the channel size, and denotes the time step. The squeeze step calculates an average matrix and each element of the average matrix is:


where is the input frame of -th channel at time step .

3.2.2 Temporal-wise Local Attention (TLA)

Following the squeeze operation, we propose the TLA mechanism for establishing temporal-wise relationships among frames. We argue that the frame in a specific time step interacts substantially with the frames in its adjacent positions. Therefore, we adopt 1-D convolution operation to model the local correspondence in the temporal dimension, as shown in Fig. 4. In detail, to capture the correlation of input frames at the temporal level, we perform -channel 1-D convolution on each row of the average matrix , and then accumulate the feature maps obtained by convolving different rows of the average matrix . The whole TLA process can be described as:


where ( <) represents the size of the convolution kernel, and is a learnable parameter, representing the -th parameter of the -th channel when performing -channel 1-D convolution on -th row of . is the attention score matrix after the TLA mechanism.

3.2.3 Channel-wise Local Attention (CLA)

As aforementioned, the frame-to-frame saliency score should not only take into account along the temporal dimension, but also take into consideration the information from adjacent frames along the channel dimension. In order to construct the correlation of different frames with their neighbors channel-wise, we propose the CLA mechanism. Similarly, as shown in Fig.  4, we perform -channel 1-D convolution on each column of the matrix , and then add the convolution results of each row, which can be described as:


where ( <) represents the size of the convolution kernel, and is a learnable parameter, representing the -th parameter of the -th channel when performing -channel 1-D convolution on -th column of . is the attention score matrix after CLA mechanism.

3.2.4 Cross Convolutional Fusion (CCF)

After TLA and CLA operations, we get the temporal (TLA matrix ) and channel (CLA matrix ) saliency scores of the input frame and its adjacent frames, respectively. Next, to learn the correlation between temporal and channel frames in tandem, we propose a cross-domain information fusion mechanism, i.e., the CCF layer. The goal of CCF is to calculate a fusion information matrix , and any position in is used to measure the potential correlation between the -th channel of the -th input temporal frame and other frames.

Specifically, we can model the joint relationship between frames by element-wise multiplication of and . It can be described as:

Figure 5: A Demo of TCJA. We give an average matrix , and the goal of TCJA is to calculate a fusion matrix integrating temporal and channel information. For instance, for a specific element in : , its calculation pipeline is as follows: 1) Calculate through TLA mechanism (Eq. 4); 2) Utilize CLA mechanism (Eq. 5) to calculate , and the calculation results are shown in the black dotted box in the figure; 3) Adopt CCF mechanism (Eq. 6) to jointly learn temporal and channel information to obtain . In addition, we can also find that after the CCF mechanism, integrates the information of the elements in the cross receptive field (Colored areas in ) as the anchor point, which indicates the Cross Convolutional Fusion.


represents the Sigmoid function. Here, we provide a demonstration to aid comprehension of the whole computational process in Fig.  


3.3 Theoretical Analysis on Receptive Field

To better understand the highlights of the proposed method, we give the following theoretical analysis about the area where the network perceives and processes information during the training phase, i.e., the receptive field.

Lemma 1. (Cross-Correlation Scope (CCS) of 1-D convolution) For an input feature map , if the size of the 1-D convolution kernel is defined as , then its CCS can be described as , where the involves the information along the second dimension of .

Lemma 2. (CCS of two orthogonal 1-D convolution) For an input feature map , the dot multiplication of two orthogonal 1-D convolutions performed on is equivalent to expanding the CCS into a cross shape, i.e., its CCS can be described by two cross-overlaid matrices (see e.g., the colored area of in Fig. 5), where , , and and are the size of the two convolution kernels, respectively.

Referring Eq.  6, Lemma 1, and Lemma 2, we can obtain the following corollary:

Corollary. Based on the broad CCS obtained by TCJA, there exists information flow among and , cooperatively considering the temporal and channel correlation, which is also clued in Eq.  6.

Recalling Eq.  4 and Eq.  5, through two 1-D convolutions along different dimensions, we construct two CCS in vertical relationship, which are stored in and . In particular, TCJA is to construct a CCS, which can perceive a larger area while realizing feature interaction in different directions. This cross receptive field is able to abolish the limitations caused by the monotonic dimension, thus bringing performance improvements to the network.

3.4 Training Framework

We integrate the TCJA module into the existing benchmark SNNs and propose the TCJA-SNN. Since the process of neuron firing is non-differentiable, we utilize the derived ATan surrogate function and the derived triangle-like surrogate function for backpropagation, which is proposed by [10] and [2]

, respectively. We adopt spike Mean-Square-Error (SMSE) as the loss function of our TCJA-SNN as the

[9; 10], which can be calculated by:


where denotes the simulation time step, is the number of labels, represents the network output and

represents the one-hot encoded target label. To estimate the classification accuracy, we define the predicted label

is the index of the neuron with the highest firing rate . Since the TCJA module simply utilizes the 1-D convolutional layer and Sigmoid function, it can be effortlessly introduced into the current network architecture as a plug-and-play module without adjusting to backpropagation.

4 Experiments

We evaluate the classification performance of TCJA-SNN on both neuromorphic datasets (CIFAR10-DVS, N-Caltech 101, and DVS128 Gesture) and static datasets (Fashion-MNIST). To verify the effectiveness of the proposed method, we integrate the TCJA module into several architectures [6; 10] with competitive performance to see if the integrated architecture can generate significant improvement. More details of the datasets, network architecture, data augmentation, and pre-process procedure can be found in the Sec. 1 and Sec. 2 of the supplementary.

4.1 Comparison with Existing SOTA Works

The performance of two TCJA-SNN variants are compared with some SOTA models in Tab. 1 and Tab. 3. We train and test two variants with SpikingJelly [8]

package based on PyTorch

[26] framework, resulting in enhanced performance across all tasks. Some works [9; 34; 36] substitute binary spikes with floating-point spikes in whole or in part and retain the same temporal forward pipeline as SNN to obtain improved classification accuracy. Thus, we devise two variants to validate the efficiency of TCJA-SNN by utilizing the Temporal Efficient Training (TET) [6]. On CIFAR10-DVS, we obtain 3.4% advantage over the prior method with binary spikes. On N-Caltech 101, with only 14 time steps, we get a 15.7% increase over the prior best work. On DVS128, we get an accuracy of 99.0%, which is higher than TA-SNN [36] using three times fewer simulation time steps. Furthermore, by using a basic 7-layer CNN on the static dataset Fashion MNIST, our method can achieve the highest classification accuracy with the fewest simulation time steps. Overall, with binary spikes, TCJA-SNN simulates no-more time steps while getting a higher performance. Furthermore, our method can achieve higher classification accuracy by adopting the non-binary spike technique.

max width= Method Binary Spikes CIFAR10-DVS N-Caltech 101 DVS128 Step Acc. Step Acc. Step Acc. SLAYER [29]NeurIPS-2018 - - - - 1600 93.4 HATS [30]CVPR-2018 N/A N/A 52.4 N/A 64.2 - - DART [27]TPAMI-2019 N/A N/A 65.8 N/A 66.8 - - NeuNorm [33]AAAI-2019 230-292 60.5 - - - - Rollout [17]Front. Neurosci-2020 48 66.8 - - 240 97.2 DECOLLE [15]Front. Neurosci-2020 - - - - 500 95.5 LIAF-Net [34]TNNLS-2021 10 70.4 - - 60 97.6 tdBN [38]AAAI-2021 10 67.8 - - 40 96.9 PLIF [10]ICCV-2021 20 74.8 - - 20 97.6 TA-SNN [36]ICCV-2021 10 72.0 - - 60 98.6 SEW-ResNet [9]NeurIPS-2021 16 74.4 - - 16 97.9 Dspike [20]NeurIPS-2021 10 75.4 - - - - SALT [16]Neural Netw-2021 20 67.1 20 55.0 - - TET [6]ICLR-2022 10 83.2 - - - - DSR [22]CVPR-2022 10 77.3 - - - - TCJA-SNN 10 80.7 14 78.5 20 99.0 TCJA-TET-SNN 10 83.3 14 82.5 20 98.2 With Data Augmentation.

Table 1: The comparison between the proposed methods and existing SOTA techniques on three mainstream neuromorphic datasets. (Bold: the best)

max width= Method Binary Spike Time Step Accuracy ST-RSBP [37]NIPS-2019 400 90.1 LISNN [5]IJCAI-2020 20 92.1 PLIF [10]ICCV-2021 8 94.4 TCJA-SNN 8 94.8 TCJA-TET-SNN 8 94.6

Table 3: Accuracy of different blocks.

max width= Block CIFAR10-DVS N-Caltech 101 DVS128 TLA 79.7 78.3 97.9 CLA 80.5 78.4 98.6 TCJA 80.7 78.5 99.0

Table 2: Static Fashion-MNIST accuracy.
Figure 6: Convergence of compared SNN methods on DVS128 Gesture.
Figure 7: Variation in test accuracy on DVS128 Gesture dataset as kernel size increases.

4.2 Ablation Study and Discussions

Ablation Study. To investigate the influence of the TLA and the CLA modules, we conduct several ablation studies. As shown in Tab. 3, the CLA module contributes significantly to performance enhancement. This is because simulation time steps are much fewer than channels in most SNNs designs, allowing the CLA module to extract extra relevant features than TLA. Notably, the TCJA module exhibits improved performance across all datasets examined, demonstrating the effectiveness of the CCF layer.

Discussion of Kernel Size. We initially investigate the kernel size in TCJA module. Intuitively, when the size of the kernel rises, the receptive field of the local attention mechanism will also expand, which may aid in enhancing the performance of TCJA-SNN. However, the experimental results in Fig. 7 overturn this conjecture. As the size of the kernel rises, the performance of the model waves. When the kernel size is less than 4, the model achieves the optimal effect. One reasonable explanation is that a frame mainly correlates with its nearby frames, and an excessively large receptive field may lead to undesired noise.

Discussion of Convergence. We also empirically demonstrate the convergence of our proposed method, as shown in Fig. 7. Specifically, the Fig. 7 illustrates the performance trend of vanilla LIF-SNN, PLIF [10]

and our proposed TCJA-SNN for 1000 epochs. As the training epoch increases, the performance trend of our proposed method becomes more stable and converges to a higher level. Moreover, the TCJA-SNN can achieve the SOTA performance when only training about 260 epochs, which demonstrates the efficacy of the proposed TCJA.

Figure 8: Attention distribution between time step and channel. The top row is the weight from the first TCJA module in TCJA-SNN working with DVS128 Gesture dataset. We select sparse and dense attention frames in both temporal-wise () and channel-wise () in bottom row.

Discussions of Attention Visualization. To make the attention mechanism easier to understand, we finally visualize the output of the first TCJA module in TCJA-SNN working with DVS128 Gesture dataset, which can be seen in Fig. 8. Changes in attention weights are primarily accumulated among channels, verifying further the substantial role performed by the CLA in the TCJA module. To embody the attention weights, we extract some temporal-wise and channel-wise frames. The difference in firing pattern in the channel dimension is more significant than that in the temporal dimension. Further discussions can be found in Sec. 3 of the supplementary.

5 Conclusion

In this paper, we propose the TCJA mechanism, which innovatively recalibrates temporal and channel information in SNN. Specifically, instead of utilizing a generic fully connected network, we use 1-D convolution to build the correlation between frames, reducing the computation and improving model performance. Moreover, we propose a CCF mechanism to realize joint feature interaction between temporal and channel information. Sufficient experiments verify the effectiveness of our method with SOTA results on four datasets, i.e., CIFAR10-DVS (83.3%), N-Caltech101 (82.5%), DVS128 (99.0%) and Fashion-MNIST (94.8%). However, the insertion of TCJA still resulted in a relatively sizable boost in the number of parameters, which can be found in Sec. 3 of the supplementary. In the future work, we believe this method can easily be integrated into the neuromorphic chip for the hardware-friendly 1-D convolution operation and the binary spiking network structure.


  • [1] A. Amir, B. Taba, D. Berg, and et al. (2017) A Low Power, Fully Event-Based Gesture Recognition System. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: item 3, item 4a.
  • [2] G. Bellec, D. Salaj, A. Subramoney, R. Legenstein, and W. Maass (2018) Long short-term memory and learning-to-learn in networks of spiking neurons. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 31, pp. . Cited by: §2.1, §3.4.
  • [3] M. Bernert and B. Yvert (2019) An Attention-Based Spiking Neural Network for Unsupervised Spike-Sorting. International Journal of Neural Systems 29 (8), pp. 1850059:1–1850059:19. Cited by: §1.
  • [4] S. M. Bohté, J. N. Kok, and J. A. L. Poutré (2002) Error-backpropagation in Temporally Encoded Networks of Spiking Neurons. Neurocomputing 48 (1-4), pp. 17–37. Cited by: §1.
  • [5] X. Cheng, Y. Hao, J. Xu, and B. Xu (2020) LISNN: Improving Spiking Neural Networks with Lateral Interactions for Robust Object Recognition. In

    International Joint Conference on Artificial Intelligence (IJCAI)

    pp. 1519–1525. Cited by: Table 3.
  • [6] S. Deng, Y. Li, S. Zhang, and S. Gu (2021) Temporal Efficient Training of Spiking Neural Network via Gradient Re-weighting. In International Conference on Learning Representations (ICLR), Cited by: §3.1, §4.1, Table 1, §4.
  • [7] J. K. Eshraghian, M. Ward, E. Neftci, X. Wang, G. Lenz, G. Dwivedi, M. Bennamoun, D. S. Jeong, and W. D. Lu (2021) Training Spiking Neural Networks Using Lessons From Deep Learning. ArXiv. External Links: 2109.12894, Document Cited by: §2.1.
  • [8] W. Fang, Y. Chen, J. Ding, D. Chen, Z. Yu, H. Zhou, Y. Tian, and other contributors (2020) SpikingJelly. Note: 2022-05-04 Cited by: §4.1.
  • [9] W. Fang, Z. Yu, Y. Chen, T. Huang, T. Masquelier, and Y. Tian (2021) Deep Residual Learning in Spiking Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 34, pp. 21056–21069. Cited by: §2.1, §3.4, §4.1, Table 1.
  • [10] W. Fang, Z. Yu, Y. Chen, T. Masquelier, T. Huang, and Y. Tian (2021) Incorporating Learnable Membrane Time Constant To Enhance Learning of Spiking Neural Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2661–2671. Cited by: §2.1, §3.1, §3.4, §4.2, Table 1, Table 3, §4.
  • [11] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2, §2.3.
  • [12] Y. Hu, H. Tang, and G. Pan (2018) Spiking Deep Residual Networks. IEEE Transactions on Neural Networks and Learning Systems (), pp. 1–6. Cited by: §1.
  • [13] C. Jin, R. Zhu, X. Wu, and L. Deng (2022) SIT: A Bionic and Non-Linear Neuron for Spiking Neural Network. ArXiv preprint arXiv:2203.16117. Cited by: §2.1.
  • [14] Y. Jin, W. Zhang, and P. Li (2018) Hybrid Macro/Micro Level Backpropagation for Training Deep Spiking Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 31, pp. . Cited by: §3.1.
  • [15] J. Kaiser, H. Mostafa, and E. Neftci (2020) Synaptic Plasticity Dynamics for Deep Continuous Local Learning (DECOLLE). Frontiers in Neuroscience 14, pp. 424. Cited by: Table 1.
  • [16] Y. Kim and P. Panda (2021) Optimizing Deeper Spiking Neural Networks for Dynamic Vision Sensing. Neural Networks 144, pp. 686–698. Cited by: Table 1.
  • [17] A. Kugele, T. Pfeil, M. Pfeiffer, and E. Chicca (2020) Efficient Processing of Spatio-temporal Data Streams with Spiking Neural Networks. Frontiers in Neuroscience 14, pp. 439. Cited by: Table 1.
  • [18] L. Lapique (1907) Recherches quantitatives sur l’excitation electrique des nerfs traitee comme une polarization. Journal of Physiology and Pathology 9, pp. 620–635. Cited by: §3.1.
  • [19] H. Li, H. Liu, X. Ji, G. Li, and L. Shi (2017) CIFAR10-DVS: An Event-Stream Dataset for Object Classification. Frontiers in Neuroscience 11, pp. 309. Cited by: item 3, item 4a.
  • [20] Y. Li, Y. Guo, S. Zhang, S. Deng, Y. Hai, and S. Gu (2021) Differentiable Spike: Rethinking Gradient-Descent for Training Spiking Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 34, pp. 23426–23439. Cited by: Table 1.
  • [21] V. Mante, V. Bonin, and M. Carandini (2008) Functional mechanisms shaping lateral geniculate responses to artificial and natural stimuli. Neuron 58 (4), pp. 625–638. Cited by: §2.3.
  • [22] Q. Meng, M. Xiao, S. Yan, Y. Wang, Z. Lin, and Z. Luo (2022) Training High-Performance Low-Latency Spiking Neural Networks by Differentiation on Spike Representation. ArXiv preprint arXiv:2205.00459. Cited by: Table 1.
  • [23] E. O. Neftci, H. Mostafa, and F. Zenke (2019) Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks. IEEE Signal Processing Magazine 36 (6), pp. 51–63. Cited by: §2.1.
  • [24] E. O. Neftci, H. Mostafa, and F. Zenke (2019) Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks. IEEE Signal Processing Magazine 36 (6), pp. 51–63. Cited by: §2.1.
  • [25] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor (2015) Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades. Frontiers in Neuroscience 9, pp. 437. Cited by: item 3, item 4a.
  • [26] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 32, pp. . Cited by: §4.1.
  • [27] B. Ramesh, H. Yang, G. Orchard, N. A. Le Thi, S. Zhang, and C. Xiang (2019) DART: Distribution Aware Retinal Transform for Event-Based Cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (11), pp. 2767–2780. External Links: Document Cited by: Table 1.
  • [28] K. Roy, A. Jaiswal, and P. Panda (2019) Towards spike-based machine intelligence with neuromorphic computing. Nature 575 (7784), pp. 607–617. Cited by: §1, §3.1.
  • [29] S. B. Shrestha and G. Orchard (2018) SLAYER: Spike Layer Error Reassignment in Time. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 31. Cited by: Table 1.
  • [30] A. Sironi, M. Brambilla, N. Bourdis, X. Lagorce, and R. Benosman (2018) HATS: Histograms of Averaged Time Surfaces for Robust Event-Based Object Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1731–1740. Cited by: Table 1.
  • [31] E. Stromatias, D. Neil, M. Pfeiffer, F. Galluppi, S. B. Furber, and S. Liu (2015)

    Robustness of spiking Deep Belief Networks to noise and reduced bit precision of neuro-inspired hardware platforms

    Frontiers in Neuroscience 9, pp. 222. Cited by: §1.
  • [32] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi (2018) Spatio-Temporal Backpropagation for Training High-performance Spiking Neural Networks. Frontiers in neuroscience 12, pp. 331. Cited by: §2.1, §3.1.
  • [33] Y. Wu, L. Deng, G. Li, J. Zhu, Y. Xie, and L. Shi (2019) Direct Training for Spiking Neural Networks: Faster, Larger, Better. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 1311–1318. External Links: Document Cited by: §1, §3.1, Table 1.
  • [34] Z. Wu, H. Zhang, Y. Lin, G. Li, M. Wang, and Y. Tang (2021) LIAF-Net: Leaky Integrate and Analog Fire Network for Lightweight and Efficient Spatiotemporal Information Processing. IEEE Transactions on Neural Networks and Learning Systems (), pp. 1–14. External Links: Document Cited by: §4.1, Table 1.
  • [35] H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. ArXiv preprint arXiv:1708.07747. Cited by: item 3, item 4a.
  • [36] M. Yao, H. Gao, G. Zhao, D. Wang, Y. Lin, Z. Yang, and G. Li (2021) Temporal-wise Attention Spiking Neural Networks for Event Streams Classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10201–10210. Cited by: Figure 1, §1, §2.2, §2.3, §3.1, §4.1, Table 1.
  • [37] W. Zhang and P. Li (2019) Spike-Train Level Backpropagation for Training Deep Recurrent Spiking Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 32, pp. . Cited by: Table 3.
  • [38] H. Zheng, Y. Wu, L. Deng, Y. Hu, and G. Li (2021) Going Deeper With Directly-Trained Larger Spiking Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 11062–11070. Cited by: §1, §1, §2.1, Table 1.


  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] Our abstract and introduction clearly describe our contribution, the algorithm, and the experimental results.

    2. Did you describe the limitations of your work? [Yes] We mentioned our limitation in Sec. 5 and analyze it in Sec. 3 of the supplementary.

    3. Did you discuss any potential negative societal impacts of your work? [No] This work is a theoretical research in spiking neural networks. For the time being, it does not present any foreseeable negative societal impact.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results? [Yes] All assumptions are stated in Sec. 3.3.

    2. Did you include complete proofs of all theoretical results? [Yes] All proofs are provided in Sec. 3.3.

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Our code will be available in the supplementary material.

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?

      [Yes] See Sec. 2 of the supplementary.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] Testing our model is time-consuming on our servers and we lack sufficient resources/time for multiple repetitions for our experiments.

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] The computational resources are shown in Sec .2 of the supplementary.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators? [Yes] We use DVS128 Gesture [1], CIFAR10-DVS [19], N-Caltech 101 [25], and Fashion-MNIST [35] datasets.

    2. Did you mention the license of the assets? [No] We adopt public datasets.

    3. Did you include any new assets either in the supplemental material or as a URL? [No]

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [No]

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [No]

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]