Convolutional Spiking Neural Networks for Spatio-Temporal Feature Extraction

03/27/2020 ∙ by Ali Samadzadeh, et al. ∙ 0

Spiking neural networks (SNNs) can be used in low-power and embedded systems (such as emerging neuromorphic chips) due to their event-based nature. Also, they have the advantage of low computation cost in contrast to conventional artificial neural networks (ANNs), while preserving ANN's properties. However, temporal coding in layers of convolutional spiking neural networks and other types of SNNs has yet to be studied. In this paper, we provide insight into spatio-temporal feature extraction of convolutional SNNs in experiments designed to exploit this property. Our proposed shallow convolutional SNN outperforms state-of-the-art spatio-temporal feature extractor methods such as C3D, ConvLstm, and similar networks. Furthermore, we present a new deep spiking architecture to tackle real-world problems (in particular classification tasks), and the model achieved superior performance compared to other SNN methods on CIFAR10-DVS. It is also worth noting that the training process is implemented based on spatio-temporal backpropagation, and ANN to SNN conversion methods will serve no use.



There are no comments yet.


page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spiking neural network encodes data in sequences of spike signals. It may execute more complex cognitive tasks in a way that becomes more similar to the brain cortex processing pattern (Allen et al., 2009; Zhang et al., 2013; Kasabov and Capecci, 2015)

. When a neuron’s membrane potential reaches a threshold, it is triggered and transmits a spiking signal and then resets. To be precise, spikes are binary codes which decay in time (like a capacitor’s charge). This binary nature of spiking neural networks makes them efficient in terms of memory consumption and computation cost which leads to lower power consumption (as demonstrated in

(Wang et al., 2020)).

Another interesting property of SNNs is instantaneous output per stream of temporal inputs. As demonstrated in (Rekabdar et al., 2017) spikes are generated as soon as they detect a specific pattern in the data, as opposed to other architectures that require a whole chunk of data. Any task accomplished by a standard artificial neural network can also be carried out by a similar spiking network (Maass, 1997). Temporal or spatial coding in SNNs may be required for resolving the task. The temporal coding concept is proven in (Mostafa, 2017) and according to (Neftci et al., 2019) SNNs also have spatial coding.

Based on biological structure, there are several modelings of SNNs. Spike-Timing-Dependent-Plasticity(STDP) modeling mostly resembles the natural functionality of brain neurons (Lee et al., 2018; Markram et al., 2011; Caporale and Dan, 2008). LIF is another modeling which simply imitates natural spiking neurons. This model is suitable for gradient descent training and backpropagation (Wu et al., 2018)

. The key property of LIF model is the threshold function playing the role of activation function after convolution layer in convolutional spiking neural networks. This layer encodes the feature representations of inputs.

GPUs have vector processing units; therefore, they are not ideal for implementing SNNs on them. The native hardware to support decay and spiking properties of SNNs are neuromorphic chips

(Mead, 1990; Seo et al., 2011; Carrillo et al., 2012b, a; Merolla et al., 2014; Akopyan et al., 2015; Schuman et al., 2017).

Since commercial release of event-cameras in 2008, applications for low power spiking neural nets emerged. Specifically, SNNs can be utilized in real-time applications and harsh environments (i.e. extreme lighting conditions). Real-time applications consist of visual simultaneous localization and mapping or visual odometry (also knows as VSLAM or VO) (Kueng et al., 2016; Kim et al., 2016; Rebecq et al., 2016), pose tracking applications (Mueggler et al., 2014; Gallego et al., 2017) and etc. Also, they are useful in high-speed applications such as object recognition in self-driving cars (Wang et al., 2020). Moreover, due to very low power consumption and low latency and lighting condition robustness of event cameras, applications of SNNs can be extended to other vision domains if the price of event-cameras drops.

We follow (Wu et al., 2018) to train our architectures based on spatio-temporal backpropagation method. Furthermore, we demonstrate spatio-temporal feature extraction property of a shallow conv-SNN and we also propose a novel deep architecture of convolutional spiking neural network to tackle complex tasks. To summarize, the contributions of our work are as follows:

  • We analyse convolutional SNNs as spatio-temporal feature extractors.

  • We clarify specific properties of a spatio-temporal dataset to compare SNNs and ANNs.

  • We propose a novel spatio-temporal test case to challenge other extractors.

  • Finally, we introduce a novel deep SNN model to tackle real-world problems.

2 Related work

In order to train SNNs, numerous methods are proposed. Most studied methods focus on converting weights of an ANN model to equivalent SNN (Diehl et al., 2015; Esser et al., 2015; Rueckauer et al., 2017; Stromatias et al., 2017). These models suppress temporal coding properties of SNNs; therefore, they can only be used in applications of converting high-performance spatial ANN to SNN. Another approach is to train SNNs directly. The main problem in this section is non-differentiability of spiking function. This issue is addressed by (Neftci et al., 2019). Many methods proposed to solve this problem (Neftci et al., 2019). (Wu et al., 2018) overcame this problem by approximating the derivative of threshold function. This approach is very straightforward and can be implemented in most deep learning frameworks.

There are many spatio-temporal extractors; convolution neural networks are mostly used in computer vision tasks

(Tavanaei et al., 2019), CNN+LSTM which are proposed in (Sainath et al., 2015) and C3D network (Tran et al., 2015) are appropriate for modeling spatio-temporal information. ConvLSTM demonstrated in (Xingjian et al., 2015) is also suitable for spatio-temporal feature learning. As mentioned in (Srivastava et al., 2015), there are some spatio-temporal datasets such as MovingMnist and CIFAR10-DVS to evaluate these methods.

Deep SNNs are a method for processing event-based data (Tavanaei et al., 2019). Even so, going deeper in spiking neural networks is a great challenge. (Wu et al., 2019)

tries to recreate batch normalization for SNNs to use its properties and build deeper networks. However, the proposed method is only tested on wider networks and not deeper. The NeuNorm solution does not have same properties of batch-norm, which is not much of help in training deeper (more than ten layers) networks.

(Hu et al., 2018) proposed a deep SNN based on residual networks, but that is a conversion from ANNs to SNNs. Common ways for training deep SNNs are described in (Sengupta et al., 2019). However, none of conventional methods purely trained spiking neural networks.

In the following sections, mathematical equations of training SNNs and some appropriate datasets are introduced. Some test cases are designed to prove SNNs as good feature extractors. In the next section, the architecture of proposed deep training SNNs is explained in details and the final section investigates the results and describes the implementation details.

3 Background

In this section, LIF model details, back-propagation through time in SNNs, issues of batch normalization adaption, and SNN’s dataset properties are discussed.

3.1 Leaky integrate and fire (LIF)

From LIF neurons implementation perspective, the model is defined as follows:


In (1), is a matrix of ones and is the layer operation. For linear layers will be defined as:


The term in the left side of (1) is for neuron rest and enforcing sparsity in the LIF neurons. is the activation function which can be interpreted as a threshold function. This function for each node is defined as:


In the equations above, is the layer number, is the time-stamp, is decay factor in (1). The decay factor needs careful tuning. All of the equations are in the matrix form, except the activation function.

The difference between an ANN and SNN neuron is the left side of (1) and the activation function. The following operation need to be performed on output of last layer to obtain output of the SNN (assuming rate encoding over an arbitrary time window):


Figure 1 summarizes the description of LIF model. The architecture presented in this paper employs the mentioned LIF neuron model.

Figure 1: LIF neurons in SNNs, expanded in time and space

3.2 Back-propagation through time in SNNs

With some differences mentioned in (Wu et al., 2018; Mostafa, 2017), spatio-temporal backpropagation in SNNs is almost indistinguishable from backpropagation through time. The only problem of this method is the derivative of which is a Dirac function (only has value in Threshold). In order to solve this, (Wu et al., 2018) proposed multiple approximation functions. Some of the approximate functions are used in this paper. The details are available in the results section.

3.3 Batch normalization

As (Santurkar et al., 2018)

described, batch normalization layer is the cause of covariance shift. This covariance shift is the source of less weight adaptation with respect to other layers (which means faster training), more general weight training (which means no dropout is needed), and Lipschitz loss function (which means less exploding and vanishing gradients). The formula is defined as follows:


Applying this formula in spiking layers will create non-binary outputs and is non-acceptable. A possible solution might be shifting the mean value of spikes close to 0.5 by scaling the membrane potentials . This solution is also unacceptable; incorrect scaling will force the network to have exactly some amount of spike rates, which due to the sparsity nature of SNNs, is inadmissible. NeuNorm introduced in (Wu et al., 2019) focuses on the mentioned problem. In this method, scaling is divided by the number of feature maps in each convolutional layer. This approach is still highly dependent on the constant scaler with respect to the data; therefore, it will only be globally accepted in convolutional networks. Until now, the only applicable solution to have batch normalization properties seems to be slight dropouts and skip connections in network architecture.

3.4 Dataset aspects

As (Iyer et al., 2018) mentioned, datasets for performance comparison of SNNs vs. ANNs should be a special type. According to this paper, data in event camera-driven datasets can easily be concatenated together as frames. An ANN architecture will take a stacked version of these frames. This technique allows ANNs to reach very high accuracies, which makes SNNs struggle to keep up. To compare ANNs and SNNs, it is imperative to design a dataset in which spatial features (a frame alone) can’t be used to detect a class and temporal properties are not in some repeatable pattern. For example, an NMNIST dataset might be great to compare SNNs against each other, but comparing an ANN with SNN on this dataset will not show the true ability of SNNs. With stronger ANNs (such as deep C3D or deep CNN+LSTM) this task will be a complete win to the ANN.

4 Spatio-temporal property of SNNs

This section introduces the spiking neural network properties and depicts the absence of these characteristics in ANN feature extractors. Also, it presents the test cases that are designed to examine spatio-temporal properties of NNs in detail.

4.1 Claim

Figure 2: Test cases designed to challenge spatio-temporal extraction properties. The base images are derived from MNIST dataset. Two of the top image rows belong to test1(zoom-in from 0% to 100% and zoom-out from 100% to 0%), the two in the 3rd and 4th rows belong to test2(360 degree clock-wise and counter clock-wise rotations from 0 degree to 360 degree and vice-versa), the two in 5th and 6th rows belong to test3 (zoom-in from 50% to 100% and zoom-out from 100% to 50%), the images in 7th row belong to test4 (occlusion) and the last two rows in the bottom belong to test5(random rotation clock-wise and counter clock-wise)

The structure of spiking neural networks is very similar to the human brain, and an advantage of these networks is the memory that exists per neuron. This memory is the source of temporal coding feature. The memory of neurons leads to astonishing performance in extracting particular spatio-temporal features including learning models with random patterns. Common feature extractors such as C3D and CONVLSTM are not able to extract these features. Mathematical formulation of this claim would be as follows: Assume that is a function of , and . This function models an spatio-temporal motion. Frames in time can be modeled as:


Training data are binomial samples:


A single layer of C3D or conv2D can not learn stochastic as we defined. Those layers are designed to learn deterministic patterns in

. ConvLSTM is comparable to SNN in terms of having memory in each layer. This memory makes it as dominant as SNN. If sigma is big enough, LSTM in the convolution layer cannot forget significant variance and it will cause ConvLSTM accuracy to drop; however, SNN thresholding makes it highly robust to significant noise variances. The mentioned problem can be solved if ConvLSTM has significantly more convolution kernels compared to SNN. A neural network of sequential shallow convolution layers and LSTMs also has some issues. The network has difficulty in predicting time domain of kernels. Additionally, in case of large time windows, typical LSTM layers suffer from information loss.

Designed test cases are as follows:

  • Test1: Zoom-in (0 to 100%) and zoom-out (0 to 100%) as 20 classes of MNIST

  • Test2: Rotate clock-wise(0 to 360 degrees) and Rotate counter clock-wise (360 to 0 degrees) as 20 classes of MNIST

  • Test3: Zoom-in (50 to 100%) and zoom-out (50 to 100%) as 20 classes of MNIST

  • Test4: Occlusion with random box of zero values

  • Test5: Random incremental rotations CW/CCW (no rotation on first and last frames of CCW, blank picture on first and last frames of CW)

4.2 Experimental backed proof

To demonstrate ineherent memory of spiking neural networks, some special test cases were designed. In tests 1 and 3 zoom-in and zoom-out images are considered as inputs and the network classifies them. SNNs can also identify clockwise or counter-clockwise rotation. Test2 is designed to challenge that property. In addition, due to memory existence in each neuron, they are capable of learning random patterns. Also, SNNs classifies occluded images with great accuracy. Tests 4 and 5 have also been designed to signify two last mentioned properties.

Figure 3: Proposed deep SNN architecture

5 Deep SNN model

This section provides a new architecture of deep spiking neural network. This architecture, shown in Figure 3 is inspired by Resnet architecture.

Resnet architectures, Solve the obstacle of gradient vanishing by utilizing skip connections. As mentioned in the previous sections, the principal difficulty of training deep SNNs is gradient vanishing. The idea of skip connections is practical. The Skip connections increase performance at a drastic rate. In the proposed architecture, skip connections are added from blocks 3 and 4 to the input of average pooling layer. In order to increase performance, concatenation operation is used afterward, instead of conventional summation operation. This concatenation does not happen in Resnet skip connections.

Details of each block are shown in Figure 3

. There are two sub-blocks in each block similar to Resnet18 architecture. In order to force binary outputs after each layer, thresholding activation function namely synapse is applied to the output of each layer. Inside Subblocks, there is a dropout to ensure generality and force sparsity. Dropouts are somehow playing the role of batch-normalization in generalization. Average pooling layers are part of the next layer operation and they do not interfere with the binary nature of the architecture (a fact never mentioned before in previous works).

The Proposed architecture Consists of 18 layers, 16 layers in the blocks, 2 convolutional layers and a fully connected layer at the top and bottom of architecture respectively. It is the first time an 18 layer SNN is trained in space and time domain to classify spatio-temporal actions.

The parameters of SNNs play a vital role in performance. The cause is non-Lipschitzness of error function due to the spiking nature of the network. Wrong parameters will result in zero trainability of network over the specified data. These parameters are window length, decay factor and threshold amount of LIF neurons. The width of the network affects the backward gradient reaching first neurons. The exact parameters of proposed SNN architecture used to train and test on CIFAR10-DVS dataset are available in the implementation details section.

Method MNIST Test1 Test2 Test3 Test4 Test5
ConvSNN 99.4% 98.6% 98.4% 99.36% 98.8% 89.5%
CNN 99% 98.89% 98.2% 98.8% 98.27% Failed
CNN+LSTM 92.84% 67.74% 98.93% 98.96% 94.88% Failed
ConvLSTM - 99.11% 98.9% 30% 97.43% 20%
C3D 99.03% 98.49% 98.32% 99.17% 97.73% 64%
Table 1: Classification accuracy over tests cases defined in 4.1 part1

6 Results

6.1 Spatio-temporal feature extraction experiments

In this section, the goal is to evaluate the performance of top-quality ANN spatio-temporal feature extractors (namely C3D, ConvLSTM, conv+LSTM) against SNNs. For this matter, a shallow network (max of 4 layers) of each architecture is used. The dataset is designed to demonstrate the critical factors of these networks.

The general property of spatio-temporal feature extraction is examined in test2 and test3. All of the architectures show promising results (Table 1). Long-term preservation of data in memory is investigated in test1. Typical LSTM layer (laking cut connections) do not have this property. The mentioned claim is evident in the Table 1. The robustness of architectures to noise and their ability to extract spatial features is examined in test4. The results prove SNN superiority in this aspect over all other networks. In test 5, the random spatio-temporal features generated in time challenge non-repeatable action extraction qualities of the mentioned networks. This test highlights SNNs great ability in classifying stochastic non-repeating patterns. The MNIST test is designed to challenge the primary spatial feature extraction of the corresponding networks. All architectures pass this test gracefully. Due to the lack of performance of simple CNN (no frame concatenation) in the temporal domain, results of this network is not provided in the Table 1. With more layers added to the networks mentioned, results may change, but the purpose of these tests was to compare architectures equally in primary spatio-temporal feature extraction.

The confusion matrix for test5 (Figure 

6) shows exceptional performance of SNNs over stacked convolution and LSTM model. Figure 6 also illustrates imperfection of convolutional SNNs in extracting random temporal properties.

Figure 7 demonstrates inability of long-term preservation of temporal features in stacked convolution and LSTM layers. This test is performed on data created in test1. Spike patterns in Figure 4 and Figure 5 show interesting results. Test1 example shows zoom-out of number 6. In this example network losses recognition of character after a certain point. Interesting point is that it does not recognize the small character as another class. Test2 example shows character 1 turning counter clock-wise. The SNN does a very good job at recognizing the rotation direction and the number. Test3 example shows a hard 1 character, which zooming out made The network to recognize the wrong character. Test4 example in Figure 5 shows accumulation of membrane potential to recognize the occuluded 2 character. This example clearly shows advantage of SNNs in the occluded scenes. Test5 example shows this test’s level of hardness, as the SNN network barely recognizes the counter clock-wise rotation of character 2. This test is also hard for humans, you can take a look at the two last rows of Figure 2 to test yourself.

The mentioned results proves claim of the paper (SNNs are very good for spatio-temporal feature exctraction specially when featuers aren’t in a regular time or space pattern or they are noisy). In order to make the SNNs more suitable for complex conditions we proposed a new deep architecture as mentioned in deep SNN model section; the results of this architecture over CIFAR10-DVS dataset is explained in the next subsection.

6.2 Experiments with the new architecture

The proposed architecture enables us to test more complex scenarios. We chose CIFAR10-DVS to depict performance of this architecture in complex scenarios. Previous successful implementation of SNN architecture is also tested on this dataset, which makes it ideal for comparison. The designed architecture is capable of processing both events and color images inputs as binary frames.

Table 2 shows the significant improvement of this architecture over the previous outstanding methods. The comparison results in the table are gathered from (Wu et al., 2019). The accuracy achieved is without using feature encoding of event data. The feature coding could be as the (Manderscheid et al., 2019), which is a patch of speed invariant time surface. This encoding will give better results but is not fair to compare to other methods and it will not highlight the performance of the new architecture.

Furthermore, our architecture needs much less kernels than (Wu et al., 2019) and is much more memory efficient. To be precise, our model concentrates on making the network deep, whereas (Wu et al., 2019) tries to increase width of networks. In order to see the exact parameters, refer to implementation details.

The phenomenon of much less kernel need for same data complexity makes us consider the probability of kernel adaptation over time. The kernel adaptation does not mean kernel change, but it means in a given time window thresholded output looks like covolution of another kernel.

Model Methods Accuracy
(Sironi et al., 2018) HAT 52.4%
(Orchard et al., 2015) RF 31.0%
(Wu et al., 2019) NeuNorm 60.5%
our model our Net 68.3%
Table 2: Classification of proposed deep network over CIFAR10-DVS dataset

6.3 Implementation details

In this subsection, we provide experiment conditions, details about networks architectures and parameters.

Figure 4: Example of spike patterns at the output layer for the specific test case. Time axis starts from bottom. In test1 classes 0-9 represent zoom-out of 0-9 and classes 10-19 represent zoom-in of 0-9. In test2 classes 0-9 represent counter clock-wise rotation of 0-9 and classes 10-19 represent clock-wise rotation of 0-9. Finally in test3 Classes 0-9 represent the zoom-out and classes 10-19 represent zoom-in.

Figure 5: Example of spike patterns at the output layer for the specific test case. Time axis starts from bottom. In test5 classes 0-9 represent random counter clock-wise rotation of numbers 0-9 and classes 10-19 represent random clock-wise rotation.

Basic implementation details for the test cases are as follows: Frame window size for all networks was 10. Learning rate for all network architectures wes 1e-3 except ConvSNN (the SNN network) which was 5e-4. All architectures trained enough to reach maximum accuracy (more than 10 epochs). These tests were performed at least 5 times and the mean value for them is reported in the tables. In order to train, Adam optimizer and least mean square was used (except for ConvLSTM network which Binary cross-entropy was used). Batch sizes were 100 except for ConvSNN which was 20 (the memory consumption of SNN is very high when the memory type is 32-bit float and not optimized as binary type).

As for the ConvSNN specific parameters, threshold value was set to 0.5. This value is extremely important and slight change in it will result in better or worse results. Alpha or decay factor is set to 0.5. Increasing this value will result in better preservation of memory and vulnerability to noise. Resting mechanism was disabled. Resting mechanism maintains more sparisity in spike patterns but results in accuracy drop. Derivative of Dirac function is aproximated with Gaussian function specifically, (Rect function is better approximate in terms of performance, but it will learn harder, and the mean accuracy for several runs will drop dramatically).

Network architectures to tackle designed test cases are as follows:

  • C3D: Conv3D(64-3) – Maxpool3D(2) – Conv3D(128-3) – Maxpool3D(2) – Conv3D(256-3) – Maxpool3D(2) – Conv3D(256-3) – Maxpool3D(2) – FC(128) – Dropout(0.5) – FC(128) – Dropout(0.5) – FC(#Classes)

  • CNN+LSTM: Conv2D(128-3) – Maxpool2D(2) – Conv2D(128-3) – Maxpool2D(2) – LSTM(128) – FC(#Classes)

  • ConvLSTM: ConvLSTM2D(64-3) – Maxpool2D(2) – ConvLSTM2D(64-3) – Maxpool2D(2) – FC(128) – FC(#Classes)

  • CNN: Per stack of frames: {Conv2D(128-3) – Maxpool(2) – Conv2D(128-3) – Maxpool(2) – FC(#Classes)}

  • ConvSNN: Per frame: {Conv2D(48-3) + Synapse – Avgpool2D(2) + Conv2D(48-3) + Synapse – Avgpool2D(2) + FC(128) + Synapse – FC(#Classes) + Synapse } – spike rate average for the specified frame window

In the architectures above, #Classses are 20 except for MNIST test (which is 10).

Parameters of the proposed deep SNN architecture to tackle CIFAR10-DVS are as follows: Frame window length is set to 10 and Learning rate is set to 5e-4 and training is perfomed for more than 50 epochs and more than 5 times as before. The optimizer is SGD with momentum of 0.9. Chosen loss function is binary cross entropy. Batch size is 10 and 1000 events concatanated per frame. Other SNN specific parameters (threshold, resting mechanism, deravative approximate function) were as before, except for decay factor which is 0.8 (in CIFAR10-DVS dataset memory is more important than noise robustness).

All of the experiments are tested on system with Intel Core i5-6500 and NVIDIA GTX 1080 with 24 GB RAM and SSD storage.

7 Discussion

This paper demonstrated the potentials of SNNs in terms of spatio-temporal feature extraction. Particularly, their capacity to extract randomly distributed features in the time and space domain. This claim was backed by experiments with a special type of dataset devised for the matter. To showcase the application of it, a new deep SNN architecture was proposed. The introduced SNN architecture was tested on a challenging dataset of CIFAR10-DVS to depict the it’s advantage over previous architectures.

Regarding the results, this work outperformed shallow ANNs over extreme conditions (designed test cases), and surpassed SNNs over the typical event-based dataset (CIFAR10-DVS). Moreover, SNNs have much lower memory consumption (with the assumption of binary connections) and computation cost, which refers to less overall hardware power consumption. Also, in some situations, SNNs with few number of neurons can achieve what oversized ANNs can barely achieve.

Figure 6: Confusion matrix over test 5, comparing result of CNN+LSTM and ConvSNN. The left image shows performance of Conv+LSTM model.

Figure 7: Confusion matrix over test 1, comparing result of CNN+LSTM and ConvSNN. The left image shows performance of ConvSNN model.

The remaining problem to be solved is adaptation of batch normalization properties to SNNs. These properties are required to have very deep SNNs (like 101 layers). Also, there should be a better solution for BP other than approximating the derivative of activation function; the approximate functions are the primary cause of gradient vanishing. Another step in the journey of analyzing these networks might be an analysis of other types of SNNs such as GANs. Another aspect to tackle might be kernel adaptation phenomenon. This phenomenon is also observed in ConvLSTM layers, but precise application and analysis have yet to come.

To sum it all, this work renders advantages of SNNs transparent and proposes some solutions to have deeper SNNs.


  • F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G. Nam, et al. (2015) Truenorth: design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE transactions on computer-aided design of integrated circuits and systems 34 (10), pp. 1537–1557. Cited by: §1.
  • J. N. Allen, H. S. Abdel-Aty-Zohdy, and R. L. Ewing (2009) Cognitive processing using spiking neural networks. In Proceedings of the IEEE 2009 National Aerospace & Electronics Conference (NAECON), pp. 56–64. Cited by: §1.
  • N. Caporale and Y. Dan (2008) Spike timing–dependent plasticity: a hebbian learning rule. Annu. Rev. Neurosci. 31, pp. 25–46. Cited by: §1.
  • S. Carrillo, J. Harkin, L. J. McDaid, F. Morgan, S. Pande, S. Cawley, and B. McGinley (2012a) Scalable hierarchical network-on-chip architecture for spiking neural network hardware implementations. IEEE Transactions on Parallel and Distributed Systems 24 (12), pp. 2451–2461. Cited by: §1.
  • S. Carrillo, J. Harkin, L. McDaid, S. Pande, S. Cawley, B. McGinley, and F. Morgan (2012b) Advancing interconnect density for spiking neural network hardware implementations using traffic-aware adaptive network-on-chip routers. Neural networks 33, pp. 42–57. Cited by: §1.
  • P. U. Diehl, D. Neil, J. Binas, M. Cook, S. Liu, and M. Pfeiffer (2015) Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing. In 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2.
  • S. K. Esser, R. Appuswamy, P. Merolla, J. V. Arthur, and D. S. Modha (2015) Backpropagation for energy-efficient neuromorphic computing. In Advances in neural information processing systems, pp. 1117–1125. Cited by: §2.
  • G. Gallego, J. E. Lund, E. Mueggler, H. Rebecq, T. Delbruck, and D. Scaramuzza (2017) Event-based, 6-dof camera tracking from photometric depth maps. IEEE transactions on pattern analysis and machine intelligence 40 (10), pp. 2402–2412. Cited by: §1.
  • Y. Hu, H. Tang, Y. Wang, and G. Pan (2018) Spiking deep residual network. arXiv preprint arXiv:1805.01352. Cited by: §2.
  • L. R. Iyer, Y. Chua, and H. Li (2018) Is neuromorphic mnist neuromorphic? analyzing the discriminative power of neuromorphic datasets in the time domain. arXiv preprint arXiv:1807.01013. Cited by: §3.4.
  • N. Kasabov and E. Capecci (2015) Spiking neural network methodology for modelling, classification and understanding of eeg spatio-temporal data measuring cognitive processes. Information Sciences 294, pp. 565–575. Cited by: §1.
  • H. Kim, S. Leutenegger, and A. J. Davison (2016) Real-time 3d reconstruction and 6-dof tracking with an event camera. In European Conference on Computer Vision, pp. 349–364. Cited by: §1.
  • B. Kueng, E. Mueggler, G. Gallego, and D. Scaramuzza (2016) Low-latency visual odometry using event-based feature tracks. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 16–23. Cited by: §1.
  • C. Lee, P. Panda, G. Srinivasan, and K. Roy (2018) Training deep spiking convolutional neural networks with stdp-based unsupervised pre-training followed by supervised fine-tuning. Frontiers in neuroscience 12, pp. 435. Cited by: §1.
  • W. Maass (1997) Networks of spiking neurons: the third generation of neural network models. Neural networks 10 (9), pp. 1659–1671. Cited by: §1.
  • J. Manderscheid, A. Sironi, N. Bourdis, D. Migliore, and V. Lepetit (2019) Speed invariant time surface for learning to detect corner points with event-based cameras. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 10245–10254. Cited by: §6.2.
  • H. Markram, W. Gerstner, and P. J. Sjöström (2011) A history of spike-timing-dependent plasticity. Frontiers in synaptic neuroscience 3, pp. 4. Cited by: §1.
  • C. Mead (1990) Neuromorphic electronic systems. Proceedings of the IEEE 78 (10), pp. 1629–1636. Cited by: §1.
  • P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, et al. (2014) A million spiking-neuron integrated circuit with a scalable communication network and interface. Science 345 (6197), pp. 668–673. Cited by: §1.
  • H. Mostafa (2017) Supervised learning based on temporal coding in spiking neural networks. IEEE transactions on neural networks and learning systems 29 (7), pp. 3227–3235. Cited by: §1, §3.2.
  • E. Mueggler, B. Huber, and D. Scaramuzza (2014) Event-based, 6-dof pose tracking for high-speed maneuvers. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2761–2768. Cited by: §1.
  • E. O. Neftci, H. Mostafa, and F. Zenke (2019) Surrogate gradient learning in spiking neural networks: bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine 36 (6), pp. 51–63. Cited by: §1, §2.
  • G. Orchard, C. Meyer, R. Etienne-Cummings, C. Posch, N. Thakor, and R. Benosman (2015) HFirst: a temporal approach to object recognition. IEEE transactions on pattern analysis and machine intelligence 37 (10), pp. 2028–2040. Cited by: Table 2.
  • H. Rebecq, T. Horstschäfer, G. Gallego, and D. Scaramuzza (2016) EVO: a geometric approach to event-based 6-dof parallel tracking and mapping in real time. IEEE Robotics and Automation Letters 2 (2), pp. 593–600. Cited by: §1.
  • B. Rekabdar, M. Nicolescu, M. Nicolescu, and S. Louis (2017) Using patterns of firing neurons in spiking neural networks for learning and early recognition of spatio-temporal patterns. Neural Computing and Applications 28 (5), pp. 881–897. Cited by: §1.
  • B. Rueckauer, I. Lungu, Y. Hu, M. Pfeiffer, and S. Liu (2017) Conversion of continuous-valued deep networks to efficient event-driven networks for image classification. Frontiers in neuroscience 11, pp. 682. Cited by: §2.
  • T. N. Sainath, O. Vinyals, A. Senior, and H. Sak (2015)

    Convolutional, long short-term memory, fully connected deep neural networks

    In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4580–4584. Cited by: §2.
  • S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry (2018) How does batch normalization help optimization?. In Advances in Neural Information Processing Systems, pp. 2483–2493. Cited by: §3.3.
  • C. D. Schuman, T. E. Potok, R. M. Patton, J. D. Birdwell, M. E. Dean, G. S. Rose, and J. S. Plank (2017) A survey of neuromorphic computing and neural networks in hardware. arXiv preprint arXiv:1705.06963. Cited by: §1.
  • A. Sengupta, Y. Ye, R. Wang, C. Liu, and K. Roy (2019) Going deeper in spiking neural networks: vgg and residual architectures. Frontiers in neuroscience 13. Cited by: §2.
  • J. Seo, B. Brezzo, Y. Liu, B. D. Parker, S. K. Esser, R. K. Montoye, B. Rajendran, J. A. Tierno, L. Chang, D. S. Modha, et al. (2011) A 45nm cmos neuromorphic chip with a scalable architecture for learning in networks of spiking neurons. In 2011 IEEE Custom Integrated Circuits Conference (CICC), pp. 1–4. Cited by: §1.
  • A. Sironi, M. Brambilla, N. Bourdis, X. Lagorce, and R. Benosman (2018) HATS: histograms of averaged time surfaces for robust event-based object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1731–1740. Cited by: Table 2.
  • N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using lstms. In

    International conference on machine learning

    pp. 843–852. Cited by: §2.
  • E. Stromatias, M. Soto, T. Serrano-Gotarredona, and B. Linares-Barranco (2017) An event-driven classifier for spiking neural networks fed with synthetic or dynamic vision sensor data. Frontiers in neuroscience 11, pp. 350. Cited by: §2.
  • A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, and A. Maida (2019) Deep learning in spiking neural networks. Neural Networks 111, pp. 47–63. Cited by: §2, §2.
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §2.
  • W. Wang, S. Zhou, J. Li, X. Li, J. Yuan, and Z. Jin (2020) Temporal pulses driven spiking neural network for fast object recognition in autonomous driving. arXiv preprint arXiv:2001.09220. Cited by: §1, §1.
  • Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi (2018) Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in neuroscience 12, pp. 331. Cited by: §1, §1, §2, §3.2.
  • Y. Wu, L. Deng, G. Li, J. Zhu, Y. Xie, and L. Shi (2019) Direct training for spiking neural networks: faster, larger, better. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 1311–1318. Cited by: §2, §3.3, §6.2, §6.2, Table 2.
  • S. Xingjian, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802–810. Cited by: §2.
  • X. Zhang, Z. Xu, C. Henriquez, and S. Ferrari (2013) Spike-based indirect training of a spiking neural network-controlled virtual insect. In 52nd IEEE Conference on Decision and Control, pp. 6798–6805. Cited by: §1.