Log In Sign Up

Closing the Accuracy Gap in an Event-Based Visual Recognition Task

Mobile and embedded applications require neural networks-based pattern recognition systems to perform well under a tight computational budget. In contrast to commonly used synchronous, frame-based vision systems and CNNs, asynchronous, spiking neural networks driven by event-based visual input respond with low latency to sparse, salient features in the input, leading to high efficiency at run-time. The discrete nature of the event-based data streams makes direct training of asynchronous neural networks challenging. This paper studies asynchronous spiking neural networks, obtained by conversion from a conventional CNN trained on frame-based data. As an example, we consider a CNN trained to steer a robot to follow a moving target. We identify possible pitfalls of the conversion and demonstrate how the proposed solutions bring the classification accuracy of the asynchronous network to only 3% below the performance of the original synchronous CNN, while requiring 12x fewer computations. While being applied to a simple task, this work is an important step towards low-power, fast, and embedded neural networks-based vision solutions for robotic applications.


Learning from Event Cameras with Sparse Spiking Convolutional Neural Networks

Convolutional neural networks (CNNs) are now the de facto solution for c...

Object Detection with Spiking Neural Networks on Automotive Event Data

Automotive embedded algorithms have very high constraints in terms of la...

Adversarial Attacks on Spiking Convolutional Networks for Event-based Vision

Event-based sensing using dynamic vision sensors is gaining traction in ...

Progressive Tandem Learning for Pattern Recognition with Deep Spiking Neural Networks

Spiking neural networks (SNNs) have shown clear advantages over traditio...

Event-based Synthetic Aperture Imaging with a Hybrid Network

Synthetic aperture imaging (SAI) is able to achieve the see through effe...

Passive Non-line-of-sight Imaging for Moving Targets with an Event Camera

Non-line-of-sight (NLOS) imaging is an emerging technique for detecting ...

Hybrid SNN-ANN: Energy-Efficient Classification and Object Detection for Event-Based Vision

Event-based vision sensors encode local pixel-wise brightness changes in...

I Introduction

Deep convolutional neural networks (CNNs) offer practical solutions for pattern recognition and have radically changed the field of image recognition. In the field of robotics, however, where real-time processing and low power budget are crucial, CNN-based image-processing algorithms face a fundamental latency-power trade-off, where low latency can only be achieved by dramatically increasing power consumption.

Evolution of embedded systems led to development of event-based vision, which has enabled improved performance of vision systems for fast and agile robots [1, 2]. Event-driven, biologically inspired vision sensors such as DVS [3], ATIS [4], and DAVIS [5]

enable fast and low-power processing of visual information. Instead of capturing static images of the scene, these sensors record pixel brightness change events with high temporal precision. Events are only triggered if a significant change occurs in the observed scene, allowing lower latency and lower required bandwidth compared to frame-based sensors. However, since the data produced by an event sensor is a sequence of events, conventional frame-based computer-vision algorithms 

[6] or DNN-based pattern recognition can not be applied directly.

One obvious way to process event-based data with conventional DNNs is to create frames by accumulating events over fixed time intervals, or accumulating a constant number of events per frame. It has been shown in several robotic applications that by following this approach, conventional CNNs can be applied for feature extraction and object classification

[7, 8, 9]. Although using constant event count frames addresses the latency-power tradeoff by using data driven computation, it ignores key advantages of event based sensors, in particular their sparse data and the high temporal precision.

This paper explores the use of asynchronous neural network architectures for processing the event-based vision data. In contrast to the synchronous, frame-based mode of operation of conventional CNNs, asynchronous spiking neural network (SNN) architectures represent hidden layer activations in form of discrete events – spikes – that are propagated through the network asynchronously, so that neurons are only activated when they receive events 


. Theory has shown that SNNs are at least as computationally powerful as conventional neuronal models being used in deep-learning

[11]. It has also been shown that by the use of dedicated event-based hardware, power consumption and latency can be reduced by several orders of magnitude [12, 13, 14]. IBM’s TrueNorth processor [14] consumes about 1000 times less energy than conventional synchronous architectures. Thus, just as hardware acceleration through GPUs has played a fundamental role in the advancements of deep-learning, there is increasing availability of neuromorphic SNN accelerators that enable efficient computation of event-based SNN training and inference [15], potentially running on a fraction of the energy budget compared to conventional CNNs running on GPUs [12, 16, 17, 14].

There are two ways to obtain an SNN for solving a pattern recognition task. First, recent work has explored direct training in the spiking domain using backpropagation inspired techniques for training multi-layer SNN architectures

[15, 18, 19]. Training SNNs is difficult as due to their non-differentiable nature and gradient-descent based methods can not be applied directly. Furthermore, backpropagation rules typically used in deep learning rely on the availability of network-wide information stored with high-precision memory, and on precise operations that are difficult to realize in event-based hardware [19].

Second, SNNs can be constructed by converting conventionally trained analog neural networks (ANNs) [20, 21]. In terms of accuracy, [22]

reports that while these neural networks seem to work well using synthetic input spike trains generated artificially from frame images (e.g., from the MNIST database), where the gray level of an image pixel is transformed into a stream of spikes, doing inference using this SNN with data from an event-based vision sensor may lead to significant loss in accuracy. Increasing our understanding of SNN processing is needed to close the accuracy gap between the frame-based and event-based pattern recognition.

In this paper, we apply the second method and analyze object recognition using analog and spiking convolutional neural networks in the context of a robotics predator/prey navigation scenario. The dataset from [7] is used to train and evaluate several neural network architectures, where the purpose of the trained networks is to steer a predator robot in the direction of a prey robot. In particular, we compare the conventional CNN architecture proposed in [7] with its event-based SNN counterpart, where accuracy is evaluated using both synthetic and sensor-driven input spike trains. We perform a thorough analysis of accuracy losses that occur in the ANN to SNN conversion and offer solutions to reduce these losses. We show that a CNN trained on constant event count frames can be run efficiently on the asynchronous sensor events at inference, using up to 12x fewer computations than when using frames. Finally, we identify the causes for classification accuracy loss that occurs when switching from a synchronous training mode to an asynchronous inference mode, and evaluate solutions that minimize this loss.

These are crucial steps on the way to low-power, fast, embedded solutions, which will enable the application of deep neural networks on robotic platforms in real time and with a limited power budget.

Ii Methods

This section describes the pipeline of the present work: Starting with an event-based data set (Sec. II-A), we first synthesized frames (Sec. II-B) to train a conventional ANN (Sec. II-C). The resulting frame-based model was then converted to an SNN (Sec. II-D) and tested on the original event-based data (Sec. II-E).

Ii-a The data set

The data set from [7] consists of twenty recordings with a total duration of 1.25 hours from a Dynamic and Active Pixel Sensor (DAVIS) [5]. The DAVIS camera was mounted on the predator robot and recorded different scenes in which the predator robot, driven by a human or by the CNN, followed the prey robot. The recordings contain both conventional image frames (APS) as well as event-based data (DVS). The APS frames were not used in this work. The DVS sensor data was output in AEDAT 2.0 file format [23]: each sensor event contained a timestamp, the pixel address, and a polarity value (ON/OFF), indicating an increase or decrease in pixel brightness.

The ground truth labels of the recordings encoded the position and bounding box of the prey robot. From these labels, we produced a target ground truth of four classes, marking in which third of the visual field the prey robot is located (classes 1-3) or if it is not visible (class 4), leading to a four-class-classification problem (left, center, right, invisible). The DAVIS camera records with a resolution of 240x180 pixels. We subsampled the event-addresses to 36x36 arrays. [7] found this to be the minimum size for which the robot can still be recognized by human eye. The data set consisted of roughly 200k images generated by binning DVS events to 5000-event frames, as described below.

Fig. 1: Synthesized DVS frames (see Sec. II-B for details). 1(a): Original method from [7]. 1(b): Our method. Left column: Full resolution (240x180); right column: Subsampled to 36x36.

Ii-B Generating frames from event-based data for ANN training

Training of the ANN was conducted on frames synthesized from DVS events, but testing of the converted SNN was performed on the original DVS stream. Thus, any transformation of the data during frame synthesis should either be applicable to the underlying DVS event streams as well, or else distort them as little as possible. This section describes each step of a frame generation method that best preserves the classification performance when using asynchronous DVS data at test time.

Ii-B1 Choosing the binning window

Frames can be synthesized from DVS data either by accumulating a variable number of events during a fixed time window, or by accumulating a fixed number of events during a variable time window. We follow [7] and use the latter approach, with a constant number of 5000 events per frame. This way, the frame rate is proportional to the rate of change of the scene, and each frame is more likely to be informative. Frame synthesis with a fixed time window can lead to overly sparse and noisy frames during time intervals with few changes in the recorded scene, and blurred frames when the robots are moving quickly.

Ii-B2 Handling polarity

Another design choice concerns the polarity of the DVS events. One can integrate ON and OFF events by representing them as +1 or -1, respectively. In this case, events of opposite polarity cancel each other. The original work [7] uses this method by initializing the frames with 0.5 pixel intensity and in-/decrementing this value by per ON/OFF event. This approach is not feasible in our setup, because it assigns a nonzero intensity to pixels where the DVS records no events. Instead, we start with an all-zero frame and apply a rectified event count that discards polarity while binning the events. In preliminary experiments we found the polarity information not to be relevant for learning this task.

Ii-B3 Input normalization

Outliers in the distribution of event counts were removed by clipping values greater than three times the standard deviation (3-sigma normalization).

Though the network was trained on frames, we aimed to use the original DVS events during inference. To maintain high classification accuracy when switching from frame to event input, any transformation of the frame data, performed during training, should be applied to the DVS data as well. 3-sigma normalization on DVS event streams is possible by temporally binning 5000 events into a frame as during training, and applying 3-sigma normalization on this frame of integer event-counts to identify outlier events, which are then removed from the DVS event stream fed into the SNN.

Ii-B4 Input scaling

After accumulation and outlier removal, the frames are scaled to real values, which is the last stage of synthesizing frames for training the ANN. During testing of the SNN, discrete events are streamed into the network, making scaling inapplicable.

A subtle difference in the training-frames arises from the order in which scaling and 3-sigma normalization are applied. In [7], the frame of integer event-counts is scaled to before 3-sigma normalization. The resulting frames consist of real values. If instead 3-sigma normalization is applied first (by removing discrete events from the frame of integer event-counts), the subsequent scaling will result in frames that consist of rational numbers only. To avoid the discrepancy of training on real values and testing on integers, we applied 3-sigma normalization before scaling. This seemingly small difference turns out to be crucial: With the opposite ordering, the classification accuracy of the converted SNN drops by 30%.

Fig. 1 shows an example of the frames synthesized as described above. In Table I we summarize how our approach differs from the original work [7].

Fig. 2: Architecture for predator-prey task. (Graphic adapted from [24])

Ii-C Training the frame-based ANN

Fig. 2 shows the model architecture that [7]

developed to solve the predator-prey task. It consists of a small CNN with two convolution layers with 4 feature maps each and a kernel size of 5x5 pixels. Each convolution layer is followed by a max pooling layer. A fully-connected layer of 40 neurons connects the last pooling layer with the 4 classifier output units. The network contains a total of 5884 neurons and 6472 parameters.

[7] showed that this tiny CNN achieves higher classification accuracy than humans observing the same images.

We implemented the network in the Keras framework 


, and trained for 30 epochs using mini-batches of size 32 and the ADAM optimizer.

Ii-D Converting the ANN to an event-driven SNN

The central idea of ANN-SNN conversion is that the time-averaged firing-rates of the resulting spiking architecture correspond to the analog activations in the original ANN. This mapping can be achieved by replacing the neurons in the ANN with non-leaky integrate-and-fire neurons [26]. The trained parameters remain the same, up to a layer-wise rescaling that reduces the problem of limited dynamic range of spiking neurons [27]. [20]

proposed implementations in the spike domain of modules commonly used in ANNs, like max-pooling and softmax layers. We apply their open-source conversion framework 

[28] to transfer the predator-prey ANN to the event-based domain.

[7] This work
APS frames used Yes No
Biases Yes Yes, with L2-regularizer
Event polarity used Yes Yes, rectified
Frames initialized at 0.5 0, for comput. sparsity
Subsampling (cf. III-B2) , to remove clusters
Scaling frames to Yes Yes (N/A in SNN)
3-sigma normalization Yes Yes, before scaling
TABLE I: Comparison of methodology.

Ii-E Data used for testing the converted SNN

Due to the lack of truly event-based datasets acquired with neuromorphic vision sensors, in recent work the spiketrains were often generated synthetically from frame-based image datasets. The most common method is to use poisson spike generators driving their firing-rate with the intensity of the corresponding input pixels. However, the stochastic nature of the generated Poisson spike trains introduces noise into the network, without having any notable benefits. A simple alternative is to use analog input values in the very first hidden layer, and to compute with spikes from there on [29, 20]. The image pixel values are interpreted as currents flowing into the neurons of the first hidden layer, where they are integrated into membrane potentials, thus deterministically producing regular spikes at a rate proportional to the pixel value. Recent work [22] has reported that while converted SNNs seem to work well on synthetic input data, using real event-based data as input can lead to a significant drop in classification accuracy.

In this work, we perform simulations with poisson and analog inputs from synthetically generated frames (Sec. II-B), and we compare these to directly applying the original DVS events from the predator/prey dataset as input spikes.

Ii-F Simulation of the converted SNN

We make use of the SNN toolbox [28] to run the converted SNN on the three input types described above. The SNN toolbox provides a simulator for spiking networks that is built on the Keras framework. The spiking network consisting of non-leaky integrate-and-fire neurons is processed in a time-stepped manner with a step size equivalent to the time resolution of the DVS event stream (1 microsecond).

The DVS data set used for testing the converted SNN is stored as a collection of .aedat files, where each file contains a DVS clip of several seconds. Previously, the toolbox accepted frame-like input. To be able to use asynchronous data, we extended the toolbox by a DataGenerator module that iteratively reads in an .aedat file and processes the event sequence with subsampling and outlier removal as described in Sec. II-B. The network outputs a classification guess at each time step, but we define the period of time needed to process 5000 events as “one sample”, and take as final classification output of the network for one particular sample the class corresponding to the neuron that fired the most spikes while processing the sample. When all events in an .aedat file are processed, the DataGenerator loads the next sequence of events from the aedat-directory; this procedure is repeated until all events are processed.

Iii Results

Iii-a ANN accuracy

First, we reproduced the results of [7] by training the frame-based CNN architecture in Fig. 2 on frames generated from the DVS events as outlined in Sec. II-B. The original work used a combined dataset of APS frames and synthesized DVS frames. We found that the same classification performance can be achieved using DVS frames alone, which is preferable in the present setup, because only DVS events will be used during inference.

Fig. 3: Spike trains generated by simulating the SNN on a single test sample corresponding to 5000 DVS events. X-axis: time (450 steps of simulation); y-axis: neuron index. See Sec. II-E for a description of the input types, and Sec. II-B for details on the frame generation. The Figure is discussed at the end of Sec. III-B3.

Iii-B SNN accuracy

After transferring the ANN to an SNN as described in Sec. II-D, the SNN was tested on the three input types listed in Sec. II-E, namely analog (frame-based), Poisson, and DVS input. Both analog and Poisson input resulted in SNN accuracies close to the original ANN accuracy (see Table II). However, in our initial experiments, the SNN accuracy dropped to chance level when using the original DVS input spike trains. We discuss the reasons for this reduction of accuracy, and propose and evaluate solutions to restore accuracy in the remainder of this section.

Iii-B1 Imbalance between network biases and DVS rates

In [28], the bias values of neurons in the ANN are converted into constant input currents that flow into SNN neurons over the course of the simulation. If a bias value is large, this bias current can outweigh the spike-driven input to a neuron and dominate that neuron’s output firing dynamics. This effect is likely to occur in neurons receiving DVS input spike trains, whose rates may vary considerably over the duration of a single 5000-event sequence (see Sec. III-B3).

To prevent dominating biases, we trained the ANN with L2-regularization on the network weights and biases. L2-regularization adds to the training cost function a term which is proportional to the squared parameter values, thereby inducing the network to keep parameter values small. Training the ANN without regularization led to several bias values that were close to or above the threshold of the SNN neurons, thereby dominating their firing dynamics. The classification accuracy of the converted L2-regularized SNN increased by 43% as compared to the SNN without regularization.

Training the ANN altogether without biases (as done in [21]) may be another straight-forward solution. We trained an ANN without biases, which achieved 0.6% lower accuracy than the L2-regularized ANN containing biases. The converted SNN without biases scored better than the non-regularized SNN with biases, but 0.81% worse than the L2-regularized SNN with biases. Thus, we favor the regularized model with biases.

Iii-B2 Subsampling induces temporal clusters

Aside from dominating biases, another reason for the drop in classification performance when using DVS input was the subsampling mechanism. Pixel addresses in the original 240x180 image space are subsampled to 36x36 by integer division. If a subsampled patch contains several simultaneous events, they will all be mapped onto a single pixel address, thereby transforming a spatial-temporal cluster into a temporal cluster. A neuron in the SNN then receives the spikes contained in such a “burst” at immediately-subsequent time steps during the simulation. These spike bursts are in strong contrast to the way the network was trained, namely on analog frames, where such temporal structures are not present.

To see why temporally structured spike trains may produce a different outcome than homogeneously distributed spike trains, consider a neuron receiving a fixed number of input spikes from two sources, one inhibitory and the other one excitatory. If both sources fire at a regular rate, their contributions cancel each other and the neuron will not be active. If instead the spikes of the excitatory source are clustered into an early spike burst, the neuron will be strongly activated, even though the total number of spikes from each source over a given time period has not changed.

To prevent formation of detrimental temporal clusters during subsampling, we keep only one of the subsampled events in a patch. Here we term this method max subsampling, and the method that accumulates all events in a patch sum subsampling. The classification accuracy of the ANN trained on the max- and sum-subsampled data is 88.25% and 88.04%, respectively. The accuracy of the converted SNN is 85.19% and 78.24%, respectively, which shows the importance of removing spike bursts due to subsampling.

Iii-B3 Non-uniform DVS spiketrains

With regularized biases and max-subsampling, the classification accuracy of the converted SNN using DVS input improves from chance level to within 3% of the original ANN. We were not able to close this gap completely, and believe the underlying cause to be inhomogeneities in the DVS spike trains.

By viewing the DVS recordings as well as by studying the raster plots in Fig. 3, one can observe phases of increased global activity within the time window that corresponds to 5000 events. These bursts of global activity are likely the result of electrical coupling between frame electronic shutter and DVS circuits within the pixels [5]. These abrupt changes in firing frequency propagate throughout the network. The variability of DVS event rates differs strongly from the rather uniform spike distribution observed when using Poisson or analog input. In the previous subsection, we argued that temporal spike patterns in the test phase can have a detrimental effect because of the asymmetric spike generation mechanism: neuron activity due to a burst of excitatory input cannot be reversed by a later burst of inhibitory input. We can not expect the network to be able to cope with temporal structure in the input which it has never experienced during training.

This intuitive explanation of the remaining accuracy loss can be validated by using Poisson or analog input, which features constant firing rates as during training. With 88.12% accuracy, analog input nearly closes the accuracy gap. With 86.77% accuracy, Poisson input falls between analog and DVS input, which is reasonable given the variance inherent in a Poisson process.

Figure 3 compares the spike trains generated by the different input types. Analog input (first row) results in the most regular firing dynamics and the highest support for the correct class label (second neuron in output layer). Surprisingly, the slight variations induced by Poisson variability (second row) reduce the network’s confidence in the correct class label significantly. Asynchronous DVS input (third row) exhibits temporal structure that is not present in analog or Poisson input. The spike rates are generally lower, which accounts for the reduced operation cost, but contributes also to the increased classification error. DVS input with sum-subsampling (bottom row, cf. Sec. III-B2) contains spike bursts that are fed into the network in close succession, causing the spike train to spread out over a longer simulation time. These temporal patterns – unseen during training – cause the network to confuse the correct output label (right column).

Fig. 4: Average classification test accuracy of ANN (single cross) and SNNs (curves) for DVS, analog and Poisson input, plotted against the number of operations.

Iii-C Operation cost

Besides classification accuracy, a second important metric for SNN performance is its operation cost. An “operation” for the SNN is defined as a synaptic update, i.e., the update of a neuron’s state due to a spike in the preceding layer. This operation corresponds to a simple “addition”, in contrast to more costly multiply-accumulate operations needed in conventional ANNs. We compute this quantity from the network architecture and the number of spikes that each neuron fires during the simulation [20].

Figure 4 compares the accuracies and operation costs of the ANN trained with L2-regularized biases on max-subsampled data, and of the converted SNN tested on the original DVS events, synthesized analog frames, and Poisson spike trains. The operation cost for the ANN is a single value, because inference consists of a single forward pass. In the SNN, a continuum of classification accuracies is obtained as simulation progresses and more operations are invested. Table II lists the final accuracy and operation cost at the end of each of the SNN curves.

Summarizing the results, the SNN with analog input provides the highest accuracy while reducing the number of operations by 7x compared to the original ANN. The SNN with DVS input suffers from an accuracy loss of 3%, but compensates for it by a 12x reduction in computational cost.

Iv Conclusion

In this work, we explored spiking neural networks (SNNs) as efficient replacements of conventional frame-based, analog neural networks (ANNs), on the task of a robot pursuing a moving target. The underlying data set stems from a dynamic vision sensor (DVS), which provides a continuous stream of asynchronous events. These event streams are seldomly used directly as input to deep neural networks; instead, the events are commonly binned into frames, on which the network is trained and tested. While this frame-based approach grants easy access to a wealth of powerful deep learning frameworks, one sacrifices the advantage of very low latency and potentially sparse computation inherent in asynchronous event streams from a DVS. Converting a pre-trained frame-based ANN into an event-driven SNN aims to combine the best of both worlds: Frame-based training provides us with a high-accuracy model, while inference is done on sparse asynchronous events.

This study confirms earlier findings [20, 29] showing that the converted SNN achieves equivalent classification accuracy as the original ANN when using static frames or Poisson spike trains as input. However, we take this analysis a step further by applying the original DVS events as input to the SNN. Initial classification results were close to chance level, indicating significant distortions when transitioning from a synchronously trained model to an asynchronously tested model. As causes for this accuracy loss we identify (1) the way that the training frames are generated from DVS data, (2) extreme weights or biases, and (3) temporal structure present in the asynchronous test data but not in the training frames.

To solve these issues, we propose (1) training frame generation steps that are applicable to the DVS test data as well, thereby minimizing the discrepancy between training and test set, and (2) L2-regularization during training to effectively prevent dominating model parameters. The resulting SNN achieves classification accuracy close to the original ANN. The third issue, temporal structure in the DVS event stream, can only partly be removed, and likely accounts for the remaining 3% accuracy gap between synchronous ANN and event-driven SNN.

By evaluating the computational cost of the SNN when run on DVS events, we confirm the expected improvement in terms of low latency and sparse, change-driven operation. Specifically, inference in the SNN can be done using 12x less computations on this data set. Further, the computations consist of simple additions, which are energetically cheaper than the multiply-accumulate operations used in conventional ANNs. Future work concerns the measurement of the energy consumption (including the cost of memory transfer due to keeping neuron states).

To close the remaining accuracy gap, extensions of this work might consider training on the DVS events directly. This would create a purely asynchronous setting and potentially enable the model to accurately process streams with highly variable event rates. Regardless of the training method, detrimental inhomogeneities in the DVS input could potentially be removed by low-pass filtering or other preprocessing with smoothing effect.

The present work is a step towards efficient inference in mobile and embedded systems requiring low latency and computation cost, which will in particular profit from development of asynchronous event-based computing hardware.

Accuracy Operations
Model (input type) [%] [MOps]
ANN (analog) 88.25 7.85
SNN (analog) 88.12 1.15
SNN (Poisson) 86.77 3.06
SNN (DVS) 85.19 0.66
TABLE II: Classification accuracy of the original ANN and the converted SNN.


This work was funded by an SNSF project Ambizione under grant agreement PZOOP2_168183.


  • [1] D. Falanga, E. Mueggler, M. Faessler, and D. Scaramuzza, “Aggressive quadrotor flight through narrow gaps with onboard sensing and computing using active vision,” Proceedings - IEEE International Conference on Robotics and Automation, pp. 5774–5781, 2017.
  • [2]

    E. Mueggler, H. Rebecq, G. Gallego, T. Delbruck, and D. Scaramuzza, “The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and SLAM,”

    International Journal of Robotics Research, vol. 36, no. 2, pp. 142–149, 2017.
  • [3] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 128 120 db 15s latency asynchronous temporal contrast vision sensor,” IEEE journal of solid-state circuits, vol. 43, no. 2, pp. 566–576, 2008.
  • [4] C. Posch, D. Matolin, and R. Wohlgenannt, “A qvga 143 db dynamic range frame-free pwm image sensor with lossless pixel-level video compression and time-domain cds,” IEEE Journal of Solid-State Circuits, vol. 46, no. 1, pp. 259–275, 2011.
  • [5] C. Brandli, R. Berner, M. Yang, S.-C. Liu, and T. Delbruck, “A 240 180 130 db 3 s latency global shutter spatiotemporal vision sensor,” IEEE Journal of Solid-State Circuits, vol. 49, no. 10, pp. 2333–2341, 2014.
  • [6] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, 2000.
  • [7] D. P. Moeys, F. Corradi, E. Kerr, P. Vance, G. Das, D. Neil, D. Kerr, and T. Delbrück, “Steering a predator robot using a mixed frame/event-driven convolutional neural network,” in Event-based Control, Communication, and Signal Processing (EBCCSP), 2016 Second International Conference on.    IEEE, 2016, pp. 1–8.
  • [8] A. Amir, B. Taba, D. Berg, T. Melano, J. McKinstry, C. Di Nolfo, T. Nayak, A. Andreopoulos, G. Garreau, M. Mendoza et al., “A low power, fully event-based gesture recognition system.”
  • [9] I.-A. Lungu, F. Corradi, and T. Delbruck, “Live demonstration: Convolutional neural network driven by dynamic vision sensor playing roshambo,” in 2017 IEEE Symposium on Circuits and Systems (ISCAS 2017), Baltimore, MD, USA, 2017.
  • [10] H. Martin and J. Conradt, “Spiking neural networks for vision tasks,” 2015.
  • [11] W. Maass and H. Markram, “On the computational power of circuits of spiking neurons,” Journal of computer and system sciences, vol. 69, no. 4, pp. 593–616, 2004.
  • [12] S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras, S. Temple, and A. D. Brown, “Overview of the spinnaker system architecture,” IEEE Transactions on Computers, vol. 62, no. 12, pp. 2454–2467, 2013.
  • [13] G. Indiveri, F. Corradi, and N. Qiao, “Neuromorphic Architectures for Spiking Deep Neural Networks,” pp. 68–71, 2015.
  • [14] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura et al., “A million spiking-neuron integrated circuit with a scalable communication network and interface,” Science, vol. 345, no. 6197, pp. 668–673, 2014.
  • [15] J. H. Lee, T. Delbruck, and M. Pfeiffer, “Training deep spiking neural networks using backpropagation,” Frontiers in neuroscience, vol. 10, 2016.
  • [16]

    N. Qiao, H. Mostafa, F. Corradi, M. Osswald, D. Sumislawska, G. Indiveri, and G. Indiveri, “A Re-configurable On-line Learning Spiking Neuromorphic Processor comprising 256 neurons and 128K synapses,”

    Frontiers in neuroscience, vol. 9, no. February, 2015.
  • [17] G. Indiveri, E. Chicca, and R. J. Douglas, “Artificial Cognitive Systems: From VLSI Networks of Spiking Neurons to Neuromorphic Cognition,” Cognitive Computation, vol. 1, no. 2, pp. 119–127, 2009.
  • [18] E. Neftci, C. Augustine, S. Paul, and G. Detorakis, “Neuromorphic deep learning machines,” arXiv preprint arXiv:1612.05596, 2016.
  • [19] E. O. Neftci, C. Augustine, S. Paul, and G. Detorakis, “Event-driven random back-propagation: Enabling neuromorphic deep learning machines,” Frontiers in neuroscience, vol. 11, p. 324, 2017.
  • [20] B. Rueckauer, I.-A. Lungu, Y. Hu, M. Pfeiffer, and S.-C. Liu, “Conversion of Continuous-Valued Deep Networks to Efficient Event-Driven Networks for Image Classification,” Frontiers in Neuroscience, vol. 11, no. December, pp. 1–12, 2017.
  • [21]

    P. U. Diehl, G. Zarrella, A. Cassidy, B. U. Pedroni, and E. Neftci, “Conversion of artificial recurrent neural networks to spiking neural networks for low-power neuromorphic hardware,” in

    Rebooting Computing (ICRC), IEEE International Conference on.    IEEE, 2016, pp. 1–8.
  • [22] E. Stromatias, M. Soto, T. Serrano-Gotarredona, and B. Linares-Barranco, “An event-driven classifier for spiking neural networks fed with synthetic or dynamic vision sensor data,” Frontiers in Neuroscience, vol. 11, no. JUN, pp. 1–17, 2017.
  • [23] INIlabs, “Aedat fileformat,”
  • [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [25] F. Chollet et al., “Keras,”, 2015.
  • [26] A. N. Burkitt, “A review of the integrate-and-fire neuron model: I. homogeneous synaptic input,” Biological cybernetics, vol. 95, no. 1, pp. 1–19, 2006.
  • [27] P. U. Diehl, D. Neil, J. Binas, M. Cook, S.-C. Liu, and M. Pfeiffer, “Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing,” in Neural Networks (IJCNN), 2015 International Joint Conference on.    IEEE, 2015, pp. 1–8.
  • [28] B. Rueckauer, “Spiking neural network conversion toolbox,”
  • [29] D. Zambrano and S. M. Bohte, “Fast and efficient asynchronous neural computation with adapting spiking neural networks,” arXiv preprint arXiv:1609.02053, 2016.