Log In Sign Up

End-to-End Learning of Representations for Asynchronous Event-Based Data

by   Daniel Gehrig, et al.

Event cameras are vision sensors that record asynchronous streams of per-pixel brightness changes, referred to as "events". They have appealing advantages over frame-based cameras for computer vision, including high temporal resolution, high dynamic range, and no motion blur. Due to the sparse, non-uniform spatiotemporal layout of the event signal, pattern recognition algorithms typically aggregate events into a grid-based representation and subsequently process it by a standard vision pipeline, e.g., Convolutional Neural Network (CNN). In this work, we introduce a general framework to convert event streams into grid-based representations through a sequence of differentiable operations. Our framework comes with two main ad-vantages: (i) allows learning the input event representation together with the task dedicated network in an end to end manner, and (ii) lays out a taxonomy that unifies the majority of extant event representations in the literature and identifies novel ones. Empirically, we show that our approach to learning the event representation end-to-end yields an improvement of approximately 12 flow estimation and object recognition over state-of-the-art methods.


page 12

page 13


Unsupervised Feature Learning for Event Data: Direct vs Inverse Problem Formulation

Event-based cameras record an asynchronous stream of per-pixel brightnes...

Representation Learning for Event-based Visuomotor Policies

Event-based cameras are dynamic vision sensors that can provide asynchro...

Matrix-LSTM: a Differentiable Recurrent Surface for Asynchronous Event-Based Data

Dynamic Vision Sensors (DVSs) asynchronously stream events in correspond...

Event-LSTM: An Unsupervised and Asynchronous Learning-based Representation for Event-based Data

Event cameras are activity-driven bio-inspired vision sensors, thereby r...

Event-VPR: End-to-End Weakly Supervised Network Architecture for Event-based Visual Place Recognition

Traditional visual place recognition (VPR) methods generally use frame-b...

EV-VGCNN: A Voxel Graph CNN for Event-based Object Classification

Event cameras report sparse intensity changes and hold noticeable advant...

Dual Transfer Learning for Event-based End-task Prediction via Pluggable Event to Image Translation

Event cameras are novel sensors that perceive the per-pixel intensity ch...

Supplementary material

Qualitative results can be viewed in the following video:
The project’s code will be released soon.

1 Introduction

Figure 1: General framework to convert asynchronous event data into grid-based representations using convolutions, quantization and projections. All of these operations are differentiable. Best viewed in color.

Event cameras are bio-inspired vision sensors that operate radically differently from traditional cameras. Instead of capturing brightness images at a fixed rate, event cameras measure brightness changes (called events) for each pixel independently. Event cameras, such as the Dynamic Vision Sensor (DVS) [1], possess appealing properties compared to traditional frame-based cameras, including a very high dynamic range, high temporal resolution (in the order of microseconds), and low power consumption. In addition, event cameras greatly reduce bandwidth. While frame-based cameras with comparable temporal resolution and/or dynamic range cameras exist, they are typically bulky, power-hungry, and require cooling.

The output of an event camera consists of a stream of events that encode the time, location, and polarity (sign) of the brightness changes. Consequently, each event alone carries very little information about the scene. Event-based vision algorithms aggregate information to enable further processing in two ways: (i) use a continuous-time model (e.g.

, Kalman filter) that can be updated asynchronously with each incoming event 

[2, 3, 4, 5, 6] or (ii) process events simultaneously in packets  [7, 8, 9, 10, 11], i.e., spatiotemporal localized aggregates of events. The former methods can achieve minimal latency, but are sensitive to parameter tuning (e.g., filter weights) and are computationally intensive, since they perform an update step for each event. In contrast, methods operating on event packets trade-off latency for computational efficiency and performance. Despite their differences, both paradigms have been successfully applied on various vision tasks, including tracking [12, 5, 2, 13], depth estimation [6, 9, 10], visual odometry [7, 14, 8, 11], recognition [3, 4], and optical flow estimation [15, 16].

Motivated by the broad success of deep learning in computer vision on frame-based imagery, a growing number of recent event-based works have adopted a data driven approach

[17, 18, 19, 20, 16]. Spiking Neural Networks (SNNs) are a natural fit to process event streams, since they enable asynchronous inference at low power on specialized hardware [17, 18, 19]

. However, SNNs are notoriously difficult to train, as no efficient backpropagation algorithm exists 

[21]. In addition, the special-purpose hardware required to run SNNs is expensive and in the development stage, which hinders its widespread adoption in the vision community.

Most closely related to the current paper are methods that pair an event stream with standard frame-based deep convolutional neural network (CNN) or recursive architectures, e.g., [22, 4, 23, 20, 16]. To do so, a pre-processing step typically converts asynchronous event data to a grid-like representation, which can be updated either synchronously [20, 16] or asynchronously [22, 4]. These methods benefit from their ease of implementation using standard frame-based deep learning libraries (e.g., [24, 25]

) and fast inference on commodity graphics hardware. However, these efforts have mainly focused on the downstream task beyond the initial representational stage and simply consider a fixed, possibly suboptimal, conversion between the raw event stream and the input grid-based tensor. To date, there has not been an extensive study on the impact of the choice of input representation, leaving the following fundamental open question:

What is the best way to convert an asynchronous event stream into a grid-based representation to maximize the performance on a given task? In this paper, we aim to address this knowledge gap.

Contributions We propose a general framework that converts asynchronous event-based data into grid-based representations. To achieve this, we express the conversion process through kernel convolutions, quantizations, and projections, where each operation is differentiable (see Fig. 1). Our framework comes with two main advantages. First, it makes the conversion process fully differentiable, allowing to learn a representation end-to-end from raw event data to the task loss. In contrast, prior work assumes the input event representation as fixed. Second, it lays out a taxonomy that unifies the majority of extant event representations in the literature and identifies novel ones. Through extensive empirical evaluations we show that our approach to learning the event representation end-to-end yields an improvement of 12% on optical flow and 12.6% on object recognition over state-of-the-art approaches that rely on handcrafted input event representations. In addition, we compare our methodology to asynchronous approaches in term of accuracy and computational load, shedding light on the relative merits of each category. Upon acceptance we will release code of our approach.

2 Related Work

Traditionally, handcrafted features were used in frame-based computer vision, e.g., [26, 27, 28, 29, 30]. More recently, research has shifted towards data-driven models, where features are automatically learned from data, e.g.,  [31, 32, 33, 34, meister17unflow]. The main catalyst behind this paradigm shift has been the availability of large training datasets [35, 36, 37], efficient learning algorithms [38, 39]

and suitable hardware. Only recently has event-based vision made strides to address each of these areas.

Analogous to early frame-based computer vision approaches, significant effort has been made in designing efficient spatiotemporal feature descriptors of the event stream. From this line of research, typical high-level applications are gesture recognition [40], object recognition [22, 4, 41]

or face detection 

[42]. Low-level applications include optical flow prediction [43, 44] and image reconstruction [45].

Another line of research has focused on applying data-driven models to event-based data. Given its asynchronous nature, spiking neural networks (SNNs) represent a good fit to process event streams [18]. Indeed, SNNs have been applied to several tasks, e.g., object recognition [3, 46, 18, 17], gesture classification [19], and optical flow prediction [15, 44]. However, the lack of specialized hardware and computationally efficient backpropagation algorithms still limits the usability of SNNs in complex real-world scenarios. A typical solution to this problem is learning parameters with frame-based data and transferring the learned parameters to event data [17, 47]. However, it is not clear how much this solution can generalize to real, noisy, event data that has not been observed during training.

Representation Dimensions Description Characteristics
Event frame [48] Image of event polarities Discards temporal and polarity information
Event count image [20, 16] Image of event counts Discards time stamps
Surface of Active Events (SAE) [15, 16] Image of most recent time stamp Discards earlier time stamps
Voxel grid [49] Voxel grid summing event polarities Discards event polarity
Histogram of Time Surfaces (HATS) [22] Histogram of average time surfaces Discards temporal information
EST (Our work) Sample event point-set into a grid Discards the least amount of information
Table 1: Comparison of grid-based event representations used in prior work on event-based deep learning. and denote the image height and width dimensions, respectively, and the number of temporal bins.

Recently, several works have proposed to use standard learning architectures as an alternative to SNNs [20, 23, 16, 50, 22]. To process asynchronous event streams, Neil et al. [23] adapted a recursive architecture to include the time dimension for prediction. Despite operating asynchronously, their approach introduces high latency, since events have to pass sequentially through the entire recursive structure. To reduce latency, other methods convert event streams into a grid-based representation, compatible with learning algorithms designed for standard frames, e.g., CNNs [20, 16, 50, 22]. Sironi et al. [22]

obtained state-of-the-art results in object recognition tasks by transforming events into histograms of averaged time surfaces (HATS), which are then fed to a support vector machine for inference. The main advantage of their representation is that it can not only be used in conjunction with standard learning pipelines, but it can also be updated asynchronously, if sufficient compute is available. A simpler representation was proposed by Maqueda et al. 

[20] to address steering-angle prediction, where events of different polarities are accumulated over a constant temporal window. To perform a low-level task, i.e., optical flow estimation, Zhu et al.[16] proposed to convert events into a four-dimensional grid that includes both the polarity and spike time. Finally, Zhu et al.[49] converted events in a spatiotemporal voxel-grid. Compared to the representation proposed in [20], the two latter representations have the advantage of preserving temporal information. A common aspect among these works is the use of a handcrafted event stream representation. In contrast, in this paper we propose a novel event-based representation that is learned end-to-end together with the task. A comparison of event-based representations and their design choices is summarized in Table 1.

Coupling event-based data with standard frame-based learning architectures has the potential to realize the flexibility of learning algorithms with the advantages of event cameras. It is however not yet clear what is the impact of the event representation on the task performance. In this work, we present an extensive empirical study on the choice of representation for the the tasks of object recognition and optical flow estimation, central tasks in computer vision.

3 Method

In this section, we present a general framework to convert asynchronous event streams into grid-based representations. By performing the conversion strictly through differentiable operators, our framework allows us to learn a representation end-to-end for a given task. Equipped with this tool, we derive a taxonomy that unifies common representations in the literature and identifies new ones. An overview of the proposed framework is given in Fig. 2.

3.1 Event Data

Event cameras have pixels which trigger events independently whenever there is a change in the log brightness :


where is the contrast threshold, is the polarity of the change in brightness, and is the time since the last event at . In a given time interval , the event camera will trigger a number of events:


Due to their asynchronous nature, events are represented as a set. To use events in combination with a convolutional neural network it is necessary to convert the event set into a grid-like representation. This means we must find a mapping between the set and a tensor . Ideally, this mapping should preserve the structure (i.e., spatiotemporal locality) and information of the events.

3.2 Event Field

Intuitively, events represent point-sets in a four-dimensional manifold spanned by the and spatial coordinates, time, and polarity. This point-set can be summarized by the event field, inspired by [51, 18]:


defined in continuous space and time, for events of positive () and negative () polarity. This representation replaces each event by a Dirac pulse in the space-time manifold. The resulting function gives a continuous-time representation of which preserves the event’s high temporal resolution and enforces spatiotemporal locality.

3.3 Generating Representations


In this section, we generalize the notion of the event field and demonstrate how it can be used to generate a grid-like representation from the events. We observe that (3) can be interpreted as successive measurements of a function defined on the domain of the events, i.e.,


We call (4) the Event Measurement Field. It assigns a measurement to each event. Examples of such functions are the event polarity , the event count , and the normalized time stamp . Other examples might include the instantaneous event rate or image intensity provided by such sensors as the Asynchronous Time-based Image Sensor (ATIS) [52]. Various representations in the literature make use of the event measurement field. In several works [22, 4, 20, 16], pure event counts are measured, and summed for each pixel and polarity to generate event count images. Other works [16, 15] use the time stamps of the events to construct the surface of active events (SAE) which retains the time stamps of the most recent event for each pixel and polarity. Other representations use the event polarities and aggregate them into a three-dimensional voxel grid [49] or a two-dimensional event frame [48].

Kernel Convolutions

Although the event measurement field retains the high temporal resolution of the events, it is still ill-defined due to the use of Dirac pulses. Therefore, to derive a meaningful signal from the event measurement field, we must convolve it with a suitable aggregation kernel. The convolved signal thus becomes:


In the literature, (5) is also known as the membrane potential [18, 53, 54]. Several variations of this kernel have been used in prior works. The two most commonly used ones are the -kernel [18, 54] and the exponential kernel [53]. In fact, the exponential kernel is also used to construct the hierarchy of time-surfaces (HOTS) [4] and histogram of average time-surfaces (HATS) [22], where events are aggregated into exponential time surfaces. In the case of HATS [22], the exponential time surfaces can be interpreted as a local convolution of the spike train with an exponential kernel. Another kernel which is typically used is the trilinear voting kernel [55]

. Generally, the design of kernel functions is based on task-dependent heuristics with no general agreement on the optimal kernel to maximize performance.

Discretized Event Spike Tensor

After kernel convolutions, a grid representation of events can be realized by sampling the signal (5) at regular intervals:


Typically, the spatiotemporal coordinates, , lie on a voxel grid, i.e., , and , where is the first time stamp, is the bin size, and is the number of temporal bins. We term this generalized representation the Event Spike Tensor (EST). Summing over both the polarity and time dimensions, one can derive the event-frame representation introduced in [11]. Previous works considered quantizing various dimensions, including spatiotemporal binning [49], and quantizing both the polarity and spatial dimensions [20, 16]. However, the generalized form that retains all four dimensions has not been previously considered, and thus is a new representation.

End-to-end Learned Representations

The measurement and kernel in (3.3) are generally hand-crafted functions. Previous works manually tuned those functions to maximize performance. In contrast, we propose to leverage the data directly to find the best function candidate, thus learning the representation end-to-end. We achieve this by replacing the kernel function in (3.3

) with a multilayer perceptron (MLP) with two hidden layers with 30 units each. This MLP takes the coordinates and time stamp of an event as input, and produces an activation map around it. For each grid location in the representation we evaluate the activation maps produced by each event and sum them together according to (

3.3). This operation is repeated for every point in the final grid, resulting in a grid-like representation. To enforce symmetry across events, we limit the input to the MLP to the difference in coordinates . For the sake of simplicity, we do not learn the measurement function as well, choosing it instead from a set of fixed functions. To speed up inference, at test time the learnt kernel can be substituted with an efficient look-up table, thus having comparable computation cost to handcrafted kernels. These design choices makes the representation both efficient and fully differentiable. In contrast to previous works, that used sub-optimal heuristics to convert events into grids, our framework can now tune the representation to the final task, thus maximizing performance.


From the generalized event spike tensor we can further instantiate novel and existing representations. Many works for example deal with three-dimensional tensors, such as [22, 4, 20, 16, 49]. The event spike tensor, being a four-dimensional data structure, thus acts as a precursor to these three-dimensional constructs, which can be obtained by summing over one of the four dimensions. For example, the two-channel image in [20, 22, 16], can be derived by contracting the temporal dimension, either through summation [20, 22, 16] or maximization [16]. The voxel grid representation described in [49] can be derived by summing across event polarities. All of these operations can be generalized via the projection operator , where can be summation , maximization , and denoting the dimension can be , , , or over polarity , yielding 16 possible projections. Here, we list only the representations that retain the spatial dimension, of which there are four, including the EST without projection:


We refer to these representations as the EST (7), Voxel Grid (8), two-channel image (9), and event frame (10). The direction of projection has an impact on the information content of the resulting representation. For example, projecting along the temporal axis greatly compresses the event representation, but at the cost of temporal localization information. By contrast, projecting the event polarities leads to the cancellation of positive and negative events, potentially removing information in the process. Of these representations, the EST stands out, as it retains the maximum amount of event information by forgoing the projection operation.

Figure 2: An overview of our proposed framework. Each event is associated with a measurement (green) which is convolved with a (possibly learnt) kernel. This convolved signal is then sampled on a regular grid. Finally, various representations can be instantiated by performing projections over the temporal axis or over polarities.

4 Empirical Evaluation

In this section, we present an extensive comparative evaluation of the representations identified by our taxonomy for object recognition (Sec. 4.1) and optical flow estimation (Sec. 4.2) on standard event camera benchmarks.

Candidate Representations

We start out by identifying 12 distinct representations based on the event spike tensor (3.3). In particular, we select the measurement function (4) from three candidates: event polarity, event count, and normalized time stamp. We use the summation operator to project out various axes defined in (7) - (10), resulting in four variations: event spike tensor, voxel grid, two-channel image, and event frame. We split the event spike tensor (a four-dimensional tensor) along the polarity dimension and concatenate the two tensors along the temporal dimension, effectively doubling the number of channels. This is done to make the representation compatible with two-dimensional convolutions. As a first step we apply a generic trilinear kernel to convolve the event spike signal, and later study the effect of different kernels on performance when applied to the EST. Finally, we report results for our end-to-end trained variant that directly utilizes raw events.

4.1 Object Recognition

Object recognition with conventional cameras remains challenging due to their low dynamic range, high latency and tendency to motion blur. In recent years, event-based classification has grown in popularity because it can address all these challenges.

In this section, we investigate the performance of the event representations proposed in Sec. 4 on the task of event-based object recognition. In particular, we aim to determine the relation between the information content of the representation and classification accuracy. We show that our end-to-end learned representation significantly outperforms the state-of-the-art in [22]. We use two publicly available datasets in our evaluation: N-Cars [22] (Neuromorphic-Cars) and N-Caltech101 [56]. N-Cars provides a benchmark for the binary task of car recognition in a scene. It contains event samples of ms length recorded by the ATIS event camera [57]. N-Caltech101 (Neuromorphic-Caltech101) is the event-based version of the popular Caltech101 dataset [58], and poses the task of multiclass recognition for event cameras. It contains samples and classes, which were recorded by placing an event camera on a motor and moving it in front of a screen projecting various samples from Caltech101.


We use a ResNet-34 architecture [31]

for each dataset. The network is pretrained on color RGB images from ImageNet

[59]. To account for the different number of input channels and output classes between the pre-trained model and ours, we adopt the technique proposed in [20]: we replace the first and last layer of the pre-trained model with random weights and then finetune all weights on the task. We train by optimizing the cross-entropy loss and use the ADAM optimizer [60] with an initial learning rate of , which we reduce by a factor of two every iterations. We use a batch-size of and for N-Caltech101 and N-Cars, respectively.

Representation Measurement Kernel N-Cars N-Caltech101
Event Frame polarity trilinear 0.866 0.587
Two-Channel Image 0.830 0.711
Voxel Grid 0.865 0.785
EST (Ours) 0.868 0.789
Event Frame count trilinear 0.799 0.689
Two-Channel Image 0.861 0.713
Voxel Grid 0.827 0.756
EST (Ours) 0.863 0.784
Event Frame time stamps trilinear 0.890 0.690
Two-Channel Image 0.917 0.731
Voxel Grid 0.847 0.754
EST (Ours) 0.917 0.787
EST (Ours) time stamps alpha 0.911 0.739
exponential 0.909 0.782
learnt 0.925 0.817
Table 2: Classification accuracy for all event representations using different measurement functions, as described in Sec. 4. For the best performing representation (EST and time stamp measurements) we additionally report results for different kernel choices, such as the trilinear [55], exponential [53], alpha kernels [18] as well as a learnable kernel.
Representation Measurement Kernel N-Cars N-Caltech101
H-First[3] - - 0.561 0.054
HOTS[4] 0.624 0.210
Gabor-SNN[22] 0.789 0.196
HATS[22] 0.902 0.642
HATS + ResNet-34 0.909 0.691
Two-Channel Image[20] count trilinear 0.861 0.713
Voxel Grid[49] polarity 0.865 0.785
EST (Ours) time stamps trilinear 0.917 0.787
learnt 0.925 0.817
Table 3: Comparison of the classification accuracy for different baseline representations [20, 49] and state-of-the-art classification methods [22, 4, 3]. As an additional baseline we pair the best performing representation from previous work (HATS [22]) with a more powerful classification model (ResNet-34, used in this work) as original numbers were reported using a linear SVM.

The classification results are shown in Table 2. From the representations that we evaluated, the event spike tensor with time stamp measurements has the highest accuracy on the test set for both N-Cars and N-Caltech101. From these results we can make two conclusions. First, we observe that representations that separate polarity consistently outperform those that sum over polarities. Indeed, this trend is observed for all measurement functions: discarding the polarity information leads to a decrease in accuracy of up to %. Second, we see that representations that retain the temporal localization of events, i.e., the Voxel Grid and EST, consistently outperform their counterparts, which sum over the temporal dimension. These observations indicate that both polarity and temporal information are important for object classification. This trend explains why the EST leads to the most accurate predictions: it retains the maximum amount of information with respect to the raw event data.

Interestingly, using event time stamps as measurements is more beneficial than other measurements, since the information about polarity and event count is already encoded in the event spike tensor. Indeed, using the time stamps explicitly in the tensor partially recovers the high temporal resolution, which was lost during convolution and discretization steps of the event field. We thus established that the EST with time stamp measurements performs best for object classification. However, the effect of the temporal kernel remains to be explored. For this purpose we experimented with the kernels described in Sec. 3.3, namely the exponential [53], alpha [18], and trilinear [55] kernels. In addition, we evaluate our end-to-end trainable representation and report the results in Table 2. We see that using different handcrafted kernels has a negative impact on the test scores. In fact, applying these kernels to the event spikes decreases the effective temporal localization compared to the trilinear kernel by overlapping the event signals in the representation. This makes it difficult for a network to learn an efficient way to identify individual events. Finally, we see that if we learn a kernel end-to-end we gain a significant boost in performance. This is justified by the fact that the learnable layer finds an optimal way to draw the events on a grid, maximizing the discriminativeness of the representation.

Comparison with State-of-the-Art

We next compare our results with state-of-the-art object classification methods that utilize handcrafted event representations, such as HATS [22], HOTS [4], as well as a baseline implementation of an SNN [22]. For the best performing representation (HATS) we additionally report the classification scores obtained with the same ResNet-34 used to evaluate the EST; the original work used a linear SVM. Two additional baselines are used for comparison: (i) the histogram of events [20] (here two-channel image), with event count measurements, and (ii) the Voxel Grid described in [49] with polarity measurements.

The results for these methods are summarized in Table 3. We see that our method outperforms the state-of-the-art method (HATS) and variant (HATS + ResNet-34), as well as the Voxel Grid and Two-Channel Image baselines by 2.3%, 1.6%, 6% and 6.5% on N-Cars and 17.5%, 12.6%, 3.2% and 10.4% on N-Caltech101 respectively. In particular, we see that our representation is more suited for object classification than existing handcrafted features, such as HATS and HOTS, even if we combine these features with more complex classification models. This is likely due to HATS discarding temporal information, which, as we established, plays an important role in object classification. It is important to note, compared to the state-of-the-art, our method does not operate asynchronously, or at low power with current hardware (as for example SNNs); however, we show in Sec. 4.3 that our method can still operate at a very high framerate that is sufficient for many high-speed applications.

4.2 Optical Flow Estimation

As in object recognition, optical flow estimation using frame-based methods remains challenging in high-dynamic range scenarios, e.g., at night, and during high speed motions. In particular, motion blur and over/under-saturation of the sensor often violate brightness constancy in the image, a fundamental assumption underlying many approaches, which leads to estimation errors. Due to their lack of motion blur and high dynamic range, event cameras have the potential to provide higher accuracy estimates in these conditions. Early works on event-based optical flow estimation fit planes to the spatiotemporal manifold generated by the events [15]. Other works have tackled this task by finding the optimal event alignments when projected onto a frame [61, 62]. Most recently, the relatively large-scale Multi Vehicle Stereo Event Camera Dataset (MVSEC) [50] made possible deep learning-based optical flow [49, 16]. It provides data from a stereo DAVIS rig combined with a LIDAR for ground-truth optical flow estimation [16]. The dataset features several driving sequences during the day and night, and indoor sequences recorded onboard a quadcopter. The methods in [16, 49] learn flow in a self-supervised manner and use standard U-Net architectures [63], outperforming existing frame-based methods in challenging night-time scenarios. In [16], a four-channel image representation is used as input to the network. This image is comprised of the two-channel event count image used in [20] and two-channel surface of active events (SAE) [15], divided according to event polarities. While the event counts and time surfaces combine the temporal and spatial information of the event stream, it still compresses the event signal by discarding all event time stamps except the most recent ones.

To date, it is unclear which event representation is optimal to learn optical flow. We investigate this question by comparing the representations listed in Sec. 4 against the state-of-the-art [16] for the task of optical flow prediction, evaluated on the MVSEC dataset.


We train an optical flow predictor on the outdoor sequences outdoor_day1 and outdoor_day2. These sequences are split into about samples at fixed time intervals. Each sample consists of events aggregated between two DAVIS frames, which are captured at Hz. We use EV-FlowNet [16] as the base network, with the channel dimension of the initial convolution layer set to the same number of channels of each input representation. The network is trained from scratch using a supervised loss derived from ground truth motion field estimates:


where denotes the robust Charbonnier loss [64], . For our experiments, we chose and . This loss is minimized using the ADAM optimizer [60] with an initial learning rate of and reducing it by a factor of two after iterations and then again every iterations with a batch size of eight.


Representation Measurement Kernel indoor_flying1 indoor_flying2 indoor_flying3

% Outlier

AEE % Outlier AEE % Outlier
Two-Channel Image [20] count trilinear 1.21 4.49 2.03 22.8 1.84 17.7
EV-FlowNet [16] - 1.03 2.20 1.72 15.1 1.53 11.9
Voxel Grid [49] polarity 0.96 1.47 1.65 14.6 1.45 11.4
Event Frame time stamps trilinear 1.17 2.44 1.93 18.9 1.74 15.5
Two-Channel Image 1.17 1.5 1.97 14.9 1.78 11.7
Voxel Grid 0.98 1.20 1.70 14.3 1.5 12.0
EST (Ours) time stamps trilinear 1.00 1.35 1.71 11.4 1.51 8.29
alpha 1.03 1.34 1.52 11.7 1.41 8.32
exponential 0.96 1.27 1.58 10.5 1.40 9.44
learnt 0.97 0.91 1.38 8.20 1.43 6.47
Table 4: Average end-point error (AEE) and % of outliers evaluation on the MVSEC dataset for different variations of the EST with time stamp measurements. Various baselines [20, 49] and state-of-the-art methods [16] are compared.

As in [16], we measure the performance of our networks by comparing the average end-point error () on the indoor_flying datasets, which are visually quite distinct from the training set. The test error on these datasets thus reflects the generalizability of our network, as well as its overall performance. In addition, as events only provide sparse information in the frame we only report the error computed at pixels where at least one event was triggered, as done in [16]. Following the KITTI 2015 benchmark [65], we report the percentage of pixels which have an end-point-error larger than three pixels and 5% of the ground-truth flow, also done in [16]. In the previous classification experiments we observed that time stamp measurements are essential for a discriminative representation. We thus focus on results obtained from representations using the time stamp as the measurement function, as well as different kernels. Table 4 summarizes the results obtained from this experiment. An exhaustive evaluation of the various measurement functions, i.e., polarities and counts, as well as qualitative results, is available in the supplemental material.

From Table 4 we see that Voxel Grid and EST have similar AEE and outlier ratios. This indicates that optical flow estimation is not as sensitive to event polarity as observed for classification. This is further supported by the small performance gap between the two-channel image and event frame. A more striking difference comes when we compare representations which retain the temporal dimension (middle rows), with those that sum over it. Indeed, the accuracies of the Two-Channel Image and the Event Frame drop approximately % when compared to the EST and Voxel-Grid. As with the classification evaluation, we establish that EST is among the most competitive representations and further explore the influence of different kernels on the performance. These are summarized in the bottom set of rows of Table 4. We see that the exponential and alpha kernels outperform the trilinear kernel. This indicates that there is a strong dependency on the kernel shape and thus proceed with the fully end-to-end learnable version. As with classification, we observe that the learnable kernel significantly improves the accuracy on almost all scenes. The most significant improvements are achieved for outlier ratios, indicating that using learnable kernels improves the robustness of the system.

Comparison with State-of-the-Art

We compare our method with the state-of-the-art [16], as well as other baselines based on the representations used in [20] and [49]. Table 4 presents a detailed comparison. It is clear that the EST outperforms the state-of-the-art by a large margin ()%. There is also significant improvements in terms of outlier ratio, reducing the outliers by an average of % which again indicates the robustness of our method. This performance difference is likely due to the sub-optimality of existing representations used in the state-of-the-art for optical flow prediction, while operating on the raw event data, the learnable EST can maximize performance.

4.3 Computational Time and Latency

One of the key advantages of event cameras are their low latency and high update rate. To achieve high-frequency predictions, previous works developed lightweight and fast algorithms to process each incoming event asynchronously. In contrast, other approaches aggregate events into packets and then process them simultaneously. While this sacrifices latency, it also leads to overall better accuracy, due to an increase in the signal-to-noise ratio. Indeed, in several pattern recognition applications, e.g., object recognition and optical flow prediction, asynchronous processing is not essential: we may actually sacrifice it for improved accuracy. We compare these two modes of operation in Table 5 where we show the number of events that can be processed per second, as well as the total time used to process a single sample of ms from the N-Cars dataset. It can be seen that if we allow for batch computation, our method using a learnt kernel and lookup table can run at a very high speed that is comparable to other methods. For applications where asynchronous updates or low-power consumption have higher priority than accuracy, other methods, e.g. SNNs, hold an advantage with respect to our approach.

We further report the computation time per inference for different architectures in Table 6. To evaluate the timings we split the computation into two stages: representation computation and inference. While representation computation is performed on a CPU (Intel i7 CPU, 64bits, 2.7GHz and 16 GB of RAM), inference is performed on a GPU (GeForce RTX 2080 Ti). Table 6 shows that the computation of the representation only contributes a very small component of the overall computation time, while most of the time is spent during inference. Nonetheless, we see that a full forward pass only takes on the order of ms, which translates to a maximum inference rate of Hz. Although not on the order of the event rate, this value is high enough for most high-speed applications, such as mobile robotics or autonomous vehicle navigation. Moreover, we see that we can reduce the inference time significantly if we use smaller models, achieving Hz for a ResNet-18. Shallower models could potentially be run at minimal loss in accuracy by leveraging distillation techniques [66].

Method Asynchronous Time [ms] Speed [kEv/s]
Gabor SNN [22] Yes 285.95 14.15
HOTS [4] Yes 157.57 25.68
HATS [22] Yes 7.28 555.74
EST (Ours) No 6.26 632.9
Table 5: Computation time for ms of event data and number of events processed per second.
Model Inference [ms] Representation [ms] Total [ms] Rate [Hz]
ResNet-18 3.87 0.38 4.25 235
ResNet-34 6.47 0.38 6.85 146
ResNet-50 9.14 0.38 9.52 105
EV-FlowNet 5.70 0.38 6.08 164
Table 6: Inference time split into EST generation ( ms) forward pass for several standard network architectures. Both ResNet-34 [31] and EV-FlowNet [16] allow processing at approximately Hz which is sufficient for high-speed applications.

5 Conclusions

This paper presented a general framework for converting asynchronous event data into grid-based representations. By representing the conversion process through differentiable operations, our framework allows learning input representations in a data-driven fashion. In addition, our framework lays out a taxonomy which unifies a large number of extant event representations and identifies new ones. Through an extensive evaluation we show that learning a representation end-to-end together with the task yields an increase of about % in performance over state-of-the-art methods, for the tasks of object recognition and optical flow estimation. With this contribution, we combined the benefits of deep learning with event cameras, thus unlocking their outstanding properties to a wider community. As an interesting direction for future work, we plan to allow asynchronous updates in our framework by deploying recurrent architectures, similar to [23]: this will bridge the gap between synchronous and asynchronous approaches for event-based processing.


This project was funded by the Swiss National Center of Competence Research (NCCR) Robotics, through the Swiss National Science Foundation, and the SNSF-ERC starting grant. K.G.D. is supported by a Canadian NSERC Discovery grant. K.G.D. contributed to this work in his personal capacity as an Associate Professor at Ryerson University.

6 Appendix

We encourage the reader to watch the supplementary video for an introduction to the event camera and qualitative results of our approach. In this section, we provide additional details about the network architecture used for our experiments, as well as supplementary results for object recognition and optical flow prediction.

6.1 Network Architecture

For all our classification experiments, we used an off-the-shelf ResNet-34 [31] architecture for inference with weights pretrained on RGB image-based ImageNet [59]. We then substitute the first and last layer of the pre-trained network with new weights (randomly initialized) to accommodate the difference in input channels (from the difference in representation) and output channels (for the difference in task).

For the optical flow experiments, we use the off-the-shelf U-Net architecture [63] for inference, adapting its input layer to the number of channels of each representation. All networks are then trained from scratch.

Learned Kernel Functions

As discussed in Sec. 3.3, we used a two-layer multi-layer perceptron (MLP) to learn the kernel function to convolve the event measurement field, defined in (4). The two hidden layer have both

nodes, with Leaky ReLU as activation function (

) to encourage better gradient flow. To give all image locations the same importance, we designed the kernel to be translation invariant. Thus, for an event occurring at time the MLP has a one-dimensional input and a single output with normalized time and denoting the time window of the events. The contribution of a single event to the sum in (3.3) is computed for every grid position for where is the number of temporal discretization bins. The weights of the MLP were initialized with the trilinear voting kernel [55], since this proved to facilitate convergence in our experiments. Figure 3 shows an illustration of the learned kernels as a function of time. Interestingly, the learned kernels show some interesting behavior, when compared against the trilinear voting kernel, on which they were initialized. For classification (Fig 3, left), the kernel seems to increase the event influence to the past, in a causual fashion: indeed, enough evidence has to be accumulated to produce a classification label. In contrast, for optical flow prediction (Fig 3

, right), the learned kernel increases in magnitude, but not significantly in the time range, with respect to the trilinear kernel. This is probably due to the fact that optical flow is a more ‘local’ task with respect to classification, and therefore less temporal information is required.

Figure 3: Kernel function learned for classification in the N-Cars dataset (left) and for optical flow prediction (right).

6.2 Ablation Studies and Qualitative Results

6.2.1 Classification

For the classification task, we investigated the relation between the number of temporal discretization bins, , i.e., channels, of the event spike tensor (EST) and the network performance. We quantitavively evaluated this effect on the N-Cars [22] and N-Caltech101 [56] datasets. More specifically, we trained four networks, each using the learned EST with and timestamp measurements, since this representation achieved the highest classification scores. The final input representations have , , , and channels since we stack the polarity dimension along the temporal dimension.

Channels N-Cars N-Caltech101
2 0.908 0.792
4 0.912 0.816
9 0.925 0.817
16 0.923 0.837
Table 7: Classification accuracy on N-Cars [22] and N-Caltech101 [56] for input representations based on the event spike tensor (EST). Four variations of the EST were tested, varying the number of temporal bins between and . The best representations are highlighted in bold.

The results for this experiment are summarized in Table A-7 and example classifications for the N-Cars and N-Caltech101 dataset are provided in Figs. 4 and 5.

For both datasets, we observe a very similar trend in the dependency of classification accuracy to temporal discretization: performance appears to increase with finer discretization, i.e., with a larger number of channels. However, for the N-Cars dataset performance plateaus after channels, while for the N-Caltech dataset performance continues to increase with a larger number of channels. This difference can be explained by the different qualities of the datasets. While the N-Cars dataset features samples taken from an outdoor environment (Fig. 5), the N-Caltech101 samples were taken in controlled, constant lighting conditions and with consistent camera motion. This leads to higher quality samples in the N-Caltech101 dataset (Fig. 5), while the samples in N-Cars are frequently corrupted by noise (Fig. 4 (a-d)). In low noise conditions (Fig. 4 (a)) classification accuracy is very high (). However, as the signal decreases due to the lack of motions (Fig. 4 (b-d)) the classification accuracy decreases rapidly. Increasing the number of temporal bins further dilutes the signal present in the event stream, resulting in noisy channels (Fig. 4 (c)), which impacts performance negatively. In addition, more input channels results in higher the memory and computational costs of the network. Therefore to trading-off performance for computational accuracy, we use in all our classification experiments.

6.2.2 Optical Flow

Representation Measurement Kernel indoor_flying1 indoor_flying2 indoor_flying3
AEE % Outlier AEE % Outlier AEE % Outlier
Event Frame polarity trilinear 1.21 4.19 2.04 20.6 1.83 16.6
Two-Channel Image 1.31 4.75 2.05 23.2 1.83 11.4
Voxel Grid 0.96 1.47 1.65 14.6 1.45 11.4
EST (Ours) 1.01 1.59 1.79 16.7 1.57 13.8
Event Frame count trilinear 1.25 3.91 2.11 22.9 1.85 17.1
Two-Channel Image 1.21 4.49 2.03 22.8 1.84 17.7
Voxel Grid 0.97 1.33 1.66 14.7 1.46 12.1
EST (Ours) 1.03 2.00 1.78 16.5 1.56 13.2
Event Frame time stamps trilinear 1.17 2.44 1.93 18.9 1.74 15.6
Two-Channel Image 1.17 1.50 1.97 14.9 1.78 11.7
Voxel Grid 0.98 1.20 1.70 14.3 1.50 12.0
EST (Ours) 1.00 1.35 1.71 11.4 1.51 8.29
EST (Ours) time stamps alpha 1.03 1.34 1.52 11.7 1.41 8.32
exponential 0.96 1.27 1.58 10.5 1.40 9.44
learnt 0.97 0.91 1.38 8.20 1.43 6.47
Table 8: Average end-point error (AEE) and % of outliers evaluation on the MVSEC datasets. Ablation of different measurement functions for the event spike tensor. The best candidates are highlighted in bold.
Channels indoor_flying1 indoor_flying2 indoor_flying3
AEE % Outlier AEE % Outlier AEE % Outlier
2 0.97 0.98 1.45 8.86 1.37 6.66
4 0.96 1.13 1.42 8.86 1.35 5.98
9 0.97 0.91 1.38 8.20 1.43 6.47
16 0.95 1.56 1.39 8.58 1.34 6.82
Table 9: Average end-point error (AEE) and % of outliers for optical flow predictions on the MVSEC dataset [16]. Four event representations based on the voxel grid were tested with 4, 9, 16 and 32 temporal bins. The best representation is highlighted in bold.

In this section, we ablate two features of the representations used for optical flow prediction: (i) the measurement function (defined in (5)), and (ii) the number of temporal discretization bins, . We use the Multi Vehicle Stereo Event Camera (MVSEC) dataset [16] for quantitative evaluation.

Table 8 shows the performance of our candidate measurement functions, i.e.polarity, event count, and event timestamp, for the generation of the representations (see (5)). While it would be possible to learn the measurement function together with the kernel, in our experiments we have considered this function to be fixed. This heuristic proved to speed-up convergence of our models, while decreasing the computational costs at training and inference time.

In Table 8 it can be observed that the event timestamp yields the highest accuracy among the measurement functions. This is indeed very intuitive since, while polarity and event count information is contained in the EST, the timestamp information is partially lost due to discretization. Adding it back in the measurements gives the EST the least amount of information lost with respect to the original event point set, therefore maximizing the performance of end-to-end learning.

To understand the role that the number of temporal bins plays, we choose the best event representation for this task, the EST with timestamp measurements, and vary the number of temporal bins from . The average endpoint errors and outlier ratios are reported in Table 9.

As with the classification task (Sec. 6.2.1), we observe a trade-off between using too few channels and too many. Since MVSEC records natural outdoor scenes, event measurements are corrupted by significant noise. As we increase the number of channels, the signal-to-noise ratio in the individual channels drops, leading to less accurate optical flow estimates. In contrast, decreasing the number of channels also has adverse effects, as this removes valuable information from the event stream due to temporal aliasing effects. Therefore, a compromise must be made between high and low channel numbers. In the experiments reported in the paper we chose a channel number of nine, as this presents a good compromise.

In conclusion, we encourage the reader to watch the supplementary video to see the qualitative results of our method on optical flow prediction. We have observed that, despite the application environment and illumination conditions, our method generates predictions which are not only accurate, but also temporally consistent without any postprocessing.

Correct label: Car Correct label: Car
good example: 99% Car score borderline example: 46% Car score
(a) (b)
Correct label: Car Correct Label: Car
bad example: 5% Car score improvement: 23% Car score
(c) (d)
Figure 4: Visualization of input representations derived from samples from the N-Cars dataset [22]

(a and b) show the event spike tensor (EST) representation with time measurements, which achieved the highest classification score on N-Cars, while (d) shows the two-channel representation of sample (c) for comparison. The EST consists of 18 channels, where the first nine are filled with events of positive polarity and the last nine are filled with negative polarity. The images show the nine temporal bins of the tensor with positive events in red and negative events in green. In good conditions (a) the classifier has high confidence in the car prediction. However, when there are less events due to the lack of motion (b and c) the uncertainty rises leading to predictions close to random (50%). In (b) the classifier sees the headlights of the car (red dots) but may still be unsure. In (c) the classifier sees only noise due to the high temporal resolution, likely attributing presence of noise to no motion. When we aggregate the noise (d) into the two-channel image we see a more distinct pattern emerge, leading to higher classification confidence.

Correct label: Butterfly Correct label: Umbrella Correct label: Strawberry
(a) (b) (c)
Figure 5: Visualization of the event spike tensor (EST) representations derived from samples from the N-Caltech101 dataset [56]. The EST consists of 18 channels, where the first nine are filled with events of positive polarity and the last 9 are filled with negative polarity. The figures show the nine temporal bins of the tensor with positive events in red and negative events in green. We see that compared to N-Cars [22] the event stream of this dataset is much cleaner and with much less noise. This is because the dataset was recorded in a controlled environment, by positioning an event camera toward an image projected on a screen. (a) and (b) correspond to correct predictions and (c) an incorrect one.


  • [1] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128128 120 dB 15 s latency asynchronous temporal contrast vision sensor,” IEEE J. Solid-State Circuits, vol. 43, no. 2, pp. 566–576, 2008.
  • [2] E. Mueggler, B. Huber, and D. Scaramuzza, “Event-based, 6-DOF pose tracking for high-speed maneuvers,” in IEEE/RSJ Int. Conf. Intell. Robot. Syst. (IROS), 2014, pp. 2761–2768.
  • [3] G. Orchard, C. Meyer, R. Etienne-Cummings, C. Posch, N. Thakor, and R. Benosman, “HFirst: A temporal approach to object recognition,” IEEE Trans. Pattern Anal. Machine Intell., vol. 37, no. 10, pp. 2028–2040, 2015.
  • [4] X. Lagorce, G. Orchard, F. Gallupi, B. E. Shi, and R. Benosman, “HOTS: A hierarchy of event-based time-surfaces for pattern recognition,” IEEE Trans. Pattern Anal. Machine Intell., vol. 39, no. 7, pp. 1346–1359, Jul. 2017.
  • [5] G. Gallego, J. E. A. Lund, E. Mueggler, H. Rebecq, T. Delbruck, and D. Scaramuzza, “Event-based, 6-DOF camera tracking from photometric depth maps,” IEEE Trans. Pattern Anal. Machine Intell., vol. 40, no. 10, pp. 2402–2412, Oct. 2018.
  • [6] A. Andreopoulos, H. J. Kashyap, T. K. Nayak, A. Amir, and M. D. Flickner, “A low power, high throughput, fully event-based stereo system,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2018, pp. 7532–7542.
  • [7] H. Kim, S. Leutenegger, and A. J. Davison, “Real-time 3D reconstruction and 6-DoF tracking with an event camera,” in Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 349–364.
  • [8] A. Z. Zhu, N. Atanasov, and K. Daniilidis, “Event-based visual inertial odometry,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2017, pp. 5816–5824.
  • [9] H. Rebecq, G. Gallego, E. Mueggler, and D. Scaramuzza, “EMVS: Event-based multi-view stereo—3D reconstruction with an event camera in real-time,” Int. J. Comput. Vis., pp. 1–21, Nov. 2017.
  • [10] A. Z. Zhu, Y. Chen, and K. Daniilidis, “Realtime time synchronized event-based stereo,” in Eur. Conf. Comput. Vis. (ECCV), 2018.
  • [11] A. Rosinol Vidal, H. Rebecq, T. Horstschaefer, and D. Scaramuzza, “Ultimate SLAM? combining events, images, and IMU for robust visual SLAM in HDR and high speed scenarios,” IEEE Robot. Autom. Lett., vol. 3, no. 2, pp. 994–1001, Apr. 2018.
  • [12] D. Gehrig, H. Rebecq, G. Gallego, and D. Scaramuzza, “Asynchronous, photometric feature tracking using events and frames,” in Eur. Conf. Comput. Vis. (ECCV), 2018.
  • [13] Z. Ni, C. Pacoret, R. Benosman, S.-H. Ieng, and S. Régnier, “Asynchronous event-based high speed vision for microparticle tracking,” J. Microscopy, vol. 245, no. 3, pp. 236–244, 2012.
  • [14] H. Rebecq, T. Horstschäfer, G. Gallego, and D. Scaramuzza, “EVO: A geometric approach to event-based 6-DOF parallel tracking and mapping in real-time,” IEEE Robot. Autom. Lett., vol. 2, pp. 593–600, 2017.
  • [15] R. Benosman, C. Clercq, X. Lagorce, S.-H. Ieng, and C. Bartolozzi, “Event-based visual flow,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 2, pp. 407–417, 2014.
  • [16] A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “EV-FlowNet: Self-supervised optical flow estimation for event-based cameras,” in Robotics: Science and Systems (RSS), 2018.
  • [17] J. A. Perez-Carrasco, B. Zhao, C. Serrano, B. Acha, T. Serrano-Gotarredona, S. Chen, and B. Linares-Barranco, “Mapping from frame-driven to frame-free event-driven vision systems by low-rate rate coding and coincidence processing–application to feedforward ConvNets,” IEEE Trans. Pattern Anal. Machine Intell., vol. 35, no. 11, pp. 2706–2719, Nov. 2013.
  • [18] J. H. Lee, T. Delbruck, and M. Pfeiffer, “Training deep spiking neural networks using backpropagation,” Front. Neurosci., vol. 10, p. 508, 2016.
  • [19] A. Amir, B. Taba, D. Berg, T. Melano, J. McKinstry, C. D. Nolfo, T. Nayak, A. Andreopoulos, G. Garreau, M. Mendoza, J. Kusnitz, M. Debole, S. Esser, T. Delbruck, M. Flickner, and D. Modha, “A low power, fully event-based gesture recognition system,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Jul. 2017, pp. 7388–7397.
  • [20] A. I. Maqueda, A. Loquercio, G. Gallego, N. García, and D. Scaramuzza, “Event-based vision meets deep learning on steering prediction for self-driving cars,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2018, pp. 5419–5427.
  • [21] D. Huh and T. J. Sejnowski, “Gradient descent for spiking neural networks,” in Conf. Neural Inf. Process. Syst. (NIPS), 2018, pp. 1440–1450.
  • [22] A. Sironi, M. Brambilla, N. Bourdis, X. Lagorce, and R. Benosman, “HATS: Histograms of averaged time surfaces for robust event-based object classification,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2018, pp. 1731–1740.
  • [23] D. Neil, M. Pfeiffer, and S.-C. Liu, “Phased lstm: Accelerating recurrent network training for long or event-based sequences,” in Conf. Neural Inf. Process. Syst. (NIPS), 2016, pp. 3882–3890.
  • [24]

    M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning,” in

    USENIX Conference on Operating Systems Design and Implementation, 2016, pp. 265–283.
  • [25]

    A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in PyTorch,” in

    NIPS-W, 2017.
  • [26] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.
  • [27] J. Sivic and A. Zisserman, “Efficient visual search of videos cast as text retrieval,” IEEE Trans. Pattern Anal. Machine Intell., vol. 31, no. 4, pp. 591–606, apr 2009.
  • [28] S. Leutenegger, M. Chli, and R. Siegwart, “BRISK: Binary robust invariant scalable keypoints,” in Int. Conf. Comput. Vis. (ICCV), Nov. 2011, pp. 2548–2555.
  • [29] P. Viola and M. Jones, “Robust real-time face detection,” in iccv.    IEEE Comput. Soc.
  • [30] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” IEEE Trans. Pattern Anal. Machine Intell., vol. 34, no. 4, pp. 743–761, apr 2012.
  • [31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Jun. 2016, pp. 770–778.
  • [32] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), jun 2016.
  • [33] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Jun. 2016, pp. 5297–5307.
  • [34] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “FlowNet 2.0: Evolution of optical flow estimation with deep networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR).    IEEE, jul 2017.
  • [35] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition.    IEEE, jun 2009.
  • [36] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (VOC) challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, sep 2009.
  • [37] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, “FlowNet: Learning optical flow with convolutional networks,” in Int. Conf. Comput. Vis. (ICCV), 2015, pp. 2758–2766.
  • [38] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng, “On optimization methods for deep learning,” in International Conference on Machine Learning (ICML).    Omnipress, 2011, pp. 265–272.
  • [39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [40] J. Lee, T. Delbruck, P. K. J. Park, M. Pfeiffer, C.-W. Shin, H. Ryu, and B. C. Kang, “Live demonstration: Gesture-based remote control using stereo pair of dynamic vision sensors,” in IEEE Int. Symp. Circuits Syst. (ISCAS), 2012.
  • [41] P. K. J. Park, K. Lee, J. H. Lee, B. Kang, C.-W. Shin, J. Woo, J.-S. Kim, Y. Suh, S. Kim, S. Moradi, O. Gurel, and H. Ryu, “Computationally efficient, real-time motion recognition based on bio-inspired visual and cognitive processing,” in IEEE International Conference on Image Processing (ICIP).    IEEE, sep 2015.
  • [42] S. Barua, Y. Miyatani, and A. Veeraraghavan, “Direct face detection and video reconstruction from event cameras,” in IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2016, pp. 1–9.
  • [43] R. Benosman, S.-H. Ieng, P. Rogister, and C. Posch, “Asynchronous event-based Hebbian epipolar geometry,” IEEE Trans. Neural Netw., vol. 22, no. 11, pp. 1723–1734, 2011.
  • [44] R. Benosman, S.-H. Ieng, C. Clercq, C. Bartolozzi, and M. Srinivasan, “Asynchronous frameless event-based optical flow,” Neural Netw., vol. 27, pp. 32–37, 2012.
  • [45] P. Bardow, A. J. Davison, and S. Leutenegger, “Simultaneous optical flow and intensity estimation from an event camera,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2016, pp. 884–892.
  • [46] B. Zhao, R. Ding, S. Chen, B. Linares-Barranco, and H. Tang, “Feedforward categorization on AER motion events using cortex-like features in a spiking neural network,” IEEE Trans. Pattern Anal. Machine Intell., vol. 26, no. 9, pp. 1963–1978, sep 2015.
  • [47] P. U. Diehl, D. Neil, J. Binas, M. Cook, S.-C. Liu, and M. Pfeiffer, “Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing,” in Int. Joint Conf. Neural Netw. (IJCNN), vol. 4, Jul. 2015, pp. 2933–2940.
  • [48] H. Rebecq, T. Horstschaefer, and D. Scaramuzza, “Real-time visual-inertial odometry for event cameras using keyframe-based nonlinear optimization,” in British Machine Vis. Conf. (BMVC), Sep. 2017.
  • [49] A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “Unsupervised event-based optical flow using motion compensation,” in Eur. Conf. Comput. Vis. Workshops (ECCVW), 2018.
  • [50] A. Z. Zhu, D. Thakur, T. Ozaslan, B. Pfrommer, V. Kumar, and K. Daniilidis, “The multivehicle stereo event camera dataset: An event camera dataset for 3D perception,” IEEE Robot. Autom. Lett., vol. 3, no. 3, pp. 2032–2039, Jul. 2018.
  • [51] A. Censi, “Efficient neuromorphic optomotor heading regulation,” in 2015 American Control Conference (ACC), July 2015, pp. 3854–3861.
  • [52] C. Brandli, R. Berner, M. Yang, S.-C. Liu, and T. Delbruck, “A 240x180 130dB 3us latency global shutter spatiotemporal vision sensor,” IEEE J. Solid-State Circuits, vol. 49, no. 10, pp. 2333–2341, 2014.
  • [53]

    F. Ponulak, “Resume-new supervised learning method for spiking neural networks,” 2005.

  • [54]

    A. Mohemmed, S. Schliebs, and N. Kasabov, “Span: A neuron for precise-time spike pattern association,” in

    Neural Information Processing, B.-L. Lu, L. Zhang, and J. Kwok, Eds.    Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 718–725.
  • [55] M. Jaderberg, K. Simonyan, A. Zisserman et al.

    , “Spatial transformer networks,” in

    Conf. Neural Inf. Process. Syst. (NIPS), 2015, pp. 2017–2025.
  • [56] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor, “Converting static image datasets to spiking neuromorphic datasets using saccades,” Front. Neurosci., vol. 9, p. 437, 2015.
  • [57] C. Posch, D. Matolin, and R. Wohlgenannt, “A QVGA 143 dB dynamic range frame-free PWM image sensor with lossless pixel-level video compression and time-domain CDS,” IEEE J. Solid-State Circuits, vol. 46, no. 1, pp. 259–275, Jan. 2011.
  • [58] L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of object categories,” IEEE Trans. Pattern Anal. Machine Intell., vol. 28, no. 4, pp. 594–611, April 2006.
  • [59] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F.-F. Li, “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Apr. 2015.
  • [60] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” Int. Conf. on Learning Representations (ICLR)), 2015.
  • [61] A. Z. Zhu, N. Atanasov, and K. Daniilidis, “Event-based feature tracking with probabilistic data association,” in IEEE Int. Conf. Robot. Autom. (ICRA), 2017, pp. 4465–4470.
  • [62] G. Gallego, H. Rebecq, and D. Scaramuzza, “A unifying contrast maximization framework for event cameras, with applications to motion, depth, and optical flow estimation,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2018, pp. 3867–3876.
  • [63] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Int. Conf. on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015, pp. 234–241.
  • [64] D. Sun, S. Roth, and M. J. Black, “A quantitative analysis of current practices in optical flow estimation and the principles behind them,” International Journal of Computer Vision, vol. 106, no. 2, pp. 115–137, 2014.
  • [65] M. Menze, C. Heipke, and A. Geiger, “Joint 3d estimation of vehicles and scene flow,” in ISPRS Workshop on Image Sequence Analysis (ISA), 2015.
  • [66] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” NIPS Deep Learning Workshop., 2014.