Exploiting Event Cameras by Using a Network Grafting Algorithm

by   Yuhuang Hu, et al.
Universität Zürich

Novel vision sensors such as event cameras provide information that is not available from conventional intensity cameras. An obstacle to using these sensors with current powerful deep neural networks is the lack of large labeled training datasets. This paper proposes a Network Grafting Algorithm (NGA), where a new front end network driven by unconventional visual inputs replaces the front end network of a pretrained deep network that processes intensity frames. The self-supervised training uses only synchronously-recorded intensity frames and novel sensor data to maximize feature similarity between the pretrained network and the grafted network. We show that the enhanced grafted network reaches comparable average precision (AP_50) scores to the pretrained network on an object detection task using an event camera dataset, with no increase in inference costs. The grafted front end has only 5–8 the total parameters and can be trained in a few hours on a single GPU equivalent to 5 detector from labeled data. NGA allows these new vision sensors to capitalize on previously pretrained powerful deep models, saving on training cost.



There are no comments yet.


page 2

page 8

page 10


Video to Events: Bringing Modern Computer Vision Closer to Event Cameras

Event cameras are novel sensors that output brightness changes in the fo...

Learning to Detect Objects with a 1 Megapixel Event Camera

Event cameras encode visual information with high temporal precision, lo...

Pseudo-labels for Supervised Learning on Dynamic Vision Sensor Data, Applied to Object Detection under Ego-motion

In recent years, dynamic vision sensors (DVS), also known as event-based...

DDD20 End-to-End Event Camera Driving Dataset: Fusing Frames and Events with Deep Learning for Improved Steering Prediction

Neuromorphic event cameras are useful for dynamic vision problems under ...

Bridging the Gap between Events and Frames through Unsupervised Domain Adaptation

Event cameras are novel sensors with outstanding properties such as high...

Enhanced Frame and Event-Based Simulator and Event-Based Video Interpolation Network

Fast neuromorphic event-based vision sensors (Dynamic Vision Sensor, DVS...

Learning Camera Miscalibration Detection

Self-diagnosis and self-repair are some of the key challenges in deployi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

Types of computer vision datasets. Data from

Figure 2: A network (blue) trained on intensity frames outputs bounding boxes of detected objects. NGA trains a new GN front end (red) using a small unlabeled dataset of recordings from a DAVIS [3] event camera that concurrently outputs intensity frames and asynchronous brightness change events. The grafted network is obtained by replacing the original front end with the GN front end, and is used for inference with the novel camera input data.

Novel vision sensors like polarization and event cameras provide new ways of sensing the visual world and enable new or improved vision system applications. So-called event cameras, for example, sense normal visible light, but dramatically sparsify it to pure brightness change events, which provide sub-ms timing and HDR to offer fast vision under challenging illumination conditions [15, 7]. These novel sensors are becoming practical alternatives that complement standard cameras to improve vision systems.

Deep Learning (DL) with labeled data has revolutionized vision systems using conventional intensity frame-based cameras. But exploiting DL for vision systems based on novel cameras has been held back by the lack of large labeled datasets for these sensors. Prior work to solve high-level vision problems using inputs other than intensity frames has followed the principles of supervised Deep Neural Network (DNN) training algorithms, where the task-specific datasets must be labeled with a tremendous amount of manual effort [17, 2]. Although the community has collected many useful small datasets for novel sensors, the size, variety, and labeling quality of these datasets is far from rivaling intensity frame datasets [18, 11, 2, 6]. As shown in Fig. 2, among 1,212 surveyed computer vision datasets in [5], 93% are intensity frame datasets. There are only 9 event-based datasets.

One line of DL research employs unsupervised methods to train networks that predict pixel-level quantities such as optical flow [30], depth [29]; and that reconstruct intensity frames [20]. The information generated by these networks can be further processed by a downstream DNN trained to solve tasks such as object classification. This information is exceptionally useful in challenging scenarios such as high-speed motion under difficult lighting conditions. The additional latency introduced by running these networks might be undesirable for fast online applications. For instance, the DNNs used for intensity reconstruction at low QVGA resolution take 30 ms on a dedicated GPU [20, 23].

This work introduces a simple yet effective algorithm called the Network Grafting Algorithm (NGA) to obtain a Grafted Network (GN) that addresses both issues: 1. the lack of large labeled datasets for training a DNN from scratch, and 2. additional inference cost and latency that comes from running networks that compute pixel-level quantities. With this algorithm, we train a GN front end for processing unconventional visual inputs (red block in Fig. 2) to drive a network originally trained on intensity frames. We demonstrate GNs for event cameras in this paper.

The NGA training encourages the GN front end to produce features that are similar to the features at several early layers of the pretrained network. Since the algorithm only requires pretrained hidden features as the target, the training is self-supervised, that is, no labels are needed from the novel camera data. The training method is described in Section 3.1. Furthermore, the newly trained GN has a similar inference cost to the pretrained network and does not introduce additional preprocessing latency. Because the training of a GN front end relies on the pretrained network, the NGA has similarities to Knowledge Distillation (KD) [10]

, Transfer Learning 

[19], and Domain Adaptation (DA) [8, 24, 26]

. In addition, our proposed algorithm utilizes loss terms proposed for super-resolution image reconstruction and image style transfer 

[12, 9]. Section 2 elaborates on the similarities and differences between NGA and these related domains.

To evaluate NGA (Section 4), we start with a pretrained object detection network and train a GN for an event camera driving dataset (Section 4.1) to solve the same task. We show that the GN achieves similar detection precision compared to the original pretrained network. We also evaluate the accuracy gap between supervised and NGA self-supervised with MNIST for event cameras (Section 4.2). Finally, we do representation analysis and ablation studies in Section 5. Our contributions are as follows:

  1. We propose a novel algorithm called NGA that allows the use of networks already trained to solve a high-level vision problem but adapted to work with a new GN front end that processes inputs from event cameras.

  2. The NGA algorithm does not need a labeled event dataset because the training is self-supervised.

  3. The newly trained GN has an inference cost similar to the pretrained network because it directly processes the event data. Hence, the computation latency brought by e.g., intensity reconstruction from events is eliminated.

  4. The algorithm allows the output of these novel cameras to be exploited in situations that are difficult for standard cameras.

2 Related Work

The NGA trains a GN front end such that the hidden features at different layers of the GN are similar to respective pretrained network features on intensity frames. From this aspect, the NGA is similar to Knowledge Distillation [10, 22, 25] where the knowledge of a teacher network is gradually distilled into a student network (usually smaller than the teacher network) via the soft labels provided by the teacher network. In KD, the teacher and student networks use the same dataset. In contrast, the NGA assumes that the inputs for the pretrained front end and the GN front end come from two different modalities that see the same scene concurrently, but this dataset can simply be raw unlabeled recordings. The NGA also has a flavor of Transfer Learning [19] and Domain Adaptation [8, 24, 26] that study how to fine-tune the knowledge of a pretrained network on a new dataset. Transfer learning and DA usually involve re-training of the network for a similar or new task, while here, we train a GN front end from scratch since the network has to process the data from a different sensory modality.

Another interpretation of maximizing hidden feature similarity can be understood from the algorithms used for super-resolution (SR) image reconstruction and image style transfer. SR image reconstruction requires a network that up-samples a low-resolution image into a high-resolution image. The perceptual loss [12, 27] was used to increase the sharpness and maintain the natural image statistics of the reconstruction. Image style transfer networks often aim to transfer an image into a target artistic style where Gram loss [9] is often employed. While these networks learn to match either a high-resolution image ground truth or an artistic style, we train the GN front end to output features that match the hidden features of the pretrained network. For training the front end, we draw inspiration from these studies and propose the use of combinations of training loss metrics including perceptual loss and Gram loss.

3 Methods

We first describe the details of NGA in Section 3.1, then the the event camera and its data representation in Section 3.2. Finally in Section 3.3, we discuss the details of the event dataset.

3.1 Network Grafting Algorithm

The NGA uses a pretrained network that takes an intensity frame at time , and produces a grafted network whose input is an event volume . and are synchronized during the training. The should perform with similar accuracy on the same network task, such as object detection. During inference with the event camera, is not needed. The rest of this section sets up the constructions of and , then the NGA is described.

Figure 3: NGA. (top) Pretrained Network. (bottom) Grafted Network. Arrows point from variables to relevant loss terms. and here are an intensity frame and an event volume, respectively. The intermediate features , , , are shown as heat maps averaged across channels. The object bounding boxes predicted by the original and the grafted network are outlined in red and blue correspondingly.

Pretrained network setup. The pretrained network consists of three blocks: (Front end), (Middle net), (Remaining layers). Each block is made up of several layers and the outputs of each of the three blocks are defined as


where is the front end features, is the middle net features, and is the network prediction. The separation of the network blocks is studied in Section 5.2. The top row in Fig. 3 illustrates the three blocks of the pretrained network.

Grafted network setup. We define a GN front end that takes as the input and outputs grafted front end features, , of the same dimension as . combined with and produces the predictions :


We define as the Grafted Network (bottom row of Fig. 3).

Network Grafting Algorithm. The NGA trains the grafted network to reach a similar performance to that of the pretrained network by increasing the representation similarity between features and .

The loss function for the training of the

consists of a combination of three losses. The first loss is the Mean-Squared-Error (MSE) between and :


Because this loss term captures the amount of representation similarity between the two different front ends, we call a Feature Reconstruction Loss (FRL).

The second loss takes into account the output of the middle net layers in the network and draws inspiration from the Perception Loss [12]. This loss is set by the MSE between the middle net frame features and the grafted middle net features :


Since this loss term additionally evaluates the feature similarities between front end features , we refer to as the Feature Evaluation Loss (FEL).

Both FRL and FEL terms minimize the magnitude differences between hidden features. To further encourage the GN front end to generate intensity frame-like textures, we introduce the Feature Style Loss (FSL) based on the mean-subtracted Gram loss [9] that computes a Gram matrix using feature columns across channels (indexed using , ). The Gram matrix represents image texture rather than spatial structure. This loss is defined as:


The final loss function is a weighted sum of the three loss terms:


For all experiments in the paper, we set , , . The loss terms and their associated variables are shown in Fig. 3. The importance of each loss term is studied in Section 5.3.

3.2 Event Camera and Feature Volume Representation

Event cameras such as the DAVIS camera [15, 3] produce a stream of asynchronous “events” triggered by local brightness (log intensity) changes at individual pixels. Each output event of the event camera is a four-element tuple where is the timestamp, is the location of the event, and is the event polarity. The polarity is either positive (brightness increasing) or negative (brightness decreasing). To preserve both spatial and temporal information captured by the polarity events, we use the event voxel grid [30, 20]. Assuming a volume of events where is the event index, we divide this volume into event slices of equal temporal intervals such that the -th slice is defined as follows:


and is the normalized event timestamp. The event volume is then defined as . In Section 4, and . Prior work has shown that this representation –which covers a constant number of brightness change events–are a simple but effective input for optical flow computation [30] and video reconstruction [20].

3.3 Event Camera Dataset

The Multi Vehicle Stereo Event Camera Dataset (MVSEC) [28]

is a collection of event camera recordings for studying 3D perception and optical flow estimation. The

outdoor_day2 recording is carried out in an urban area of West Philadelphia. This recording was selected for the car detection experiment because of its better quality compared to other recordings, and it has a large number of cars in the scenes distributed throughout the entire recording. We generated in total 7,000 intensity frames and event volume pairs from this recording. Each event volume contains events. The first 5,000 pairs are used as the training dataset, and the last 2,000 pairs are used as the testing dataset. There are no temporally overlapping pairs between the training and testing datasets.

Because MVSEC does not provide ground truth bounding boxes for cars, we pseudo-labeled data pairs of the testing dataset for intensity frames that contain at least one car detected by the Hybrid Task Cascade (HTC) Network [4], which provides state-of-the-art results in object detection. We only use the bounding boxes with 80% or higher confidence to obtain high-quality bounding boxes. To compare the effect of using different numbers of event slices in an event volume on the detection results, we additionally created two versions of this dataset: DVS-3 where and DVS-10 where .

4 Experiments

We use the NGA to train a GN front end for a pretrained object detection network. In this case, we use the YOLOv3 network [21] that was trained using the COCO dataset [16] with 80 objects. We choose the YOLOv3 network because it still provides good detection accuracy and could be deployed on a low-cost embedded real-time platform. We refer to the pretrained network as YOLOv3- and the grafted event-driven networks as YOLOv3- in the rest of the paper. The training inputs consist of

image patches randomly cropped from the training pairs. No other data augmentation is performed. All networks are trained for 100 epochs with the Adam optimizer 

[13], a learning rate of , and a mini-batch size of 8. Each model training takes 1.5 hours using an NVIDIA RTX 2080 Ti, which is only 5% of the 2 days it typically requires to train one of the object detectors used in this paper on standard labeled datasets. More results from the experiments on the different vision datasets are presented in the supplementary material.

4.1 Car Detection on Event Camera Driving Dataset

To study if the NGA is effective for exploiting a novel visual sensor, e.g., an event camera, we evaluated car detection results using the pretrained network YOLOv3- and a grafted network YOLOv3- using the MVSEC dataset.

Figure 4: Examples of testing pairs from the MVSEC dataset. The event volume is visualized after averaging across event slices. The predicted bounding boxes (in red) from the intensity-driven network can be compared with the predicted bounding boxes (in blue) from the event-driven network. The magenta box shows cars detected by the event-driven network that are missed by the intensity-driven network. Best viewed in color.

The event camera operates over a larger dynamic range of lighting than an intensity frame camera and therefore detects moving objects even in poorly lighted scenes with less motion blur. From the six different data pairs in the MVSEC testing dataset (Fig. 4), we see that the event-driven YOLOv3- network detects most of the cars found in the intensity frames and additional cars not detected in the intensity frame (see the magenta box in the figure). These examples help illustrate how event cameras and the event-driven network can complement the pretrained network in challenging situations.

Table 1 compares the accuracy of the intensity and event camera detection networks on the testing set. As might be expected for these well-exposed and sharp daytime intensity frames, the YOLOv3- produces the highest average precision (AP). But the YOLOv3- with DVS-3 input achieves close to the same accuracy, although it was never explicitly trained to detect objects on this type of data. It means that NGA trained features for DVS input that were effective in driving the remaining parts of the original network for achieving the detection tasks.

We also tested if the pretrained network would perform poorly on the DVS-3 event dataset. The almost 0% for the AP results (not reported in the table) confirm that the intensity-driven front end completely fails at processing the event volume and that using a GN front end is essential for acceptable accuracy.

We also compare the performance of the event-driven networks that receive as input, the two different time slices for the event volume. The AP for the best event-driven network (DVS-10) is only 3.18 lower than the original YOLOv3 network accuracy (see next two rows of the table). In the next two rows of the table, we also looked at the effect on accuracy of varying the number of training samples. With only 40% of training data (2k samples), the event-driven network still shows strong detection precision at 66.75. But when the NGA has access to only 10% of the data (500 samples) during training, the detection performance drops by 22.47% compared to the best event-driven network. Although the NGA requires far less data compared to standard supervised training, training with only a few hundreds of samples remains challenging and could benefit from data augmentation to improve performance.

Network Modality AP # Trained Params
YOLOv3- Intensity 73.53 62M
YOLOv3- DVS-3 70.140.36 3.2M
YOLOv3- DVS-10 70.350.51 3.2M
YOLOv3- DVS-10 (40% samples) 66.750.30 3.2M
YOLOv3- DVS-10 (10% samples) 47.881.86 3.2M
Combined Intensity+DVS-10 75.45 N/A
Table 1: AP scores for car detection on the MVSEC driving dataset (five runs).

To study the benefit of using the event camera brightness change events to complement its intensity frame output, we combined the detection results from both the pretrained network and event-driven network (Row Combined in Table 1). After removing duplicated bounding boxes through non-maximum suppression, the AP score of the combined prediction is higher by 1.92 than the prediction of the pretrained network using intensity frames. It means that NGA enables the simplest type of output sensor fusion to improve accuracy above either input alone.

4.2 Comparing NGA and standard supervised learning

Intuitively, a network trained in a supervised manner should achieve higher accuracy than a network trained through self-supervision. To study this, we evaluated the accuracy gap between classification networks trained with supervised learning and the NGA using event recordings of the MNIST handwritten digit recognition dataset, called N-MNIST 

[18]. It contains event recordings for each MNIST digit. Each event volume is prepared by setting . The training uses the Adam optimizer, a learning rate of and a batch size of 256.

First, we trained the LeNet-N network with the standard LeNet-5 architecture [14] using the intensity samples in the standard MNIST dataset. Next, we trained LeNet-GN with the NGA by using parallel MNIST and N-MNIST sample pairs. We also trained an event-driven LeNet-supervised network from scratch on N-MNIST using standard supervised learning with the labeled digits. Table 2 compares the results. The LeNet-GN network accuracy is only 0.36% lower than the event-driven LeNet-supervised network. This result illustrates that even with the NGA training of a front end which has only 8% of the total network parameters, and without the availability of labeled training data, the performance of a GN can closely approach the accuracy of a supervised network.

Network Dataset Error Rate (%) # Trained Params
LeNet-N MNIST 0.92 64k
LeNet-GN N-MNIST 1.470.05 5k
LeNet-supervised N-MNIST 1.110.06 64k
Table 2: Classification results on MNIST and N-MNIST dataset.

5 Network Analysis

To understand the representational power of the GN features, Section 5.1 presents a qualitative study that illustrates how the grafted front end features represent helpful visual input under difficult lighting conditions. To design an effective GN, it is important to select what parts of the network to graft. The following two subsections study the network designs (Section 5.2) and NGA training (Section 5.3).

5.1 Grafted front end features decoding

Previous experiments show that the grafted front end features supply useful information for the GN in the object detection tasks. In this section, we provide qualitative evidence that the grafted features often faithfully represent the input scene. Specifically, we decode the grafted features by optimizing a decoded intensity frame that produces features through the intensity-driven network best matching the grafted features , by minimizing:


where is a total variation regularizer for encouraging spatial smoothness [1]. The decoded intensity frame is initialized randomly and has the same spatial dimension as the intensity frame, then the pixel values of are optimized for 1,000 iterations using an Adam optimizer with learning rate of .

Figure 5: Two grafted features decoding examples for the event datasets. Each column is one example. The top image is the raw event data. The middle image is the raw intensity frame. The bottom image is the decoded intensity frame (see main text). Labeled regions show where the event features show details not visible in the original intensity frames.

Figure 5 shows two examples from the event dataset. Under extreme lighting conditions, the intensity frames are often under/over-exposed while the decoded intensity frames show that the event front end features can represent the same scene better (see the labeled regions).

5.2 Design of the grafted network

Figure 6: YOLOv3 backbone: Darknet-53 [21]. The front end variants are S1, S2 and S3. The middle net variants are S4 and S5. Conv represents a convolution layer, ResBlock represents a residual block.

The YOLOv3 network has a backbone network, Darknet-53, that consists of five residual blocks (Fig. 6). Selecting the correct set of residual blocks used for the NGA front end is important. We tested six combinations of the front end and middle net by using different numbers of residual blocks: {S1, S4}, {S1, S5}, {S2, S4}, {S2, S5}, {S3, S4} and {S3, S5}. S1, S2, S3 indicate front end variants with different number of residual blocks that uses 0.06% (40k), 0.45% (279k), and 5.17% (3.2M) of total parameters (62M) respectively. The number of blocks for S4 and S5 vary depending on the chosen variant. Figure 7 shows the AP scores for different combinations of front end and middle net variants. The best separation of the network blocks is {S3, S4}. In the YOLOv3 network, the detection results improve sharply when the front end includes more layers. On the other hand, the difference in AP between using S4 or S5 for the middle net is not significant. These results suggest that using a deeper front end is better than a shallow front end, especially when training resources are not a constraint.

Figure 7: Results of different front end and middle net variants in Fig. 6 for the event dataset in AP. Each separation configuration is repeated five times.

5.3 Ablation study on loss terms

The NGA training includes three loss terms: FRL, FEL, and FSL. We studied the importance of these loss terms by performing an ablation study. These experiments are done on the network configuration {S3, S4} that yielded the event-driven networks with the best accuracy (see Fig. 7). The detection precision scores are shown in Fig. 8 for different loss configurations. The FRL and the FEL are the most critical loss terms, while the role of the FSL is less significant. The effectiveness of different loss combinations sometimes fluctuates, e.g., FEL+FSL. The trend lines indicate that using a combination of loss terms is most likely to produce better detection scores.

Figure 8: GN performance (AP) trained with different loss configurations. Results are from five repeats of each loss configuration.

6 Conclusion

This paper proposes the Network Grafting Algorithm that trains a grafted network front end of a pretrained network trained on a large labeled dataset so that the grafted network also works well with a different sensor modality. Training the GN front end for a different modality, in this case, an event camera, requires only a reasonably small unlabeled dataset ( 5k samples) that has temporally synchronized data from both modalities. By comparison, the COCO dataset on which many object detection networks are trained has 330k images. Ordinarily, training a network with a new sensor type and limited labeled data requires a lot of careful data augmentation. NGA avoids this by exploiting the new sensor data even if unlabeled because the pretrained network already has informative features.

We applied the NGA on an object detection network that was pretrained on a big image dataset. The NGA training was conducted using the MVSEC driving dataset [28]. After training, the GN reaches a similar average precision (AP) score compared to the precision achieved by the original network. Furthermore, the inference cost of the GN is similar to that of the pretrained network, which eliminates the latency cost for computing low-level quantities, particularly for event cameras. This newly proposed NGA widens the use of these unconventional cameras to a broader range of computer vision applications.

6.0.1 Acknowledgements

This work was funded by the Swiss National Competence Center in Robotics (NCCR Robotics).


  • [1] H. A. Aly and E. Dubois (2005-10) Image up-sampling using total-variation regularization with a new observation model. IEEE Transactions on Image Processing 14 (10), pp. 1647–1659. Cited by: §5.1.
  • [2] J. Anumula, D. Neil, T. Delbruck, and S. Liu (2018) Feature representations for neuromorphic audio spike streams. Frontiers of Neuroscience (), pp. . Cited by: §1.
  • [3] C. Brandli, R. Berner, M. Yang, S-C. Liu, and T. Delbruck (2014) A 240180 130 dB 3 s latency global shutter spatiotemporal vision sensor. IEEE Journal of Solid-State Circuits 49 (10), pp. 2333–2341. External Links: ISSN 0018-9200 Cited by: Figure 2, §3.2.
  • [4] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019-06) Hybrid task cascade for instance segmentation. In

    2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §3.3.
  • [5] R. Fisher (2020) CVonline: Image Databases. External Links: Link Cited by: Figure 2, §1.
  • [6] FLIR (2018) FREE FLIR thermal dataset for algorithm training. Note: https://www.flir.com/oem/adas/adas-dataset-form/ Cited by: §1.
  • [7] G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis, and D. Scaramuzza (2019) Event-based vision: A survey. CoRR abs/1904.08405. Cited by: §1.
  • [8] Y. Ganin and V. Lempitsky (2015-07–09 Jul)

    Unsupervised domain adaptation by backpropagation


    Proceedings of the 32nd International Conference on Machine Learning

    , F. Bach and D. Blei (Eds.),
    Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1180–1189. Cited by: §1, §2.
  • [9] L. A. Gatys, A. S. Ecker, and M. Bethge (2016-06)

    Image style transfer using convolutional neural networks

    In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2414–2423. Cited by: §1, §2, §3.1.
  • [10] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, Cited by: §1, §2.
  • [11] Y. Hu, H. Liu, M. Pfeiffer, and T. Delbruck (2016) DVS benchmark datasets for object tracking, action recognition, and object recognition. Frontiers in Neuroscience 10, pp. 405. External Links: ISSN 1662-453X Cited by: §1.
  • [12] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In 2016 European Conference on Computer Vision, Cited by: §1, §2, §3.1.
  • [13] D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), Cited by: §4.
  • [14] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998-11) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: ISSN 1558-2256 Cited by: §4.2.
  • [15] P. Lichtsteiner, C. Posch, and T. Delbruck (2008-02) A 128x128 120 dB 15 s latency asynchronous temporal contrast vision sensor. IEEE Journal of Solid-State Circuits 43 (2), pp. 566–576. External Links: ISSN 0018-9200 Cited by: §1, §3.2.
  • [16] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 740–755. External Links: ISBN 978-3-319-10602-1 Cited by: §4.
  • [17] D. P. Moeys, F. Corradi, E. Kerr, P. Vance, G. Das, D. Neil, D. Kerr, and T. Delbrück (2016-06) Steering a predator robot using a mixed frame/event-driven convolutional neural network. In 2016 Second International Conference on Event-based Control, Communication, and Signal Processing (EBCCSP), Vol. , pp. 1–8. External Links: ISSN Cited by: §1.
  • [18] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor (2015) Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in Neuroscience 9, pp. 437. External Links: ISSN 1662-453X Cited by: §1, §4.2.
  • [19] S. J. Pan and Q. Yang (2010-10) A survey on transfer learning. IEEE Trans. on Knowl. and Data Eng. 22 (10), pp. 1345–1359. External Links: ISSN 1041-4347 Cited by: §1, §2.
  • [20] H. Rebecq, R. Ranftl, V. Koltun, and D. Scaramuzza (2019-06) Events-To-Video: bringing modern computer vision to event cameras. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §3.2.
  • [21] J. Redmon and A. Farhadi (2018) YOLOv3: an incremental improvement. arXiv. Cited by: §4, Figure 6.
  • [22] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2015) FitNets: hints for thin deep nets. In Proceedings of ICLR, Cited by: §2.
  • [23] C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, and D. Scaramuzza (2020) Fast image reconstruction with an event camera. In The IEEE Winter Conference on Applications of Computer Vision, pp. 156–163. Cited by: §1.
  • [24] Y. Sun, E. Tzeng, T. Darrell, and A. A. Efros (2019) Unsupervised domain adaptation through self-supervision. External Links: 1909.11825 Cited by: §1, §2.
  • [25] J. Yim, D. Joo, J. Bae, and J. Kim (2017-07) A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [26] K. You, M. Long, Z. Cao, J. Wang, and M. I. Jordan (2019-06) Universal domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • [27] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018-06)

    The unreasonable effectiveness of deep features as a perceptual metric

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [28] A. Z. Zhu, D. Thakur, T. Özaslan, B. Pfrommer, V. Kumar, and K. Daniilidis (2018-07) The multivehicle stereo event camera dataset: an event camera dataset for 3d perception. IEEE Robotics and Automation Letters 3 (3), pp. 2032–2039. External Links: ISSN Cited by: §3.3, §6.
  • [29] A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis (2019-06) Unsupervised event-based learning of optical flow, depth, and egomotion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [30] A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis (2019) Unsupervised event-based optical flow using motion compensation. In Computer Vision – ECCV 2018 Workshops, L. Leal-Taixé and S. Roth (Eds.), Cham, pp. 711–714. External Links: ISBN 978-3-030-11024-6 Cited by: §1, §3.2.