By registering changes in log intensity in the image with microsecond accuracy, event-based cameras offer promising advantages over frame based cameras in situations with factors such as high speed motions and difficult lighting. One interesting application of these cameras is the estimation of optical flow. By directly measuring the precise time at which each pixel changes, the event stream directly encodes fine grain motion information, which researchers have taken advantage of in order to perform optical flow estimation. For example, Benosman et al.  show that optical flow can be estimated from a local window around each event in a linear fashion, by estimating a plane in the spatio-temporal domain. This is significantly simpler than image-based methods, where optical flow is performed using iterative methods. However, analysis in Rueckauer and Delbruck 
has shown that these algorithms require significant, hand crafted outlier rejection schemes, as they do not properly model the output of the sensor.
For traditional image-based methods, deep learning has helped the computer vision community achieve new levels of performance while avoiding having to explicitly model the entire problem. However, these techniques have yet to see the same level of adoption and success for event-based cameras. One reason for this is the asynchronous output of the event-based camera, which does not easily fit into the synchronous, frame-based inputs expected by image-based paradigms. Another reason is the lack of labeled training data necessary for supervised training methods. In this work, we propose two main contributions to resolve these issues.
First, we propose a novel image-based representation of an event stream, which fits into any standard image-based neural network architecture. The event stream is summarized by an image with channels representing the number of events and the latest timestamp at each polarity at each pixel. This compact representation preserves the spatial relationships between events, while maintaining the most recent temporal information at each pixel and providing a fixed number of channels for any event stream.
Second, we present a self-supervised learning method for optical flow estimation given only a set of events and the corresponding grayscale images generated from the same camera. The self-supervised loss is modeled after frame based self-supervised flow networks such asYu et al.  and Meister et al. , where a photometric loss is used as a supervisory signal in place of direct supervision. As a result, the network can be trained using only data captured directly from an event camera that also generates frame based images, such as the Dynamic and Active-pixel Vision (DAVIS) Sensor developed by Brandli et al. , circumventing the need for expensive labeling of data.
These event images combined with the self-supervised loss are sufficient for the network to learn to predict accurate optical flow from events alone. For evaluation, we generate a new event camera optical flow dataset, using the ground truth depths and poses in the Multi Vehicle Event Camera Dataset by Zhu et al. . We show that our method is competitive on this dataset with UnFlow by Meister et al. , an image-based self supervised network trained on KITTI, and fine tuned on event camera frames, as well as standard non-learning based optical flow methods.
In summary, our main contributions in this work are:
We introduce a novel method for learning optical flow using events as inputs only, without any supervision from ground-truth flow.
Our CNN architecture uses a self-supervised photoconsistency loss from low resolution intensity images used in training only.
We present a novel event-based optical flow dataset with ground truth optical flow, on which we evaluate our method against a state of the art frame based method.
Ii Related Work
Ii-a Event-based Optical Flow
There have been several works that attempt to take advantage of the high temporal resolution of the event camera to estimate accurate optical flow. Benosman et al.  model a given patch moving in the spatial temporal domain as a plane, and estimate optical flow as the slope of this plane. This work is extended in Benosman et al.  by adding an iterative outlier rejection scheme to remove events significantly far from the plane, and in Barranco et al.  by combining the estimated flow with flow from traditional images. In addition, Brosch et al.  present an analogy of Lucas et al.  using the events to approximate the spatial image gradient, while Zhu et al. 
present an expectation-maximization based approach to estimate flow in a local patch.
Ii-B Event-based Deep Learning
One of the main challenges for supervised learning for events is the lack of labeled data. As a result, many of the early works on learning with event-based data, such as Ghosh et al.  and Moeys et al. , rely on small, hand collected datasets.
To address this, recent works have attempted to collect new datasets of event camera data. Mueggler et al. , provide handheld sequences with ground truth camera pose, which Nguyen et al.  use to train a LSTM network to predict camera pose. In addition, Zhu et al.  provide flying, driving and handheld sequences with ground truth camera pose and depth maps, and Binas et al.  provide long driving sequences with ground truth measurements from the vehicle such as steering angle and GPS position.
Recently, there have also been implementations of neural networks on spiking neuromorphic processors, such as in Amir et al. , where a network is adapted to the TrueNorth chip to perform gesture recognition.
Ii-C Self-supervised Optical Flow
Self-supervised, or unsupervised, methods have shown great promise in training networks to solve many challenging 3D perception problems. Yu et al.  and Ren et al.  train an optical flow prediction network using the traditional brightness constancy and smoothness constraints developed in optimization based methods such as the Lucas Kanade method Lucas et al. . Zhu et al.  combine this self-supervised loss with supervision from an optimization based flow estimate as a proxy for ground truth supervision, while Meister et al.  extend the loss with occlusion masks and a second order smoothness term, and Lai et al.  introduce an adversarial loss on top of the photometric error.
In this section, we describe our approach in detail. In Sec. III-A, we describe our event representation, which is an analogy to an event image. In Sec. III-B, we describe the self-supervised loss used to provide a supervisory signal using only the gray scale images captured before and after each time window, and in Sec. III-C, we describe the architecture of our network, which takes as input the event image and outputs a pixel-wise optical flow. Note that, throughout this paper, we refer to optical flow as the displacement of each pixel within a given time window.
Iii-a Event Representation
An event-based camera tracks changes in the log intensity of an image, and returns an event whenever the log intensity changes over a set threshold :
|Each event contains the pixel location of the change, timestamp of the event and polarity:|
Because of the asynchronous nature of the events, it is not immediately clear what representation of the events should be used in the standard convolutional neural network architecture. Most modern network architectures expect image-like inputs, with a fixed, relatively low, number of channels (recurrent networks excluded) and spatial correlations between neighboring pixels. Therefore, a good representation is key to fully take advantage of existing networks while summarizing the necessary information from the event stream.
Perhaps the most complete representation that preserves all of the information in each event would be to represent the events as a matrix, where each column contains the information of a single event. However, this does not directly encode the spatial relationships between events that is typically exploited by convolutions over images.
In this work, we chose to instead use a representation of the events in image form. The input to the network is a 4 channel image with the same resolution as the camera.
The first two channels encode the number of positive and negative events that have occurred at each pixel, respectively. This counting of events is a common method for visualizing the event stream, and has been shown in Nguyen et al.  to be informative in a learning based framework to regress 6dof pose.
However, the number of events alone discards valuable information in the timestamps that encode information about the motion in the image. Incorporating timestamps in image form is a challenging task. One possible solution would be to have channels, where is the most events in any pixel in the image, and stack all incoming timestamps. However, this would result in a large increase in the dimensionality of the input. Instead, we encode the pixels in the last two channels as the timestamp of the most recent positive and negative event at that pixel, respectively. This is similar to the ”Event-based Time Surfaces” used in Lagorce et al.  and the ”timestamp images” used in Park et al. . An example of this kind of image can be found in Fig. 2, where we can see that the flow is evident by following the gradient in the image, particularly for closer (faster moving) objects. While this representation inherently discards all of the timestamps but the most recent at each pixel, we have observed that this representation is sufficient for the network to estimate the correct flow in most regions. One deficiency of this representation is that areas with very dense events and large motion will have all pixels overridden by very recent events with very similar timestamps. However, this problem can be avoided by choosing smaller time windows, thereby reducing the magnitude of the motion.
In addition, we normalize the timestamp images by the size of the time window for the image, so that the maximum value in the last two channels is 1. This has the effect of both scaling the timestamps to be on the same order of magnitude as the event counts, and ensuring that fast motions with a small time window and slow motions with a large time window that generate similar displacements have similar inputs to the network.
Iii-B Self-Supervised Loss
Due to the fact that there is a relatively small amount of labeled data for event based cameras as compared to traditional cameras, it is difficult to generate a sufficient dataset for a supervised learning method. Instead, we utilize the fact that the DAVIS camera generates synchronized events and grayscale images to perform self-supervised learning using the grayscale images in the loss. At training time, the network is provided with the event timestamp images, as well as a pair of grayscale images, occurring immediately before and after the event time window. Only the event timestamp images are passed into the network, which predicts a per pixel flow. The grayscale images are then used to apply a loss over the predicted flow in a self-supervised manner.
The overall loss function used follows traditional variational methods for estimating optical flow, and consists of a photometric and a smoothness loss.
To compute the photometric loss, the flow is used to warp the second image to the first image using bilinear sampling, as described in Yu et al. . The photometric loss, then, aims to minimize the difference in intensity between the warped second image and the first image:
|where is the Charbonnier loss function, a common loss in the optical flow literature used for outlier rejection (Sun et al. ):|
As we are using frame based images for supervision, this method is susceptible to image-based issues such as the aperture problem. Thus, we follow the other works in the frame based domain, and apply a regularizer in the form of a smoothness loss. The smoothness loss aims to regularize the output flow by minimizing the difference in flow between neighboring pixels horizontally, vertically and diagonally.
where is the set of neighbors around .
The total loss is the weighted sum of the photometric and smoothness losses:
Iii-C Network Architecture
The EV-FlowNet architecture very closely resembles the encoder-decoder networks such as the stacked hourglass (Newell et al. ) and the U-Net (Ronneberger et al. ), and is illustrated in Fig. 3. The input event image is passed through 4 strided convolution layers, with output channels doubling each time. The resulting activations are passed through 2 residual blocks, and then four upsample convolution layers, where the activations are upsampled using nearest neighbor resampling and then convolved, to obtain a final flow estimate. At each upsample convolution layer, there is also a skip connection from the corresponding strided convolution layer, as well as another convolution layer to produce an intermediate, lower resolution, flow estimate, which is concatenated with the activations from the upsample convolution. The loss in (6
) is then applied to each intermediate flow by downsampling the grayscale images. The tanh function is used as the activation function for all of the flow predictions.
Iv Optical Flow Dataset
For ground truth evaluation only, we generated a novel dataset for ground truth optical flow using the data provided in the Multi-Vehicle Stereo Event Camera dataset (MVSEC) by Zhu et al. . The dataset contains stereo event camera data in a number of flying, driving and handheld scenes. In addition, the dataset provides ground truth poses and depths maps for each event camera, which we have used to generate reference ground truth optical flow.
From the pose (consisting of rotation and translation ) of the camera at time and , we make a linear velocity assumption, and estimate velocity and angular velocity using numerical differentiation:
|where logm is the matrix logarithm, and |
converts the vector
into the corresponding skew symmetric matrix:
A central moving average filter is applied to the estimated velocities to reduce noise. We then use these velocities to estimate the motion field, given the ground truth depths, , at each undistorted pixel position:
Finally, we scale the motion field by the time window between each pair of images , and use the resulting displacement as an approximation to the true optical flow for each pixel. To apply the ground truth to the distorted images, we shift the undistorted pixels by the flow, and apply distortion to the shifted pixels. The distorted flow is, then, the displacement from the original distorted position to the shifted distorted position.
In total, we have generated ground truth optical flow for the indoorflying, outdoorday and outdoornight sequences. In addition to using the indoorflying and outdoorday ground truth sets for evaluation, we will also release all sequences as a dataset.
V Empirical Evaluation
|Grayscale Image||Event Timestamps||Ground Truth Flow||UnFlow Flow||EV-FlowNet Flow|
V-a Training Details
Two networks were trained on the two outdoorday sequences from MVSEC. outdoorday1 contains roughly 12000 images, and outdoorday2 contains roughly 26000 images. The images are captured from driving in an industrial complex and public roads, respectively, where the two scenes are visually very different. The motions include mostly straights and turns, with occasional independently moving objects such as other cars and pedestrians. The input images are cropped to 256x256, the number of output channels at the first encoder layer is 64 and the number of output channels in each residual block is 512.
To increase the variation in the magnitude of the optical flow seen at training, we randomly select images up to images apart in time, and all of the events that occurred between those images. In our experiments, . In addition, we randomly flip the inputs horizontally, and randomly crop them to achieve the desired resolution.
was set to be 1e-3. The Adam optimizer is used, with learning rate initialized at 1e-5, and exponentially decayed every 4 epochs by 0.8. The model is trained for 300,000 iterations, and takes around 12 hours to train on a 16GB NVIDIA Tesla V100.
V-B Ablation Studies
In addition to the described architecture (denoted EV-FlowNet), we also train three other networks to test the effects of varying the input to the network, as well as increasing the capacity of the network.
To test the contribution of each of the channels in the input, we train two additional networks, one with only the event counts (first two channels) as input (denoted EV-FlowNet), and one with only the event timestamps (last two channels) as input (denoted EV-FlowNet).
In addition, we tested different network capacities by training a larger model with 4 residual blocks (denoted EV-FlowNet). A single forward pass takes, on average, 40ms for the smaller network, and 48ms for the larger network, when run on a NVIDIA GeForce GTX 1050, a laptop grade GPU.
|dt=1 frame||outdoor driving||indoor flying1||indoor flying2||indoor flying3|
|dt=4 frames||outdoor driving||indoor flying1||indoor flying2||indoor flying3|
|AEE||% Outlier||AEE||% Outlier||AEE||% Outlier||AEE||% Outlier|
As there is no open source code by the authors of Event-based Visual Flow, we designed an implementation around the method described in Rueckauer and Delbruck . In particular, we implemented the robust Local Plane Fit algorithm, with a spatial window of 5x5 pixels, vanishing gradient threshold th3 of 1e-3, and outlier distance threshold of 1e-2. However, we were unable to achieve any reasonable results on the datasets, with only very few points returning valid flow values (), and none of the valid flow values being visually correct. For validation, we also tested the open source MATLAB code provided by the authors of Mueggler et al. , where we received similar results. As a result, we believe that the method was unable to generalize to the natural scenes in the test set, and so did not include the results in this paper.
For UnFlow, we used the unsupervised model trained on KITTI raw, and fine tuned on outdoorday2. This model was able to produce reasonable results on the testing sets, and we include the results in the quantitative evaluation in Tab. I.
V-D Test Sequences
For comparison against UnFlow, we evaluated 800 frames from the outdoorday1 sequence as well as sequences 1 to 3 from indoorflying. For the event input, we used all of the events that occurred in between the two input frames.
The outdoorday1 sequence spans between 222.4s and 240.4s. This section was chosen as the grayscale images were consistently bright, and there is minimal shaking of the camera (the provided poses are smoothed and do not capture shaking of the camera if the vehicle hits a bump in the road). In order to avoid conflicts between training and testing data, a model trained only using data from outdoorday2 was used, which is visually significantly different from outdoorday1.
The three indoorflying sequences total roughly 240s, and feature a significantly different indoor scene, containing vertical and backward motions, which were previously unseen in the driving scenes. A model trained on both outdoorday1 and outdoorday2 data was used for evaluation on these sequences. We avoided fine tuning on the flying sequences, as the sequences are in one room, and all relatively similar in visual appearance. As a result, it would be very easy for a network to overfit the environment. Sequence 4 was omitted as the majority of the view was just the floor, and so had a relatively small amount of useful data for evaluation.
For each method and sequence, we compute the average endpoint error (AEE), defined as as the distance between the endpoints of the predicted and ground truth flow vectors:
In addition, we follow the KITTI flow 2015 benchmark and report the percentage of points with EE greater than 3 pixels and 5% of the magnitude of the flow vector. Similarly to KITTI, 3 pixels is roughly the maximum error observed when warping the grayscale images according to the ground truth flow, and comparing against the next image.
However, as the input event image is relatively sparse, the network only returns accurate flow on points with events. As a result, we limit the computation of AEE to pixels in which at least one event was observed. For consistency, this is done with a mask applied to the EE for both event-based and frame-based methods. We also mask out any points for which we have no ground truth flow (i.e. regions with no ground truth depth). In practice, this results in the error being computed over 20-30% of the pixels in each image.
In order to vary the magnitude of flow observed for each test, we run two evaluations per sequence: one with input frames and corresponding events that are one frame apart, and one with frames and events four frames apart. We outline the results in Tab. I.
V-F1 Qualitative Results
In addition to the quantitative analysis provided, we provide qualitative results in Fig. 4. In these results, and throughout the test set, the predicted flow always closely follows the ground truth. As the event input is quite sparse, our network tends to predict zero flow in areas without events. This is consistent with the photometric loss, as areas without events are typically low texture areas, where there is little change in intensity within each pixel neighborhood. In practice, the useful flow can be extracted by only using flow predictions at points with events. On the other hand, while UnFlow typically performs reasonably on the high texture regions, the results on low texture regions are very noisy.
V-F2 Ablation Study Results
From the results of the ablation studies in Tab. I, EV-FlowNet (counts only) performed the worst. This aligns with our intuition, as the only information attainable from the counts is from motion blur effects, which is a weak signal on its own. EV-FlowNet (timestamps only) performs better for most tests, as the timestamps carry information about the ordering between neighboring events, as well as the magnitude of the velocity. However, the timestamp only network fails when there is significant noise in the image, or when fast motion results in more recent timestamps covering all of the older ones. This is illustrated in Fig 5, where even the full network struggles to predict the flow in a region dominated by recent timestamps. Overall, the combined models clearly perform better, likely as the event counts carry information about the importance of each pixel. Pixels with few events are likely to be just noise, while pixels with many events are more likely to carry useful information. Somewhat surprisingly, the larger network, EV-FlowNet actually performs worse than the smaller one, EV-FlowNet. A possible explanation is that the larger capacity network learned to overfit the training sets, and so did not generalize as well to the test sets, which were significantly different. For extra validation, both EV-FlowNet and EV-FlowNet were trained for an additional 200,000 iterations, with no appreciable improvements. It is likely, however, that, given more data, the larger model would perform better.
V-F3 Comparison Results
From our experiments, we found that the UnFlow network tends to predict roughly correct flows for most inputs, but tends to be very noisy in low texture areas of the image. The sparse nature of the events is a benefit in these regions, as the lack of events there would cause the network to predict no flow, instead of an incorrect output.
In general, EV-FlowNet performed better on the dt=4 tests, while worse on the dt=1 tests (with the exception of outdoordriving1 and indoorflying3). We observed that UnFlow typically performed better in situations with very small or very large motion. In these situations, there are either few events as input, or so many events that the image is overriden by recent timestamps. However, this is a problem intrinsic to the testing process, as the time window is defined by the image frame rate. In practice, these problems can be avoided by choosing time windows large enough so that sufficient information is available while avoiding saturating the event image. One possible solution to this would be to have a fixed number of events in the window each time.
In this work, we have presented a novel design for a neural network architecture that is able to accurately predict optical flow from events alone. Due to the method’s self-supervised nature, the network can be trained without any manual labeling, simply by recording data from the camera. We show that the predictions generalize beyond hand designed laboratory scenes to natural ones, and that the method is competitive with state of the art self supervised methods. We hope that this work will provide not only a novel method for flow estimation, but also a paradigm for applying other self-supervised learning methods to event cameras in the future.
Amir et al. 
Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey McKinstry, Carmelo
Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela
Mendoza, et al.
A Low Power,
Fully Event-Based Gesture Recognition System.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7243–7252, 2017.
- Barranco et al.  Francisco Barranco, Cornelia Fermüller, and Yiannis Aloimonos. Contour motion estimation for asynchronous event-driven cameras. Proceedings of the IEEE, 102(10):1537–1556, 2014.
- Benosman et al.  Ryad Benosman, Sio-Hoi Ieng, Charles Clercq, Chiara Bartolozzi, and Mandyam Srinivasan. Asynchronous frameless event-based optical flow. Neural Networks, 27:32–37, 2012.
- Benosman et al.  Ryad Benosman, Charles Clercq, Xavier Lagorce, Sio-Hoi Ieng, and Chiara Bartolozzi. Event-based visual flow. IEEE transactions on neural networks and learning systems, 25(2):407–417, 2014.
- Binas et al.  Jonathan Binas, Daniel Neil, Shih-Chii Liu, and Tobi Delbrück. DDD17: End-To-End DAVIS Driving Dataset. CoRR, abs/1711.01458, 2017.
- Brandli et al.  Christian Brandli, Raphael Berner, Minhao Yang, Shih-Chii Liu, and Tobi Delbruck. A 240 180 130 db 3 s latency global shutter spatiotemporal vision sensor. IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014.
- Brosch et al.  Tobias Brosch, Stephan Tschechne, and Heiko Neumann. On event-based optical flow detection. Frontiers in neuroscience, 9:137, 2015.
- Ghosh et al.  Rohan Ghosh, Abhishek Mishra, Garrick Orchard, and Nitish V Thakor. Real-time object recognition and orientation estimation using an event-based camera and CNN. In Biomedical Circuits and Systems Conference (BioCAS), 2014 IEEE, pages 544–547. IEEE, 2014.
- Hu et al.  Yuhuang Hu, Hongjie Liu, Michael Pfeiffer, and Tobi Delbruck. DVS benchmark datasets for object tracking, action recognition, and object recognition. Frontiers in Neuroscience, 10, 2016.
- Lagorce et al.  Xavier Lagorce, Garrick Orchard, Francesco Galluppi, Bertram E Shi, and Ryad B Benosman. HOTS: a hierarchy of event-based time-surfaces for pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(7):1346–1359, 2017.
- Lai et al.  Wei-Sheng Lai, Jia-Bin Huang, and Ming-Hsuan Yang. Semi-supervised learning for optical flow with generative adversarial networks. In Advances in Neural Information Processing Systems, pages 353–363, 2017.
- Lucas et al.  Bruce D Lucas, Takeo Kanade, et al. An iterative image registration technique with an application to stereo vision. 1981.
- Meister et al.  Simon Meister, Junhwa Hur, and Stefan Roth. UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss. arXiv preprint arXiv:1711.07837, 2017.
- Moeys et al.  Diederik Paul Moeys, Federico Corradi, Emmett Kerr, Philip Vance, Gautham Das, Daniel Neil, Dermot Kerr, and Tobi Delbrück. Steering a predator robot using a mixed frame/event-driven convolutional neural network. In Event-based Control, Communication, and Signal Processing (EBCCSP), 2016 Second International Conference on, pages 1–8. IEEE, 2016.
- Mueggler et al.  Elias Mueggler, Henri Rebecq, Guillermo Gallego, Tobi Delbruck, and Davide Scaramuzza. The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and SLAM. The International Journal of Robotics Research, 36(2):142–149, 2017.
- Newell et al.  Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483–499. Springer, 2016.
- Nguyen et al.  Anh Nguyen, Thanh-Toan Do, Darwin G Caldwell, and Nikos G Tsagarakis. Real-Time Pose Estimation for Event Cameras with Stacked Spatial LSTM Networks. arXiv preprint arXiv:1708.09011, 2017.
- Orchard et al.  Garrick Orchard, Ajinkya Jayawant, Gregory K Cohen, and Nitish Thakor. Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in neuroscience, 9, 2015.
- Park et al.  Paul KJ Park, Baek Hwan Cho, Jin Man Park, Kyoobin Lee, Ha Young Kim, Hyo Ah Kang, Hyun Goo Lee, Jooyeon Woo, Yohan Roh, Won Jo Lee, et al. Performance improvement of deep learning based gesture recognition using spatiotemporal demosaicing technique. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 1624–1628. IEEE, 2016.
- Ren et al.  Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha. Unsupervised Deep Learning for Optical Flow Estimation. In AAAI, pages 1495–1501, 2017.
- Ronneberger et al.  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
- Rueckauer and Delbruck  Bodo Rueckauer and Tobi Delbruck. Evaluation of event-based algorithms for optical flow with ground-truth from inertial measurement sensor. Frontiers in neuroscience, 10, 2016.
- Sun et al.  Deqing Sun, Stefan Roth, and Michael J Black. A quantitative analysis of current practices in optical flow estimation and the principles behind them. International Journal of Computer Vision, 106(2):115–137, 2014.
- Yu et al.  Jason J Yu, Adam W Harley, and Konstantinos G Derpanis. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In Computer Vision–ECCV 2016 Workshops, pages 3–10. Springer, 2016.
- Zhu et al. [2017a] Alex Zihao Zhu, Nikolay Atanasov, and Kostas Daniilidis. Event-based feature tracking with probabilistic data association. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 4465–4470. IEEE, 2017a.
- Zhu et al.  Alex Zihao Zhu, Dinesh Thakur, Tolga Ozaslan, Bernd Pfrommer, Vijay Kumar, and Kostas Daniilidis. The Multi Vehicle Stereo Event Camera Dataset: An Event Camera Dataset for 3D Perception. arXiv preprint arXiv:1801:10202, 2018.
- Zhu et al. [2017b] Yi Zhu, Zhenzhong Lan, Shawn Newsam, and Alexander G Hauptmann. Guided optical flow learning. arXiv preprint arXiv:1702.02295, 2017b.