I Introduction
Thanks to recent progress in aerial autonomy, UAVs have been able to fly agilely in complex environments such as mine exploration. Drones are able to perceive unknown environments and plan an exploration path autonomously. However, perception in dynamic environments, especially with highspeed objects, is still a challenging problem. For example, drones have difficulty dodging a rock falling headon during the fast mine exploration.
For fastmoving object’s avoidance, it’s pivotal to track them and predict their future trajectories in a short latency. Normally, this latency is hundreds of milliseconds for most perception methods: cameras need tens of milliseconds to expose and suffer from motion blur; besides, algorithms need a sequence of frames to predict a trajectory. However, for objects with speed higher than 10 meters per second, such long latency leaves drones no time to escape. In order to reduce this latency, sensors with a higher temporal resolution are keenly demanded. Meanwhile, a realtime detection and tracking algorithm is also indispensable.
To fill this research gap, we adopt the event camera, an asynchronous motionactivated sensor providing a microsecondlevel temporal resolution, for solving this problem. In this work, a complete perception system integration for this sensor is also designed. Firstly, we propose an egomotion compensation algorithm to alleviate the noise. Then, for dynamic object detection, we develop a regressionbased approach to find the region of interest (ROI). Notably, this approach is more robust and less computational demanding compared to other clusteringbased methods in [3].
Furthermore, a satisfactory solution should be capable of tracking the object in the 2D camera space and estimating its corresponding 3D trajectory
[14]. To address the scale ambiguity issue, we further incorporate a depth camera to recover the scale of monocular sensing by joint optimization. Afterward, combining event and depth observations, we present an accurate trajectory estimator which significantly increases the robustness and accuracy. Our algorithm successfully balances the tight onboard computational budget and trajectory accuracy.We perform extensive quantitative and qualitative experiments in high dynamic scenarios to validate our object detection and trajectory estimation framework, which provide a solid foundation for fastmoving object avoidance.
This paper highlights several features:

An advanced motion compensation method for eventdetection balancing efficiency and accuracy.

A 3D trajectory estimation approach that fuses event and depth information asynchronously.

A complete system integration with open source
^{1}^{1}1 Our code and video can be found at https://github.com/ZJUFASTLab/FASTDynamicVision.
Ii Related Work
Due to its unique advantage of high temporal resolution and no motion blur, the event camera has attracted many researchers’ interest [4] [2]
. The first problem to solve is how to represent events. Existing event representation algorithms can be divided into two main categories. One category is classifying events of different objects into several clusters
[20] [12] [8] [9] [1]. This sort of method is intuitively built on event mechanisms, but it is sensitive to noisy events. It also neglects time information which is vital for eventbased detection. The other category is tracking features on time surface frames [3] [11] [5] [18], which is a 2D map only containing the latest event’s timestamp while ignoring others triggered recently in each pixel. Specifically, some researchers [3] [11] introduces a meantime image representation contains the average timestamp of the events. This meantime image is less computational demanding than other types, e.g., exponential time surfaces[19]. Furthermore, this representation is more suitable for object detection tasks: regions containing moving objects can be obtained by merely thresholding the meantime image. Therefore, our detection method is based on it.To remove background events generated by rotation and translation, egomotion compensation is necessary for moving object detection and tracking. Mitrokhin et al. [11] minimize error functions provided by spatial gradient of meantime image to fit a parametric motion model; Gallego et al. [5]
maximize a variance which represents local contrast, in other words, sharpness, on the compensated image. Zhou
et al. [17] minimize an energy function. The optimizationbased method is accurate. However, one drawback of this method is that its high computational cost introduces extra latency in the perception system [3], which would lead to potential failure in our object avoidance scene. Falanga et al. [3] use IMU’s angular velocity average to perform rotational egomotion compensation. This method is less computationally demanding so that it can be applied for onboard flights while the accuracy is not guaranteed in forwarding flights. Based on this method [3], we improve the motion compensation approach by fusing depth and IMU data to implement both rotational and translational egomotion compensation. Our method can enhance its accuracy and reliability without sacrificing computational efficiency.For object tracking and trajectory estimation, our framework is inspired by the following studies. Su et al. [14] fit a parabolic model to estimate the 3D trajectory of a flying object from noisy 2D observation. This method requires plenty of observations due to the lack of depth information, which cannot meet the requirement for lowlatency. Falanga et al. [3] apply stereo event cameras for 3D position estimation. However, this configuration does not guarantee accuracy and robustness because the high level of noise causes uncertainty in depth estimation. To obtain more accurate 3D trajectories, we design a different configuration fusing event and depth sensor onboard.
Iii Overview
Iiia System Architecture
The pipeline of our framework is illustrated in Fig 2
. There are three procedures in this framework: egomotion compensation, object detection, and object trajectory estimation. Firstly, we implement an advanced motion compensation algorithm fusing IMU and depth data to filter out background events generated by egomotion, including rotation and translation during flight. The meantime image can be generated by motioncompensated events. Each pixel value of this meantime image is the average timestamp of corresponding events. Following the motion compensation step, we detect and locate the region with the largest average timestamp in the meantime image. This region represents the area with the fastest speed on the image plane. To obtain the region’s bounding box, we introduce an iterative Gaussian fitting algorithm for the object detection step. We also present a moving region retrieval to guarantee the bounding box we get is the most accurate one. Next, the moving object’s location is tracked with Kalman Filter on the 2D plane, and the object is segmented out on the depth map according to the detection result. Then, we optimize the trajectory of the object by minimizing reprojection residuals. Finally, to validate our estimation, we design a scenario in which a UAV autonomously detect and avoid objects flying towards it.
The rest of this paper is organized as follows: Section IVA presents our advanced egomotion compensation algorithms. Then we discuss object detection and tracking methods used in this framework (Section IVB). In Section IVC, we perform our 3D trajectory estimator fusing event stream and depth information. Section V depicts our real flight experiment and compares our performance with others.
IiiB Notation
Let denote a set of events. We use symbols to denote an event triggered by an event camera. The symbol , represents the event’s coordinate on the image plane, denotes the timestamp of the event.
We represent as a set of motioncompensated events (see IVA) which are projected onto the same pixel
(1) 
Therefore, the eventcount image pixel [11] can be denoted as where
(2) 
We also define the timeimage as . Hence, pixel in the timeimage represented as , can be expressed as the average timestamp of events triggered in this position, as follows:
(3) 
We name the normalized timeimage as normalized meantime image , which can be computed by the following equation [3]
(4) 
We use () as the world frame, () as the drone body frame. Notably, we use () to represent the event camera frame while () representing the depth camera frame. Hence, we can represent the transformation of the event camera in the world frame as
(5) 
.
Iv Fastmoving Object Detection and Tracking
Iva Egomotion Compensation
Events can be triggered either by moving objects or by the egomotion of the camera. In order to segment objects, events generated by egomotion (backgrounds) should be filtered out first. In previous works [3][11] [10], algorithms for egomotion compensation are either computational demanding or not accurate enough. To eliminate this problem, we present a method considering computational efficiency, accuracy, and robustness by fusing depth and IMU data to compensate for rotational and translational egomotion. The illustration of this section can be inferred in Fig. 3.
This section will introduce our advanced egomotion compensation method in two steps: rotational and translational. Before projection, we store some events into an event buffer in a small time window from timestamp . Then we utilize IMU data to compute the average angular velocity and orientation matrix in the world frame during .
IvA1 Rotational Compensation
We apply this compensation step to eliminate events generated by the camera’s rotation [3]. After getting the average angular velocity , we apply the Rodrigues’ Rotation Formula [13] to build the rotation matrix from relative angle at timestamp . Instead of building this matrix at each timestamp , we update it every millisecond to decrease computational costs. Then, we use this rotation matrix to apply a warp field for every event on the image plane. [11] [3]. This event warping process can be denoted as follows
(6)  
with being the motioncompensated events in buffer . After compensation, we project compensated events to a 2D image plane by the event camera’s intrinsic matrix.
IvA2 Translational Compensation
We apply this step to eliminate noise generated by the camera’s translation.
Previous compensation methods [3] are limited by lacking depth estimation. Without unreliable depth for each pixel, they cannot compensate for egotranslation, which leads to misdetection when drones are flying fast. We solve this problem by leveraging an onboard depth camera.
We now project events on 2D camera plane to 3D body frame by perspective projection model with homogeneous coordinates [6]
(7) 
where is the event’s position in the camera frame, represents event’s coordinate on the image plane, is the intrinsic matrix. We then project this point into world frame by transform matrix and apply this transitional compensation by multiplying matrix as reproject it back to the camera frame,
(8) 
where is event’s compensated position in the camera frame.
This translational compensation matrix is built from the derivative of position over time
(9) 
with being the velocity using estimation from our odometry. Due to computational cost, we update velocity and timestamp to build this matrix every millisecond.
IvB Object Detection
IvB1 Dynamic Obstacle Segmentation
After egomotion compensation, we propose a thresholding method to filter out the background in normalized meantime image . Instead of using a fixed threshold, we design an adaptive one considering angular and linear velocity as , where and is the magnitude of angular and linear velocity, , and are parameters. We use this threshold to classify objects and background. Let us define to represent the image after motion segmentation, which is formulated as:
(10) 
Compared to the previous approach [3], our method can preserve more information of the moving object to decrease the possibility of missed detection.
IvB2 Iterative Gaussian fitting
After dynamic obstacle segmentation, the image is composed of moving objects and background noise. Commonly, the patterns consist of the moving object have a relatively high meantimestamp. While some preprocesses are still to be done to make the fitting effects better. First, mean filter and morphological operations are used to eliminate saltandpepper noise. Next, an elementwise square is ensued to enhance the image contrast further. After all preprocesses above, the Algorithm 1 is proposed to extract the moving object. Initially, is the origin ROI, where denotes the center point of ROI and denotes all initial side lengths. In this work, is pointed as the pattern with the highest meantimestamp, and are distributed as of the image width and height. After that, the optimal and are computed through an iterative Gaussian fitting process. Finally, the origin ROI is converged to optimal, denoted as .
IvB3 Moving Region Retrieval
Mostly, the contour of the moving object can be extracted accurately and completely. While in some cases, especially when the scale of the moving object is too small on the image plane, the ROI can fail to converge. Hence, we seek connected components to find the region that is most likely to be the moving object in a failconverged ROI. After this operation, the moving object is extracted accurately in the majority of cases.
IvC Object Trajectory Estimation
As the moving objects are detected, a 3D trajectory can be predicted. However, the two sensors’ sample frequency is different, therefore synchronization can be challenging. We are inspired by [14] and propose an optimizationbased trajectory estimation and prediction system fusing event and depth observation without synchronous sampling.
An illustration of this section can be seen in Fig. 4.
This section introduces our entire system in three parts: (1) 2D object correspondence and tracking. (2) Object segmentation in the depth plane (3) 3D trajectory estimator fusing event and depth observation.
IvC1 2D Object Correspondence and Tracking
To estimate the object trajectory from detection results, we need to know which object the result corresponds to. When a new detection comes up, the algorithm will judge the time discrepancy and position deviation from the previous detection. After the correspondence is determined, the object is tracked by an Extended Kalman Filter (EKF) [7] with a linear constant acceleration model on the 2D camera plane. The EKF updates the object’s central position on the 2D image plane to estimate its 3D position, velocity, and acceleration.
There are two reasons why we use this Extended Kalman Filter. Firstly, due to the event camera’s limitation, the appearance of objects on the event plane is not stable and accurate, resulting in the noise of objects’ 2D positions under event observation. EKF can filter this noise. Second, potential misdetection may occur, and EKF can predict the object position in this case.
IvC2 Object Segmentation in the Depth Plane
A 3D trajectory is required for motion planning, but the tracking method only provides 2D position. Therefore, the perception of object depth is essential. We solve this by Semitight coupling depth segmentation. In other words, we use the detection results of the event camera to assist depth camera segmentation to decrease the computation and processing latency. The principles of our whole process are described below.
First of all, the depth map from the depth camera is registered to the event camera according to the intrinsic and extrinsic matrices of the event and depth cameras.
In practice, because the data frequency of depth and event camera is not equal, object position after projection may have a little bias. We scale the bounding box twice as the ROI area.
After previous steps, the approximate location of the object has been determined. We assume that the most dangerous obstacle is closest to us. So the object can be separated by the nearest peak in the histogram of the depth map. The process is shown as Fig. 5.
To improve the system robustness, we compute the mean and variance of segmented depth pixels. If the variance is too high, it might mean that these pixels belong to the background, which should not be considered. Otherwise, these pixels belong to the object, and we average the value to represent the camera and the object’s distance.
IvC3 3D Trajectory Estimator Fusing Event and Depth
Although the 2D position of the object and the depth have been estimated separately, the event camera is faster than the depth camera, so we can not associate the object depth and the 2D position on the camera plane directly. Inspired by [14], a 3D optimizationbased trajectory estimator fusing 2D position and depth is proposed (see Fig. 4). The most significant difference is that we fuse the depth residual into the optimization framework. Before the start, two assumptions should be stated. First, the drone has known the earth’s gravity. Second, the object is in free fall, ignoring air resistance. We describe the trajectory as Equation 11. Given the initial 3D position ,velocity
, gravity vector
and the start time , we can predict the object 3D position at any time expressed as . From object correspondence, the time when the object first appeared can be measured. We just need to obtain the initial 3D position and velocity to represent the whole trajectory.(11) 
The and are estimated through the nonlinear optimization by minimizing the depth residual and reprojection error of event observation. Due to depth residual, the number of observations in the same period has increased so that the convergence speed is faster and the robustness of the system is improved significantly compared to the monocular method.
At time , we detect and track the object in the event camera and the predicted object’s position in camera frame can be written as Equation 12. means position of object in camera frame at time . and respectively represent the rotation matrix and translation vector from world to event camera. and are the object position on event camera plane from 2D tracking. The residual can be written as Equation 12 and 13. Meanwhile, in Equation 13, we assume the camera model is pinhole, but this model can be changed according to the actual lens.
(12) 
(13) 
Similarly, the depth residual is expressed as Equation 14 and 15 with the rotation and translation from world frame to depth camera frame at time . is the depth from depth camera observation at time . It should be indicated that the and are independent and various.
(14) 
(15) 
(16) 
Then this problem can formulate nonlinear optimization problem as Equation 16 to obtain the trajectory parameter . For better robustness, we use the Huber loss.
V Experiment and Evaluation
Va Implementation Details
We present our realworld experiment (see section VD) on a modified flight platform, carrying an iniVation DVXplorer dynamic vision sensor and an Intel Realsense D435i depth camera. A DJI Manifold2C computer running Ubuntu 16.04 is mounted in our UAV for computational supports. We use a CUAV Nora Autopilot Flight Controller running the PX4 flight stack. To alleviate disturbance from the motion capture system’s infrared light on the dynamic vision sensor, we add an infrared filter on the lens surface of the DVXplorer camera. The overall weight (including LiPo battery and propellers) is 1.99 kg, with dimensions being mm. An overview of our flight platform can be seen in Fig. 6.
VB Evaluation of Egomotion Compensation
To demonstrate the robustness of our method, we put our system into high dynamic scenarios, where the UAV flies at a speed of over 5 m/s. Three algorithms are applied in two scenes, one with no moving object and another has an object that moves over 10 m/s (see TABLE I). To ensure efficiency and accuracy, we hope the process has lower time consumption. Moreover, the contrast between the moving object and the background is deemed to be as high as possible, which is critical for detection algorithms. We call this the Relative Contrast. To derive this, we define the manually marked bounding box of the moving object in a motioncompensated image as (marked as green in Fig. 7(a) on the RGB frame). We depict the rest part of the image as . Then, relative contrast is defined as:
(17) 
Notice, since the sensors are imperfect, noise is introduced. It is meaningless to compute the relative contrast on an image that has much noise because noise often has the timestamp from oldest to newest, so they are more likely to be selected for computation of relative contrast instead of the moving object or background. Therefore, we apply unified denoising for images after egomotion compensation by all three methods before computing the relative contrast. Besides, the relative contrast can only be computed in Scene 2 because scene 1 does not have moving objects.
Usually, these indicators cannot be met at the same time, so the tradeoff between performance and efficiency is indispensable. In this work, we sacrifice a little efficiency under the promise of realtime. Table I lies the results of the comparison of our method against [3] and [11]. The table indicates that our method largely outperforms [11] while lower than [3] in several millionseconds. At the same time, the mean value and variance of our output image are closer to the optimizationbased method [11] than [3]’s. Moreover, our method has the highest contrast between the moving object and the background, which provides convenience for object detection.
Experiment  Algorithm  Time (ms) 

Realtime  Dynamicflight  

min  avg  max  min  avg  max  
Scene 1  BetterFlow [11]  1173.2  10620.4  32461.9        No  Yes  
Falanga [3]  2.7  7.1  20.6        Yes  No  
ours  4.1  12.9  22.4        Yes  Yes  
Scene 2  BetterFlow [11]  554.2  16587.6  110559.0  17.7  27.8  36.1  No  Yes  
Falanga [3]  1.4  4.6  16.4  1.2  1.7  3.9  Yes  No  
ours  4.3  8.9  20.7  22.4  32.9  51.4  Yes  Yes 
VC Evaluation of Trajectory Estimation
To validate the accuracy of estimated trajectories, we compare our fusing method with the monocular method[14] in the same scenarios. Our ground truth is provided by a Vicon motion capture system. We perform two estimation algorithms in two different scenarios, one with the drone flying fast forward and the other with the drone swinging forward. The drone flies at 2 m/s, and a ball is thrown at about 12 m/s from one side to another in both scenes. The modules of detection and data association are fully consistent. Due to the fast motion of the ball, there are seven detections on the event camera and three segmentations of the depth map in 0.18 seconds in the forward scenario. The swinging scenario lasts for 0.16 seconds with six detections on event and three on depth.
Fig. 8 shows a comparison of fusing two cameras versus a monocular event camera which is configured as [14]. Fig. 8 states that the result of fusing two sensors is significantly superior to only monocular event intuitively. In both scenarios, the trajectories estimated by the monocular method are opposite in the xdirection. It is mainly due to fewer detection times and lack of depth truth. We compute the APE(Absolute Pose Error) of the estimated trajectory with the reference. The detailed result is shown in TABLE II. This comparison demonstrates that the accuracy of our method is much higher in these fast scenarios.
VD Realworld Experiment
We present several throwingball experiments with onboard sensors both indoor and outdoor, with larger and smaller balls, bright (240 1100 lux) and dim (8 10 lux), swinging and moving environments. The main goal of these experiments is to validate our detection, tracking, and 3D trajectory estimation system onboard highlighted in the dynamic object avoidance scene.
One experiment is to throw a ball of unknown size at a hovering UAV, which would move upward after detecting the ball to avoid the collision. A ball with a diameter of 21cm was thrown at a distance ranging from 8 to 10 meters at speeds from 7.0 to 12.0 m/s like Fig. 1. In this experiment, background events are triggered by insitu vibration and rotation. We did this experiment in several scenes to evaluate the performance under different ambient illumination levels.
Another experiment is to dodge a throwing ball while the UAV is flying forward. The ball was thrown at the same position and speed compared to the last experiment. Different from that one, plenty of background events were triggered by the UAV’s translational motion, making the moving ball harder to detect. Eventually, our system successfully tackles this challenge with remarkable performance (please refer to our attached video).
Vi Conclusion
In this paper, we present a novel perception system for solving dynamic object avoidance problems. It achieves a computationalfriendly while accurate motion compensation for eventbased object detection. It also presents a robust 3D trajectory estimator leveraging both event and depth data. The system has been tested in realworld experiments to prove its advantages.
Nevertheless, there is still room for improvement in some aspects. Integrating avoidance algorithm based on motion planning with our perception system is one of the most promising improvements. In this way, a carefully generated trajectory such as [16] [15] could consider static and dynamic scenes, avoiding performance and flight smoothness.
References
 [1] (2016) ELiSeD — An eventbased line segment detector. In 2016 Second International Conference on EventBased Control, Communication, and Signal Processing (EBCCSP), pp. 1–7. External Links: Document, ISBN 9781509041961 Cited by: §II.
 [2] (2015) On eventbased optical flow detection. Frontiers in Neuroscience 9. External Links: ISSN 1662453X, Document Cited by: §II.
 [3] (20200318) Dynamic obstacle avoidance for quadrotors with event cameras. Science Robotics 5 (40), pp. eaaz9712. External Links: ISSN 24709476, Document Cited by: §I, §II, §II, §II, §IIIB, §IVA1, §IVA2, §IVA, §IVB1, Fig. 7, 7(c), 7(g), §VB, TABLE I.
 [4] (2020) Eventbased vision: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pp. 1–1. External Links: ISSN 19393539, Document Cited by: §II.

[5]
(201806)
A Unifying Contrast Maximization Framework for Event Cameras, with Applications to Motion, Depth, and Optical Flow Estimation.
In
Proc. of the IEEE Intl. Conf. on Pattern Recognition (CVPR)
, pp. 3867–3876. External Links: Document, ISBN 9781538664209 Cited by: §II, §II. 
[6]
(2004)
Multiple view geometry in computer vision
. Cambridge University Press, Cambridge (eng). External Links: ISBN 9780521540513 Cited by: §IVA2.  [7] (2004) Unscented filtering and nonlinear estimation. Proceedings of the IEEE 92 (3), pp. 401–422. External Links: Document Cited by: §IVC1.

[8]
(201508)
Asynchronous EventBased Multikernel Algorithm for HighSpeed Visual Features Tracking.
IEEE Transactions on Neural Networks and Learning Systems
26 (8), pp. 1710–1720. External Links: ISSN 21622388, Document Cited by: §II.  [9] (2019) Robust eventbased object tracking combining correlation filter and cnn representation. Frontiers in Neurorobotics 13, pp. 82. External Links: Document, ISSN 16625218 Cited by: §II.
 [10] (201911) EVIMO: Motion Segmentation Dataset and Learning Pipeline for Event Cameras. In Proc. of the IEEE/RSJ Intl. Conf. on Intell. Robots and Syst. (IROS), pp. 6105–6112. External Links: ISSN 21530866, Document Cited by: §IVA.
 [11] (2018) EventBased Moving Object Detection and Tracking. In Proc. of the IEEE/RSJ Intl. Conf. on Intell. Robots and Syst., pp. 1–9. External Links: Document, ISBN 9781538680940 Cited by: §II, §II, §IIIB, §IVA1, §IVA, Fig. 7, 7(b), 7(f), §VB, TABLE I.
 [12] (2012) Spatiotemporal multiple persons tracking using dynamic vision sensor. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Vol. , pp. 35–40. External Links: Document Cited by: §II.
 [13] (1978) Quaternion from rotation matrix. Journal of Guidance and Control 1 (3), pp. 223–224. Cited by: §IVA1.
 [14] (2017) Catching a Flying Ball with a VisionBased Quadrotor. In 2016 International Symposium on Experimental Robotics, D. Kulić, Y. Nakamura, O. Khatib, and G. Venture (Eds.), Springer Proceedings in Advanced Robotics, Vol. 1, pp. 550–562. External Links: Document, ISBN 9783319501147 9783319501154 Cited by: §I, §II, §IVC3, §IVC, Fig. 8, §VC, §VC, TABLE II.
 [15] (20210227)(Website) External Links: Link, 2103.00190 Cited by: §VI.
 [16] (2021) EGOplanner: an esdffree gradientbased local planner for quadrotors. IEEE Robotics and Automation Letters 6 (2), pp. 478–485. External Links: Document Cited by: §VI.
 [17] (2020)(Website) External Links: Link Cited by: §II.
 [18] (2018) Semidense 3d reconstruction with a stereo event camera. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 235–251. Cited by: §II.
 [19] (20210222)(Website) External Links: Link, 2007.15548 Cited by: §II.
 [20] (201705) Eventbased feature tracking with probabilistic data association. In Proc. of the IEEE Intl. Conf. on Robot. and Autom. (ICRA), pp. 4465–4470. External Links: Document Cited by: §II.