Thanks to recent progress in aerial autonomy, UAVs have been able to fly agilely in complex environments such as mine exploration. Drones are able to perceive unknown environments and plan an exploration path autonomously. However, perception in dynamic environments, especially with high-speed objects, is still a challenging problem. For example, drones have difficulty dodging a rock falling head-on during the fast mine exploration.
For fast-moving object’s avoidance, it’s pivotal to track them and predict their future trajectories in a short latency. Normally, this latency is hundreds of milliseconds for most perception methods: cameras need tens of milliseconds to expose and suffer from motion blur; besides, algorithms need a sequence of frames to predict a trajectory. However, for objects with speed higher than 10 meters per second, such long latency leaves drones no time to escape. In order to reduce this latency, sensors with a higher temporal resolution are keenly demanded. Meanwhile, a real-time detection and tracking algorithm is also indispensable.
To fill this research gap, we adopt the event camera, an asynchronous motion-activated sensor providing a microsecond-level temporal resolution, for solving this problem. In this work, a complete perception system integration for this sensor is also designed. Firstly, we propose an ego-motion compensation algorithm to alleviate the noise. Then, for dynamic object detection, we develop a regression-based approach to find the region of interest (ROI). Notably, this approach is more robust and less computational demanding compared to other clustering-based methods in .
Furthermore, a satisfactory solution should be capable of tracking the object in the 2D camera space and estimating its corresponding 3D trajectory. To address the scale ambiguity issue, we further incorporate a depth camera to recover the scale of monocular sensing by joint optimization. Afterward, combining event and depth observations, we present an accurate trajectory estimator which significantly increases the robustness and accuracy. Our algorithm successfully balances the tight onboard computational budget and trajectory accuracy.
We perform extensive quantitative and qualitative experiments in high dynamic scenarios to validate our object detection and trajectory estimation framework, which provide a solid foundation for fast-moving object avoidance.
This paper highlights several features:
An advanced motion compensation method for event-detection balancing efficiency and accuracy.
A 3D trajectory estimation approach that fuses event and depth information asynchronously.
A complete system integration with open source111 Our code and video can be found at https://github.com/ZJU-FAST-Lab/FAST-Dynamic-Vision.
Ii Related Work
. The first problem to solve is how to represent events. Existing event representation algorithms can be divided into two main categories. One category is classifying events of different objects into several clusters    . This sort of method is intuitively built on event mechanisms, but it is sensitive to noisy events. It also neglects time information which is vital for event-based detection. The other category is tracking features on time surface frames    , which is a 2D map only containing the latest event’s timestamp while ignoring others triggered recently in each pixel. Specifically, some researchers   introduces a mean-time image representation contains the average timestamp of the events. This mean-time image is less computational demanding than other types, e.g., exponential time surfaces. Furthermore, this representation is more suitable for object detection tasks: regions containing moving objects can be obtained by merely thresholding the mean-time image. Therefore, our detection method is based on it.
To remove background events generated by rotation and translation, ego-motion compensation is necessary for moving object detection and tracking. Mitrokhin et al.  minimize error functions provided by spatial gradient of mean-time image to fit a parametric motion model; Gallego et al. 
maximize a variance which represents local contrast, in other words, sharpness, on the compensated image. Zhouet al.  minimize an energy function. The optimization-based method is accurate. However, one drawback of this method is that its high computational cost introduces extra latency in the perception system , which would lead to potential failure in our object avoidance scene. Falanga et al.  use IMU’s angular velocity average to perform rotational ego-motion compensation. This method is less computationally demanding so that it can be applied for onboard flights while the accuracy is not guaranteed in forwarding flights. Based on this method , we improve the motion compensation approach by fusing depth and IMU data to implement both rotational and translational ego-motion compensation. Our method can enhance its accuracy and reliability without sacrificing computational efficiency.
For object tracking and trajectory estimation, our framework is inspired by the following studies. Su et al.  fit a parabolic model to estimate the 3D trajectory of a flying object from noisy 2D observation. This method requires plenty of observations due to the lack of depth information, which cannot meet the requirement for low-latency. Falanga et al.  apply stereo event cameras for 3D position estimation. However, this configuration does not guarantee accuracy and robustness because the high level of noise causes uncertainty in depth estimation. To obtain more accurate 3D trajectories, we design a different configuration fusing event and depth sensor onboard.
Iii-a System Architecture
The pipeline of our framework is illustrated in Fig 2
. There are three procedures in this framework: ego-motion compensation, object detection, and object trajectory estimation. Firstly, we implement an advanced motion compensation algorithm fusing IMU and depth data to filter out background events generated by ego-motion, including rotation and translation during flight. The mean-time image can be generated by motion-compensated events. Each pixel value of this mean-time image is the average timestamp of corresponding events. Following the motion compensation step, we detect and locate the region with the largest average timestamp in the mean-time image. This region represents the area with the fastest speed on the image plane. To obtain the region’s bounding box, we introduce an iterative Gaussian fitting algorithm for the object detection step. We also present a moving region retrieval to guarantee the bounding box we get is the most accurate one. Next, the moving object’s location is tracked with Kalman Filter on the 2D plane, and the object is segmented out on the depth map according to the detection result. Then, we optimize the trajectory of the object by minimizing reprojection residuals. Finally, to validate our estimation, we design a scenario in which a UAV autonomously detect and avoid objects flying towards it.
The rest of this paper is organized as follows: Section IV-A presents our advanced ego-motion compensation algorithms. Then we discuss object detection and tracking methods used in this framework (Section IV-B). In Section IV-C, we perform our 3D trajectory estimator fusing event stream and depth information. Section V depicts our real flight experiment and compares our performance with others.
Let denote a set of events. We use symbols to denote an event triggered by an event camera. The symbol , represents the event’s coordinate on the image plane, denotes the timestamp of the event.
We represent as a set of motion-compensated events (see IV-A) which are projected onto the same pixel
Therefore, the event-count image pixel  can be denoted as where
We also define the time-image as . Hence, pixel in the time-image represented as , can be expressed as the average timestamp of events triggered in this position, as follows:
We name the normalized time-image as normalized mean-time image , which can be computed by the following equation 
We use () as the world frame, () as the drone body frame. Notably, we use () to represent the event camera frame while () representing the depth camera frame. Hence, we can represent the transformation of the event camera in the world frame as
Iv Fast-moving Object Detection and Tracking
Iv-a Ego-motion Compensation
Events can be triggered either by moving objects or by the ego-motion of the camera. In order to segment objects, events generated by ego-motion (backgrounds) should be filtered out first. In previous works  , algorithms for ego-motion compensation are either computational demanding or not accurate enough. To eliminate this problem, we present a method considering computational efficiency, accuracy, and robustness by fusing depth and IMU data to compensate for rotational and translational ego-motion. The illustration of this section can be inferred in Fig. 3.
This section will introduce our advanced ego-motion compensation method in two steps: rotational and translational. Before projection, we store some events into an event buffer in a small time window from timestamp . Then we utilize IMU data to compute the average angular velocity and orientation matrix in the world frame during .
Iv-A1 Rotational Compensation
We apply this compensation step to eliminate events generated by the camera’s rotation . After getting the average angular velocity , we apply the Rodrigues’ Rotation Formula  to build the rotation matrix from relative angle at timestamp . Instead of building this matrix at each timestamp , we update it every millisecond to decrease computational costs. Then, we use this rotation matrix to apply a warp field for every event on the image plane.  . This event warping process can be denoted as follows
with being the motion-compensated events in buffer . After compensation, we project compensated events to a 2D image plane by the event camera’s intrinsic matrix.
Iv-A2 Translational Compensation
We apply this step to eliminate noise generated by the camera’s translation.
Previous compensation methods  are limited by lacking depth estimation. Without unreliable depth for each pixel, they cannot compensate for ego-translation, which leads to misdetection when drones are flying fast. We solve this problem by leveraging an onboard depth camera.
We now project events on 2D camera plane to 3D body frame by perspective projection model with homogeneous coordinates 
where is the event’s position in the camera frame, represents event’s coordinate on the image plane, is the intrinsic matrix. We then project this point into world frame by transform matrix and apply this transitional compensation by multiplying matrix as re-project it back to the camera frame,
where is event’s compensated position in the camera frame.
This translational compensation matrix is built from the derivative of position over time
with being the velocity using estimation from our odometry. Due to computational cost, we update velocity and timestamp to build this matrix every millisecond.
Iv-B Object Detection
Iv-B1 Dynamic Obstacle Segmentation
After ego-motion compensation, we propose a thresholding method to filter out the background in normalized mean-time image . Instead of using a fixed threshold, we design an adaptive one considering angular and linear velocity as , where and is the magnitude of angular and linear velocity, , and are parameters. We use this threshold to classify objects and background. Let us define to represent the image after motion segmentation, which is formulated as:
Compared to the previous approach , our method can preserve more information of the moving object to decrease the possibility of missed detection.
Iv-B2 Iterative Gaussian fitting
After dynamic obstacle segmentation, the image is composed of moving objects and background noise. Commonly, the patterns consist of the moving object have a relatively high mean-timestamp. While some pre-processes are still to be done to make the fitting effects better. First, mean filter and morphological operations are used to eliminate salt-and-pepper noise. Next, an element-wise square is ensued to enhance the image contrast further. After all pre-processes above, the Algorithm 1 is proposed to extract the moving object. Initially, is the origin ROI, where denotes the center point of ROI and denotes all initial side lengths. In this work, is pointed as the pattern with the highest mean-timestamp, and are distributed as of the image width and height. After that, the optimal and are computed through an iterative Gaussian fitting process. Finally, the origin ROI is converged to optimal, denoted as .
Iv-B3 Moving Region Retrieval
Mostly, the contour of the moving object can be extracted accurately and completely. While in some cases, especially when the scale of the moving object is too small on the image plane, the ROI can fail to converge. Hence, we seek connected components to find the region that is most likely to be the moving object in a fail-converged ROI. After this operation, the moving object is extracted accurately in the majority of cases.
Iv-C Object Trajectory Estimation
As the moving objects are detected, a 3D trajectory can be predicted. However, the two sensors’ sample frequency is different, therefore synchronization can be challenging. We are inspired by  and propose an optimization-based trajectory estimation and prediction system fusing event and depth observation without synchronous sampling.
An illustration of this section can be seen in Fig. 4.
This section introduces our entire system in three parts: (1) 2D object correspondence and tracking. (2) Object segmentation in the depth plane (3) 3D trajectory estimator fusing event and depth observation.
Iv-C1 2D Object Correspondence and Tracking
To estimate the object trajectory from detection results, we need to know which object the result corresponds to. When a new detection comes up, the algorithm will judge the time discrepancy and position deviation from the previous detection. After the correspondence is determined, the object is tracked by an Extended Kalman Filter (EKF)  with a linear constant acceleration model on the 2D camera plane. The EKF updates the object’s central position on the 2D image plane to estimate its 3D position, velocity, and acceleration.
There are two reasons why we use this Extended Kalman Filter. Firstly, due to the event camera’s limitation, the appearance of objects on the event plane is not stable and accurate, resulting in the noise of objects’ 2D positions under event observation. EKF can filter this noise. Second, potential misdetection may occur, and EKF can predict the object position in this case.
Iv-C2 Object Segmentation in the Depth Plane
A 3D trajectory is required for motion planning, but the tracking method only provides 2D position. Therefore, the perception of object depth is essential. We solve this by Semi-tight coupling depth segmentation. In other words, we use the detection results of the event camera to assist depth camera segmentation to decrease the computation and processing latency. The principles of our whole process are described below.
First of all, the depth map from the depth camera is registered to the event camera according to the intrinsic and extrinsic matrices of the event and depth cameras.
In practice, because the data frequency of depth and event camera is not equal, object position after projection may have a little bias. We scale the bounding box twice as the ROI area.
After previous steps, the approximate location of the object has been determined. We assume that the most dangerous obstacle is closest to us. So the object can be separated by the nearest peak in the histogram of the depth map. The process is shown as Fig. 5.
To improve the system robustness, we compute the mean and variance of segmented depth pixels. If the variance is too high, it might mean that these pixels belong to the background, which should not be considered. Otherwise, these pixels belong to the object, and we average the value to represent the camera and the object’s distance.
Iv-C3 3D Trajectory Estimator Fusing Event and Depth
Although the 2D position of the object and the depth have been estimated separately, the event camera is faster than the depth camera, so we can not associate the object depth and the 2D position on the camera plane directly. Inspired by , a 3D optimization-based trajectory estimator fusing 2D position and depth is proposed (see Fig. 4). The most significant difference is that we fuse the depth residual into the optimization framework. Before the start, two assumptions should be stated. First, the drone has known the earth’s gravity. Second, the object is in free fall, ignoring air resistance. We describe the trajectory as Equation 11. Given the initial 3D position ,velocity
, gravity vectorand the start time , we can predict the object 3D position at any time expressed as . From object correspondence, the time when the object first appeared can be measured. We just need to obtain the initial 3D position and velocity to represent the whole trajectory.
The and are estimated through the nonlinear optimization by minimizing the depth residual and reprojection error of event observation. Due to depth residual, the number of observations in the same period has increased so that the convergence speed is faster and the robustness of the system is improved significantly compared to the monocular method.
At time , we detect and track the object in the event camera and the predicted object’s position in camera frame can be written as Equation 12. means position of object in camera frame at time . and respectively represent the rotation matrix and translation vector from world to event camera. and are the object position on event camera plane from 2D tracking. The residual can be written as Equation 12 and 13. Meanwhile, in Equation 13, we assume the camera model is pinhole, but this model can be changed according to the actual lens.
Similarly, the depth residual is expressed as Equation 14 and 15 with the rotation and translation from world frame to depth camera frame at time . is the depth from depth camera observation at time . It should be indicated that the and are independent and various.
Then this problem can formulate nonlinear optimization problem as Equation 16 to obtain the trajectory parameter . For better robustness, we use the Huber loss.
V Experiment and Evaluation
V-a Implementation Details
We present our real-world experiment (see section V-D) on a modified flight platform, carrying an iniVation DVXplorer dynamic vision sensor and an Intel Realsense D435i depth camera. A DJI Manifold2-C computer running Ubuntu 16.04 is mounted in our UAV for computational supports. We use a CUAV Nora Autopilot Flight Controller running the PX4 flight stack. To alleviate disturbance from the motion capture system’s infrared light on the dynamic vision sensor, we add an infrared filter on the lens surface of the DVXplorer camera. The overall weight (including LiPo battery and propellers) is 1.99 kg, with dimensions being mm. An overview of our flight platform can be seen in Fig. 6.
V-B Evaluation of Ego-motion Compensation
To demonstrate the robustness of our method, we put our system into high dynamic scenarios, where the UAV flies at a speed of over 5 m/s. Three algorithms are applied in two scenes, one with no moving object and another has an object that moves over 10 m/s (see TABLE I). To ensure efficiency and accuracy, we hope the process has lower time consumption. Moreover, the contrast between the moving object and the background is deemed to be as high as possible, which is critical for detection algorithms. We call this the Relative Contrast. To derive this, we define the manually marked bounding box of the moving object in a motion-compensated image as (marked as green in Fig. 7(a) on the RGB frame). We depict the rest part of the image as . Then, relative contrast is defined as:
Notice, since the sensors are imperfect, noise is introduced. It is meaningless to compute the relative contrast on an image that has much noise because noise often has the timestamp from oldest to newest, so they are more likely to be selected for computation of relative contrast instead of the moving object or background. Therefore, we apply unified denoising for images after ego-motion compensation by all three methods before computing the relative contrast. Besides, the relative contrast can only be computed in Scene 2 because scene 1 does not have moving objects.
Usually, these indicators cannot be met at the same time, so the trade-off between performance and efficiency is indispensable. In this work, we sacrifice a little efficiency under the promise of real-time. Table I lies the results of the comparison of our method against  and . The table indicates that our method largely outperforms  while lower than  in several million-seconds. At the same time, the mean value and variance of our output image are closer to the optimization-based method  than ’s. Moreover, our method has the highest contrast between the moving object and the background, which provides convenience for object detection.
|Scene 1||BetterFlow ||1173.2||10620.4||32461.9||-||-||-||No||Yes|
|Scene 2||BetterFlow ||554.2||16587.6||110559.0||17.7||27.8||36.1||No||Yes|
V-C Evaluation of Trajectory Estimation
To validate the accuracy of estimated trajectories, we compare our fusing method with the monocular method in the same scenarios. Our ground truth is provided by a Vicon motion capture system. We perform two estimation algorithms in two different scenarios, one with the drone flying fast forward and the other with the drone swinging forward. The drone flies at 2 m/s, and a ball is thrown at about 12 m/s from one side to another in both scenes. The modules of detection and data association are fully consistent. Due to the fast motion of the ball, there are seven detections on the event camera and three segmentations of the depth map in 0.18 seconds in the forward scenario. The swinging scenario lasts for 0.16 seconds with six detections on event and three on depth.
Fig. 8 shows a comparison of fusing two cameras versus a monocular event camera which is configured as . Fig. 8 states that the result of fusing two sensors is significantly superior to only monocular event intuitively. In both scenarios, the trajectories estimated by the monocular method are opposite in the x-direction. It is mainly due to fewer detection times and lack of depth truth. We compute the APE(Absolute Pose Error) of the estimated trajectory with the reference. The detailed result is shown in TABLE II. This comparison demonstrates that the accuracy of our method is much higher in these fast scenarios.
V-D Real-world Experiment
We present several throwing-ball experiments with onboard sensors both indoor and outdoor, with larger and smaller balls, bright (240 1100 lux) and dim (8 10 lux), swinging and moving environments. The main goal of these experiments is to validate our detection, tracking, and 3D trajectory estimation system onboard highlighted in the dynamic object avoidance scene.
One experiment is to throw a ball of unknown size at a hovering UAV, which would move upward after detecting the ball to avoid the collision. A ball with a diameter of 21cm was thrown at a distance ranging from 8 to 10 meters at speeds from 7.0 to 12.0 m/s like Fig. 1. In this experiment, background events are triggered by in-situ vibration and rotation. We did this experiment in several scenes to evaluate the performance under different ambient illumination levels.
Another experiment is to dodge a throwing ball while the UAV is flying forward. The ball was thrown at the same position and speed compared to the last experiment. Different from that one, plenty of background events were triggered by the UAV’s translational motion, making the moving ball harder to detect. Eventually, our system successfully tackles this challenge with remarkable performance (please refer to our attached video).
In this paper, we present a novel perception system for solving dynamic object avoidance problems. It achieves a computational-friendly while accurate motion compensation for event-based object detection. It also presents a robust 3D trajectory estimator leveraging both event and depth data. The system has been tested in real-world experiments to prove its advantages.
Nevertheless, there is still room for improvement in some aspects. Integrating avoidance algorithm based on motion planning with our perception system is one of the most promising improvements. In this way, a carefully generated trajectory such as   could consider static and dynamic scenes, avoiding performance and flight smoothness.
-  (2016) ELiSeD — An event-based line segment detector. In 2016 Second International Conference on Event-Based Control, Communication, and Signal Processing (EBCCSP), pp. 1–7. External Links: Cited by: §II.
-  (2015) On event-based optical flow detection. Frontiers in Neuroscience 9. External Links: Cited by: §II.
-  (2020-03-18) Dynamic obstacle avoidance for quadrotors with event cameras. Science Robotics 5 (40), pp. eaaz9712. External Links: Cited by: §I, §II, §II, §II, §III-B, §IV-A1, §IV-A2, §IV-A, §IV-B1, Fig. 7, 7(c), 7(g), §V-B, TABLE I.
-  (2020) Event-based vision: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), pp. 1–1. External Links: Cited by: §II.
A Unifying Contrast Maximization Framework for Event Cameras, with Applications to Motion, Depth, and Optical Flow Estimation.
Proc. of the IEEE Intl. Conf. on Pattern Recognition (CVPR), pp. 3867–3876. External Links: Cited by: §II, §II.
Multiple view geometry in computer vision. Cambridge University Press, Cambridge (eng). External Links: Cited by: §IV-A2.
-  (2004) Unscented filtering and nonlinear estimation. Proceedings of the IEEE 92 (3), pp. 401–422. External Links: Cited by: §IV-C1.
Asynchronous Event-Based Multikernel Algorithm for High-Speed Visual Features Tracking.
IEEE Transactions on Neural Networks and Learning Systems26 (8), pp. 1710–1720. External Links: Cited by: §II.
-  (2019) Robust event-based object tracking combining correlation filter and cnn representation. Frontiers in Neurorobotics 13, pp. 82. External Links: Cited by: §II.
-  (2019-11) EV-IMO: Motion Segmentation Dataset and Learning Pipeline for Event Cameras. In Proc. of the IEEE/RSJ Intl. Conf. on Intell. Robots and Syst. (IROS), pp. 6105–6112. External Links: Cited by: §IV-A.
-  (2018) Event-Based Moving Object Detection and Tracking. In Proc. of the IEEE/RSJ Intl. Conf. on Intell. Robots and Syst., pp. 1–9. External Links: Cited by: §II, §II, §III-B, §IV-A1, §IV-A, Fig. 7, 7(b), 7(f), §V-B, TABLE I.
-  (2012) Spatiotemporal multiple persons tracking using dynamic vision sensor. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Vol. , pp. 35–40. External Links: Cited by: §II.
-  (1978) Quaternion from rotation matrix. Journal of Guidance and Control 1 (3), pp. 223–224. Cited by: §IV-A1.
-  (2017) Catching a Flying Ball with a Vision-Based Quadrotor. In 2016 International Symposium on Experimental Robotics, D. Kulić, Y. Nakamura, O. Khatib, and G. Venture (Eds.), Springer Proceedings in Advanced Robotics, Vol. 1, pp. 550–562. External Links: Cited by: §I, §II, §IV-C3, §IV-C, Fig. 8, §V-C, §V-C, TABLE II.
-  (2021-02-27)(Website) External Links: Cited by: §VI.
-  (2021) EGO-planner: an esdf-free gradient-based local planner for quadrotors. IEEE Robotics and Automation Letters 6 (2), pp. 478–485. External Links: Cited by: §VI.
-  (2020)(Website) External Links: Cited by: §II.
-  (2018) Semi-dense 3d reconstruction with a stereo event camera. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 235–251. Cited by: §II.
-  (2021-02-22)(Website) External Links: Cited by: §II.
-  (2017-05) Event-based feature tracking with probabilistic data association. In Proc. of the IEEE Intl. Conf. on Robot. and Autom. (ICRA), pp. 4465–4470. External Links: Cited by: §II.