Log In Sign Up

PL-EVIO: Robust Monocular Event-based Visual Inertial Odometry with Point and Line Features

Event cameras are motion-activated sensors that capture pixel-level illumination changes instead of the intensity image with a fixed frame rate. Compared with the standard cameras, it can provide reliable visual perception during high-speed motions and in high dynamic range scenarios. However, event cameras output only a little information or even noise when the relative motion between the camera and the scene is limited, such as in a still state. While standard cameras can provide rich perception information in most scenarios, especially in good lighting conditions. These two cameras are exactly complementary. In this paper, we proposed a robust, high-accurate, and real-time optimization-based monocular event-based visual-inertial odometry (VIO) method with event-corner features, line-based event features, and point-based image features. The proposed method offers to leverage the point-based features in the nature scene and line-based features in the human-made scene to provide more additional structure or constraints information through well-design feature management. Experiments in the public benchmark datasets show that our method can achieve superior performance compared with the state-of-the-art image-based or event-based VIO. Finally, we used our method to demonstrate an onboard closed-loop autonomous quadrotor flight and large-scale outdoor experiments. Videos of the evaluations are presented on our project website:


page 1

page 4

page 5

page 10

page 11

page 12

page 13


Hybrid, Frame and Event based Visual Inertial Odometry for Robust, Autonomous Navigation of Quadrotors

Event cameras are bio-inspired vision sensors that output pixel-level br...

IDOL: A Framework for IMU-DVS Odometry using Lines

In this paper, we introduce IDOL, an optimization-based framework for IM...

Robust SfM with Little Image Overlap

Usual Structure-from-Motion (SfM) techniques require at least trifocal o...

ESVIO: Event-based Stereo Visual Inertial Odometry

Event cameras that asynchronously output low-latency event streams provi...

Balancing the Budget: Feature Selection and Tracking for Multi-Camera Visual-Inertial Odometry

We present a multi-camera visual-inertial odometry system based on facto...

Privacy-Preserving Visual Localization with Event Cameras

We present a robust, privacy-preserving visual localization algorithm us...

An Event-based Algorithm for Simultaneous 6-DOF Camera Pose Tracking and Mapping

Compared to regular cameras, Dynamic Vision Sensors or Event Cameras can...

I Introduction

I-a Motivations

State estimation is the most fundamental topic in the field of robotics, such as Simultaneous Localization and Mapping (SLAM), Visual Odometry (VO), Visual Inertial Odometry (VIO), etc. Recently, the event camera, also called Dynamic Vision Sensors (DVS), offers exciting opportunities for state estimation and perception. Event cameras have many advantages over standard cameras including high temporal resolution (

s-level), high dynamic range (HDR, 140 dB), and no motion blur[GWPHKU:EVENT-SURVEY]. The event cameras only capture the information when the brightness of an individual pixel change, while it does not generate any events when the camera is still or without any brightness change. Conveniently, the standard cameras provide intensity value of the scene. Although standard cameras might have some trouble during high-speed motion or in HDR scenarios, they can provide rich information about the environment most of the time (in low-speed motion and good lighting scenarios).

Observing this complementarity, in this paper, we proposed a monocular VIO framework for the event, image, and inertial measurement unit (IMU) sensor with well-designed feature management. Our VIO framework includes the purely event-based VIO (EIO), and the event & image-based VIO (EVIO). More specifically, we first implement a motion compensation algorithm using the IMU data to correct the motion of each event according to its individual timestamp, including rotation and translation motion, into the same timestamp (e.g. the timestamp of the event stream or the first event). Then the motion-corrected event streams are used for event-based feature detection and tracking.

Fig. 1: Our PL-EVIO Pipeline Combines Events, Images, and IMU to Provide Robust State Estimation, and can Run Onboard in an Autonomous Quadrotor with Limited Computational Resource. Bottom Left: Event-corner Features in the Event. Bottom Middle: Line-based Features in the Event. Bottom Right: Point-based Features in the Image.

What’s more, most of the state-of-the-art works in both image and event-based SLAM/VO/VIO rely on point-based features. However, human-made structures are built with regular geometric shapes such as lines or planes, making point-based features not necessarily the best representation for visual tracking in all scenarios. While point-based features are more common in natural scenes. Therefore, for heterogeneous event-based visual information utilization, we design and extract the line-based feature in the event stream to improve the performance of only point-based EIO and EVIO, since the line-based features can reflect more geometric structure information than point-based features [GWPHKU:PL-SLAM] [lim2021avoiding] [GWPHKU:PL-VINS]. As can be seen in Fig.1 and Fig.3

, the integration of the line-based feature and point-based feature can further ensure a more uniform distribution of the features and can provide additional constraints on scene structure. Modified from the line segment detector (LSD) algorithm

[GWPHKU:LSD], we efficiently extract the line-based feature and descriptor in the event-based representation thanks to its highlight of the strong edges of the scene.

Finally, by integrating the aforementioned line-based feature, we proposed the PL-EVIO whose visual landmarks include: event-corner features, line-based event features, and point-based image features. Point-based and line-based visual residual from the event camera, point-based visual residual from the standard camera, and the residual from IMU pre-integration are tightly-coupled fused in a keyframe-based graph optimization framework. The point-based event-corner features provide reliable perception in high-speed motion and HDR scenarios, the point-based image features offer rich visual information in good lighting scenarios, while the line-based event features can better represent the geometric structure constraints when the point-based features are poor. These three kinds of features are well designed and integrated together to leverage additional structure or constraints information for more accurate and robust state estimation.

I-B Contributions

Our contributions are summarized as follows:

  1. PL-EIO (Event+IMU): We design the line-based feature and descriptor in event-based representation for front-end incremental estimation. Based on our previous EIO pipeline [GWPHKU:MyEVIO], we extend it to include the line-based event features and the event-corner point features, termed as PL-EIO, to perform robust state estimation. Furthermore, we improve the PL-EIO through a motion compensation algorithm for motion correction of the event streams.

  2. PL-EVIO (Event+Image+IMU): Finally, we integrate the image-based measurement into our PL-EIO framework, termed PL-EVIO, which tightly integrated the event-corner features, line-based event features, and point-based image features together. We evaluate that our PL-EVIO can achieve state-of-the-art performance in different challenging datasets, and provide reliable performance for onboard quadrotor aggressive flighting.

The remainder of the paper is organized as follows: Section II introduces the related works. Section III introduces the principle of our proposed method. Section IV present the experiments and results. Finally, conclusion is given in Section V.

Ii Related Works

Ii-a Event-based Representation and Feature Extraction

An event is triggered only when the intensity of an individual pixel varies exceeds a specific threshold , which can be represented as the spatio-temporal coordinates of the intensity change and its sign:


where is the timestamp that the intensity of a pixel changes, and is the polarity that indicates the direction of the intensity change. The generation model of the event stream endowed some good properties, which also allow the event camera to confer robustness to vision-based localization in challenging scenarios. However, adopting the event camera into the SLAM/VO/VIO is a very challenging task since the event streams are asynchronous format which is fundamentally different from the synchronous image data, therefore most methods and concepts developed for conventional image-based cameras can not be directly applied. To enable the asynchronous event data into the synchronous data representation, different kinds of event representation have been proposed:

The first one is directly working on the raw event stream without any frame-like accumulation. Ref.[GWPHKU:ACE] propose a feature tracker that employed the descriptors for event data. Ref.[GWPHKU:zhu2017event] present a feature tracker based on Expectation Maximisation (EM). Ref.[GWPHKU:PowerLine] [GWPHKU:IDOL] extract the line-feature from the raw asynchronous events.

The second one is combining with the image sensor, or generating the intensity image from the event through learning-based methods. Ref.[GWPHKU:kueng2016low] [GWPHKU:EKLT] firstly detected the features on the grayscale image frames, and then tracked the features asynchronously using event streams.

The third one is the motion-compensated event image, or edge image, which is generated through aggregating a group of neighbor events within the spatio-temporal window into an edge image. Ref. [GWPHKU:ETH-EVIO] [GWPHKU:Ultimate-SLAM] adopt the conventional corner detection algorithms, such as FAST corners [fast_corner_detection] or Shi-Tomasi [Shi-Tomasi] for feature detection, and the Lucas Kanade (LK) optical flow [LK_optical_flow] for feature tracking in the event image.

The last one is the time surface (TS) or Surface of Active Event (SAE), which is a 2D map where each pixel stores the time value. It can summarize and update the event stream at any given instant, or encode the Spatio-temporal constraints of the historical events. Using an exponential decay kernel, TS can emphasize recent events over past events[GWPHKU:TS]. If , is the timestamp of the last event at each pixel coordinate the TS at time is defined by:


where is the decay rate parameter. Ref.[GWPHKU:Harris-event-corner] [GWPHKU:ARC*]

use the SAE or TS to inspect previously triggered events in the stream and the adjacent pixels for classifying a new event as an event-corner.

Fig. 2: The Framework of Our PL-EIO

Ii-B Event-based Motion Estimation

Event-based state estimation has gained significant research interest in the field of SLAM/VO/VIO for challenging scenarios where the performance of traditional cameras is compromised such as high-speed motion estimation as well as HDR perception. Ref. [GWPHKU:first-event-SLAM] proposed the first event-based SLAM system, which is limited to tracking planar motions while reconstructing the 2D ceiling map with an upward-looking event camera. Ref.[GWPHKU:censi2014low] and [GWPHKU:kueng2016low]

proposed the event-based VO to track camera motion. However, these methods still relied on the standard camera, which was still susceptible to motion blur and low dynamic range. The first purely event-based 6-DoF (Degree-of-Freedom) VO was presented in

[GWPHKU:kim2016real], which performed real-time event-based SLAM through three decoupled probabilistic filters that jointly estimate the 6-DoF camera pose, 3D map of the scene, and image intensity. However, it is computationally expensive and requires GPU to achieve real-time performance. EVO [GWPHKU:EVO] proposed to solve the SLAM problem without recovering image intensity, thus reducing computational complexity, and it can run in real-time on a standard CPU. It performs a tracking approach based on image-to-model alignment and adopts the 3D reconstruction method from EMVS [GWPHKU:EMVS] to perform the mapping. However, the EVO is needed to run in the scene that is planar to the sensor, up to several seconds, for bootstrapping the system. ESVO [GWPHKU:ESVO] is the first stereo event-based VO method, which follows a parallel tracking-and-mapping scheme to estimate the ego-motion and the semi-dense 3D map of the scene. However, it barely operates in real-time in DAVIS346 (346*240) and is limited by rigorous and unreliable initialization. Ref. [GWPHKU:Feature-based-ESVO]

proposed stereo VO for event cameras based on features. The pose estimation is done by re-projection error minimization, while the features are stereo and temporally matched through the consecutive left and right event TS. It solves the problems of ESVO mentioned above. However, it still cannot operate in real-time in high-resolution event cameras (640*480).

The robustness of event-based SLAM/VO systems can be improved by combining them with the IMU measurement. The first EIO method is proposed in Ref. [GWPHKU:Event-based-visual-inertial-odometry]

which fuses a purely event-based tracking algorithm with pre-integration IMU measurement through the Extended Kalman Filter. Another EIO method was proposed in Ref.

[GWPHKU:ETH-EVIO]. It detects and tracks the features in the edge image, which is generated from motion-compensated event streams, through traditional image-based feature detection and tracking methods. Finally, the tracked features are combined with IMU measurement using keyframe-based nonlinear optimization. The authors extended their method to leverage the complementary advantages of both standard and event cameras in Ultimate-SLAM [GWPHKU:Ultimate-SLAM] to fuse events image frames, standard frames, and IMU. To some extent, these methods use the edge image to realize VIO, this might introduce bottlenecks and is limited by the number of events boosting. EKLT-VIO [EKLT-VIO] combined the event-based tracker [GWPHKU:EKLT] as the front end with a filter-based back end to perform the EVIO for Mars-like sequences. However, it is pretty hard for it to perform in real-time even in the lowest resolution event camera. Ref. [GWPHKU:Continuous-time-visual-inertial-odometry-for-event-cameras] proposed to fuse events and IMU measurement into a continuous-time framework. However, their approach cannot achieve real-time because of the expensive optimization required to update the spline parameters upon receiving every event [GWPHKU:ETH-EVIO]. IDOL [GWPHKU:IDOL] investigates line-feature in the EIO framework by directly using asynchronous raw events without using any frame-like accumulation. However, caused of the computational complexity, it doesn’t have real-time capabilities even in low-resolution event cameras (240*180). In our previous work [GWPHKU:MyEVIO], we proposed a monocular EIO which tightly-coupled fuses the event-corner features with IMU measurement to provide real-time 6-DoF state estimation even in high-resolution event cameras. Furthermore, this EIO framework can bootstrap from unknown initial states and can ensure global consistency thanks to the loop closure function.

Iii Methodology

Fig. 3: Three Different Kinds of Features in our PL-EVIO framework: Event-corner Feature, Event-based Line Feature, and Image-based Point Features

Iii-a Framework Overview

The structure of our proposed PL-EIO is illustrated in Fig.2, which is composed of two sections: The EIO Front-end takes the motion-compensated event stream as input and extracts the event-corner features and the line-based event features. Two kinds of event representations: the TS with polarity and normalized TS without polarity , which are generated from the SAE for point & line feature tracking and loop closure detection, respectively. The definition of these two kinds of event representations can be seen in our previous work [GWPHKU:MyEVIO]. The EIO Back-end tightly fuses the point landmarks, line landmarks, and the IMU pre-integration to estimate the 6-DoF state, while the loop closure is used to eliminate the accumulated drifts. Finally, to achieve low latency, we also directly forward propagate (loosely-coupled) the latest estimation with the IMU measurements to achieve IMU-rate state outputs which can be up to 1000 Hz. This can ensure the requirement of closed-loop autonomous quadrotor flighting.

For the keyframe in the sliding window, it is selected by two criteria and only based on the event-corner features: When the average parallax of the tracked event-corner features, between two consecutive timestamps, is beyond a threshold (10 is set in our experiment). When the number of successfully tracked event-corner features from the last timestamp falls below a certain threshold (20 is set in our experiment).

While for the initialization procedure of our framework, adopting from [GWPHKU:VINS-MONO] and [GWPHKU:VINS-MONO-initialization], our PL-EIO/EVIO starts with a vision-only structure from motion (SfM) to build the up-to-scale structure of camera pose and event-corner feature positions. By loosely aligning the SfM with the pre-integrated IMU Measurements, it can bootstrap the system from unknown initial states. It is worth mentioning that, if the image is available in the framework, we only use the point-based image visual measurement for SfM initialization in order to obtain reliable visual-inertial alignment and up-to-scale camera poses.

As for the loop closure, extra event corners are detected in the EIO Front-end, and then described by the BRIEF descriptor and feed to the Back-end. These additional event-corner features are used to achieve a better recall rate on loop detection. Thanks to our designed normalized TS without polarity, which would be triggered in the scene that has strong edges, it can help for eliminating the accumulated drifts and ensuring global consistency. The correspondences are found by the BRIEF descriptor matching through identifying by Hamming distance. When the number of the correspondences of the event-descriptors is greater than a certain threshold (16-25 in our experiments), the loop closure is detected. After detecting the loop, the connection residual of the previous keyframe and the current keyframe are integrated into the nonlinear optimization as a re-localization residual.

We further extend our PL-EIO framework to include point-based image features to provide a more robust state estimation (PL-EVIO). Fig.3 shows the complementarity of the image and event visual information. For the bad lighting area, the event visual can provide reliable event-corner features, while the image visual can provide rich point-based features in other areas. This enables the uniform distribution of the point-based event and image features in the scene. While the line-based event feature can provide more constraints (shown in Fig.3) even when the successfully tracked point-based event and image features are less in the scene, our framework can provide a more robust and accurate state estimation. More details of three kinds of feature detection and tracking in our framework can be seen in our demonstration111

Iii-B Motion Compensation for the Event Stream using IMU Measuremnet

Events can be triggered either by moving objects or by the ego-motion of the camera. Similar to Ref.[ETH-Science-Robotics], we only rely on the IMU for motion compensation, which makes it sufficient to run in real-time onboard. For the new event stream coming, we used the angular velocity and linear acceleration from the IMU average over the time window where the events are accumulated in the same event stream, to estimate the ego-rotation and ego-translation of each event. Using this ego-motion to warp the events into the timestamp of the first event in the same event stream. The motion (considering both rotation and translation) of each event can be calculated through:


where and is the angular velocity and linear acceleration measurement from the IMU in the current event stream timestamp. While is the rotation matrix generating from the angular velocity and the time difference . Each event of the event stream is then warped by , where is the timestamp of the first event of the current event stream and is the timestamp of event .

Iii-C Event-corner Feature Detection and Tracking

The SAE would be updated through the motion-compensated event stream, while the existing event-corner features are tracked by the LK optical flow on the TS with polarity which is generated from the updated SAE (shown in Fig.4(c)). Specifically, we use a two-way tracking strategy to track event-corner features between two consecutive timestamps. For any event-corner feature on last timestamp is tracked to on current timestamp , we would reverse the tracking process by tracking on current timestamp back to on last timestamp . If the distance between and is smaller than a threshold (1.0 pixel in our experiment), this event-corner feature would be viewed as successfully tracked. The event-corner features that are not successfully tracked in the current timestamp would be discarded immediately.

Fig. 4: The Event-corner Feature Detection and Tracking: (a) Detecting Features from Raw Event Stream; (b) Using the TS with Polarity as the Mask for Uniformly Distribution of the Event-corner Features; (c)Tracking Feature in the TS with Polarity.

Whenever the number of the tracked features falls below a certain threshold (150-250 in our experiment), new event-corner features would be detected from the latest motion-compensated event stream (shown in Fig.4(a)) for future feature tracking. Modified from the publicly available implementation of the Arc* algorithm [GWPHKU:ARC*] for event-based corner detection, we extract the event corners on the individual event by leveraging the SAE rather than adopting the conventional corner detection algorithms in frame-like accumulation (like Ref. [GWPHKU:ETH-EVIO] [GWPHKU:Ultimate-SLAM]). The newly detected event-corner features would be further selected by setting the TS with polarity as the mask (shown in Fig.4(b)). To enforce the uniform distribution, a minimum distance (10-20 pixels for different resolution event cameras) is set between two neighboring event-corner features. Meanwhile, we maintain the event corners, where the pixel value of the TS with polarity is not equal to 128.0, to emphasize the detected event-corner features located in the strong edges rather than the too many noisy features in low texture areas.

Furthermore, all the event-corner features in the front-end are first undistorted based on the camera distortion model, and then projected to a normalized camera coordinate system. To remove outliers, we also use the Random Sample Consensus (RANSAC) for outliers filtering. Finally, we recover the inverse depth of the event-corner features that are successfully tracked between two consecutive timestamps through triangulation. The point-based landmark whose 3D position has been successfully calculated would be fed to the sliding window for nonlinear optimization.

Iii-D Line-based Feature Detection and Matching on Event Stream

Utilizing the line-based features to improve the performance of point-based VIO is effective as line features can provide additional constraints and structure information in the scene, especially for the human-made environment. Therefore, for new event streams coming, after the motion compensation, we would map the event streams into the Opencv-Mat format (event mat). Since the events are triggered in a scene that has strong edges, using the event mat for line feature generation can avoid invalid line features and improve efficiency. The LSD algorithm in the Opencv is used to extract the line features on the raw event streams. Through studying the hidden parameter tuning and length rejection strategy of the LSD (inspired from [GWPHKU:PL-VINS]), we filter out the short line features by a length rejection strategy:


where denotes the smaller value between the width and the height of the event camera. is the ratio factor(0.125 is set in our experiments). After that, we adopt the Line Band Descriptor (LBD)[GWPHKU:LBD] to describe and match line features, respectively. In particular, to ensure good tracking performance and be consistent with point-based event-corner features, we also use the TS with polarity for LBD generation and line-based feature matching. We further execute the line features refinement schemes to identify line features as good matches for successful line tracking:

  • The Hamming distance between matching line features is less than 30;

  • The square error of the endpoint between the matching line features is less than 200*200 ;

  • The angular between matching line features is less than 0.1 rad;

The successful tracked line-based event features would be further refined through undistorting the endpoints of the lines, and then, projected to a unit sphere after passing outlier rejection. The outlier rejection is performed using RANSAC with a fundamental matrix model. After that, we can get the line-based landmark by triangulating the correspondences of two line features. The line-based landmark whose 3D position has been successfully calculated would be fed to the sliding windows for the nonlinear optimization.

Iii-E Sliding Windows Graph-based Optimization based on the Point-Line Features

Iii-E1 Formulation of the Nonlinear Optimization

The full state vector in the sliding windows is defined as:


where , and is the inverse depth of the event-corner features and point-based image features, respectively, while , is the four-parameter orthonormal representation [zhang2015building] of the line-based event features, in the sliding windows. or is the extrinsic transformation from camera frame (the image or event ) to the body (IMU) frame ( when using the DAVIS which can simultaneously output the image and event data); is the optimization variable in the sliding windows, which comprises the state of the IMU, with ( in our experiments), the total number of keyframes in the sliding windows. The system state at keyframe is given by the position , orientation quaternion , and the velocity of the IMU in the world frame, and the accelerometer bias and gyroscope bias as follows:


Joint nonlinear optimization is solved for the maximum a posteriori estimation of

, while the cost function can be written as:


Eq.(7) contains the marginalization residuals with weight ; the re-localization residuals with weight ; the IMU pre-integration residuals with weight ; the point-based event residual with weight ; the line-based event residual with weight ; the point-based image residual with weight ; while , , and are the set of event-corner features, line-based event features, and point-based event features, respectively, which have been successfully tracked or matched at least twice in the current sliding window.

Iii-E2 Point-based Event Visual Measurement Residual

The in Eq.(7) is the event-corner measurement residual from the re-projection function. Considering the event-corner feature that is first observed in the keyframe, the residual for its observation in the keyframe is defined as:


where, is the first observation of the event-corner feature in the keyframe. is the observation of the same event-corner feature in the keyframe, and are the projection and back-projection function of the event camera, respectively, which includes the intrinsic parameters for the transform between the 2D pixel coordinates and normalized event camera coordinate. indicates the movement of the body frame related to the world frame in timestamp , is the transpose of the pose of the body in the world frame in the keyframe.

Iii-E3 Line-based Event Visual Measurement Residual

The in Eq.(7) is the line-based event measurement residual which is generated from line re-projection model. The line re-projection residual is modeled as the distance from the endpoints of the line to the projected line in the normalized image plane. We define as the line-based landmarks in the Plücker Coordinates, the denotes the normal vector of the plane determined by and the origin of the world frame, and denotes the direction vector determined by the two endpoints of . Assuming that is first observed in the keyframe with movement , and current is keyframe with movement . The dual Plücker matrix can be described as:



denotes the skew-symmetric matrix of a 3-vector. Given the transformation matrix

indicates the transformation from the event frame to the world frame. Then we can transform the line-based event feature in keyframe from world frame to event frame by:


Next, the 3D in the event camera frame can be projected to the event plane line by


where is the projection function of the event camera, and the can be obtain from Eq.10. The line re-projection error can be defined as:


where and are the homogeneous coordinates of the endpoints of the line feature in the image plane, and denotes the point-to-line distance function from the endpoints to the projection line :


Iii-E4 Point-based Image Visual Measurement Residual

The in Eq.(7) is the point-based image measurement residual from the re-projection function. Similar to the event-corner measurement, the point-based image feature that is first observed in the keyframe, the residual for its observation in the keyframe is defined as:


where, is the first observation of the point-based image feature in the keyframe. is the observation of the same point-based image feature in the keyframe, and are the projection and back-projection function of the standard camera, respectively, which includes the intrinsic parameters for the transform between the 2D pixel coordinates and normalized camera coordinate.

Iii-E5 IMU Measurement Residual

The in Eq.(7) is the IMU residual from the IMU pre-integration. The raw measurement of angular velocity and acceleration from IMU at time are:


where , are modeled as additive Gaussian noise. , are modeled as random walks. The Notation is used to represent noisy measurements. Given the time interval corresponding to keyframe and . , , can be propagated in such time interval by using gyroscope and accelerometer measurements in the world frames as follow:


Based on the Eq.15, the Eq.16 can be rewritten as follow:


In order to ensure the pre-integration term is only related to the inertial measurements and biases in , is multiplied on both sides of the Eq.17, and we define the pre-integration term , , as follow:


Discretizing Eq.18 by the zero-order discretization method as follow:


Eventually, the IMU residual can be derived as follow:


Iv Evaluation

Sequence DAVIS346 (346*240) DVXplorer (640*480)
Ultimate SLAM [GWPHKU:Ultimate-SLAM]
Ultimate SLAM [GWPHKU:Ultimate-SLAM]
Ultimate SLAM [GWPHKU:Ultimate-SLAM]
vicon_hdr1 0.96 0.32 0.67 1.49 2.44 0.59 0.67 0.57 0.17 1.94 0.30 0.47 0.41
vicon_hdr2 1.60 0.75 0.90 1.28 1.11 0.74 0.45 0.54 0.12 2.38 0.37 0.22 0.21
vicon_hdr3 2.28 0.60 0.69 0.66 0.83 0.72 0.74 0.69 0.19 0.83 0.69 0.47 0.36
vicon_hdr4 1.40 0.70 0.66 1.84 1.49 0.37 0.37 0.32 0.11 2.09 0.26 0.27 0.25
vicon_darktolight1 0.51 0.75 0.84 1.33 1.00 0.81 0.78 0.66 0.14 1.96 0.80 0.71 0.71
vicon_darktolight2 0.98 0.76 1.50 1.48 0.79 0.42 0.44 0.51 0.12 1.57 0.57 0.56 0.47
vicon_lighttodark1 0.55 0.41 0.64 1.79 0.84 0.29 0.42 0.33 0.13 2.48 0.81 0.43 0.54
vicon_lighttodark2 0.55 0.58 0.93 1.32 1.49 0.79 0.73 0.53 0.16 1.37 0.75 0.67 0.60
vicon_dark1 0.88 failed 0.53 1.75 3.45 1.02 0.64 0.35 0.43 3.79 0.35 0.51 0.41
vicon_dark2 0.52 0.60 failed 1.10 0.63 0.49 0.30 0.38 0.47 2.81 0.41 0.38 0.41
vicon_aggressive_hdr failed failed 1.94 failed 2.30 0.66 0.62 0.50 1.97 failed 0.65 0.62 0.50
Average 1.02 0.61 0.93 1.40 1.49 0.63 0.56 0.49 0.36 2.12 0.54 0.48 0.45
  • Unit:%/m, 0.45 means the average error would be 0.45m for 100m motion.

TABLE I: Accuracy Comparison of Our PL-EIO with Other Event-based Works

In this section, we assess the performance of our framework both quantitatively and qualitatively on different challenging sequences in subsection IV.A and IV.B. We implemented our method with C++ in Ubuntu 20.04 and ROS Noetic. All the sequences are evaluated in real-time using a laptop with Intel Core i7-11800H and are recorded in videos (shown on our project website). In subsection IV.C and IV.D, we demonstrate the quadrotor flighting using our PL-EVIO for the closed-loop state estimator and aggressive fliping. While the large-scale experiments are carried out to illustrate the long-time practicability in subsection IV.E.

Iv-a Evaluation for Our PL-EIO

For demonstrating the robustness, accuracy, and real-time capability, we firstly evaluate our PL-EIO using different resolution event cameras (DAVIS346 (346*240) and DVXplorer (640*480)) with the ground truth from VICON. All the sequences 222 are recorded in HDR scenarios with very low illumination or strong illumination changes, or with aggressive motion. Without loss of generality, we used the raw image from DAVIS346 to run the VINS-MONO [GWPHKU:VINS-MONO], PL-VINS [GWPHKU:PL-VINS], and ORB-SLAM3 [ORB-SLAM3], as image-based comparisons. In addition, based on the source code of Ultimate SLAM [GWPHKU:Ultimate-SLAM], we also test the EVIO and EIO versions of Ultimate SLAM for event-based comparison. The estimated and ground-truth trajectories were aligned with a 6-DOF transformation (in SE3), using 5 seconds [0-5s] of the resulting trajectory. We computed the mean position error (Euclidean distance in meters) as percentages of the total traveled distance of the ground truth, which are calculated by the publicly available tool [GWPHKU:evo_package]. As can be seen from the results in Table I, our proposed PL-EIO can perform better performance compared with the other methods in different resolution event cameras. Especially, for the results of vicon_aggressive_hdr, our PL-EIO produces reliable and accurate pose estimation even when the image-based VIO and VO fail. Besides, compared with our previous EIO[GWPHKU:MyEVIO], the introduction of the line feature has very effective performance improvement in different resolution event cameras.

However, as can be seen from the results, the motion compensation version (PL-EIO+) does not have effective improvements in different sequences, especially in the aggressive motion, which might be caused by the bias of the IMU measurements in aggressive motion. On the other hand, unlike most of the EIO works [GWPHKU:Event-based-visual-inertial-odometry] [GWPHKU:ETH-EVIO] [GWPHKU:Ultimate-SLAM] only work on 30 HZ event stream, the rate of the event stream in this evaluation is higher (60HZ in DAVIS346 and 50 HZ in DVXplorer). This might lead to less time difference in the same event stream, so the function of our motion compensation algorithm on the event stream is limited. We also found that the motion compensation for event streams might not a good option for the high-resolution event cameras since the balance between calculation amount and performance improvement. Furthermore, we also evaluate our PL-EVIO, although the image-aid one has significant performance improvement in most of the sequence, it performs worse in dark environments (such as vicon_dark1 and vicon_dark2) than the PL-EIO. This might be caused by the deterioration of the point-based image feature tracking in dark environments.

It is worth mentioning that the Ultimate-SLAM is just for reference since we do not deeply fine-tune the parameters for different sequences (no-failed is not easy). Since the illumination would change greatly in our dataset, and it is very difficult for Ultimate-SLAM to choose a certain stationary threshold to integrate the event stream into the edge image. We have tried our best to fine-tune the parameters of Ultimate-SLAM in sequence vicon_hdr3 to achieve good performance and use the same parameters to evaluate the other sequences. This also shows that the generalization ability to integrate the event streams into the edge-image for VIO is pretty bad since the number of triggered events depends on many factors, including the resolution of the camera, the texture of the sense, the illumination, etc.

Iv-B Evaluation for Our PL-EVIO

(a) boxes_translation
(b) dynamic_translation
(c) poster_6dof
Fig. 5: Comparison of the Translation and Rotation Estimates of Our PL-EVIO Against the Ground Truth

In this section, we evaluate our PL-EVIO in UZH-FPV dataset [GWPHKU:FPV], which is a high-speed, aggressive visual-inertial odometry dataset. This dataset includes fast laps around a racetrack with drone racing gates, as well as free-form trajectories around obstacles. We compare our PL-EVIO with ORB-SLAM3 (stereo VIO) [ORB-SLAM3], VINS-Fusion (stereo VIO) [GWPHKU:VINS-Fusion], VINS-MONO (monocular VIO) [GWPHKU:VINS-MONO], and Ultimate SLAM (EVIO) [GWPHKU:Ultimate-SLAM]. We also computed the mean position error as percentages of the total traveled distance, while the estimated trajectories and ground-truth were aligned in SE3 with all alignment. As can be seen from the results in Table II, our proposed PL-EVIO achieve better performance even compare with the stereo VIO using a higher resolution camera. This dataset is so challenging that most of the sequences using Ultimate-SLAM and VINS-Fusion failed. While our Pl-EVIO still can provide reliable and satisfying results. The VINS-MONO is also needed deeply fine-tune the parameters for no bad performance. The video record of our PL-EVIO can be seen in the demonstrations333

Sequence Snapdragon (640*480) DAVIS346 (346*240)
VINS-Fusion [GWPHKU:VINS-Fusion]
Stereo VIO
Stereo VIO
Ultimate SLAM [GWPHKU:Ultimate-SLAM]
Indoor_forward_3 0.84 0.55 0.65 failed 0.38
Indoor_forward_5 failed 1.19 1.07 failed 0.90
Indoor_forward_6 1.45 failed 0.25 failed 0.30
Indoor_forward_7 0.61 0.36 0.37 failed 0.55
Indoor_forward_9 2.87 0.77 0.51 failed 0.44
Indoor_forward_10 4.48 1.02 0.92 failed 1.06
Indoor_45_degree_2 failed 2.18 0.53 failed 0.55
Indoor_45_degree_4 failed 1.53 1.72 9.79 1.30
Indoor_45_degree_9 failed 0.49 1.25 4.74 0.76
Average 5.26 2.10 0.81 7.26 0.70

  • Unit:%/m, 0.70 means the average error would be 0.70m for 100m motion.

TABLE II: Accuracy Comparison of Our PL-EVIO with Other Event-based Works in UZH-FPV Dataset [GWPHKU:FPV]

Furthermore, we also evaluate our PL-EVIO with the other EIO works in publicly available Event Camera Datasets[GWPHKU:event-camera-dataset_davis240c], which is acquired by the DAVIS240C (240*180, event-sensor, image-sensor, IMU sensor), it contains extremely fast 6-Dof motion and scenes with HDR. We directly refer to the raw result in Ref.[GWPHKU:Event-based-visual-inertial-odometry] [GWPHKU:ETH-EVIO] [GWPHKU:Ultimate-SLAM] [HASTE-VIO] [EKLT-VIO], and [GWPHKU:MyEVIO]. Please note that we use the same alignment ways as the aforementioned works which aligned the estimated trajectory and ground truth path with SE3 using 5 seconds. As can be seen from Table III, our PL-EVIO achieves state-of-the-art performance. Fig.5 presents the estimated trajectories (translation and rotation) of our PL-EVIO against the ground truth, for the sequence box_translation, dynamic_translation and poster_6dof. Additionally, in Fig.6, we use the relative error metrics, which evaluate the relative error by averaging the drift over trajectories of different lengths. It is worth mentioning that although the EVIO version of Ultimate-SLAM has similar performance to ours, it uses different parameters for each sequence while we think that parameter tuning is infeasible, therefore we evaluate our methods using fixed parameters for different sequences during the experiments.

Ref. [GWPHKU:Ultimate-SLAM]
Ref. [GWPHKU:Ultimate-SLAM]
boxes_translation 2.69 0.57 0.76 0.27 2.55 0.48 0.34 0.06
hdr_boxes 1.23 0.92 0.67 0.37 1.75 0.46 0.40 0.10
boxes_6dof 3.61 0.69 0.44 0.30 2.03 0.84 0.61 0.21
dynamic_translation 1.90 0.47 0.59 0.18 1.32 0.40 0.26 0.24
dynamic_6dof 4.07 0.54 0.38 0.19 0.52 0.79 0.43 0.48
poster_translation 0.94 0.89 0.15 0.12 1.34 0.35 0.40 0.54
hdr_poster 2.63 0.59 0.49 0.31 0.57 0.65 0.40 0.12
poster_6dof 3.56 0.82 0.30 0.28 1.50 0.35 0.26 0.14
Average 2.58 0.69 0.47 0.25 1.45 0.54 0.39 0.24
  • Unit:%/m, 0.24 means the average error would be 0.24m for 100m motion.

TABLE III: Accuracy Comparison of Our PL-EVIO with Other EIO/EVIO Works in DAVIS240c Dataset [GWPHKU:event-camera-dataset_davis240c]
(a) boxes_translation
(b) dynamic_translation
(c) poster_6dof
Fig. 6: The Relative Pose Error Comparison of Our PL-EVIO with EIO [GWPHKU:ETH-EVIO], Ultimate-SLAM [GWPHKU:Ultimate-SLAM], and Our EIO [GWPHKU:MyEVIO]

Iv-C Onboard Quadrotor-flighting Evaluation

In this section, we present our real-world experiment on a self-designed quadrotor platform (shown in Fig.7), carrying a forward-looking IniVation DAVIS346 sensor. An Intel NUC10i7FNH computer running Ubuntu 20.04 is mounted in our quadrotor for onboard computational support. We use Pixracer (FMUv4) autopilot to run the PX4 flight stack. To alleviate disturbance from the motion capture system’s infrared light on the event camera, we add an infrared filter on the lens surface of the DAVIS346 camera. Note the the introduction of the infrared filter might caused the degradation of perception for both the event and image camera during the evaluation in subsection IV.A, IV.C, IV.D, and IV.E. The overall weight of our quadrotor is 1.364kg (GS330 frame with T-Motor F60 PRO).

In the experiments, the reference trajectories are generated offline. The polynomial trajectory generation method [mellinger2011minimum] is used to ensure the motion feasibility of the quadrotor. To follow the generated trajectory, a cascaded feed-forward P.I.D. controller is constructed as a high-level position controller running on NUC. Given the position, velocity, and acceleration as inputs, the high-level feed-forward controller computes desired attitude and throttle sent to the low-level controller running on PX4.

Fig. 7: (a) Our Quadrotor Platform; (b) The Environment of Our VICON Room
Fig. 8: The Estimated Trajectory of Our PL-EVIO on the Quadrotor Flighting and its Comparison Against the Ground Truth (Taking the Onboard_test_1 as Example).

We conduct four flighting experiments (Fig.7) to test the performance of autonomous trajectory tracking and feedback control under the state estimates from our PL-EVIO. The quadrotor is commanded to track different patterns as follows (Offboard and Onboard means using the VICON and our PL-EVIO as pose feedback control, respectively, while the PL-EVIO all run the real-time online calculations in the onboard computer):

Iv-C1 Offboard_test_1

The states estimate from the VICON are used for feedback control of the quadrotor which is commanded to track a figure-eight pattern with each circle being 0.625m in radius and 1.2m in height, shown in Fig.8. The yaw angle of the commanded figure-eight pattern is fixed. The quadrotor follows this trajectory ten times continuously during the experiment. The 1000-HZ online calculation of our PL-EVIO is also recorded for accuracy comparison.

Iv-C2 Offboard_test_2

The states estimate from the VICON are used for feedback control of the quadrotor which is commanded to track a figure-eight pattern with each circle being 0.625m in radius and 1.2m in height. The yaw angle of the commanded figure-eight pattern is changed drastically, for aggressive motion. The quadrotor follows this trajectory ten times continuously during the experiment. The 1000-HZ online calculation of our PL-EVIO is also recorded for accuracy comparison.

Iv-C3 Onboard_test_1

The states estimate from our PL-EVIO are used for feedback control of the quadrotor which is commanded to track a figure-eight pattern with each circle being 0.625m in radius and 1.2m in height. The yaw angle of the commanded figure-eight pattern is fixed. The quadrotor follows this trajectory ten times continuously during the experiment. The 1000-HZ onboard state estimates of our PL-EVIO enable real-time feedback control of the quadrotor.

Iv-C4 Onboard_test_2

The states estimate from our PL-EVIO are used for feedback control of the quadrotor which is commanded to track a screw pattern shown in Fig.1. The quadrotor follows this trajectory ten times continuously during the experiment. The 1000-HZ onboard state estimates of our PL-EVIO enable real-time feedback control of the quadrotor.

Sequence Translation Error Rotation Error
Mean RMSE Std Mean RMSE Std
Offboard_test_1 0.054 0.061 0.028 0.094 0.095 0.015
Offboard_test_2 0.12 0.13 0.057 0.061 0.065 0.023
Onboard_test_1 0.078 0.084 0.030 0.078 0.087 0.039
Onboard_test_2 0.081 0.093 0.046 0.056 0.059 0.019
  • Unit: m for translation and deg for rotation

TABLE IV: Accuracy Comparison of Our PL-EVIO with Groundtruth in Quadrotor Flighting

The ground truth is obtained from VICON. The details of the translation and rotation error are shown in TableIV. Taking the Onboard_test_1 as an example, in Fig.8 and Fig.9, we further illustrate the estimated trajectories (translation and rotation) of our PL-EVIO against the ground truth, as well as their corresponding errors. The total trajectory length is 101.15m. The translation errors in the X, Y, and Z dimensions are all within 0.1m, while the rotation error of the Roll and Pitch dimensions are within 2°, and the Yaw dimension is within 6°.

(a) X-axis
(b) Y-axis
(c) Z-axis
(d) Roll-axis
(e) Pitch-axis
(f) Yaw-axis
Fig. 9: The Position, Orientation, and the Corresponding Errors of Our PL-EVIO in Onboard Flighting Compared with the Ground Truth from VICON (Taking the Onboard_test_1 as Example).

Iv-D Aggressive Quadrotor-fliping Evaluation

Fig. 10: The Estimated Trajectory of Our PL-EVIO on the Quadrotor Fliping.
Fig. 11: The Estimated Trajectory of Our PL-EVIO in Outdoor Environment. We also Visualize the Detection and Tracking Situation of the Event-corner Features, Line-based Event Features, and Point-based Image Features, during the Experiment. The Combination of These Features Provide More Structures and Constraints in the Scene, Ensuring the Robustness.

In this section, we further demonstrate the quadrotor flipping experiments to evaluate the performance of our PL-EVIO in aggressive motion. The estimated trajectory of our PL-EVIO compared with the VICON during the flipping evaluations can be seen in Fig.10. The total length of the trajectory is 12m. The translational mean error and the angular mean error are 0.079m and 2.7 respectively. More video demos of the evaluations in IV.C and IV.D are shown in our demonstrations444

Iv-E Outdoor Large-scale Evaluation

The workspace of the sequences in the previous subsection is relatively small, it is difficult to distinguish between drift and failure from error value alone. Therefore, in this section, we evaluate our PL-EVIO with the large-scale environment that goes around the HKU campus, features moving pedestrians, low-texture area, long-term movement, strong sunlight, and indoor & outdoor conversion. We also returned to the same location after a large loop to evaluate the loop closure. The evaluation covers the place that is around 160m in length, 100m in width, and with 10m in height changes. The estimated trajectory is aligned with the Google map in Fig.11. We can see that our PL-EVIO is almost drift-free in the long-term motion. While the complementarity of three different kinds of features (e.g. line-based event features for human-made environment, point-based event features for HDR scene, and point-based image features for good lighting scene) ensure the robust and reliable state estimation.

V Conclusions

In this paper, we proposed a robust and accurate monocular VIO that tightly-couple fuses the event, image, and IMU together, with point and line features. The combination of point-based event-corner features, line-based event features and point-based image features would provide more geometric constraints on the structure of the environment. Besides, a motion compensation algorithm is proposed to alleviate the blur and correct the motion of the event stream. However, the improvement of this motion compensation is relatively limited, which might be caused by the inaccuracy of the IMU sensor in the event cameras. Finally, numerous experiments show that our PL-EVIO pipeline is able to leverage the properties of the standard camera and the event camera to provide robust state estimation. We hope that this work can inspire other researchers and industries to push wide applications for event cameras on robotics and perception. In our future work, we might explore the stereo EVIO. Besides, multi-sensors fusion, including a wider range of local perception (such as LiDAR), and global perception (such as visible light positioning [GWPHKU:My2021TIM] for indoor, or GPS for outdoor), might be deeply studied to exploit the complementary advantage of different sensors with event cameras.


This research is supported by General Research Fund under Grant No. 17204222, the Seed Funding for Strategic Interdisciplinary Research Scheme and Platform Technology Fund. We also thank Cai Yixi for the advice on our flighting experiments.