A Versatile Keyframe-Based Structureless Filter for Visual Inertial Odometry

12/30/2020 ∙ by Jianzhu Huai, et al. ∙ Wuhan University 0

Motion estimation by fusing data from at least a camera and an Inertial Measurement Unit (IMU) enables many applications in robotics. However, among the multitude of Visual Inertial Odometry (VIO) methods, few efficiently estimate device motion with consistent covariance, and calibrate sensor parameters online for handling data from consumer sensors. This paper addresses the gap with a Keyframe-based Structureless Filter (KSF). For efficiency, landmarks are not included in the filter's state vector. For robustness, KSF associates feature observations and manages state variables using the concept of keyframes. For flexibility, KSF supports anytime calibration of IMU systematic errors, as well as extrinsic, intrinsic, and temporal parameters of each camera. Estimator consistency and observability of sensor parameters were analyzed by simulation. Sensitivity to design options, e.g., feature matching method and camera count was studied with the EuRoC benchmark. Sensor parameter estimation was evaluated on raw TUM VI sequences and smartphone data. Moreover, pose estimation accuracy was evaluated on EuRoC and TUM VI sequences versus recent VIO methods. These tests confirm that KSF reliably calibrates sensor parameters when the data contain adequate motion, and consistently estimate motion with accuracy rivaling recent VIO methods. Our implementation runs at 42 Hz with stereo camera images on a consumer laptop.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Supplementary Material

A video of running the KSF method with real-time calibration and consistent pose covariance on TUM VI raw room1 sequence is available at https://youtu.be/ZJFTdtrP2HQ.

I Introduction

Visual Inertial Odometry (VIO) methods estimate the motion of a platform outfitted with at least a camera and an Inertial Measurement Unit (IMU). They are an enabling technique for augmented reality and robotics. For instance, these methods have been widely used for state estimation of smartphones [1], Unmanned Aerial Vehicles (UAVs) [2], and service robots [3], thanks to the low weight and power consumption and the wide availability of the sensor assembly. As the VIO technology matures, a plethora of VIO algorithms [4] have been proposed over the years. They come in a variety of frontend feature association schemes and backend estimation engines. The variants of feature association schemes mainly revolve around whether feature descriptors are extracted (e.g., [5] vs. [6]) and whether feature matching is done between frames and selected frames i.e., keyframes (e.g., [5] vs. [7]

). The backend estimation engines are mainly based on filtering, optimization, neural networks, or hybrids. To reduce computation time of jointly solving for ego-motion and scene structure in an estimator, a structureless approach can be taken where the estimator only solves for ego-motion and the scene structure is solved separately.

Despite the rich literature on VIO, there remain some desired properties of a VIO method that are not well addressed. Ideally, a VIO method should produce a covariance estimate for the estimated pose (position and orientation). But for some optimization-based methods, it is computationally intensive to obtain the covariance [8], while some filtering-based methods report over-confident covariance inconsistent to the actual estimation error [9]. Lack of consistent covariance further prevents proper weighting of pose estimates in downstream algorithms, e.g., pose graph optimization [10].

Fig. 1: Flowchart of the keyframe-based structureless filter whose unique components are shaded. \⃝raisebox{-0.9pt}{1}-\⃝raisebox{-0.9pt}{2} are described in III-E. \⃝raisebox{-0.9pt}{3}-\⃝raisebox{-0.9pt}{6} are presented in III-F. \⃝raisebox{-0.9pt}{7} is discussed in III-D.

As consumer sensor assemblies suitable for VIO proliferate, e.g

., on smartphones, it is desirable to calibrate the camera and the IMU in real time and without turning to expensive equipment. Even in cases when such special devices are available, a VIO method with self-calibration capability is helpful, as it can indicate the need for a new calibration. Several methods have been proposed to tackle pose estimation and sensor calibration at the same time. Built upon the Multi-State Constraint Kalman Filter (MSCKF) 

[11], Li and Mourikis [12] estimated a variety of parameters of a monocular camera-IMU system, including IMU systematic errors, and camera intrinsic, extrinsic, and temporal parameters. Also with MSCKF, the online sensor calibration method in [13] estimates intrinsic and extrinsic parameters and time offsets of multiple cameras. While the original MSCKF [11] has difficulty in handling a stationary period, its improved variant [7] uses zero-velocity update which, however, requires delicate thresholds.

Motivated by the need for consistent covariance estimation and real-time sensor calibration, we propose a Keyframe-based Structureless Filter (KSF) for VIO with on-the-fly calibration capability. A structureless estimator does not include the scene structure variables in the state vector, thus is typically more efficient than estimators that jointly estimate motion and scene structure. In addition to being structureless, KSF integrates the concept of keyframes and sensor self-calibration. Thanks to the use of keyframes, it can cope with standstills without resorting to zero-velocity updates. The keyframe concept has been used in the filtering backend of [14], but we use keyframes in both the feature association frontend and the filtering backend. Besides the keyframe concept, KSF differs from [12] in the concept of state variable timestamps. Besides the two concepts, another difference to a close work [13] is that KSF includes old velocity variables in the state vector so that continuous-time camera poses can be predicted with IMU measurements instead of linear fitting.

In summary, our contributions are:

  • For efficient and consistent motion estimation with data captured from a platform of an IMU and at least one camera, the KSF is developed, employing keyframes for image feature association in the frontend and for state variable management in the backend. Its efficiency and accuracy were validated on public benchmarks, its consistency by simulation.

  • To deal with data captured by consumer sensors, the proposed method supports calibrating IMU random biases and systematic errors, and intrinsic, extrinsic, and temporal parameters (time delay and rolling shutter effect) of every camera in the platform. KSF’s ability to calibrate these parameters was confirmed by simulation and on one public benchmark captured by low-cost sensors.

Subsequently, Section II provides a non-exhaustive overview of recent VIO methods, focused on feature matching and sensor calibration. Section III presents the proposed KSF covering the state vector and error state, IMU and camera measurement models, keyframe-based feature association frontend, and the keyframe-based filtering backend. Next, the simulation study on estimator consistency and sensor parameter observability is reported in Section IV. Furthermore, Section V gives the sensitivity analysis of various design options, estimation accuracy of sensor parameters and poses, and computation time on real-world data. In the end, Section VI discusses the results and lists caveats, and Section VII summarizes the work and points out future directions.

Ii Related Work

This section briefly reviews traditional geometry-based visual inertial odometry and calibration methods without claiming to be exhaustive. VIO methods can be loosely grouped into geometry-based, learning-based, and hybrid methods based on the extent to which geometric models of image information are used in the inference engine [4]. We refer to [15] for a recent survey on learning-based and hybrid VIO methods, and refer to [16] for a recent survey on geometry-based VIO methods. At the time of writing this paper, the trending learning-based methods are reported to be robust under challenging conditions, e.g., [17], while geometry-based methods tend to be fast and accurate. The following review focuses on the category of geometrical methods to which our proposed method belongs.

A plethora of VIO methods based on traditional geometry models has been reported, varying in the supported number of cameras and in the calibration capability.

Ii-a Monocular Visual Inertial Odometry

Filtering In [18], measurements from a rigid camera-IMU platform were fused with a global bundle adjustment or with an Extended Kalman Filter (EKF) to estimate platform pose and velocity expressed in the camera frame at startup, landmark positions, IMU biases, and the gravity vector. These methods were initialized with the bundle adjustment algorithm from a special data sequence with benign motion. With an EKF, Veth and Raquet [19] estimated pose and velocity of the sensor assembly in the navigation frame, landmark positions, IMU biases, and camera extrinsic parameters. Feature matches were found by projecting landmarks to new images and searching in uncertainty ellipses. Landmarks were initialized by using a terrain model or triangulation. The MSCKF [11] keeps a sliding window of cloned poses for camera frames while not using landmarks in the state vector. From every disappearing feature track (all observations corresponding to the same landmark), a landmark is triangulated, and then its linearized observations canceling out the landmark parameters are used to constrain multiple cloned poses in the update step of EKF. Also using an EKF, Jones and Soatto [20] estimated platform pose and velocity in an inertial frame, landmark positions, IMU biases, the gravity vector, and camera extrinsic parameters. A landmark point is represented by its projection direction and the logarithm of its depth. Kelly and Sukhatme [21] estimated pose and velocity of the sensor unit in a world frame (e.g., aligned to a calibration target), IMU biases, the gravity vector, camera extrinsic parameters, and landmark positions with a UKF. Authors of [22, 23] used an EKF to fuse IMU readings and up-to-scale motion estimates from a visual odometry method. Li and Mourikis [9] improved MSCKF by using first estimates of position and velocity in computing system transition matrix and measurement Jacobians. Hesch et al[24] improved MSCKF by including distinguishable landmarks in the state vector and modifying the transition matrix and measurement Jacobians to satisfy observability constraints. Robocentric VIO [25] improved MSCKF by expressing the cloned poses relative to the first one and including the gravity vector in the state vector.

Optimization-based Methods With iSAM2 [26], Forster et al[27] estimated the entire history of system states associated to sequential keyframes. Each system state includes pose and velocity expressed in an inertial frame and IMU biases. The measurements encoded as factors provided to iSAM2 include a prior, preintegrated IMU factors, and structureless projection factors constructed by eliminating the landmark parameters from linearized reprojection errors.

With a sliding window smoother, VINS-Mono [28] estimates a sliding window of poses and velocities (expressed in a gravity-aligned world frame) and IMU biases, inverse depth of landmarks, and extrinsic parameters and time delay of the camera. For each selected frame, a least squares problem is constructed from image features matched between frames by KLT [29], preintegrated IMU factors, and a marginalization factor. To bound computation, either the second latest frame or the oldest frame is marginalized from the problem. In [30], VINS-Mono was extended for optimizing both points and lines with their observations.

Ii-B Stereo Visual Inertial Odometry

For stereo cameras, Sun et al[14] implemented MSCKF with observability constraints to ensure consistency. They used KLT [29] to track features between successive left camera images and between left and right camera images. In every other update step, two cloned state variables were chosen based on relative motion from those associated with recent frames and oldest frames, and marginalized from the filter.

With an optimization backend, Huai et al[31] estimated system state variables associated with earlier keyframes in a spatial window and recent frames in a temporal window. Each system state variable includes pose and velocity in a gravity-aligned world frame and IMU biases. The system state variables will fix their linearization points for computing Jacobians before exiting the temporal window to ensure consistency. Kimera-VIO [10] finds feature matches between consecutive stereo frames by KLT [29] and between left and right rectified images by correlating along epipolar lines. By using structureless vision factors, preintegrated IMU factors, and a prior of the system state (pose, velocity, and IMU biases), a fixed-lag smoother optimizes the system states associated with keyframes. The stereo VIO method in  [32] tracks features between consecutive stereo frames and between left and right images by an inverse-compositional sparse optical flow method. For stereo VIO, by using reprojection errors, preintegrated IMU factors, and a marginalization factor, the method solves for a sliding window of system state variables associated with keyframes and recent frames with the Gauss-Newton method. To ensure consistency, the first-estimate Jacobian technique [33] is used, i.e., Jacobians of system state variables involved in a marginalization factor are evaluated at their estimates upon marginalization.

Ii-C -Camera Visual Inertial Odometry

With the iterated EKF, ROVIO [34] uses the intensity errors of image patches for landmarks to estimate robocentric pose, velocity, landmark positions, as well as IMU biases and camera extrinsic parameters. The landmarks are represented by a bearing vector and an inverse depth, allowing undelayed initialization. With KLT [29] or descriptor-based matching, OpenVINS [7] identifies feature matches between consecutive monocular or stereo frames. These matches are used by a sliding window filter to update the state vector including pose, velocity, cloned poses of earlier frames, and landmark parameters. It is also possible to calibrate extrinsic and intrinsic parameters of cameras and a time delay. For consistency, propagated values (rather than updated values) of position, orientation, and velocity are used in computing the transition matrix and observation Jacobians.

Optimization-based OKVIS [8] tracks BRISK descriptors [35] in a new frame of images to previous keyframes and the last frame. With reprojection errors constructed from these feature matches, preintegrated IMU measurements, a marginalization factor, and a prior of position and yaw about gravity, OKVIS solves for landmarks and system state variables (each system state includes pose, velocity, and IMU biases) in a sliding window of keyframes and recent frames. To limit the problem size, redundant state variables of the oldest keyframe or the old recent frame are selected and marginalized. VINS-Fusion [28] extends VINS-Mono to support stereo cameras and synchronous processing.

Ii-D Camera-IMU Online Calibration

As VIO algorithms flourish, methods to calibrate a camera-IMU system have made great advances along the lines of both offline and online calibration. Here we skim through the offline methods and briefly review online methods. The interested reader is referred to [36] for recent developments in calibration of visual inertial sensors.

Offline visual inertial calibration usually performs joint optimization of structure, motion, and sensor parameters. Available offline approaches are achieved with varying infrastructure support, e.g., using calibration targets [37], revisiting a mapped area [38], or observing natural features [36].

Online calibration of a camera-IMU system usually relies on filtering or fixed-lag smoothing approaches while taking observations of existing features. It has been common practice to estimate camera extrinsic parameters in real time, e.g., [8, 20, 21, 34]. Additional parameters have also been incorporated in real time estimators, e.g., [7]. The extended MSCKF [12] estimates scale errors, misalignment, and g-sensitivity of the IMU and the intrinsic, extrinsic, and temporal parameters of the camera including the frame readout time. An improved VINS-Mono [39] estimates scale errors, misalignment, and g-sensitivity of the IMU. To date, no formal observability analysis for these additional parameters exists. But empirical evidence indicates that a rich variety of motion is necessary for estimating these parameters reliably [21, 36]. In response to the strict condition to render all parameters observable, estimators capable of real-time calibration often come with the option to lock sensor parameters when they are deemed unobservable or well calibrated. For a filter, a parameter can be locked by setting its covariance entry zero, e.g., [20].

Iii Keyframe-based Structureless Filter

The problem we consider is to track the pose (position and orientation) and optionally calibrate the sensor parameters of a platform consisting of an IMU and at least one camera using inertial readings and image observations of opportunistic landmarks with unknown positions. Our proposed keyframe-based VIO system has two major components: (i) the feature association frontend which processes images and gives feature matches, and (ii) the estimation backend which fuses feature matches and inertial measurements and estimates motion and optional calibration parameters. Both components integrate the concept of keyframes. In the following, we present first the notation conventions, IMU and camera measurement models, then the state variables, and finally the frontend, the backend filter and its consistency.

Iii-a Notation and Coordinate Frames

We use a transformation matrix composed of a rotation matrix and a 3D translation vector , i.e., , to relate homogeneous coordinates of a point in reference frame , , to those in , as .

This work uses several right-handed coordinate frames, including camera frames, IMU sensor frame , body frame , and world frame . The conventional right-down-forward camera frame is affixed to the imaging sensor of every camera . Without loss of generality, we designate as the main camera, e.g., the left camera in a stereo camera rig. The IMU sensor frame has the origin at the intersection of the - and -accelerometer. Its -axis is aligned to the -accelerometer, and its -axis perpendicular to the -axis and close to the -accelerometer while in the plane spanned by - and -accelerometer. The estimated poses refer to a body frame affixed to the platform. The body frame can be defined to coincide in an IMU-centric manner. Alternatively, in a camera-centric manner, the body frame is defined to have the same origin as , and a constant orientation relative to the main camera, e.g., [12]. For easy result comparison, is set to the initial value of the relative orientation between the IMU and the main camera. This work supports both definitions of the body frame (see Table I). For coordinate frames that evolve over time, we append a time parameter to such a frame to obtain a quasi-inertial frame. For instance, denotes the quasi-inertial frame coinciding at time .

A quasi-inertial world frame is introduced for expressing the navigation state variables of the platform. It is instantiated at the start of a VIO estimator with accelerometers’ measurements such that its -axis is aligned with the negative gravity direction.

Iii-B IMU Measurements

An IMU measurement model considers systematic and random errors in the 3-degree-of-freedom (3-DOF) gyroscope and the 3-DOF accelerometer. Many VIO methods use a simplified IMU measurement model that only considers random biases and noises. A generic IMU model also accounts for systematic errors including scale factor errors, misalignment, relative orientations, and g-sensitivity effect,

e.g., [36, 37]. Next, we present the simplified model first and then a generic IMU model, both being accommodated in our method.

Recall that the accelerometer triad measures the acceleration in body frame caused by specific forces,

(1)

where is the local gravity. And the gyroscope measures angular velocity of the platform in frame .

Simple IMU model In a simplified IMU model, the IMU measurements and are assumed to be affected by accelerometer and gyroscope biases, and

, and Gaussian white noise processes,

and , i.e.,

(2)
(3)
(4)
(5)

where the biases are assumed to be driven by Gaussian white noise processes, and . The power spectral densities of , , , and , are usually assumed to be , , , and , respectively, but they may have disparate diagonal values, and nonzero off-diagonal entries, e.g., for a two DOF gyro.

In summary, the IMU parameters for the simple model are

(6)

Generic IMU model Our method incorporates the generic IMU model presented in [12]. In this model, the accelerometer measurement is corrupted by systematic errors encoded in a matrix , the accelerometer bias and noise process ,

(7)

The 9 entries of encompasses 3-DOF scale factor error, 3-DOF misalignment, and 3-DOF relative orientation between the accelerometer input axes and frame.

The gyroscope measurement is corrupted by systematic errors encoded in a matrix and g-sensitivity effect encoded in a matrix , the gyroscope bias and noise process ,

(8)

The 9 entries of encompasses 3-DOF scale factor error, 3-DOF misalignment, and 3-DOF relative orientation between the gyroscope input axes and frame.

Alternatively, the IMU measurements can be modeled with separate coefficients for scaling, misalignment, and bearing effects as done in [37], thus it will be easier to interpret the resulting estimates. But to avoid dealing with SO(3) relative orientations in implementation, we adopt the above formulation, (7) and (8). In this case, the IMU parameters are summarily,

(9)

where stacks entries of a matrix into a vector.

For a well-constrained system, the presented IMU models and compatible camera extrinsic models are listed in Table I. For the generic IMU model (9), it is necessary to use the body frame defined in a camera-centric manner (Section III-A). Otherwise, if the IMU-centric frame is used, shall be unknown and thus variable, making the 3-DOF relative orientation in redundant. On the other hand, for the simple IMU model (6), it is desirable to use the body frame defined in an IMU-centric manner for the purpose of estimating the relative orientation .

Body frame
definition
IMU
parameters
Main camera
extrinsics
Other camera
extrinsics
Camera-centric
body frame
or
IMU-centric
body frame
or
TABLE I: Compatible IMU and camera extrinsic parameters.

Propagation with IMU Data With a sequence of IMU readings , between and , the navigation state variable , and the covariance of and IMU parameters , , can be propagated from given state variables , , and covariance at as expressed by ,

(10)

where the noises of IMU readings are stacked in , i.e.,

(11)

The transition matrix is the derivative of the propagated error state, and , relative to the initial error state, and , i.e.,

(12)

Detailed calculation of and in (10) can be found in [40, Appendix A], [41]. The propagation (10) should support backward integration in time to deal with observations earlier than a pose variable.

Iii-C Camera Measurements

The visual measurements are expressed by conventional reprojection errors. Our adopted pinhole camera observation model relates a landmark ’s observation in image of frame at time to its parameters and the system state. For a landmark expressed in , , the observation model is given by

(13)

where the perspective projection function factors in camera intrinsic parameters, , and we assume that an image observation is affected by Gaussian noise .

Alternatively, a landmark may be anchored in camera of frame , , where is frame ’s timestamp. Then, can be represented by a homogeneous point in terms of its inverse depth , and its observation direction , i.e., , and the observation model is given by

(14)

Notice that the enclosed term in (13) or (14) is obtained by transforming a point (i.e., ) or inversely transforming a point (i.e.,

), their Jacobians can be computed in a unified manner by recursively using the chain rule as done in the reverse-mode automatic differentiation

[42]

. Because in general the camera measurements in one image are at different epochs due to the rolling shutter effect, the Jacobians of the pose at feature observation epoch

, , relative to the pose and velocity at , and , are computed with IMU propagation matrix (12). Jacobians relative to time parameters, e.g., camera time delay, are computed from the velocity of the propagated pose . The measurement Jacobian relative to IMU parameters is ignored as in [43] because its entries are usually small.

Iii-D State and Error State

This section presents variables used in formulating the nonlinear VIO problem. We consider the true state and the associated error state

to be unknown random variables, and denote the state estimate (mean) by

.

State Vector The state of a VIO system consists of current navigation state variable , IMU parameters , parameters of cameras , and a sliding window of past navigation state variables at known epochs where enumerates past navigation state variables for keyframes and recent frames in the sliding window. That is,

(15)
(16)
(17)

The navigation state expressed in includes position , orientation , and velocity of the platform at time , i.e.,

(18)

The IMU parameters are given in (6) or (9).

Camera Parameters The parameters for one camera include the camera extrinsic, intrinsic and temporal parameters. The extrinsic parameters of the main camera in the -camera system are represented by . For other cameras, the extrinsic variable can be parameterized by either or . The intrinsic parameters of one camera include focal length, principal point, and distortion parameters when assuming a pinhole projection model. The temporal parameters of camera include the relative time delay between camera and the IMU, and frame readout time . Without loss of generality, is defined such that the raw timestamp of an image of frame by the camera clock, , plus , is the timestamp by the IMU clock for the central row or mid-exposure of the image, , i.e., .

Each navigation state in the sliding window, , associated with frame is at a known epoch which is set to the first estimate of the IMU timestamp for the central row of the main image of frame , i.e.,

(19)

where is the estimate of the main camera’s time delay at processing frame and hence known. Tying navigation state variables to known epochs is conceptually simpler than tying them to true epochs of estimated values ([7, 43]).

With time delay , and readout time for camera , the epoch to observe a landmark at in image of frame with raw timestamp and rows is computed by

(20)

Since landmarks are observed at varying epochs due to the rolling shutter effect, velocity variables are included in the sliding window (17) so that the camera pose for a landmark observation can be propagated from the relevant navigation state variable with IMU measurements.

Random Constant State Variables Among all state variables, only the current navigation state variable and the IMU biases are time-variant. The historic navigation state variables are unknown random constants [41] because they are at known epochs. The IMU parameters (6) or (9) excluding biases and the camera parameters are assumed to be random constants [41] because these parameters change little within the timespan of the sliding window. This practice was adopted in [12, 44] and OpenVINS [7].

If a sensor parameter is in fact not constant over time, noise can be added to account for its additional uncertainty as is done for IMU biases. This technique has been used for estimating time delay between a camera and an IMU in [45, (32)].

The motivation for including these sensor parameters in an estimator is threefold. First, the hardware is usually imperfect, especially for sensors on mobile devices [46]. Secondly, estimating such random constant parameters incurs marginal computation. Thirdly, these parameters can be locked up online and thus excluded from estimation ad hoc. For a filter, this is done by zeroing out their covariance blocks.

Error State Vector All state variables fit in the -manifolds axiomatized in [47]. Therefore, the error for a state variable on a -manifold is defined with the -operator associated with the manifold,

(21)

In this sense, generalizes the conventional . The error state variables are mostly defined with the -operator, e.g., for translation , velocity , and gyroscope bias ,

(22a)
(22b)
(22c)

The exceptions are rotations. For an element in the space of orientation preserving rotations SO(3), its error state is defined such that right multiplying the matrix exponential of by the estimate gives , i.e.,

(23)

where obtains the skew-symmetric matrix for a 3D vector. The small angle approximation in the last expression is useful in deriving analytical Jacobians. Equivalently, in terms of Hamilton quaternions,

(24)

where and are the quaternions for and , respectively, and is the quaternion multiplication operator [48, (12)].

Camera Extrinsic Error Variables The error variable of the main camera’s extrinsic parameters is defined considering the definition of the body frame (Section III-A). And the error variables of other cameras’ extrinsic parameters are defined considering specific parameterizations.

For the main camera, if the camera-centric body frame is used, then is fixed in orientation, and the extrinsic error is only defined in the translation component which can be expressed by either or . Following the notation of [12], the latter is used to define the extrinsic error, i.e.,

(25)

Otherwise, when the IMU-centric definition of is adopted, the main camera’s extrinsic error consists of a rotation part and a translation part , i.e.,

(26)
(27)

For other cameras, their extrinsic parameters, , can be parameterized relative to the body frame, i.e., by themselves,

(28)

or to the main camera, i.e.,

(29)

The error variables of and are defined for the translation and orientation components like (22a) and (23), i.e., for ,

(30a)
(30b)
(30c)
(30d)

Iii-E Feature Association Frontend

The frontend extracts and matches image features by using BRISK descriptors [35]

and removes outliers by RANSAC methods provided in OpenGV 

[49], similarly to [8].

Fig. 2: The keyframe-based feature matching consists of matching the current frame to the last frame, \⃝raisebox{-0.9pt}{1} and \⃝raisebox{-0.9pt}{4}, matching the current frame to two previous keyframes, \⃝raisebox{-0.9pt}{2}, \⃝raisebox{-0.9pt}{3}, \⃝raisebox{-0.9pt}{5}, and \⃝raisebox{-0.9pt}{6}, and matching between images of the current frame, \⃝raisebox{-0.9pt}{7}. Showing keyframes as shaded frames, this diagram assumes a platform with stereo cameras. \⃝raisebox{-0.9pt}{1}-\⃝raisebox{-0.9pt}{6} each includes a 3D-2D matching phase and a 2D-2D matching phase (Section III-E). For a single camera setup, only \⃝raisebox{-0.9pt}{1}-\⃝raisebox{-0.9pt}{3} are needed.

Consider a frame captured by synchronized cameras. Each image in the frame is taken by one camera, and these images are grouped into a frame because their timestamps are close (but not necessarily identical). The matching procedure associates features of the current frame to several previous frames which include the last frame and two most recent keyframes, and associates features between images of the current frame if applicable (Fig. 2).

For every previous frame, the matching operation consists of a 3D-2D matching phase for features associated with 3D landmarks and a 2D-2D matching phase for features not yet triangulated, both done by brute-force matching. Though a bit counter intuitive, brute-force matching is quite efficient with up to 400 features per image (see Section V-D).

Given two frames, for every pair of images captured by the same camera, the 3D-2D phase matches descriptors of triangulated 3D landmarks observed in the earlier image to descriptors in the latter image. Each match is then checked by projecting the 3D landmark onto the latter image. Lastly, all valid matches go through a three-point absolute pose RANSAC step [49] to further remove outliers. Given two frames, for every pair of images captured by the same camera, the 2D-2D phase matches descriptors not associated with landmarks in the earlier image to those in the latter image. Then from these feature matches, new landmarks will be triangulated. Notably, this triangulation step is not required by the backend estimator. Its purpose is to remove feature matches of large projection errors while keeping those of low rotation-compensated disparity, and to provide depth estimates for 3D-2D matching. After successful triangulation, these 2D-2D matches will go through a two-point relative pose RANSAC step [50] to remove outliers. Each valid 2D-2D match spawns a feature track which is all observations of the triangulated landmark. These feature tracks are shared with the backend estimator. On the other hand, they are to be extended in future 3D-2D and inter-camera matching steps.

In the case of cameras indexed by , features are associated between images captured by cameras of overlapping field-of-views, e.g., . Unlike [8], matches that deviate much from the epipolar lines will be removed to deal with outliers. Also, a match of two features of two landmarks will cause the two landmarks and their feature tracks to be fused.

The above matching procedure distinguishes ordinary frames and keyframes. Keyframes are selected by using the criteria in [8]. For every image in the current frame, we compute the overlap between the convex hull of 2D image features associated with landmarks and the one of all 2D image features, and the ratio between the number of 2D image features associated with landmarks and the number of those fall into the convex hull . A keyframe is selected if the maximum of area overlaps is less than a threshold (typically 60%) or if the maximum of match ratios is less than (typically 20%).

Iii-F Filter Workflow

The backend, a keyframe-based structureless filter, estimates state variables tied to frames and keyframes using completed feature tracks and marginalizes old or redundant state variables (Fig. 1). The filter is structureless since it does not include landmarks in the state vector, similarly to MSCKF [9]. The novelty of the filter is that it manages motion-related state variables depending on whether they are associated with ordinary frames or keyframes. The keyframe-based state management scheme is facilitated by the selection of keyframes in the frontend and is motivated by the need to make a structureless filter work when the sensor setup is stationary.

At the beginning (Algorithm 1), the VIO system is initialized to zero position, zero velocity, zero accelerometer bias, gyroscope bias averaged over first few gyroscope readings, and an orientation such that the -axis of the world frame is aligned to the negative gravity direction by using a few accelerometer readings. The other IMU parameters (9), the camera extrinsic parameters , time delay , and readout time , are initialized to their nominal values from datasheets or experience.

The standard deviations of state variables (

16)-(18) are usually initialized to sensible values, see Table III for our simulation setup and Table IX for real data tests.

Fig. 3: Schematic drawing of a landmark ’s observation Jacobians relative to the state vector whose nonzero entries are shaded. In accordance with Fig. 2, as a new frame arrives, this landmark completes its feature track at frame which serves as its anchor frame (14). After the navigation state variable for frame has been cloned as , the computed Jacobians for are stacked as shown and ready for canceling out the Jacobian for ’s parameters (Section III-F).

As a frame of coarsely synchronized images arrives, the navigation state (18) and the covariance matrix are propagated with inertial readings to the state epoch ((19), Box \⃝raisebox{-0.9pt}{3} of Fig. 1) for the frame. Then, the navigation state variable is cloned and augmented to the state vector and the covariance matrix is also expanded for the cloned navigation state (Line 3, Box \⃝raisebox{-0.9pt}{4} of Fig. 1). Because the cloned navigation state is assigned a known constant epoch (19), unlike [7] or [43, (30)], we need not to account for uncertainty of the time delay estimate when expanding covariance for the cloned state. But we need to compute the measurement Jacobian relative to time delays of cameras in the update step (Line 12). In parallel, feature descriptors are extracted from images in the frame and matched to those in the last frame and two recent keyframes (Line 4, Boxes \⃝raisebox{-0.9pt}{1} and \⃝raisebox{-0.9pt}{2} of Fig. 1, and Fig. 2) as in the frontend of [8]. Feature tracks that disappear in the current frame will be used as measurements for update.

A completed feature track with at least three observations is used to triangulate a point landmark and all observations in the track provide reprojection errors to the filter (Lines 6-8). A completed feature track of two observations are discarded because they are not informative to the filter [14]. For an example landmark , the Jacobians of its observations relative to the state vector are visualized in Fig. 3. Since the landmark is not represented in the state vector, its Jacobian is canceled out by projecting all measurement Jacobians to the left nullspace of the landmark’s Jacobian [11]. The left nullspace is typically spanned by column vectors for a landmark well observed times [40, 4.51], but has columns for a landmark at infinity because of the rank-deficient landmark Jacobian. This special case is considered in canceling out the landmark Jacobian and in performing the subsequent Mahalanobis gating test. To remove outliers of large measurement residuals, every reprojection measurement goes through a Mahalanobis gating test (Line 9) [9]. All valid residuals and Jacobians for these disappeared landmarks are stacked and used to update the state vector and the covariance matrix (Line 12, Box \⃝raisebox{-0.9pt}{5} of Fig. 1) [11].

To bound computation, redundant frames are selected and marginalized from the filter (Lines 13-15, Box \⃝raisebox{-0.9pt}{6} of Fig. 1) once the number of navigation state variables in the sliding window exceeds the total number of allowed old keyframes and recent frames successive in time, . In each marginalization operation, at least redundant frames ( is 3 for the monocular case, 2 for multiple camera setups) are chosen (Line 13) because two reprojection measurements for a unknown landmark is uninformative [14]. To meet this requirement, they are chosen first among the recent non-keyframes while excluding the most recent frames and secondly among the oldest keyframes. For the case of Fig. 2, if and , then the redundant frames are keyframe and frame for the latest frame .

With these redundant frames, we update the filter with observations of landmarks each observed more than twice in such frames (Line 14). For such a landmark, if it can be triangulated with its entire observation history, its observations in the redundant frames are used for Kalman update. All other observations in the redundant frames are discarded. After the update, corresponding state variables and entries in the covariance matrix for these redundant frames are removed (Line 15).

1:Initialize state and covariance
2:while a frame of images arrives do
3:     Propagate and augment state vector and covariance
4:     Extract descriptors and match to last frame and keyframes
5:     for feature tracks disappearing in the frame do
6:         Triangulate the point landmark
7:         Compute residuals and Jacobians of reprojection errors
8:         Cancel out Jacobians for the landmark
9:         Remove outliers with Mahalanobis test
10:     end for
11:     Form the whole residual vector and Jacobian
12:     Update state and covariance
13:     Detect redundant frames
14:     Update with reprojection errors in these frames
15:     Remove associated state variables and covariance entries
16:end while
Algorithm 1 Keyframe-based structureless filter workflow. Changes relative to MSCKF [11] are shown in italics.

Iii-G Consistency

Despite easy interpretation and intuitive covariance tuning, the chosen navigation error state is trajectory-dependent. According to [51], because the evolution of the chosen position error (22a) and velocity error (22b) depend on the trajectory of the navigation state [40, (A.27), (A.28)], naively using the last estimate of the state vector to evaluate Jacobians for (10), (13), and (14) will falsely eliminate the unobservable rotation about gravity ([9, 24]) and lead to optimistic covariance inconsistent to the actual estimator error.

To ensure consistency, Jacobians relative to navigation state variables appearing in propagation (12) and measurements (13) and (14) are evaluated at propagated values of position and velocity, i.e., ”first estimates” [9], and at last estimates of other variables. Unlike [7] and [6], we do not lock the evaluation points of rotation in computing Jacobians because our chosen rotation error (23) does not affect the nullspace corresponding to the unobservable directions.

Jacobians of all other state variables are evaluated at the last estimate of the state vector because they, e.g., IMU biases, time offsets, rolling shutter effect, do not affect the nullspace [51].

Iv Simulation Study

We evaluate the KSF with simulation to show that the designed method reports covariance consistent to the actual estimation error and that the introduced sensor parameters are observable. The simulation evaluates the proposed KSF in terms of consistency and state estimation accuracy in comparison to a recent VIO method, OKVIS [8].

Error Metrics The estimator consistency is measured by the Normalized Estimation Error Squared (NEES) of components of the navigation state variable. The expected value of NEES for a variable is its degrees of freedom, hence 3 for position error , 3 for orientation error , and 6 for pose error . An inconsistent estimator will optimistically estimate the covariance, thus the computed NEES is much greater than its expected value [9]. Following [52, (3.7.6-1)], with successful runs of an estimator, the NEES for position, orientation, and pose at epoch are

(31)

where and is its covariance.

The accuracy of the estimated state is measured by Root Mean Square Error (RMSE) for components of the state vector. For a component , the RMSE at epoch is computed by

(32)

Simulation Setup A scene with point landmarks distributed on four walls was simulated. A camera-IMU platform traversed the scene for five minutes with two motion types: a wavy circle (wave for short), and a yarn torus (Fig. 4). For the two motion types, the platform moved at an average velocity 1.26 m/s and 2.30 m/s, respectively.

Gyroscope Accelerometer
Bias White
Noise
[53, (9) and (51)]
with
with
White Noise
[53, (50)]
with
with
TABLE II:

The covariance of the zero-mean Gaussian distributions from which discrete noise samples are drawn.

is the IMU sampling frequency.

The simulated platform had a camera and an IMU. The camera captured images of size at 10Hz. The rolling shutter effect was simulated by solving for (20) as a root of the projection model (13) with the Newton-Raphson method. For OKVIS, the rolling shutter effect was not added in simulation. The image observations were corrupted by radial tangential distortion and white Gaussian noise of 1 pixel standard deviation at each dimension. The simulated inertial measurements were sampled at =100 Hz, corrupted by biases, white noise on biases and additive white noise. Discrete noise samples were drawn from Gaussian distributions tabulated in Table II. These noise parameters were chosen to be realistic for the IMU of a consumer smartphone. The other parameters for generating the image and inertial data were set to their mean values provided in Table III.

Mean
0.29 0.02 0.005 0.001 0.005
Mean
350, 360
378, 238
0, 0
0, 0
500
20
0 2.0
5.0, 5.0
5.0, 5.0
0.05,0.01
0.001,0.001
5
5
TABLE III: The IMU and camera parameters in KSF were initialized with samples drawn from Gaussian distributions with mean values and standard deviation at each dimension. , , , are the camera projection parameters, , , , are the radial tangential distortion parameters, and are time offset and frame readout time of the camera.

Estimator Setup A simulation frontend was created to provide feature tracks to an estimator, i.e., the backend of KSF or OKVIS. The frontend associated observations of a landmark between consecutive frames and between current frame and the last keyframe. For torus / wave motion, the average feature track length was 5.8 / 6.8, and the average number of observed landmarks in an image was 40.5 / 60.5.

An estimator were initialized with the true pose but a noisy velocity estimate corrupted with noise sampled from Gaussian distribution . For KSF, the initial values for camera and IMU parameters were drawn from Gaussian distributions given in Table III. For OKVIS, the initial biases were drawn from distributions given in Table III and the other parameters were locked to mean values in Table III except for time delay and readout time of the camera which were set zero.

(a)
(b)
Fig. 4: The simulated scene with general wave motion (a) and torus motion (b) for five minutes.

Each estimator ran 1000 times for each motion type. All runs were finished successfully, i.e., with the error in position 100 m at the end.

Estimator Consistency We compared the NEES for KSF through wave motion with Jacobians naively evaluated at last estimates, KSF through wave motion with first estimate Jacobians (FEJ) by default, and KSF through torus motion with FEJ by default. The NEES for OKVIS was not computed because of rank deficient Jacobians in the estimator. The evolution of NEES is visualized in Fig. 5. The NEES values averaged over the last 10 seconds to smooth out jitters are tabulated in Table IV.

We had observed that the NEES could be reduced closer to its expected value by decreasing the variance of noise in sampling IMU readings while keeping the noise variance unchanged in the VIO estimator. Doing so is somewhat reasonable in the sense that the simulation setup has unmodeled quantization errors. However, for simplicity, the following results were obtained without weakening noise added to IMU readings.

Fig. 5: The history of NEES for poses estimated by KSF with naive Jacobians under wave motion (KSF-Wave-Naive), KSF with First Estimate Jacobians (FEJ) under wave motion (KSF-Wave), KSF with FEJ under torus motion (KSF-Torus), and OKVIS under torus motion (OKVIS-Torus). Expected values of NEES for position, orientation, and pose are 3, 3, and 6.

From curves in Fig. 5 and final values of NEES in Table IV, we see that the proposed estimator, KSF with FEJ, is consistent in terms of NEES. The results also indicate that NEES is affected to some extent by motion type as the motion that excited the IMU in diverse ways led NEESs closer to expected values.

NEES averaged over
last 10 seconds
RMSE at start / end
Position
(1)
Orienta-
tion (1)
Pose
(1)
Position
(m)
Orienta-
tion ()
Theoretic
value
3 3 6 N/A N/A
KSF-Naive
Wave
11.28 146.94 150.80 0 / 0.72 0 / 8.48
KSF Wave 4.53 5.13 8.50 0 / 0.55 0 / 5.52
KSF Torus 3.32 3.11 6.46 0 / 0.43 0 / 4.46
OKVIS Torus 13630 46229 80863 0 / 1.16 0 / 15.07
TABLE IV: NEES and RMSE computed over 1000 runs for OKVIS with wave motion and no systematic error, KSF with naive Jacobians and wave motion, KSF with first estimate Jacobians (FEJ) and wave motion, and KSF with FEJ and torus motion. RMSE values are 0 at the start because poses were initialized with ground truth.

Estimation of Pose and Sensor Parameters From the final RMSE in Table IV, we see that KSF outperformed OKVIS in pose accuracy, and the inconsistent KSF with naively computed Jacobians led to worse pose estimates.

For KSF with wave and torus motion, the RMSEs of sensor parameters at different epochs are tabulated in Table V. For both motion types, all sensor parameter components decreased in uncertainty over time as observed in [12, 40], though at varying rates. The frame readout time converged in general but became stagnant after its estimate was within a radius of the true value.

The uncertainty of IMU parameters decreased faster with torus motion than with the less diverse wave motion. Obviously, for wave motion (2nd row block of Table V), accelerometer bias and gyroscope bias converged very slowly. Besides the simple motion, we suspected that inclusion of IMU parameters, i.e., , , and , in the estimator also slowed the convergence of IMU biases. To verify this, we ran KSF through wave motion while locking additional IMU parameters to their nominal values. The resultant RMSEs are given in the first row block of Table V. From the results, we see that when the IMU systematic errors, , , and , were removed, estimates for biases converged much faster, and many camera parameters also sped up in convergence.

Test
setup
Time
(s)
Wave
with
fixed
0.0 0.47 0.034 0.00 0.00 0.00 3.45 7.36 7.07 55.89 1.44 4.34 4.93
3.0 0.21 0.029 0.00 0.00 0.00 1.79 1.33 1.28 3.84 0.96 0.54 0.81
10.0 0.08 0.025 0.00 0.00 0.00 0.94 0.58 0.65 2.17 0.56 0.27 0.54
30.0 0.05 0.016 0.00 0.00 0.00 0.61 0.32 0.43 1.14 0.32 0.16 0.42
100.0 0.03 0.008 0.00 0.00 0.00 0.32 0.17 0.21 0.60 0.17 0.09 0.39
300.0 0.02 0.005 0.00 0.00 0.00 0.20 0.09 0.13 0.36 0.10 0.06 0.39
Wave 0.0 0.50 0.035 14.92 3.04 14.83 3.45 7.26 7.02 49.82 1.40 4.96 4.89
3.0 0.43 0.033 10.59 1.89 12.77 2.16 1.90 1.70 4.24 1.12 1.08 0.93
10.0 0.42 0.032 6.81 1.13 9.67 1.48 1.04 1.47 2.17 1.00 0.45 0.56
30.0 0.40 0.032 5.11 0.92 8.09 1.20 0.63 1.35 1.23 0.89 0.27 0.43
100.0 0.35 0.031 3.44 0.74 6.28 0.83 0.40 0.96 0.67 0.64 0.15 0.40
300.0 0.29 0.031 2.21 0.60 5.10 0.56 0.26 0.58 0.40 0.39 0.09 0.39
Torus 0.0 0.50 0.034 14.90 3.01 14.91 3.48 7.16 6.94 51.38 1.48 5.13 5.02
3.0 0.44 0.032 13.06 2.45 13.03 2.97 2.82 2.20 3.69 1.22 3.21 2.28
10.0 0.39 0.032 6.92 1.37 9.46 2.07 1.44 1.44 1.75 0.91 1.11 1.11
30.0 0.22 0.028 4.18 0.70 7.02 1.52 0.72 1.17 0.95 0.74 0.49 0.69
100.0 0.12 0.023 2.45 0.38 4.80 0.93 0.39 0.79 0.54 0.51 0.27 0.52
300.0 0.08 0.017 1.56 0.23 3.25 0.60 0.23 0.53 0.31 0.37 0.16 0.51
TABLE V: RMSE of sensor parameters at several epochs computed from 1000 simulation runs of KSF with first estimate Jacobians by default in three settings, KSF through wave motion with IMU parameters except biases locked at nominal values, KSF through wave motion, KSF through torus motion.

V Real-World Experiments

With real-world data, the proposed KSF was evaluated in four groups, sensitivity of pose accuracy to several design options, -camera-IMU platform calibration in comparison to reference values, accuracy of motion estimation relative to recent VIO methods, and computation cost.

The pose accuracy was measured with Root Mean Square (RMS) of the Absolute Translation Error (ATE) [54, (24)], the Relative Rotation Error (RRE) [54, (27)]

, and the Relative Translation Error (RTE)

[54, (27)] as implemented by [54]. ATE in units of meters measures the overall deviation of the estimated pose sequence from ground truth in translation. RRE in units and RTE in units % quantifies deviations from ground truth over trajectory segments of fixed lengths, in rotation and translation, respectively. To mitigate the random behavior of a VIO estimator, results from three runs were used to compute the error metrics since empirically metrics computed from three runs changed little, e.g., less than 0.05% in ATE. To compute relative errors, RRE and RTE, distance intervals [40, 60, 80, 100, 120] were used as in the UZH-FPV dataset [55].

RRE and RRE are invariant to a rigid transformation of the pose sequence. However, ATE is dependent on the spatial alignment. As recommended in [54, Table I], we aligned the trajectory estimated by a VIO method to the ground truth with a transformation of a yaw-only rotation and a 3D translation. This 4-DOF alignment requires that poses in the estimated trajectory and ground truth are expressed in world frames with z-axis aligned to gravity. The first part is fulfilled since all VIO methods used for comparison estimate poses of the IMU sensor frame in a gravity-aligned world frame. The second part is fulfilled by benchmarks, EuRoC [56] and TUM VI [57].

For the following tests, we initialized a VIO estimator with the standard deviation of position, orientation error, velocity, gyroscope bias, and accelerometer bias, , , , , and , respectively.

V-a Sensitivity Study

By tuning KSF on the EuRoC dataset, we studied its dependence on the feature tracking method, its sensitivity to key parameters including number of keyframes and number of temporally successive frames , and the effect on pose estimation of landmark parameterization and sensor calibration. The entire EuRoC dataset [56] without modification was used because it is less demanding than other benchmarks, e.g., TUM VI [57], and we feel that a challenging dataset may result in parameters not well suitable to the vast average data.

We tested the parameter sensitivity of KSF in the monocular setting because most findings are expected to carry over to -camera settings. By default, KSF processed data synchronously in the non-realtime fashion.

For each subsequent sensitivity test, we began with a manually selected baseline combination of parameters which was expected to achieve good pose estimation accuracy based on our experiences. Then we ran KSF with components or parameters of interest differing from the baseline. In the end, the optimal component or parameter combination was identified by looking at the pose estimation accuracy of different combinations.

Feature Tracking Method To show the merit of keyframe-based feature matching, we assessed KSF with alternative feature association methods including frame-wise KLT feature tracking [29] (hence KSF-KLT), and sequential frame matching by BRISK descriptors (hence KSF-seq). The relative errors are summarized in VI, confirming the advantage of keyframe-based feature association. Visual check of the superimposed trajectory plots indicated that both KSF-KLT and KSF-seq often had jumps in trajectory when the Micro Aerial Vehicle (MAV) was nearly stationary. This was expected because these two alternatives got mainly uninformative feature observations of small disparities during standstill.

RTE (%) RRE ()
KSF 0.73 0.030
KSF-KLT 3.76 0.053
KSF-seq 0.87 0.034
KSF-3-3 3.42 0.047
KSF-5-5 0.81 0.032
KSF-7-5 0.72 0.030
KSF-7-7 0.73 0.031
KSF-9-5 0.71 0.030
KSF-9-9 0.69 0.031
TABLE VI: Relative errors of KSF on EuRoC dataset with keyframe-based, KLT, and sequential feature association and of KSF with varying number of keyframes (first number after dash) and temporal frames (last number after dash).

Number of Keyframes and Temporal Frames The backend sliding window consists of keyframes and recent temporal frames. A larger sliding window is expected to give better pose estimates but incurs higher computation as shown in VI. To balance accuracy and resource consumption, we selected 7 keyframes and 5 temporal frames for KSF in subsequent tests.

Lock Sensor Parameters As the proposed method involves quite a few sensor parameters, we evaluated the effect of calibrating sensor parameters on pose accuracy. Starting from an extreme baseline, unlocking (or locking) all sensor parameters, we ran KSF with only one component of the sensor parameters locked (or unlocked). Remarkably, the IMU biases were always estimated. Then the relative errors for results of different calibration settings were computed (Table VII). From the first half of Table VII, we see that locking IMU parameters or camera intrinsic parameters can improve pose accuracy while locking camera extrinsic parameters have a slight impact, and locking camera temporal parameters worsens the accuracy. From the second half of Table VII, we see that estimating IMU or camera intrinsic parameters adversely affects the results, but estimating camera extrinsic or temporal parameters improves the accuracy. Interestingly, Yang et al. [58, Table V] reported that online IMU calibration marginally worsens RMS ATE values for many EuRoC sequences and they reasoned that the MAV motion is not diverse enough. In summary, for the EuRoC dataset, locking IMU and camera intrinsic parameters while calibrating camera extrinsic and temporal parameters led to better pose accuracy. This finding is likely due to good individual calibration of the IMU and cameras used for capturing the EuRoC dataset.

RTE (%) RRE ()
KSF unlock all 0.79 0.037
KSF lock 0.73 0.031
KSF lock extrinsic 0.78 0.036
KSF lock intrinsic 0.72 0.031
KSF lock temporal 0.87 0.036
KSF lock all 0.71 0.029
KSF unlock 0.75 0.032
KSF unlock extrinsic 0.69 0.030
KSF unlock intrinsic 0.73 0.029
KSF unlock 0.70 0.029
TABLE VII: Relative errors of KSF on EuRoC dataset when estimating subsets of sensor parameters, including IMU parameters , camera extrinsic (), intrinsic (projection and distortion), and temporal (time delay and frame readout time) parameters.

Landmark Parameterization It was observed in [9, 59] that anchored inverse depth (AID) parameterization (14) for landmarks outperformed the homogeneous point (HP) parameterization (13). This was confirmed in our tests by relative errors for KSF with AID by default versus with HP in Table VIII.

-Camera Setup and Real-Time Processing We also evaluated KSF with input from a stereo camera (KSF-n), KSF running in asynchronous mode (KSF-async), i.e., real-time processing, and KSF-n in asynchronous mode (KSF-n-async). The relative errors are given in Table VIII where we see the benefit of stereo cameras in reducing the relative translation error. Surprisingly, asynchronous processing led to slightly better result than synchronous processing. But this trend was not observed on other benchmarks, e.g., TUM VI. Aside from the random behavior of the algorithm, another cause could be that some of the rich feature descriptors in the EuRoC dataset were redundant and discarding them might be better regarding pose accuracy (see also [6]).

RTE (%) RRE ()
KSF 0.68 0.031
KSF-HP 0.71 0.029
KSF-async 0.67 0.030
KSF-n-async 0.39 0.028
KSF-n 0.40 0.029
TABLE VIII: Pose accuracy of KSF on EuRoC dataset with an alternative landmark parameterization, Homogeneous Point (HP) in the world frame, with asynchronous real-time processing (async), with asynchronous processing of stereo images (n-async), and with non-realtime processing of stereo images (n). By default, KSF processes images from a single camera synchronously with anchored inverse depth parameterization.

V-B Sensor Calibration

To show the capability of KSF in estimating the extrinsic, intrinsic, and time delay parameters of a -camera-IMU platform, KSF was used to process several raw sequences of TUM VI [57] and the estimated sensor parameters were compared to provided reference values. Moreover, to show the possibility of calibrating frame readout time, sequences captured by a few Android phones were processed with KSF and the estimated readout times were compared to values obtained with a LED panel.

Stereo-Camera-IMU Calibration The TUM VI benchmark provided the raw sequences collected by the stereo-camera-IMU platform which was subject to a variety of sensor systematic errors. On these raw sequences with diverse motion, KSF was expected to give good estimates of these systematic errors. On the other hand, the systematic errors of the platform had been estimated by the authors with dedicated calibration sequences and reference values of relevant parameters were provided. Thus, by comparing the reference values and the estimates from KSF, we can assess KSF’s capability of sensor calibration.

To this end, we chose six room sessions of the raw TUM VI sequences (each about two-minute long) because they contained rich motion and ground truth covering the full trajectory. Images in these raw data were down sampled by pyrDown [29] to 512512. Both images and IMU data were timestamped with the header.stamp field of raw messages. Otherwise these raw data were intact. This contrasted with [39] where the initial IMU biases in the raw data were corrected before online calibration.

KSF was initialized with nominal values and standard deviations for sensor parameters as listed in Column 2, Table IX. Because the sensor pixel is usually square, the same focal length was used for both and directions of camera . The camera intrinsic parameters were initialized assuming we know their rough values. For instance, was initialized with a rounded number 190. The principal points were simply initialized at the image centers, i.e., (256, 256). The IMU noise parameters were set as the stereo KSF for processing the calibrated TUM VI sequences in Section V-C. Each sequence was processed three times to mitigate the nondeterministic behavior of the algorithm. For six raw sequences three runs each, we computed means and standard deviations of estimated sensor parameters at the end of these 18 trials. The estimated parameters including IMU biases and systematic errors, and camera extrinsic, intrinsic, and temporal parameters are given in Table IX.

The calibration parameters coming with the TUM VI dataset were obtained by a global bundle adjustment with special data captured while observing a calibration target. These parameters followed the IMU and camera extrinsic models defined by TUM VI. For comparison, they needed to be converted to parameters of the IMU and camera extrinsic models used by KSF. The conversion was derived by comparing TUM VI models with ours. Given TUM VI IMU model coefficient matrices , , IMU biases, and , extrinsic parameters of the two cameras in the TUM VI sensor frame , and , the parameters of KSF can be obtained by

(33)
(34)
(35)
(36)
(37)
(38)
(39)
(40)

where is the orientation between the camera frame and the nominal IMU frame ,

(41)

is needed because KSF fixed orientation of the body frame relative to the camera frame to in processing raw sequences of TUM VI, i.e., starting with a coarse calibration.

Parameter
Initial
Value
Final Value
TUM VI
Ref.
calibrate all
fix camera
intrinsics
2.08 0.31
0.47 0.13
1.19 0.36
1.97 0.21
0.47 0.14
1.08 0.30
1.72
0.38
0.93
-1.26 0.04
-0.30 0.05
0.44 0.05
-1.28 0.04
-0.31 0.04
0.41 0.03
-1.30
-0.39
0.38
58.27 1.91
27.90 3.67
-7.44 1.10
-24.78 3.25
-87.05 0.86
30.46 2.33
7.87 0.99
-36.82 3.31
-18.05 1.34
58.47 1.26
30.53 1.47
-8.33 0.58
-26.95 1.15
-87.69 0.69
35.48 2.67
9.51 1.05
-43.92 3.29
-18.45 0.83
59.30
30.09
-8.80
-27.24
-87.09
34.08
10.38
-41.27
-16.54
0.49 0.38
0.05 0.10
-0.67 0.56
-0.05 0.14
0.19 0.16
-0.11 0.25
0.43 0.33
0.10 0.22
-0.44 0.71
0.29 0.22
0.07 0.14
-0.47 0.39
-0.11 0.13
0.05 0.12
-0.12 0.26
0.12 0.26
0.03 0.20
-0.26 0.57
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
-6.74 1.69
21.72 5.08
-10.89 3.09
-25.13 4.36
-4.35 2.26
22.43 6.61
20.23 3.77
-30.40 4.16
20.70 5.04
-7.76 1.91
25.82 3.08
-9.58 3.33
-27.79 3.46
-3.92 2.20
29.00 5.22
18.90 2.94
-35.93 3.76
23.66 3.00
-4.67
29.59
-7.41
-29.32
-2.39
34.55
18.71
-34.10
29.75
3.82 0.91
-5.40 0.67
-4.57 1.31
3.82 0.32
-4.60 0.55
-8.00 0.46
4.71
-4.76
-6.90
2.72 0.87
-0.58 1.08
1.53 0.59
0.00 0.00
0.00 0.00
0.00 0.00
0.98
-1.07
0.90