I Introduction
In GPSdenied environments, for instance, indoors and in urban canyons, it is essential for mobile robot platforms such as micro aerial vehicles (MAVs) to know their own pose for operations. In recent years, this pose estimation problem is popularly solved by the combination of visual information from cameras and measurements from an Inertial Measurement Unit (IMU), which is usually referred to as Visual Inertial Odometry (VIO)[1, 2]. Compared to lidar based approaches, VIO requires only a lightweight and smallsize sensor package, making it the preferred technique for mobile robot platforms.
Realtime VIO solutions are critical for robots to response quickly to corresponding environments. At the same time, computation efficiency of VIO solutions is also important for robots with limited payloads and computation power such as MAVs. It is wellknown that filterbased VIO solutions are generally more computationally efficient than that of optimization based methods [1] while stereo cameras are more robust to hostile environments compared with monocular camera. In order to solve the above problem of pose estimation in a realtime and robust manner, we propose a new VIO solution based on the extended Kalman filer (EKF) and information from stereo cameras in this paper.
To achieve a precise and robust pose estimation, the algorithm we present here is a dual stage EKFbased VIO solution, which is referred as to DSVIO. It is consisted of two EKF filters and each stage has one EKF filter. We demonstrate that our algorithm perform comparable or even higher precision than other stateofart VIO algorithms. The primary contribution of our work is a new filterbased VIO framework that contains a dual stage EKF filter. The first stage perform the sensor fusion of the measurements of accelerometer and gyroscope. The second stage performs the fusion of the measurements of IMU and stereo camera. The dual stage of EKF allows the VIO solution achieve higher precision and robustness. To show the effects, we provide detailed comparisons between our DSVIOS with other stateofart opensource VIO approaches including OKVIS, ROVIO, VINSMONO and SMSCKF on the EuRoC dataset through experiments. The results demonstrate that our DSVIO is able to achieve similar or even better performance than these stateofart VIO approaches in term of precision and robustness.
The rest of the paper is orgnized as follows: Section II introduces the related work. Section III and Section IV provide the mathematical details of the first and second stage of EKF filter. Section V describes the enforcement mechanism of dual stage EKF filter. The experiments comparing our DSVIO with state of the art opensource VIO approaches is conducted in Section VI. Finally, Section VII draws some conclusions of this paper.
Ii Related Work
There are a large number of work which fuses measurements from cameras and IMUs to perform pose estimations, including VIO and VISLAM (Simultaneous Localization and Mapping). The related work is discussed from three different aspects: (i) loosely or tightly coupled solutions. (ii) filterbased or optimizationbased solutions. (iii) direct or indirect solutions.
Loosely coupled solutions process the measurements from IMUs and cameras separately. Some methods process the images for computing relativemotion estimates between consecutive poses firstly [3, 4, 5] and then fuse the result with IMU measurements. In contrast, in [6, 7], IMU measurements are used to compute rotation estimates and fuse the result with an visual estimation algorithm. Loosely coupled solutions lead to a reduction in computational cost and information loss [8]. However, the tightly coupled solutions can achieve higher accuracy [9, 10, 11, 12, 13]. In this work, we are interested in tightly coupled solutions and the proposed DSEKF is also a tightly coupled solution.
Existing tightly coupled solutions can be divided into filterbased or optimizationbased solutions. The latter generally attains higher accuracy because relinearization at each iteration can better deal with their nonlinear measurement models, like [14, 15, 12, 16]. However, the procedure for multiple iterations lead to heavy calculation burden. As for filterbased solutions, to date, the majority of the proposed algorithms are EKFbased methods like [17, 18, 19, 20, 11, 21, 13], also, Uncented Kalman Filter in [22], and Particle Filter in [23]. In order to improve the accuracy of filterbased solutions, some researchers focus on the consistency of estimator. The works in [24, 25] present the First Estimate Jacobian EKF (FEJEKF). Observability Constraint EKF (OCEKF) is presented in [26, 27, 13]. The key idea of FEJEKF is to choose the same or firstever available estimations for all the state variables as the linearization points. However, OCEKF guarantees observability of the linearized system by ensuring the rank of nullspace of the nonlinearized system not changed after linearized. The OCEKF is applied in our DSVIO algorithm.
The vast majority of pose estimation methods have relied on the usage of point features, which are detected and tracked in the images like [11, 12]. The methods in [28, 29] employ edge information while the method in [2]
employs both point and line features. All the above methods are classified as indirect solutions. The methods in
[30, 31] present a direct method which employ the imageintensity measurements directly. The direct methods exploit more information from the images than that of indirect methods.Iii First Stage Ekf Filter Description
The mathematical description in this paper follows the formulations in [11]. The gyroscope state is defined as:
where is the unit quaternion describing the rotation from inertial frame () to the body frame (), describes the bias of gyroscope measurement. Here, the body frame () is assumed to be fixed to IMU frame.
Iiia Process Model
The state equation in the timecontinuous form is given as follows:
(1a)  
(1b) 
where is the rotational velocity in the IMU frame,
and the gyroscope bias is modeled as a random walk process with random walk rate . The gyroscope measurements is:
(2) 
By applying the expectation operator in (1), we can get the estimates of the evolving gyroscope state:
(3a)  
(3b) 
where .
To propagate the uncertainty of the state, the discrete time state transition matrix should be computed, as for (3a),
where and is the time interval between two IMU measurements. As for (3b), in a similar manner, one has
Therefore, we can get
Therefore, one can obtain the following expression:
(4) 
where is the Jacobian of process model with respect to the gyroscope state, which is shown in Appendix, is a vector describe the process noise, of which the covariance matrix is as follows:
Finally, the propagated covariance of the gyroscope state is described as:
IiiB Measurement Model
The first correction stage uses data from the accelerometers to correct the gyroscope state, the measurement is,
(5) 
According to [32], can be presented by
where is the constant gforce acceleration, and
(6) 
The residual of the measurement can be approximated as
(7) 
where is the Jacobian of measurement with respect to the gyroscope state, which is shown in Appendix, describes the measurement noise, of which the covariance matrix is ,
Iv Second Stage Ekf Filter Description
The evolving IMU state is defined as follows:
where and are vectors describing the IMU position and velocity in frame , and describes the bias of accelerometer measurement. The bias is modeled as random walk processes and the random walk rate is . The IMU errorstate is defined as:
where the standard additive error defined as with and is used for position, velocity and bias, respectively, and represents the quaternion error [11, 33]. Ultimately, we add the camera poses in the IMU state, we can get the EKF state vector, at timestep, it can be defined as:
where the , is the camera attitude and position, respectively. The EKF error state vevtor is defined as:
Iva Process Model
The continuoustime system model is described as :
(8a)  
(8b)  
(8c)  
(8d)  
(8e) 
where is the body acceleration in the global frame (). The accelerometer measurement is:
where denotes a rotational matrix, and are measurement noise, is gravitational acceleration expressed int the local frame. From (1) to (4), we can get the equations for propagating the estimates of the evolving IMU state by applying expectation operator:
(9a)  
(9b)  
(9c)  
(9d)  
(9e) 
where . Therefore, the linearized continuoustime model for the IMU errorstate is:
(10) 
where , F and G are shown in appendix. The state equation model can be computed as follows:
(11) 
where the computation of transform matrix and the discrete time noise covariance can be found in [13]. So the propagated covariance of the IMU state is described as:
where the covariance matrix and are described as:
When recording a new camera, we should add the camera pose into the state by using the following expressions:
The augmented covariance is given by:
(12) 
where J is shown in Appendix.
IvB Measurement Model
The second correction stage employs data from the images to correct the IMU state, of which the measurement model we adopted here follows [11, 13]. It is based the fact that static features in the wold can be observed from multiple camera poses showing constraints among all these camera poses. Consider the case of a single feature observed by the stereo cameras at time step . Assume this feature are observed by both left and right cameras, and the left and right cameras poses are represented as and . For the feature at time step , the stereo measurement is,
where , represent the positions of feature in the left and right camera frame are given by,
where is the position of feature in the golbal frame, it can be computed the least square method shown in [11].
Once is obtained, the measurement residual can be computed by:
By linearizing about the estimates for the camera poses and feature position, the residual can be approximated as:
where is the noise of the measurement, and are the Jacobian matrixes of the measurement with respect to the state and the feature position, respectively, of which the value of these two Jacobian matrixes can be found in [13], is the error in the position estimate of . By stacking all the observations of feature , one can obtain:
Because the position of feature is computed by using the state estimate X, the error is correlated with the error . The form of residual cannot be directly used for update in the EKF. By projecting to the nullspace of , one has
where A
denotes the unitary matrix whose columns form the basis of the left nullspace of
[11].V Dual Stage Ekf Mechanism
The structure of the proposed dual stage of EKF filter is shown in Fig.1. The first stage of EKF is designed to combine the measurements from accelerometer and gyroscope. By implementing the first stage EKF filter, data from accelerometer is used as a corrective measure by taking into account the gravitational force to curb the error of the orientation estimate. The role of the second stage EKF is fusing measurements from IMU and stereo cameras. The images from stereo cameras can present a natural complement to IMU to to curb the errors and drift.
As shown in Fig.1, the first stage of EKF is executed immediately as the IMU measurements are acquired. For each IMU measurement received, one has:

Propagation: The rotational velocity is used to propagate the gyroscope state , and covariance matrix .

update: The body acceleration is used to perform an EKF update.
The second stage of EKF contains the first stage, and the first stage is regarded as the propagate part in the second stage EKF:

Propagation: Whenever a new IMU measurement is received, it propagates the IMU state and covariance matrix .

Image registration:
Whenever new stereo images are acquired, it (i) augments the IMU state and the corresponding covariance matrix; (ii) operates image frontend processes, including feature extraction and feature matching.

update: EKF update is performed when (i) a feature has been tracked in a number of images (the number is 3 in our algorithm) is no longer detected; (ii) camera poses will be discarded when the largest number of camera poses in the IMU state has been reached.
We present a new strategy to discard old camera poses in the above update. The oldest nonkeyframe is discarded according to two criteria whenever new stereo images are received. The two criteria from [12] is employed here for keyframe selection. The first one is the average parallax apart from the previous keyframe and the second one is tracking quality.
Vi Experiments
In this section, we perform two experiments to evaluate the proposed DSVIO algorithm. In the first experiment, we compare the proposed algorithm with other stateoftheart algorithms on public dataset by analyzing the Root Mean Square Error (RMSE) metrics. In the second experiment, we compare the proposed DSVIO with SMSCKF in the indoor environment. SMSCKF is selected because it is also one of stereo and EKF filter based approaches.
Via Dataset Comparison
The proposed DSVIO algorithm (stereofilter) is compared with the state of the art approaches including OKVIS [14] (stereo optimization), ROVIO [33] (monocular filter), VINSMONO [12] (monocular optimization) and SMSCKF [13] (stereo filter) on the EuRoC dataset [34]. These methods are different combinations of monocular, stereo, filterbased, and optimizationbased methods. Among these methods, SMSCKF is a tightlycoupled filteringbased stereo VIO algorithm which is closely related to our work. During the experiment, only the images from the left camera are used for monocular camera based algorithms like ROVIO and VINSMONO. In order to perform a convictive experiment, only the performance of VIO is conducted and the functionality of loop closure is disabled for VINSMONO. For each algorithm, its performance is evaluated for repeating experiments ten times and the mean value is treated as the result.
The EuRoC MAV Dataset is a collection of visualinertial datasets collected onboard a Micro Aerial Vehicle (MAV). It contains synchronized 20Hz stereo images and 200Hz IMU messages, accurate motion and structure groundtruth. Some parts of the dataset exhibit very dynamic motions and different image brightness, which renders stereo matching and feature tracking more challenging.
Figure 2 shows the RMSE results of our proposed DSVIO algorithm and other state of the art algorithms including OKVIS, ROVIO, VINSMONO and SMSCKF on the EuRoC datasets. Because both our proposed method and SMSCKF employ the KLT optical flow algorithm for feature tracking and stereo matching, they do not work properly on “V2_03” dataset. As pointed out in [13], the rapid change of brightness in “V2_03” dataset causes failures in the stereo feature matching. On the rest datasets, our method achieves comparable performance with other methods. For the most similar method SMSCKF, we can see that our method has better performance than that of SMSCKF on all the datasets.
ViB Indoor Experiment
In the indoor experiment, for the sake of fairness, we only compare SMSCKF with our DSVIO because both algorithms are stereofilter based approaches. We choose rectangular corridor in our laboratory building as the experiment area. We encounter low light and textureless condition in the corridor environment, as shown in Fig.3. The sensor suite we use is ZEDmini, which contains stereo cameras (30hz) and an IMU (800hz). With ZEDmini in our hands, we walk along the rectangular corridor for a circle around 1m/s.
Fig.4 shows the comparison results between our DSVIO and SMSCKF. Fig.4(a) shows the trajectories in the plane and Fig.4(b) shows the trajectories in three dimensions. The blue line represents trajectory of DSVIO while the dotted line represents the trajectory of SMSCKF. One can that the blue trajectory is a nonstandard rectangle but the dotted trajectory is far away from a rectangle. This comparison shows that the proposed DSVIO achieves a smaller cumulative error than that SMSCKF. One can attribute the archived senior performance to the first stage of EKF filter of the proposed DSVIO, which employs the complementary characteristics between accelerometer and gyroscope.
Vii Conclusions
In this paper, we present a robust and efficient filterbased stereo VIO. It employs a dual stage of EKF to perform the state estimation. This dual stage EKF filter employ the complementary characters of IMU and stereo cameras as well as accelerometer and gyroscope. The accuracy and robustness of the proposed VIO is demonstrated by experiments of the EuRoC MAV Dataset and indoor environment by comparing with the state of the art VIO algorithms. The further work should explore how to achieve better accuracy and efficiency by selecting feature points which have some certain characters.
Appendix
References
 [1] K. Sun, K. Mohta, B. Pfrommer, M. Watterson, S. Liu, Y. Mulgaonkar, C. J. Taylor, and V. Kumar, “Robust stereo visual inertial odometry for fast autonomous flight,” IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 965–972, April 2018.
 [2] F. Zheng, G. Tsai, Z. Zhang, S. Liu, C.C. Chu, and H. Hu, “Pivio: Robust and efficient stereo visual inertial odometry using points and lines,” arXiv preprint arXiv:1803.02403, 2018.
 [3] J. Ma, S. Susca, M. Bajracharya, L. Matthies, M. Malchano, and D. Wooden, “Robust multisensor, day/night 6dof pose estimation for a dynamic legged vehicle in gpsdenied environments,” in IEEE International Conference on Robotics and Automation, 2012, pp. 619–626.
 [4] J. P. Tardif, M. George, M. Laverne, and A. Kelly, “A new approach to visionaided inertial navigation,” in Ieee/rsj International Conference on Intelligent Robots and Systems, 2010, pp. 4161–4168.
 [5] S. Weiss and R. Siegwart, “Realtime metric state estimation for modular visioninertial systems,” in IEEE International Conference on Robotics and Automation, 2011, pp. 4531–4537.
 [6] L. Matthies, “Fully selfcontained visionaided navigation and landing of a micro air vehicle independent from external sensor inputs,” Proc Spie, vol. 8387, p. 25, 2012.
 [7] T. Oskiper, Z. Zhu, S. Samarasekera, and R. Kumar, “Visual odometry system using multiple stereo cameras and inertial measurement unit,” in Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, 2007, pp. 1–8.
 [8] M. Li, “Visualinertial odometry on resourceconstrained systems,” Dissertations Theses  Gradworks, 2014.
 [9] K. Konolige and M. Agrawal, “Frameslam: From bundle adjustment to realtime visual mapping,” IEEE Transactions on Robotics, vol. 24, no. 5, pp. 1066–1077, 2008.
 [10] K. Konolige, M. Agrawal, and J. Solà, “Largescale visual odometry for rough terrain,” in Robotics Research  the International Symposium, Isrr 2007, November 2629, 2007 in Hiroshima, Japan, 2010, pp. 201–212.
 [11] A. I. Mourikis and S. I. Roumeliotis, “A multistate constraint kalman filter for visionaided inertial navigation,” in Robotics and automation, 2007 IEEE international conference on. IEEE, 2007, pp. 3565–3572.
 [12] T. Qin, P. Li, and S. Shen, “Vinsmono: A robust and versatile monocular visualinertial state estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018.
 [13] K. Sun, K. Mohta, B. Pfrommer, M. Watterson, S. Liu, Y. Mulgaonkar, C. J. Taylor, and V. Kumar, “Robust stereo visual inertial odometry for fast autonomous flight,” IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 965–972, 2018.
 [14] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframebased visualinertial odometry using nonlinear optimization,” International Journal of Robotics Research, vol. 34, no. 3, pp. 314–334, 2014.
 [15] T. Lupton and S. Sukkarieh, VisualInertialAided Navigation for HighDynamic Motion in Built Environments Without Initial Conditions. IEEE Press, 2012.
 [16] V. Usenko, J. Engel, J. Stückler, and D. Cremers, “Direct visualinertial odometry with stereo cameras,” in IEEE International Conference on Robotics and Automation, 2016, pp. 1885–1892.
 [17] E. S. Jones and S. Soatto, “Visualinertial navigation, mapping and localization: A scalable realtime causal approach,” International Journal of Robotics Research, vol. 30, no. 4, pp. 407–430, 2011.
 [18] J. Kelly, S. Saripalli, and G. S. Sukhatme, Combined Visual and Inertial Navigation for an Unmanned Aerial Vehicle. Springer Berlin Heidelberg, 2008.
 [19] J. Kelly and G. S. Sukhatme, “Visualinertial simultaneous localization, mapping and sensortosensor selfcalibration,” in IEEE International Conference on Computational Intelligence in Robotics and Automation, 2009, pp. 360–368.
 [20] M. Kleinert and S. Schleith, “Inertial aided monocular slam for gpsdenied navigation,” in Multisensor Fusion and Integration for Intelligent Systems, 2010, pp. 20–25.
 [21] P. Pinies, T. Lupton, S. Sukkarieh, and J. D. Tardos, “Inertial aiding of inverse depth slam using a monocular camera,” in IEEE International Conference on Robotics and Automation, 2007, pp. 2797–2802.
 [22] J. Kelly and G. S. Sukhatme, “Visualinertial sensor fusion: Localization, mapping and sensortosensor selfcalibration,” International Journal of Robotics Research, vol. 30, no. 1, pp. 56–79, 2011.
 [23] T. Yap, M. Li, A. I. Mourikis, and C. R. Shelton, “A particle filter for monocular visionaided odometry,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp. 5663–5669.
 [24] G. P. Huang, A. I. Mourikis, and S. I. Roumeliotis, A FirstEstimates Jacobian EKF for Improving SLAM Consistency. DBLP, 2009.
 [25] ——, “Observabilitybased rules for designing consistent ekf slam estimators,” International Journal of Robotics Research, vol. 29, no. 5, pp. 502–528, 2010.
 [26] J. A. Hesch, D. G. Kottas, S. L. Bowman, and S. I. Roumeliotis, “Consistency analysis and improvement of visionaided inertial navigation,” IEEE Transactions on Robotics, vol. 30, no. 1, pp. 158–176, 2017.
 [27] S. I. Roumeliotis, D. G. Kottas, C. Guo, and J. Hesch, “Observabilityconstrained visionaided inertial navigation,” 2016.
 [28] J. J. Tarrio and S. Pedre, “Realtime edgebased visual odometry for a monocular camera,” in IEEE International Conference on Computer Vision, 2016, pp. 702–710.
 [29] H. Yu and A. I. Mourikis, “Edgebased visualinertial odometry,” in Ieee/rsj International Conference on Intelligent Robots and Systems, 2017, pp. 6670–6677.

[30]
C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “Imu preintegration on manifold for efficient visualinertial maximumaposteriori estimation (supplementary material),”
Georgia Institute of Technology, 2015.  [31] X. Zheng, Z. Moratto, M. Li, A. I. Mourikis, X. Zheng, Z. Moratto, M. Li, and A. I. Mourikis, “Photometric patchbased visualinertial odometry,” in IEEE International Conference on Robotics and Automation, 2017, pp. 3264–3271.
 [32] S. Sabatelli, M. Galgani, L. Fanucci, and A. Rocchi, “A doublestage kalman filter for orientation tracking with an integrated processor in 9d imu,” IEEE Transactions on Instrumentation Measurement, vol. 62, no. 3, pp. 590–598, 2013.
 [33] M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust visual inertial odometry using a direct ekfbased approach,” in Ieee/rsj International Conference on Intelligent Robots and Systems, 2015, pp. 298–304.
 [34] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,” International Journal of Robotics Research, vol. 35, no. 10, pp. 1157–1163, 2016.
Comments
There are no comments yet.