Globally optimal consensus maximization for robust visual inertial localization in point and line map

02/27/2020 ∙ by Yanmei Jiao, et al. ∙ Zhejiang University 0

Map based visual inertial localization is a crucial step to reduce the drift in state estimation of mobile robots. The underlying problem for localization is to estimate the pose from a set of 3D-2D feature correspondences, of which the main challenge is the presence of outliers, especially in changing environment. In this paper, we propose a robust solution based on efficient global optimization of the consensus maximization problem, which is insensitive to high percentage of outliers. We first introduce translation invariant measurements (TIMs) for both points and lines to decouple the consensus maximization problem into rotation and translation subproblems, allowing for a two-stage solver with reduced solution dimensions. Then we show that (i) the rotation can be calculated by minimizing TIMs using only 1-dimensional branch-and-bound (BnB), (ii) the translation can be found by running 1-dimensional search for three times with prioritized progressive voting. Compared with the popular randomized solver, our solver achieves deterministic global convergence without depending on an initial value. While compared with existing BnB based methods, ours is exponentially faster. Finally, by evaluating the performance on both simulation and real-world datasets, our approach gives accurate pose even when there are 90% outliers (only 2 inliers).



There are no comments yet.


page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visual inertial navigation is currently a popular option for state estimation in mobile robots, autonomous vehicles and augmented reality applications. Many efforts have been paid to build accurate, consistent and efficient visual inertial odometry [1][2]

. However, its inherent drift is unacceptable in long-term operation, calling for absolute pose estimation for correction. Map based visual inertial localization is therefore an important component in a complete navigation system, of which the underlying problem is to estimate the absolute pose from a set of feature correspondences between 2D image key points and global 3D map points. In this problem, one main challenge is the robustness of the solver against the outliers (incorrect feature correspondences). When high percentage of correspondences is outlier, the performance of the general pose estimator may seriously degenerate.

Pose estimation with outliers is in general stated as consensus maximization problem. One popular solution is random sample consensus (RANSAC), which has lots of variants [3][4] and has been employed in many visual localization methods [5][6]. The advantage of RANSAC is the simplicity for implementation, and the usefulness in many scenarios with moderate percentage of outliers. But there are also disadvantages that (i) it cannot tolerate extreme percentage of outliers, say 90%, (ii) it is a probabilistic method, thus not guaranteeing the deterministic global optimality.

Fig. 1: The projected map points on the map image (left column) and the detected image key points on the query image (right column), with inlier correspondences in red and outliers in blue. The initial correspondences found by feature descriptor matching (top), and the consensus set correspondences searched by RANSAC (middle) and proposed consensus maximization algorithm (bottom).

In contrast to RANSAC, another solution to consensus maximization is global optimization based methods, which can give globally optimal solution without referring to an initial value [7][8]. However, one obstacle preventing its application is the considerable computation time. Most global optimization methods aim at general geometry estimation problems. They employ branch-and-bound (BnB) as the basic framework to reduce the search space [9], or mixed integer programming for further acceleration [10][11]. But the computational cost is still unsatisfactory as the multi-dimensional search space is coupled.

In this paper, we propose a deterministic visual inertial localization solution to achieve global convergence with much higher efficiency by dividing search space into multiple 1-D search spaces. Specifically, inspired by the minimal solution in RANSAC, we build intermediate cost function for both point and line features, translation invariant measurements (TIMs), to decouple consensus maximization into two cascaded subproblems only related to rotation and translation respectively. Based on TIMs, the globally optimal rotation is then searched by 1-dimensional BnB in with the aid of inertial measurements. For the translation part, search is replaced with three times 1-dimensional search using prioritized progressive voting. To the best of our knowledge, this is the first solver for visual inertial localization with deterministic global optimality. In summary, the contributions include

  • TIMs based formulation of visual inertial localization decouples the problem and enables 1D BnB based global optimization of the rotation.

  • Prioritized progressive voting method replaces space search with three times search for global optimization of the translation.

  • Experiments on simulation and real-world cross-session datasets that validate the effectiveness and efficiency of the proposed method against comparative methods.

The remainder of the paper is organized as follows: Section II reviews the related literatures. Section III presents the decoupling of the consensus maximization problem. Section IV introduces the solutions of the subproblems. Section V demonstrates the experimental settings and results, followed by Section VI concluding the paper.

Ii Related Works

Ii-a Visual localization

Visual localization and navigation for mobile robots has been studied extensively in the robotics and computer vision communities in the recent decade. A general visual navigation system has two components: visual odometry, which estimates the relative pose and has drift in long term

[12][13], and visual localization, which eliminates the drift by registering the image on a global map [14][15]. More recently, inertial sensors are employed in the system to improve the accuracy and robustness [16][17][18]

. Specifically, the inertial sensor has globally observable pitch and roll measurements, reducing the degrees of freedom (DoF) in visual inertial localization problem to 4. In

[1][19], the reduction is utilized when formulating the pose estimation given a set of inlier feature correspondences. However, few works have been done on outliers elimination when inertial measurements are provided.

Ii-B Random sample consensus

For robust localization given the feature correspondences containing outliers, RANSAC is the most popular solution employed in many visual navigation system. To deal with the visual localization without inertial measurements, i.e. 6DoF, there have been many variants. In [20][21][22], point feature correspondences based RANSAC are studied. In [23][24][25], RANSAC is extended to line features. When inertial measurements are provided, the DoF of the problem is reduced, which is utilized by RANSAC to improve the robustness in [26][27], and extended to both point and line correspondences in [28]. As RANSAC is developed on randomized sampling theory, it is simple to implement and has good performance on scenarios with moderate outliers. But its disadvantage is also obvious, including low tolerance against extreme outliers, local convergence and no guarantee of the optimality [29].

Ii-C Global optimization method

Global optimization methods are proposed to achieve the global optimality and deterministic convergence, addressing the shortcomings of RANSAC. In this branch of literatures, Branch-and-Bound (BnB) is mostly used, which gradually prunes the solution space by coarse-to-fine division. In [30], BnB is used to solve the 2D-2D registration problems. In [9], a general framework for point, line and plane features is proposed to solve 3D-3D registration via BnB. Integrated with mixed integer programming, the BnB optimization can converge faster [10][11]. In [29], the linear matrix inequality constraints are introduced to mixed integer programming, resulting in a faster BnB for all 2D-2D, 2D-3D and 3D-3D geometric vision problems. In the works mentioned above, the rotation is modeled as a rotation matrix with matrix level constraints. Thus it is unclear about the incorporation of inertial measurements. In addition, there are also works propose globally optimal algorithms specializing on one class of problem. In [31][32], pairs of features are used to decouple the 3D-3D registration. In [33], TEASER is proposed for decoupled scaled 3D-3D registration. These works show that it is possible to have superior performance with specialized algorithms rather than only the general BnB framework, even accelerated.

In this paper, we follow the idea of specialized solver to bridge the gap of globally optimal deterministic solution for visual inertial localization, which is a robust 3D-2D pose estimation problem with inertial measurements. To the best of our knowledge, this is the first work to study this problem in the context of global optimality. We expect this solution to be accurate and efficient.

Iii Decoupling Translation and Rotation

The underlying problem of visual inertial localization is the pose estimation from 3D-2D correspondences with outliers. Formally, given a set consisting of correspondences between 3D global points and 2D visual points , they satisfy


where and is the camera pose to be estimated, is the camera projection function with known intrinsic parameters , is assumed to be bounded random measurement noise, is zero for inlier while an arbitrary number for outlier. To deal with outliers, the robust pose estimation generally begins with consensus maximization problem as


where is binary, indicating whether is zero. To solve the problem in global, general BnB algorithms search in , which is a coupled space of and

. But this probably leads to exponential computational complexity in bad cases. For local techniques like RANSAC, inliers may be estimated conservatively, i.e. inliers regarded as outliers, especially when the noise is unavoidable.

Iii-a Translation invariant measurements

Iii-A1 Point-TIM

Inspired by the minimal solution in RANSAC, we develop an intermediate measurement which is invariant to the translation of the pose. Mathematically, given an image key point

, we have an un-normalized direction vector from the camera center as


Then the corresponding world point is transformed to the camera coordinates and satisfies


where and . Based on (30), we have two constraints from a correspondence. Naturally, given another correspondence and , we can have two more constraints as


According to (30) and (33), we have linear constraints of the translation . With proper variable substitutions among the constraints, and the globally observable pitch and roll angles from inertial measurements, we can eliminate , reduce to , and derive TIM as


where is the unknown yaw angle, , , and the derivation details are presented in the Appendix. Now we substitute the constraints which are related to both and in (2) with the TIM, leading to


where , indicates the -th and -th correspondence derived the constraint are inliers.

Iii-A2 Line-TIM

Similar to a pair of point correspondences, given a set of line correspondences , it is also possible to develop TIM. Given the end points of the image line segment and , we have two un-normalized directions as (29), denoted as and .

Then following the fact that the point on the world line lies on the plane spanned by the rays from camera center along direction and , we have


which is a constraint for both rotation and translation. Since arbitrary number of points can be sampled from a line, we sample another point on the same world line to formulate the constraint as (10). Then only one line correspondence can lead to line-TIM after proper substitution as


where the line-TIM has the same form as point-TIM in (36), but the coefficients are different. The derivation details are also presented in the Appendix.

TIMs based rotation only problem. Note that either (36) or (42) is only related to the yaw angle. By combining them together, we have a general consensus maximization problem with TIM constraints only relating to rotation compatible to the map having both point and line features as


Iii-B Two-stage consensus maximization solver

With TIMs for both point and line correspondences, we decouple the original consensus maximization problem into rotation only problem, and translation only problem when the rotation is fixed. Accordingly, the proposed solver has two stages in cascade:

  • We estimate the rotation by based on the TIMs in (12). This estimator solves a 1D optimization problem and is described in Section IV-A.

  • We estimate the translation based on the original consensus maximization in (2) where the rotation is assigned with . This estimator solves a optimization problem and is described in Section IV-B.

Iv Estimators of Rotation and Translation

Iv-a BnB based optimization for rotation

We employ BnB strategy to solve problem (12). The cost function in (12) relates to and . But it is obvious that when is determined, is simply derived by evaluating the constraints. So we denote the cost function as that is explained as the number of inliers given a yaw angle .

Upper bound of cost function. We then derive the upper bound of on the subset , denoted as , where . Recall (36) and (42), as the forms of point-TIM and line-TIM are the same, we denote them as . The lower bound of on , denoted as , is derived as


where the derivation of the coefficients are introduced in Appendix. Note that can be solved analytically without any iterations. Now we formulate a consensus maximization problem as


where the problem is defined on , and the TIMs constraints are replaced with tight lower bounds, relaxing the constraints and yielding an optimistic estimation of . We then have


as a tight upper bound. The equality exists when all constraints give the same with and , which is only possible when noise is free.

Accelerate BnB optimization. With (12-19), we have the BnB search for globally optimal rotation, of which the pseudo code is listed in Algorithm 1. Note that the main idea of BnB is to prune the solution space when its upper bound is smaller than the current best estimates . Therefore, if we have a fast solution to initialize a good , most solution spaces can be pruned at early stage, significantly improving the search efficiency. To implement this idea, we use RANSAC [28] to generate a rough initial

. In addition, we introduce a heuristics to balance the global optimality and the efficiency. The best

estimated during RANSAC is utilized to initialize subsets among . Each subset centers at each estimated with a width . When is large, global optimality is emphasized and vice versa. Another implementation trick is to store the respective inliers when evaluating (16) on each subset . When is further divided into smaller subsets, only the stored inliers within are evaluated, instead of all constraints, saving lots of computational cost. These techniques are all shown to accelerate the search in the experimental ablation study without drop of accuracy.

Input: 3D-2D feature correspondences ,
Output: Optimal
1 Initialize partition of into subsets .
2 Initialize best estimation , .
3 Insert into queue .
4 while  is not empty do
5       Pop the first subset of as .
6       Compute as (16).
7       if  then
8             Assign center of as .
9             Compute as (12).
10             if  then
11                   Update , .
13            Subdivide into subsets and insert into .
Algorithm 1 Globally Optimal Rotation Search

Iv-B Prioritized progressive voting for translation

When is estimated, the co-linear and co-planar constraints (30) and (10) are all linear constraints for . Thus we can transform the consensus maximization problem with point and line constraints as


where and are the coefficients for linear constraints derived from (30) or (10) with estimated . However, this problem still has coupled constraints for so that search is indispensable.

Fig. 2: The voting illustration of . Each derived by -th and -th correspondence votes for the interval if , which means the corresponding consensus set contains i and j.

Decoupled linear constraints. Note that for a point correspondence constraint (30), we have two linear equations, while for a line correspondence constraint (10), we have one. Therefore, given a pair of correspondences including at least one point correspondence, say the -th point correspondence and the -th point or line correspondence, it is sufficient to solve for this small linear system (see Appendix for details), then we have


Now we find that the constraints are decoupled for each dimension of . Set the -dimension as example, we have


arriving at the resultant three dimension-wise linear constrained consensus maximization problems.

Dimension-wise voting algorithm. We use a voting algorithm to solve the problem. We first specify the noise bound in (24). Given the noise bound in (20), we have the noise bound for following the techniques in [34] [35] as


The details can be found in Appendix.

Still taking -dimension as example, each estimated defines an interval . If the real lies in this interval, then the real inlier set contains the two correspondences deriving . According to [33], the insight is that the inlier set only changes its membership when real enters a new interval. Besides, given estimations, the maximum number of possible consensus sets, i.e. the cardinality of the solution space, is , where is in quadratic w.r.t the number of correspondences. This complexity enables a voting algorithm for all sets. By counting the unique correspondences of the votes in each set, we get the corresponding consensus set. Then the maximal consensus set can lead to an estimation of . An illustrative case is shown in Fig. 2 and the pseudo code is listed in Algorithm 2 with -dimension as example. For simplicity, we replace with in the pseudo code. Following the similar idea in [33], by repeating the voting algorithm for three times, is estimated as , , .

Fig. 3: The rotation accuracy and computation time over the increasing outlier rate. BnB2 denotes the BnB with RANSAC initialization. BnB3 denotes the BnB with both RANSAC initialization and the implementation trick.
Input: , ,
Output: Consensus sets
1 Initialize key-value map .
2 .
3 for  do
4       .
5       for  do
6             if  then
7                   .
Algorithm 2 Voting

Prioritized progressive voting algorithm. When the number of inliers is high, independent voting along three dimensions is possible. But when the number of inliers is low and outlier rate is high, independent dimension-wise voting may lead to failure. The reason is that, though it is almost impossible that there are more outliers than inliers having the similar , it is possible that there are more outliers than inliers having the similar . In such scenario, search along -dimension leads to incorrect , which cannot be corrected in the successive voting along or -dimension.

To deal with such scenario while keeping a low computational complexity, we propose a prioritized progressive voting for translation in Algorithm 3. The main idea is that we progressively vote on the three dimensions, but there is a priority, i.e. number of votes, for early termination. The experimental results show that the computational complexity of prioritized progressive voting is almost similar to the dimension-wise voting. Otherwise, it is also possible to use 3D BnB translation search for better accuracy, but it is slower because of the coupled multi-dimensional solution space. Finally, we apply nonlinear refinement to achieve the best accuracy when the maximum consensus set is found.

Fig. 4: Computation time comparison over increasing (a) outlier rate (b) number of points. Ours denotes the proposed method with prioritized progressive voting, while Ours-DV denotes the dimension-wise voting.
Input: , , ,
Output: Maximum consensus set
1 Initialize best estimation .
2 .
3 Sort in decreasing cardinality.
4 for each key in  do
5       if  then
6             break;
8      .
9       for each key in  do
10             if  then
11                   break;
13            .
14             if  then
15                   Update .
16                   Update .
Algorithm 3 Prioritized Progressive Voting

V Experimental Results

In the experiments, we evaluate the proposed consensus maximization solver on (i) the feasibility and effectiveness of the subproblem solvers, (ii) the accuracy and robustness compared with existing methods, and (iii) the performance in real world visual inertial localization applications. We implement the proposed solver in MATLAB on a desktop with CPU Intel i7-7700 3.60GHz and 8G RAM.

V-a Ablation study

We build the synthetic world consisting of 3D points and lines in the cube . The 2D image projections are generated with randomly sampled camera poses in , as well as their inlier correspondences. All the projected 2D image points are added with bounded random noise with the bound . Each outlier correspondence is generated from other randomly sampled camera pose different to ground truth pose. The total number of correspondences is fixed as 50. Specifically, there are 50 point correspondences when evaluating point only methods, while 25 point and 25 line correspondences for the point and line methods. We vary the outlier percentage from 10% to 90% with a step of 10%. Statistic performance indicators are evaluated with an average of 100 Monte Carlo runs. Denoting the ground truth pose as , we compute the translation error as in meter and the rotation error as the angle of in degree.

BnB heuristics. We first evaluate the heuristics introduced in Section IV-A from the aspect of accuracy and efficiency. As shown in Fig. 3, with the heuristics, the efficiency is improved while the accuracy stays similar. Since the final pose is refined by nonlinear optimization, slight rotation error after BnB can be ignored. As a baseline, we also show the error of estimated rotation giving the most inliers in RANSAC, of which the performance is much worse, indicating inconsistency between the identified inliers and the real inliers. In following experiments, heuristics are applied with BnB as default setting.

Translation voting. We then compare the voting strategies introduced in Section IV-B

. Now we can evaluate the final accuracy after nonlinear refinement. In addition to efficiency and accuracy, we also evaluate the consistency between the estimated consensus set and the real inlier set (CCI) using precision and recall. As shown in Fig.

4, the computation of the prioritized progressive voting is slightly higher than the dimension-wise voting. More importantly, the increased time keeps almost consistent w.r.t outlier rate and correspondences number, which might be explained as no complexity growth for prioritized progressive voting. The CCI and accuracy are shown in the right columns in Tab. I. We see that all variants achieve perfect CCI, naturally leading to high accuracy.

Sensitivity to noisy inertial measurements.

As inertial measurements are noisy, it is necessary to evaluate the sensitivity of the proposed method. We add Gaussian noise with zero mean and increasing standard deviation up to 5 degree on both pitch and roll angle. The threshold to judge a successful localization is 0.1m for translation error and 0.5 degree for rotation error as in

[36]. The result is shown in Fig. 5, indicating that the proposed algorithm can achieve over 90% success rate when the noise increases to 5 degree. This level of noise is far more than the pitch and roll estimations in practice [37]. In addition, we can find that the performance is better when employing prioritized progressive search.

Fig. 5: The sensitivity experiment result using proposed algorithm with dimension-wise voting (solid) and prioritized progressive voting (dash).

V-B Comparison on synthetic datasets

The comparative methods include the RANSAC-based methods EPnP[21], P3P[20], 2-Entity[28] and globally optimal method LMI[29]. We use the OpenCV[38]

implementation of EPnP and P3P. For LMI, we modify their open source code in MATLAB following the paper, since only code for 3D-3D registration is released. In addition, we control the evaluation data having rotation angle less than

and add it as the constraint of LMI, as suggested in [29]. The 2-Entity RANSAC is implemented in MATLAB and we select the mixed sampling strategy which utilize both points and lines for pose estimation. All methods are followed by nonlinear refinement on the identified consesus set. We still use the synthetic dataset as in the ablation study.

Fig. 6: (a) The number of inliers in the estimated maximal consensus set w.r.t increasing outliers of successful estimation. (b) The number of inliers in the estimated maximal consensus set for 100 runs when the outlier rate is 80%.
Fig. 7: Success rate with respect to threshold on the whole three sessions 0827 (left), 0828 (center) and 0129 (right).

Efficiency of globally optimal methods. We first compare the efficiency between the proposed method and the LMI. We evaluate the computational cost with respect to the number of feature correspondences and the percentage of outliers. The result is shown in Fig. 4, the computational cost of LMI is significantly higher than the proposed methods both for increasing number of correspondences, and the percentage of outliers. The growing gap may also indicate that the complexity of LMI is higher than ours.

Deterministic convergence. The vital difference between RANSAC and globally optimal method is the convergence. We compare the number of inliers in the estimated maximal consensus set with respect to increasing outliers when the final pose estimation is successful. The result is shown in Fig. 6, which indicates that the proposed solution achieves deterministic perfect CCI, while RANSAC gives conservative estimations with less inliers and LMI finds optimistic estimations by incorrectly regarding outliers as inliers. In addition, both RANSAC and LMI fail when the outlier rate is 90%. The results for all 100 runs when the outlier rate is 80% are also shown in Fig. 6. We can see that the proposed algorithm deterministically finds the globally optimal consensus, while RANSAC achieves global optimality probabilistically.

Robustness and accuracy. We finally show the performance of all methods on the synthetic data, including accuracy, precision and recall to measure the CCI, with respect to percentage of outliers ranging from 60% to 90%. Note that we only evaluate the accuracy for successful trials, since result on incorrectly identified consensus set can lead to very large error, disturbing the accuracy. The result in Tab. I first confirms that CCI is highly related to the accuracy, validating the feasibility of maximizing consensus set. RANSAC gives consistent conservative estimations, as the precision remains at a higher level compared with the recall. For LMI, the estimation is prone to regard the outliers as inliers, thus the recall is higher compared with precision. Considering that LMI, P3P and EPnP are designed for general visual localization, the better performance achieved by 2-Entity and the proposed method, designed for visual inertial localization, is reasonable. But we can still summarize that superior result can be found by specialized globally optimal method.

V-C Comparison on visual inertial localization

Finally, we evaluate all the methods on a real world cross-session visual inertial localization task. The dataset employed is YQ-dataset[39]. In the dataset, there are three sessions collected in summer 2017, denoted as 2017-0823, 2017-0827 and 2017-0828, and one session in winter 2018 after snow denoted as 2018-0129. The 3D map is built with 2017-0823 session and the other three sessions are used to evaluate the localization performance, indicating the changing environment. The details to obtain the 3D-2D point and line correspondences can be found in Appendix. For evaluation, we compute the ground truth relative pose between the query camera and the map by aligning the synchronized LiDAR scans. For the pitch and roll angle, we use the estimation of visual inertial odometry [40].


Outlier Method P3P EPnP 2-Entity LMI Ours-DV Ours
60% T(m) 0.0010 0.0009 0.0008 0.0128 0.0005 0.0006
R(°) 0.0196 0.0170 0.0059 0.0083 0.0019 0.0020
Precision 1.00 1.00 1.00 0.96 1.00 1.00
Recall 0.99 0.99 1.00 0.98 1.00 1.00
Success% 100 100 100 65 100 100
70% T(m) 0.0013 - 0.0011 0.0209 0.0005 0.0006
R(°) 0.0213 - 0.0211 0.1059 0.0017 0.0028
Precision 1.00 0 1.00 0.93 1.00 1.00
Recall 0.98 0 0.99 0.93 1.00 1.00
Success% 100 0 100 54 100 100
80% T(m) 0.0017 - 0.0017 0.0246 0.0007 0.0006
R(°) 0.0267 - 0.0257 0.4778 0.0050 0.0032
Precision 1.00 0 1.00 0.46 1.00 1.00
Recall 0.49 0 0.93 0.58 1.00 1.00
Success% 52 0 96 37 100 100
90% T(m) - - 0.0027 - 0.0007 0.0007
R(°) - - 0.0411 - 0.0073 0.0043
Precision 0 0 1.00 0.27 1.00 1.00
Recall 0 0 0.70 0.35 1.00 1.00
Success% 0 0 86 0 100 100


  • The accuracy is evaluated for successful trials, the precision and recall of CCI are for all test trails.

  • Ours-DV denotes the proposed method with dimension-wise voting.

TABLE I: Accuracy and CCI comparison.

Selected cases performance. We first select several typical examples for evaluation as in [29] and the results are shown in Tab. II. The Exp01, Exp02 and Exp03 are cases with pure point features where Exp03 has lines as disturbance and the outlier rate in these three cases are all more than 50%. The RANSAC-based methods perform poorly compared with the global optimization methods. One thing to note is that in real world dataset, dimension-wise voting brings slight performance drop, but still achieves superior performance against comparative methods. Also note that in Exp03, the proposed method gives optimistic results by regarding 2 outliers as inliers, which may be caused by unknown noise bound thus inappropriate threshold in real world data. In Exp04, Exp05 and Exp06, the utilization of good line features promotes the performance of point line methods obviously (2-Entity and ours). Overall, the results still confirm the conclusions in simulation.

Full dataset performance. Finally, we arrive at the success rate on the whole three sessions as shown in Fig. 7. As LMI is too slow to finish all the dataset, here we only show the result of ours and RANSAC methods. We first see that the proposed globally optimal methods consistently outperform the RANSAC methods on all three sessions. The other fact is that progressive prioritized voting brings the best accuracy over the one with dimension-wise voting, because of the consideration on extremely low number of inliers.


01 9/18 0/0 02 15/39 0/0
EPnP 0.9938 0.8025 7/12 0.9026 1.3255 11/21
P3P 0.8187 0.6302 7/11 1.9751 0.5977 10/20
2-Entity 0.6683 0.4351 8/10 0.5703 0.3378 12/21
LMI 0.1630 0.1951 9/13 0.2832 0.2155 14/19
Ours-DV 0.1207 0.1321 9/09 0.1803 0.1550 14/14
Ours 0.1207 0.1321 9/09 0.1753 0.1334 15/15
03 21/65 0/2 04 23/48 7/15
EPnP 0.4506 0.9741 10/29 0.5504 0.7823 19/28
P3P 0.3213 0.8807 13/27 0.3678 0.4066 19/27
2-Entity 0.3138 0.4603 15/27 0.1405 0.2055 27/33
LMI 0.2998 0.3786 19/44 0.2834 0.1769 22/28
Ours-DV 0.1407 0.1743 21/23 0.0309 0.1607 28/29
Ours 0.1382 0.1707 21/23 0.0253 0.1509 30/30
05 21/38 8/13 06 96/134 3/4
EPnP 1.0876 0.8111 13/25 0.2705 0.5202 93/112
P3P 1.0876 0.8111 13/25 0.1682 0.5243 90/98
2-Entity 0.1732 0.2687 27/29 0.1163 0.4623 95/108
LMI 0.7641 0.6394 16/28 0.0891 0.2812 96/102
Ours-DV 0.1671 0.1072 29/29 0.0861 0.2791 99/99
Ours 0.1671 0.1072 29/29 0.0861 0.2791 99/99


  • denotes the number of identified inliers, while the true inliers.

TABLE II: Performance on selected cases in real world.

Vi Conclusions

In this paper, we propose a robust solver designed for visual inertial localization problem, achieving global optimization of the consensus maximization problem with deterministic convergence, even when the percentage of outliers is very high, say 90%. The key step in our solver is the derivation of translation invariant measurements for both points and lines, thus decoupling the problem into two smaller subproblems. Then we propose 1D BnB and prioritized progressive voting to find globally optimal rotation and translation respectively, accelerating the search efficiency. The effectiveness of the proposed method is validated on both synthetic and real world dataset.

Appendix A Derivation of TIMs

With the aid of inertial measurements, the pitch and roll angle between the current query camera frame and the gravity-aligned world reference frame are globally observable, such that the rotation estimation of the query camera with respect to the world can be formulated as


where and denote the observed pitch and roll angle provided by inertial measurements, denotes the yaw angle to be estimated, , . Therefore, the rotation matrix is only determined by the estimation of yaw, which is the same in , as . Thus the degrees of freedom (DoF) of the rotation matrix estimation can be reduced to 1 with the aid of inertial measurements, that is


A-a Derivation of point-TIM

The collinearity of each 2D-3D point features is utilized to derive the point-TIM as shown in Fig. 8. Mathematically, given an image key point , we have an un-normalized direction vector from the camera center as


According to the projection geometry, the optical center of camera frame , the 2D point and the corresponding 3D point lie on the same line, which is denoted as . By solving the line equation from the first two points and substituting the third point into the equation, we have


where and . Based on (30), we have two constraints from a correspondence as

Fig. 8: The illustration of 2D-3D point and line features.

Naturally, given another correspondence and , according to


Then we can have two more constraints as


Combining (31) - (32), and can be eliminated, then substituted into (34) - (35), can also be eliminated, resulting in an constraint only relating to . Recall (28), by reorganizing the coefficients, we have the point-TIM as


A-B Derivation of line-TIM

Each line feature correspondence can be represented by a pair of start point and end point of the line segment as shown in Fig. 8. According to the projection geometry, the optical center of the camera, the 2D line segment and the 3D line lie on the same plane. Then the four points , , and are coplanar, denoted as . Similarly, also holds. By solving the plane equation from the first three points and substituting the fourth point into it, we have:


That is


Similarly, for , we have:


That is


With (A-B)-(A-B), the can be eliminated resulting in


Recall (28), (A-B) can be reorganized to line-TIM as


A-C Derivation of TIMs’ lower bound

Recall (36) and (42), as the forms of point-TIM and line-TIM are the same, we denote them as . That is


where , , .

Then the lower bound of on , denoted as , is derived as


Appendix B Derivation of Translation Bound

After the rotation estimation, we get the optimal yaw angle . As shown in Fig. 8, according to , we have


which is equal to


where denotes the symmetric matrix of vector . Then (46) can be written as


where . Then two equations of translation can be derived as


Similarly, with another point correspondence , we have


where . Then we have another two equations as


Combining (48)-(49) and (51)-(52), the translation can be solved as


In addition, the translation can also be solved with one point and one line correspondence. According to (37)


we have


Then (37) can be written as


where . Similarly, with (39), we have


where . Thus, combining (48)-(49) and (58)-(59), the translation can be solved as