Visual inertial navigation is currently a popular option for state estimation in mobile robots, autonomous vehicles and augmented reality applications. Many efforts have been paid to build accurate, consistent and efficient visual inertial odometry 
. However, its inherent drift is unacceptable in long-term operation, calling for absolute pose estimation for correction. Map based visual inertial localization is therefore an important component in a complete navigation system, of which the underlying problem is to estimate the absolute pose from a set of feature correspondences between 2D image key points and global 3D map points. In this problem, one main challenge is the robustness of the solver against the outliers (incorrect feature correspondences). When high percentage of correspondences is outlier, the performance of the general pose estimator may seriously degenerate.
Pose estimation with outliers is in general stated as consensus maximization problem. One popular solution is random sample consensus (RANSAC), which has lots of variants  and has been employed in many visual localization methods . The advantage of RANSAC is the simplicity for implementation, and the usefulness in many scenarios with moderate percentage of outliers. But there are also disadvantages that (i) it cannot tolerate extreme percentage of outliers, say 90%, (ii) it is a probabilistic method, thus not guaranteeing the deterministic global optimality.
In contrast to RANSAC, another solution to consensus maximization is global optimization based methods, which can give globally optimal solution without referring to an initial value . However, one obstacle preventing its application is the considerable computation time. Most global optimization methods aim at general geometry estimation problems. They employ branch-and-bound (BnB) as the basic framework to reduce the search space , or mixed integer programming for further acceleration . But the computational cost is still unsatisfactory as the multi-dimensional search space is coupled.
In this paper, we propose a deterministic visual inertial localization solution to achieve global convergence with much higher efficiency by dividing search space into multiple 1-D search spaces. Specifically, inspired by the minimal solution in RANSAC, we build intermediate cost function for both point and line features, translation invariant measurements (TIMs), to decouple consensus maximization into two cascaded subproblems only related to rotation and translation respectively. Based on TIMs, the globally optimal rotation is then searched by 1-dimensional BnB in with the aid of inertial measurements. For the translation part, search is replaced with three times 1-dimensional search using prioritized progressive voting. To the best of our knowledge, this is the first solver for visual inertial localization with deterministic global optimality. In summary, the contributions include
TIMs based formulation of visual inertial localization decouples the problem and enables 1D BnB based global optimization of the rotation.
Prioritized progressive voting method replaces space search with three times search for global optimization of the translation.
Experiments on simulation and real-world cross-session datasets that validate the effectiveness and efficiency of the proposed method against comparative methods.
The remainder of the paper is organized as follows: Section II reviews the related literatures. Section III presents the decoupling of the consensus maximization problem. Section IV introduces the solutions of the subproblems. Section V demonstrates the experimental settings and results, followed by Section VI concluding the paper.
Ii Related Works
Ii-a Visual localization
Visual localization and navigation for mobile robots has been studied extensively in the robotics and computer vision communities in the recent decade. A general visual navigation system has two components: visual odometry, which estimates the relative pose and has drift in long term, and visual localization, which eliminates the drift by registering the image on a global map . More recently, inertial sensors are employed in the system to improve the accuracy and robustness 
. Specifically, the inertial sensor has globally observable pitch and roll measurements, reducing the degrees of freedom (DoF) in visual inertial localization problem to 4. In, the reduction is utilized when formulating the pose estimation given a set of inlier feature correspondences. However, few works have been done on outliers elimination when inertial measurements are provided.
Ii-B Random sample consensus
For robust localization given the feature correspondences containing outliers, RANSAC is the most popular solution employed in many visual navigation system. To deal with the visual localization without inertial measurements, i.e. 6DoF, there have been many variants. In , point feature correspondences based RANSAC are studied. In , RANSAC is extended to line features. When inertial measurements are provided, the DoF of the problem is reduced, which is utilized by RANSAC to improve the robustness in , and extended to both point and line correspondences in . As RANSAC is developed on randomized sampling theory, it is simple to implement and has good performance on scenarios with moderate outliers. But its disadvantage is also obvious, including low tolerance against extreme outliers, local convergence and no guarantee of the optimality .
Ii-C Global optimization method
Global optimization methods are proposed to achieve the global optimality and deterministic convergence, addressing the shortcomings of RANSAC. In this branch of literatures, Branch-and-Bound (BnB) is mostly used, which gradually prunes the solution space by coarse-to-fine division. In , BnB is used to solve the 2D-2D registration problems. In , a general framework for point, line and plane features is proposed to solve 3D-3D registration via BnB. Integrated with mixed integer programming, the BnB optimization can converge faster . In , the linear matrix inequality constraints are introduced to mixed integer programming, resulting in a faster BnB for all 2D-2D, 2D-3D and 3D-3D geometric vision problems. In the works mentioned above, the rotation is modeled as a rotation matrix with matrix level constraints. Thus it is unclear about the incorporation of inertial measurements. In addition, there are also works propose globally optimal algorithms specializing on one class of problem. In , pairs of features are used to decouple the 3D-3D registration. In , TEASER is proposed for decoupled scaled 3D-3D registration. These works show that it is possible to have superior performance with specialized algorithms rather than only the general BnB framework, even accelerated.
In this paper, we follow the idea of specialized solver to bridge the gap of globally optimal deterministic solution for visual inertial localization, which is a robust 3D-2D pose estimation problem with inertial measurements. To the best of our knowledge, this is the first work to study this problem in the context of global optimality. We expect this solution to be accurate and efficient.
Iii Decoupling Translation and Rotation
The underlying problem of visual inertial localization is the pose estimation from 3D-2D correspondences with outliers. Formally, given a set consisting of correspondences between 3D global points and 2D visual points , they satisfy
where and is the camera pose to be estimated, is the camera projection function with known intrinsic parameters , is assumed to be bounded random measurement noise, is zero for inlier while an arbitrary number for outlier. To deal with outliers, the robust pose estimation generally begins with consensus maximization problem as
where is binary, indicating whether is zero. To solve the problem in global, general BnB algorithms search in , which is a coupled space of and
. But this probably leads to exponential computational complexity in bad cases. For local techniques like RANSAC, inliers may be estimated conservatively, i.e. inliers regarded as outliers, especially when the noise is unavoidable.
Iii-a Translation invariant measurements
Inspired by the minimal solution in RANSAC, we develop an intermediate measurement which is invariant to the translation of the pose. Mathematically, given an image key point
, we have an un-normalized direction vector from the camera center as
Then the corresponding world point is transformed to the camera coordinates and satisfies
where and . Based on (30), we have two constraints from a correspondence. Naturally, given another correspondence and , we can have two more constraints as
According to (30) and (33), we have linear constraints of the translation . With proper variable substitutions among the constraints, and the globally observable pitch and roll angles from inertial measurements, we can eliminate , reduce to , and derive TIM as
where is the unknown yaw angle, , , and the derivation details are presented in the Appendix. Now we substitute the constraints which are related to both and in (2) with the TIM, leading to
where , indicates the -th and -th correspondence derived the constraint are inliers.
Similar to a pair of point correspondences, given a set of line correspondences , it is also possible to develop TIM. Given the end points of the image line segment and , we have two un-normalized directions as (29), denoted as and .
Then following the fact that the point on the world line lies on the plane spanned by the rays from camera center along direction and , we have
which is a constraint for both rotation and translation. Since arbitrary number of points can be sampled from a line, we sample another point on the same world line to formulate the constraint as (10). Then only one line correspondence can lead to line-TIM after proper substitution as
where the line-TIM has the same form as point-TIM in (36), but the coefficients are different. The derivation details are also presented in the Appendix.
Iii-B Two-stage consensus maximization solver
With TIMs for both point and line correspondences, we decouple the original consensus maximization problem into rotation only problem, and translation only problem when the rotation is fixed. Accordingly, the proposed solver has two stages in cascade:
Iv Estimators of Rotation and Translation
Iv-a BnB based optimization for rotation
We employ BnB strategy to solve problem (12). The cost function in (12) relates to and . But it is obvious that when is determined, is simply derived by evaluating the constraints. So we denote the cost function as that is explained as the number of inliers given a yaw angle .
Upper bound of cost function. We then derive the upper bound of on the subset , denoted as , where . Recall (36) and (42), as the forms of point-TIM and line-TIM are the same, we denote them as . The lower bound of on , denoted as , is derived as
where the derivation of the coefficients are introduced in Appendix. Note that can be solved analytically without any iterations. Now we formulate a consensus maximization problem as
where the problem is defined on , and the TIMs constraints are replaced with tight lower bounds, relaxing the constraints and yielding an optimistic estimation of . We then have
as a tight upper bound. The equality exists when all constraints give the same with and , which is only possible when noise is free.
Accelerate BnB optimization. With (12-19), we have the BnB search for globally optimal rotation, of which the pseudo code is listed in Algorithm 1. Note that the main idea of BnB is to prune the solution space when its upper bound is smaller than the current best estimates . Therefore, if we have a fast solution to initialize a good , most solution spaces can be pruned at early stage, significantly improving the search efficiency. To implement this idea, we use RANSAC  to generate a rough initial
. In addition, we introduce a heuristics to balance the global optimality and the efficiency. The bestestimated during RANSAC is utilized to initialize subsets among . Each subset centers at each estimated with a width . When is large, global optimality is emphasized and vice versa. Another implementation trick is to store the respective inliers when evaluating (16) on each subset . When is further divided into smaller subsets, only the stored inliers within are evaluated, instead of all constraints, saving lots of computational cost. These techniques are all shown to accelerate the search in the experimental ablation study without drop of accuracy.
Iv-B Prioritized progressive voting for translation
Decoupled linear constraints. Note that for a point correspondence constraint (30), we have two linear equations, while for a line correspondence constraint (10), we have one. Therefore, given a pair of correspondences including at least one point correspondence, say the -th point correspondence and the -th point or line correspondence, it is sufficient to solve for this small linear system (see Appendix for details), then we have
Now we find that the constraints are decoupled for each dimension of . Set the -dimension as example, we have
arriving at the resultant three dimension-wise linear constrained consensus maximization problems.
Dimension-wise voting algorithm. We use a voting algorithm to solve the problem. We first specify the noise bound in (24). Given the noise bound in (20), we have the noise bound for following the techniques in   as
The details can be found in Appendix.
Still taking -dimension as example, each estimated defines an interval . If the real lies in this interval, then the real inlier set contains the two correspondences deriving . According to , the insight is that the inlier set only changes its membership when real enters a new interval. Besides, given estimations, the maximum number of possible consensus sets, i.e. the cardinality of the solution space, is , where is in quadratic w.r.t the number of correspondences. This complexity enables a voting algorithm for all sets. By counting the unique correspondences of the votes in each set, we get the corresponding consensus set. Then the maximal consensus set can lead to an estimation of . An illustrative case is shown in Fig. 2 and the pseudo code is listed in Algorithm 2 with -dimension as example. For simplicity, we replace with in the pseudo code. Following the similar idea in , by repeating the voting algorithm for three times, is estimated as , , .
Prioritized progressive voting algorithm. When the number of inliers is high, independent voting along three dimensions is possible. But when the number of inliers is low and outlier rate is high, independent dimension-wise voting may lead to failure. The reason is that, though it is almost impossible that there are more outliers than inliers having the similar , it is possible that there are more outliers than inliers having the similar . In such scenario, search along -dimension leads to incorrect , which cannot be corrected in the successive voting along or -dimension.
To deal with such scenario while keeping a low computational complexity, we propose a prioritized progressive voting for translation in Algorithm 3. The main idea is that we progressively vote on the three dimensions, but there is a priority, i.e. number of votes, for early termination. The experimental results show that the computational complexity of prioritized progressive voting is almost similar to the dimension-wise voting. Otherwise, it is also possible to use 3D BnB translation search for better accuracy, but it is slower because of the coupled multi-dimensional solution space. Finally, we apply nonlinear refinement to achieve the best accuracy when the maximum consensus set is found.
V Experimental Results
In the experiments, we evaluate the proposed consensus maximization solver on (i) the feasibility and effectiveness of the subproblem solvers, (ii) the accuracy and robustness compared with existing methods, and (iii) the performance in real world visual inertial localization applications. We implement the proposed solver in MATLAB on a desktop with CPU Intel i7-7700 3.60GHz and 8G RAM.
V-a Ablation study
We build the synthetic world consisting of 3D points and lines in the cube . The 2D image projections are generated with randomly sampled camera poses in , as well as their inlier correspondences. All the projected 2D image points are added with bounded random noise with the bound . Each outlier correspondence is generated from other randomly sampled camera pose different to ground truth pose. The total number of correspondences is fixed as 50. Specifically, there are 50 point correspondences when evaluating point only methods, while 25 point and 25 line correspondences for the point and line methods. We vary the outlier percentage from 10% to 90% with a step of 10%. Statistic performance indicators are evaluated with an average of 100 Monte Carlo runs. Denoting the ground truth pose as , we compute the translation error as in meter and the rotation error as the angle of in degree.
BnB heuristics. We first evaluate the heuristics introduced in Section IV-A from the aspect of accuracy and efficiency. As shown in Fig. 3, with the heuristics, the efficiency is improved while the accuracy stays similar. Since the final pose is refined by nonlinear optimization, slight rotation error after BnB can be ignored. As a baseline, we also show the error of estimated rotation giving the most inliers in RANSAC, of which the performance is much worse, indicating inconsistency between the identified inliers and the real inliers. In following experiments, heuristics are applied with BnB as default setting.
Translation voting. We then compare the voting strategies introduced in Section IV-B
. Now we can evaluate the final accuracy after nonlinear refinement. In addition to efficiency and accuracy, we also evaluate the consistency between the estimated consensus set and the real inlier set (CCI) using precision and recall. As shown in Fig.4, the computation of the prioritized progressive voting is slightly higher than the dimension-wise voting. More importantly, the increased time keeps almost consistent w.r.t outlier rate and correspondences number, which might be explained as no complexity growth for prioritized progressive voting. The CCI and accuracy are shown in the right columns in Tab. I. We see that all variants achieve perfect CCI, naturally leading to high accuracy.
Sensitivity to noisy inertial measurements.
As inertial measurements are noisy, it is necessary to evaluate the sensitivity of the proposed method. We add Gaussian noise with zero mean and increasing standard deviation up to 5 degree on both pitch and roll angle. The threshold to judge a successful localization is 0.1m for translation error and 0.5 degree for rotation error as in. The result is shown in Fig. 5, indicating that the proposed algorithm can achieve over 90% success rate when the noise increases to 5 degree. This level of noise is far more than the pitch and roll estimations in practice . In addition, we can find that the performance is better when employing prioritized progressive search.
V-B Comparison on synthetic datasets
implementation of EPnP and P3P. For LMI, we modify their open source code in MATLAB following the paper, since only code for 3D-3D registration is released. In addition, we control the evaluation data having rotation angle less thanand add it as the constraint of LMI, as suggested in . The 2-Entity RANSAC is implemented in MATLAB and we select the mixed sampling strategy which utilize both points and lines for pose estimation. All methods are followed by nonlinear refinement on the identified consesus set. We still use the synthetic dataset as in the ablation study.
Efficiency of globally optimal methods. We first compare the efficiency between the proposed method and the LMI. We evaluate the computational cost with respect to the number of feature correspondences and the percentage of outliers. The result is shown in Fig. 4, the computational cost of LMI is significantly higher than the proposed methods both for increasing number of correspondences, and the percentage of outliers. The growing gap may also indicate that the complexity of LMI is higher than ours.
Deterministic convergence. The vital difference between RANSAC and globally optimal method is the convergence. We compare the number of inliers in the estimated maximal consensus set with respect to increasing outliers when the final pose estimation is successful. The result is shown in Fig. 6, which indicates that the proposed solution achieves deterministic perfect CCI, while RANSAC gives conservative estimations with less inliers and LMI finds optimistic estimations by incorrectly regarding outliers as inliers. In addition, both RANSAC and LMI fail when the outlier rate is 90%. The results for all 100 runs when the outlier rate is 80% are also shown in Fig. 6. We can see that the proposed algorithm deterministically finds the globally optimal consensus, while RANSAC achieves global optimality probabilistically.
Robustness and accuracy. We finally show the performance of all methods on the synthetic data, including accuracy, precision and recall to measure the CCI, with respect to percentage of outliers ranging from 60% to 90%. Note that we only evaluate the accuracy for successful trials, since result on incorrectly identified consensus set can lead to very large error, disturbing the accuracy. The result in Tab. I first confirms that CCI is highly related to the accuracy, validating the feasibility of maximizing consensus set. RANSAC gives consistent conservative estimations, as the precision remains at a higher level compared with the recall. For LMI, the estimation is prone to regard the outliers as inliers, thus the recall is higher compared with precision. Considering that LMI, P3P and EPnP are designed for general visual localization, the better performance achieved by 2-Entity and the proposed method, designed for visual inertial localization, is reasonable. But we can still summarize that superior result can be found by specialized globally optimal method.
V-C Comparison on visual inertial localization
Finally, we evaluate all the methods on a real world cross-session visual inertial localization task. The dataset employed is YQ-dataset. In the dataset, there are three sessions collected in summer 2017, denoted as 2017-0823, 2017-0827 and 2017-0828, and one session in winter 2018 after snow denoted as 2018-0129. The 3D map is built with 2017-0823 session and the other three sessions are used to evaluate the localization performance, indicating the changing environment. The details to obtain the 3D-2D point and line correspondences can be found in Appendix. For evaluation, we compute the ground truth relative pose between the query camera and the map by aligning the synchronized LiDAR scans. For the pitch and roll angle, we use the estimation of visual inertial odometry .
The accuracy is evaluated for successful trials, the precision and recall of CCI are for all test trails.
Ours-DV denotes the proposed method with dimension-wise voting.
Selected cases performance. We first select several typical examples for evaluation as in  and the results are shown in Tab. II. The Exp01, Exp02 and Exp03 are cases with pure point features where Exp03 has lines as disturbance and the outlier rate in these three cases are all more than 50%. The RANSAC-based methods perform poorly compared with the global optimization methods. One thing to note is that in real world dataset, dimension-wise voting brings slight performance drop, but still achieves superior performance against comparative methods. Also note that in Exp03, the proposed method gives optimistic results by regarding 2 outliers as inliers, which may be caused by unknown noise bound thus inappropriate threshold in real world data. In Exp04, Exp05 and Exp06, the utilization of good line features promotes the performance of point line methods obviously (2-Entity and ours). Overall, the results still confirm the conclusions in simulation.
Full dataset performance. Finally, we arrive at the success rate on the whole three sessions as shown in Fig. 7. As LMI is too slow to finish all the dataset, here we only show the result of ours and RANSAC methods. We first see that the proposed globally optimal methods consistently outperform the RANSAC methods on all three sessions. The other fact is that progressive prioritized voting brings the best accuracy over the one with dimension-wise voting, because of the consideration on extremely low number of inliers.
denotes the number of identified inliers, while the true inliers.
In this paper, we propose a robust solver designed for visual inertial localization problem, achieving global optimization of the consensus maximization problem with deterministic convergence, even when the percentage of outliers is very high, say 90%. The key step in our solver is the derivation of translation invariant measurements for both points and lines, thus decoupling the problem into two smaller subproblems. Then we propose 1D BnB and prioritized progressive voting to find globally optimal rotation and translation respectively, accelerating the search efficiency. The effectiveness of the proposed method is validated on both synthetic and real world dataset.
Appendix A Derivation of TIMs
With the aid of inertial measurements, the pitch and roll angle between the current query camera frame and the gravity-aligned world reference frame are globally observable, such that the rotation estimation of the query camera with respect to the world can be formulated as
where and denote the observed pitch and roll angle provided by inertial measurements, denotes the yaw angle to be estimated, , . Therefore, the rotation matrix is only determined by the estimation of yaw, which is the same in , as . Thus the degrees of freedom (DoF) of the rotation matrix estimation can be reduced to 1 with the aid of inertial measurements, that is
A-a Derivation of point-TIM
The collinearity of each 2D-3D point features is utilized to derive the point-TIM as shown in Fig. 8. Mathematically, given an image key point , we have an un-normalized direction vector from the camera center as
According to the projection geometry, the optical center of camera frame , the 2D point and the corresponding 3D point lie on the same line, which is denoted as . By solving the line equation from the first two points and substituting the third point into the equation, we have
where and . Based on (30), we have two constraints from a correspondence as
Naturally, given another correspondence and , according to
Then we can have two more constraints as
Combining (31) - (32), and can be eliminated, then substituted into (34) - (35), can also be eliminated, resulting in an constraint only relating to . Recall (28), by reorganizing the coefficients, we have the point-TIM as
A-B Derivation of line-TIM
Each line feature correspondence can be represented by a pair of start point and end point of the line segment as shown in Fig. 8. According to the projection geometry, the optical center of the camera, the 2D line segment and the 3D line lie on the same plane. Then the four points , , and are coplanar, denoted as . Similarly, also holds. By solving the plane equation from the first three points and substituting the fourth point into it, we have:
Similarly, for , we have:
A-C Derivation of TIMs’ lower bound
where , , .
Then the lower bound of on , denoted as , is derived as
Appendix B Derivation of Translation Bound
After the rotation estimation, we get the optimal yaw angle . As shown in Fig. 8, according to , we have
which is equal to
where denotes the symmetric matrix of vector . Then (46) can be written as
where . Then two equations of translation can be derived as
Similarly, with another point correspondence , we have
where . Then we have another two equations as