Log In Sign Up

Robust Photogeometric Localization over Time for Map-Centric Loop Closure

by   Chanoh Park, et al.

Map-centric SLAM is emerging as an alternative of conventional graph-based SLAM for its accuracy and efficiency in long-term mapping problems. However, in map-centric SLAM, the process of loop closure differs from that of conventional SLAM and the result of incorrect loop closure is more destructive and is not reversible. In this paper, we present a tightly coupled photogeometric metric localization for the loop closure problem in map-centric SLAM. In particular, our method combines complementary constraints from LiDAR and camera sensors, and validates loop closure candidates with sequential observations. The proposed method provides a visual evidence-based outlier rejection where failures caused by either place recognition or localization outliers can be effectively removed. We demonstrate the proposed method is not only more accurate than the conventional global ICP methods but is also robust to incorrect initial pose guesses.


page 2

page 4

page 7

page 8


Elasticity Meets Continuous-Time: Map-Centric Dense 3D LiDAR SLAM

Map-centric SLAM utilizes elasticity as a means of loop closure. This ap...

TagSLAM: Robust SLAM with Fiducial Markers

TagSLAM provides a convenient, flexible, and robust way of performing Si...

Loop-box: Multi-Agent Direct SLAM Triggered by Single Loop Closure for Large-Scale Mapping

In this paper, we present a multi-agent framework for real-time large-sc...

AEROS: Adaptive RObust least-Squares for Graph-Based SLAM

In robot localisation and mapping, outliers are unavoidable when loop-cl...

Statistical Outlier Identification in Multi-robot Visual SLAM using Expectation Maximization

This paper introduces a novel and distributed method for detecting inter...

TBV Radar SLAM – trust but verify loop candidates

Robust SLAM in large-scale environments requires fault resilience and aw...

MRS-VPR: a multi-resolution sampling based global visual place recognition method

Place recognition and loop closure detection are challenging for long-te...

I Introduction

Simultaneous Localization And Mapping (SLAM) is a key enabling component of autonomous robots and driverless technologies. Although SLAM techniques have reached a mature level, with the recent advent of low cost multi-modal sensors, 3D dense LiDAR-based SLAM is still in its infancy. Map-centric approaches [1, 2], which have demonstrated their accuracy and effectiveness by fusion-based mapping and deformation-based loop closure, provide an alternative solution to the dominant trajectory-centric SLAM [3, 4]. However, in the map-centric approach, the effect of incorrect loop closure is more destructive and irreversible compared to the conventional graph-based methods. The map-centric approach fuses all the measurements on-the-fly to the map, where the fused map is difficult to be detached and corrected once the loop is closed. Therefore, the accuracy and robustness of loop closure in map-centric approaches [1, 2] become crucial.

Previous map-centric approaches either did not have a proper method to detect false positives [2] in a loop closure or they are not robust enough as well as being limited to only the RGB-D sensor model [1]. The current state-of-the-art solution in false positive loop closure detection or localization failure detection [5] requires a global pose graph optimization for each validity test which is a huge penalty when it comes to scalability issues. Furthermore, their method is also dedicated for trajectory-centric systems such as graph-based SLAM, and as a result, is not applicable to map-centric methods.

To overcome this problem, we propose a metric localization, namely PhotogeoSeq, which combines geometric and photometric constraints over time for the robust metric localization of the map-centric loop closure. The specific contributions of this work can be outlined as follows. First, with the tightly coupled geometric and photometric

constraints, our method robustly estimates 6 Degree of Freedom (DoF)

alignment even when the initial guess for registration is completely incorrect or unknown.

Second, instead of making a loop closure decision based on only a single 6 DoF alignment estimation, the proposed method observes the stability of the alignment over time and provides a reliability metric to reject or accept a loop closure hypothesis without a global trajectory optimization. For this purpose, we demonstrate that utilizing the visual features are beneficial for detecting a localization failure.

This paper is organized as follows. Section II reviews related works. Section III describes the overall procedure of the proposed method. In Section IV, we describe the proposed photogeometric metric localization method (PhotogeoSeq), and detail our visual evidence-based outlier rejection in Section V. Results demonstrating the advantages of our method over current state-of-the-art global ICP algorithms are given in Section VI. Finally we summarize and present future directions in Section VII.

Ii Related Work

SeqSLAM [6] proposed a place recognition architecture that utilizes the coherent sequential 2D visual information, where it searches whether a similar sequence of the features between the two paths are observed over time. Lynen et al. [7]

extended SeqSLAM as a 2D probability density estimation problem in votes versus travel distance space. Their work finds chunks of revisited places more efficiently than SeqSLAM by the distance normalization. However, as their method purely depends on 2D visual input, it is not suitable for a 6 DoF pose estimation problem. Furthermore, if the trajectory is only overlapped a short distance, such as at intersections, it is not likely to have much coherency which increases the risk of failure


Latif et al. [9] proposed a graph-based loop closure detection and correction system, where they utilize a residual of the graph optimization as a prior of a loop closure failure. Their method clusters topologically related loop closure hypotheses over time and rejects any hypothesis that increases the residual of graph optimization. They have proved that the localization could be reversely utilized for testing whether the input place recognition is false positive, whereas the conventional algorithms showed single directional hierarchical flow from place recognition to localization. Although, they were able to increase robustness to false positive loop closures, their method requires a graph-optimization to test each hypothesis, which is a huge drawback for a long-term mapping system.

On the other hand, the modality of the loop closure system is one of the key components that increases the performance and accuracy of a loop closure. In early works, a single sensor was utilized for the loop closure detection problem, where 3D or visual descriptors were solely used [8, 9]. However, with the single sensor approach the robustness is rather limited because of the innate degeneracies in each configuration. Thus, even the state-of-the-art works have a higher false positive rate than desired and work well only within limited scenarios.

There are few works that introduce multi-modality for the place recognition or localization problem. Early works in this category [10, 11, 12], combined 3D and 2D descriptors for a place recognition problem. While the combined complementary descriptors are capable of handling degenerate configurations such as where there are no visual patterns or where the scene is geometrically flat, rejection of incorrect place recognition or alignment error is difficult. Furthermore, the constraints from 2D recognition are not tightly integrated in the pose optimization of the metric localization step [1, 10]. Also, when it comes to a separate sensing system such as an independent LiDAR and camera system on a dynamically moving platform [3], the quality of the localization is affected by the spatio-temporal differences in observations, which was not properly considered in the previous works [10]. Even in the cases where the integrated intensity from LiDAR is utilized instead of 2D visual features, difficulties still exist because of the LiDAR intensity calibration and intensity differences according to the incident angle [13].

Iii Overview of the system

Fig. 1: Block diagram of the proposed PhotogeoSeq method.

The proposed metric localization method (PhotogeoSeq) consists of place recognition, metric localization, and pose fusion modules as depicted in Fig. 1. Given a motion-undistorted local point cloud from LiDAR and corresponding camera images by the sliding window-based local trajectory optimization [2] as shown Fig. 2, the place recognition module finds a possible revisit of a place and triggers the localization step to estimate an alignment between the current location and the reference location. Once the localization module is triggered, our method continuously estimates the alignment between the new current location and its corresponding reference location at different places until the uncertainty of the estimated alignment reaches a certain threshold.

Iv Alignment Estimation

(a) (b)
Fig. 2: (a) A hand-held 3D scanning device with a spinning LiDAR, camera and IMU. (b) Pair of matched source and destination images from a place recognition module. The point clouds are extracted at the timestamp of each image.

In this section, we elaborate on our alignment estimation method by photogeometric constraints.

Iv-a Continuous-Time Trajectory Representation

In the proposed method, the synchronization problem of the multi-modal sensors such as different frame-rates or unsynchronized clocks are required to be properly addressed. To cope with this problem, our method proposes to use the continuous-time trajectory representation [2]. In the continuous-time trajectory representation, the trajectory is modeled as a function of time. An exact system pose at an arbitrary query time

can be interpolated from a set of temporally nearby poses. Let

be composed of translational component and rotational component as,


Then, utilizing the linear continuous-time trajectory representation, its value can be evaluated by an interpolation from two poses where their timestamps satisfy . Given poses and timestamps, their interpolation at is given by,


where the relative pose is defined by and the exponential mapping linearly interpolates the relative pose on the manifold with the interpolation ratio .

represents skew-symmetric matrix conversion of a vector.

Iv-B Alignment Between Two Places

Given a continuous-time trajectory and two matching places found by a loop closure detection module as depicted in Fig. 2 (b), we define an initial source to reference transformation from the drifted raw trajectory as,


where is a continuous-time trajectory [3] that represents the LiDAR frame in the world frame at source and reference time , as shown in Fig. 3. Due to drifts in trajectory, the initial alignment is highly likely to be misaligned. Thus, we define and find its correction by,


where is the compensation to the initial alignment which can be found by visual and geometrical sensor observations. Additionally, for the joint optimization by the LiDAR and visual observations, we define a corresponding relative camera transformation as,


where is a pre-calibrated camera frame [C] to LiDAR frame extrinsics.

Fig. 3: Illustration of the proposed 6 DoF alignment estimation. Each alignment between source [] and reference [] at different places are transformed to the very first place [], [] where the loop closure was first reported. In such way, what we keep estimating is . Selection of source [] and reference [] locations can be decided from multiple sources such as 2D or 3D place recognition and relative motion.

Iv-C Single Alignment Estimation from Place Pairs

As we already mentioned, an alignment estimation based on a single location has a high probability of failure. We propose to aggregate the estimations of multiple alignments at different locations to the very first place where the loop closure was first detected. This approach will develop a more robust and reliable failure detection and outlier rejection.

Equation (4) represents the displacement between two matched local frames. As our objective is to estimate a single alignment, we transform the local displacements into the first place as,


where , are the relative transformation from the first frame and frame which are extracted from the drifted trajectory as described in Fig. 3. It is important to note that we assume that the trajectory utilized to extract , is locally rigid and accurate enough, which implies that , do not embed any uncertainty because of the local trajectory drift within small regions.

Iv-D Spatio-Temporal Uncertainty in Visual Measurements

To account for uncertainties from various sources during the optimization, we define the uncertainty model of the system. Two types of uncertainties are defined. The first uncertainty is from the temporal difference between the two sensing modalities. As we estimate the location of the camera based on the LiDAR trajectory, the temporal uncertainty affects the camera location estimation as,


where is the estimated temporal difference with mean

and variance

[14]. Even with a small temporal uncertainty, a large local motion amplifies the uncertainty of the camera pose.

Fig. 4: LiDAR trajectory and LiDAR pose at camera image time stamp

. Actual camera position can be any where around the given time stamp. The Gaussian distribution on the trajectory represents the time lag variance region. Spatial uncertainty caused by calibration error is represented as the blue circle.

The second uncertainty originates from the extrinsic calibration between the two sensors,


where is an extrinsic calibration error and is the true extrinsics. The visualization of these uncertainties are shown in Fig. 4. The uncertainties modelled here will be propagated through the cost function of visual constraints and used as weights in non-linear least squares in the next section as well as a statistic test for localization outlier rejection.

Iv-E Geometric Constraints

We apply surfel-based point-to-plane ICP with multi-resolutional voxelization[15] for the geometrical alignment between the source and the reference point cloud as,


where the pair of the matched surfel centroids , are acquired by a nearest neighbour method in the surface normal and centroid space. The original point cloud to calculate these surfels are extracted around the pose at , and represented in the local LiDAR coordinates

. The covariance is defined by a M-estimator weight and the Eigenvalue of the voxel


Iv-F Photometric Constraints

The sparse surfel constraint is computationally efficient because of the lower number of surfels after a spatial compression but this often fails when the initial guess on the pose is incorrect or the scene is geometrically degenerate such as in a corridor or wide open area. To cope with both problems, we tightly combine Photometric constraints with ICP. If there are distinctive visual patterns in the scene, the visual patterns generate complementary constraints which are not effected by the structure of the scene.

Our proposed method combines two different types of visual features: semi-direct features [16] and indirect features [17], as a complementary constraint to the geometric constraint. Both visual features are fed into the epipolar constraints between two camera poses. Thus, given a set of 2D features , the epipolar constraint is given as,


The visual constraint by (11) can estimate the pose up to scale only. However, combined with the ICP constraint, full 6 DoF can be estimated as well as preventing the optimization from being stuck at local minima because of incorrect surfel matching caused by an incorrect initial guess on alignment.

The indirect 2D features between the source image and the reference image are found by a global feature descriptor method [17] to cope with the large view difference problem. As the feature detection and matching stage are independent of the time lag and initial guess on the true alignment, even a few matches impose strong orientational constraints. This will guide the pose estimation to the roughly correct direction so that ICP can find the proper matchings of surfels. Therefore, Equation (11) largely reduces the chance of getting stuck in local minima because of surfel false matchings.

Also, to estimate more features widely spread over the image, the semi-direct features are matched by projecting the reference image feature points onto the source and refining as depicted in Fig. 5. It starts by defining the warping function as,


where , are the feature location on the reference image and its projection on the source image, K is the intrinsic matrix and is the depth map of the scene, and is the camera projection. is the reconstructed 3D positions of the 2D features. The depth map is generated from a dense surfel map by the LiDAR point cloud [2] by rendering at where follows the reference image time stamp.

Rendered depthfrom LiDAR points
Fig. 5: Illustration of the projected feature. The original feature location in the reference frame is reconstructed and projected onto the source frame . Then, the correction is found toward the direction that reduces the intensity difference of the two image patches. The depth map of the reference image is achieved by rendering the surfel map.
Fig. 6: Illustration of a matched patch refinement step. (a) Warped reference patch, (b) Source patch, (c) Shifted patch, (d) Initial error, (e) Error after refinement. Because of the time lag, calibration error and pose estimation error, a projected and warped reference patch does not exactly matches with the source patch location. The refinement step corrects the difference.

As the depth is loosely coupled to the image, the 2D geometry of the projected reference patch is not accurately aligned to the source patch in general as shown in the Fig. 6. Thus, for each image patch a refinement is found so that it reduces the intensity difference of two patches as described in Fig. 5 and Fig. 6,


where , are the source and reference images defined at respectively, is the image patch around . The warped reference image is defined as,


where is the inverse projection from the source to the reference. Thus, the additional matched points are given from the reference feature location and the corrected source location as

. Note that the semi-direct method does not directly relate the intensity to the pose estimation problem. The semi-direct method is utilized to find more matched points that are uniformly distributed over the image. This semi-direct method is inspired by SVO

[16] but we added frequency domain refinement to cope with considerably large [18].

The semi-direct and indirect features are also complementary to each other. The semi-direct method finds features only from the reference side and warps them to the source side. Although more features are found evenly over the entire image area, it is not functional until the given pose is close enough to the true pose. However, the matching of indirect features are given regardless of the current pose, therefore, they are valid over all pose estimation stages. Furthermore, the indirect features are usually detected at a far distance where the shape of features are less distorted, thus it is easy to become a rank deficient constraint. Finally, the cost function of the two visual constraints are defined as,


where is the error of feature pair and the covariances are utilized to consider spatio-temporal uncertainty caused by heterogeneous sensory system which is defined as,


where , , , are the Jacobian of the cost function with respect to the uncertainty parameters on matching corruption, and . They propagate the uncertainty to the optimization parameter space and are defined as follows,


The amount of the pixel level uncertainty is calculated from a predefined ground truth dataset by back projecting source inlier features to the destination. The spatiotemporal uncertainties , are approximated by the Hessian during the on-the-fly extrinsic, temporal calibration [14] by using a continuous-time bundle adjustment.

Iv-G Joint Optimization

Given a photogeometric cost function , we utilize the Gauss-Newton method to find the compensation and update iteratively. Since the cost function is a multi objective problem, a proper handling of scale difference is required [19]. The scaling factors are found by normalization. Also, for soft outlier rejection, we applied an M-estimator with t-distribution based weights.

V Fusing Alignment Estimations

Generally, single alignment estimation is prone to error due to various inevitable reasons. In our proposed system, to ensure robust localization, we propose to use sequential alignment estimation through different parts of the map instead of taking a single place recognition result for a loop closure. In this section we describe our probabilistic and temporal approach on fusing estimated poses. Also, an effective way to detect a localization failure or false positive will be discussed as well.

V-a Pose Fusion Model

Suppose that we have multiple alignment estimations and their covariances. As the estimated poses are the alignments between the first source and reference frame, the estimated alignment should not change over time under the assumption that they are locally rigid. Thus, we model this problem as estimating the same pose over time by (6).

The Bayesian fusion offers a closed form solution on the vector fusion problem [2]. However, as the pose vector sits on a manifold, directly applying the Bayesian fusion on the poses causes the convergence to be sub-optimal. Thus, we introduce the pose fusion approach proposed in [20] with a modification to convert the original batch fusion problem to a sequential fusion problem suitable for real-time application.

Let , respectively be pose measurement and the current pose estimation up to the point with their uncertainties. The uncertainty of the current estimation is approximated by the Hessian as . Applying the Gauss-Newton method with the two poses, we have,


where is the Baker-Cambell-Hausdorff approximation [20] which is calculated from . Refer to [20] for a detailed derivation of the above. The next current pose estimation is found by iteratively updating the best guess from the current poses as,


where the pose becomes the new at the end of the iteration for the next fusion. The covariance of the next current pose estimation is decided as follows at the end of the iteration,


Note that the accuracy of the sequential fusion is identical with the batch fusion method but sequential fusion reduces unnecessary redundant fusions. To decide whether or not to collect more evidence for the loop closure, we inspect the eigenvalues of the current pose estimation covariance and stop estimating the alignment when eigenvalues are sufficiently small such that for , where is a sufficiently conservative threshold and eigenvalue of .

V-B Outlier Rejection

The basic assumption for the pose fusion is that the uncertainty of the estimated pose is consistent and well represents the true error. A pose with a large error is fine to be fused as long as its covariance value is high. In such a case, the incorrect pose merely changes the fused pose when the system has the strong evidence that the previous pose is correct. But in general it is not true because of blindness of ICP oriented covariance with respect to translation and possibility of degeneracy in 2D scenes. Thus, pose outliers where covariance does not properly represent its estimation quality should be filtered prior to the fusion.

Under the local rigidity assumption, any newly estimated alignment should not be increasing the residual of previously constructed constraints. The outlier rejection stage utilizes this rigidity assumption to detect and properly handle any outliers before fusing them to our final alignment estimation.

For this stage, only the epipolar constraint is utilized as its residuals proportionally increases when the input pose is incorrect. Note that the surfel-based geometric feature matching does not follow this property. The matched surfels are found in normal and centroid spaces, therefore, what it implies is an abstracted spatial alignment rather than feature level correspondence. Thus, given a new pose to be tested, a statistic test using the epipolar constraint is carried out as follows,


where is residual of epipolar constraint of pose estimation with respect to the pose and is degree of freedom. The pose should be reversely calculated from the given pose estimation using the relationship between the first pose and pose as in (6). This test checks if the residuals from each previous pose estimations are within 95%  confidence area considering uncertainties caused by the time lag and extrinsic calibration error.

Vi Experiment

In this section, we demonstrate the accuracy and robustness of the proposed method by comparing it to ground truth and other 3D point cloud based global localization methods.

Vi-a Implementation Details

Initial (a) (b) (c) (d) (e) (f)
Guess Sparse ICP Visual+ICP SHOT+ICP FPFH+ICP PhotogeoSeq PhotogeoSeq
Easy 0.04(0.01) 0.05(0.01) 0.15(0.008) 0.30(0.03) 0.09(0.004) 0.05(0.003)
Medium 0.40(0.19) 0.12(0.11) 0.13(0.005) 1.64(0.30) 0.09(0.004) 0.04(0.004)
Hard 2.52(2.42) 1.49(0.71) 0.17(0.005) 13.8(0.10) 0.10(0.004) 0.04(0.004)

Time (sec) 0.17 0.63 1.05 3.25 0.75 2.04
  • The estimations are compared to the ground truth to calculate error norm of translation and rotation (). Units are meter and rotation vector norm. For each noise level, initial poses are randomly generated according to the following parameters: Easy , Medium , Hard . is the number of the matched places in the sequence. The proposed method in (e) PhotogeoSeq combines photogeometric constraints in sequential manner with the additional visual features from the semi-direct method in (f) PhotogeoSeq.

TABLE I: Localization Accuracy Comparison.
Fig. 7: An example of a photogeometric localization sequence. The loop closure was first reported at (a) and finalized at (e).

For the experiments, a hand-held 3D spinning LiDAR is utilized to collect LiDAR and color images as shown in Fig. 2 (a). The device consists of a spinning Hokuyo UTM-30LX laser, an encoder, a Microstrain 3DM-GX3 IMU, and a Grasshopper3 2.8 MP color camera[2].

For an initial seeding of the loop closure, a bag-of-word based place voting method is utilized. Once an initial pair of images are given, point clouds around the images are extracted for the ICP as shown Fig. 2 (b). To prevent the point cloud in the reference place and the current place to be mixed, a temporal selection is utilized with 5 seconds range. Also, for the efficiency in processing the point cloud, a 10 spatial selection is made. Additional matched place pairs are added by either the place recognition module or selecting the nearest image frame with a similar angle when an inlier pose exists. For the multi-resolutional voxels, voxels size with 0.3, 0.8, 1.5 are utilized.

To reduce computational complexity, sub-sampled images with size 960540 are utilized with 2121 patch sizes for the semi-direct features. Considering intensity changes caused by largely different views and time differences, the intensities of the patches are compared after a normalization. For both the global and projected features, outliers are initially filtered using a geometrical constraint with the Essential matrix. Although, Essential matrix based filtering works reasonably well, it fails to detect outliers when 2D features are degenerate such as in co-planer cases. Thus, upon detection of an inlier 6 DoF localization that minimizes both the geometrical and visual difference within the expected uncertainty, all of the past 2D matches, that are used as evidence in (21), are double checked and filtered by a test. The localization estimation is terminated early when the uncertainty is sufficiently small.

Vi-B Experimental Setup

As we mentioned earlier the initial guess on the true alignment is the dominant factor that often leads the pose optimization to failure. It is fair to assume the initial guess is unknown because of possible drift to a large extent. This is rather similar to the global point cloud registration problem. Thus, we compare our method with the state-of-the-art global ICP registration libraries with different initial guesses.

For the evaluation of the localization performance, we have collected an indoor and outdoor mixed point cloud dataset that includes multiple loops, a continuous-time trajectory and color images. The collected dataset spans 6015 with 331 in continuous-time trajectory length and 377 seconds recording time. To demonstrate the algorithm under different challenging situations, we generated 10 datasets from the map shown in Fig. 9 by making multiple traverses and picking a different place recognition starting position. The datasets include challenging scenes where the loop closure sequence starts at geometrically degenerate scenes. The ground truth of the alignment is acquired from the globally optimized trajectory. To fairly estimate the robustness against the initial guess, initial poses are randomly generated within a certain amount of uncertainty (easy, medium and hard in the Table I) and then a 6 DoF alignment is estimated 50 times each at different loop closure location.

Fig. 8: An example of pose error, uncertainty trends (upper) and visual evidence test over fusions (lower).
Fig. 9: Deformation graph where the estimated misalignment is utilized to close the loop (left). Unit is meter. Before and after comparison of the zoomed loop closure area in red box (middle: after, right: before). Upper figures represent the color map. Lower side color represents time.

Vi-C Localization

The Root Mean Square Error (RMSE) between the ground truth and the estimated alignments is listed in Table I. Sparse surfel-based point-to-plain ICP is utilized for the sparse ICP (a) which was most vulnerable to incorrect initial guess because of incorrect surfel matchings. The method that combines ICP with visual features (b) showed improved rotation estimation ability but initial guesses with a large error (medium and hard in Table I) caused occasional failures.

For the comparison with the state-of-the-art global ICP registration libraries, PCL (c) and Open3D (d) are utilized. For both methods, the initial alignments are achieved by 3D feature matchings such as using Signature of Histograms of OrienTations (SHOT) [21] (c) or Fast Point Feature Histograms (FPFH) [22] (d) and refined with dense point cloud in multi-resolution levels, which makes the alignment estimation less affected by the initial guess. These approaches are similar to method (b) except that multi-modal complementary constraints are jointly optimized in our case.

For both methods (c), (d) the results give reasonably accurate estimations for the places with many geometrical features and enough overlaps. However, in case of (d) the accuracy drops quickly and the estimation diverges especially when the localization occurs in geometrically degenerate scenes or the overlap is not sufficient which causes the failure in estimating a proper initial guess. On the other hand, the result in (c) shows a constant level of error over different initial guesses which was due to the robustness in the SHOT-based initial pose guess.

Our proposed method PhotogeoSeq (e) combines the method Visual+ICP (b) with the proposed sequential fusion and pose outlier rejection. Its robustness against the initial guess outperforms the state-of-the-art algorithms. However, further improvement was achieved in the translation estimation as given in the proposed PhotogeoSeq (f) by adding semi-direct features to (e). The improvement was due to the increased number of the widely spread visual features.

Our proposed sequential methods (e), (f) utilizes wider area for the localization than the methods (a)-(d). However, rather than utilizing the wide area at a single alignment estimation, the proposed method (e), (f) segments the map into frame level and runs independent registrations. This provides a way of conducting graph-SLAM like loop closure to the map-centric SLAM where the map is continuous and non-rigid rather than discrete like graph-based SLAM. Fig. 7 displays how the continuous map is spatio-temporally segmented into the frame level for the sequential localization.

The processing time is given at the last row of Table I in seconds. Note that as the localization occurs over different places, sufficient time will be given for processing each localization. Note that the alignment estimation process in the proposed methods (e), (f) will be executed at each locations, which is why the processing time is multiplied by the number of matched places .

Considering the actual local trajectories are drifting over time, the length of the sequential fusion should be limited. Otherwise, the transformed alignments which are found far from the initial loop closure location will be different from the alignment found on the initial location. The maximum length of depends on the accuracy of the trajectory estimation and can be found by propagating the unit uncertainty according to the travel time.

Vi-D Outlier Handling

The pose estimation uncertainty and the error of the fused pose in the sequential fusion is depicted in Fig. 9

along with the visual evidence test statistics where the color in the

axis label represents the status of the current fusion (black: need more evidence, blue: inlier pose estimation, red: outlier pose estimation).

The uncertainty and the actual error monotonically decrease whenever a new valid pose estimation is added. Occasional localization failures or false positive place recognitions (numbers in red) are filtered by the localization outlier rejection method. Once the uncertainty reaches to a threshold, the result of the localization is utilized for a map deformation as a loop closure constraint [2]. An example of the deformation graph and the comparison of the registration is depicted in Fig. 9. When the uncertainty of each translation or orientation is high, the pose estimation is not reflected to the fused pose. However, as discussed in Section V.B the measured uncertainty could be incorrect for various reasons. The adverse effect on the rotation estimation such as in the fusion sequence 2, 3 can be caused by incorrect uncertainty. Detected inlier features are added to the visual evidence pool and increased number of evidences are used for each sequence in the alignment validity test. The inlier rate is calculated by (inlier evidence)(total evidence) where any pose estimations with a large error such as the fusion sequence 6 drastically drop the inlier rate.

Vii Conclusion

In this paper, we proposed a robust metric localization method (PhotogeoSeq) that tightly couples spatio-temporal and visual information from a multi-modal sensory set to improve the robustness and accuracy of a loop closure for continuous-time map-centric SLAM. Based on our experiments, we demonstrated the proposed method is superior to the state-of-the-art global ICP methods in terms of accuracy and failure detection. This is especially the case when the loop closure is detected for scenes that lack geometric features, as the proposed metric localization method has four times more accurate translation estimation compared to state-of-the-art. Furthermore, with the proposed sequential approach where it observes if the estimated alignment is constant over different places, we were able to reject both localization or place recognition failures and achieve robust localization regardless of the initial guess. Our approach is especially beneficial with the on-the-fly loop closure scheme of the map-centric approach where faultless metric localization is required. Finally, comparison with the state-of-the-art loop closure failure detection method reveals the benefits of our proposed method as it does not require a global trajectory optimization. In our future work, handling local non-rigidity and advanced map deformation that counts multiple alignments will be covered.


The authors gratefully acknowledge funding of the project by the CSIRO and QUT. Special thanks goes to Renaud Dube, Pavel Vechersky for the implementation.


  • [1] T. Whelan, S. Leutenegger, R. S. Moreno, B. Glocker, and A. Davison, “ElasticFusion: Dense SLAM Without A Pose Graph,” in Robotics: Science and Systems (RSS), vol. 11, 2015, pp. 1–9.
  • [2] C. Park, P. Moghadam, S. Kim, A. Elfes, C. Fookes, and S. Sridharan, “Elastic LiDAR Fusion: Dense Map-Centric Continuous-Time SLAM,” in International Conference on Robotics and Automation (ICRA), 2018, pp. 1206–1213.
  • [3] M. Bosse, R. Zlot, and P. Flick, “Zebedee: Design of a Spring-Mounted 3-D Range Sensor with Application to Mobile Mapping,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1104–1119, 2012.
  • [4] S. Vidas, P. Moghadam, and S. Sridharan, “Real-time mobile 3D temperature mapping,” IEEE Sensors Journal, vol. 15, no. 2, pp. 1145–1152, 2015.
  • [5] K. L. Ho and P. Newman, “Detecting Loop Closure with Scene Sequences,”

    International Journal of Computer Vision

    , vol. 74, no. 3, pp. 261–286, Sept. 2007.
  • [6] M. J. Milford and G. F. Wyeth, “SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights,” in International Conference on Robotics and Automation (ICRA), 2012, pp. 1643–1649.
  • [7] S. Lynen, M. Bosse, and R. Siegwart, “Trajectory-based place-recognition for efficient large scale localization,” International Journal of Computer Vision, vol. 124, no. 1, pp. 49–64, Aug 2017.
  • [8] M. Cummins and P. Newman, “FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance,” The International Journal of Robotics Research, vol. 27, no. 6, pp. 647–665, June 2008.
  • [9] Y. Latif, C. Cadena, and J. Neira, “Robust loop closing over time for pose graph SLAM,” The International Journal of Robotics Research, vol. 32, no. 14, pp. 1611–1626, Dec. 2013.
  • [10] K. L. Ho and P. Newman, “Loop closure detection in SLAM by combining visual and spatial appearance,” Robotics and Autonomous Systems, vol. 54, no. 9, pp. 740–749, Sept. 2006.
  • [11] B. Glocker, J. Shotton, A. Criminisi, and S. Izadi, “Real-Time RGB-D Camera Relocalization via Randomized Ferns for Keyframe Encoding,” IEEE Transactions on Visualization and Computer Graphics, vol. 21, no. 5, pp. 571–583, May 2015.
  • [12] A. Gawel, T. Cieslewski, R. Dubé, M. Bosse, R. Siegwart, and J. Nieto, “Structure-based vision-laser matching,” in International Conference on Intelligent Robots and Systems (IROS), Oct. 2016, pp. 182–188.
  • [13] J. Guo, P. V. K. Borges, C. Park, and A. Gawel, “Local Descriptor for Robust Place Recognition using LiDAR Intensity,” IEEE Robotics and Automation Letters, To appear, 2019.
  • [14] C. Park, S. Kim, P. Moghadam, S. Sridharan, and C. Fookes, “Spatiotemporal Camera-LiDAR Calibration: A Markerless and Structureless Approach,” in Submitted to International Conference on Robotics and Automation (ICRA), 2019.
  • [15] C. Park, S. Kim, P. Moghadam, C. Fookes, and S. Sridharan, “Probabilistic Surfel Fusion for Dense LiDAR Mapping,” International Conference on Computer Vision Workshops (ICCVW), pp. 2418–2426, 2017.
  • [16] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi-direct monocular visual odometry,” in International Conference on Robotics and Automation (ICRA), May 2014, pp. 15–22.
  • [17] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, Nov. 2004.
  • [18] M. Guizar-Sicairos, S. T. Thurman, and J. R. Fienup, “Efficient subpixel image registration algorithms,” Optics Letters, vol. 33, no. 2, pp. 156–158, Jan. 2008.
  • [19]

    F. I. I. Muñoz and A. I. Comport, “Point-to-hyperplane RGB-D pose estimation: Fusing photometric and geometric measurements,” in

    IEEE International Conference on Intelligent Robots and Systems (IROS), Oct. 2016.
  • [20] T. D. Barfoot and P. T. Furgale, “Associating Uncertainty With Three-Dimensional Poses for Use in Estimation Problems,” IEEE Transactions on Robotics, vol. 30, no. 3, pp. 679–693, June 2014.
  • [21] F. Tombari, S. Salti, and L. Di Stefano, “Unique signatures of histograms for local surface description,” in European Conference on Computer Vision (ECCV), 2010, pp. 356–369.
  • [22] R. B. Rusu, N. Blodow, and M. Beetz, “Fast Point Feature Histograms (FPFH) for 3D registration,” in International Conference on Robotics and Automation (ICRA), May 2009, pp. 3212–3217.