I Introduction
Simultaneous Localization And Mapping (SLAM) is a key enabling component of autonomous robots and driverless technologies. Although SLAM techniques have reached a mature level, with the recent advent of low cost multimodal sensors, 3D dense LiDARbased SLAM is still in its infancy. Mapcentric approaches [1, 2], which have demonstrated their accuracy and effectiveness by fusionbased mapping and deformationbased loop closure, provide an alternative solution to the dominant trajectorycentric SLAM [3, 4]. However, in the mapcentric approach, the effect of incorrect loop closure is more destructive and irreversible compared to the conventional graphbased methods. The mapcentric approach fuses all the measurements onthefly to the map, where the fused map is difficult to be detached and corrected once the loop is closed. Therefore, the accuracy and robustness of loop closure in mapcentric approaches [1, 2] become crucial.
Previous mapcentric approaches either did not have a proper method to detect false positives [2] in a loop closure or they are not robust enough as well as being limited to only the RGBD sensor model [1]. The current stateoftheart solution in false positive loop closure detection or localization failure detection [5] requires a global pose graph optimization for each validity test which is a huge penalty when it comes to scalability issues. Furthermore, their method is also dedicated for trajectorycentric systems such as graphbased SLAM, and as a result, is not applicable to mapcentric methods.
To overcome this problem, we propose a metric localization, namely PhotogeoSeq, which combines geometric and photometric constraints over time for the robust metric localization of the mapcentric loop closure. The specific contributions of this work can be outlined as follows. First, with the tightly coupled geometric and photometric
constraints, our method robustly estimates 6 Degree of Freedom (DoF)
alignment even when the initial guess for registration is completely incorrect or unknown.Second, instead of making a loop closure decision based on only a single 6 DoF alignment estimation, the proposed method observes the stability of the alignment over time and provides a reliability metric to reject or accept a loop closure hypothesis without a global trajectory optimization. For this purpose, we demonstrate that utilizing the visual features are beneficial for detecting a localization failure.
This paper is organized as follows. Section II reviews related works. Section III describes the overall procedure of the proposed method. In Section IV, we describe the proposed photogeometric metric localization method (PhotogeoSeq), and detail our visual evidencebased outlier rejection in Section V. Results demonstrating the advantages of our method over current stateoftheart global ICP algorithms are given in Section VI. Finally we summarize and present future directions in Section VII.
Ii Related Work
SeqSLAM [6] proposed a place recognition architecture that utilizes the coherent sequential 2D visual information, where it searches whether a similar sequence of the features between the two paths are observed over time. Lynen et al. [7]
extended SeqSLAM as a 2D probability density estimation problem in votes versus travel distance space. Their work finds chunks of revisited places more efficiently than SeqSLAM by the distance normalization. However, as their method purely depends on 2D visual input, it is not suitable for a 6 DoF pose estimation problem. Furthermore, if the trajectory is only overlapped a short distance, such as at intersections, it is not likely to have much coherency which increases the risk of failure
[8].Latif et al. [9] proposed a graphbased loop closure detection and correction system, where they utilize a residual of the graph optimization as a prior of a loop closure failure. Their method clusters topologically related loop closure hypotheses over time and rejects any hypothesis that increases the residual of graph optimization. They have proved that the localization could be reversely utilized for testing whether the input place recognition is false positive, whereas the conventional algorithms showed single directional hierarchical flow from place recognition to localization. Although, they were able to increase robustness to false positive loop closures, their method requires a graphoptimization to test each hypothesis, which is a huge drawback for a longterm mapping system.
On the other hand, the modality of the loop closure system is one of the key components that increases the performance and accuracy of a loop closure. In early works, a single sensor was utilized for the loop closure detection problem, where 3D or visual descriptors were solely used [8, 9]. However, with the single sensor approach the robustness is rather limited because of the innate degeneracies in each configuration. Thus, even the stateoftheart works have a higher false positive rate than desired and work well only within limited scenarios.
There are few works that introduce multimodality for the place recognition or localization problem. Early works in this category [10, 11, 12], combined 3D and 2D descriptors for a place recognition problem. While the combined complementary descriptors are capable of handling degenerate configurations such as where there are no visual patterns or where the scene is geometrically flat, rejection of incorrect place recognition or alignment error is difficult. Furthermore, the constraints from 2D recognition are not tightly integrated in the pose optimization of the metric localization step [1, 10]. Also, when it comes to a separate sensing system such as an independent LiDAR and camera system on a dynamically moving platform [3], the quality of the localization is affected by the spatiotemporal differences in observations, which was not properly considered in the previous works [10]. Even in the cases where the integrated intensity from LiDAR is utilized instead of 2D visual features, difficulties still exist because of the LiDAR intensity calibration and intensity differences according to the incident angle [13].
Iii Overview of the system
The proposed metric localization method (PhotogeoSeq) consists of place recognition, metric localization, and pose fusion modules as depicted in Fig. 1. Given a motionundistorted local point cloud from LiDAR and corresponding camera images by the sliding windowbased local trajectory optimization [2] as shown Fig. 2, the place recognition module finds a possible revisit of a place and triggers the localization step to estimate an alignment between the current location and the reference location. Once the localization module is triggered, our method continuously estimates the alignment between the new current location and its corresponding reference location at different places until the uncertainty of the estimated alignment reaches a certain threshold.
Iv Alignment Estimation
In this section, we elaborate on our alignment estimation method by photogeometric constraints.
Iva ContinuousTime Trajectory Representation
In the proposed method, the synchronization problem of the multimodal sensors such as different framerates or unsynchronized clocks are required to be properly addressed. To cope with this problem, our method proposes to use the continuoustime trajectory representation [2]. In the continuoustime trajectory representation, the trajectory is modeled as a function of time. An exact system pose at an arbitrary query time
can be interpolated from a set of temporally nearby poses. Let
be composed of translational component and rotational component as,(1) 
Then, utilizing the linear continuoustime trajectory representation, its value can be evaluated by an interpolation from two poses where their timestamps satisfy . Given poses and timestamps, their interpolation at is given by,
(2) 
where the relative pose is defined by and the exponential mapping linearly interpolates the relative pose on the manifold with the interpolation ratio .
IvB Alignment Between Two Places
Given a continuoustime trajectory and two matching places found by a loop closure detection module as depicted in Fig. 2 (b), we define an initial source to reference transformation from the drifted raw trajectory as,
(3) 
where is a continuoustime trajectory [3] that represents the LiDAR frame in the world frame at source and reference time , as shown in Fig. 3. Due to drifts in trajectory, the initial alignment is highly likely to be misaligned. Thus, we define and find its correction by,
(4) 
where is the compensation to the initial alignment which can be found by visual and geometrical sensor observations. Additionally, for the joint optimization by the LiDAR and visual observations, we define a corresponding relative camera transformation as,
(5) 
where is a precalibrated camera frame [C] to LiDAR frame extrinsics.
IvC Single Alignment Estimation from Place Pairs
As we already mentioned, an alignment estimation based on a single location has a high probability of failure. We propose to aggregate the estimations of multiple alignments at different locations to the very first place where the loop closure was first detected. This approach will develop a more robust and reliable failure detection and outlier rejection.
Equation (4) represents the displacement between two matched local frames. As our objective is to estimate a single alignment, we transform the local displacements into the first place as,
(6) 
where , are the relative transformation from the first frame and frame which are extracted from the drifted trajectory as described in Fig. 3. It is important to note that we assume that the trajectory utilized to extract , is locally rigid and accurate enough, which implies that , do not embed any uncertainty because of the local trajectory drift within small regions.
IvD SpatioTemporal Uncertainty in Visual Measurements
To account for uncertainties from various sources during the optimization, we define the uncertainty model of the system. Two types of uncertainties are defined. The first uncertainty is from the temporal difference between the two sensing modalities. As we estimate the location of the camera based on the LiDAR trajectory, the temporal uncertainty affects the camera location estimation as,
(7) 
where is the estimated temporal difference with mean
and variance
[14]. Even with a small temporal uncertainty, a large local motion amplifies the uncertainty of the camera pose.The second uncertainty originates from the extrinsic calibration between the two sensors,
(8) 
where is an extrinsic calibration error and is the true extrinsics. The visualization of these uncertainties are shown in Fig. 4. The uncertainties modelled here will be propagated through the cost function of visual constraints and used as weights in nonlinear least squares in the next section as well as a statistic test for localization outlier rejection.
IvE Geometric Constraints
We apply surfelbased pointtoplane ICP with multiresolutional voxelization[15] for the geometrical alignment between the source and the reference point cloud as,
(9) 
(10) 
where the pair of the matched surfel centroids , are acquired by a nearest neighbour method in the surface normal and centroid space. The original point cloud to calculate these surfels are extracted around the pose at , and represented in the local LiDAR coordinates
. The covariance is defined by a Mestimator weight and the Eigenvalue of the voxel
[2].IvF Photometric Constraints
The sparse surfel constraint is computationally efficient because of the lower number of surfels after a spatial compression but this often fails when the initial guess on the pose is incorrect or the scene is geometrically degenerate such as in a corridor or wide open area. To cope with both problems, we tightly combine Photometric constraints with ICP. If there are distinctive visual patterns in the scene, the visual patterns generate complementary constraints which are not effected by the structure of the scene.
Our proposed method combines two different types of visual features: semidirect features [16] and indirect features [17], as a complementary constraint to the geometric constraint. Both visual features are fed into the epipolar constraints between two camera poses. Thus, given a set of 2D features , the epipolar constraint is given as,
(11) 
The visual constraint by (11) can estimate the pose up to scale only. However, combined with the ICP constraint, full 6 DoF can be estimated as well as preventing the optimization from being stuck at local minima because of incorrect surfel matching caused by an incorrect initial guess on alignment.
The indirect 2D features between the source image and the reference image are found by a global feature descriptor method [17] to cope with the large view difference problem. As the feature detection and matching stage are independent of the time lag and initial guess on the true alignment, even a few matches impose strong orientational constraints. This will guide the pose estimation to the roughly correct direction so that ICP can find the proper matchings of surfels. Therefore, Equation (11) largely reduces the chance of getting stuck in local minima because of surfel false matchings.
Also, to estimate more features widely spread over the image, the semidirect features are matched by projecting the reference image feature points onto the source and refining as depicted in Fig. 5. It starts by defining the warping function as,
(12) 
where , are the feature location on the reference image and its projection on the source image, K is the intrinsic matrix and is the depth map of the scene, and is the camera projection. is the reconstructed 3D positions of the 2D features. The depth map is generated from a dense surfel map by the LiDAR point cloud [2] by rendering at where follows the reference image time stamp.
As the depth is loosely coupled to the image, the 2D geometry of the projected reference patch is not accurately aligned to the source patch in general as shown in the Fig. 6. Thus, for each image patch a refinement is found so that it reduces the intensity difference of two patches as described in Fig. 5 and Fig. 6,
(13) 
where , are the source and reference images defined at respectively, is the image patch around . The warped reference image is defined as,
(14) 
where is the inverse projection from the source to the reference. Thus, the additional matched points are given from the reference feature location and the corrected source location as
. Note that the semidirect method does not directly relate the intensity to the pose estimation problem. The semidirect method is utilized to find more matched points that are uniformly distributed over the image. This semidirect method is inspired by SVO
[16] but we added frequency domain refinement to cope with considerably large [18].The semidirect and indirect features are also complementary to each other. The semidirect method finds features only from the reference side and warps them to the source side. Although more features are found evenly over the entire image area, it is not functional until the given pose is close enough to the true pose. However, the matching of indirect features are given regardless of the current pose, therefore, they are valid over all pose estimation stages. Furthermore, the indirect features are usually detected at a far distance where the shape of features are less distorted, thus it is easy to become a rank deficient constraint. Finally, the cost function of the two visual constraints are defined as,
(15) 
where is the error of feature pair and the covariances are utilized to consider spatiotemporal uncertainty caused by heterogeneous sensory system which is defined as,
(16) 
where , , , are the Jacobian of the cost function with respect to the uncertainty parameters on matching corruption, and . They propagate the uncertainty to the optimization parameter space and are defined as follows,
(17) 
The amount of the pixel level uncertainty is calculated from a predefined ground truth dataset by back projecting source inlier features to the destination. The spatiotemporal uncertainties , are approximated by the Hessian during the onthefly extrinsic, temporal calibration [14] by using a continuoustime bundle adjustment.
IvG Joint Optimization
Given a photogeometric cost function , we utilize the GaussNewton method to find the compensation and update iteratively. Since the cost function is a multi objective problem, a proper handling of scale difference is required [19]. The scaling factors are found by normalization. Also, for soft outlier rejection, we applied an Mestimator with tdistribution based weights.
V Fusing Alignment Estimations
Generally, single alignment estimation is prone to error due to various inevitable reasons. In our proposed system, to ensure robust localization, we propose to use sequential alignment estimation through different parts of the map instead of taking a single place recognition result for a loop closure. In this section we describe our probabilistic and temporal approach on fusing estimated poses. Also, an effective way to detect a localization failure or false positive will be discussed as well.
Va Pose Fusion Model
Suppose that we have multiple alignment estimations and their covariances. As the estimated poses are the alignments between the first source and reference frame, the estimated alignment should not change over time under the assumption that they are locally rigid. Thus, we model this problem as estimating the same pose over time by (6).
The Bayesian fusion offers a closed form solution on the vector fusion problem [2]. However, as the pose vector sits on a manifold, directly applying the Bayesian fusion on the poses causes the convergence to be suboptimal. Thus, we introduce the pose fusion approach proposed in [20] with a modification to convert the original batch fusion problem to a sequential fusion problem suitable for realtime application.
Let , respectively be pose measurement and the current pose estimation up to the point with their uncertainties. The uncertainty of the current estimation is approximated by the Hessian as . Applying the GaussNewton method with the two poses, we have,
(18) 
where is the BakerCambellHausdorff approximation [20] which is calculated from . Refer to [20] for a detailed derivation of the above. The next current pose estimation is found by iteratively updating the best guess from the current poses as,
(19) 
where the pose becomes the new at the end of the iteration for the next fusion. The covariance of the next current pose estimation is decided as follows at the end of the iteration,
(20) 
Note that the accuracy of the sequential fusion is identical with the batch fusion method but sequential fusion reduces unnecessary redundant fusions. To decide whether or not to collect more evidence for the loop closure, we inspect the eigenvalues of the current pose estimation covariance and stop estimating the alignment when eigenvalues are sufficiently small such that for , where is a sufficiently conservative threshold and eigenvalue of .
VB Outlier Rejection
The basic assumption for the pose fusion is that the uncertainty of the estimated pose is consistent and well represents the true error. A pose with a large error is fine to be fused as long as its covariance value is high. In such a case, the incorrect pose merely changes the fused pose when the system has the strong evidence that the previous pose is correct. But in general it is not true because of blindness of ICP oriented covariance with respect to translation and possibility of degeneracy in 2D scenes. Thus, pose outliers where covariance does not properly represent its estimation quality should be filtered prior to the fusion.
Under the local rigidity assumption, any newly estimated alignment should not be increasing the residual of previously constructed constraints. The outlier rejection stage utilizes this rigidity assumption to detect and properly handle any outliers before fusing them to our final alignment estimation.
For this stage, only the epipolar constraint is utilized as its residuals proportionally increases when the input pose is incorrect. Note that the surfelbased geometric feature matching does not follow this property. The matched surfels are found in normal and centroid spaces, therefore, what it implies is an abstracted spatial alignment rather than feature level correspondence. Thus, given a new pose to be tested, a statistic test using the epipolar constraint is carried out as follows,
(21) 
where is residual of epipolar constraint of pose estimation with respect to the pose and is degree of freedom. The pose should be reversely calculated from the given pose estimation using the relationship between the first pose and pose as in (6). This test checks if the residuals from each previous pose estimations are within 95% confidence area considering uncertainties caused by the time lag and extrinsic calibration error.
Vi Experiment
In this section, we demonstrate the accuracy and robustness of the proposed method by comparing it to ground truth and other 3D point cloud based global localization methods.
Via Implementation Details
Initial  (a)  (b)  (c)  (d)  (e)  (f) 

Guess  Sparse ICP  Visual+ICP  SHOT+ICP  FPFH+ICP  PhotogeoSeq  PhotogeoSeq 
Easy  0.04(0.01)  0.05(0.01)  0.15(0.008)  0.30(0.03)  0.09(0.004)  0.05(0.003) 
Medium  0.40(0.19)  0.12(0.11)  0.13(0.005)  1.64(0.30)  0.09(0.004)  0.04(0.004) 
Hard  2.52(2.42)  1.49(0.71)  0.17(0.005)  13.8(0.10)  0.10(0.004)  0.04(0.004) 


Time (sec)  0.17  0.63  1.05  3.25  0.75  2.04 

The estimations are compared to the ground truth to calculate error norm of translation and rotation (). Units are meter and rotation vector norm. For each noise level, initial poses are randomly generated according to the following parameters: Easy , Medium , Hard . is the number of the matched places in the sequence. The proposed method in (e) PhotogeoSeq combines photogeometric constraints in sequential manner with the additional visual features from the semidirect method in (f) PhotogeoSeq.
For the experiments, a handheld 3D spinning LiDAR is utilized to collect LiDAR and color images as shown in Fig. 2 (a). The device consists of a spinning Hokuyo UTM30LX laser, an encoder, a Microstrain 3DMGX3 IMU, and a Grasshopper3 2.8 MP color camera[2].
For an initial seeding of the loop closure, a bagofword based place voting method is utilized. Once an initial pair of images are given, point clouds around the images are extracted for the ICP as shown Fig. 2 (b). To prevent the point cloud in the reference place and the current place to be mixed, a temporal selection is utilized with 5 seconds range. Also, for the efficiency in processing the point cloud, a 10 spatial selection is made. Additional matched place pairs are added by either the place recognition module or selecting the nearest image frame with a similar angle when an inlier pose exists. For the multiresolutional voxels, voxels size with 0.3, 0.8, 1.5 are utilized.
To reduce computational complexity, subsampled images with size 960540 are utilized with 2121 patch sizes for the semidirect features. Considering intensity changes caused by largely different views and time differences, the intensities of the patches are compared after a normalization. For both the global and projected features, outliers are initially filtered using a geometrical constraint with the Essential matrix. Although, Essential matrix based filtering works reasonably well, it fails to detect outliers when 2D features are degenerate such as in coplaner cases. Thus, upon detection of an inlier 6 DoF localization that minimizes both the geometrical and visual difference within the expected uncertainty, all of the past 2D matches, that are used as evidence in (21), are double checked and filtered by a test. The localization estimation is terminated early when the uncertainty is sufficiently small.
ViB Experimental Setup
As we mentioned earlier the initial guess on the true alignment is the dominant factor that often leads the pose optimization to failure. It is fair to assume the initial guess is unknown because of possible drift to a large extent. This is rather similar to the global point cloud registration problem. Thus, we compare our method with the stateoftheart global ICP registration libraries with different initial guesses.
For the evaluation of the localization performance, we have collected an indoor and outdoor mixed point cloud dataset that includes multiple loops, a continuoustime trajectory and color images. The collected dataset spans 6015 with 331 in continuoustime trajectory length and 377 seconds recording time. To demonstrate the algorithm under different challenging situations, we generated 10 datasets from the map shown in Fig. 9 by making multiple traverses and picking a different place recognition starting position. The datasets include challenging scenes where the loop closure sequence starts at geometrically degenerate scenes. The ground truth of the alignment is acquired from the globally optimized trajectory. To fairly estimate the robustness against the initial guess, initial poses are randomly generated within a certain amount of uncertainty (easy, medium and hard in the Table I) and then a 6 DoF alignment is estimated 50 times each at different loop closure location.
ViC Localization
The Root Mean Square Error (RMSE) between the ground truth and the estimated alignments is listed in Table I. Sparse surfelbased pointtoplain ICP is utilized for the sparse ICP (a) which was most vulnerable to incorrect initial guess because of incorrect surfel matchings. The method that combines ICP with visual features (b) showed improved rotation estimation ability but initial guesses with a large error (medium and hard in Table I) caused occasional failures.
For the comparison with the stateoftheart global ICP registration libraries, PCL (c) and Open3D (d) are utilized. For both methods, the initial alignments are achieved by 3D feature matchings such as using Signature of Histograms of OrienTations (SHOT) [21] (c) or Fast Point Feature Histograms (FPFH) [22] (d) and refined with dense point cloud in multiresolution levels, which makes the alignment estimation less affected by the initial guess. These approaches are similar to method (b) except that multimodal complementary constraints are jointly optimized in our case.
For both methods (c), (d) the results give reasonably accurate estimations for the places with many geometrical features and enough overlaps. However, in case of (d) the accuracy drops quickly and the estimation diverges especially when the localization occurs in geometrically degenerate scenes or the overlap is not sufficient which causes the failure in estimating a proper initial guess. On the other hand, the result in (c) shows a constant level of error over different initial guesses which was due to the robustness in the SHOTbased initial pose guess.
Our proposed method PhotogeoSeq (e) combines the method Visual+ICP (b) with the proposed sequential fusion and pose outlier rejection. Its robustness against the initial guess outperforms the stateoftheart algorithms. However, further improvement was achieved in the translation estimation as given in the proposed PhotogeoSeq (f) by adding semidirect features to (e). The improvement was due to the increased number of the widely spread visual features.
Our proposed sequential methods (e), (f) utilizes wider area for the localization than the methods (a)(d). However, rather than utilizing the wide area at a single alignment estimation, the proposed method (e), (f) segments the map into frame level and runs independent registrations. This provides a way of conducting graphSLAM like loop closure to the mapcentric SLAM where the map is continuous and nonrigid rather than discrete like graphbased SLAM. Fig. 7 displays how the continuous map is spatiotemporally segmented into the frame level for the sequential localization.
The processing time is given at the last row of Table I in seconds. Note that as the localization occurs over different places, sufficient time will be given for processing each localization. Note that the alignment estimation process in the proposed methods (e), (f) will be executed at each locations, which is why the processing time is multiplied by the number of matched places .
Considering the actual local trajectories are drifting over time, the length of the sequential fusion should be limited. Otherwise, the transformed alignments which are found far from the initial loop closure location will be different from the alignment found on the initial location. The maximum length of depends on the accuracy of the trajectory estimation and can be found by propagating the unit uncertainty according to the travel time.
ViD Outlier Handling
The pose estimation uncertainty and the error of the fused pose in the sequential fusion is depicted in Fig. 9
along with the visual evidence test statistics where the color in the
axis label represents the status of the current fusion (black: need more evidence, blue: inlier pose estimation, red: outlier pose estimation).The uncertainty and the actual error monotonically decrease whenever a new valid pose estimation is added. Occasional localization failures or false positive place recognitions (numbers in red) are filtered by the localization outlier rejection method. Once the uncertainty reaches to a threshold, the result of the localization is utilized for a map deformation as a loop closure constraint [2]. An example of the deformation graph and the comparison of the registration is depicted in Fig. 9. When the uncertainty of each translation or orientation is high, the pose estimation is not reflected to the fused pose. However, as discussed in Section V.B the measured uncertainty could be incorrect for various reasons. The adverse effect on the rotation estimation such as in the fusion sequence 2, 3 can be caused by incorrect uncertainty. Detected inlier features are added to the visual evidence pool and increased number of evidences are used for each sequence in the alignment validity test. The inlier rate is calculated by (inlier evidence)(total evidence) where any pose estimations with a large error such as the fusion sequence 6 drastically drop the inlier rate.
Vii Conclusion
In this paper, we proposed a robust metric localization method (PhotogeoSeq) that tightly couples spatiotemporal and visual information from a multimodal sensory set to improve the robustness and accuracy of a loop closure for continuoustime mapcentric SLAM. Based on our experiments, we demonstrated the proposed method is superior to the stateoftheart global ICP methods in terms of accuracy and failure detection. This is especially the case when the loop closure is detected for scenes that lack geometric features, as the proposed metric localization method has four times more accurate translation estimation compared to stateoftheart. Furthermore, with the proposed sequential approach where it observes if the estimated alignment is constant over different places, we were able to reject both localization or place recognition failures and achieve robust localization regardless of the initial guess. Our approach is especially beneficial with the onthefly loop closure scheme of the mapcentric approach where faultless metric localization is required. Finally, comparison with the stateoftheart loop closure failure detection method reveals the benefits of our proposed method as it does not require a global trajectory optimization. In our future work, handling local nonrigidity and advanced map deformation that counts multiple alignments will be covered.
Acknowledgments
The authors gratefully acknowledge funding of the project by the CSIRO and QUT. Special thanks goes to Renaud Dube, Pavel Vechersky for the implementation.
References
 [1] T. Whelan, S. Leutenegger, R. S. Moreno, B. Glocker, and A. Davison, “ElasticFusion: Dense SLAM Without A Pose Graph,” in Robotics: Science and Systems (RSS), vol. 11, 2015, pp. 1–9.
 [2] C. Park, P. Moghadam, S. Kim, A. Elfes, C. Fookes, and S. Sridharan, “Elastic LiDAR Fusion: Dense MapCentric ContinuousTime SLAM,” in International Conference on Robotics and Automation (ICRA), 2018, pp. 1206–1213.
 [3] M. Bosse, R. Zlot, and P. Flick, “Zebedee: Design of a SpringMounted 3D Range Sensor with Application to Mobile Mapping,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1104–1119, 2012.
 [4] S. Vidas, P. Moghadam, and S. Sridharan, “Realtime mobile 3D temperature mapping,” IEEE Sensors Journal, vol. 15, no. 2, pp. 1145–1152, 2015.

[5]
K. L. Ho and P. Newman, “Detecting Loop Closure
with Scene Sequences,”
International Journal of Computer Vision
, vol. 74, no. 3, pp. 261–286, Sept. 2007.  [6] M. J. Milford and G. F. Wyeth, “SeqSLAM: Visual routebased navigation for sunny summer days and stormy winter nights,” in International Conference on Robotics and Automation (ICRA), 2012, pp. 1643–1649.
 [7] S. Lynen, M. Bosse, and R. Siegwart, “Trajectorybased placerecognition for efficient large scale localization,” International Journal of Computer Vision, vol. 124, no. 1, pp. 49–64, Aug 2017.
 [8] M. Cummins and P. Newman, “FABMAP: Probabilistic Localization and Mapping in the Space of Appearance,” The International Journal of Robotics Research, vol. 27, no. 6, pp. 647–665, June 2008.
 [9] Y. Latif, C. Cadena, and J. Neira, “Robust loop closing over time for pose graph SLAM,” The International Journal of Robotics Research, vol. 32, no. 14, pp. 1611–1626, Dec. 2013.
 [10] K. L. Ho and P. Newman, “Loop closure detection in SLAM by combining visual and spatial appearance,” Robotics and Autonomous Systems, vol. 54, no. 9, pp. 740–749, Sept. 2006.
 [11] B. Glocker, J. Shotton, A. Criminisi, and S. Izadi, “RealTime RGBD Camera Relocalization via Randomized Ferns for Keyframe Encoding,” IEEE Transactions on Visualization and Computer Graphics, vol. 21, no. 5, pp. 571–583, May 2015.
 [12] A. Gawel, T. Cieslewski, R. Dubé, M. Bosse, R. Siegwart, and J. Nieto, “Structurebased visionlaser matching,” in International Conference on Intelligent Robots and Systems (IROS), Oct. 2016, pp. 182–188.
 [13] J. Guo, P. V. K. Borges, C. Park, and A. Gawel, “Local Descriptor for Robust Place Recognition using LiDAR Intensity,” IEEE Robotics and Automation Letters, To appear, 2019.
 [14] C. Park, S. Kim, P. Moghadam, S. Sridharan, and C. Fookes, “Spatiotemporal CameraLiDAR Calibration: A Markerless and Structureless Approach,” in Submitted to International Conference on Robotics and Automation (ICRA), 2019.
 [15] C. Park, S. Kim, P. Moghadam, C. Fookes, and S. Sridharan, “Probabilistic Surfel Fusion for Dense LiDAR Mapping,” International Conference on Computer Vision Workshops (ICCVW), pp. 2418–2426, 2017.
 [16] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semidirect monocular visual odometry,” in International Conference on Robotics and Automation (ICRA), May 2014, pp. 15–22.
 [17] D. G. Lowe, “Distinctive Image Features from ScaleInvariant Keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, Nov. 2004.
 [18] M. GuizarSicairos, S. T. Thurman, and J. R. Fienup, “Efficient subpixel image registration algorithms,” Optics Letters, vol. 33, no. 2, pp. 156–158, Jan. 2008.

[19]
F. I. I. Muñoz and A. I. Comport, “Pointtohyperplane RGBD pose estimation: Fusing photometric and geometric measurements,” in
IEEE International Conference on Intelligent Robots and Systems (IROS), Oct. 2016.  [20] T. D. Barfoot and P. T. Furgale, “Associating Uncertainty With ThreeDimensional Poses for Use in Estimation Problems,” IEEE Transactions on Robotics, vol. 30, no. 3, pp. 679–693, June 2014.
 [21] F. Tombari, S. Salti, and L. Di Stefano, “Unique signatures of histograms for local surface description,” in European Conference on Computer Vision (ECCV), 2010, pp. 356–369.
 [22] R. B. Rusu, N. Blodow, and M. Beetz, “Fast Point Feature Histograms (FPFH) for 3D registration,” in International Conference on Robotics and Automation (ICRA), May 2009, pp. 3212–3217.