The availability of benchmark datasets [Shotton2013CVPR, Kendall2015ICCV, Sattler2018CVPR, Li2010ECCV, Wald2020ECCV, Zhang2020IJCV, Taira2018CVPR, Li2012ECCV, Chen2011CVPR, Maddern2017IJRR, Valentin20163DV, Torii2019TPAMI] has been a driving factor for research on visual re-localisation, a core technology to make autonomous robots [Lim12CVPR], self-driving cars [Heng2019ICRA], and augmented / mixed reality (AR / MR) systems [Castle08ISWC, Arth2011ISMAR, Lynen2015RSS]
a reality. These benchmarks provide camera poses for a set of training and test images. The training images can be used to create a scene representation, and the test images serve as queries to determine the 3D position and 3D orientation (6DoF pose) of the camera with respect to the scene. Due to the challenge of jointly estimating the poses of thousands or more images, benchmark datasets are typically generated by a reference algorithm such as SfM or (RGB-)D SLAM[Kendall2015ICCV, Shotton2013CVPR, Valentin20163DV, Li2010ECCV, Li2012ECCV]. As such, benchmarks measure how well visual re-localisation methods are able to replicate the results of the reference algorithm.
Ideally, the choice of reference algorithm should not matter as long as it faithfully estimates the camera poses of the training and test images. In particular, the choice of reference algorithm should not affect the ranking of methods on a benchmark. In practice however, different reference algorithms optimise different cost functions, , reprojection errors of sparse point clouds for SfM [Schoenberger2016CVPR, Wu133DV] or alignment errors in 3D space for depth-based SLAM methods [Newcombe2011ISMAR, Izadi2011UIST, Schoeps2019CVPR, Dai2017TOG], leading to different local minima. We ask to what degree the choice of reference algorithm impacts the ranking of methods on a benchmark. This is an important question as it pertains to whether or not we can draw absolute conclusions, , algorithm A is better than algorithm B or using component C improves accuracy. Interestingly, to the best of our knowledge, this question has not received much attention in the re-localisation literature.
The main focus of this paper is to investigate how the choice of reference algorithms impacts the measured performance of visual re-localisation algorithms. To this end, we compare two types of reference algorithms (depth-based SLAM and SfM) on two popular benchmark datasets [Shotton2013CVPR, Valentin20163DV]. Detailed experiments with state-of-the-art re-localisation algorithms show that the choice of reference algorithm can have a profound impact on the ranking of methods. In particular, as illustrated in Fig. 1, we show that depending on the reference algorithm, a modern end-to-end-trainable approach [brachmann2020ARXIV] either outperforms or is outperformed by a classical, nearly 10 year-old baseline [Sattler2012ECCV, Sattler2017PAMI]. Similarly, the choice of whether to use depth maps or SfM point clouds to represent a scene can improve or decrease performance depending on the reference algorithm. Our results show that we as a community should be careful when drawing conclusions from existing benchmarks. Instead, it is necessary to take into account that certain approaches more closely resemble the reference algorithm than others. The former are better able to replicate imperfections in a reference algorithm’s pseudo ground truth (pGT). This natural advantage should be discussed when evaluating localisation results and designing new benchmarks.
In detail, this paper makes the following contributions:
1) we show that the choice of a reference algorithm for obtaining pGT poses can have a significant impact on the relative ranking of methods, to the extend that the rankings of methods can be (nearly) completely reversed. This implies that published results for visual re-localisation should always be considered under the aspect of which algorithm was used to create the pGT.
2) we provide a comparison of pGT generated by RGB-only SfM and (RGB-)D SLAM on the 7Scenes [Shotton2013CVPR] and 12Scenes [Valentin20163DV] datasets, which are widely used [Shotton2013CVPR, Guzman2014CVPR, Brachmann2016CVPR, Brachmann2017CVPR, Brachmann2018CVPR, Brachmann2019ICCVa, Brachmann2019ICCVb, Kendall2015ICCV, Kendall2017CVPR, Walch2017ICCV, Brahmbhatt2018CVPR, Valentin2015CVPR, Valentin20163DV, Cavallari2017CVPR, Cavallari20193DV]. We show that none is clearly superior than the other. We show that commonly accepted results from the literature (RGB-D variants of re-localisation methods outperform their RGB-only counterparts; scene coordinate regression is more accurate than feature-based methods) are not absolute but depending on the pGT.
3) we are not aware of prior work aimed at evaluating the extent to which conclusions about localisation performance can be drawn from existing benchmarks. As such, this paper is the first to raise awareness that the limitations of the pGT for re-localisation need to be discussed in order to make valid comparisons across methods.
Our new pGT and our evaluation pipeline are available at github.com/tsattler/visloc_pseudo_gt_limitations/.
2 Related Work
The difficulty of obtaining ground truth varies for different tasks in computer visions. For tasks with a low dimensional structure, image classification and object detection, human annotations are effective[pascal-voc-2012] and can be scaled via crowd sourcing [Deng2009CVPR]. For tasks with a more complex output, image segmentation or optical flow, annotation time quickly rises to a level which severely affects the scale and cost of associated datasets [Donath2013ICVS, Cordts2016CVPR, geiger2013vision, Menze2015CVPR].
The 6DoF camera pose estimation task comes with the added difficulty that humans are not skilled at directly annotating the camera poses. Instead they annotate image correspondences as input to an optimization problem to recover the pose indirectly[Sattler2018CVPR, Taira2018CVPR, Torii2019TPAMI]. Since many correspondences are required for a stable pose estimate, such an annotation approach does not scale beyond a few hundred images. Further, the annotations are usually only precise up to a few pixels, which, depending on the distance to the scene, can result in significant pose uncertainty [Sattler2018CVPR, Zhang2020IJCV, Torii2019TPAMI].
As an alternative, the recording camera can be tracked by an external tracking system [sturm12iros, Burri2016IJRR, Schoeps2019CVPR]. While providing high precision poses, capturing a diverse set of scenes is challenging due to the complicated setup, , installation and calibration, ensuring good visibility of the sensors, Similarly, industrial level LiDAR scanners have been used to produce high quality scans of landmarks, but the corresponding datasets provide only few scenes with limited spatial extent [Strecha2008CVPR, Knapitsch2017TOG, Schoeps2017CVPR]. GPS-INS systems that combine GPS with inertial navigation systems (INS) have also been used to track camera poses on a large scale [geiger2013vision, Maddern2017IJRR]. Yet, post-processing is still required to obtain a higher accuracy [Sattler2018CVPR].
Synthetic datasets come with true ground truth, but most current datasets, , Habitat[habitat19iccv], are limited in diversity of low-level noise, illumination conditions or specular reflections. Therefore, data association, which is at the core of re-localisation, can become too easy. An example are the very low errors reported in [Sattler2019CVPR] for Active Search [Sattler2012ECCV, Sattler2017PAMI] on a synthetic version of the Cambridge Landmarks dataset.
The vast majority of re-localisation benchmarks follow an automatic approach to ground truth recovery using a reference algorithm [Li2010ECCV, Chen2011CVPR, Li2012ECCV, Shotton2013CVPR, Kendall2015ICCV, Valentin20163DV, Sattler2018CVPR, Taira2018CVPR, Torii2019TPAMI, Zhang2020IJCV, Wald2020ECCV, Jin2020]. Popular choices are SfM [Schoenberger2016CVPR, Wu133DV], often for large-scale outdoor environments [Li2010ECCV, Chen2011CVPR, Li2012ECCV, Torii2019TPAMI, Kendall2015ICCV, Sattler2018CVPR], and depth-based SLAM [Newcombe2011ISMAR, Izadi2011UIST, Schoeps2019CVPR, Dai2017TOG], often for small-scale indoor environments [Shotton2013CVPR, Valentin20163DV, Wald2020ECCV]. Hybrid solutions also exist, such as ICP-based registration of LiDAR scans followed by an SfM-based registration of RGB images [Sattler2018CVPR, Taira2018CVPR]. Some benchmarks use human visual inspection as a final quality control and verification stage [Sattler2018CVPR, Taira2018CVPR, Jin2020, Zhang2020IJCV], and state-of-the-art reference algorithms are found to provide high-quality reconstructions and pose tracks. However, as shown in Fig. 1, subtle differences in the output of reference algorithms, unlikely to be recognized by visual checks, can have significant influence on the evaluation outcome of a benchmark.
Such evaluation artifacts have the potential to challenge some conclusions that have previously been drawn in the literature. Fig. 2 shows published results in re-localisation research on the popular indoor datasets 7Scenes [Shotton2013CVPR] and 12Scenes [Valentin20163DV], and the popular outdoor datasets Cambridge Landmarks [Kendall2015ICCV] and Aachen Day [Sattler2012BMVC, Sattler2018CVPR]
. We compare the dominant families of re-localisation methods, scene coordinate regression and sparse feature-based matching. Scene coordinate regression methods use a learned model, a neural network or a random forest, to predict dense image-to-scene correspondences[Shotton2013CVPR, Valentin2015CVPR, Brachmann2016CVPR, Brachmann2017CVPR, Brachmann2018CVPR, Brachmann2019ICCVa, Brachmann2019ICCVb, brachmann2020ARXIV, Cavallari2017CVPR, Cavallari2019TPAMI, Yang2019ICCV, li2020hierarchical]. RGB-D variants of scene coordinate regression methods dominate rankings for indoor re-localisation, which has been attributed to the inherent difficulty of the indoor scenario regarding texture-less surfaces and ambiguous structures that make it difficult to find and match sparse features [Shotton2013CVPR, Kendall2015ICCV, Walch2017ICCV, Brachmann2017CVPR]. For outdoor re-localisation, classical approaches, which match hand-crafted [Sattler2017PAMI, Shotton2013CVPR, Valentin20163DV] or learned descriptors [Sarlin2019CVPR, DeTone2018CVPRWorkshops, Sarlin2020CVPR, HumenbergerX20Kapture] at sparse feature locations to a 3D SfM reconstruction, achieve vastly superior results compared to scene coordinate regression. This has been attributed to an inability of scene coordinate regression to scale to spatially large scenes [Sattler2018CVPR, Taira2019TPAMI]. We offer a different explanation regarding the performance of re-localisers in different settings by taking the reference algorithms into account that were used to create the associated benchmarks.
3 Datasets and Reference Algorithms
In order to measure the impact different reference algorithms have on localisation performance, we consider pGT generated using (RGB-)D and sparse RGB-only data. We use the popular 7Scenes [Shotton2013CVPR] and 12Scenes [Valentin20163DV] datasets as they provide depth maps and pGT poses for both test and training images. This is in contrast to other common benchmarks [Sattler2018CVPR, Taira2018CVPR, Kendall2015ICCV, Li2010ECCV, Wald2020ECCV, Torii2019TPAMI], which do not provide depth information for test and train images [Sattler2018CVPR, Taira2018CVPR, Kendall2015ICCV, Li2010ECCV, Torii2019TPAMI]111 Estimated depth using motion stereo requires and is influenced by the pGT. Single-view depth prediction offers limited quality and stability. or do not make the poses of the test images publicly available [Sattler2018CVPR, Taira2018CVPR, Wald2020ECCV]. In the following, we describe the datasets, their original pGT, and how we create an additional pGT for each dataset via RGB-only SfM. The purpose of this section is to familiarize the reader with the datasets and reference algorithms before evaluating the resulting pGT variants (Sec. 4) and measuring their impact on re-localisation performance (Sec. 5).
3.1 Incremental Depth SLAM
Camera poses can be tracked by incrementally registering dense depth measurements to a 3D scene representation. KinectFusion [Newcombe11ICCV, Izadi2011UIST], an early incarnation of such a system, uses a truncated signed distance function (TSDF) to represent the scene. The TSDF is updated by merging depth maps of frames into a weighted average
where and denote the TSDF representations of the scene and of depth map , respectively. Weights capture the measurement uncertainty of the depth recording. For tracking the 6DoF camera pose of a new frame with rotation and translation , KinectFusion minimises the point-plane distance between the measured depth and a depth rendering of the scene’s TSDF volume
The objective is minimised over 2D pixel positions . Measured depth and rendered depth are back-projected to 3D vertex maps and , respectively. Particularly, denotes the rendered vertex map of the scene in world (or global) coordinates and denotes rendered normals.
KinectFusion pGT for 7Scenes. Shotton [Shotton2013CVPR] created the 7Scenes dataset for re-localisation by scanning seven small-scale indoor environments with Kinect v1 and KinectFusion. Every scene was scanned multiple times by different users, and the resulting 3D scene models were registered using ICP [Rusinkiewicz2001ICP]. No global optimization within a single scan or across multiple scans was performed, and any camera drift remains unaccounted for in the pGT of 7Scenes. In terms of RGB-D images, the 7Scenes dataset only provides uncalibrated output of the Kinect, , RGB images and depth maps are not registered, and the camera poses align with the depth sensor, not the RGB camera.
3.2 Globally Optimised RGB-D SLAM
To reduce camera drift during incremental scanning, more recent RGB-D SLAM systems like BundleFusion [Dai2017TOG]
jointly optimise all 6DoF camera poses. The parameter vectorstacks rotations and translations of all frames recorded and BundleFusion optimises
The term minimises the Euclidean distance for sparse SIFT [Lowe04IJCV] feature matches across all images. Note that this term minimises a 3D distance, not a reprojection error, since the depth of image pixels is known. The term is a photometric loss that ensures a consistent gradient of image luminance across registered images. Finally, optimises a point-to-plane distance of depth maps with projective data association similar to KinectFusion, see Eq. 1.
BundleFusion pGT for 12Scenes. Valentin [Valentin20163DV] scanned twelve small-scale indoor environments for their 12Scenes dataset. They utilized a structure.io depth sensor mounted on an iPad that provided associated color images. Different from 7Scenes, 12Scenes comes with fully calibrated and synchronized color and depth images and depth is registered to the color images. Each room was scanned two times, once for training and once for testing, and both scans of each scene were registered manually.
3.3 Pseudo Ground Truth via SfM
A common approach to generate pGT [Li2010ECCV, Li2012ECCV, Sattler2018CVPR, Torii2019TPAMI, Kendall2015ICCV, Sun2017CVPR] is to use (incremental) SfM algorithms [Schoenberger2016CVPR, Wu133DV, Snavely08IJCV]. SfM methods rely on sparse local features such as SIFT [Lowe04IJCV] to establish feature matches between images, which are then used to recover camera poses and 3D scene structure. SfM is usually applied jointly on the test and training images to jointly recover the camera poses of all images [Sattler2018CVPR, Li2012ECCV, Kendall2015ICCV].
SfM algorithms minimise the reprojection error between the estimated 3D points and their corresponding feature measurements in the images, optimising the problem
during Bundle Adjustment (BA) [Triggs2000VATP]. Here, are the intrinsic camera parameters, is the 3D point, indicates whether the 3D point is visible in the image, is the corresponding 2D feature position of the 3D point in the image, is the projection function, and is a robust cost function [Triggs2000VATP]. SfM only reconstructs the scene up to an arbitrary scaling factor. Known 3D distances are used to recover the absolute scale of the model.
SfM pGT for 7Scenes and 12Scenes. As the basis for our analysis, we generate an alternative pGT for 7Scenes and 12Scenes. First, we reconstruct the scene with SfM using only the training images. Next, we continue the reconstruction process with the test images while keeping the training camera poses fixed. This strategy ensures that the training poses are not affected by the test images, as would be the case in practice. Finally, we recover the scale by robustly aligning the positions of all cameras to those of the original pGT. We implement this process with COLMAP [Schoenberger2016CVPR], using the same camera intrinsics for all images in a scene.
This approach failed for the office2/5a and 5b datasets of 12Scenes. Both depict scenes with highly repetitive structures. As a result, the SfM reconstruction collapses, , visually similar but physically different parts of the scene are merged. Thus, for both scenes, we first triangulate 3D points using the original pGT. Next, we apply 10 iterations consisting of BA followed by merging and completing 3D points: nearby 3D points with matching features are merged and new features are added to 3D points if possible.
Some images of 12Scenes that were not registered by BundleFusion were reconstructed using COLMAP. Also, for the office2/5a and 5b scenes, we removed 61 images (out of 3,354 images contained in both scenes together) that we identified as obvious outliers via visual inspection.
4 Comparison of Pseudo Ground Truths
Given the two versions of pGT for each scene, estimated using (RGB-)D SLAM respectively SfM, a natural question is whether one version is more precise than the other. In this section, we quantitatively and qualitatively show that no version of the pGT is clearly preferably over the other: we first show that the SfM pGT outperforms the (RGB-)D SLAM version according to metrics that are optimised during the SfM process. We then show that the (RGB-)D SLAM pGT in turn outperforms the SfM version in terms of dense 3D point alignment, , the metrics optimised by depth-based methods. Thus, both versions can be considered as valid pGT for re-localisation experiments. Note that our analysis is focused on the two particular datasets. For a more general analysis of various reference algorithms, , about the influence of calibration accuracy, we refer to [Schoeps2019CVPR].
Evaluation based on SfM metrics. The first experiment focuses on standard metrics used to evaluate SfM reconstructions [Schoenberger2017CVPR]. We measure the number of 3D points (#3D), the number of feature observations (#feat.) used to triangulate the 3D points, the average track length (track), , the average number of features used to triangulate a 3D point, and the average reprojection error (err.). For the same number of images in a 3D model, more observations and longer tracks, especially in combination with a lower reproj. error, indicate higher camera pose accuracy. Shorter tracks, , more 3D points, indicate that a single physical 3D point is represented by multiple SfM points: due to pose inaccuracies, no single SfM point projects within the error threshold used for robust triangulation [Schoenberger2016CVPR] for all its measurements.
We compare the SfM pGT with point clouds obtained by triangulating the scenes from the original (RGB-)D pGT. For 7Scenes, we adjust the original pGT to account for the offset between RGB camera and depth sensor using the calibration from [wolf2014CVIU]. We use the same set of matches and the same COLMAP parameters for both pGT versions and use training and test images to calculate the statistics.
Tab. 1 shows the SfM metrics for both datasets. The SfM pGT clearly outperforms the original (RGB-)D SLAM pGT in the number of observations, track length, and reprojection error, especially on 7Scenes. We attribute this to the fact that KinectFusion, in contrast to BundleFusion, does not perform global optimisation and is thus susceptible to drift [Valentin20163DV]. Fig. 3 qualitatively compares the SfM point clouds obtained with both versions of the pGT, showing that the SfM pGT leads to significantly less noisy SfM points.
As way to measure the similarity of the local optima found by the different pGT algorithms, we generate an “intermediate” pGT, denoted as +BA in Tab. 1: starting from the original pGT, we alternate between BA of the triangulated 3D model and merging and completing 3D points. As for office2/5a and office2/5b, we repeat this process for 10 iterations. In case that the local minima found by the (RGB-)D SLAM and SfM algorithms are close-by, we expect this process to result in a similar local optimum for the SfM metrics.222We compare the “intermediate” pGT to the other pGT based on SfM metrics rather than comparing pose errors. The alignment between SfM and SLAM pGT introduces a potential error that we cannot easily remove. As can be seen in Tab. 1, the “intermediate” pGT results in similar or slightly worse statistics compared to the SfM pGT for both datasets. This indicates that the difference between poses is not large enough for bundle adjustment to result in significantly different local minima.
Evaluation based on 3D alignment metrics. We next evaluate how accurately the two pGT versions align the depth maps available for each image. For a pair of images in a scene, we use the pGT poses to transform their depth maps to 3D point clouds in scene coordinates. For each 3D point in ’s depth map, we find the nearest point in ’s depth map. We report the root mean square error (RMSE) of all point correspondences below a 5cm outlier threshold.333We did not observe image pairs with no correspondences within 5cm. This cost function, implemented in Open3D [Zhou2018ARXIV], measures the 3D alignment of the two point clouds and replicates the metric minimised by algorithms such as KinectFusion [Newcombe2011ISMAR, Izadi2011UIST] and BundleFusion [Dai2017TOG].
We select image pairs for evaluation based on visual overlap in the SfM pGT [Radenovic2019PAMI]: Let be the number of 3D points jointly observed by images and and and be the number of 3D points seen in respectively . We consider a pair if .
Fig. 4 shows cumulative histograms over the alignment errors for both pGT versions (see Appendix C for plots of all individual scenes). We separately show curves for pairs of training images and pairs containing one test and one training image. The former measures the consistency between the training images and the latter measures how well the test images align with the training images. Since images are taken in continuous sequences, there are smaller changes in viewpoint between pairs of training images than for pairs containing training and test images. As such, there is a larger error for test/train pairs than for train/train pairs.
Fig. 4 shows smaller alignment errors for the orig. pGT. We also show dense point clouds obtained by fusing individual depth maps using the different pGT poses in Fig. 5 and Fig. 6. While the SfM pGT leads to globally more consistent geometry with less drift444Note that global consistency might not always be necessary, , AR applications where a user observes only a small part of the scene., fine details of foreground objects are better recovered with the original pGT. This confirms the results from Fig. 4, which show a more precise relative alignment of depth maps for the orig. pGT.
5 Re-localisation Evaluation
Sec. 4 showed that neither the original (RGB-)D SLAM pGT nor the SfM pGT is clearly better than the other. Thus, both pGT versions are valid choices for evaluating re-localisation algorithms. Their differences are in the order of centimeters. However, this is typically the range used to measure localisation accuracy. This section thus investigates how different pGT versions affect the performance of re-localisers. We show that RGB-D baselines fare better on pGT generated with (RGB-)D SLAM. Baselines that minimise a reprojection error perform better on the SfM pGT.
Evaluation measures. We report the percentage of images localised within cm and of the respective pGT [Shotton2013CVPR, Valentin20163DV]. We also report the Dense Correspondence Reprojection Error (DCRE)[Wald2020ECCV]: for each test image, we back-project the depth map into a 3D point cloud using its pGT pose. We project each 3D point into the image using the estimated and the pGT pose and measure the 2D distance between both projections. We report the maximum DCRE per test image below and the mean DCRE per test image in Appendix D.
Baselines. We evaluate classical, feature-based as well as learning-based re-localisers on the two versions of the pGT. Learning-based methods are re-trained on each pGT, and feature-based methods use each version of the pGT to create their map. Please see Appendix A for details.
DSAC* [Brachmann2017CVPR, Brachmann2018CVPR, brachmann2020ARXIV] is a learning-based scene coordinate regression approach, where a neural network predicts for each pixel the corresponding 3D point in scene space. DSAC* uses a PnP[gao2003complete] solver and RANSAC[Fischler81CACM] on top of the 2D-3D matches. Its RGB-D variant, DSAC* (+D), uses image depth to establish 3D-3D matches and a Kabsch[kabsch1976solution] solver. hLoc [Sarlin2019CVPR]
combines image retrieval with SuperPoint[DeTone2018CVPRWorkshops] features and SuperGlue [Sarlin2020CVPR] for matching, followed by P3P+RANSAC-based pose estimation. DenseVLAD+R2D2 [Torii2015CVPR, revaud2019r2d2, HumenbergerX20Kapture] uses DenseVLAD [Torii2015CVPR] for retrieving image pairs and R2D2 features for matching. The training images and poses are used to construct a 3D SfM map, and test images are localised using 2D-3D matches and P3P+RANSAC. Instead of triangulating point matches, DenseVLAD+R2D2 (+D) constructs the 3D map by projecting R2D2 keypoints to 3D space using depth maps. Active Search (AS) [Sattler2012ECCV, Sattler2017PAMI] is a classical feature-based approach that establishes 2D-3D correspondences based on prioritized SIFT [Lowe04IJCV] matching. AS estimates the camera pose with a P3P solver [Kneip11CVPR, Haralick94IJCV] inside a RANSAC loop [Fischler81CACM].
Results. Tab. 2 reports the percentage of test images localised within 5cm and 5 of the pGT for the 7Scenes dataset. For the original pGT, depth-based DSAC* (+D) clearly outperforms all other methods. Depth-based DenseVLAD+R2D2 (+D) achieves the best results among all sparse feature-based methods. AS, using classical SIFT features, achieves the lowest accuracy using the original pGT.
The ranking changes drastically using the SfM pGT. AS jumps from last to first place with an absolute difference of +29.8 in pose accuracy, outperforming all learning-based and depth-based competitors. Particularly notable are the results on Pumpkin and Red Kitchen, where AS improves from localising less than 50% within the 5cm, 5 threshold to localizing more than 99% of the images. For both scenes, Tab. 1 shows a significant difference in the SfM statistics between the two pGT versions. See Appendix B for a visual analysis on Pumpkin. The previously-leading depth-based DSAC* (+D) and DenseVLAD+R2D2 (+D) drop to the last places of the ranking. Both methods are outperformed by their RGB-only counterparts when using the SfM pGT.
We can correlate these observations with each method’s similarity to the respective reference algorithm (column 1 of Tab. 2 for a coarse classification). We regard methods that optimise a reprojection error over sparse features as similar to SfM and methods that optimise a dense 3D-3D error as similar to (RGB-)D SLAM. The RGB variant of DSAC* optimises a dense reprojection error. DVLAD+R2D2(+D)
optimises a sparse reprojection error but incorporates depth when building the 3D map. Thus, we classify those two methods asintermediary. Among methods similar to SfM, AS shows the largest improvement under the SfM pGT as it re-uses the SIFT features from SfM. Fig. 7(a) shows cumulative distributions over the fraction of images localised within Xcm, X of the pGT for tighter thresholds than used before. This is particularly interesting for 12Scenes, where the accuracy of all methods saturates under the 5cm, 5 threshold. Poses predicted by DSAC* (+D) better align with the original (RGB-)D SLAM pGT than with the SfM pGT. At the same time, poses predicted by RGB-based methods better align with the SfM pGT. There are larger differences between the methods for finer thresholds. For 12Scenes, hLoc and DenseVLAD+R2D2 achieve the highest accuracy under a 1cm, 1 threshold.
Fig. 7(b) shows cumulative distributions for the max. DCRE. Since the DCRE depends on the pose accuracy, we observe the same behavior as before, , methods more similar to SfM outperform depth-based methods on the SfM pGT while performing worse on the original (RGB-D) pGT. Yet, this does not necessarily imply that such methods are superior. They closely resemble the SfM ref. algorithm, and they use the 3D points triangulated by the SfM pipeline from the training images for pose estimation. Thus, it seems likely that feature-based methods “overfit” to the SfM pGT by being able to closely replicate SfM behavior. To further illustrate the issue, we created an “intermediate” pGT: starting with the original pGT poses, we triangulate the scene and use bundle adjustment followed by point merging to optimise the test poses while keeping the training poses fixed. Intuitively, the resulting poses, denoted as “+BA”, approximate the “optimal” test poses for the original training image pGT under the reprojection error metric. Fig. 7(c) and Tab. 2 show results obtained using the +BA test poses. The +BA poses significantly improve the evaluation scores of RGB-based methods such as AS, but less so for depth-based methods such as DSAC* (+D) or DenseVLAD+R2D2 (+D). The closer the pGT is to the cost function optimised by both methods, the better they perform. In contrast, depth-based methods typically either perform similar or worse under these pGT poses. Our results indicate that learning-based methods might have some capacity to adjust to the pGT since DSAC* ranks well across all pGT versions. Still, DSAC* is always outperformed by methods more similar to the ref. algorithm.
Re-localisation benchmarks usually rely on a reference algorithm to create pseudo ground truth for evaluation. As such, they do not measure absolute pose accuracy but rather how well a given method is able to reproduce the reference output. Our paper points out an important implication: different cost functions optimised by reference algorithms lead to different local minima. This affects re-localisation evaluation as methods that optimise a similar cost function as the reference algorithm better replicate the local minima and imperfections of the pGT, to a degree that relative rankings can be (nearly) completely inverted. This issue is fundamental, and we do not see a solution to this problem. However, there are ways to address the issue, as shown in Sec. 5: new benchmark datasets could provide multiple pGT versions to enable a more concise evaluation that takes the impact of the pGT into account. , although DSAC* does not perform best under any pGT, it performs well under all pGT versions. If multiple pGT versions are not available, localisation algorithms can be grouped based on their similarity to the reference algorithm (color-coding in Tab. 2) and only be compared within but not between groups. Another approach is to choose evaluation thresholds that are large enough that the difference in pGT will not affect the measured performance, , 5cm, 5 for 12Scenes. Such an approach will likely have to explicitly account for the uncertainties in the estimated poses, which itself is a complex problem [Foerstner2016Book]. Still, knowledge about pose uncertainties would allow us to determine when a dataset is solved. Another direction is a task-specific evaluation of re-localisation methods, , measuring their performance in the context of AR, robotic navigation, Again, understanding the impact of the pGT on such evaluations is an interesting and open problem.
Acknowledgements. This work has received funding from the EU Horizon 2020 project RICAIP (grant agreement No 857306) and the European Regional Development Fund under project IMPACT (No. CZ.02.1.01/0.0/0.0/15003/0000468).
Appendix A Implementation Details
In the following, we detail how we adjusted the source code of Active Search [Sattler2017PAMI], hLoc [Sarlin2019CVPR, Sarlin2020CVPR], R2D2 [HumenbergerX20Kapture], and DSAC* [brachmann2020ARXIV] and provide training details for the latter.
a.1 Active Search (AS)
We use the source code of [Sattler2017PAMI], but replace the original RANSAC method with the LO-RANSAC [Lebeda2012BMVC] implementation from [Sattler2019Github]. Local optimisation is implemented by minimising the sum of squared reprojection errors over a subset of the inliers of the best pose found so far. In addition, we perform non-linear optimisation of the pose by minimising the sum of squared reprojection errors over all inliers after LO-RANSAC. In both cases, Ceres [ceres-solver] is used to implement the optimisation. Based on preliminary experiments, both modifications significantly improve performance.
We set the inlier threshold for LO-RANSAC to 1% of the image diagonal555While we observed better results when tuning the threshold per scene, we want to avoid overfitting to the test set and thus use the same setting for all scenes. and use 10k visual words trained on an unrelated outdoor dataset for prioritization. For the SfM pGT, which provides an estimate of the radial distortion of the test images, we undistort the SIFT [Lowe04IJCV] feature positions in the test images before RANSAC-based pose estimation.
AS requires a SfM model of the scene for 2D-3D matching. We use COLMAP to build these models by triangulating the 3D structure of the scene from the known pGT poses of the training images. To establish the matches required for triangulation, we use COLMAP’s image retrieval pipeline [Schoenberger2016ACCV] to match each training image against the top-100 retrieved other training images. In addition, we match each training image against each other training image that has a pGT pose difference below 2m and 45. For the original pGT of 7Scenes, we obtained better results by relaxing the thresholds COLMAP uses for triangulation. We account for the transformation between the depth and the RGB cameras when building the SfM models for the original 7Scenes pGT.666Note that the SfM pGT directly provides poses for the RGB images and it is not necessary to account for the transformation.
a.2 Hierarchical Localization (hLoc)
Similar to Active Search, hLoc [Sarlin2019CVPR, Sarlin2020CVPR] is based on local features. Whereas Active Search relies on SIFT [Lowe04IJCV], hLoc employs SuperPoint features [DeTone2018CVPRWorkshops]
, a modern learned alternative. Active Search directly matches features extracted from the test image against descriptors associated with the 3D points. In contrast, hLoc first employs an image retrieval stage to identify a set of training images that potentially show the same part of the scene as the test image. The features found in the test image are then only matched against the 3D points visible in the top-retrieved training images. For matching, the SuperGlue [Sarlin2020CVPR] approach is used to improve matching quality. The resulting 2D-3D matches are then used to estimate the camera pose by applying a P3P solver inside a RANSAC loop.
We use the source made publicly available by the authors and use their default settings. While the original publication describing the hierarchical localization pipeline [Sarlin2019CVPR] uses NetVLAD [Arandjelovic2016CVPR] descriptors, we use DenseVLAD [Torii2015CVPR] descriptors instead. DenseVLAD is a non-learned alternative to NetVLAD, where densely extracted RootSIFT [Arandjelovic2012CVPR] features are pooled into a VLAD [Jegou-CVPR10] descriptor. We chose DenseVLAD as it, in our experience, performs better for the 7Scenes and 12Scenes datasets than NetVLAD and use the top-20 retrieved images.
DenseVLAD+R2D2 [HumenbergerX20Kapture] follows the workflow of image retrieval as well as structure-based methods where, first, the most similar training images are retrieved using global image representations, and second, these image pairs are used for local feature matching. Same as for our hLoc experiments (we use exactly the same retrieval results), for image retrieval during localisation, we use DenseVLAD [Torii2015CVPR] features and for mapping, we use a list for matching training images that was obtained as a result of finding co-observations of reconstructed 3D points (using the AS map as basis). For local feature matching, and in addition to Active Search and hLoc (which use SIFT resp. SuperPoint), here we use R2D2 [revaud2019r2d2] features. DenseVLAD+R2D2 uses COLMAP for both, 3D point triangulation of the map and image registration using 2D-3D correspondences. The matches are obtained using the nearest neighbors in descriptor space (L2-norm), cross-validation, and geometric verification.
Instead of triangulating keypoint matches using the camera poses, for DenseVLAD+R2D2 (+D), we construct the 3D map by projecting the keypoints to 3D space using the provided and registered [wolf2014CVIU] depth maps. For localization, we follow the same method as described above.
We use the public code of DSAC* [brachmann2020ARXIV] with default parameters. DSAC* supports different training modes utilising varying degrees of supervision. To achieve best results, we follow Brachmann and Rother [brachmann2020ARXIV] and initialize the DSAC* network using scene coordinate ground truth. Brachmann and Rother render ground truth scene coordinates using 3D models of each scene provided in the 7Scenes and 12Scenes datasets, respectively. Next to the pseudo ground truth camera poses of these datasets, the 3D models are an additional output of (RGB-)D SLAM. Hence, these 3D models would add an additional, non-trivial dependency of DSAC* training to the underlying dataset reference algorithm. To restrict the influence of the reference algorithm to pGT poses alone, we train DSAC* using ground truth scene coordinates that we obtain from the measured depth map of each image. We backproject the depth map to 3D using the camera calibration parameters, and transform them to scene space using the pGT pose. For 7Scenes, we manually register depth maps to RGB images using the calibration parameters provided by [wolf2014CVIU].
Since the DSAC* code does not support a camera model with radial distortion, we instead undistort RGB images using COLMAP before passing it to the DSAC* pipeline. We only do this for experiments with the SfM pGT since the (RGB-)D SLAM pGT assumes zero radial distortion.
We follow Brachmann and Rother [brachmann2020ARXIV] and train DSAC* for 1.1M iterations (initialization + end-to-end). This took approximately 16 hours per scene on a GeForce RTX 2080 Ti. Compared to the results published in [brachmann2020ARXIV], we observe slightly reduced accuracy, 82.9% versus 85.2% for DSAC* (RGB) on 7Scenes (averaged over all scenes). We attribute this slight difference to our use of measured depth maps in the initialization training stage, which are more noisy and contain holes as well as large areas of invalid depth compared to the rendered ground truth scene coordinates used in [brachmann2020ARXIV].
Appendix B Visual Comparisons of pGT
We plot depth-based SLAM pGT versus RGB-based SfM pGT for the Pumpkin scene of 7Scenes in Fig. 8. For this scene, we observe the largest visual drift between both versions of the pGT. We also show estimated camera trajectories for Active Search, DSAC* and DSAC* (+D), the top-performing methods depending on the pGT version, for both versions of the pGT. While the depth-based SLAM pGT on this scene seems to have defects that make it hard for all re-localization methods to follow the ground truth trajectory, results look smoother for the SfM pGT. Still, both DSAC* re-localizers fail to follow the SfM pGT exactly, exhibiting small, consistent offsets the pseudo ground truth trajectory. We observe similar, yet less pronounced, patterns for other scenes of 7Scenes and the scenes of 12Scenes.
Appendix C Quantitative Comparisons of pGT
Fig. 4 shows cumulative distributions over 3D alignment statistics for the 7Scenes and 12Scenes datasets (Sec. 4 for details). While Fig. 4 shows average statistics over all scenes in each dataset and one selected scene per dataset, here we show the distributions for all scenes of the two datasets.
Fig. 9 shows the cumulative distributions for all scenes of the 7Scenes dataset. As can be seen, the original (RGB-)D SLAM pGT results in a more accurate alignment for most scenes compared to the SfM pGT. For the Red Kitchen and Stairs scenes, there is little difference between the two versions of the pGT and the SfM pGT produces a (slightly) more accurate alignment for the test/train pairs.
Similarly, Fig. 10 shows the cumulative distributions for all scenes of the 12Scenes dataset. Again, we observe that the original (RGB-)D SLAM pGT results in more accurate 3D alignments compared to the SfM pGT. However, for most scenes, the difference between both versions of the pGT is smaller than for the 7Scenes dataset.
Appendix D Visual Re-Localization Evaluation
As an extension to Fig. 7, we show cumulative pose error plots and cumulative DCRE [Wald2020ECCV] error plots for all scenes of 7Scenes and 12Scenes, separately. See Fig. 11, Fig. 12 and Fig. 13 for pose error plots, max. DCRE error plots and mean DCRE error plots, respectively, for 7Scenes. We show the corresponding plots for 12Scenes in Fig. 14, Fig. 15 and Fig. 16.