Robust Image Retrieval-based Visual Localization using Kapture

07/27/2020 ∙ by Martin Humenberger, et al. ∙ NAVER LABS Corp. 0

In this paper, we present a versatile method for visual localization. It is based on robust image retrieval for coarse camera pose estimation and robust local features for accurate pose refinement. Our method is top ranked on various public datasets showing its ability of generalization and its great variety of applications. To facilitate experiments, we introduce kapture, a flexible data format and processing pipeline for structure from motion and visual localization that is released open source. We furthermore provide all datasets used in this paper in the kapture format to facilitate research and data processing. The code can be found at, the datasets as well as more information, updates, and news can be found at



There are no comments yet.


page 2

page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual localization

The goal of visual localization is to estimate the accurate position and orientation of a camera using its images. In detail, correspondences between a representation of the environment (map) and query images are utilized to estimate the camera pose in 6 degrees of freedom (DOF). The representation of the environment can be a structure from motion (SFM) reconstruction 

[34, 43, 13, 40, 26], a database of images [45, 44, 33], or even a CNN [16, 20, 5, 39]. Structure-based methods [34, 24, 36, 22, 44, 40] use local features to establish correspondences between 2D query images and 3D reconstructions. These correspondences are then used to compute the camera pose using perspective-n-point (PNP) solvers [17] within a RANSAC loop [12, 7, 21]. To reduce the search range in large 3D reconstructions, image retrieval methods can be used to first retrieve most relevant images from the SFM model. Second, local correspondences are established in the area defined by those images. Scene point regression methods [42, 6]

establish the 2D-3D correspondences using a deep neural network (DNN) and absolute pose regression methods 

[16, 20, 5, 39] directly estimate the camera pose with a DNN. Furthermore, also objects can be used for visual localization, such as proposed in [49, 32, 8, 3].


Since in visual localization correspondences between the map and the query image need to be established, environmental changes present critical challenges. Such changes could be caused by time of day or season of the year, but also structural changes on house facades or store fronts are possible. Furthermore, the query images can be taken under significantly different viewpoints than the images used to create the map.

Long term visual localization

To overcome these challenges, researchers proposed various ways to increase robustness of visual localization methods. Most relevant to our work are data-driven local [25, 10, 11, 30, 9] and global [1, 28, 29] features. Instead of manually describing how keypoints or image descriptions should look like, a large amount of data is used to train an algorithm to make this decision by itself. Recent advances in the field showed great results on tasks like image matching [27] and visual localization [33, 11, 31]. [37] provide an online benchmark which consists of several datasets covering a variety of the mentioned challenges.

In this paper, we present a robust image retrieval-based visual localization method. Extensive evaluations show that it reports top results on various public datasets which highlights its versatile application. We implemented our algorithm using our newly proposed data format and toolbox named kapture. The code is open source and all datasets from the website mentioned above are provided in this format.

Figure 1: Overview of the structure from motion (SFM) reconstruction of the map from a set of training (mapping) images. Photos: Sceaux Castle image dataset22footnotemark: 2

2 Visual Localization Method

As a reminder, visual localization is the problem of estimating the 6DOF pose of a camera within a known 3D space representation using query images. There are several ways to tackle this problem including structure-based methods [34, 24, 36, 22, 44, 40], pose [16, 20, 5, 39] and scene point regression-based [42, 6] methods or image retrieval-based methods [47, 51, 45]. Our approach follows the workflow of image retrieval as well as structure-based methods and combines functionalities provided by the COLMAP SFM library333 [40] as well as our local features R2D2 [30] and our global image representation APGeM [29]. The method consists of two main components: the SFM-based mapping pipeline (shown in Figure 1) and the localization (image registration) pipeline (shown in Figure 2).



SFM is one of the most popular strategies for reconstruction of a 3D scene from un-ordered photo collections [43, 13, 40, 26]. The main idea is to establish 2D-2D correspondences between local image features (keypoints) of mapping444Also referred to as training images.

image pairs, followed by geometric verification to remove outliers. By exploiting transitivity, observations of a keypoint can be found in several images allowing to apply relative pose estimation for initialization of the reconstruction followed by 3D point triangulation 

[15] and image registration for accurate 6DOF camera pose estimation. RANSAC [12, 7, 21] can be used to increase robustness of several steps in this pipeline and bundle adjustment [48] can be used for global (and local) optimization of the model (3D points and camera poses). Since the camera poses of the training images for all datasets used in this paper are known, our mapping pipeline can skip this step. For geometric verification of the matches and triangulation of the 3D points, we used COLMAP. Figure 1 illustrates our mapping workflow.

Figure 2: Overview of the localization pipeline which registers query images in the SFM map. Photos: Sceaux Castle image dataset2


Similarly to the reconstruction step, 2D-2D local feature correspondences are established between a query image and the database images used to generate the map. In order to only match relevant images, we use image retrieval to obtain the 20 most similar images from the database. Since many keypoints from the database images correspond to 3D points of the map, 2D-3D correspondences between query image and map can be established. These 2D-3D matches are then used to compute the 6DOF camera pose by solving a PNP problem [17, 18, 19] robustly inside a RANSAC loop [12, 7, 21]. We again used COLMAP for geometric verification and image registration.

Local descriptors

We can see, that both pipelines (mapping and localization) heavily rely on local image descriptors and matches. Early methods used handcrafted local feature extractors, notably the popular SIFT descriptor555as used in COLMAP [23]. However, those keypoint extractors and descriptors have several limitations, including the fact that they are not necessary tailored to the target task. Therefore, several data-driven learned representations were proposed recently including learning local features with end-to-end deep architectures (see the evolution of local features in [9, 41]).

Our method uses R2D2 [30], which is a sparse keypoint extractor that jointly performs detection and description but separately estimates keypoint reliability and keypoint repeatability. Keypoints with high likelihoods on both aspects are chosen which improves the overall feature matching pipeline. R2D2 uses a list-wise loss that directly maximises the average precision to learn reliability. Since a very large amount of image patches (only one is correct) is used per batch, the resulting reliability is well suited for the task of matching. Since reliability and patch descriptor are related, the R2D2 descriptor is extracted from the reliability network. The R2D2 model was trained with synthetic image pairs generated by known transformations (homographies) providing exact pixel matches as well as optical flow data from real image pairs. See Section 4 for details about the model.

Image retrieval

In principle, mapping and localization can be done by considering all possible image pairs. However, this approach does not scale well to visual localization in real-world applications where localization might need to be done in large scale environments such as big buildings or even cities. To make visual localization scaleable, image retrieval plays an important role. On the one hand, it makes the mapping more efficient, on the other hand, it increases robustness and efficiency of the localization step [14, 35, 44]. This is achieved in two steps: First, the global descriptors are matched in order to find the most similar images which form image pairs (e.g. reference-reference for mapping and query-reference for localization). Second, these image pairs are used to establish the local keypoint matches.

Localization approaches based on image retrieval typically use retrieval representations designed for geo-localization [1, 46, 2]. However, our initial experiments have not shown superiority of these features compared to our off-the-shelf deep visual representations Resnet101-AP-GeM [29]. Note that our model was trained for the landmark retrieval task on the Google Landmarks (GLD) dataset [25]. The model considers a generalized mean-pooling (GeM) layer [28] to aggregate the feature maps into a compact, fixed-length representation which is learned by directly optimizing the mean average precision (mAP).

3 Kapture

3.1 Kapture format and toolbox

When running a visual localization pipeline on several datasets, one of the operational difficulties is to convert those datasets into a format that the algorithm and all the tools used can handle. Many formats already exist, notably the ones from Bundler666, VisualSFM777, OpenMVG888, OpenSfM999, and COLMAP101010, but none met all our requirements. In particular we needed a format that could handle timestamps, shared camera parameters, multi-camera rigs, but also reconstruction data (keypoints, descriptors, global features, 3D points, matches…) and that would be flexible and easy to use for localization experiments. Furthermore, it should be easy to convert data into other formats supported by major open source projects such as OpenMVG and COLMAP.

Inspired by the mentioned open source libraries, kapture started as pure data format that provided a good representation of all the information we needed. It then grew into a Python toolbox and library for data manipulation (conversion between various popular formats, dataset merging/splitting, trajectory visualization, etc.), and finally it became the basis for our mapping and localization pipeline. More precisely, the kapture format can be used to store sensor data: images, camera parameters, camera rigs, trajectories, but also other sensor data like lidar or wifi records. It can also be used to store reconstruction data, in particular local descriptors, keypoints, global features, 3D points, observations, and matches.

We believe that the kapture format and tools could be useful to the community, so we release them as open-source at We also provide major public datasets of the domain in this format to facilitate future experiments for everybody.

3.2 Kapture pipeline

We implemented our visual localization method, described in Section 2, on top of the kapture tools and libraries. In particular, the mapping pipeline consists of the following steps:

  1. Extraction of local descriptors and keypoints (e.g. R2D2) of training images

  2. Extraction of global features (e.g. APGeM) of training images

  3. Computation of training image pairs using image retrieval based on global features

  4. Computation of local descriptor matches between these image pairs

  5. Geometric verification of the matches and point triangulation with COLMAP

The localization steps are similar:

  1. Extraction of local and global features of query images

  2. Retrieval of similar images from the training images

  3. Local descriptor matching

  4. Geometric verification of the matches and camera pose estimation with COLMAP

4 Evaluation

For evaluation of our method, we chose the datasets provided by the online visual localization benchmark111111 introduced in [37]. Each of these datasets is split into a training (mapping) and a test set. The training data, which consists of images, corresponding poses in the world frame as well as intrinsic camera parameters, is used to construct the map, the test data is used to evaluate the precision of the localization method. Intrinsic parameters of the test images are not always provided.

We converted all datasets to kapture which provided an easy way to evaluate our methods on a variety of datasets using the proposed method. We used the publicly available models for R2D2121212r2d2_WASF_N8_big from and APGeM131313Resnet101-AP-GeM-LM18 from for all datasets and evaluations. If not indicated differently, we used the top 20k keypoints extracted with R2D2.


We experimented with three COLMAP parameter settings which are presented in Table 1. For map generation, we always used config1.

COLMAP image_registrator config1 config2 config3
Mapper.ba_refine_focal_length 0 0 1
Mapper.ba_refine_principal_point 0 0 0
Mapper.ba_refine_extra_params 0 0 1
Mapper.min_num_matches 15 4 4
Mapper.init_min_num_inliers 100 4 4
Mapper.abs_pose_min_num_inliers 30 4 4
Mapper.abs_pose_min_inlier_ratio 0.25 0.05 0.05
Mapper.ba_local_max_num_iterations 25 50 50
Mapper.abs_pose_max_error 12 20 20
Mapper.filter_max_reproj_error 4 12 12
Table 1: Parameter configurations.


All datasets used are divided into different conditions. These conditions could be different times of day, differences in weather such as snow, or even different buildings or locations within the dataset. In order to report localization results, we used the online benchmark11 which computes the percentage of query images which where localized within three pairs of translation and rotation thresholds.

4.1 Aachen Day-Night

The Aachen Day-Night dataset [37, 38] represents an outdoor handheld camera localization scenario where all query images are taken individually with large changes in viewpoint and scale, but also between daytime and nighttime. In detail, the query images are divided into the classes day and night and the two classes are evaluated separately. We evaluated our method in two settings: (i) we used the full dataset to construct a single map using the provided reference poses and localized all query images within this map, and (ii) we used the pairs141414 provided for the local features evaluation task on the online benchmark11, which cover nighttime images only. Table 2 presents the results.

setting day night
full (config2) 88.7 / 95.8 / 98.8 44.9 / 62.2 / 85.7
pairs (config1) - 48.0 / 67.3 / 88.8
Table 2: Results on Aachen Day-Night. In pairs we used the top 40k R2D2 keypoints. Day: (0.25m, 2) / (0.5m, 5) / (5m, 10), Night: (0.5m, 2) / (1m, 5) / (5m, 10)

4.2 Inloc

Inloc [44, 50] is a large indoor dataset for visual localization. It also represents a handheld camera scenario with large viewpoint changes, occlusions, people and even changes in furniture. Contrary to the other datasets, Inloc also provides 3D scan data, i.e. 3D point clouds for each training image. However, since the overlap between the training images is quite small, the resulting structure from motion models are sparse and, according to our experience, not suitable for visual localization. Furthermore, the Inloc environment is very challenging for global and local features because it contains large textureless and many repetitive areas. To overcome these problems, the original Inloc localization method [44] introduced various dense matching and pose verification methods which make use of the provided 3D data.


Even if impressive results were achieved, we did not follow the Inloc method since we did not want to change the core of our method for a specific dataset. Instead, we constructed our SFM map using the provided 3D data and the camera poses, which differs from the mapping described in Section 2. We first assign a 3D point to each local feature in the training images. Second, we generate matches based on 3D points. In detail, we look for local features which are the projection of the same 3D point in different images. To decide whether or not a 3D point is the same for different keypoints, we use an Euclidean distance threshold (5mm, 1mm, and 0.5mm). This results in a very dense 3D map (Figure 3) where each 3D point is associated with a local descriptor and can, thus, be used in our method.


We ran the localization pipeline (Figure 2) for all provided query images. Table 3 presents the results.

Figure 3: Inloc map generated by assigning a 3D point to each R2D2 feature in the training images (viewed in COLMAP).
setting DUC1 DUC2
config2, 5mm 24.7 / 38.4 / 52.5 22.1 / 41.2 / 51.1
config2, 1mm 21.7 / 37.4 / 54.5 23.7 / 41.2 / 54.2
config2, 0.5mm 28.8 / 40.4 / 60.6 25.2 / 44.3 / 54.2
Table 3: Results on Inloc using different 3D point distance thresholds for mapping. (0.25m, 10) / (0.5m, 10) / (5m, 10)

4.3 RobotCar Seasons

RobotCar Seasons [37] is an outdoor dataset captured in the city of Oxford at various periods of a year and in different conditions (rain, night, dusk, etc.). The images are taken from a car with a synchronized three-camera rig pointing in three directions (rear, left and right). The data was captured at 49 different non-overlapping locations and several 3D models are provided. Training images were captured with a reference condition (overcast-reference), while test images were captured in different conditions. For each test image, the dataset provides its condition, the location where it was captured (one of the 49 location used in the training data), its timestamp, and the camera name.


Since the different locations are not overlapping, there is no benefit in building a single map. For our experiments, we used the individual models for each 49 locations that are provided in the COLMAP format. We converted the COLMAP files into the kapture format to recover trajectories (poses and timestamps) and created 49 individual maps using our mapping pipeline (Figure 1). For this step, we used the provided camera parameters (pinhole model) and considered each camera independently without using the rig information.


Since the location within the dataset is given for each query image, we can directly use it during localization. Otherwise, we would have first selected the correct map, e.g. by using image retrieval. We tested both, COLMAP config1 and config2.

For the images that could not be localized we ran two additional steps. First, we leveraged the fact that images are captured synchronously with a rig of three cameras for which the calibration parameters are provided. Hence, if one image taken at a specific timestamp is localized, using the provided extrinsic camera parameters we can compute the pose for all images of the rig (even if they were not successfully localized). We used this technique to find the missing poses for all images for which this can be applied.

However, there are still timestamps for which no pose was found for any of the three cameras. In this case, we leverage the fact that query images are given in sequences (e.g. 6 to 12 images in most cases). Sequences can be found using image timestamps. When the gap between two successive timestamps is too large (i.e. above a certain threshold), we start a new sequence. Once the sequences are defined, we look for non-localized image triplets in these sequences and estimate their poses by linear interpolation between the two closest successfully localized images. If this is not possible, we use the closest available pose. Note that for real-world applications, we could either only consider images of the past or introduce a small latency if images from both directions (before and after) are used. These steps increase the percentage of localized images to 97.2%. Table 

4 presents the results of the configurations tested. Interestingly, even if config2 could localize all images and config1 only 90%, applying the rig and sequence information on config1 led to overall better results.

setting day night
config2 55.2 / 82.0 / 97.1 28.1 / 59.0 / 82.7
config1 55.1 / 82.1 / 96.9 26.9 / 55.6 / 78.4
config1 + rig 55.1 / 82.1 / 97.2 28.7 / 58.3 / 83.4
config1 + rig + seq 55.1 / 82.1 / 97.3 28.8 / 58.8 / 89.4
Table 4: Results on RobotCar Seasons. Thresholds: (0.25m, 2) / (0.5m, 5) / (5m, 10)

4.4 Extended CMU-Seasons

The Extended CMU-Seasons dataset [37, 4] is an autonomous driving dataset that contains sequences from urban, suburban, and park environments. The images were recorded in the area of Pittsburgh, USA over a period of one year and thus contain different conditions (foliage/mixed-foliage/no foliage, overcast, sunny, low sun, cloudy, snow). The training and query images were captured by two front-facing cameras mounted on a car, pointing to the left and right of the vehicle at approximately 45 degrees with respect to the longitudinal axis. The cameras are not synchronized. This dataset is also split into multiple locations. Unlike RobotCar Seasons, there is some overlap between them that we did not leverage.


For our experiments, we used the individual models for each location. We converted the ground-truth-database-images-sliceX.txt files into the kapture format to recover trajectories (poses and timestamps). We then created 14 individual maps (the slices that were provided with queries: 2-6/13-21) using the pipeline described above. For this step, we used the provided camera parameters (OpenCV151515 pinhole camera), and considered each camera independently, without using the rig information.


We ran the localization pipeline described above on all images listed in the test-images-sliceX.txt files with config1. We then ran two post-processing steps: rig and sequence. For rig, we first estimated a rig configuration from the slice2 training poses. For all images that failed to localize, we computed the position using this rig if the image from the other camera with the closest timestamp was successfully localized. Finally, we applied the same sequence post-processing that is described in the RobotCar Seasons section. Table 5 presents the results on this dataset and the improvements we get from each of the post-processing steps.

setting urban suburban park
config2 95.9 / 98.1 / 98.9 89.5 / 92.1 / 95.2 78.3 / 82.0 / 86.4
config1 95.8 / 98.1 / 98.8 88.9 / 91.1 / 93.4 75.5 / 78.4 / 82.0
config1 + rig 96.5 / 98.8 / 99.5 94.3 / 96.7 / 99.1 83.1 / 87.9 / 92.8
config1 + rig + seq 96.7 / 98.9 / 99.7 94.4 / 96.8 / 99.2 83.6 / 89.0 / 95.5
Table 5: Results on Extended CMU-Seasons. All conditions: (0.25m, 2) / (0.5m, 5) / (5m, 10)

4.5 SILDa Weather and Time of Day

SILDa Weather and Time of Day161616 is an outdoor dataset captured over a period of 12 months (clear, snow, rain, noon, dusk, night) which covers 1.2km of streets around Imperial College in London. It was captured using a camera rig composed of two back-to-back wide-angle fisheye lenses. The geometry of the rig as well as the hardware synchronization of the acquisition could be leveraged, e.g. to reconstruct spherical images.


The dataset provides camera parameters corresponding to a fisheye model that is not available in COLMAP. For the sake of simplicity, we chose to estimate the parameters of both cameras using a camera model supported by COLMAP, namely the FOV model (we still use the provided estimation of the principal point).


Similarly to the RobotCar Seasons dataset, we applied the image sequences and camera rig configuration to estimate camera poses of images which could not be localized. As the rig geometry is not given for SILDa, we estimated an approximation. Table 6 presents the results of the configurations used. As can be seen, leveraging the sequence did not improve the results.

setting evening snow night
config1 31.8 / 66.3 / 89.4 0.3 / 3.9 / 64.9 30.0 / 53.4 / 77.5
config1 + rig 31.9 / 66.6 / 92.5 0.5 / 5.8 / 89.2 30.5 / 54.2 / 78.5
config1 + rig + seq 31.9 / 66.6 / 92.5 0.5 / 5.8 / 89.2 30.5 / 54.2 / 78.5
Table 6: Results on SILDa. Thresholds: (0.25m, 2) / (0.5m, 5) / (5m, 10)

5 Conclusion and Future Work

We presented a versatile method for visual localization based on robust global features for coarse localization using image retrieval and robust local features for accurate pose computation. We evaluated our method on multiple datasets covering a large variety of application scenarios and challenging situations. Our method ranks among the best methods on the online visual localization benchmark11. We implemented our method in Python and ran the experiments using kapture, a unified SFM and localization data format which we released open source. Since all datasets will be made available in this format, we hope to facilitate future large scale visual localization and structure from motion experiments using a multitude of datasets.


  • [1] R. Arandjelović, P. Gronát, A. Torii, T. Pajdla, and J. Sivic (2016) NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In CVPR, Cited by: §1, §2.
  • [2] R. Arandjelović, J. Sivic, M. Okutomi, and T. Pajdla (2014) Dislocation: Scalable Descriptor Distinctiveness for Location Recognition. In ACCV, Cited by: §2.
  • [3] S. Ardeshir, A. R. Zamir, A. Torroella, and M. Shah (2014) GIS-assisted object detection and geospatial localization. In

    European Conference on Computer Vision

    pp. 602–617. Cited by: §1.
  • [4] H. Badino, D. Huber, and T. Kanade (2011) The CMU Visual Localization Data Set. Note: Cited by: §4.4.
  • [5] V. Balntas, S. Li, and V. Prisacariu (2018) RelocNet: Continuous Metric Learning Relocalisation Using Neural Nets. In ECCV, Cited by: §1, §2.
  • [6] E. Brachmann and C. Rother (2018) Learning Less Is More - 6D Camera Localization via 3D Surface Regression. In CVPR, Cited by: §1, §2.
  • [7] O. Chum and J. Matas (2008) Optimal randomized ransac. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (8), pp. 1472–1482. Cited by: §1, §2, §2.
  • [8] A. Cohen, J. L. Schönberger, P. Speciale, T. Sattler, J. Frahm, and M. Pollefeys (2016) Indoor-outdoor 3d reconstruction alignment. In European Conference on Computer Vision, pp. 285–300. Cited by: §1.
  • [9] G. Csurka, C. R. Dance, and M. Humenberger (2018) From Handcrafted to Deep Local Invariant Features. arXiv 1807.10254. Cited by: §1, §2.
  • [10] D. DeTone, T. Malisiewicz, and A. Rabinovich (2018) Superpoint: Self-supervised Interest Point Detection and Description. In CVPR, Cited by: §1.
  • [11] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler (2019) D2-Net: a Trainable CNN for Joint Description and Detection of Local Features. In CVPR, Cited by: §1.
  • [12] M. Fischler and R. Bolles (1981) Random Sampling Consensus: A Paradigm for Model Fitting with Application to Image Analysis and Automated Cartography. Communications of the ACM 24, pp. 381–395. Cited by: §1, §2, §2.
  • [13] J. Heinly, J. L. Schönberger, E. Dunn, and J. M. Frahm (2015) Reconstructing the world* in six days. In CVPR, Cited by: §1, §2.
  • [14] A. Irschara, C. Zach, J. Frahm, and H. Bischof (2009) From Structure-from-Motion Point Clouds to Fast Location Recognition. In CVPR, Cited by: §2.
  • [15] L. Kang, L. Wu, and Y. Yang (2014) Robust multi-view L2 triangulation via optimal inlier selection and 3D structure refinement. PR 47 (9), pp. 2974–2992. Cited by: §2.
  • [16] A. Kendall, M. Grimes, and R. Cipolla PoseNet: a Convolutional Network for Real-Time 6-DOF Camera Relocalization. In ICCV, Cited by: §1, §2.
  • [17] L. Kneip, D. Scaramuzza, and R. Siegwart (2011) A Novel Parametrization of the Perspective-three-point Problem for a Direct Computation of Absolute Camera Position and Orientation. In CVPR, Cited by: §1, §2.
  • [18] Z. Kukelova, M. Bujnak, and T. Pajdla (2013) Real-Time Solution to the Absolute Pose Problem with Unknown Radial Distortion and Focal Length. In ICCV, Cited by: §2.
  • [19] V. Larsson, Z. Kukelova, and Y. Zheng (2017) Making Minimal Solvers for Absolute Pose Estimation Compact and Robust. In ICCV, Cited by: §2.
  • [20] Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala (2017)

    Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network

    In iccvws, Cited by: §1, §2.
  • [21] K. Lebeda, J. E. S. Matas, and O. Chum (2012) Fixing the Locally Optimized RANSAC. In BMVC, Cited by: §1, §2, §2.
  • [22] L. Liu, H. Li, and Y. Dai (2017) Efficient Global 2D-3D Matching for Camera Localization in a Large-Scale 3D Map. In ICCV, Cited by: §1, §2.
  • [23] D. G. Lowe (2004) Distinctive Image Features from Scale-invariant Keypoints. IJCV 60 (2), pp. 91–110. Cited by: §2.
  • [24] P. Moulon, P. Monasse, and R. Marlet (2013) Global Fusion of Relative Motions for Robust, Accurate and Scalable Structure from Motion. In ICCV, Cited by: §1, §2.
  • [25] H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han (2017) Large-Scale Image Retrieval with Attentive Deep Local Features. In ICCV, Cited by: §1, §2.
  • [26] O. Özyeşil, V. Voroninski, R. Basri, and A. Singer (2017) A survey of structure from motion.. Acta Numerica 26, pp. 305–364. External Links: Document Cited by: §1, §2.
  • [27] F. Radenović, A. Iscen, G. Tolias, and O. Avrithis (2018) Revisiting Oxford and Paris: Large-scale Image Retrieval Benchmarking. In CVPR, Cited by: §1.
  • [28] F. Radenović, G. Tolias, and O. Chum (2019) Fine-Tuning CNN Image Retrieval with no Human Annotation. PAMI 41 (7), pp. 1655–1668. Cited by: §1, §2.
  • [29] J. Revaud, J. Almazan, R. S. de Rezende, and C. R. de Souza (2019) Learning with Average Precision: Training Image Retrieval with a Listwise Loss. In ICCV, Cited by: §1, §2, §2.
  • [30] J. Revaud, P. Weinzaepfel, C. De Souza, and M. Humenberger (2019) R2D2: Reliable and Repeatable Detectors and Descriptors. In NeurIPS, Cited by: §1, §2, §2.
  • [31] J. Revaud, P. Weinzaepfel, C. De Souza, N. Pion, G. Csurka, Y. Cabon, and M. Humenberger (2019)

    R2D2: Reliable and Repeatable Detectors and Descriptors for Joint Sparse Keypoint Detection and Local Feature Extraction

    CoRR (arXiv:1906.06195). Cited by: §1.
  • [32] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison (2013) Slam++: simultaneous localisation and mapping at the level of objects. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1352–1359. Cited by: §1.
  • [33] P. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk (2019) From Coarse to Fine: Robust Hierarchical Localization at Large Scale. In CVPR, Cited by: §1, §1.
  • [34] T. Sattler, B. Leibe, and L. Kobbelt (2011) Fast image-based localization using direct 2d-to-3d matching. In ICCV, Cited by: §1, §2.
  • [35] T. Sattler, M. Havlena, F. Radenovic, K. Schindler, and M. Pollefeys (2015) Hyperpoints and fine vocabularies for large-scale location recognition. In ICCV, Cited by: §2.
  • [36] T. Sattler, B. Leibe, and L. Kobbelt (2017) Efficient & Effective Prioritized Matching for Large-Scale Image-Based Localization. PAMI 39 (9), pp. 1744–1756. Cited by: §1, §2.
  • [37] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, F. Kahl, and T. Pajdla (2018) Benchmarking 6DoF Outdoor Visual Localization in Changing Conditions. In CVPR, Cited by: §1, §4.1, §4.3, §4.4, §4.
  • [38] T. Sattler, T. Weyand, B. Leibe, and L. Kobbelt (2012) Image Retrieval for Image-Based Localization Revisited. In British Machine Vision Conference (BMCV), Cited by: §4.1.
  • [39] T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taixé (2019) Understanding the Limitations of CNN-based Absolute Camera Pose Regression. In CVPR, Cited by: §1, §2.
  • [40] J. L. Schönberger and J. Frahm (2016) Structure-from-motion Revisited. In CVPR, Cited by: §1, §2, §2.
  • [41] J. L. Schönberger, H. Hardmeier, T. Sattler, and M. Pollefeys (2017) Comparative Evaluation of Hand-Crafted and Learned Local Features. In CVPR, Cited by: §2.
  • [42] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon (2013) Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images. In CVPR, Cited by: §1, §2.
  • [43] N. Snavely, S.M. Seitz, and R. Szeliski (2008) Modeling the World from Internet Photo Collections. IJCV 80 (2), pp. 189–210. Cited by: §1, §2.
  • [44] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii (2019) InLoc: Indoor Visual Localization with Dense Matching and View Synthesis. PAMI (), pp. 1–1. Cited by: §1, §2, §2, §4.2.
  • [45] A. Torii, H. Taira, J. Sivic, M. Pollefeys, M. Okutomi, T. Pajdla, and T. Sattler (2019) Are Large-Scale 3D Models Really Necessary for Accurate Visual Localization?. PAMI (), pp. 1–1. Cited by: §1, §2.
  • [46] A. Torii, R. Arandjelović, J. Sivic, M. Okutomi, and T. Pajdla (2015) 24/7 Place Recognition by View Synthesis. In CVPR, Cited by: §2.
  • [47] A. Torii, J. Sivic, and T. Pajdla (2011) Visual Localization by Linear Combination of Image Descriptors. In ICCV-W, Cited by: §2.
  • [48] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon (1999) Bundle adjustment—a modern synthesis. In International workshop on vision algorithms, pp. 298–372. Cited by: §2.
  • [49] P. Weinzaepfel, G. Csurka, Y. Cabon, and M. Humenberger (2019-06) Visual localization by learning objects-of-interest dense match regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [50] E. Wijmans and Y. Furukawa (2017) Exploiting 2d floorplan for building-scale panorama rgbd alignment. In Computer Vision and Pattern Recognition, CVPR, External Links: Link Cited by: §4.2.
  • [51] A. R. Zamir and M. Shah (2010) Accurate Image Localization Based on Google Maps Street View. In ECCV, Cited by: §2.