In this paper we consider the problem of estimating the full 6DOF camera pose of a query image with respect to a large-scale 3D model such as those obtained from a Structure-from-Motion (SfM) pipeline [27, 34, 16, 25]. A typical approach is to detect distinctive 2D feature points in a query image and perform correspondence search against feature descriptors associated with 3D points obtained from the SfM reconstruction. This initial matching is performed in descriptor space (e.g., SIFT  or SURF ) using an approximate k-nearest neighbor search implementation [17, 18]. Candidate 2D-3D correspondences are then further filtered using robust fitting techniques (e.g., RANSAC variants [10, 32, 15]
) to identify inliers and the final camera pose estimated using an algebraic PnP solver and non-linear refinement. Camera pose estimation is a fundamental building block in many computer vision algorithms (e.g., incremental bundle adjustment), can provide strong constraints on object recognition (see e.g.,[33, 8]), and is useful in robotics applications such as autonomous driving and navigation.
Unfortunately, the performance of standard camera localization pipelines degrades as the size of the 3D model grows. Finding good correspondences becomes difficult in the large-scale setting due to two factors. First, standard 2D-to-3D forward
matching is likely to accept bad correspondences of a query feature with the model since the feature space becomes cluttered with similar descriptors from completely different locations. Standard heuristics for identifying distinctive matches, such as the distance ratio-test of Lowe
, which compares the distance to the nearest-neighbor point descriptor with that of the second-nearest neighbor, fail due to proximity of other model feature descriptors. Second, increasingly noisy correspondences obtained from the matching stage drives up the runtime of the robust pose estimation step, whose complexity typically grows exponentially with the number of outliers. These difficulties are particularly evident in large urban environments, where repeated structure is common and local features become less distinctive[31, 1].
These problems are well known and have been approached in several ways in the literature. Works such as [13, 12] focus on generating a simplified 3D model that contains only a representative subset of distinctive model points. With a smaller model and prioritized search, it becomes possible to replace the traditional approach of 2D-to-3D forward matching, with 3D-to-2D back matching, allowing the ratio test to be performed in the sparser feature space of the query image.
An alternative to removing points from the model is to cluster and quantize model point feature descriptors.  use vocabulary trees to speed up forward matching by assigning each model point and each query feature to a vocabulary word, yielding faster runtimes since the vocabulary size is generally smaller than the model point cloud. A linear search for the first and second nearest neighbors is performed within each word bin, and a ratio test filters out non-distinct correspondences.  use active search in the vocabulary tree to prioritize back matching of 3D points close to those that have already been found and terminate early as soon as a sufficient number of matches have been identified.
A very different approach is taken in the works of [36, 29]. Camera localization is framed as a Hough voting procedure, where the geometric properties of SIFT (scale and orientation) provide approximate information about likely camera pose from individual point correspondences. By using focal length and camera orientation priors, each 2D-to-3D match casts a vote into the intersection of a visibility cone and a hypothesized ground-plane. Orientation and model co-visibility are further used to filter out unlikely matches, rapidly identifying the potential camera locations.
Inspired by this prior work, we propose a fast, simple method for camera localization that scales well to large models with globally repeated structure. Our approach avoids complicated data structures and makes no hard a priori assumptions on camera pose (e.g., gravity direction of the camera). Our basic insight is to utilize a coarse-to-fine approach that rapidly narrows down the region of camera pose space associated with the query image. Specifically, we formulate a linear time voting process over camera pose space by assigning each single model view to an individual camera pose bin. This voting allows us to identify model views likely to overlap the query image and to prioritize back matching of those views against it while exploiting co-visibility constraints and local ratio testing.
) is to introduce and analyze two ratio-tests that can be used to find distinctive matches in a pool of candidates produced by global k-nearest neighbor search (kNN). Our second contribution (Section3) uses these forward matches as votes to prioritize back matching of model images against the query image. Extensive experimental evaluation (Section 4) suggests this approach scales well and outperforms existing methods on several pose-estimation benchmarks.
2 Ratio Tests for Global Matching
Forward-matching of query image points against a model is effective when the model is small. In such models, approximate nearest-neighbors are often true correspondences and ratio-testing is effective at discarding bad matches. In this section we first establish that clustering the model into smaller sub-models and performing forward-matching within each cluster is sufficient to achieve good performance for large models (Section 2.1). We then describe how to efficiently approximate exhaustive cluster-wise matching by global forward-matching using approximations to the local ratio test (Section 2.2) followed by back-matching.
2.1 Clustering and Exhaustive Local Matching
A naive approach to solving camera localization at large scale is to simply divide the 3D model into small pieces (clusters) and perform matching and robust PnP pose estimation for each cluster. This avoids the problems of global feature repetition and difficulties of high density in the feature space. However, this is infeasible from a computational point of view as it requires building a nearest-neighbor data structure for each cluster and matching to each cluster separately at test time. Consider a kd-tree, where searching for a match in a set with descriptors is logarithmic in the set size: . If we divide the model into clusters of constant size , execution time is dominated by the number of clusters which grows linearly in the model size, . While not practical at scale, we take this exhaustive local matching approach as a gold-standard baseline for evaluating our coarse-to-fine approach.
Exhaustive Local Matching is effective but slow:
To evaluate clustering and local matching, we use the Eng-Quad dataset from , and build two SfM models using COLMAP . The first model contains only the training image set, while a second model bundles both the training and test images and is used for evaluating localization accuracy. We geo-register the resulting reconstructions with a GIS model so that the scale of the SfM model is approximately metric. 5129 training images of the 6402 were bundled, and 520 out of 570 test images were additionally bundled in the test model. The resulting point cloud has 579,859 3D points and 2,901,885 feature descriptors. We refer to these descriptors as views of the point.
To generate clusterings of the model, we construct a scene matrix whose entry contains the number of points that image pair
share in the SfM model. We performed spectral clustering
on the scene matrix using the 50 largest eigenvectors and produce three different granularities: no clustering at all (purely global), 50 clusters, and 500 clusters. To evaluate exhaustive local matching, we matched a query image against every cluster and select the one that produces the smallest localization error. For matching to a cluster, we use FLANN to find the first and second NN of each query point and apply a standard ratio test with a threshold. We ran RANSAC on each set of candidate cluster correspondences using a P3P solver  and a focal length prior based on the image EXIF metadata. Similar to , an image is considered to be successfully matched if it has at least inlier correspondences with a reprojection error less than .
Table 1 shows that exhaustive local matching within each cluster performs much better than global matching, with lower median error and fewer failures. However, the execution time grows roughly linearly with respect to the number of clusters, motivating our coarse-to-fine strategy.
|#clusters||#images||#inliers||ratio||error [m]||fwd [s]||RNSC [s]||total [s]|
2.2 Local Ratio Tests for Global Matches
How can we get the benefits of local cluster-wise matching while maintaining the computational cost associated with a single global nearest-neighbor search? Cluster-wise matching considers a nearest-neighbor per-cluster for each query point. To try and recover this larger pool of candidate correspondences using global search, we propose to retrieve the global top k nearest-neighbors for each query point. Fortunately, approximate kNN searches are not substantially more costly since those points typically live in adjacent leaves of the kd-tree (which must be explored even for a 1-NN retrieval). A larger set of candidate matches can address the problem of repeated structure by retrieving the set of multiple scene points that might correspond to a query point. However, it also results in a k-fold increase in outliers which we now address.
We define a view as the 2D point observation of a 3D point in a particular model image . Given a camera pose clustering of the SfM model images, we assign the view descriptors of each image to their corresponding cluster . Note that these clusters divide images in disjoint groups, but they do share common points, as a 3D point can have multiple views belonging to images assigned to different clusters. For a query image with query features , we search for approximate nearest neighbors using a global kd-tree structure built from all views .
Global k-ratio tests:
We start with a conservative global ratio-test (Algorithm 1) to prune candidate matches by comparing the distance ratio of the first and nearest neighbor retrieved, as proposed by . If the ratio is greater than threshold , we drop the query point. Otherwise, all nearest neighbors pairs are included in the set of putative correspondences . This global ratio test is much more conservative than the standard first vs second NN test. In the remainder of this paper, we will refer to this global test as k-ratio, defined formally as . The standard first versus second NN test will be referred as 1-ratio.
If a candidate match fails the global k-ratio test, it also fails the local 1-ratio test.
Let be the first and second local nearest neighbors of a particular query feature . Since the global set is sorted by ascending distance, this implies that , and . Formally,
Hence, the local 1-ratio will always be equal or greater than the global k-ratio. This guarantees that any correspondence rejected by the k-ratio test would also have failed the local 1-ratio test. A correspondence passing the k-ratio test might not pass the local 1-ratio test, so the local 1-ratio test is a more stringent criteria. ∎
Cluster-wise ratio tests:
After the initial global filtering, we would like to perform local ratio testing within each cluster. When more than two candidate matches for a query point belong to the same cluster, we can simply re-rank them and apply a standard 1-ratio test. For example, suppose two global matches and which are the second and fourth global NN of the query feature fall into the same cluster. If and are views of distinct 3D points, then they are necessarily the first and second local nearest-neighbors of in that cluster (see Figure 2). Any lower-ranked matches within the cluster can be ignored and the 1-ratio test applied to this pair.
When only a single global match falls within a cluster we can no longer perform an exact local 1-ratio test since we do not have immediate access to the 2nd nearest neighbor within that cluster. Instead we develop a bound based on the triangle inequality to define an alternate test for such cases which we refer to as the t-ratio test.
Given a local correspondence , we define as the nearest neighbor of view in the feature space defined by cluster . Since is obtained purely from training data, we can pre-compute it offline and access it at test time. We define the t-ratio test as:
Although we missed the local 2nd nearest neighbor in the global search, the distance provides useful information on how far away the 2nd nearest neighbor might be.
If a candidate match fails the t-ratio test, it also fails the local 1-ratio test.
Let be a query feature, and the fist and second local nearest neighbors in a cluster , and . We can bound the distance to the second nearest neighbor by the inequalities:
where the first inequality holds since , and the second holds by the triangle inequality. Thus,
Consequently, a singleton match that fails the t-ratio test will always fail the local 1-ratio test. The t-ratio test thus only filters correspondences that would have failed the local ratio test if was available. ∎
Back-matching and fitting:
To provide additional robustness to outliers, we can back match views (model feature point descriptors) which were indicated as candidate correspondences from the forward matching. For any such candidate matching view, we search for the first and second nearest neighbor matches using a kd-tree built over the query image features and apply the 1-ratio test. We then select as the final set of correspondences the intersection of pairs that passed the forward and back matching process. These pairs are cluster-wise best buddies , since each and of a pair are both discriminative features in the query and model feature space.
|#clusters||#imgs||#inl||ratio||err.||fwd [s]||RT [s]||bck [s]||RNSC [s]||total [s]|
2.3 Cluster-wise ratio-tests are effective and fast
The cluster-wise ratio test, defined in Algorithm 2, prunes a large number of non-discriminative correspondences while still maintaining the locally unique matches. The complexity of this algorithm is linear in the number of forward correspondences . For every local NN , we simply look for its second NN pair within the list of nearest neighbors. The list of intra-cluster nearest neighbors is simply a view-to-vector that can be pre-computed offline and accessed at constant time, similar to vocabulary-based methods that store view-to-word assignments. Hence, at most ratio tests will be performed.
We evaluated this cluster-wise approach using the same settings as our gold standard baseline experiment. We added a finer division of the model, consisting of atomic clusters with a single image each. Table 2 shows the localization performance on these different granularities. A single global cluster gives surprisingly good results in the number of localized cameras, although it provides worse camera position results. This is due to the restrictiveness of the ratio test in denser search spaces, yielding fewer inliers and missing some discriminative correspondences that would improve results. As we increase the number of clusters, the localization errors are reduced (8 cm on average) thanks to the cluster-wise ratio test which provides more high confidence matches (at the expensive of longer RANSAC runtimes). We obtain best results using the finest clustering (a single model camera per cluster), successfully localizing 482 images. Compared to the gold-standard of Table 1, our strategy is competitive, by only dropping 5% in localization performance while being three orders of magnitude faster. Moreover, the finest single-image clusters provide the best result we can avoid running any complex clustering method (e.g., spectral clustering). We use single-image clusters in the remainder of the paper.
3 Accelerating Matching by Pose Voting
As Table 1 suggests, with appropriate cluster-wise testing, forward matching now constitutes the primary computational bottleneck. Short of simplifying the model (e.g., as pursued by [12, 13]), how might we further accelerate the matching process? A natural strategy is to carry out forward matching incrementally and stop as soon as we have a sufficient number of matches to guarantee a good result. From this perspective, we can view forward matching as “voting” for the location of the query camera. Unlike [36, 29] where votes were cast into a uniformly binned camera translation space, we use each model camera pose as a putative bin to cast our votes (also used in ). We avoid additional data structures like vocabulary trees in favor of storing a simple but effective view-to- vector that enforces local uniqueness. Once we have accumulated enough votes to narrow down the camera pose to a few candidate clusters, we can terminate forward matching and carry out back matching with little loss in accuracy.
3.1 Coarse localization using cluster matching
To analyze how many votes are needed to determine a good localization, we frame the problem as that of location recognition [20, 30, 2, 24], namely producing a short ranked list of model images that depict the same general location as the query image. We follow the evaluation procedure of , reporting if there exists at least one image among the top-k images that shares 12 or more fundamental matrix inliers. We benchmark performance on two datasets: Eng-Quad and Dubrovnik .
|Dubrovnik - 800 test images|
|Eng-Quad - 520 test images|
The results in Table 3 are inspiring. Algorithm 2 is able to recognize the location of all 800 test images in the Dubrovnik dataset using 200 random features passing the k-ratio test. Results on the more challenging Eng-Quad dataset provide almost accuracy on recognizing the landmarks of the 520 query images for which we have a ground truth pose. Importantly, a random subset of a few hundred query features achieves nearly as good recognition results as using all image features (a query image usually has 5,000 to 10,000 features). This suggests that the forward matching can be terminated early while still maintaining good localization performance.
3.2 Prioritized Back Matching
Determining the correct model image only provides rough camera location and additional work is needed to estimate the precise camera pose. To reap the computational benefits of subsampling, we thus modify our framework slightly. We use forward matching with a subset of query features in order to identify likely model images. We then perform back matching within candidate images in order to expand the set of matches used for fine camera pose estimation. This back matching is carried out using a greedy prioritized search over images ranked by votes and further exploits co-visibility information encoded in the SfM model to find additional distinctive matches that were not identified during the forward (sub-sampled) matching.
Algorithm 3 describes our back-matching approach. Given the forward matches found using Algorithm 2, we select the most voted model image and back-match all of its views against the query image using the standard 1-ratio test with threshold . The correspondences found are added to the pool of back matched pairs used for the fine pose estimation. These back matches are also treated as votes. We use the SfM model’s camera-point visibility graph , to cast votes for other images that observe the same views as in . These new votes increase the likelihood that neighboring images are selected for subsequent rounds of back-matching. To avoid introducing noise into the voting process, we only allow a back-matched image to cast votes if it depicts the same location (i.e., returns 12 or more matches). The algorithm terminates when is large enough to guarantee a good camera localization, or a certain total number of images have been back-matched.
4 Benchmark Evaluation
We evaluated our approach (Algorithm 4) on three different datasets: Eng-Quad, Dubrovnik, and Rome. Rome is a large dataset of 15,179 training and 1,000 test images. Dubrovnik is a popular 6,044 training and 800 test image dataset whose SfM model is roughly aligned to geographic coordinates, allowing for quantitative metric evaluation. While Eng-Quad has fewer images, it is perhaps the most challenging due to the presence of strongly repeated structures in the modern architectural designs it depicts. When using P3P, we used EXIF metadata for Eng-Quad test images and ground-truth focal lengths from the SfM models for Dubrovnik and Rome. We also briefly analyze results on the city-wide SF-0 dataset .
After carefully analyzing the original models provided for Dubrovnik, we found that the test set ground truth was often wrong, with extremely large focal lengths and misaligned 2D-3D correspondences. This in turn resulted in large errors in camera location and poor alignment between projection of 3D points and the corresponding 2D features. These problems are evident in results published elsewhere. For example,  report better results using P4Pf  than using P3P with the given “true” focal lengths. This is contrary to what should be expected: knowing the ground-truth focal length (P3P) should outperform joint estimation of pose and focal length (P4Pf). Examples are shown in the supplementary material.
For this reason, we rebuilt a new version of the Dubrovnik “ground-truth” model using the same set of keypoints provided for the original dataset and the excellent SfM package COLMAP . We aligned the new model with the original one using a RANSAC-based Procrustes analysis so that the scale is approximately metric. After alignment, only 3853 of the recovered 6844 images were located within 3 meters from their original position in the model, further validating our concerns. Our reconstruction provided ground-truth for 777 of the 800 query images.
The runtime of our algorithm for camera localization depends on two parameters: and . Setting these parameters trades off localization accuracy with faster execution times. Figure 3 shows the influence of these variables using the Eng-Quad dataset. We benchmarked forward matching times by randomly sampling query features until a desired number pass the global ratio under fixed values for . Similarly, we fixed and evaluated different values for . In both cases, the range of values tested vary from 50 up to 500 matched features. Figure 3 shows the number of registered images under these different configurations, and the time spent to achieve such a level of performance.
We tested our localization pipeline using the following settings: for each dataset, we built a global kd-tree index using all model view descriptors. We request nearest neighbors and check 128 leaves. We set across all of our ratio tests. We set and to provide a good balance between camera localization and execution time. Algorithm 3 stops after 20 back-matched images, which is a generous setting in these datasets (in most cases is achieved in less than 5 loops). Experiments were performed using a single thread on an Intel i7-5930 CPU at 3.50GHz. We used the implementation of [21, 22] in Eng-Quad and the re-bundled Dubrovnik comparisons, running a single thread on an Intel i7-3770 CPU at 3.40GHz. We used a generic vocabulary tree and default parameters: for  and for . Unfortunately, implementations of [36, 28] were not available.
We successfully localized all images in Dubrovnik, except one image in the corrected version using P4Pf. We achieved the smallest localization errors for all quartiles, and reported more images within thethreshold and fewer beyond the mark. Despite finding a substantial higher number of inliers, our method yielded larger average errors with respect to the original Dubrovnik model due to its underlying defects in the ground-truth.  and  (after RANSAC), who use a shape-voting approximation to the rough image location rather than the traditional match-and-RANSAC pipeline, report smaller localization errors but at the cost of longer runtimes. Finally, we successfully localized all query images in the Rome dataset using P4Pf. Rome also suffers the similar inaccuracies as Dubrovnik, which resulted in the loss of one test image using P3P.
The benefits of our approach are more pronounced on Eng-Quad, due to its difficult characteristics. We localized more than 100 and 50 additional cameras w.r.t. [21, 22] respectively, improving all localization errors except the first quartile using P4Pf. We obtain faster runtimes than [13, 36] while being competitive with those of [21, 22]. Notably, our approach adapts better to the more difficult Eng-Quad dataset, spending more time retrieving images with sufficient correspondences. On the other hand, we quickly recognize landmarks in Dubrovnik with the first or second top ranked images, quickly retrieving sufficient putative correspondence and yielding faster localization times.
We obtained an asymptotic recall of 66.63% on the SF-0 dataset using the protocol of . At 95% precision, the recall drops to 52.30% using the effective inlier count of , falling below performance of other methods [6, 19, 1, 36] for location recognition. For this test we used less stringent parameters: , , , and back matched up to 50 images. We expect tuning these parameters and utilizing re-ranking heuristics exploited by other methods to provide a better approach for such location retrieval problems.
Alternatives to large-scale image localization have focused on reducing the density of the search space to quickly find discriminative correspondences. Here we have shown that retrieving multiple global nearest neighbors and filtering them using approximations to the ratio test can quickly identify candidate regions of pose space. Such regions can be further refined by back matching to yield state-of-the-art results in camera localization, even for datasets with challenging global repeated structure.
-  R. Arandjelović and A. Zisserman. Dislocation: Scalable descriptor distinctiveness for location recognition. In Asian Conference on Computer Vision, pages 188–204. Springer, 2014.
-  G. Baatz, O. Saurer, K. Köser, and M. Pollefeys. Large scale visual geo-localization of images in mountainous terrain. In Computer Vision–ECCV 2012, pages 517–530. Springer, 2012.
-  H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (surf). Computer vision and image understanding, 110(3):346–359, 2008.
M. Bujnak, Z. Kukelova, and T. Pajdla.
A general solution to the p4p problem for camera with unknown focal
Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
-  S. Cao and N. Snavely. Graph-based discriminative learning for location recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 700–707, 2013.
-  D. M. Chen, G. Baatz, K. Köser, S. S. Tsai, R. Vedantham, T. Pylvänäinen, K. Roimela, X. Chen, J. Bach, M. Pollefeys, et al. City-scale landmark identification on mobile devices. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 737–744. IEEE, 2011.
-  T. Dekel, S. Oron, M. Rubinstein, S. Avidan, and W. T. Freeman. Best-buddies similarity for robust template matching. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2021–2029. IEEE, 2015.
-  R. Díaz, S. Hallman, and C. C. Fowlkes. Detecting dynamic objects with multi-view background subtraction. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 273–280. IEEE, 2013.
R. Díaz, M. Lee, J. Schubert, and C. C. Fowlkes.
Lifting gis maps into strong geometric context for scene understanding.In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–9, March 2016.
-  M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
-  L. Kneip, D. Scaramuzza, and R. Siegwart. A novel parametrization of the perspective-three-point problem for a direct computation of absolute camera position and orientation. In CVPR, 2011.
-  Y. Li, N. Snavely, D. Huttenlocher, and P. Fua. Worldwide Pose Estimation using 3D Point Clouds. In ECCV, 2012.
-  Y. Li, N. Snavely, and D. P. Huttenlocher. Location recognition using prioritized feature matching. In European Conference on Computer Vision, pages 791–804. Springer, 2010.
-  D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 60(2):91–110, Nov. 2004.
-  L. Moisan, P. Moulon, and P. Monasse. Automatic homographic registration of a pair of images, with a contrario elimination of outliers. Image Processing On Line, 2:56–73, 2012.
-  P. Moulon, P. Monasse, and R. Marlet. Adaptive structure from motion with a contrario model estimation. In ACCV. 2012.
-  D. M. Mount and S. Arya. Ann: library for approximate nearest neighbour searching. 1998.
-  M. Muja and D. G. Lowe. Flann, fast library for approximate nearest neighbors. In International Conference on Computer Vision Theory and Applications (VISAPP’09). INSTICC Press, 2009.
-  T. Sattler, M. Havlena, F. Radenovic, K. Schindler, and M. Pollefeys. Hyperpoints and fine vocabularies for large-scale location recognition. In ICCV, 2015.
-  T. Sattler, M. Havlena, K. Schindler, and M. Pollefeys. Large-scale location recognition and the geometric burstiness problem. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  T. Sattler, B. Leibe, and L. Kobbelt. Fast image-based localization using direct 2D-to-3D matching. ICCV, 2011.
-  T. Sattler, B. Leibe, and L. Kobbelt. Improving image-based localization by active correspondence search. In European Conference on Computer Vision, pages 752–765. Springer, 2012.
-  T. Sattler, C. Sweeney, and M. Pollefeys. On sampling focal length values to solve the absolute pose problem. In European Conference on Computer Vision, pages 828–843. Springer, 2014.
-  G. Schindler, M. Brown, and R. Szeliski. City-scale location recognition. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–7. IEEE, 2007.
-  J. L. Schönberger and J.-M. Frahm. Structure-from-motion revisited. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
-  N. Snavely, I. Simon, M. Goesele, R. Szeliski, and S. M. Seitz. Scene Reconstruction and Visualization From Community Photo Collections. Proceedings of the IEEE, 98(8):1370–1390, Aug. 2010.
-  L. Svarm, O. Enqvist, F. Kahl, and M. Oskarsson. City-scale localization for cameras with known vertical direction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
-  L. Svarm, O. Enqvist, M. Oskarsson, and F. Kahl. Accurate localization and pose estimation for large 3d models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 532–539, 2014.
-  A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla. 24/7 place recognition by view synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  A. Torii, J. Sivic, T. Pajdla, and M. Okutomi. Visual place recognition with repetitive structures. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.
-  P. H. Torr and A. Zisserman. Mlesac: A new robust estimator with application to estimating image geometry. Computer Vision and Image Understanding, 78(1):138–156, 2000.
-  S. Wang, S. Fidler, and R. Urtasun. Holistic 3d scene understanding from a single geo-tagged image. 2015.
-  C. Wu. Towards linear-time incremental structure from motion. In 2013 International Conference on 3D Vision-3DV 2013, pages 127–134. IEEE, 2013.
-  A. R. Zamir and M. Shah. Image geo-localization based on multiple nearest neighbor feature matching using generalized graphs. IEEE transactions on pattern analysis and machine intelligence, 36(8):1546–1558, 2014.
-  B. Zeisl, T. Sattler, and M. Pollefeys. Camera pose voting for large-scale image-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 2704–2712, 2015.