1 Introduction
In this paper we consider the problem of estimating the full 6DOF camera pose of a query image with respect to a largescale 3D model such as those obtained from a StructurefromMotion (SfM) pipeline [27, 34, 16, 25]. A typical approach is to detect distinctive 2D feature points in a query image and perform correspondence search against feature descriptors associated with 3D points obtained from the SfM reconstruction. This initial matching is performed in descriptor space (e.g., SIFT [14] or SURF [3]) using an approximate knearest neighbor search implementation [17, 18]. Candidate 2D3D correspondences are then further filtered using robust fitting techniques (e.g., RANSAC variants [10, 32, 15]
) to identify inliers and the final camera pose estimated using an algebraic PnP solver and nonlinear refinement. Camera pose estimation is a fundamental building block in many computer vision algorithms (e.g., incremental bundle adjustment), can provide strong constraints on object recognition (see e.g.,
[33, 8]), and is useful in robotics applications such as autonomous driving and navigation.Unfortunately, the performance of standard camera localization pipelines degrades as the size of the 3D model grows. Finding good correspondences becomes difficult in the largescale setting due to two factors. First, standard 2Dto3D forward
matching is likely to accept bad correspondences of a query feature with the model since the feature space becomes cluttered with similar descriptors from completely different locations. Standard heuristics for identifying distinctive matches, such as the distance ratiotest of Lowe
[14], which compares the distance to the nearestneighbor point descriptor with that of the secondnearest neighbor, fail due to proximity of other model feature descriptors. Second, increasingly noisy correspondences obtained from the matching stage drives up the runtime of the robust pose estimation step, whose complexity typically grows exponentially with the number of outliers. These difficulties are particularly evident in large urban environments, where repeated structure is common and local features become less distinctive
[31, 1].Related Work:
These problems are well known and have been approached in several ways in the literature. Works such as [13, 12] focus on generating a simplified 3D model that contains only a representative subset of distinctive model points. With a smaller model and prioritized search, it becomes possible to replace the traditional approach of 2Dto3D forward matching, with 3Dto2D back matching, allowing the ratio test to be performed in the sparser feature space of the query image.
An alternative to removing points from the model is to cluster and quantize model point feature descriptors. [21] use vocabulary trees to speed up forward matching by assigning each model point and each query feature to a vocabulary word, yielding faster runtimes since the vocabulary size is generally smaller than the model point cloud. A linear search for the first and second nearest neighbors is performed within each word bin, and a ratio test filters out nondistinct correspondences. [22] use active search in the vocabulary tree to prioritize back matching of 3D points close to those that have already been found and terminate early as soon as a sufficient number of matches have been identified.
A very different approach is taken in the works of [36, 29]. Camera localization is framed as a Hough voting procedure, where the geometric properties of SIFT (scale and orientation) provide approximate information about likely camera pose from individual point correspondences. By using focal length and camera orientation priors, each 2Dto3D match casts a vote into the intersection of a visibility cone and a hypothesized groundplane. Orientation and model covisibility are further used to filter out unlikely matches, rapidly identifying the potential camera locations.
Our Contribution:
Inspired by this prior work, we propose a fast, simple method for camera localization that scales well to large models with globally repeated structure. Our approach avoids complicated data structures and makes no hard a priori assumptions on camera pose (e.g., gravity direction of the camera). Our basic insight is to utilize a coarsetofine approach that rapidly narrows down the region of camera pose space associated with the query image. Specifically, we formulate a linear time voting process over camera pose space by assigning each single model view to an individual camera pose bin. This voting allows us to identify model views likely to overlap the query image and to prioritize back matching of those views against it while exploiting covisibility constraints and local ratio testing.
Figure 1 gives on overview of our pipeline. Our first contribution (Section 2
) is to introduce and analyze two ratiotests that can be used to find distinctive matches in a pool of candidates produced by global knearest neighbor search (kNN). Our second contribution (Section
3) uses these forward matches as votes to prioritize back matching of model images against the query image. Extensive experimental evaluation (Section 4) suggests this approach scales well and outperforms existing methods on several poseestimation benchmarks.2 Ratio Tests for Global Matching
Forwardmatching of query image points against a model is effective when the model is small. In such models, approximate nearestneighbors are often true correspondences and ratiotesting is effective at discarding bad matches. In this section we first establish that clustering the model into smaller submodels and performing forwardmatching within each cluster is sufficient to achieve good performance for large models (Section 2.1). We then describe how to efficiently approximate exhaustive clusterwise matching by global forwardmatching using approximations to the local ratio test (Section 2.2) followed by backmatching.
2.1 Clustering and Exhaustive Local Matching
A naive approach to solving camera localization at large scale is to simply divide the 3D model into small pieces (clusters) and perform matching and robust PnP pose estimation for each cluster. This avoids the problems of global feature repetition and difficulties of high density in the feature space. However, this is infeasible from a computational point of view as it requires building a nearestneighbor data structure for each cluster and matching to each cluster separately at test time. Consider a kdtree, where searching for a match in a set with descriptors is logarithmic in the set size: . If we divide the model into clusters of constant size , execution time is dominated by the number of clusters which grows linearly in the model size, . While not practical at scale, we take this exhaustive local matching approach as a goldstandard baseline for evaluating our coarsetofine approach.
Exhaustive Local Matching is effective but slow:
To evaluate clustering and local matching, we use the EngQuad dataset from [9], and build two SfM models using COLMAP [25]. The first model contains only the training image set, while a second model bundles both the training and test images and is used for evaluating localization accuracy. We georegister the resulting reconstructions with a GIS model so that the scale of the SfM model is approximately metric. 5129 training images of the 6402 were bundled, and 520 out of 570 test images were additionally bundled in the test model. The resulting point cloud has 579,859 3D points and 2,901,885 feature descriptors. We refer to these descriptors as views of the point.
To generate clusterings of the model, we construct a scene matrix whose entry contains the number of points that image pair
share in the SfM model. We performed spectral clustering
[26]on the scene matrix using the 50 largest eigenvectors and produce three different granularities: no clustering at all (purely global), 50 clusters, and 500 clusters. To evaluate exhaustive local matching, we matched a query image against every cluster and select the one that produces the smallest localization error. For matching to a cluster, we use FLANN
[18] to find the first and second NN of each query point and apply a standard ratio test with a threshold. We ran RANSAC on each set of candidate cluster correspondences using a P3P solver [11] and a focal length prior based on the image EXIF metadata. Similar to [13], an image is considered to be successfully matched if it has at least inlier correspondences with a reprojection error less than .Table 1 shows that exhaustive local matching within each cluster performs much better than global matching, with lower median error and fewer failures. However, the execution time grows roughly linearly with respect to the number of clusters, motivating our coarsetofine strategy.
#clusters  #images  #inliers  ratio  error [m]  fwd [s]  RNSC [s]  total [s] 

1 (global)  463  94  0.57  0.64  0.833  0.129  0.962 
50  512  66  0.54  0.45  13.10  43.62  56.822 
500  517  51  0.49  0.29  80.23  523.69  603.52 
2.2 Local Ratio Tests for Global Matches
How can we get the benefits of local clusterwise matching while maintaining the computational cost associated with a single global nearestneighbor search? Clusterwise matching considers a nearestneighbor percluster for each query point. To try and recover this larger pool of candidate correspondences using global search, we propose to retrieve the global top k nearestneighbors for each query point. Fortunately, approximate kNN searches are not substantially more costly since those points typically live in adjacent leaves of the kdtree (which must be explored even for a 1NN retrieval). A larger set of candidate matches can address the problem of repeated structure by retrieving the set of multiple scene points that might correspond to a query point. However, it also results in a kfold increase in outliers which we now address.
We define a view as the 2D point observation of a 3D point in a particular model image . Given a camera pose clustering of the SfM model images, we assign the view descriptors of each image to their corresponding cluster . Note that these clusters divide images in disjoint groups, but they do share common points, as a 3D point can have multiple views belonging to images assigned to different clusters. For a query image with query features , we search for approximate nearest neighbors using a global kdtree structure built from all views .
Global kratio tests:
We start with a conservative global ratiotest (Algorithm 1) to prune candidate matches by comparing the distance ratio of the first and nearest neighbor retrieved, as proposed by [35]. If the ratio is greater than threshold , we drop the query point. Otherwise, all nearest neighbors pairs are included in the set of putative correspondences . This global ratio test is much more conservative than the standard first vs second NN test. In the remainder of this paper, we will refer to this global test as kratio, defined formally as . The standard first versus second NN test will be referred as 1ratio.
Proposition 1.
If a candidate match fails the global kratio test, it also fails the local 1ratio test.
Proof.
Let be the first and second local nearest neighbors of a particular query feature . Since the global set is sorted by ascending distance, this implies that , and . Formally,
(1) 
Hence, the local 1ratio will always be equal or greater than the global kratio. This guarantees that any correspondence rejected by the kratio test would also have failed the local 1ratio test. A correspondence passing the kratio test might not pass the local 1ratio test, so the local 1ratio test is a more stringent criteria. ∎
Clusterwise ratio tests:
After the initial global filtering, we would like to perform local ratio testing within each cluster. When more than two candidate matches for a query point belong to the same cluster, we can simply rerank them and apply a standard 1ratio test. For example, suppose two global matches and which are the second and fourth global NN of the query feature fall into the same cluster. If and are views of distinct 3D points, then they are necessarily the first and second local nearestneighbors of in that cluster (see Figure 2). Any lowerranked matches within the cluster can be ignored and the 1ratio test applied to this pair.
When only a single global match falls within a cluster we can no longer perform an exact local 1ratio test since we do not have immediate access to the 2nd nearest neighbor within that cluster. Instead we develop a bound based on the triangle inequality to define an alternate test for such cases which we refer to as the tratio test.
Given a local correspondence , we define as the nearest neighbor of view in the feature space defined by cluster . Since is obtained purely from training data, we can precompute it offline and access it at test time. We define the tratio test as:
(2) 
Although we missed the local 2nd nearest neighbor in the global search, the distance provides useful information on how far away the 2nd nearest neighbor might be.
Proposition 2.
If a candidate match fails the tratio test, it also fails the local 1ratio test.
Proof.
Let be a query feature, and the fist and second local nearest neighbors in a cluster , and . We can bound the distance to the second nearest neighbor by the inequalities:
(3) 
where the first inequality holds since , and the second holds by the triangle inequality. Thus,
(4) 
Consequently, a singleton match that fails the tratio test will always fail the local 1ratio test. The tratio test thus only filters correspondences that would have failed the local ratio test if was available. ∎
Backmatching and fitting:
To provide additional robustness to outliers, we can back match views (model feature point descriptors) which were indicated as candidate correspondences from the forward matching. For any such candidate matching view, we search for the first and second nearest neighbor matches using a kdtree built over the query image features and apply the 1ratio test. We then select as the final set of correspondences the intersection of pairs that passed the forward and back matching process. These pairs are clusterwise best buddies [7], since each and of a pair are both discriminative features in the query and model feature space.
#clusters  #imgs  #inl  ratio  err.  fwd [s]  RT [s]  bck [s]  RNSC [s]  total [s] 

1  481  115  0.74  0.69  0.821  0.008  0.021  0.046  0.895 
50  477  127  0.59  0.66  0.818  0.008  0.028  0.061  0.915 
500  480  133  0.56  0.61  0.821  0.009  0.038  0.066  0.934 
5129  482  136  0.55  0.62  0.833  0.009  0.048  0.070  0.961 
2.3 Clusterwise ratiotests are effective and fast
The clusterwise ratio test, defined in Algorithm 2, prunes a large number of nondiscriminative correspondences while still maintaining the locally unique matches. The complexity of this algorithm is linear in the number of forward correspondences . For every local NN , we simply look for its second NN pair within the list of nearest neighbors. The list of intracluster nearest neighbors is simply a viewtovector that can be precomputed offline and accessed at constant time, similar to vocabularybased methods that store viewtoword assignments. Hence, at most ratio tests will be performed.
We evaluated this clusterwise approach using the same settings as our gold standard baseline experiment. We added a finer division of the model, consisting of atomic clusters with a single image each. Table 2 shows the localization performance on these different granularities. A single global cluster gives surprisingly good results in the number of localized cameras, although it provides worse camera position results. This is due to the restrictiveness of the ratio test in denser search spaces, yielding fewer inliers and missing some discriminative correspondences that would improve results. As we increase the number of clusters, the localization errors are reduced (8 cm on average) thanks to the clusterwise ratio test which provides more high confidence matches (at the expensive of longer RANSAC runtimes). We obtain best results using the finest clustering (a single model camera per cluster), successfully localizing 482 images. Compared to the goldstandard of Table 1, our strategy is competitive, by only dropping 5% in localization performance while being three orders of magnitude faster. Moreover, the finest singleimage clusters provide the best result we can avoid running any complex clustering method (e.g., spectral clustering). We use singleimage clusters in the remainder of the paper.
3 Accelerating Matching by Pose Voting
As Table 1 suggests, with appropriate clusterwise testing, forward matching now constitutes the primary computational bottleneck. Short of simplifying the model (e.g., as pursued by [12, 13]), how might we further accelerate the matching process? A natural strategy is to carry out forward matching incrementally and stop as soon as we have a sufficient number of matches to guarantee a good result. From this perspective, we can view forward matching as “voting” for the location of the query camera. Unlike [36, 29] where votes were cast into a uniformly binned camera translation space, we use each model camera pose as a putative bin to cast our votes (also used in [19]). We avoid additional data structures like vocabulary trees in favor of storing a simple but effective viewto vector that enforces local uniqueness. Once we have accumulated enough votes to narrow down the camera pose to a few candidate clusters, we can terminate forward matching and carry out back matching with little loss in accuracy.
3.1 Coarse localization using cluster matching
To analyze how many votes are needed to determine a good localization, we frame the problem as that of location recognition [20, 30, 2, 24], namely producing a short ranked list of model images that depict the same general location as the query image. We follow the evaluation procedure of [5], reporting if there exists at least one image among the topk images that shares 12 or more fundamental matrix inliers. We benchmark performance on two datasets: EngQuad and Dubrovnik [13].
Dubrovnik  800 test images  
Method  top1  top2  top5  top10  Time [s] 
99.00%  99.38%  99.62%  99.88%  0.048  
99.62%  99.75%  99.88%  99.88%  0.085  
100%  100%  100%  100%  0.157  

EngQuad  520 test images  

Method  top1  top2  top5  top10  Time [s] 
83.85%  85.96%  88.46%  88.65%  0.064  
84.62%  86.54%  89.62%  90.96%  0.125  
85.77%  87.69%  90.38%  91.35%  0.242  
86.15%  88.27%  90.96%  91.35%  0.502  
All features  86.92%  89.23%  91.15%  91.92%  0.833 
The results in Table 3 are inspiring. Algorithm 2 is able to recognize the location of all 800 test images in the Dubrovnik dataset using 200 random features passing the kratio test. Results on the more challenging EngQuad dataset provide almost accuracy on recognizing the landmarks of the 520 query images for which we have a ground truth pose. Importantly, a random subset of a few hundred query features achieves nearly as good recognition results as using all image features (a query image usually has 5,000 to 10,000 features). This suggests that the forward matching can be terminated early while still maintaining good localization performance.
3.2 Prioritized Back Matching
Determining the correct model image only provides rough camera location and additional work is needed to estimate the precise camera pose. To reap the computational benefits of subsampling, we thus modify our framework slightly. We use forward matching with a subset of query features in order to identify likely model images. We then perform back matching within candidate images in order to expand the set of matches used for fine camera pose estimation. This back matching is carried out using a greedy prioritized search over images ranked by votes and further exploits covisibility information encoded in the SfM model to find additional distinctive matches that were not identified during the forward (subsampled) matching.
Algorithm 3 describes our backmatching approach. Given the forward matches found using Algorithm 2, we select the most voted model image and backmatch all of its views against the query image using the standard 1ratio test with threshold . The correspondences found are added to the pool of back matched pairs used for the fine pose estimation. These back matches are also treated as votes. We use the SfM model’s camerapoint visibility graph , to cast votes for other images that observe the same views as in . These new votes increase the likelihood that neighboring images are selected for subsequent rounds of backmatching. To avoid introducing noise into the voting process, we only allow a backmatched image to cast votes if it depicts the same location (i.e., returns 12 or more matches). The algorithm terminates when is large enough to guarantee a good camera localization, or a certain total number of images have been backmatched.
4 Benchmark Evaluation
We evaluated our approach (Algorithm 4) on three different datasets: EngQuad, Dubrovnik, and Rome. Rome is a large dataset of 15,179 training and 1,000 test images. Dubrovnik is a popular 6,044 training and 800 test image dataset whose SfM model is roughly aligned to geographic coordinates, allowing for quantitative metric evaluation. While EngQuad has fewer images, it is perhaps the most challenging due to the presence of strongly repeated structures in the modern architectural designs it depicts. When using P3P, we used EXIF metadata for EngQuad test images and groundtruth focal lengths from the SfM models for Dubrovnik and Rome. We also briefly analyze results on the citywide SF0 dataset [12].


Dubrovnik correctness:
After carefully analyzing the original models provided for Dubrovnik, we found that the test set ground truth was often wrong, with extremely large focal lengths and misaligned 2D3D correspondences. This in turn resulted in large errors in camera location and poor alignment between projection of 3D points and the corresponding 2D features. These problems are evident in results published elsewhere. For example, [23] report better results using P4Pf [4] than using P3P with the given “true” focal lengths. This is contrary to what should be expected: knowing the groundtruth focal length (P3P) should outperform joint estimation of pose and focal length (P4Pf). Examples are shown in the supplementary material.
For this reason, we rebuilt a new version of the Dubrovnik “groundtruth” model using the same set of keypoints provided for the original dataset and the excellent SfM package COLMAP [25]. We aligned the new model with the original one using a RANSACbased Procrustes analysis so that the scale is approximately metric. After alignment, only 3853 of the recovered 6844 images were located within 3 meters from their original position in the model, further validating our concerns. Our reconstruction provided groundtruth for 777 of the 800 query images.
25pt5pt  30pt5pt  
25pt5pt  30pt5pt  
25pt5pt  30pt5pt  
EngQuad  Dubrovnik 
Anytime performance:
The runtime of our algorithm for camera localization depends on two parameters: and . Setting these parameters trades off localization accuracy with faster execution times. Figure 3 shows the influence of these variables using the EngQuad dataset. We benchmarked forward matching times by randomly sampling query features until a desired number pass the global ratio under fixed values for . Similarly, we fixed and evaluated different values for . In both cases, the range of values tested vary from 50 up to 500 matched features. Figure 3 shows the number of registered images under these different configurations, and the time spent to achieve such a level of performance.
Experimental details:
We tested our localization pipeline using the following settings: for each dataset, we built a global kdtree index using all model view descriptors. We request nearest neighbors and check 128 leaves. We set across all of our ratio tests. We set and to provide a good balance between camera localization and execution time. Algorithm 3 stops after 20 backmatched images, which is a generous setting in these datasets (in most cases is achieved in less than 5 loops). Experiments were performed using a single thread on an Intel i75930 CPU at 3.50GHz. We used the implementation of [21, 22] in EngQuad and the rebundled Dubrovnik comparisons, running a single thread on an Intel i73770 CPU at 3.40GHz. We used a generic vocabulary tree and default parameters: for [21] and for [22]. Unfortunately, implementations of [36, 28] were not available.
Camera Localization:
We successfully localized all images in Dubrovnik, except one image in the corrected version using P4Pf. We achieved the smallest localization errors for all quartiles, and reported more images within the
threshold and fewer beyond the mark. Despite finding a substantial higher number of inliers, our method yielded larger average errors with respect to the original Dubrovnik model due to its underlying defects in the groundtruth. [28] and [36] (after RANSAC), who use a shapevoting approximation to the rough image location rather than the traditional matchandRANSAC pipeline, report smaller localization errors but at the cost of longer runtimes. Finally, we successfully localized all query images in the Rome dataset using P4Pf. Rome also suffers the similar inaccuracies as Dubrovnik, which resulted in the loss of one test image using P3P.The benefits of our approach are more pronounced on EngQuad, due to its difficult characteristics. We localized more than 100 and 50 additional cameras w.r.t. [21, 22] respectively, improving all localization errors except the first quartile using P4Pf. We obtain faster runtimes than [13, 36] while being competitive with those of [21, 22]. Notably, our approach adapts better to the more difficult EngQuad dataset, spending more time retrieving images with sufficient correspondences. On the other hand, we quickly recognize landmarks in Dubrovnik with the first or second top ranked images, quickly retrieving sufficient putative correspondence and yielding faster localization times.
Location Retrieval:
We obtained an asymptotic recall of 66.63% on the SF0 dataset using the protocol of [12]. At 95% precision, the recall drops to 52.30% using the effective inlier count of [19], falling below performance of other methods [6, 19, 1, 36] for location recognition. For this test we used less stringent parameters: , , , and back matched up to 50 images. We expect tuning these parameters and utilizing reranking heuristics exploited by other methods to provide a better approach for such location retrieval problems.
5 Conclusion
Alternatives to largescale image localization have focused on reducing the density of the search space to quickly find discriminative correspondences. Here we have shown that retrieving multiple global nearest neighbors and filtering them using approximations to the ratio test can quickly identify candidate regions of pose space. Such regions can be further refined by back matching to yield stateoftheart results in camera localization, even for datasets with challenging global repeated structure.
References
 [1] R. Arandjelović and A. Zisserman. Dislocation: Scalable descriptor distinctiveness for location recognition. In Asian Conference on Computer Vision, pages 188–204. Springer, 2014.
 [2] G. Baatz, O. Saurer, K. Köser, and M. Pollefeys. Large scale visual geolocalization of images in mountainous terrain. In Computer Vision–ECCV 2012, pages 517–530. Springer, 2012.
 [3] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speededup robust features (surf). Computer vision and image understanding, 110(3):346–359, 2008.

[4]
M. Bujnak, Z. Kukelova, and T. Pajdla.
A general solution to the p4p problem for camera with unknown focal
length.
In
Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on
, pages 1–8. IEEE, 2008.  [5] S. Cao and N. Snavely. Graphbased discriminative learning for location recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 700–707, 2013.
 [6] D. M. Chen, G. Baatz, K. Köser, S. S. Tsai, R. Vedantham, T. Pylvänäinen, K. Roimela, X. Chen, J. Bach, M. Pollefeys, et al. Cityscale landmark identification on mobile devices. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 737–744. IEEE, 2011.
 [7] T. Dekel, S. Oron, M. Rubinstein, S. Avidan, and W. T. Freeman. Bestbuddies similarity for robust template matching. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2021–2029. IEEE, 2015.
 [8] R. Díaz, S. Hallman, and C. C. Fowlkes. Detecting dynamic objects with multiview background subtraction. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 273–280. IEEE, 2013.

[9]
R. Díaz, M. Lee, J. Schubert, and C. C. Fowlkes.
Lifting gis maps into strong geometric context for scene understanding.
In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–9, March 2016.  [10] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
 [11] L. Kneip, D. Scaramuzza, and R. Siegwart. A novel parametrization of the perspectivethreepoint problem for a direct computation of absolute camera position and orientation. In CVPR, 2011.
 [12] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua. Worldwide Pose Estimation using 3D Point Clouds. In ECCV, 2012.
 [13] Y. Li, N. Snavely, and D. P. Huttenlocher. Location recognition using prioritized feature matching. In European Conference on Computer Vision, pages 791–804. Springer, 2010.
 [14] D. G. Lowe. Distinctive Image Features from ScaleInvariant Keypoints. IJCV, 60(2):91–110, Nov. 2004.
 [15] L. Moisan, P. Moulon, and P. Monasse. Automatic homographic registration of a pair of images, with a contrario elimination of outliers. Image Processing On Line, 2:56–73, 2012.
 [16] P. Moulon, P. Monasse, and R. Marlet. Adaptive structure from motion with a contrario model estimation. In ACCV. 2012.
 [17] D. M. Mount and S. Arya. Ann: library for approximate nearest neighbour searching. 1998.
 [18] M. Muja and D. G. Lowe. Flann, fast library for approximate nearest neighbors. In International Conference on Computer Vision Theory and Applications (VISAPP’09). INSTICC Press, 2009.
 [19] T. Sattler, M. Havlena, F. Radenovic, K. Schindler, and M. Pollefeys. Hyperpoints and fine vocabularies for largescale location recognition. In ICCV, 2015.
 [20] T. Sattler, M. Havlena, K. Schindler, and M. Pollefeys. Largescale location recognition and the geometric burstiness problem. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [21] T. Sattler, B. Leibe, and L. Kobbelt. Fast imagebased localization using direct 2Dto3D matching. ICCV, 2011.
 [22] T. Sattler, B. Leibe, and L. Kobbelt. Improving imagebased localization by active correspondence search. In European Conference on Computer Vision, pages 752–765. Springer, 2012.
 [23] T. Sattler, C. Sweeney, and M. Pollefeys. On sampling focal length values to solve the absolute pose problem. In European Conference on Computer Vision, pages 828–843. Springer, 2014.
 [24] G. Schindler, M. Brown, and R. Szeliski. Cityscale location recognition. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–7. IEEE, 2007.
 [25] J. L. Schönberger and J.M. Frahm. Structurefrommotion revisited. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 [26] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
 [27] N. Snavely, I. Simon, M. Goesele, R. Szeliski, and S. M. Seitz. Scene Reconstruction and Visualization From Community Photo Collections. Proceedings of the IEEE, 98(8):1370–1390, Aug. 2010.
 [28] L. Svarm, O. Enqvist, F. Kahl, and M. Oskarsson. Cityscale localization for cameras with known vertical direction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
 [29] L. Svarm, O. Enqvist, M. Oskarsson, and F. Kahl. Accurate localization and pose estimation for large 3d models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 532–539, 2014.
 [30] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla. 24/7 place recognition by view synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [31] A. Torii, J. Sivic, T. Pajdla, and M. Okutomi. Visual place recognition with repetitive structures. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.
 [32] P. H. Torr and A. Zisserman. Mlesac: A new robust estimator with application to estimating image geometry. Computer Vision and Image Understanding, 78(1):138–156, 2000.
 [33] S. Wang, S. Fidler, and R. Urtasun. Holistic 3d scene understanding from a single geotagged image. 2015.
 [34] C. Wu. Towards lineartime incremental structure from motion. In 2013 International Conference on 3D Vision3DV 2013, pages 127–134. IEEE, 2013.
 [35] A. R. Zamir and M. Shah. Image geolocalization based on multiple nearest neighbor feature matching using generalized graphs. IEEE transactions on pattern analysis and machine intelligence, 36(8):1546–1558, 2014.
 [36] B. Zeisl, T. Sattler, and M. Pollefeys. Camera pose voting for largescale imagebased localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 2704–2712, 2015.
Comments
There are no comments yet.