1 Introduction
Imagebased camera relocalization is the problem of estimating the 6 DoF camera pose in an environment from a single image. It plays a crucial role in computer vision and robotics, and is the key component for a wide range of applications, such as pedestrian localization and navigation of autonomous robots
[8], simultaneous localization and mapping (SLAM) [9], and augmented reality [5, 25].Many conventional localization methods proposed in the literature [31, 32, 33] are based on handcrafted local image features, such as SIFT, ORB, or SURF [24, 30, 1]. These methods usually require a 3D point cloud model where each 3D point is associated with the image features from which it was triangulated. Given a query image, its 6D camera pose is recovered by first finding a large set of matches between 2D image features and 3D points in the model via descriptor matching, and then using a RANSACbased [10]
strategy to reject outlier matches and estimate the camera pose on inliers. However, these local image features are limited by their handcrafted feature detectors and descriptors. They are not robust enough for localization in challenging scenarios, and thus limit the use of these conventional methods.
Neural network based localization approaches have been recently explored. For example, PoseNet [19] utilizes a convolutional neural network that takes a single RGB image as input and directly regresses the 6 DoF camera pose relative to a scene. Since these methods formulate the camera pose estimation task as a regression problem, which is solved using neural networks, no conventional handcrafted features are required. These methods can successfully predict camera pose in challenging environments where the conventional methods fail, but their overall accuracy still lags behind the conventional ones. An alternative neural network based solution is to keep the twostage pipeline of conventional methods and formulate the first stage of the pipeline for generating 2D3D correspondences as a learning problem. For example, the recently presented DSAC pipeline [2] predicts the 6 DoF pose by first regressing for image patches their 3D positions in the scene coordinate frame, and then determining the camera pose via a RANSACbased scheme using the produced correspondences. The regression step in the first stage is the socalled scene coordinate regression and the 3D positions are the scene coordinates. Results have shown that these methods achieve stateoftheart localization accuracy.
As the improved version of DSAC, DSAC++ [4] has demonstrated that scene coordinate regression can be learned even without ground truth scene coordinates. A new 3step training strategy is proposed for training the DSAC++ pipeline. When there are no ground truth scene coordinates available, the first training step is to initialize the network using approximate scene coordinates. In the second training step, the network is further trained via optimizing reprojection loss which is measured by the sum of residuals between projected 3D points in a 2D image plane. Finally, an endtoend optimization step is performed. The second step is crucial for the network to discover scene geometry and the first step is also necessary for the second step to work. However, initializing the network using approximate scene coordinates that are far from the ground truth ones might also degenerate localization accuracy.
In order to train the network without an initialization step, we propose in this work a novel loss function for learning scene coordinate regression. We call this new loss anglebased reprojection loss. This new loss function has better properties compared to the original reprojection loss, and thus careful initialization is not required. In addition, this new loss allows us to additionally exploit multiview constraints.
The contributions of this paper can be summarized as follows:

We present a novel anglebased reprojection loss, which enables the training of coordinate network without careful initialization and improves localization accuracy.

We show that based on this new loss function, we can incorporate multiview constraints to further improve the accuracy.
2 Related Work
Imagebased camera localization in largescale environments is often addressed as an image retrieval problem
[46]. These methods rely on efficient and scalable retrieval approaches to determine the location of a query image. Typically, the query image is matched to a database of geotagged images and the most similar images to it are retrieved. The location of the query image is then estimated according to the known locations of the retrieved images. These approaches can scale to extremely large urban environments and even entire cites. However, all these methods can only produce an approximate location of the camera and are unable to output an exact estimate of 6 DoF camera pose.To directly obtain 6 DoF camera pose with respect to a scene, approaches based on sparse local features such as SIFT [24] have been proposed. Instead of matching a query image to geotagged database images, these methods usually use a reconstructed 3D point cloud model obtained from StructurefromMotion [38] to represent the scene. Each 3D point in the model is associated with one or more local image feature descriptors. Thus, a set of 2D3D correspondences can be established by matching local image features to 3D points in the model. Based on the 2D3D correspondences, the camera pose of the query image can be determined by a PerspectivenPoint [21] solver inside a RANSAC [10] loop. To scale up these methods to large environments, an efficient and effective descriptor matching step is needed. Therefore, techniques such as prioritized matching [31], intermediate image retrieval [15, 34], geometric outlier filtering [45], and covisibility filtering [32] have been proposed. However, due to limitations of the handcrated features, these approaches can fail in challenging scenarios.
PoseNet [19]
was the first approach to tackle the problem of 6 DoF camera relocalization with deep learning. PoseNet demonstrates the feasibility of directly regressing the 6 DoF camera pose from a query RGB image via a deep CNN. Later works have been proposed to extend this method for better accuracy. For example, in
[17], the authors explore Bayesian Neural Networks to produce relocalization uncertainty of the predicted pose. LSTMPose [42] utilizes LSTM units [14] on the output of CNN to extract better features for localization. Another variant, HourglassPose [27], makes use of the encoderdecoder hourglass architecture which can preserve the finegrained information of input image. Moreover, [18] demonstrates that PoseNet performance can be improved by leveraging geometric loss functions. However, these methods are still outperformed by conventional sparse feature based methods. More recently, two multitask models VlocNet [40] and VlocNet++ [29] have been introduced. These models operate on consecutive monocular images and utilize auxiliary learning during training. Remarkably, they can offer promising localization performance that surpasses the sparse feature based methods.Unlike PoseNet, Laskar et al. introduce a deep learning based localization framework, which relies on image retrieval and relative camera pose estimation [22]. This method requires no scenespecific training and can generalize well to previously unseen scenes. Similarly, Taira et al. put forth an image retrieval based localization system for largescale indoor environments [39]. After retrieval of candidate poses, the pose is estimated based on dense matching. A final pose verification step via virtual view synthesis can further improve the robustness of the system. To achieve robust visual localization under a wide range of viewing conditions, semantic information has also been exploited in [36].
Our work is most closely related to methods based on the scene coordinate regression framework. The original scene coordinate regression pipeline is proposed for RGBD camera relocalization [37]. This method formulates descriptor matching as a regression problem and applies a regression forest [7]
to produce 2D3D correspondences from an RGBD input image. Similarly to a sparse feature based pipeline, the final camera pose is recovered from the correspondences via a RANSACbased solver. Since generating correspondences is directly achieved by the regression forest, no traditional feature extraction, feature description, or feature matching processes are required. This method has been further extended in later works
[13, 41, 6]. Currently, practical lowcost devices are usually equipped with RGB cameras only. However, these methods still require a depth signal at both training and test time.To localize RGBonly images, an autocontext random forest is adopted in the pipeline in
[3]. In [26], Massiceti et al. explored randomforesttoneuralnetwork mapping strategy. Recently, Brachmann et al. proposed a differentiable version of RANSAC (DSAC) [2] and presented a localization system based on it. In the DSAC pipeline, two CNNs are adopted for predicting scene coordinates and for scoring hypotheses respectively. Since the entire pipeline is differentiable, an endtoend optimization step can be performed using ground truth camera poses. In contrast to DSAC, which adopts a patchbased network for scene coordination regression, a fullframe network considering global image appearance is presented in [23] and a data augmentation strategy is proposed to ensure the accuracy. However, all of these methods require ground truth scene coordinates for training. DSAC++ [4], the successor version of DSAC, demonstrates the feasibility of learning scene coordinate regression without scene coordinate ground truth. This is achieved by first initializing the predictions with approximate scene coordinates and then optimizing reprojection errors. This method is the current stateoftheart on the 7Scenes dataset [37] and the Cambridge Landmarks dataset [19].3 Method
In this work, we follow the twostage pipeline of DSAC and DSAC++ for RGBonly camera relocalization. In the first stage, given an RGB image, a coordinate CNN is adopted to generate dense 3D scene coordinate predictions to form 2D3D correspondences. In the second stage, a RANSACbased scheme is performed to generate pose hypotheses with their scores and determine the final pose estimate which can be further refined. The overall pipeline is illustrated in Figure 1.
DSAC first demonstrates how to incorporate CNNs into the twostage pipeline and achieves stateoftheart results. DSAC++ further addresses the main shortcomings of the DSAC pipeline and achieves substantially more accurate results. More importantly, unlike previous works which require scene coordinate ground truth generated using RGBD training images or available 3D model during training, DSAC++ is able to discover 3D scene geometry automatically. That is, the coordinate CNN can be trained from RGB images only with their ground truth camera poses, and no ground truth scene coordinates or depth information is needed. This is achieved by first using approximate scene coordinates to initialize the network and then optimizing reprojection loss, which is calculated using the ground truth poses. In the initialization step, the approximate scene coordinates are generated to have a constant distance from the camera plane and serve as the dummy ground truth for training. The reprojection loss can then help the network to recover the scene geometry, but it may not work without the initialization step, even though the initialization step itself provides poor localization performance. However, the DSAC++ pipeline still has a main drawback. That is, a proper value for constant distance should be selected carefully, since a value that is far off the actual range of distances can result in poor accuracy [4].
In the following, we present a new anglebased reprojection loss that allows us to train the coordinate CNN without careful initialization. We explain the deficiencies of the original reprojection loss and present our new loss function in Section 3.1. This new loss function also enables us to utilize available multiview reprojection and photoconsistency constraints, which we discuss in Section 3.2 and Section 3.3 respectively.
3.1 AngleBased Reprojection Loss
For a training image , the reprojection loss is calculated using the scene coordinate predictions and the ground truth pose. It can be written as^{1}^{1}1We omit the conversions between homogeneous coordinates and Cartesian coordinates for simplicity. This also applies to equations in the rest of the paper.:
(1) 
where is the intrinsic matrix, is the camera pose for image , is the set of all points in image , is the scene coordinate prediction for point with input image and learnable parameters , is the 2D position of point in image .
As mentioned above, the DSAC++ pipeline requires a carefully selected distance for generating approximate scene coordinates when training without ground truth scene coordinates. However, improper selection of may also lead to poor performance. Therefore, one may wonder if it is possible to train the network without an initialization step. Unfortunately, if we train the network directly using the reprojection loss without the initialization step, the training might not converge due to unstable gradients. The unstable gradients are caused by minimizing the reprojection loss at the beginning of training when the predictions of the network could be behind the camera or close to the camera center.
The shortcomings of the reprojection loss are illustrated in Figure 2. As we can see, the reprojection loss does not constrain the predictions to be in front of the camera, and the loss could be 0 even when the predictions are behind the camera. Therefore, in such cases, the reprojection loss cannot help the network to discover the true geometry of the scene. In addition, when the
coordinate of a prediction in the camera coordinate frame is close to zero, the corresponding projected point in the image plane could be extremely far away from the ground truth one, resulting in extremely large loss value and exploding gradients. More importantly, we believe the ability of the network to discover the scene geometry automatically comes from the patchbased nature of the predictions. The output neurons of the fullyconvolutional network used in the DSAC++ pipeline have a limited receptive field. Since local patch appearance is relatively invariant w.r.t. to viewpoint change, the explicitly applied singleview reprojection loss could be considered as implicit multiview constraints. However, when the predictions are behind camera, minimizing reprojection loss to fulfill multiview constraints could lead to inconsistent gradients. This again prevents the network from discovering the true geometry of a scene.
The aforementioned deficiencies restrict the use of the reprojection loss. In particular, before using the reprojection loss, we need to initialize the predictions to be in front of the camera using a constant distance . However, this might lead to overfitting to the dummy ground truth scene coordinates, and thus affect the localization accuracy.
To address these issues, we present a new anglebased reprojection loss. Instead of minimizing the distance between the projected ground truth and projected prediction in the image plane, we seek to minimize the angle between two rays that share the camera center as the common endpoint and pass through the scene coordinate prediction and ground truth respectively. To be specific, we minimize the distance between two points in which the two rays intersect a sphere with center at the camera center and radius equal to the distance between the projected ground truth and the camera center, as shown in Figure 3. More formally, for a training image , the anglebased reprojection loss is defined as^{2}^{2}2We assume in this paper, but the loss function is also applicable when .:
(2) 
where is the focal length, , and .
As we can see, the new anglebased reprojection loss forces the predictions to be in front of camera, since the loss function has large value when predictions are behind camera. Moreover, it no longer causes unstable gradients when predictions are behind camera or the coordinates of the predictions in the camera frame are close to zero. In addition, it is easy to see that the new loss function is approximately equal to the original one when predictions are close to ground truth scene coordinates. Assume , , and the ground truth . When predictions are close to ground truth, we have . Therefore:
(3)  
where and are parameters for principal point offset. This is the reason that we call the new loss function anglebased reprojection loss.
Note that in order to minimize the angle
, one may also maximize the dot product of the two unit vectors of the two rays, but the resulting loss function no longer has the property to be approximately equal to the reprojection loss when predictions are close to ground truth scene coordinates.
3.2 MultiView Reprojection Loss
Although with only singleview reprojection loss, the network can already approximately recover the scene geometry, sometimes it is still difficult for the network to learn accurate scene coordinate predictions, e.g., when ground truth poses are erroneous or there are ambiguities in the scene. Therefore, if we could exploit explicitly multiview constraints during training, we might achieve more accurate localization performance.
One way to incorporate multiview constraints is to utilize available multiview correspondences to form multiview reprojection loss. The correspondences might come from a reconstructed 3D model or can be directly generated by correspondence matching.
Since the original reprojection loss is problematic when predictions are behind camera, it is more likely to cause problem in the multiview cases. Therefore, we formulate the multiview reprojection loss based on our new anglebased reprojection loss. For one training image, the loss function can be written as:
(4)  
where is the balance weight, is the set of all points in image which have no corresponding points, is the set of points in image which have corresponding points in other training images, is the set of images in which point has a corresponding point. As we can see, encodes singleview constraints, and encodes multiview constraints.
3.3 Photometric Reconstruction Loss
If training data consists of sequences of images, it is also possible to constrain the scene coordinate predictions using photometric reconstruction loss. Photometric reconstruction loss has recently become a dominant strategy for solving many geometry related learning problems. It enables neural networks to learn many tasks, such as monocular depth [11, 12], egomotion [47], and optical flow [44], in a selfsupervised manner.
For learning scene coordinate regression, given a training image and a neighboring image , we can reconstruct the training image by sampling pixels from the neighboring image based on the scene coordinate predictions and the pose of the neighboring image. Specifically, for a point in the training image , we can project it to the neighboring image and obtain its 2D coordinates in the neighboring image:
(5) 
Then, the pixel value for point in the reconstructed image can be sampled using bilinear sampler based on the coordinates . That is, the sampled pixel value is the weighted sum of the values of the four neighboring pixels of .
Following [12], our photometric reconstruction loss is a combination of an loss term and a simplified SSIM loss term with a block filter:
(6) 
where .
Since in the beginning of training, the scene coordinate predictions could be quite far from the ground truth ones, the projected 2D coordinates of points in the neighboring image might also be far from valid pixels. In other words, we are unable to obtain valid gradients by minimizing the photometric reconstruction loss. Therefore, using only photometric reconstruction loss for learning scene coordinate regression may not work. Thus, we pair it with our anglebase reprojection loss, and the final loss function can be written as:
(7) 
4 Experiments
We evaluate the performance of our approach on the 7Scenes dataset [37] and the Cambridge Landmarks dataset [19], which are described in Section 4.1. The implementation details are explained in Section 4.2 and the results are given in Section 4.3.
4.1 Evaluation Dataset
The 7Scenes dataset [37] is a widely used RGBD dataset provided by Microsoft Research. It consists of seven different indoor environments, and exhibits several challenges, such as motion blur, illumination changes, textureless surfaces, repeated structures, reflections, and sensor noise. Each scene contains multiple sequences of tracked RGBD camera frames and these sequences are split into training and testing data. The RGBD images are captured using a handheld Kinect camera at resolution and associated with 6 DoF ground truth camera poses obtained via the KinectFusion system [16, 28].
The Cambridge Landmarks [19] is an outdoor RGB relocalization dataset, which consists of six scenes from around Cambridge University. Each scene is composed of hundreds or thousands of frames, which are divided into training and test sequences. The 6 DoF ground truth camera poses of the images are generated with structure from motion techniques [43].
In contrast to some previous works, ground truth scene coordinates are not necessary for learning scene coordinate regression in this work. Therefore, we use only RGB images at both training and test time.
4.2 Implementation Details
Note that the following implementation details are for the experiments on the 7Scenes dataset. For the experiments on the Cambridge Landmarks dataset, we simply follow DSAC++ [4] except that we train the network with our anglebased reprojection loss only for iterations, we halve the learning rate every iterations after
iterations, and we perform no gradient clipping.
We adopt the FCN network architecture used in the DSAC++ pipeline which takes a image as input and produces scene coordinate predictions. For implementation simplicity, unlike DSAC++, in the second stage of the pipeline, we use a nondifferentiable version of the RANSACbased scheme for pose estimation, which scores hypotheses using inlier count instead of soft inlier count. The details of the used pose estimation algorithm are explained in [23]. In this paper, we mainly focus on the loss functions for learning scene coordinate regression in the first stage of the pipeline, and thus the entire pipeline is not required to be endtoend trainable. However, it is also straightforward to adopt the differentiable optimization strategy and perform an extra endtoend optimization step at the end of the training as described in [4].
We train the network from scratch without the scene coordinate initialization training step for iterations with a batch size of 1 using the Adam [20] optimizer where , , and . The initial learning rate is set to and is halved after , , , and iterations. Following [4], training images are randomly shifted by a maximum of 8 pixels, horizontally and vertically, to make full use of the training data, since the network produces one prediction for each image block.
When training the network using the multiview reprojection loss, for efficiency, we do not use all the correspondences to calculate the latter term of the loss. In particular, we calculate instead of where is an image randomly selected from . The covisibility information used to form the multiview constraints is extracted from a sparse model constructed with Colmap [35]. Moreover, we set the balance weight .
When computing the photometric reprojection loss, the neighboring image is randomly selected from the same sequence with difference in index less or equal to 10. Since the network produces scene coordinate predictions, we resize both the training images (after shifting) and the neighboring images to , and the camera intrinsic parameters are also adjusted accordingly. The balance weight is set to 20.
4.3 Results
The experimental results of our method on the 7Scenes dataset and the Cambridge Landmarks dataset are given in Table 1. For the Cambridge Landmarks dataset, we report median localization errors of our method trained with singleview constraints only. For the 7Scenes dataset, we report both median localization errors and accuracy measured as the percentage of query images for which the camera pose error is below and 5cm, as the latter one can better represent the localization performance. Our method is compared to the stateoftheart DSAC++ method trained without an accurate 3D scene model.
, 5cm (Median Error)  
Scene  DSAC++ [4]  Ours  Ours+multiview  Ours+photometric 
Chess  93.8% (2cm, )  95.1% (2cm, )  96.1% (2cm, )  96.0% (2cm, ) 
Fire  75.6% (3cm, )  84.0% (3cm, )  88.6% (2cm, )  86.4% (2cm, ) 
Heads  18.4% (12cm, )  80.5% (2cm, )  86.9% (2cm, )  83.2% (2cm, ) 
Office  75.4% (3cm, )  80.4% (3cm, )  80.6% (3cm, )  81.6% (3cm, ) 
Pumpkin  55.9% (5cm, )  56.8% (4cm, )  60.3% (4cm, )  59.2% (4cm, ) 
Red Kitchen  50.7% (5cm, )  59.9% (4cm, )  61.9% (4cm, )  60.0% (4cm, ) 
Stairs  2.0% (29cm, )  2.9% (25cm, )  11.3% (13cm, )  4.7% (22cm, ) 
Complete  60.4%  69.2%  71.8%  70.4% 
Great Court  (66cm, )  (51cm, )     
K. College  (23cm, )  (18cm, )     
Old Hospital  (24cm, )  (19cm, )     
Shop Facade  (9cm, )  (7cm, )     
St M. Church  (20cm, )  (25cm, )     
According to the results, for the 7Scenes dataset, when training the network using our new anglebased reprojection loss without initializing the network predictions with constant distance value, the fraction of accurately localized test images is improved by 8.8 percentage points compared to DSAC++. Similarly, for the Cambridge Landmarks dataset, our method achieves better median pose accuracy on four out of five scenes. Moreover, for the 7Scenes dataset, the accuracy can be further improved by utilizing either multiview correspondences or photometric reconstruction metric. Compared to using singleview constraints only, we observed that the additional loss terms could make the network produce more accurate scene coordinate predictions during training. This means that incorporating multiview constraints could help the network better discover scene geometry, and thus leads to improved localization performance.
We also attempted to train the network using the original reprojection loss without the scene coordinate initialization step. However, we found that for most of the scenes, training could not converge in both singleview and multiview cases. In addition, when using the photometric reconstruction loss, for all the scenes, we observed that the training of the network without the anglebased reprojection loss term would always get stuck in a local minimum with large scene coordinate prediction error, resulting in a completely failed localization system.
When training with either the multiview reprojection loss or the photometric reconstruction loss, the balance weight is important for achieving good localization results. If the weight is set too small, the multiview constraints will have no effect on training. However, if it is set too large, we found that the localization accuracy could also drop. For example, when training with the photometric reconstruction loss with , test accuracy for Office decreases to 79.5%. It would be interesting to explore learnable weighting strategy presented in [18] in future work.
Note that our method is still less accurate compared to the DSAC++ method trained with a 3D model (76.1%), as our method is trained without access to adequate scene geometry information. When training with the multiview reprojection loss, only a sparse set of points have corresponding points in other images (typically about 1000 points for a image). That is, the multiview constraints are active for only a small portion of the points in a training image. We believe that with denser correspondence information, our method could achieve better results. For the photometric reconstruction loss, adding an additional loss term similar to the leftright consistency proposed in [12], which enforces predictions to be consistent between different views, might be also helpful to further improve the accuracy, but we did not explore it.
5 Conclusion
In this work, we have presented a new anglebased reprojection loss for learning scene coordinate regression for imagebased camera relocalization. Our novel loss function makes it possible to train the coordinate CNN without a scene coordinate initialization step, resulting in improved localization accuracy. Moreover, this novel loss function allows us to explore available multiview constraints, which can further improve performance.
Acknowledgements
Authors acknowledge funding from the Academy of Finland (grant numbers 277685, 309902). This work has also been partially supported by the grant “Deep in France” (ANR16CE230006) and LabEx PERSYVAL (ANR11LABX002501).
References
 [1] Bay, H., Tuytelaars, T., Gool, L.V.: Surf: Speeded up robust features. In: ECCV (2006)
 [2] Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Michel, F., Gumhold, S., Rother, C.: Dsac  differentiable ransac for camera localization. In: CVPR (2017)
 [3] Brachmann, E., Michel, F., Krull, A., Yang, M.Y., Gumhold, S., Rother, C.: Uncertaintydriven 6d pose estimation of objects and scenes from a single RGB image. In: CVPR (2016)
 [4] Brachmann, E., Rother, C.: Learning less is more  6d camera localization via 3d surface regression. In: CVPR (2018)
 [5] Castle, R.O., Klein, G., Murray, D.W.: Videorate localization in multiple maps for wearable augmented reality. In: ISWC (2008)
 [6] Cavallari, T., Golodetz, S., Lord, N.A., Valentin, J.P.C., di Stefano, L., Torr, P.H.S.: Onthefly adaptation of regression forests for online camera relocalisation. In: CVPR (2017)
 [7] Criminisi, A., Shotton, J.: Decision Forests for Computer Vision and Medical Image Analysis. Springer Publishing Company, Incorporated (2013)
 [8] Cummins, M., Newman, P.: Fabmap: Probabilistic localization and mapping in the space of appearance. The International Journal of Robotics Research 27(6), 647–665 (2008)
 [9] Eade, E., Drummond, T.: Scalable monocular slam. In: CVPR (2006)
 [10] Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6), 381–395 (1981)
 [11] Garg, R., G, V.K.B., Reid, I.D.: Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: ECCV (2016)
 [12] Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with leftright consistency. In: CVPR (2017)
 [13] GuzmánRivera, A., Kohli, P., Glocker, B., Shotton, J., Sharp, T., Fitzgibbon, A.W., Izadi, S.: Multioutput learning for camera relocalization. In: CVPR (2014)

[14]
Hochreiter, S., Schmidhuber, J.: Long shortterm memory. Neural Computation
9(8), 1735–1780 (1997)  [15] Irschara, A., Zach, C., Frahm, J.M., Bischof, H.: From structurefrommotion point clouds to fast location recognition. In: CVPR (2009)
 [16] Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R.A., Kohli, P., Shotton, J., Hodges, S., Freeman, D., Davison, A.J., Fitzgibbon, A.W.: Kinectfusion: realtime 3d reconstruction and interaction using a moving depth camera. In: UIST (2011)
 [17] Kendall, A., Cipolla, R.: Modelling uncertainty in deep learning for camera relocalization. In: ICRA (2016)
 [18] Kendall, A., Cipolla, R.: Geometric loss functions for camera pose regression with deep learning. In: CVPR (2017)
 [19] Kendall, A., Grimes, M., Cipolla, R.: Posenet: A convolutional network for realtime 6dof camera relocalization. In: ICCV (2015)
 [20] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
 [21] Kneip, L., Scaramuzza, D., Siegwart, R.: A novel parametrization of the perspectivethreepoint problem for a direct computation of absolute camera position and orientation. In: CVPR (2011)
 [22] Laskar, Z., Melekhov, I., Kalia, S., Kannala, J.: Camera relocalization by computing pairwise relative poses using convolutional neural network. In: ICCVW (2017)
 [23] Li, X., Ylioinas, J., Kannala, J.: Fullframe scene coordinate regression for imagebased localization. In: RSS (2018)
 [24] Lowe, D.G.: Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)
 [25] Lynen, S., Sattler, T., Bosse, M., Hesch, J.A., Pollefeys, M., Siegwart, R.: Get out of my lab: Largescale, realtime visualinertial localization. In: RSS (2015)
 [26] Massiceti, D., Krull, A., Brachmann, E., Rother, C., H. S. Torr, P.: Random forests versus neural networks  what’s best for camera relocalization? In: ICRA (2017)
 [27] Melekhov, I., Ylioinas, J., Kannala, J., Rahtu, E.: Imagebased localization using hourglass networks. In: ICCVW (2017)
 [28] Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J., Kohli, P., Shotton, J., Hodges, S., Fitzgibbon, A.W.: Kinectfusion: Realtime dense surface mapping and tracking. In: ISMAR (2011)
 [29] Radwan, N., Valada, A., Burgard, W.: Vlocnet++: Deep multitask learning for semantic visual localization and odometry. CoRR abs/1804.08366 (2018)
 [30] Rublee, E., Rabaud, V., Konolige, K., Bradski, G.R.: ORB: an efficient alternative to SIFT or SURF. In: ICCV (2011)
 [31] Sattler, T., Leibe, B., Kobbelt, L.: Fast imagebased localization using direct 2dto3d matching. In: ICCV (2011)
 [32] Sattler, T., Leibe, B., Kobbelt, L.: Improving imagebased localization by active correspondence search. In: ECCV (2012)
 [33] Sattler, T., Leibe, B., Kobbelt, L.: Efficient effective prioritized matching for largescale imagebased localization. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(9), 1744–1756 (2017)
 [34] Sattler, T., Torii, A., Sivic, J., Pollefeys, M., Taira, H., Okutomi, M., Pajdla, T.: Are largescale 3d models really necessary for accurate visual localization? In: CVPR (2017)
 [35] Schönberger, J.L., Frahm, J.: Structurefrommotion revisited. In: CVPR (2016)
 [36] Schönberger, J.L., Pollefeys, M., Geiger, A., Sattler, T.: Semantic visual localization. In: CVPR (2018)
 [37] Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.W.: Scene coordinate regression forests for camera relocalization in rgbd images. In: CVPR (2013)
 [38] Snavely, N., Seitz, S.M., Szeliski, R.: Modeling the world from internet photo collections. International Journal of Computer Vision 80(2), 189–210 (2008)
 [39] Taira, H., Okutomi, M., Sattler, T., Cimpoi, M., Pollefeys, M., Sivic, J., Pajdla, T., Torii, A.: Inloc: Indoor visual localization with dense matching and view synthesis. In: CVPR (2018)
 [40] Valada, A., Radwan, N., Burgard, W.: Deep auxiliary learning for visual localization and odometry. In: ICRA (2018)
 [41] Valentin, J.P.C., Nießner, M., Shotton, J., Fitzgibbon, A.W., Izadi, S., Torr, P.H.S.: Exploiting uncertainty in regression forests for accurate camera relocalization. In: CVPR (2015)
 [42] Walch, F., Hazirbas, C., LealTaixe, L., Sattler, T., Hilsenbeck, S., Cremers, D.: Imagebased localization using lstms for structured feature correlation. In: ICCV (2017)
 [43] Wu, C.: Towards lineartime incremental structure from motion. In: 3DV (2013)

[44]
Yu, J.J., Harley, A.W., Derpanis, K.G.: Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In: ECCVW (2016)
 [45] Zeisl, B., Sattler, T., Pollefeys, M.: Camera pose voting for largescale imagebased localization. In: ICCV (2015)
 [46] Zhang, W., Kosecka, J.: Image based localization in urban environments. In: 3DPVT (2006)
 [47] Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and egomotion from video. In: CVPR (2017)