I Introduction
Camera relocalization, or 6 degrees of freedom (6DOF) estimation, refers to the problem of estimating the pose (position and orientation) of an image (camera). It is a hot research topic in structure from motion (SfM), simultaneous localization and mapping (SLAM) and robotics, and it is also an essential component of autonomous driving and navigation.
Global Positioning System (GPS) has been widely used for vehicle localization but its accuracy significantly decreases in urban areas where tall buildings block or weaken its signals. Many imagebased methods have been proposed to complement GPS. They provide position and orientation information based either on image retrieval
[1, 2, 3, 4, 5] or 3D model reconstruction [6]. However, these methods face many challenges, including high storage overheads, low computational efficiency and image variations, especially for large scenes.Recently, rapid progress in machine learning, particularly deep learning, has produced a number of deep learningbased methods
[7, 8, 9, 10, 11, 12, 13, 14, 15]. They have attained good performances in addressing the aforementioned challenges but their accuracies are not as good as traditional methods. Another severe problem of deep learningbased methods is that they fail to distinguish two different locations that have similar objects or scenes.In this paper, we present a novel relative geometryaware Siamese neural network, which explicitly exploits the relative geometry constraints between images to regularize the network. We improve the localization accuracy and enhance the ability of the network to distinguish locations with similar images. It is achieved with three key new ideas:

We design a novel Siamese neural network that explicitly learns the global poses of a pair of images. We constrain the estimated global poses with the actual relative pose between the pair of images.

We perform multitask learning to estimate the absolute and relative poses simultaneously to ensure that the predicted poses are correct both globally and locally.

We employ metric learning and design an adaptive metric distance loss to learn feature representations that are capable of distinguishing the poses of similar visual images of different locations thus improving the overall pose estimation accuracy.
The rest of the paper is organized as follows: Section II reviews the related works in camera relocalization. Section III elaborates the basic idea of deep learningbased camera relocalization methods. Section IV describes the architecture of the proposed network and its loss function items. We present the details of our experiments and evaluation in Section V. Finally, we conclude our work in Section VI.
Ii Related Work
Camera relocalization methods can be mainly classified into three categories: image retrievalbased methods, 3D modelbased methods, and deep learningbased methods.
Many approaches and systems are proposed based on image retrieval technique [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. They determine the pose of the query image by matching it with images rendered from 3D scene models. The key component of the technique is image representation. Global descriptors are often used, such as colour histogram [3] and gradient orientation histogram [27]. GIST descriptor [28] and GISTbased descriptors [29] are applied to represent panoramic images in [30, 31, 1]. SeqSLAM [32] generates the global descriptor from a sequence of consecutive images instead of a single image. Global descriptors are fast to compute, but they are not robust to occlusion and illumination changes. Local features like SIFT [33] and SURF [34], have been used in [17] for image representation. Compared with the global descriptor, they are less sensitive to occlusion and view variations. However, the storage requirement of the method is high for large scenes. The pooling features like BoW [35] and VLAD [36]
are able to relieve the challenge. They aggregate local features and represent the locations with a compact feature vector instead of a large number of local features
[16].Another type of methods solve the problem by utilizing camera projection geometry between 2D pixels and 3D models. They estimate the pose by constructing the correspondence between 2D pixels and 3D points of the scene [37, 38, 39, 40, 41]. Local point features, like SIFT [33], SURF [34] and ORB [42], are frequently used to describe the detected 2D points. 3D points, generated using the SfM technique, are also described with local features to perform 2D3D matching. It can achieve accurate results when enough correct pairs are provided. The main challenge is to establish enough correct 2D3D correspondences, which is difficult for two reasons. Firstly, local feature descriptor fails when a scene has repetitive texture or textureless surface; and secondly, the process is inefficient for large scenes.
To increase the efficiency of the 2D3D matching, prioritized search approaches [39, 40]
are proposed to construct enough matching pairs instead of matching all detected 2D points. Scene coordinate random forest (SCRF)
[43, 44] utilizes machine learning techniques to directly predict 3D coordinates of image pixels by training a random forest. Similar to SCRF, deep learning technique is employed to predict 3D coordinate of the center point of an image patch in [45]. However, these methods require 3D model for the network training ,which limits their application. To filter out the wrong matches, covisibility information is exploited in [38, 39].Deep learning has achieved extraordinary performance in image classification, object detection, and image retrieval tasks. Many researchers have employed it to solve the camera relocalization problem [7, 8, 9, 10, 11, 12, 13, 14, 15]. PlaNet[7]
regards the problem as a classification task. It divides the map into grids and predicts the grid in which the query image belongs to through deep learning technique. Many other researchers consider it as a regression problem instead. They directly estimate the pose through a convolutional neural network. PoseNet
[8], built on the GoogLeNet model [9], is the first to adopt this paradigm in an endtoend manner. It is further extended to Bayesian PoseNet [10] to estimate the confidence of the result as well. HourglassNet [11] utilizes the encoderdecoder network structure with skipped connections to aggregate features from both lower and higher layers for pose regression. It achieves better performance than PoseNet. LSTMNet [12] believes that high dimensional output of fully connected layer in PoseNet is not optimal. It adds a LSTM network after the last fully connected layer in PoseNet to reduce information redundancy. VidLoc [13]exploits smooth constraints of a video to address the perceptual aliasing problem. It takes a video clip as input instead of a single image and proposes a bidirectional recurrent neural network structure to fuse the previous and next images information to increase predicted pose accuracy. Laskar
[46] proposes a new triangulating strategy that predicts the pose by estimating the relative pose between the query image and the images in the database. Its main drawback is low efficiency since the relative pose of all the images in the database have to be computed. PoseNet2 [14] introduces the reprojection error with global pose error and improves the performance. However, 3D points are required in their method. MapNet [15] fuses the inertial information with image information through deep learning to enhance the network performance.The proposed method in this paper is also based on convolutional neural networks. However, it has a number of distinctive features. For example, we use an innovative Siamese network architecture to exploit the relative geometry of images in addition to predicting the absolute poses. Unlike [14] and [15], we only rely on the 2D images for training. Compared to [10, 10, 11, 12], we take a pair of images as input and utilize their relative pose error for training. In contrast to [46], we directly regress the image pose instead of performing triangulation.
A very recent work that also uses multitask learning and explicitly models relative poses of two frames appears in [57, 58]. However, our system architecture differs from that of [57, 58] in a number of significant ways. Whilst we use a Siamese network and metric learning loss to model the relative geometrics of two frames, [57, 58] use two separate networks to model the relative geometrics of two consecutive frames (although [57, 58] refer their two networks as Siamese network, strictly speaking it is not a Siamese network architecture because the two networks do not share weights). Furthermore, while our method can model the relative geometrics of two arbitrary frames, [57, 58] can only model two consecutive frames.
Iii Deep Learningbased Camera Relocalization
Deep learningbased camera relocalization methods use an endtoend learning strategy to predict the positions and orientations directly. They do not perform image matching or solve 2D3D correspondence as traditional methods do. Instead, they regard the task as a regression problem and utilize convolutional neural networks to model the hidden mapping function between the images and their corresponding poses. The networks are supervised by the distance between the predicted poses and the ground truth. This section focuses on discussing the pose representation and describing the loss function formulation.
Iiia Pose Representation
The image (camera) pose is comprised of the positional component and the orientational component. The position is denoted by a 3dimensional vector x of the arbitrary coordinate space. Orientation can be represented in 3 forms: Euler angle, transformation matrix, and quaternion. Euler angle is not a good choice because it suffers from the gimbal lock problem. Transformation matrix is overparameterized for orientation because it contains 9 parameters to represent the orientation of 3D space, while the orientation only has 3 degrees of freedom. Previous works [8, 13, 11, 12] choose the quaternion to represent orientation, because it is a smooth and continuous representation. The quaternion is a 4dimensional unit vector q and is easy to perform backpropagation. The main concern for the quaternion is that each orientation has two different quaternion representations. This can be addressed by constraining the quaternion to one hemisphere.
One simple and obvious way to represent pose is to form a 7dimensional vector, combining position and orientation together. However, previous works demonstrate that the 7dimensional vector representation does not achieve good performance due to the difference of scale between position and orientation. Therefore, two pose components are usually regressed separately. In this paper, instead of training two separate convolutional neural networks to estimate position and orientation, we train one model to predict the two components simultaneously. This is reasonable because both position and orientation come from the same image content.
IiiB Loss Function
The loss function (GlobalLoss) is normally designed based on the distance between the predicted pose and the ground truth, serving as the optimization objective for training the networks. It consists of two components, i.e. position loss and orientation loss, as shown in equation (1).
(1) 
where is the position loss and denotes the orientation loss. Here, Euclidean distance is chosen to calculate the position loss and orientation loss as it is continuous and smooth. The two components are computed by equations (2) and (3) respectively.
(2) 
where x represents the real position and denotes the predicted one.
(3) 
where is the ground truth orientation, denotes the predicted orientation and represents the length of the predicted orientation quaternion. is performed to normalize the predicted quaternion to the length of 1 since the network prediction does not guarantee it.
Due to the quantity and scale difference between the position loss and the orientation loss, a hyperplane parameter
is introduced to balance the influence of the two loss components. The loss function is represented as equation (4).(4) 
Previous works choose to set manually and achieve good performance in their experiments. However, fine tuning for different scenes is labourintensive. PoseNet2 addresses this issue by introducing two learnable variables, i.e. and , which correspond to the loss of position and orientation respectively. Then equation (4) is transformed into equation (5):
(5) 
Iv Relative GeometryAware Siamese Network for Camera Relocalization
Our network is built on Siamese network originally introduced by Bromley and LeCun in [47]. A traditional Siamese neural network architecture consists of twin networks which accepts distinct inputs. The loss function computes a metric between the highestlevel feature representation on each side given certain threshold. We utilize this structure to learn a robust feature representation for mapping positions and orientations by introducing relative geometry constraints of the training images. The process is supervised by both global pose and relative pose constraints. The proposed network architecture is illustrated in Figure 1. Compared to the conventional Siamese network structure, it has an additional component for relative pose prediction and performs multitask learning. In the following subsections, we will present the network architecture and the relative geometry losses for the network training in detail.
Iva Network Architecture
Each of the twin networks consists of a modified ResNet50 [48] and a global pose regression unit (GPRU). The modified ResNet50 consists of 5 residual blocks and an average pooling layer. Each residual block has multiple residual bottleneck units that are comprised of three convolutional layers with kernel sizes of , , and
in sequence. Each convolutional layer is followed by rectified linear unit (ReLU) and batch normalization operation. The average pooling layer is used to aggregate the feature information from the previous layers. The GPRU contains 3 fully connected layers. The first fully connected layer has 1024 neurons and the followed two has 3 and 4 neurons respectively for regressing the position and orientation. For the relative pose of the two inputs, we design a relative pose regression unit (RPRU). It has a similar structure as the GPRU. The difference lies in their inputs. While the GPRU takes the output vector of the modified ResNet50 as input, the RPRU takes the concatenation of the two modified ResNet50 output vectors as input. The dropout technique is applied after each fully connected layer to reduce feature redundancy. The parameter of dropout layer is set to be 0.2 empirically.
IvB Relative Geometry Losses
We design three relative geometry losses based on the relative geometry constraints of the training images including the relative pose loss (RelLoss), the relative pose regression loss (RelRLoss) and the adaptive metric distance loss (MDLoss). They function in the feature and the pose spaces to regularize the network. They will be discussed in detail in the following sections.
IvB1 Relative Pose Loss
Previous deep learningbased pose estimation methods train the network on the global poses of the images, i.e. given an input image, they estimate its global (absolute) position and orientation while the relative pose between two training images is ignored. However, the relative pose information of two images is important. In this paper, the network not only explicitly estimates the global pose of the input image but also explicitly requires that the difference between the estimated global poses of two images is consistent with their actual (ground truth) difference. The relative pose loss (RelLoss) is designed to preserve the relative geometry in the pose space by comparing the distance between two predicted global poses, and the actual distance of the global poses of the two images. RelLoss is able to keep the relative pose of paired images consistent with their ground truth. It works in the pose space and constrains the pose error of two images.
Suppose that the position and orientation of the current image and a reference image are (, ) and (, ), respectively. The relative position and orientation can be computed with equations (6) and (7).
(6) 
(7) 
where represents the conjugate quaternion of . Note that when calculating the relative orientation from the predicted orientation quaternion with equation (7), the quaternion has to be normalized. The RelLoss also contains the positional loss component and the orientational loss component as shown in equation (8).
(8) 
where denotes the RelLoss positional component, and is the orientational component.
The two loss components are formulated with Euclidean distance as shown in equations (9) and (10).
(9) 
(10) 
where are the predicted relative position and orientation, and denote the ground truth.
IvB2 Relative Pose Regression Loss
Whilst RelLoss captures the relative geometry of two images through estimating their global poses, we here introduce another loss to estimate the relative pose distance of a pair of images directly from the input images. The relative pose regression loss (RelRLoss) is defined as shown in equation (11).
(11) 
where denotes the positional component, and denotes the orientational component. The two component loss functions are computed by equations (12) and (13).
(12) 
(13) 
where represent the ground truth relative position and orientation, and represent the directly predicted relative position and orientation. The ground truth relative position and orientation can be obtained using equation (6) , (7). Note that needs to be normalized as it is directly regressed by the network.
It should be noted that in equation (11) and in equation (8) are different. One is computed from the difference of two predicted global poses while the other is predicted directly by regression. Furthermore, it is the that joins the twin networks together (please refer to Figure 1
). The purposes of introducing RelRLoss is to ensure that the features extracted by the ResNet50 network will not only enable an accurate estimate of the global pose but also an accurate relative pose estimation.
IvB3 Adaptive Metric Distance Loss
Deep learningbased methods often fail to accurately predict the poses of similar images of different locations. Distinguishing similar inputs belonging to different classes is one of the major difficulties in computer vision. Here, we take advantage of the Siamese network architecture of Figure
1 and propose the adaptive metric distance loss (MDLoss) to address the problem. It is inspired by metric learning [49, 50, 51]. The basic idea of metric learning is to learn a metric distance adaptive to the problem of interest. For many problems, including camera relocalization, handcrafted representations fail badly in capturing the notion of similarity. Deep learning regressionbased camera relocalization approaches are based on the visual contents of the input image to estimate its pose, therefore simple metrics measuring the visual content similarity fails to capture the pose dissimilarity in the above cases. In the case of our Siamese architecture in Figure 1, the 6DOF camera pose is estimated by the GPRU. The input to the GPRU unit (the output of the ResNet50) should reflect the pose difference rather than the visual similarity of the images. We therefore introduce the adaptive metric distance loss (MDLoss) to address this issue.The MDLoss is built on the contrastive loss, which employs semantic information (data label) to force the convolutional neural network to learn an embedding representation that complies with a notion of similarity of the problem domain. In our scenario, we define the metric distance loss by embedding the relative pose of two images. The relative information is used to define the margin of feature representation. The loss function is shown in equation (14).
(14) 
where denotes the number of the training samples, , and are the outputs of the modified ResNet50 network taken for the current image and the reference image respectively, is the Euclidean distance of the actual relative position while is the Euclidean distance of the actual relative orientation of the current image and the reference image, is a positive constant to balance the influence of the relative position and orientation. It is set equal to 10 empirically.
An explanation of is that, if is smaller than , we want to make it as large as . On the other hand, if is larger than , this cost function is not utilized and other cost functions will function to ensure and to take the appropriate values. This is a reasonable strategy because the reference image is always taken at a different location from that of the current image.
IvC Comprehensive Loss
We train the proposed neural network jointly with GlobalLoss, RelLoss, RelRLoss and MDLoss. The comprehensive loss can be represented by equation (15).
(15) 
It consists of three components: position loss , orientation loss and metric distance loss . Positional loss and orientational loss each has three components and can be written as equations (17) and (18) respectively.
(17) 
(18) 
We choose a learning strategy to balance the position loss and orientation loss similar to PoseNet2. Therefore, the comprehensive loss can be further reformulated as equation (19):
(19) 
where and are learnable coefficients.
V Experiments
In this section, we test our method on two publicly available camera relocalization benchmark datasets, one indoor and one outdoor, to demonstrate its effectiveness. Experimental results are presented and compared with stateoftheart methods in the literatures. We also investigate the role of various components of the loss function and analyze how the choice of reference image affects the performance of the proposed method.
Va Datasets
The two public datasets we used are: 7Scene [43] and Cambridge Landmarks [8]. To make our results exactly comparable to previous methods, we use the same split of training set and testing set as in the original datasets.
7Scene is an indoor image dataset for camera relocalization and trajectory tracking. It is collected with a handhold RGBD camera. The ground truth pose is generated using the Kinect Fusion approach [52]. The dataset is captured in 7 indoor scenes. For each scene, it contains several image sequences, which has already been divided into training and testing sets. The images are taken at the resolution of with known focal length of 585. The dataset is quite challenging as motion makes the images blur. Besides, the indoor scenes are usually textureless, which makes the localization problem even more difficult.
Cambridge Landmarks is an outdoor dataset collected in 4 sites around Cambridge University. It is collected using a Google mobile phone while pedestrians walk. The images are captured at the resolution of and the ground truth pose is obtained through VisualSFM software [53]. The dataset is also very challenging as it is taken in different weather and lighting conditions. Besides, the occlusion of moving pedestrians and vehicles further increases the difficulty.
VB Setup
Training phase: in this phase, all parts of the proposed network are involved. It takes in a pair of images and outputs the corresponding global poses of them. It is important to note that, the twin networks are identical. One takes the current image as input and produces its global 6DOF pose information, while the other takes the reference image as input and outputs its corresponding pose.
Testing phase: in the testing phase, only one of the twins is necessary. Since they are identical, any one can be used. The middle part that linking the twins is no longer necessary in this stage. Once training is completed, an image is fed to one of the twin networks and the 6 degree global pose information of the camera can be estimated.
We use the same image preprocessing approaches as previous methods [8]
. We firstly resize the image to 256 pixels along the shorter side and normalize it with the mean and standard deviation computed from the ImageNet dataset. For the training phase, we randomly crop the image to
pixels. For the testing phase, images are cropped to pixels at the center of the image. Training images are shuffled before they are fed to the network.The modified ResNet50 is initialized with pretrained weights of ImageNet dataset. The GPRU component and the RPRU are initialized with the Xavier initialization [54]. We choose the Adam optimizer to train the network with parameters and . The weight decay is . We train the network with a learning rate of and the batchsize is set to be 32. We initialize the and
with 0 and 3.0 respectively in our experiments. We implement the network with PyTorch and train the network on an Ubuntu 16.04 TS system with a NVIDIA GTX 1080Ti GPU. Training is stopped until the network is converged.
VC Results
We compare the results of the proposed method with that of stateoftheart deep learningbased methods such as PoseNet, Bayesian PoseNet, PoseNet2, Hourgrlassnet, LSTMNet and RelNet on the 7Scene dataset, and with PoseNet, Bayesian PoseNet, PoseNet2 and LSTMNet on the Cambridge Landmarks dataset. Similar to others, we report each scene’s median error. We also compare the average median accuracy over all scenes in each dataset. The comparative results are shown in Table I and Table II.
Scene  PoseNet  Bayesian PoseNet  LSTMNet  Vidloc  HourglassNet  PoseNet2  Relnet  Ours 

Chess  0.32m,  0.37m,  0.24m,  0.18m, N/A  0.15m,  0.13m,  0.13m,  0.099m, 
Fire  0.47m,  0.43m,  0.34m,  0.26m, N/A  0.27m,  0.27m,  0.26m,  0.253m, 
Heads  0.29m,  0.31m,  0.21m,  0.14m, N/A  0.19m,  0.17m,  0.14m,  0.126m, 
Office  0.48m,  0.48m,  0.30m,  0.26m, N/A  0.21m,  0.19m,  0.21m,  0.161m, 
Pumpkin  0.47m,  0.61m,  0.33m,  0.36m, N/A  0.25m,  0.26m,  0.24m,  0.163m, 
Redkitchen  0.59m,  0.58m,  0.37m,  0.31m, N/A  0.27m,  0.23m,  0.24m,  0.174m, 
Stairs  0.47m,  0.48m,  0.40m,  0.26m, N/A  0.29m,  0.35m,  0.27m,  0.26m, 
Average  0.44m,  0.47m,  0.31m,  0.25m, N/A  0.23m,  0.23m,  0.21m,  0.177m, 
Table I shows the results for the 7Scene dataset. It is seen that compared with 7 stateoftheart deep learningbased camera relocalization methods, the proposed method achieves the best performance on positional accuracy in all 7 scenes. Our method improves the average median positional accuracy by over the best reported result. It is interesting to note that our method has obtained even better result than PoseNet2, which utilizes 3D reference as additional constraints.
For orientational accuracy, we achieve the best result compared to methods based on direct regression. It is not surprising that the results are not as good as PoseNet2 and RelNet since PoseNet2 requires additional 3D models and RelNet triangulates the pose with all referencing images by estimating the relative poses instead of directly regressing results.
Scene  PoseNet  Bayesian PoseNet  LSTMNet  PoseNet2  Ours 

KingsCollege  1.92m,  1.74m,  0.99m,  0.88m,  0.865m, 
OldHospital  2.31m,  2.57m,  1.51m,  3.20m,  1.617m, 
ShopFacade  1.46m,  1.25m,  1.18m,  0.88m,  0.834m, 
StMarysChurch  2.65m,  2.11m,  1.52m,  1.57m,  1.650m, 
Average  2.08m,  1.92m,  1.30m,  1.62m,  1.24m, 
Table II shows the results for the Cambridge Landmarks dataset. It can be seen that our method obtains the best positional accuracy on the KingsCollege and the ShopFacade scenes, reaching accuracies of 0.865m and 0.834m respectively. We improve the stateoftheart orientational accuracy of the OldHospital and the StMarysChurch scenes from and to and , achieving 26% and 10% improvement respectively. The average positional accuracy over all scenes is improved from to . The average orientational accuracy over all scenes is only a little worse than that of PoseNet2, which is trained with 3D model constraints.
It is interesting to note that of all the methods presented in the two tables, some did better in positional accuracy and some did better in orientational accuracy, none of them seems to comprehensively beat the others in both measures. Our method achieves the best average positional accuracy amongst all methods in both datasets. For orientational accuracy, our method achieves competent results, which is only slightly worse than the best method (PoseNet2) but better or at least as good as the other methods.
VD Discussion
In this section, we perform analysis on the influence of various loss function components and the reference image selection strategy. The experiments are also done on the 7Scene and Cambridge Landmarks.
VD1 Loss Analysis
We perform ablation analysis on the loss function. Recall from equation (15), the overall loss function is , consisting of the the global loss , the relative pose loss , the relative pose regression loss , and the adaptive metric distance loss . In order to assess the role these loss components play, we formulate 4 loss functions based on the following combinations:

G: GlobalLoss;

G+C: GlobalLoss RelLoss;

G+C+R: GlobalLoss RelLoss RelRLoss;

Ours: GlobalLoss RelLoss RelRLoss MDLoss.
Scene  G  G+C  G+C+R  Ours 

Chess  0.135m,  0.118m,  0.116m,  0.099m, 
Fire  0.285m,  0.258m,  0.258m,  0.253m, 
Heads  0.185m,  0.140m,  0.144m,  0.126m, 
Office  0.180m,  0.173m,  0.175m,  0.161m, 
Pumpkin  0.215m,  0.226m,  0.214m,  0.163m, 
Redkitchen  0.266m,  0.253m,  0.201m,  0.174m, 
Stairs  0.345m,  0.324m,  0.279m,  0.260m, 
Average  0.230m,  0.213m,  0.198m,  0.177m, 
We train the proposed network by the 4 aforementioned loss functions separately. The results are shown in Table III for the 7Scene dataset and in Table IV for Cambridge landmarks. It is seen that as more loss terms are added to the loss function, both positional error and orientational error decrease for all scenes of the 7Scene dataset and the Cambridge Landmarks dataset. The average positional error and orientational error for the 7Scene dataset and Cambridge Landmarks dataset are shown in Figure 2 and in Figure 3 respectively. We can see that average position and orientation errors show a decreasing trend by adding more constraints. This demonstrates the usefulness of each loss component combinations.
Scene  G  G+C  G+C+R  Ours 

KingsCollege  1.07m,  0.932m,  0.97m,  0.865m, 
OldHospital  1.76m,  1.650m,  1.67m,  1.617m, 
ShopFacade  1.00m,  0.930m,  0.858m,  0.834m, 
StMarysChurch  1.76m,  1.720m,  1.684m,  1.615m, 
Average  1.396m,  1.308m,  1.296m,  1.242m, 
VD2 Comparison of Relative Geometry Losses
We have designed three relative geometrybased losses. In order to evaluate their performance separately for pose prediction, we formulate new losses by combining each of them with the global pose loss. We also use global pose loss and our comprehensive loss as baselines. The details of the loss combinations are listed as follows:

G : GlobalLoss;

G+M: GlobalLoss MDLoss;

G+C: GlobalLoss RelLoss;

G+R: GlobalLoss RelRLoss;

Ours: GlobalLoss RelLoss RelRLoss MDLoss.
For each loss function, we repeat experiments using the same training setup in previous experiments. The results on 7Scene and on Cambridge Landmarks are shown in Table V and in Table VI respectively. The average localization errors of the two datasets are shown in Figure 4 and Figure 5.
Scene  G  G+M  G+C  G+R  Ours 

Chess  0.135m,  0.116m,  0.118m,  0.117m,  0.099m, 
Fire  0.285m,  0.271m,  0.258m,  0.262m,  0.253m, 
Heads  0.185m,  0.128m,  0.140m,  0.147m,  0.126m, 
Office  0.180m,  0.177m,  0.173m,  0.189m,  0.161m, 
Pumpkin  0.215m,  0.198m,  0.226m,  0.196m,  0.163m, 
Redkitchen  0.266m,  0.217m,  0.253m,  0.218m,  0.174m, 
Stairs  0.345m,  0.265m,  0.324m,  0.281m,  0.260m, 
Average  0.230m,  0.196m,  0.213m,  0.201m,  0.177m, 
Scene  G  G+M  G+C  G+R  Ours 

KingsCollege  1.07m,  0.960m,  0.932m,  0.980m,  0.865m, 
OldHospital  1.76m,  1.650m,  1.650m,  1.615m,  1.617m, 
ShopFacade  1.00m,  0.876m,  0.930m,  0.868m,  0.834m, 
StMarysChurch  1.76m,  1.617m,  1.720m,  1.664m,  1.615m, 
Average  1.396m,  1.275m,  1.308m,  1.282m,  1.242m, 
As shown in the two Figures, relative geometryrelated losses (G+M, G+C, G+R) achieve better accuracy than global pose alone in every scene of the two datasets. This further demonstrates their effectiveness on global pose prediction. It can also be seen that G+M obtains a larger average accuracy increase compared with the other two. In addition, G+C acquires the smallest accuracy improvement on both datasets, lower than G+R. This implies that relative geometry constraints work better in feature space than in the pose space since RelRLoss and MDLoss are in the feature space while RelLoss is in the pose space. It should also be noted that the results of our proposed loss (G+C+R+M) outperforms all the other single relative geometryrelated losses, which further demonstrate the effectiveness of our comprehensive loss function.
VD3 Comparison of Metric Losses
To further evaluate the proposed adaptive metric distance loss, we conduct experiments to compare it with conventional siamese loss [55] and triplet loss [56], since the two losses can also help make visually similar image distinctive as the proposed metric distance loss does. The siamese loss is shown in equation (20) and the triplet loss is shown in equation (21). The major difference is that the conventional metric losses set the margin to be a fixed value while our loss is a function of the relative pose of two images.
(20) 
(21) 
where represents as hinge loss, is the number of training samples, is the margin, is the feature distance of the paired image, always equals 1, since the two images are not from the same location. , and are the feature vectors of the th training image, its reference images, and the image after the reference image, respectively. In the siamese loss of equation (20), it explicitly forces the features of the two images to be different because they are from two different locations. In the triplet loss (21), it explicitly enforces that the difference between the th image and its reference should be smaller than the difference between it and the image after the reference image. In the experiments, we simply replace the MDLoss with the siamese loss and the triplet loss respectively and repeat the experiment. The margin parameter of the siamese loss and the triplet loss is empirically set to be 0.001, which gives the best accuracy. Three comparative losses are listed as below.

LossSiamese: GlobalLoss RelLoss RelRLoss SiameseLoss;

LossTriplet: GlobalLoss RelLoss RelRLoss TripletLoss;

Ours: GlobalLoss RelLoss RelRLoss MDLoss.
We repeat the experiments on the two datasets using the above losses and the results are shown in Table VII for 7Scene and Table VIII for Cambridge Landmarks. The average localization errors of the two datasets are shown in Figure 6 and Figure 7.
Scene  LossSiamese  LossTriplet  Ours 

Chess  0.127m,  0.139m,  0.099m, 
Fire  0.273m,  0.276m,  0.253m, 
Heads  0.128m,  0.125m,  0.126m, 
Office  0.188m,  0.192m,  0.161m, 
Pumpkin  0.198m,  0.216m,  0.163m, 
Redkitchen  0.219m,  0.224m,  0.174m, 
Stairs  0.277m,  0.279m,  0.260m, 
Average  0.201m,  0.207m,  0.177m, 
Scene  LossSiamese  LossTriplet  Ours 

KingsCollege  0.867m,  0.839m,  0.865m, 
OldHospital  1.675m,  1.683m,  1.617m, 
ShopFacade  0.861m,  0.847m,  0.834m, 
StMarysChurch  1.728m,  1.650m,  1.615m, 
Average  1.282m,  1.258m,  1.242m, 
It can be seen that our method achieves the best average position accuracy on 7Scene dataset, and both average position accuracy and average orientation accuracy on the Cambridge Landmarks dataset. The LossSiamese acquires the best orientational accuracy on the 7Scene dataset. LossTriplet performs badly on the 7Scene dataset but obtains better performance than LossSiamese on the Cambridge Landmarks dataset. Although the LossSiamese achieves the best orientational accuracy, our method obtains more best performances on each scene of the two datasets shown in Table VII and in Table VIII. The results show that our adaptive metric distance loss outperforms the conventional siamese loss and the triplet loss.
VD4 Reference Image Analysis
In this section, we evaluate two strategies of choosing the reference image. One obvious strategy is to pair every two different images, but it will result in exponential increase of the training time and high information redundancy. To make the training phase efficient, we generate only one reference image for each image. Specifically, reference images are selected in two ways: 1) select the next image in the same image sequence as the reference image; 2) randomly select a different image of the dataset that is not a reference image of any other images. It should be noted that the next image is visually similar to the current image. Randomly chosen reference image has no such property. To evaluate the effectiveness of the two reference image selection strategies on the adaptive metric loss (MDLoss), we train the proposed network with the comprehensive loss function. In addition, we use the result of the networks trained without MDLoss (G+R+C) as baseline to compare the results.
Scene  G+R+C  Random  Next 

Chess  0.116m,  0.109m,  0.099m, 
Fire  0.258m,  0.265m,  0.253m, 
Heads  0.144m,  0.138m,  0.126m, 
Office  0.175m,  0.172m,  0.161m, 
Pumpkin  0.214m,  0.207m,  0.163m, 
Redkitchen  0.201m,  0.202m,  0.174m, 
Stairs  0.279m,  0.287m,  0.260m, 
Average  0.198m,  0.197m,  0.177m, 
Comparative median error results for the different reference selection strategies for the 7Scene and Cambridge Landmarks are shown in Table IX and Table X. The average positional and orientational errors are shown in Figure 8 and in Figure 9. It can be seen that strategy of choosing the next image as reference image obtains higher image similarity score than that of randomly choosing in two datasets since it achieves lower feature distance.
From Table IX, it is seen that compared to the random reference selection strategy, taking the next image as reference image increases the average positional accuracy from 0.197m to 0.177m and the average orientational accuracy from to
. It is also seen that for both reference image selection strategies, the inclusion of MDLoss improves performance. One probable explanation is that MDLoss makes the network learn to keep similar images of different poses apart in the feature space.
Scene  G+C+R  Random  Next 

KingsCollege  0.970m,  1.120m,  0.865m, 
OldHospital  1.670m,  1.618m,  1.617m, 
ShopFacade  0.858m,  1.000m,  0.834m, 
StMarysChurch  1.684m,  1.714m,  1.650m, 
Average  1.296m,  1.363m,  1.242m, 
As shown in Table X, the results of taking the next image as reference are better than that of the random reference selection strategy on both the average positional and orientational accuracy. It is also seen that randomly choosing the reference image achieves the worse performance on positional accuracy than the baseline. This may be explained by the fact that images of the Cambridge Landmarks are of large difference so that the metric distance loss fails to work. To verify the explanation, we measure image similarity of the two pairing strategies. The average Euclidean distance of GIST features of paired images are employed to quantify paired image similarity. The average feature distances of the scenes are shown in Figure 10.
Scene  Training  Testing  Spatial scope(m)  Next  Random 

Chess  4000  2000  3 2  0.0700  0.5044 
Fire  2000  2000  2.5 1  0.0968  0.5187 
Heads  1000  1000  2 0.5  0.0613  0.4866 
Office  6000  4000  2.5 2  0.0600  0.4366 
Pumpkin  4000  2000  2.5 2  0.0540  0.4390 
Red Kitchen  7000  5000  4 3  0.0749  0.4993 
Stairs  2000  1000  2.5 2  0.0540  0.5220 
KingsCollege  1220  343  140 40  0.2816  0.5531 
OldHospital  895  182  50 40  0.3338  0.6127 
ShopFacade  231  103  35 25  0.3133  0.5730 
StMarysChurch  1487  530  80 60  0.3411  0.6471 
Table XI shows that for each scene, taking the next image as reference achieves higher similarity between paired images than that of randomly chosen. This confirms our explanation that MDLoss works better in scenarios where paired images are similar.
Vi Concluding Remarks
In this paper, we enhance the camera relocalization performance of deep learningbased methods by introducing the relative geometry constraints. This is achieved by designing a relative geometryaware Siamese neural network and three relative geometryrelated loss functions. The proposed network is capable of predicting the poses of two images as well as the relative pose between them. Another advantage of the network is that it is able to predict the global pose by feeding a single image into one stream of it. The new pose space relative loss and feature space relative regression loss functions can be combined with traditional global pose loss to enhance the position and orientation accuracy. The metric distance loss enables the network to learn deep feature representation that can distinguish similar images of different locations, thus helping improve localization accuracy. We also find that pairing similar images outperforms random paring. In future work, we plan to investigate the combination of deep learningbased methods and 3D modelingbased methods to further enhance the performance.
Acknowledgment
References
 [1] A. C. Murillo and J. Kosecka, “Experiments in place recognition using gist panoramas,” in Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on. IEEE, 2009, pp. 2196–2203.
 [2] T. Sattler, T. Weyand, B. Leibe, and L. Kobbelt, “Image retrieval for imagebased localization revisited.” in BMVC, vol. 1, no. 2, 2012, p. 4.
 [3] I. Ulrich and I. Nourbakhsh, “Appearancebased place recognition for topological localization,” in Robotics and Automation, 2000. Proceedings. ICRA’00. IEEE International Conference on, vol. 2. Ieee, 2000, pp. 1023–1029.
 [4] J. Wolf, W. Burgard, and H. Burkhardt, “Robust visionbased localization by combining an imageretrieval system with monte carlo localization,” IEEE transactions on robotics, vol. 21, no. 2, pp. 208–216, 2005.
 [5] ——, “Robust visionbased localization for mobile robots using an image retrieval system based on invariant features,” in Robotics and Automation, 2002. Proceedings. ICRA’02. IEEE International Conference on, vol. 1. IEEE, 2002, pp. 359–365.
 [6] Z. Kukelova, M. Bujnak, and T. Pajdla, “Realtime solution to the absolute pose problem with unknown radial distortion and focal length,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2816–2823.
 [7] T. Weyand, I. Kostrikov, and J. Philbin, “Planetphoto geolocation with convolutional neural networks,” in European Conference on Computer Vision. Springer, 2016, pp. 37–55.
 [8] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for realtime 6dof camera relocalization,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2938–2946.

[9]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2015, pp. 1–9.  [10] A. Kendall and R. Cipolla, “Modelling uncertainty in deep learning for camera relocalization,” arXiv preprint arXiv:1509.05909, 2015.
 [11] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu, “Imagebased localization using hourglass networks,” in Computer Vision Workshop (ICCVW), 2017 IEEE International Conference on. IEEE, 2017, pp. 870–877.
 [12] F. Walch, C. Hazirbas, L. LealTaixe, T. Sattler, S. Hilsenbeck, and D. Cremers, “Imagebased localization using lstms for structured feature correlation,” in Int. Conf. Comput. Vis.(ICCV), 2017, pp. 627–637.
 [13] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen, “Vidloc: A deep spatiotemporal model for 6dof videoclip relocalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 3, 2017.
 [14] A. Kendall, R. Cipolla et al., “Geometric loss functions for camera pose regression with deep learning,” in Proc. CVPR, vol. 3, 2017, p. 8.
 [15] S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz, “Geometryaware learning of maps for camera localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2616–2625.
 [16] Y. Kalantidis, G. Tolias, Y. Avrithis, M. Phinikettos, E. Spyrou, P. Mylonas, and S. Kollias, “Viral: Visual image retrieval and localization,” Multimedia Tools and Applications, vol. 51, no. 2, pp. 555–592, 2011.
 [17] Y.H. Lee and Y. Kim, “Efficient image retrieval using advanced surf and dcd on mobile platform,” Multimedia Tools and Applications, vol. 74, no. 7, pp. 2289–2299, 2015.
 [18] X. Li, M. Larson, and A. Hanjalic, “Geodistinctive visual element matching for location estimation of images,” IEEE Transactions on Multimedia, vol. 20, no. 5, pp. 1179–1194, 2018.
 [19] B. J. Kröse, N. Vlassis, R. Bunschoten, and Y. Motomura, “A probabilistic model for appearancebased robot localization,” Image and Vision Computing, vol. 19, no. 6, pp. 381–391, 2001.
 [20] E. Menegatti, M. Zoccarato, E. Pagello, and H. Ishiguro, “Imagebased monte carlo localisation with omnidirectional images,” Robotics and Autonomous Systems, vol. 48, no. 1, pp. 17–30, 2004.
 [21] J. Wang, H. Zha, and R. Cipolla, “Coarsetofine visionbased localization by indexing scaleinvariant features,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 36, no. 2, pp. 413–422, 2006.
 [22] J. Wang, R. Cipolla, and H. Zha, “Visionbased global localization using a visual vocabulary,” in Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on. IEEE, 2005, pp. 4230–4235.
 [23] M. Cummins and P. Newman, “Fabmap: Probabilistic localization and mapping in the space of appearance,” The International Journal of Robotics Research, vol. 27, no. 6, pp. 647–665, 2008.
 [24] A. Torii, Y. Dong, M. Okutomi, J. Sivic, and T. Pajdla, “Efficient localization of panoramic images using tiled image descriptors,” Information and Media Technologies, vol. 9, no. 3, pp. 351–355, 2014.
 [25] M. Umeda and H. Date, “Spherical panoramic imagebased localization by deep learning,” Transactions of the Society of Instrument and Control Engineers, vol. 54, pp. 483–493, 2018.
 [26] A. GuzmanRivera, P. Kohli, B. Glocker, J. Shotton, T. Sharp, A. Fitzgibbon, and S. Izadi, “Multioutput learning for camera relocalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1114–1121.
 [27] J. Kosecka, L. Zhou, P. Barber, and Z. Duric, “Qualitative image based localization in indoors environments,” in Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, vol. 2. IEEE, 2003, pp. II–II.
 [28] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International journal of computer vision, vol. 42, no. 3, pp. 145–175, 2001.
 [29] N. Sünderhauf and P. Protzel, “Briefgistclosing the loop by simple means,” in Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on. IEEE, 2011, pp. 1234–1241.
 [30] G. Singh and J. Kosecka, “Visual loop closing using gist descriptors in manhattan world,” in ICRA Omnidirectional Vision Workshop, 2010.
 [31] R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, J. J. Yebes, and S. Gámez, “Bidirectional loop closure detection on panoramas for visual navigation,” in Intelligent Vehicles Symposium Proceedings, 2014 IEEE. IEEE, 2014, pp. 1378–1383.
 [32] M. J. Milford and G. F. Wyeth, “Seqslam: Visual routebased navigation for sunny summer days and stormy winter nights,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp. 1643–1649.
 [33] D. G. Lowe, “Object recognition from local scaleinvariant features,” in Computer vision, 1999. The proceedings of the seventh IEEE international conference on, vol. 2. Ieee, 1999, pp. 1150–1157.
 [34] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speededup robust features (surf),” Computer vision and image understanding, vol. 110, no. 3, pp. 346–359, 2008.
 [35] J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object matching in videos,” in null. IEEE, 2003, p. 1470.
 [36] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 3304–3311.
 [37] M. Donoser and D. Schmalstieg, “Discriminative featuretopoint matching in imagebased localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 516–523.
 [38] T. Sattler, M. Havlena, F. Radenovic, K. Schindler, and M. Pollefeys, “Hyperpoints and fine vocabularies for largescale location recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2102–2110.
 [39] T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & effective prioritized matching for largescale imagebased localization,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 9, pp. 1744–1756, 2017.
 [40] ——, “Improving imagebased localization by active correspondence search,” in European conference on computer vision. Springer, 2012, pp. 752–765.
 [41] M. Uyttendaele, M. Cohen, S. Sinha, and H. Lim, “Realtime imagebased 6dof localization in largescale environments,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 1043–1050.
 [42] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in Computer Vision (ICCV), 2011 IEEE international conference on. IEEE, 2011, pp. 2564–2571.
 [43] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgbd images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2930–2937.
 [44] J. Valentin, M. Nießner, J. Shotton, A. Fitzgibbon, S. Izadi, and P. H. Torr, “Exploiting uncertainty in regression forests for accurate camera relocalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4400–4408.
 [45] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “Dsacdifferentiable ransac for camera localization,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 3, 2017.
 [46] Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala, “Camera relocalization by computing pairwise relative poses using convolutional neural network,” in Computer Vision Workshop (ICCVW), 2017 IEEE International Conference on. IEEE, 2017, pp. 920–929.
 [47] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a” siamese” time delay neural network,” in Advances in neural information processing systems, 1994, pp. 737–744.
 [48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [49] A. Bellet, A. Habrard, and M. Sebban, “A survey on metric learning for feature vectors and structured data,” arXiv preprint arXiv:1306.6709, 2013.
 [50] M. Norouzi, D. J. Fleet, and R. R. Salakhutdinov, “Hamming distance metric learning,” in Advances in neural information processing systems, 2012, pp. 1061–1069.
 [51] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 539–546.
 [52] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinectfusion: Realtime dense surface mapping and tracking,” in Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on. IEEE, 2011, pp. 127–136.
 [53] C. Wu et al., “Visualsfm: A visual structure from motion system,” 2011.

[54]
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, 2010, pp. 249–256.  [55] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in null. IEEE, 2006, pp. 1735–1742.

[56]
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.  [57] A. Valada, N. Radwan, and W. Burgard, “Deep auxiliary learning for visual localization and odometry,” arXiv preprint arXiv:1803.03642, 2018.
 [58] N. Radwan, A. Valada, and W. Burgard, “Vlocnet++: Deep multitask learning for semantic visual localization and odometry,” arXiv preprint arXiv:1804.08366, 2018.