Camera relocalization, or 6 degrees of freedom (6DOF) estimation, refers to the problem of estimating the pose (position and orientation) of an image (camera). It is a hot research topic in structure from motion (SfM), simultaneous localization and mapping (SLAM) and robotics, and it is also an essential component of autonomous driving and navigation.
Global Positioning System (GPS) has been widely used for vehicle localization but its accuracy significantly decreases in urban areas where tall buildings block or weaken its signals. Many image-based methods have been proposed to complement GPS. They provide position and orientation information based either on image retrieval[1, 2, 3, 4, 5] or 3D model reconstruction . However, these methods face many challenges, including high storage overheads, low computational efficiency and image variations, especially for large scenes.
Recently, rapid progress in machine learning, particularly deep learning, has produced a number of deep learning-based methods[7, 8, 9, 10, 11, 12, 13, 14, 15]. They have attained good performances in addressing the aforementioned challenges but their accuracies are not as good as traditional methods. Another severe problem of deep learning-based methods is that they fail to distinguish two different locations that have similar objects or scenes.
In this paper, we present a novel relative geometry-aware Siamese neural network, which explicitly exploits the relative geometry constraints between images to regularize the network. We improve the localization accuracy and enhance the ability of the network to distinguish locations with similar images. It is achieved with three key new ideas:
We design a novel Siamese neural network that explicitly learns the global poses of a pair of images. We constrain the estimated global poses with the actual relative pose between the pair of images.
We perform multi-task learning to estimate the absolute and relative poses simultaneously to ensure that the predicted poses are correct both globally and locally.
We employ metric learning and design an adaptive metric distance loss to learn feature representations that are capable of distinguishing the poses of similar visual images of different locations thus improving the overall pose estimation accuracy.
The rest of the paper is organized as follows: Section II reviews the related works in camera relocalization. Section III elaborates the basic idea of deep learning-based camera relocalization methods. Section IV describes the architecture of the proposed network and its loss function items. We present the details of our experiments and evaluation in Section V. Finally, we conclude our work in Section VI.
Ii Related Work
Camera relocalization methods can be mainly classified into three categories: image retrieval-based methods, 3D model-based methods, and deep learning-based methods.
Many approaches and systems are proposed based on image retrieval technique [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. They determine the pose of the query image by matching it with images rendered from 3D scene models. The key component of the technique is image representation. Global descriptors are often used, such as colour histogram  and gradient orientation histogram . GIST descriptor  and GIST-based descriptors  are applied to represent panoramic images in [30, 31, 1]. SeqSLAM  generates the global descriptor from a sequence of consecutive images instead of a single image. Global descriptors are fast to compute, but they are not robust to occlusion and illumination changes. Local features like SIFT  and SURF , have been used in  for image representation. Compared with the global descriptor, they are less sensitive to occlusion and view variations. However, the storage requirement of the method is high for large scenes. The pooling features like BoW  and VLAD 
are able to relieve the challenge. They aggregate local features and represent the locations with a compact feature vector instead of a large number of local features.
Another type of methods solve the problem by utilizing camera projection geometry between 2D pixels and 3D models. They estimate the pose by constructing the correspondence between 2D pixels and 3D points of the scene [37, 38, 39, 40, 41]. Local point features, like SIFT , SURF  and ORB , are frequently used to describe the detected 2D points. 3D points, generated using the SfM technique, are also described with local features to perform 2D-3D matching. It can achieve accurate results when enough correct pairs are provided. The main challenge is to establish enough correct 2D-3D correspondences, which is difficult for two reasons. Firstly, local feature descriptor fails when a scene has repetitive texture or texture-less surface; and secondly, the process is inefficient for large scenes.
are proposed to construct enough matching pairs instead of matching all detected 2D points. Scene coordinate random forest (SCRF)[43, 44] utilizes machine learning techniques to directly predict 3D coordinates of image pixels by training a random forest. Similar to SCRF, deep learning technique is employed to predict 3D coordinate of the center point of an image patch in . However, these methods require 3D model for the network training ,which limits their application. To filter out the wrong matches, co-visibility information is exploited in [38, 39].
Deep learning has achieved extraordinary performance in image classification, object detection, and image retrieval tasks. Many researchers have employed it to solve the camera relocalization problem [7, 8, 9, 10, 11, 12, 13, 14, 15]. PlaNet
regards the problem as a classification task. It divides the map into grids and predicts the grid in which the query image belongs to through deep learning technique. Many other researchers consider it as a regression problem instead. They directly estimate the pose through a convolutional neural network. PoseNet, built on the GoogLeNet model , is the first to adopt this paradigm in an end-to-end manner. It is further extended to Bayesian PoseNet  to estimate the confidence of the result as well. HourglassNet  utilizes the encoder-decoder network structure with skipped connections to aggregate features from both lower and higher layers for pose regression. It achieves better performance than PoseNet. LSTM-Net  believes that high dimensional output of fully connected layer in PoseNet is not optimal. It adds a LSTM network after the last fully connected layer in PoseNet to reduce information redundancy. VidLoc 
exploits smooth constraints of a video to address the perceptual aliasing problem. It takes a video clip as input instead of a single image and proposes a bidirectional recurrent neural network structure to fuse the previous and next images information to increase predicted pose accuracy. Laskar proposes a new triangulating strategy that predicts the pose by estimating the relative pose between the query image and the images in the database. Its main drawback is low efficiency since the relative pose of all the images in the database have to be computed. PoseNet2  introduces the re-projection error with global pose error and improves the performance. However, 3D points are required in their method. MapNet  fuses the inertial information with image information through deep learning to enhance the network performance.
The proposed method in this paper is also based on convolutional neural networks. However, it has a number of distinctive features. For example, we use an innovative Siamese network architecture to exploit the relative geometry of images in addition to predicting the absolute poses. Unlike  and , we only rely on the 2D images for training. Compared to [10, 10, 11, 12], we take a pair of images as input and utilize their relative pose error for training. In contrast to , we directly regress the image pose instead of performing triangulation.
A very recent work that also uses multi-task learning and explicitly models relative poses of two frames appears in [57, 58]. However, our system architecture differs from that of [57, 58] in a number of significant ways. Whilst we use a Siamese network and metric learning loss to model the relative geometrics of two frames, [57, 58] use two separate networks to model the relative geometrics of two consecutive frames (although [57, 58] refer their two networks as Siamese network, strictly speaking it is not a Siamese network architecture because the two networks do not share weights). Furthermore, while our method can model the relative geometrics of two arbitrary frames, [57, 58] can only model two consecutive frames.
Iii Deep Learning-based Camera Relocalization
Deep learning-based camera relocalization methods use an end-to-end learning strategy to predict the positions and orientations directly. They do not perform image matching or solve 2D-3D correspondence as traditional methods do. Instead, they regard the task as a regression problem and utilize convolutional neural networks to model the hidden mapping function between the images and their corresponding poses. The networks are supervised by the distance between the predicted poses and the ground truth. This section focuses on discussing the pose representation and describing the loss function formulation.
Iii-a Pose Representation
The image (camera) pose is comprised of the positional component and the orientational component. The position is denoted by a 3-dimensional vector x of the arbitrary coordinate space. Orientation can be represented in 3 forms: Euler angle, transformation matrix, and quaternion. Euler angle is not a good choice because it suffers from the gimbal lock problem. Transformation matrix is over-parameterized for orientation because it contains 9 parameters to represent the orientation of 3D space, while the orientation only has 3 degrees of freedom. Previous works [8, 13, 11, 12] choose the quaternion to represent orientation, because it is a smooth and continuous representation. The quaternion is a 4-dimensional unit vector q and is easy to perform back-propagation. The main concern for the quaternion is that each orientation has two different quaternion representations. This can be addressed by constraining the quaternion to one hemisphere.
One simple and obvious way to represent pose is to form a 7-dimensional vector, combining position and orientation together. However, previous works demonstrate that the 7-dimensional vector representation does not achieve good performance due to the difference of scale between position and orientation. Therefore, two pose components are usually regressed separately. In this paper, instead of training two separate convolutional neural networks to estimate position and orientation, we train one model to predict the two components simultaneously. This is reasonable because both position and orientation come from the same image content.
Iii-B Loss Function
The loss function (GlobalLoss) is normally designed based on the distance between the predicted pose and the ground truth, serving as the optimization objective for training the networks. It consists of two components, i.e. position loss and orientation loss, as shown in equation (1).
where is the position loss and denotes the orientation loss. Here, Euclidean distance is chosen to calculate the position loss and orientation loss as it is continuous and smooth. The two components are computed by equations (2) and (3) respectively.
where x represents the real position and denotes the predicted one.
where is the ground truth orientation, denotes the predicted orientation and represents the length of the predicted orientation quaternion. is performed to normalize the predicted quaternion to the length of 1 since the network prediction does not guarantee it.
Due to the quantity and scale difference between the position loss and the orientation loss, a hyperplane parameteris introduced to balance the influence of the two loss components. The loss function is represented as equation (4).
Previous works choose to set manually and achieve good performance in their experiments. However, fine tuning for different scenes is labour-intensive. PoseNet2 addresses this issue by introducing two learnable variables, i.e. and , which correspond to the loss of position and orientation respectively. Then equation (4) is transformed into equation (5):
Iv Relative Geometry-Aware Siamese Network for Camera Relocalization
Our network is built on Siamese network originally introduced by Bromley and LeCun in . A traditional Siamese neural network architecture consists of twin networks which accepts distinct inputs. The loss function computes a metric between the highest-level feature representation on each side given certain threshold. We utilize this structure to learn a robust feature representation for mapping positions and orientations by introducing relative geometry constraints of the training images. The process is supervised by both global pose and relative pose constraints. The proposed network architecture is illustrated in Figure 1. Compared to the conventional Siamese network structure, it has an additional component for relative pose prediction and performs multi-task learning. In the following subsections, we will present the network architecture and the relative geometry losses for the network training in detail.
Iv-a Network Architecture
Each of the twin networks consists of a modified ResNet50  and a global pose regression unit (GPRU). The modified ResNet50 consists of 5 residual blocks and an average pooling layer. Each residual block has multiple residual bottleneck units that are comprised of three convolutional layers with kernel sizes of , , and
in sequence. Each convolutional layer is followed by rectified linear unit (ReLU) and batch normalization operation. The average pooling layer is used to aggregate the feature information from the previous layers. The GPRU contains 3 fully connected layers. The first fully connected layer has 1024 neurons and the followed two has 3 and 4 neurons respectively for regressing the position and orientation. For the relative pose of the two inputs, we design a relative pose regression unit (RPRU). It has a similar structure as the GPRU. The difference lies in their inputs. While the GPRU takes the output vector of the modified ResNet50 as input, the RPRU takes the concatenation of the two modified ResNet50 output vectors as input. The dropout technique is applied after each fully connected layer to reduce feature redundancy. The parameter of dropout layer is set to be 0.2 empirically.
Iv-B Relative Geometry Losses
We design three relative geometry losses based on the relative geometry constraints of the training images including the relative pose loss (RelLoss), the relative pose regression loss (RelRLoss) and the adaptive metric distance loss (MDLoss). They function in the feature and the pose spaces to regularize the network. They will be discussed in detail in the following sections.
Iv-B1 Relative Pose Loss
Previous deep learning-based pose estimation methods train the network on the global poses of the images, i.e. given an input image, they estimate its global (absolute) position and orientation while the relative pose between two training images is ignored. However, the relative pose information of two images is important. In this paper, the network not only explicitly estimates the global pose of the input image but also explicitly requires that the difference between the estimated global poses of two images is consistent with their actual (ground truth) difference. The relative pose loss (RelLoss) is designed to preserve the relative geometry in the pose space by comparing the distance between two predicted global poses, and the actual distance of the global poses of the two images. RelLoss is able to keep the relative pose of paired images consistent with their ground truth. It works in the pose space and constrains the pose error of two images.
where represents the conjugate quaternion of . Note that when calculating the relative orientation from the predicted orientation quaternion with equation (7), the quaternion has to be normalized. The RelLoss also contains the positional loss component and the orientational loss component as shown in equation (8).
where denotes the RelLoss positional component, and is the orientational component.
where are the predicted relative position and orientation, and denote the ground truth.
Iv-B2 Relative Pose Regression Loss
Whilst RelLoss captures the relative geometry of two images through estimating their global poses, we here introduce another loss to estimate the relative pose distance of a pair of images directly from the input images. The relative pose regression loss (RelRLoss) is defined as shown in equation (11).
where represent the ground truth relative position and orientation, and represent the directly predicted relative position and orientation. The ground truth relative position and orientation can be obtained using equation (6) , (7). Note that needs to be normalized as it is directly regressed by the network.
It should be noted that in equation (11) and in equation (8) are different. One is computed from the difference of two predicted global poses while the other is predicted directly by regression. Furthermore, it is the that joins the twin networks together (please refer to Figure 1
). The purposes of introducing RelRLoss is to ensure that the features extracted by the ResNet50 network will not only enable an accurate estimate of the global pose but also an accurate relative pose estimation.
Iv-B3 Adaptive Metric Distance Loss
Deep learning-based methods often fail to accurately predict the poses of similar images of different locations. Distinguishing similar inputs belonging to different classes is one of the major difficulties in computer vision. Here, we take advantage of the Siamese network architecture of Figure1 and propose the adaptive metric distance loss (MDLoss) to address the problem. It is inspired by metric learning [49, 50, 51]. The basic idea of metric learning is to learn a metric distance adaptive to the problem of interest. For many problems, including camera relocalization, hand-crafted representations fail badly in capturing the notion of similarity. Deep learning regression-based camera relocalization approaches are based on the visual contents of the input image to estimate its pose, therefore simple metrics measuring the visual content similarity fails to capture the pose dissimilarity in the above cases. In the case of our Siamese architecture in Figure 1, the 6DOF camera pose is estimated by the GPRU. The input to the GPRU unit (the output of the ResNet50) should reflect the pose difference rather than the visual similarity of the images. We therefore introduce the adaptive metric distance loss (MDLoss) to address this issue.
The MDLoss is built on the contrastive loss, which employs semantic information (data label) to force the convolutional neural network to learn an embedding representation that complies with a notion of similarity of the problem domain. In our scenario, we define the metric distance loss by embedding the relative pose of two images. The relative information is used to define the margin of feature representation. The loss function is shown in equation (14).
where denotes the number of the training samples, , and are the outputs of the modified ResNet50 network taken for the current image and the reference image respectively, is the Euclidean distance of the actual relative position while is the Euclidean distance of the actual relative orientation of the current image and the reference image, is a positive constant to balance the influence of the relative position and orientation. It is set equal to 10 empirically.
An explanation of is that, if is smaller than , we want to make it as large as . On the other hand, if is larger than , this cost function is not utilized and other cost functions will function to ensure and to take the appropriate values. This is a reasonable strategy because the reference image is always taken at a different location from that of the current image.
Iv-C Comprehensive Loss
We train the proposed neural network jointly with GlobalLoss, RelLoss, RelRLoss and MDLoss. The comprehensive loss can be represented by equation (15).
It consists of three components: position loss , orientation loss and metric distance loss . Positional loss and orientational loss each has three components and can be written as equations (17) and (18) respectively.
We choose a learning strategy to balance the position loss and orientation loss similar to PoseNet2. Therefore, the comprehensive loss can be further reformulated as equation (19):
where and are learnable coefficients.
In this section, we test our method on two publicly available camera relocalization benchmark datasets, one indoor and one outdoor, to demonstrate its effectiveness. Experimental results are presented and compared with state-of-the-art methods in the literatures. We also investigate the role of various components of the loss function and analyze how the choice of reference image affects the performance of the proposed method.
The two public datasets we used are: 7Scene  and Cambridge Landmarks . To make our results exactly comparable to previous methods, we use the same split of training set and testing set as in the original datasets.
7Scene is an indoor image dataset for camera relocalization and trajectory tracking. It is collected with a handhold RGB-D camera. The ground truth pose is generated using the Kinect Fusion approach . The dataset is captured in 7 indoor scenes. For each scene, it contains several image sequences, which has already been divided into training and testing sets. The images are taken at the resolution of with known focal length of 585. The dataset is quite challenging as motion makes the images blur. Besides, the indoor scenes are usually texture-less, which makes the localization problem even more difficult.
Cambridge Landmarks is an outdoor dataset collected in 4 sites around Cambridge University. It is collected using a Google mobile phone while pedestrians walk. The images are captured at the resolution of and the ground truth pose is obtained through VisualSFM software . The dataset is also very challenging as it is taken in different weather and lighting conditions. Besides, the occlusion of moving pedestrians and vehicles further increases the difficulty.
Training phase: in this phase, all parts of the proposed network are involved. It takes in a pair of images and outputs the corresponding global poses of them. It is important to note that, the twin networks are identical. One takes the current image as input and produces its global 6DOF pose information, while the other takes the reference image as input and outputs its corresponding pose.
Testing phase: in the testing phase, only one of the twins is necessary. Since they are identical, any one can be used. The middle part that linking the twins is no longer necessary in this stage. Once training is completed, an image is fed to one of the twin networks and the 6 degree global pose information of the camera can be estimated.
We use the same image pre-processing approaches as previous methods 
. We firstly resize the image to 256 pixels along the shorter side and normalize it with the mean and standard deviation computed from the ImageNet dataset. For the training phase, we randomly crop the image topixels. For the testing phase, images are cropped to pixels at the center of the image. Training images are shuffled before they are fed to the network.
The modified ResNet50 is initialized with pre-trained weights of ImageNet dataset. The GPRU component and the RPRU are initialized with the Xavier initialization . We choose the Adam optimizer to train the network with parameters and . The weight decay is . We train the network with a learning rate of and the batch-size is set to be 32. We initialize the and
with 0 and -3.0 respectively in our experiments. We implement the network with PyTorch and train the network on an Ubuntu 16.04 TS system with a NVIDIA GTX 1080Ti GPU. Training is stopped until the network is converged.
We compare the results of the proposed method with that of state-of-the-art deep learning-based methods such as PoseNet, Bayesian PoseNet, PoseNet2, Hourgrlass-net, LSTM-Net and RelNet on the 7Scene dataset, and with PoseNet, Bayesian PoseNet, PoseNet2 and LSTM-Net on the Cambridge Landmarks dataset. Similar to others, we report each scene’s median error. We also compare the average median accuracy over all scenes in each dataset. The comparative results are shown in Table I and Table II.
Table I shows the results for the 7Scene dataset. It is seen that compared with 7 state-of-the-art deep learning-based camera relocalization methods, the proposed method achieves the best performance on positional accuracy in all 7 scenes. Our method improves the average median positional accuracy by over the best reported result. It is interesting to note that our method has obtained even better result than PoseNet2, which utilizes 3D reference as additional constraints.
For orientational accuracy, we achieve the best result compared to methods based on direct regression. It is not surprising that the results are not as good as PoseNet2 and RelNet since PoseNet2 requires additional 3D models and RelNet triangulates the pose with all referencing images by estimating the relative poses instead of directly regressing results.
Table II shows the results for the Cambridge Landmarks dataset. It can be seen that our method obtains the best positional accuracy on the KingsCollege and the ShopFacade scenes, reaching accuracies of 0.865m and 0.834m respectively. We improve the state-of-the-art orientational accuracy of the OldHospital and the StMarysChurch scenes from and to and , achieving 26% and 10% improvement respectively. The average positional accuracy over all scenes is improved from to . The average orientational accuracy over all scenes is only a little worse than that of PoseNet2, which is trained with 3D model constraints.
It is interesting to note that of all the methods presented in the two tables, some did better in positional accuracy and some did better in orientational accuracy, none of them seems to comprehensively beat the others in both measures. Our method achieves the best average positional accuracy amongst all methods in both datasets. For orientational accuracy, our method achieves competent results, which is only slightly worse than the best method (PoseNet2) but better or at least as good as the other methods.
In this section, we perform analysis on the influence of various loss function components and the reference image selection strategy. The experiments are also done on the 7Scene and Cambridge Landmarks.
V-D1 Loss Analysis
We perform ablation analysis on the loss function. Recall from equation (15), the overall loss function is , consisting of the the global loss , the relative pose loss , the relative pose regression loss , and the adaptive metric distance loss . In order to assess the role these loss components play, we formulate 4 loss functions based on the following combinations:
G+C: GlobalLoss RelLoss;
G+C+R: GlobalLoss RelLoss RelRLoss;
Ours: GlobalLoss RelLoss RelRLoss MDLoss.
We train the proposed network by the 4 aforementioned loss functions separately. The results are shown in Table III for the 7Scene dataset and in Table IV for Cambridge landmarks. It is seen that as more loss terms are added to the loss function, both positional error and orientational error decrease for all scenes of the 7Scene dataset and the Cambridge Landmarks dataset. The average positional error and orientational error for the 7Scene dataset and Cambridge Landmarks dataset are shown in Figure 2 and in Figure 3 respectively. We can see that average position and orientation errors show a decreasing trend by adding more constraints. This demonstrates the usefulness of each loss component combinations.
V-D2 Comparison of Relative Geometry Losses
We have designed three relative geometry-based losses. In order to evaluate their performance separately for pose prediction, we formulate new losses by combining each of them with the global pose loss. We also use global pose loss and our comprehensive loss as baselines. The details of the loss combinations are listed as follows:
G : GlobalLoss;
G+M: GlobalLoss MDLoss;
G+C: GlobalLoss RelLoss;
G+R: GlobalLoss RelRLoss;
Ours: GlobalLoss RelLoss RelRLoss MDLoss.
For each loss function, we repeat experiments using the same training setup in previous experiments. The results on 7Scene and on Cambridge Landmarks are shown in Table V and in Table VI respectively. The average localization errors of the two datasets are shown in Figure 4 and Figure 5.
As shown in the two Figures, relative geometry-related losses (G+M, G+C, G+R) achieve better accuracy than global pose alone in every scene of the two datasets. This further demonstrates their effectiveness on global pose prediction. It can also be seen that G+M obtains a larger average accuracy increase compared with the other two. In addition, G+C acquires the smallest accuracy improvement on both datasets, lower than G+R. This implies that relative geometry constraints work better in feature space than in the pose space since RelRLoss and MDLoss are in the feature space while RelLoss is in the pose space. It should also be noted that the results of our proposed loss (G+C+R+M) outperforms all the other single relative geometry-related losses, which further demonstrate the effectiveness of our comprehensive loss function.
V-D3 Comparison of Metric Losses
To further evaluate the proposed adaptive metric distance loss, we conduct experiments to compare it with conventional siamese loss  and triplet loss , since the two losses can also help make visually similar image distinctive as the proposed metric distance loss does. The siamese loss is shown in equation (20) and the triplet loss is shown in equation (21). The major difference is that the conventional metric losses set the margin to be a fixed value while our loss is a function of the relative pose of two images.
where represents as hinge loss, is the number of training samples, is the margin, is the feature distance of the paired image, always equals 1, since the two images are not from the same location. , and are the feature vectors of the th training image, its reference images, and the image after the reference image, respectively. In the siamese loss of equation (20), it explicitly forces the features of the two images to be different because they are from two different locations. In the triplet loss (21), it explicitly enforces that the difference between the th image and its reference should be smaller than the difference between it and the image after the reference image. In the experiments, we simply replace the MDLoss with the siamese loss and the triplet loss respectively and repeat the experiment. The margin parameter of the siamese loss and the triplet loss is empirically set to be 0.001, which gives the best accuracy. Three comparative losses are listed as below.
LossSiamese: GlobalLoss RelLoss RelRLoss SiameseLoss;
LossTriplet: GlobalLoss RelLoss RelRLoss TripletLoss;
Ours: GlobalLoss RelLoss RelRLoss MDLoss.
We repeat the experiments on the two datasets using the above losses and the results are shown in Table VII for 7Scene and Table VIII for Cambridge Landmarks. The average localization errors of the two datasets are shown in Figure 6 and Figure 7.
It can be seen that our method achieves the best average position accuracy on 7Scene dataset, and both average position accuracy and average orientation accuracy on the Cambridge Landmarks dataset. The LossSiamese acquires the best orientational accuracy on the 7Scene dataset. LossTriplet performs badly on the 7Scene dataset but obtains better performance than LossSiamese on the Cambridge Landmarks dataset. Although the LossSiamese achieves the best orientational accuracy, our method obtains more best performances on each scene of the two datasets shown in Table VII and in Table VIII. The results show that our adaptive metric distance loss outperforms the conventional siamese loss and the triplet loss.
V-D4 Reference Image Analysis
In this section, we evaluate two strategies of choosing the reference image. One obvious strategy is to pair every two different images, but it will result in exponential increase of the training time and high information redundancy. To make the training phase efficient, we generate only one reference image for each image. Specifically, reference images are selected in two ways: 1) select the next image in the same image sequence as the reference image; 2) randomly select a different image of the dataset that is not a reference image of any other images. It should be noted that the next image is visually similar to the current image. Randomly chosen reference image has no such property. To evaluate the effectiveness of the two reference image selection strategies on the adaptive metric loss (MDLoss), we train the proposed network with the comprehensive loss function. In addition, we use the result of the networks trained without MDLoss (G+R+C) as baseline to compare the results.
Comparative median error results for the different reference selection strategies for the 7Scene and Cambridge Landmarks are shown in Table IX and Table X. The average positional and orientational errors are shown in Figure 8 and in Figure 9. It can be seen that strategy of choosing the next image as reference image obtains higher image similarity score than that of randomly choosing in two datasets since it achieves lower feature distance.
From Table IX, it is seen that compared to the random reference selection strategy, taking the next image as reference image increases the average positional accuracy from 0.197m to 0.177m and the average orientational accuracy from to
. It is also seen that for both reference image selection strategies, the inclusion of MDLoss improves performance. One probable explanation is that MDLoss makes the network learn to keep similar images of different poses apart in the feature space.
As shown in Table X, the results of taking the next image as reference are better than that of the random reference selection strategy on both the average positional and orientational accuracy. It is also seen that randomly choosing the reference image achieves the worse performance on positional accuracy than the baseline. This may be explained by the fact that images of the Cambridge Landmarks are of large difference so that the metric distance loss fails to work. To verify the explanation, we measure image similarity of the two pairing strategies. The average Euclidean distance of GIST features of paired images are employed to quantify paired image similarity. The average feature distances of the scenes are shown in Figure 10.
|Red Kitchen||7000||5000||4 3||0.0749||0.4993|
Table XI shows that for each scene, taking the next image as reference achieves higher similarity between paired images than that of randomly chosen. This confirms our explanation that MDLoss works better in scenarios where paired images are similar.
Vi Concluding Remarks
In this paper, we enhance the camera relocalization performance of deep learning-based methods by introducing the relative geometry constraints. This is achieved by designing a relative geometry-aware Siamese neural network and three relative geometry-related loss functions. The proposed network is capable of predicting the poses of two images as well as the relative pose between them. Another advantage of the network is that it is able to predict the global pose by feeding a single image into one stream of it. The new pose space relative loss and feature space relative regression loss functions can be combined with traditional global pose loss to enhance the position and orientation accuracy. The metric distance loss enables the network to learn deep feature representation that can distinguish similar images of different locations, thus helping improve localization accuracy. We also find that pairing similar images outperforms random paring. In future work, we plan to investigate the combination of deep learning-based methods and 3D modeling-based methods to further enhance the performance.
-  A. C. Murillo and J. Kosecka, “Experiments in place recognition using gist panoramas,” in Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on. IEEE, 2009, pp. 2196–2203.
-  T. Sattler, T. Weyand, B. Leibe, and L. Kobbelt, “Image retrieval for image-based localization revisited.” in BMVC, vol. 1, no. 2, 2012, p. 4.
-  I. Ulrich and I. Nourbakhsh, “Appearance-based place recognition for topological localization,” in Robotics and Automation, 2000. Proceedings. ICRA’00. IEEE International Conference on, vol. 2. Ieee, 2000, pp. 1023–1029.
-  J. Wolf, W. Burgard, and H. Burkhardt, “Robust vision-based localization by combining an image-retrieval system with monte carlo localization,” IEEE transactions on robotics, vol. 21, no. 2, pp. 208–216, 2005.
-  ——, “Robust vision-based localization for mobile robots using an image retrieval system based on invariant features,” in Robotics and Automation, 2002. Proceedings. ICRA’02. IEEE International Conference on, vol. 1. IEEE, 2002, pp. 359–365.
-  Z. Kukelova, M. Bujnak, and T. Pajdla, “Real-time solution to the absolute pose problem with unknown radial distortion and focal length,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2816–2823.
-  T. Weyand, I. Kostrikov, and J. Philbin, “Planet-photo geolocation with convolutional neural networks,” in European Conference on Computer Vision. Springer, 2016, pp. 37–55.
-  A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2938–2946.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
-  A. Kendall and R. Cipolla, “Modelling uncertainty in deep learning for camera relocalization,” arXiv preprint arXiv:1509.05909, 2015.
-  I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu, “Image-based localization using hourglass networks,” in Computer Vision Workshop (ICCVW), 2017 IEEE International Conference on. IEEE, 2017, pp. 870–877.
-  F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler, S. Hilsenbeck, and D. Cremers, “Image-based localization using lstms for structured feature correlation,” in Int. Conf. Comput. Vis.(ICCV), 2017, pp. 627–637.
-  R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen, “Vidloc: A deep spatio-temporal model for 6-dof video-clip relocalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 3, 2017.
-  A. Kendall, R. Cipolla et al., “Geometric loss functions for camera pose regression with deep learning,” in Proc. CVPR, vol. 3, 2017, p. 8.
-  S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz, “Geometry-aware learning of maps for camera localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2616–2625.
-  Y. Kalantidis, G. Tolias, Y. Avrithis, M. Phinikettos, E. Spyrou, P. Mylonas, and S. Kollias, “Viral: Visual image retrieval and localization,” Multimedia Tools and Applications, vol. 51, no. 2, pp. 555–592, 2011.
-  Y.-H. Lee and Y. Kim, “Efficient image retrieval using advanced surf and dcd on mobile platform,” Multimedia Tools and Applications, vol. 74, no. 7, pp. 2289–2299, 2015.
-  X. Li, M. Larson, and A. Hanjalic, “Geo-distinctive visual element matching for location estimation of images,” IEEE Transactions on Multimedia, vol. 20, no. 5, pp. 1179–1194, 2018.
-  B. J. Kröse, N. Vlassis, R. Bunschoten, and Y. Motomura, “A probabilistic model for appearance-based robot localization,” Image and Vision Computing, vol. 19, no. 6, pp. 381–391, 2001.
-  E. Menegatti, M. Zoccarato, E. Pagello, and H. Ishiguro, “Image-based monte carlo localisation with omnidirectional images,” Robotics and Autonomous Systems, vol. 48, no. 1, pp. 17–30, 2004.
-  J. Wang, H. Zha, and R. Cipolla, “Coarse-to-fine vision-based localization by indexing scale-invariant features,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 36, no. 2, pp. 413–422, 2006.
-  J. Wang, R. Cipolla, and H. Zha, “Vision-based global localization using a visual vocabulary,” in Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on. IEEE, 2005, pp. 4230–4235.
-  M. Cummins and P. Newman, “Fab-map: Probabilistic localization and mapping in the space of appearance,” The International Journal of Robotics Research, vol. 27, no. 6, pp. 647–665, 2008.
-  A. Torii, Y. Dong, M. Okutomi, J. Sivic, and T. Pajdla, “Efficient localization of panoramic images using tiled image descriptors,” Information and Media Technologies, vol. 9, no. 3, pp. 351–355, 2014.
-  M. Umeda and H. Date, “Spherical panoramic image-based localization by deep learning,” Transactions of the Society of Instrument and Control Engineers, vol. 54, pp. 483–493, 2018.
-  A. Guzman-Rivera, P. Kohli, B. Glocker, J. Shotton, T. Sharp, A. Fitzgibbon, and S. Izadi, “Multi-output learning for camera relocalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1114–1121.
-  J. Kosecka, L. Zhou, P. Barber, and Z. Duric, “Qualitative image based localization in indoors environments,” in Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, vol. 2. IEEE, 2003, pp. II–II.
-  A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International journal of computer vision, vol. 42, no. 3, pp. 145–175, 2001.
-  N. Sünderhauf and P. Protzel, “Brief-gist-closing the loop by simple means,” in Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on. IEEE, 2011, pp. 1234–1241.
-  G. Singh and J. Kosecka, “Visual loop closing using gist descriptors in manhattan world,” in ICRA Omnidirectional Vision Workshop, 2010.
-  R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, J. J. Yebes, and S. Gámez, “Bidirectional loop closure detection on panoramas for visual navigation,” in Intelligent Vehicles Symposium Proceedings, 2014 IEEE. IEEE, 2014, pp. 1378–1383.
-  M. J. Milford and G. F. Wyeth, “Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp. 1643–1649.
-  D. G. Lowe, “Object recognition from local scale-invariant features,” in Computer vision, 1999. The proceedings of the seventh IEEE international conference on, vol. 2. Ieee, 1999, pp. 1150–1157.
-  H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Computer vision and image understanding, vol. 110, no. 3, pp. 346–359, 2008.
-  J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object matching in videos,” in null. IEEE, 2003, p. 1470.
-  H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 3304–3311.
-  M. Donoser and D. Schmalstieg, “Discriminative feature-to-point matching in image-based localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 516–523.
-  T. Sattler, M. Havlena, F. Radenovic, K. Schindler, and M. Pollefeys, “Hyperpoints and fine vocabularies for large-scale location recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2102–2110.
-  T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & effective prioritized matching for large-scale image-based localization,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 9, pp. 1744–1756, 2017.
-  ——, “Improving image-based localization by active correspondence search,” in European conference on computer vision. Springer, 2012, pp. 752–765.
-  M. Uyttendaele, M. Cohen, S. Sinha, and H. Lim, “Real-time image-based 6-dof localization in large-scale environments,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 1043–1050.
-  E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in Computer Vision (ICCV), 2011 IEEE international conference on. IEEE, 2011, pp. 2564–2571.
-  J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2930–2937.
-  J. Valentin, M. Nießner, J. Shotton, A. Fitzgibbon, S. Izadi, and P. H. Torr, “Exploiting uncertainty in regression forests for accurate camera relocalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4400–4408.
-  E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “Dsac-differentiable ransac for camera localization,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 3, 2017.
-  Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala, “Camera relocalization by computing pairwise relative poses using convolutional neural network,” in Computer Vision Workshop (ICCVW), 2017 IEEE International Conference on. IEEE, 2017, pp. 920–929.
-  J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a” siamese” time delay neural network,” in Advances in neural information processing systems, 1994, pp. 737–744.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  A. Bellet, A. Habrard, and M. Sebban, “A survey on metric learning for feature vectors and structured data,” arXiv preprint arXiv:1306.6709, 2013.
-  M. Norouzi, D. J. Fleet, and R. R. Salakhutdinov, “Hamming distance metric learning,” in Advances in neural information processing systems, 2012, pp. 1061–1069.
-  S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 539–546.
-  R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on. IEEE, 2011, pp. 127–136.
-  C. Wu et al., “Visualsfm: A visual structure from motion system,” 2011.
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in
Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256.
-  R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in null. IEEE, 2006, pp. 1735–1742.
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” inThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  A. Valada, N. Radwan, and W. Burgard, “Deep auxiliary learning for visual localization and odometry,” arXiv preprint arXiv:1803.03642, 2018.
-  N. Radwan, A. Valada, and W. Burgard, “Vlocnet++: Deep multitask learning for semantic visual localization and odometry,” arXiv preprint arXiv:1804.08366, 2018.