I Introduction
Imagebased localization refers to retrieving 6 DegreeofFreedom (DoF) pose of the camera (monocular in most cases) using the obtained single or few images (testing) and previous pose labeled images (training) [patel2018contextualnet]. Typical applications include pedestrian localization with one image taken by mobile phone or localization of service robot in indoor with lowcost camera.
Contrary to priorfree monocular Simultaneous Localization And Mapping (SLAM), imagebased localization can align the observed single/few images to the training images’ coordinate and recover the pose globally. In monocular SLAM, the system can only recover the camera pose locally (relative to the starting pose) using a sequence of images without the correct scale. The scale and pose in monocular SLAM suffer from drift [mur2017orb]. Table I summarizes the differences. Since imagebased localization works with consumerlevel monocular cameras, it provides a fast and lowcost solution for a wide variety of applications.
Monocular SLAM  Imagebased localization  
Prior data  N/A  Images and pose labels 
Scale  No  Yes 
Pose coordinate  Local  Global 
Number of images  Sequential  Single or few 
Previous researches aligned the obtained image with the viewed scenery in the training set based on the visual similarity (observed from a similar viewing angle). The similarity is encoded in feature extractors like Speeded Up Robust Features (SURF) or Oriented FAST and Rotated BRIEF (ORB). Later, PoseNet
[kendall2015posenet], and SCORE forest [shotton2013scene]showed the machine learning method, especially the deep Convolutional Neural Network (CNN), can map the images to its 6 DoF poses. The trained deep CNN model was used to infer the pose of the testing image directly. The follow up work developed different variants of PoseNet to enhance the performance of imagetopose deep CNN models
[wang2019atloc, weinzaepfel2019visual, valada2018deep, patel2018contextualnet, radwan2018vlocnet++, tian20203d].To estimate the uncertainties for the imagetopose deep CNN model, kendall2016modelling adopted a Bayesian deep CNN framework to quantify the estimated pose’s uncertainty. It addressed uncertainty based on the training data set coverage, as it maintained the assumption that the “model is more uncertain about images which are dissimilar to the training examples” [kendall2016modelling]. The epistemic uncertainty measures the output in a datafitting manner. Similar studies can be found in [zhao2019generative, costante2020uncertainty, yang2020d3vo]. In the Bayesian deep CNN network, the pose uncertainty refers to the ambiguity caused by the unequaled distributions between the training and testing samples, distinguishing itself from the traditional geometric uncertainty. Unlike the Bayesian deep CNN model, liu2018deep inferred the covariance of the measurement from raw sensory data directly. In contrast to the datadriven methods, the modeldriven aleatoric uncertainty [hartley2003multiple, vega2013cello, ozog2013importance, polic2017camera, briskin2017estimating] addresses the geometric uncertainty. These geometric works followed the classical pinhole camera model and obtained the 6 DoF pose uncertainty propagated from the uncertainty in 2D tracking, measured as the 2D reprojection residuals sum.
In general, deep CNNbased imagetopose approaches are only efficient if images in the training and testing set are visually similar. Deep CNNbased methods cannot retrieve good results if the images are collected from very different perspectives. Moreover, the uncertainties in the deep CNN framework measure the similarity between the training and testing images. The obtained pose uncertainties from the Bayesian network have only valid statistical meaning instead of geometric interpretations. Our work exploits the pinhole camera projection model and is less dependent on the size of the training data. Moreover, benefiting from the geometric uncertainty quantification, we use a constant velocity camera motion model to incorporate the information of consecutive images to gain robustness. By introducing the motion model, we overcome the issues caused by blurry images, which randomly yields inaccurate farout predictions. The pose predictions provided by the camera projection model and motion model are further fused based on the uncertainties.
The contributions of this work are as follows.

A novel imagebased localization framework that explicitly localizes the robot by fusing deep CNN and sequential images’ geometric constraints.

The predictions from the geometric locator and the motion model are fused by incorporating their respective uncertainties during pose fusion.

Our proposed framework requires relatively small training data set and can be easily deployed for indoor localization tasks.
Ii Related Work
Imagebased localization can be categorized as single monocular and sequential monocular imagebased localization.
Early researches used the monocular image captured by the consumerlevel camera for imagebased localization. The priorfree Bag of Words (BoW) algorithm [galvez2012bags] obtains the image descriptors to characterize the image appearance. Due to its efficiency, BoW has been widely applied in SLAM systems for loop closure detection [galvez2012bags]. Following BoW, other methods used SIFT probability [cummins2008fab], deep CNN framework [sunderhauf2015performance] and contextual reweight network [kim2017learned] to generate image descriptors. All these methods only provide rough camera localization and not the precise 6 DoF pose.
PoseNet [kendall2015posenet] (monocular) and SCORE forest [shotton2013scene] (RGBD) are learningbased approaches to map from imageto6 DoF pose. The learningbased approaches bridge the gap between the learning and the geometric domain. wang2019atloc
introduced an attention model for better feature extraction.
weinzaepfel2019visual claimed the benefits of an intermediate objectsofinterest extraction and matching. tian20203d trained a deep CNN model with better efficiency by addressing the 3D scene geometryaware constraints in consecutive images. The idea proposed by tian20203d is similar to our approach where geometryaware constraints are used via an implicit encoding in the training process. However, we explicitly convert the imagetopose CNN into a hybrid of imagetoimage matching and geometric localization. We also exploit the interimage relationship by explicitly using a motion model constraint.Considering that shortterm consecutive images may benefit the robustness and accuracy of the imagetopose fitting, further studies improved the deep CNNbased framework with sequential images. patel2018contextualnet
proposed a Long Short Term Memory (LSTM) network termed as ContextualNet. The information within sequential images is exploited to enhance the performance of the deep CNN network. VLocNet
[valada2018deep] and VLocNet [radwan2018vlocnet++] integrated multiple deep CNNs to predict 6 DoF pose, visual odometry, and semantic map segmentation simultaneously. This multitask deep CNN strategy significantly enhanced the performance of the pose prediction. brahmbhatt2018geometry, yang2020d3vo integrated conventional SLAM technique by feeding the redundant pose predictions of multiple images to the pose graph or Bundle Adjustment (BA) optimizations.Iii Notation
Denote as an isomorphism that converts an element of Lie algebra
to a 6vector (i.e., twist). Adversely,
. For the testing image , the goal of this work is to estimate the camera’s pose . Additionally, the associated covariance matrix is denoted . We define as a distance between two poses aswhere is the matrix (Lie) logarithm map. The misalignment angle between two orientations is
(1) 
where yields the trace of the input matrix. Denote and as the orientation and position from the pose .
Iv Proposed Methodology
Fig. 1 illustrates the proposed framework, which consists of five modules: (a) Aligning the single testing image to the most visually similar image in the training set. (b) Selecting covisible keyframes to the visually similar image in the training set. (c) The testing image is localized using the selected covisible images using geometric features. (d) The camera pose of the testing image is retrieved using a motion model based on the constant velocity assumption to account for the sequential predictions. (e) The poses predicted from the geometric locator and the motion model are fused, addressing both uncertainties. If the image has no sequential prior predictions, only the single image locator is used in the first few starting frames.
Iva Image descriptor network
The image descriptor network aims at estimating the camera pose by aligning the most visually similar image in the training set with the testing image, using the trained deep CNN model. We modify the endtoend imagetopose regression network PoseNet [kendall2015posenet] as an imagetoimage similarity searching for following reasons. First, kendall2016modelling pointed out that the uncertainty of the PoseNet is closely related to the similarity of the training and testing images. The Siamese network structure can be explicitly used to compare the image similarities to exploit the information better. Second, zagoruyko2015learning demonstrates that the Siamese network with deep CNN model is efficient in measuring image similarity with limited training data. Lastly, BA addresses the inherent geometric constraint and provides an error propagation formulation mathematically derived from the pinhole camera model.
Fig. 2 shows the structure of the Siamese network structure. Among the offtheshelf methods like [GalvezTRO12, schmitt2017openxbow], or deep CNNbased method [wang2014learning, zagoruyko2015learning], GoogLeNet [szegedy2015going] is adopted as the base network since it works well in PoseNet [kendall2015posenet] in extracting image descriptors. The twin GoogLeNet models share the same weights. All input images are regularized to grayscale images with
. Each twin network takes one of the two gray images. The branch’s outputs are concatenated and fed into a fully connected layer to generate a onedimensional image descriptor. In the training process, arbitrary thresholds of the pose are chosen to label the similarity of the image. We used contrastive loss function
[hadsell2006dimensionality] to measure the similarity between the encoded descriptors (2). and are vectors yielded from the image descriptor networks. Before training, we build a KD tree on the descriptors of the training images, which is further used to find the most similar training image using the image descriptors. In (2), is the margin.(2) 
IvB Covisible keyframes selection
After the testing image is aligned with the optimal training image, the optimal training image’s covisible keyframes are selected following two principles: (I) Every two keyframes should have a suitable baseline to allow enough parallax and overlap for triangulation. (II) The orientational difference between the two images is trivial so that there is enough overlap for triangulation. (I) ensures large enough parallax while (II) is to avoid too small overlap within the selected images. We follow [ondruvska2015mobilefusion] and modify the keyframe selection strategy by calculating the score. For two arbitrary keyframes with camera poses and , the score is defined as
where is composed of orientation and position and is composed of orientation and position . is the optimal baseline. is set as .
The neighboring keyframes selection is done by searching the optimal training image’s neighbors. Once the frame is aligned by deep CNN, the backward (or forward) keyframe seeds (or ) are set as . The backward (or forward) direction searches the optimal (highest score) keyframe in the range / (). Then the optimal keyframe is assigned as (or ). This process is iteratively carried out on both directions until the joint keyframe group reaches the maximum number.
IvC The single image locator
The geometric locator consists of training image sparse mapping (forwardintersection) and testing image localization (backwardintersection). It localizes a single image geometrically, and both modules are simplified from BA [hartley2003multiple].
Forwardintersection triangulates sparse 3D map points from the 2D tracking in the training set. The predicted most similar image and its neighbors in Section IVB are adopted. As Fig. 3 shows, the predicted image, as well as its covisible images (blue camera in Fig. 3), go through the SURF to be tracked with 2D key corners. All selected images are paired and matched for reliable tracking. The poses of the selected images are . Define the 2D point tracked on image as . Note that some points cannot be tracked on all images. Projecting the 3D point in global coordinate to the image with pose is
where is the 2D projection. It obtains its homogeneous version. is the camera intrinsic matrix. Forwardintersection retrieves the 3D map points by minimizing the sum reprojection residual as
(3) 
where is the set of all selected training images and is the group of key points tracked on image . Huberloss [huber1992robust, kummerle2011g]
is applied as the robustifier to handle outliers. All trackings are treated with equal weights. Moreover, the triangulated points with an average residual greater than threshold
are deleted.Backwardintersection localizes the camera of the testing image with the triangulated 3D points. The 3D feature points are matched with the testing image (red camera in Fig. 3). Define as the group of the tracked 2D points on image . is estimated by minimizing the sum of reprojection residual (4) as
(4) 
where is the th tracked points on the testing image . Denote the sum of the residuals in (4) as . Similar to forwarditersection, Huberloss [huber1992robust, kummerle2011g]
is applied and all tracking are treated with equal weights (variance is set to
pixel). Combining the identity variance matrix of the pixellevel and the Huber loss factors, the information matrix of each map point can be obtained. Denote as the right Jacobian of the perpixel residuals in (4) following the definition in [barfoot2017state], the information matrix of the pose is given byWe follow [polic2017camera] to approximate the average residual variance as
where is the number of tracking in . With the information matrix , the covariance matrix can be obtained as
(5) 
IvD Probabilistic motion model
As patel2018contextualnet pointed out, single imagebased localization suffers from predictions that deviate significantly from consecutive predictions. Sequential predictions can alleviate this issue as it uses information from multiple images. Researchers developed different techniques that exploits sequential information using both deep CNN based methods [patel2018contextualnet, valada2018deep, radwan2018vlocnet++, tian20203d] and geometry perspective [brahmbhatt2018geometry, yang2020d3vo]. We applied shortterm robot motion model [thrun2002probabilistic, barfoot2017state] to constrain the consecutive predictions. The shortterm robot motion model is often categorized as Brownian, constant velocity, constant acceleration, and constant turn [thrun2002probabilistic, shi2002speed]
. We employ a constant velocitybased probabilistic motion model to fit the shortterm robot motion. The motion model is first fitted with the shortterm historical prediction and then applied to predict the testing image’s pose and the uncertainty. Following the constant motion interpolation (in
) in [barfoot2017state], we define the shortterm constant motion and optimized the objective function as(6) 
where is the number of historical images in the window of the motion model, is the th estimated pose. The basic idea is to search for the optimal and constant motion to fit the consecutive predictions. After the estimation of the probabilistic motion model (6), the onestep prediction of robot pose can be achieved.
The covariance matrix (diagonal) of from the probabilistic motion model, which is , is calculated statistically. It describes the fitting accuracy of the probabilistic motion model. The uncertainty only quantifies the shortterm epistemic uncertainty and suffers from uncertainty drift in the long run. Precisely, if the uncertainties of the motion model are propagated, they tend to be overconfident.
IvE Fusion of the geometric locator and the motion model
The predictions of the geometric locator and the probabilistic motion model are fused with the quantified uncertainties following [barfoot2014associating] as
(7) 
where and are the right Jacobians of objective function (7) regarding and following the definition in [barfoot2017state]
. To filter outliers from the single image locator, the isometric standard deviation
and are enforced for filtering . If the positional or orientational predictions of exceeds the bounds, is identified as failure and the motion model pose prediction is assigned to . The isometric standard deviation and arewhere scalar represents position’s ellipsoid uncertainty. Similarly, scalar is the orientational uncertainty.
The motion model provides an independent and cheap way to rectify the outliers. Unlike the global positioning of the geometric locator, the probabilistic motion model retrieves the local pose with the estimated shortterm historical poses. Its uncertainty is only valid in the short term. Assuming the error of the pose from the geometric locator follows the Gaussian distribution if the motion model prediction deviates the geometric locator for more than 3
bound, the prediction of the geometric locator is discarded. The large difference indicates unreasonable abrupt pose prediction. Other filtering methods, such as the extended Kalman filter, are unsuitable because the geometric locator yields consecutive outliers.
V Experiments
We evaluated our method on three data sets (one synthetic and two realworld). The synthetic data set was generated in a virtual warehouse using Isaac Sim [issacsdk] engine. The realworld public data set used for our experiments were 7Scenes [shotton2013scene] and TUM RGBD data sets [sturm12iros]. We also used a MonteCarlo experiment to validate the uncertainty quantification proposed in (5).
Va MonteCarlo synthetic experiment
To validate the uncertainty quantification in (5), a MonteCarlo experiment was conducted to test the geometric locator. It follows the scenario in Fig. 3. 3D points were partially projected on the images of the 4 camera poses and the target camera. Gaussian noises (i.i.d.) with variance were exerted on all 2D tracking. The coverage rate is defined as the difference between the estimated pose and the ground truth pose and within and , where is the th MonteCarlo experiment and is the ground truth. To further evaluate the efficiency with the presence of noise in pose labeling, we corrupt pose labels using simulated noise to replicate a more realistic scenario. Table II shows the coverage rate using different and different pose label noises. The ideal coverage rate is .

,  ,  ,  

1  2  1  2  1  2  
Angular  
Positional 
VB Data set
Our method was tested on the synthetic data set generated using Isaac Sim engine [issacsdk], that is photorealistic and physically accurate. The monocular images captured by the camera mounted on the robot were logged along with the corresponding poses while the robot maneuvered in the warehouse. Two data sets were generated in two different virtual environments with different levels of complexity. The synthesized images were in . Fig. 4 shows sample images.
The proposed approach was also tested on the smallrange 7Scenes [shotton2013scene] indoor data set, which was collected with a handheld Kinect RGBD camera at the resolution of . The pose labels were generated by implementing the KinectFusion [newcombe2011kinectfusion] SLAM on the RGBD images. The monocular images contain motion blur, flat surfaces, and various lighting conditions. The TUM RGBD data set [sturm12iros], which was designed to test the performance of RGBD SLAM algorithms, was also adopted. In TUM RGBD data set, a motion capture system was used to track the reflectance markers on the Kinect camera accurately, and the intrinsic parameter was calibrated. Therefore, the accuracy of the pose label and intrinsic parameters are substantially improved compared to the 7Scenes. The image resolution is and contains apparent challenges including motion blur, flat surfaces, and various lighting conditions. We chose the last 100 images as the testing data set and the remaining as the training data set. Data sets “freiburg1 desk2”, “freiburg1 floor”, “freiburg1 room”, “freiburg2 360 hemisphere”, “freiburg2 360 kidnap”, “freiburg2 large with loop” were chosen since most of the last 100 frames were revisited in the training set. We only used RGB images and the pose labels.

PoseNet  ContexualNet 

















PoseNet 


ContexualNet 




Chess  ,  ,  ,  ,  ,  ,  ,  ,  
Fire  ,  ,  ,  ,  ,  ,  ,  ,  
Heads  ,  ,  ,  ,  ,  ,  ,  ,  
Office  ,  ,  ,  ,  ,  ,  ,  ,  
Pumpkin  ,  ,  ,  ,  ,  ,  ,  ,  

,  ,  ,  ,  ,  ,  ,  ,  
Stairs  ,  ,  ,  ,  ,  ,  ,  , 
PoseNet  ContexualNet 



desk2 





floor 





room 





360_hemisphere 





large_with_loop 




100%  75%  50%  33%  25%  12%  
Ours (Simple)  0.21m  0.21m  0.22m  0.24m  0.23m  0.25m 
Ours (Complex)  0.79m  0.79m  0.83m  0.88m  0.97m  0.94m 
PoseNet (Simple)  0.24m  0.25m  0.30m  0.31m  0.34m  0.39m 
PoseNet (Complex)  3.58m  3.54m  4.66m  4.32m  4.73m  4.91m 
VC Experiment setting
Following settings were applied in the training process. In each epoch, 128 positive image pairs and the same amount of negative pairs were randomly selected. The model was trained for
epochs with the learning rate of and the batch size of . The training was achieved by minimizing the contrastive loss; the weights were optimized using Adaptive movement estimation (Adam) [kingma:2014]. Margin in the contrastive loss function is set as. Tensorflow libraries
[abadi2016tensorflow] were employed for training, and the model was trained on NVIDIA Titan X GPUs. The model was initialized using the weights trained on the Places [zhou:2014] data set.In the keyframe selection, the searching range is set to 100. Standard deviation . is set to 3 degrees. Seven keyframes are selected (3 backward and 3 forward).
In the geometric locator, the threshold of the reprojection error was set to . historical images were selected for the motion model. The proposed method was compared with the PoseNet [kendall2015posenet], ContextualNet [patel2018contextualnet], Bayesian PoseNet [kendall2016modelling], the geometryaware DNN [tian20203d] on the 7Scenes data. It should be noted that the results for PoseNet, ContextualNet, Bayesian PoseNet, and the geometryaware DNN [tian20203d] on the 7Scenes data set were directly cited from the published papers. For TUM and Isaac Sim data sets, we compared the results with the PoseNet [kendall2015posenet] and ContextualNet [patel2018contextualnet]. The accuracy comparisons are presented in Tables III, IV and V. Each image is processed individually (with several neighboring predictions) and sequential processing is not required.
We note that monocular SLAM cannot be implemented on the 3 data sets as the initialization fails due to the predominant rotational motion. To compare with the priorfree BoW (the backbone of the relocalization module in ORBSLAM series), we implemented BoW for image alignment and adopted our proposed geometric locator for 6 DoF pose estimation, results of which are presented in Table
III. Moreover, in the first few starting frames, only single image locator is used.Vi Discussion and Limitations
Via Performance of the proposed method
The results of MonteCarlo experiment listed in Table II shows (5) is accurate when pose labels are noisefree. The accuracy of the estimated uncertainty deteriorates as the pose label noise increases. We could not evaluate the probabilistic motion model using the MonteCarlo tests as the accuracy is dependent on the validity of the constant displacement and turning assumption.
Table III, IV and V demonstrate our proposed method significantly outperforms the baseline methods on the synthetic, 7Scenes and TUM data sets. It can partially be attributed to the abundant texture in all three data sets. Adequate accurate and robust corner points contribute to the good performance of the method on all data sets. More importantly, the inferior performance of the imagetopose methods is due to the different viewing angles between the training and testing image sets. Fig. 5 reveals several sample testing images and their visually similar correspondences in the training set. The figures indicate that the scenarios were observed from different viewing angles. Our proposed method substantially outperforms imagetopose fitting methods in these cases due to the efficiency of the BA and the dissimilarity between the training and testing images. Learningbased methods cannot achieve similar performance in this situation. The BA localization, however, obtains the average residual less than 1 pixel after the optimization.
The motion model further addresses the misclassification issues of the Siamese network, which are caused by blurry images and textureless regions. The uncertainty estimated by the motion model, along with the geometric locator, assists in filtering incorrect results and improving the overall results.
Further, our proposed method is more suitable to be applied in the scenario where there are fewer training data and abundant textures. Conventional learningbased methods heavily rely on the size of the training data. We tested PoseNet and our method with downsampled training data set as listed in Table VI, which shows the feasibility of our approach with smaller data set. Moreover, as Table V indicates, the deep CNNbased model is not well trained with limited data (540 and 1261 training samples for “desk2” and “room” data set respectively). Take the training samples 540, e.g., the Siamese training procedure can provide samples at most (positive and negative pairs). Furthermore, our method’s image similarity network task is easier to achieve than the imagetopose fitting.
ViB Accuracy of the image descriptor network
“Single image” results in Table III, IV and V validate the efficiency of the image descriptor network. The accuracy of the geometric locator heavily relies on testing and training image similarity alignment. The geometric locator fails if the overlap between the two images is sufficiently small that there are not enough valid features to be tracked. We measure the median distance between the testing and aligned training image in TUM data set, which are: (, ), (, ), (, ), (, ), (, ) and (, ) respectively.
ViC The benefit of introducing the geometric uncertainty
One critical drawback of the deep imagetopose fitting is that it relies on the domain coverage between the training and testing data. The Bayesian network [kendall2016modelling]
estimates the uncertainty with its dropout approximated framework and passively “provides a measure of similarity between a testing image and the data set’s training image”. Our method actively searches for key features to align the images when the testing image and training image are observed from different viewing directions. Moreover, the aleatoric uncertainty from the reprojection errors geometrically depicts the confidence of the estimated pose well. The uncertainty addresses the quality of the testing image and shows geometric confidence. In the experiments, images with blurriness are assigned with high uncertainty, while images with different viewing directions maintain high confidence. The wrongly classified images can also be identified if the prediction deviates from the motion model prediction significantly.
With the geometric uncertainty, the fusion module enhances the robustness of the localization. The motion model provides complimentary shortterm inertial information when the geometric method does not perform well (indicated by the uncertainty). As demonstrated in the three experiments, the motion model cannot improve the performance when the geometric localization works well. However, when the geometric locator performs poorly (“Office” and “Pumpkin” (7Scenes), “Warehouse2” (simulation), e.g.), the motion model substantially improves the accuracy.
ViD Limitations
The proposed method has two drawbacks. First, the geometric locator suffers from illumination changes. The matched corner points are unreliable if the training image and the aligned testing image have very different illumination. We observed satisfactory performance in the indoor scenarios with relatively uniform illumination and poor performance in outdoor environments because of heavy influence by sunlight. Moreover, this approach requires the known camera intrinsic parameter. The two drawbacks indicate our proposed method and the imagetopose learningbased methods [kendall2015posenet, patel2018contextualnet] are complementary. Although both our proposed method and the work of tian20203d address the geometric constraint, experiments show that our method is more efficient given the correct camera parameter. It remains unclear which method is more suitable when the camera intrinsic parameter is unknown, and this problem is an interesting future work direction.
Vii Conclusions
We developed a geometryaware imagebased localization system for indoor environments using an image descriptor network, a geometric locator, a motion model, and an uncertaintybased pose fusion. The proposed framework has higher accuracy and is even well suited for scenarios where the data sets are small. We extensively tested our method using synthetic and publicly available data sets and provided comparisons with other learning and featurebased methods.
In the future, we plan to extend the algorithm to localize any image with unknown intrinsic parameters. We will also explore developing a fully automatic localization failure detection mechanism.
Acknowledgment
Toyota Research Institute provided funds to support this work. Funding for M. Ghaffari was in part provided by NSF Award No. 2118818. We would also like to acknowledge FX Palo Alto Laboratory Inc. where this work initially started.
Comments
There are no comments yet.