Fusing Convolutional Neural Network and Geometric Constraint for Image-based Indoor Localization

by   Jingwei Song, et al.
University of Michigan

This paper proposes a new image-based localization framework that explicitly localizes the camera/robot by fusing Convolutional Neural Network (CNN) and sequential images' geometric constraints. The camera is localized using a single or few observed images and training images with 6-degree-of-freedom pose labels. A Siamese network structure is adopted to train an image descriptor network, and the visually similar candidate image in the training set is retrieved to localize the testing image geometrically. Meanwhile, a probabilistic motion model predicts the pose based on a constant velocity assumption. The two estimated poses are finally fused using their uncertainties to yield an accurate pose prediction. This method leverages the geometric uncertainty and is applicable in indoor scenarios predominated by diffuse illumination. Experiments on simulation and real data sets demonstrate the efficiency of our proposed method. The results further show that combining the CNN-based framework with geometric constraint achieves better accuracy when compared with CNN-only methods, especially when the training data size is small.



There are no comments yet.


page 1

page 4

page 5

page 6


Spatio-Temporal Graph Localization Networks for Image-based Navigation

Localization in topological maps is essential for image-based navigation...

Combining Deep Learning with Geometric Features for Image based Localization in the Gastrointestinal Tract

Tracking monocular colonoscope in the Gastrointestinal tract (GI) is a c...

MAV Navigation in Unknown Dark Underground Mines Using Deep Learning

This article proposes a Deep Learning (DL) method to enable fully autono...

An Image Based Visual Servo Approach with Deep Learning for Robotic Manipulation

Aiming at the difficulty of extracting image features and estimating the...

Fusing the Old with the New: Learning Relative Camera Pose with Geometry-Guided Uncertainty

Learning methods for relative camera pose estimation have been developed...

Virtual Training for a Real Application: Accurate Object-Robot Relative Localization without Calibration

Localizing an object accurately with respect to a robot is a key step fo...

Ligand Pose Optimization with Atomic Grid-Based Convolutional Neural Networks

Docking is an important tool in computational drug discovery that aims t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image-based localization refers to retrieving 6 Degree-of-Freedom (DoF) pose of the camera (monocular in most cases) using the obtained single or few images (testing) and previous pose labeled images (training) [patel2018contextualnet]. Typical applications include pedestrian localization with one image taken by mobile phone or localization of service robot in indoor with low-cost camera.

Contrary to prior-free monocular Simultaneous Localization And Mapping (SLAM), image-based localization can align the observed single/few images to the training images’ coordinate and recover the pose globally. In monocular SLAM, the system can only recover the camera pose locally (relative to the starting pose) using a sequence of images without the correct scale. The scale and pose in monocular SLAM suffer from drift [mur2017orb]. Table I summarizes the differences. Since image-based localization works with consumer-level monocular cameras, it provides a fast and low-cost solution for a wide variety of applications.

Fig. 1: The proposed method contains five major parts colored in blue: The similarity-based image locator, keyframe selection module, the geometric locator, the motion model, and the fusion module. The geometric locator and motion model are two parallel processes, and the fusion module combines the two predictions. The images are processed in a non sequential manner.
Monocular SLAM Image-based localization
Prior data N/A Images and pose labels
Scale No Yes
Pose coordinate Local Global
Number of images Sequential Single or few
TABLE I: The table shows the difference between monocular image-based localization and monocular SLAM.

Previous researches aligned the obtained image with the viewed scenery in the training set based on the visual similarity (observed from a similar viewing angle). The similarity is encoded in feature extractors like Speeded Up Robust Features (SURF) or Oriented FAST and Rotated BRIEF (ORB). Later, PoseNet 

[kendall2015posenet], and SCORE forest [shotton2013scene]

showed the machine learning method, especially the deep Convolutional Neural Network (CNN), can map the images to its 6 DoF poses. The trained deep CNN model was used to infer the pose of the testing image directly. The follow up work developed different variants of PoseNet to enhance the performance of image-to-pose deep CNN models 

[wang2019atloc, weinzaepfel2019visual, valada2018deep, patel2018contextualnet, radwan2018vlocnet++, tian20203d].

To estimate the uncertainties for the image-to-pose deep CNN model, kendall2016modelling adopted a Bayesian deep CNN framework to quantify the estimated pose’s uncertainty. It addressed uncertainty based on the training data set coverage, as it maintained the assumption that the “model is more uncertain about images which are dissimilar to the training examples” [kendall2016modelling]. The epistemic uncertainty measures the output in a data-fitting manner. Similar studies can be found in [zhao2019generative, costante2020uncertainty, yang2020d3vo]. In the Bayesian deep CNN network, the pose uncertainty refers to the ambiguity caused by the unequaled distributions between the training and testing samples, distinguishing itself from the traditional geometric uncertainty. Unlike the Bayesian deep CNN model, liu2018deep inferred the covariance of the measurement from raw sensory data directly. In contrast to the data-driven methods, the model-driven aleatoric uncertainty [hartley2003multiple, vega2013cello, ozog2013importance, polic2017camera, briskin2017estimating] addresses the geometric uncertainty. These geometric works followed the classical pinhole camera model and obtained the 6 DoF pose uncertainty propagated from the uncertainty in 2D tracking, measured as the 2D re-projection residuals sum.

In general, deep CNN-based image-to-pose approaches are only efficient if images in the training and testing set are visually similar. Deep CNN-based methods cannot retrieve good results if the images are collected from very different perspectives. Moreover, the uncertainties in the deep CNN framework measure the similarity between the training and testing images. The obtained pose uncertainties from the Bayesian network have only valid statistical meaning instead of geometric interpretations. Our work exploits the pinhole camera projection model and is less dependent on the size of the training data. Moreover, benefiting from the geometric uncertainty quantification, we use a constant velocity camera motion model to incorporate the information of consecutive images to gain robustness. By introducing the motion model, we overcome the issues caused by blurry images, which randomly yields inaccurate far-out predictions. The pose predictions provided by the camera projection model and motion model are further fused based on the uncertainties.

The contributions of this work are as follows.

  1. A novel image-based localization framework that explicitly localizes the robot by fusing deep CNN and sequential images’ geometric constraints.

  2. The predictions from the geometric locator and the motion model are fused by incorporating their respective uncertainties during pose fusion.

  3. Our proposed framework requires relatively small training data set and can be easily deployed for indoor localization tasks.

Ii Related Work

Image-based localization can be categorized as single monocular and sequential monocular image-based localization.

Early researches used the monocular image captured by the consumer-level camera for image-based localization. The prior-free Bag of Words (BoW) algorithm [galvez2012bags] obtains the image descriptors to characterize the image appearance. Due to its efficiency, BoW has been widely applied in SLAM systems for loop closure detection [galvez2012bags]. Following BoW, other methods used SIFT probability [cummins2008fab], deep CNN framework [sunderhauf2015performance] and contextual re-weight network [kim2017learned] to generate image descriptors. All these methods only provide rough camera localization and not the precise 6 DoF pose.

PoseNet [kendall2015posenet] (monocular) and SCORE forest [shotton2013scene] (RGB-D) are learning-based approaches to map from image-to-6 DoF pose. The learning-based approaches bridge the gap between the learning and the geometric domain. wang2019atloc

introduced an attention model for better feature extraction.

weinzaepfel2019visual claimed the benefits of an intermediate objects-of-interest extraction and matching. tian20203d trained a deep CNN model with better efficiency by addressing the 3D scene geometry-aware constraints in consecutive images. The idea proposed by tian20203d is similar to our approach where geometry-aware constraints are used via an implicit encoding in the training process. However, we explicitly convert the image-to-pose CNN into a hybrid of image-to-image matching and geometric localization. We also exploit the inter-image relationship by explicitly using a motion model constraint.

Considering that short-term consecutive images may benefit the robustness and accuracy of the image-to-pose fitting, further studies improved the deep CNN-based framework with sequential images. patel2018contextualnet

proposed a Long Short Term Memory (LSTM) network termed as ContextualNet. The information within sequential images is exploited to enhance the performance of the deep CNN network. VLocNet 

[valada2018deep] and VLocNet [radwan2018vlocnet++] integrated multiple deep CNNs to predict 6 DoF pose, visual odometry, and semantic map segmentation simultaneously. This multi-task deep CNN strategy significantly enhanced the performance of the pose prediction. brahmbhatt2018geometry, yang2020d3vo integrated conventional SLAM technique by feeding the redundant pose predictions of multiple images to the pose graph or Bundle Adjustment (BA) optimizations.

Iii Notation

Denote as an isomorphism that converts an element of Lie algebra

to a 6-vector (i.e., twist). Adversely,

. For the testing image , the goal of this work is to estimate the camera’s pose . Additionally, the associated covariance matrix is denoted . We define as a distance between two poses as

where is the matrix (Lie) logarithm map. The misalignment angle between two orientations is


where yields the trace of the input matrix. Denote and as the orientation and position from the pose .

Iv Proposed Methodology

Fig. 1 illustrates the proposed framework, which consists of five modules: (a) Aligning the single testing image to the most visually similar image in the training set. (b) Selecting co-visible keyframes to the visually similar image in the training set. (c) The testing image is localized using the selected co-visible images using geometric features. (d) The camera pose of the testing image is retrieved using a motion model based on the constant velocity assumption to account for the sequential predictions. (e) The poses predicted from the geometric locator and the motion model are fused, addressing both uncertainties. If the image has no sequential prior predictions, only the single image locator is used in the first few starting frames.

Iv-a Image descriptor network

The image descriptor network aims at estimating the camera pose by aligning the most visually similar image in the training set with the testing image, using the trained deep CNN model. We modify the end-to-end image-to-pose regression network PoseNet [kendall2015posenet] as an image-to-image similarity searching for following reasons. First, kendall2016modelling pointed out that the uncertainty of the PoseNet is closely related to the similarity of the training and testing images. The Siamese network structure can be explicitly used to compare the image similarities to exploit the information better. Second, zagoruyko2015learning demonstrates that the Siamese network with deep CNN model is efficient in measuring image similarity with limited training data. Lastly, BA addresses the inherent geometric constraint and provides an error propagation formulation mathematically derived from the pinhole camera model.

Fig. 2 shows the structure of the Siamese network structure. Among the off-the-shelf methods like [GalvezTRO12, schmitt2017openxbow], or deep CNN-based method [wang2014learning, zagoruyko2015learning], GoogLeNet [szegedy2015going] is adopted as the base network since it works well in PoseNet [kendall2015posenet] in extracting image descriptors. The twin GoogLeNet models share the same weights. All input images are regularized to grayscale images with

. Each twin network takes one of the two gray images. The branch’s outputs are concatenated and fed into a fully connected layer to generate a one-dimensional image descriptor. In the training process, arbitrary thresholds of the pose are chosen to label the similarity of the image. We used contrastive loss function

[hadsell2006dimensionality] to measure the similarity between the encoded descriptors (2). and are vectors yielded from the image descriptor networks. Before training, we build a K-D tree on the descriptors of the training images, which is further used to find the most similar training image using the image descriptors. In (2), is the margin.


Iv-B Co-visible keyframes selection

After the testing image is aligned with the optimal training image, the optimal training image’s co-visible keyframes are selected following two principles: (I) Every two keyframes should have a suitable baseline to allow enough parallax and overlap for triangulation. (II) The orientational difference between the two images is trivial so that there is enough overlap for triangulation. (I) ensures large enough parallax while (II) is to avoid too small overlap within the selected images. We follow [ondruvska2015mobilefusion] and modify the keyframe selection strategy by calculating the score. For two arbitrary keyframes with camera poses and , the score is defined as

where is composed of orientation and position and is composed of orientation and position . is the optimal baseline. is set as .

Fig. 2: Illustrated is the structure of the Siamese network used for training the image descriptor network, which is a modified version of GoogLeNet.

The neighboring keyframes selection is done by searching the optimal training image’s neighbors. Once the frame is aligned by deep CNN, the backward (or forward) keyframe seeds (or ) are set as . The backward (or forward) direction searches the optimal (highest score) keyframe in the range / (). Then the optimal keyframe is assigned as (or ). This process is iteratively carried out on both directions until the joint keyframe group reaches the maximum number.

Iv-C The single image locator

The geometric locator consists of training image sparse mapping (forward-intersection) and testing image localization (backward-intersection). It localizes a single image geometrically, and both modules are simplified from BA [hartley2003multiple].

Forward-intersection triangulates sparse 3D map points from the 2D tracking in the training set. The predicted most similar image and its neighbors in Section IV-B are adopted. As Fig. 3 shows, the predicted image, as well as its co-visible images (blue camera in Fig. 3), go through the SURF to be tracked with 2D key corners. All selected images are paired and matched for reliable tracking. The poses of the selected images are . Define the 2D point tracked on image as . Note that some points cannot be tracked on all images. Projecting the 3D point in global coordinate to the image with pose is

Fig. 3: The figure shows the forward-intersection and the backward-intersection. The red camera (I) is the testing pose while the three blue cameras are the training set (II, III and IV) , and . Three 2D feature points are tracked on the training and testing images. The forward-intersection triangulates the global coordinates of the 3D feature points tracked in the training images using the labeled poses. After the forward-intersection, the backward-intersection estimates the pose of the testing images by minimizing the re-projection errors of the 3D triangulated feature points.

where is the 2D projection. It obtains its homogeneous version. is the camera intrinsic matrix. Forward-intersection retrieves the 3D map points by minimizing the sum re-projection residual as


where is the set of all selected training images and is the group of key points tracked on image . Huber-loss [huber1992robust, kummerle2011g]

is applied as the robustifier to handle outliers. All trackings are treated with equal weights. Moreover, the triangulated points with an average residual greater than threshold

are deleted.

Backward-intersection localizes the camera of the testing image with the triangulated 3D points. The 3D feature points are matched with the testing image (red camera in Fig. 3). Define as the group of the tracked 2D points on image . is estimated by minimizing the sum of re-projection residual (4) as


where is the th tracked points on the testing image . Denote the sum of the residuals in (4) as . Similar to forward-itersection, Huber-loss [huber1992robust, kummerle2011g]

is applied and all tracking are treated with equal weights (variance is set to

pixel). Combining the identity variance matrix of the pixel-level and the Huber loss factors, the information matrix of each map point can be obtained. Denote as the right Jacobian of the per-pixel residuals in (4) following the definition in [barfoot2017state], the information matrix of the pose is given by

We follow [polic2017camera] to approximate the average residual variance as

where is the number of tracking in . With the information matrix , the covariance matrix can be obtained as


Iv-D Probabilistic motion model

As patel2018contextualnet pointed out, single image-based localization suffers from predictions that deviate significantly from consecutive predictions. Sequential predictions can alleviate this issue as it uses information from multiple images. Researchers developed different techniques that exploits sequential information using both deep CNN based methods [patel2018contextualnet, valada2018deep, radwan2018vlocnet++, tian20203d] and geometry perspective [brahmbhatt2018geometry, yang2020d3vo]. We applied short-term robot motion model [thrun2002probabilistic, barfoot2017state] to constrain the consecutive predictions. The short-term robot motion model is often categorized as Brownian, constant velocity, constant acceleration, and constant turn [thrun2002probabilistic, shi2002speed]

. We employ a constant velocity-based probabilistic motion model to fit the short-term robot motion. The motion model is first fitted with the short-term historical prediction and then applied to predict the testing image’s pose and the uncertainty. Following the constant motion interpolation (in

) in [barfoot2017state], we define the short-term constant motion and optimized the objective function as


where is the number of historical images in the window of the motion model, is the th estimated pose. The basic idea is to search for the optimal and constant motion to fit the consecutive predictions. After the estimation of the probabilistic motion model (6), the one-step prediction of robot pose can be achieved.

The covariance matrix (diagonal) of from the probabilistic motion model, which is , is calculated statistically. It describes the fitting accuracy of the probabilistic motion model. The uncertainty only quantifies the short-term epistemic uncertainty and suffers from uncertainty drift in the long run. Precisely, if the uncertainties of the motion model are propagated, they tend to be over-confident.

Iv-E Fusion of the geometric locator and the motion model

The predictions of the geometric locator and the probabilistic motion model are fused with the quantified uncertainties following [barfoot2014associating] as


where and are the right Jacobians of objective function (7) regarding and following the definition in [barfoot2017state]

. To filter outliers from the single image locator, the isometric standard deviation

and are enforced for filtering . If the positional or orientational predictions of exceeds the bounds, is identified as failure and the motion model pose prediction is assigned to . The isometric standard deviation and are

where scalar represents position’s ellipsoid uncertainty. Similarly, scalar is the orientational uncertainty.

The motion model provides an independent and cheap way to rectify the outliers. Unlike the global positioning of the geometric locator, the probabilistic motion model retrieves the local pose with the estimated short-term historical poses. Its uncertainty is only valid in the short term. Assuming the error of the pose from the geometric locator follows the Gaussian distribution if the motion model prediction deviates the geometric locator for more than 3-

bound, the prediction of the geometric locator is discarded. The large difference indicates unreasonable abrupt pose prediction. Other filtering methods, such as the extended Kalman filter, are unsuitable because the geometric locator yields consecutive outliers.

V Experiments

We evaluated our method on three data sets (one synthetic and two real-world). The synthetic data set was generated in a virtual warehouse using Isaac Sim [issacsdk] engine. The real-world public data set used for our experiments were 7Scenes [shotton2013scene] and TUM RGB-D data sets [sturm12iros]. We also used a Monte-Carlo experiment to validate the uncertainty quantification proposed in (5).

V-a Monte-Carlo synthetic experiment

To validate the uncertainty quantification in (5), a Monte-Carlo experiment was conducted to test the geometric locator. It follows the scenario in Fig. 3. 3D points were partially projected on the images of the 4 camera poses and the target camera. Gaussian noises (i.i.d.) with variance were exerted on all 2D tracking. The coverage rate is defined as the difference between the estimated pose and the ground truth pose and within and , where is the th Monte-Carlo experiment and is the ground truth. To further evaluate the efficiency with the presence of noise in pose labeling, we corrupt pose labels using simulated noise to replicate a more realistic scenario. Table II shows the coverage rate using different and different pose label noises. The ideal coverage rate is .

Pose noise
, , ,
1 2 1 2 1 2
TABLE II: The table shows the Monte-Carlo experiments with 1000 trials. We show the coverage rate of different levels of tracking noises. We also tested adding isometric pose noises to pose labels (cells are standard deviation of the noises).

V-B Data set

Our method was tested on the synthetic data set generated using Isaac Sim engine [issacsdk], that is photorealistic and physically accurate. The monocular images captured by the camera mounted on the robot were logged along with the corresponding poses while the robot maneuvered in the warehouse. Two data sets were generated in two different virtual environments with different levels of complexity. The synthesized images were in . Fig. 4 shows sample images.

Fig. 4: Shown are the examples of the synthetic data set generated using Isaac Sim simulator. The upper two and lower two figures are taken in diffuse lighting and indoor lighting respectively.

The proposed approach was also tested on the small-range 7Scenes [shotton2013scene] indoor data set, which was collected with a handheld Kinect RGB-D camera at the resolution of . The pose labels were generated by implementing the KinectFusion [newcombe2011kinectfusion] SLAM on the RGB-D images. The monocular images contain motion blur, flat surfaces, and various lighting conditions. The TUM RGB-D data set [sturm12iros], which was designed to test the performance of RGB-D SLAM algorithms, was also adopted. In TUM RGB-D data set, a motion capture system was used to track the reflectance markers on the Kinect camera accurately, and the intrinsic parameter was calibrated. Therefore, the accuracy of the pose label and intrinsic parameters are substantially improved compared to the 7Scenes. The image resolution is and contains apparent challenges including motion blur, flat surfaces, and various lighting conditions. We chose the last 100 images as the testing data set and the remaining as the training data set. Data sets “freiburg1 desk2”, “freiburg1 floor”, “freiburg1 room”, “freiburg2 360 hemisphere”, “freiburg2 360 kidnap”, “freiburg2 large with loop” were chosen since most of the last 100 frames were revisited in the training set. We only used RGB images and the pose labels.

Fig. 5: (a,c,e,g) are the sample testing images in the TUM data sets “desk2”, “floor”, “room” and “large_with_loop”. (b,d,f,h) are the corresponding most visually similar images in the training data set predicted by the image descriptor network. We achieve the accuracy , , and in contrast to the predictions , , and from PoseNet.
BoW +
PoseNet ContexualNet
Our method
(Single image)
Our method
(Multiple image)
TABLE III: The table shows the median localization errors on the synthetic data set generated using Isaac Sim. “Warehouse1” is the normal illumination, and “warehouse2” is in dark illumination. “BoW+Locator” is the integration of BoW with our geometric Locator.
Score Forest
aware DNN [tian20203d]
Our method
(Single image)
Our method
(Multiple model)
Chess , , , , , , , ,
Fire , , , , , , , ,
Heads , , , , , , , ,
Office , , , , , , , ,
Pumpkin , , , , , , , ,
Red Kitchen
, , , , , , , ,
Stairs , , , , , , , ,
TABLE IV: The table shows the median localization errors of SCORE forest [shotton2013scene], directional PoseNet [kendall2015posenet], Bayesian PoseNet [kendall2016modelling], the ContexturalNet [patel2018contextualnet] and the 3D geometry-aware network [tian20203d] in the 7Scenes data set.
PoseNet ContexualNet
Our method
(Single image)
Our method
(Multiple image)
TABLE V: The table shows the median localization errors on the TUM data set.
100% 75% 50% 33% 25% 12%
Ours (Simple) 0.21m 0.21m 0.22m 0.24m 0.23m 0.25m
Ours (Complex) 0.79m 0.79m 0.83m 0.88m 0.97m 0.94m
PoseNet (Simple) 0.24m 0.25m 0.30m 0.31m 0.34m 0.39m
PoseNet (Complex) 3.58m 3.54m 4.66m 4.32m 4.73m 4.91m
TABLE VI: The table compares our method (single image) and PoseNet on the downsampled synthetic warehouse (different ratio of downsampling) data set. Values are in median errors.

V-C Experiment setting

Following settings were applied in the training process. In each epoch, 128 positive image pairs and the same amount of negative pairs were randomly selected. The model was trained for

epochs with the learning rate of and the batch size of . The training was achieved by minimizing the contrastive loss; the weights were optimized using Adaptive movement estimation (Adam) [kingma:2014]. Margin in the contrastive loss function is set as

Tensorflow libraries 

[abadi2016tensorflow] were employed for training, and the model was trained on NVIDIA Titan X GPUs. The model was initialized using the weights trained on the Places [zhou:2014] data set.

In the keyframe selection, the searching range is set to 100. Standard deviation . is set to 3 degrees. Seven keyframes are selected (3 backward and 3 forward).

In the geometric locator, the threshold of the re-projection error was set to . historical images were selected for the motion model. The proposed method was compared with the PoseNet [kendall2015posenet], ContextualNet [patel2018contextualnet], Bayesian PoseNet [kendall2016modelling], the geometry-aware DNN [tian20203d] on the 7Scenes data. It should be noted that the results for PoseNet, ContextualNet, Bayesian PoseNet, and the geometry-aware DNN [tian20203d] on the 7Scenes data set were directly cited from the published papers. For TUM and Isaac Sim data sets, we compared the results with the PoseNet [kendall2015posenet] and ContextualNet [patel2018contextualnet]. The accuracy comparisons are presented in Tables III, IV and V. Each image is processed individually (with several neighboring predictions) and sequential processing is not required.

We note that monocular SLAM cannot be implemented on the 3 data sets as the initialization fails due to the predominant rotational motion. To compare with the prior-free BoW (the backbone of the relocalization module in ORB-SLAM series), we implemented BoW for image alignment and adopted our proposed geometric locator for 6 DoF pose estimation, results of which are presented in Table 

III. Moreover, in the first few starting frames, only single image locator is used.

Vi Discussion and Limitations

Vi-a Performance of the proposed method

The results of Monte-Carlo experiment listed in Table II shows (5) is accurate when pose labels are noise-free. The accuracy of the estimated uncertainty deteriorates as the pose label noise increases. We could not evaluate the probabilistic motion model using the Monte-Carlo tests as the accuracy is dependent on the validity of the constant displacement and turning assumption.

Table III, IV and V demonstrate our proposed method significantly outperforms the baseline methods on the synthetic, 7Scenes and TUM data sets. It can partially be attributed to the abundant texture in all three data sets. Adequate accurate and robust corner points contribute to the good performance of the method on all data sets. More importantly, the inferior performance of the image-to-pose methods is due to the different viewing angles between the training and testing image sets. Fig. 5 reveals several sample testing images and their visually similar correspondences in the training set. The figures indicate that the scenarios were observed from different viewing angles. Our proposed method substantially outperforms image-to-pose fitting methods in these cases due to the efficiency of the BA and the dissimilarity between the training and testing images. Learning-based methods cannot achieve similar performance in this situation. The BA localization, however, obtains the average residual less than 1 pixel after the optimization.

The motion model further addresses the misclassification issues of the Siamese network, which are caused by blurry images and textureless regions. The uncertainty estimated by the motion model, along with the geometric locator, assists in filtering incorrect results and improving the overall results.

Further, our proposed method is more suitable to be applied in the scenario where there are fewer training data and abundant textures. Conventional learning-based methods heavily rely on the size of the training data. We tested PoseNet and our method with downsampled training data set as listed in Table VI, which shows the feasibility of our approach with smaller data set. Moreover, as Table V indicates, the deep CNN-based model is not well trained with limited data (540 and 1261 training samples for “desk2” and “room” data set respectively). Take the training samples 540, e.g., the Siamese training procedure can provide samples at most (positive and negative pairs). Furthermore, our method’s image similarity network task is easier to achieve than the image-to-pose fitting.

Vi-B Accuracy of the image descriptor network

“Single image” results in Table III, IV and V validate the efficiency of the image descriptor network. The accuracy of the geometric locator heavily relies on testing and training image similarity alignment. The geometric locator fails if the overlap between the two images is sufficiently small that there are not enough valid features to be tracked. We measure the median distance between the testing and aligned training image in TUM data set, which are: (, ), (, ), (, ), (, ), (, ) and (, ) respectively.

Vi-C The benefit of introducing the geometric uncertainty

One critical drawback of the deep image-to-pose fitting is that it relies on the domain coverage between the training and testing data. The Bayesian network [kendall2016modelling]

estimates the uncertainty with its dropout approximated framework and passively “provides a measure of similarity between a testing image and the data set’s training image”. Our method actively searches for key features to align the images when the testing image and training image are observed from different viewing directions. Moreover, the aleatoric uncertainty from the re-projection errors geometrically depicts the confidence of the estimated pose well. The uncertainty addresses the quality of the testing image and shows geometric confidence. In the experiments, images with blurriness are assigned with high uncertainty, while images with different viewing directions maintain high confidence. The wrongly classified images can also be identified if the prediction deviates from the motion model prediction significantly.

With the geometric uncertainty, the fusion module enhances the robustness of the localization. The motion model provides complimentary short-term inertial information when the geometric method does not perform well (indicated by the uncertainty). As demonstrated in the three experiments, the motion model cannot improve the performance when the geometric localization works well. However, when the geometric locator performs poorly (“Office” and “Pumpkin” (7Scenes), “Warehouse2” (simulation), e.g.), the motion model substantially improves the accuracy.

Vi-D Limitations

The proposed method has two drawbacks. First, the geometric locator suffers from illumination changes. The matched corner points are unreliable if the training image and the aligned testing image have very different illumination. We observed satisfactory performance in the indoor scenarios with relatively uniform illumination and poor performance in outdoor environments because of heavy influence by sunlight. Moreover, this approach requires the known camera intrinsic parameter. The two drawbacks indicate our proposed method and the image-to-pose learning-based methods [kendall2015posenet, patel2018contextualnet] are complementary. Although both our proposed method and the work of tian20203d address the geometric constraint, experiments show that our method is more efficient given the correct camera parameter. It remains unclear which method is more suitable when the camera intrinsic parameter is unknown, and this problem is an interesting future work direction.

Vii Conclusions

We developed a geometry-aware image-based localization system for indoor environments using an image descriptor network, a geometric locator, a motion model, and an uncertainty-based pose fusion. The proposed framework has higher accuracy and is even well suited for scenarios where the data sets are small. We extensively tested our method using synthetic and publicly available data sets and provided comparisons with other learning and feature-based methods.

In the future, we plan to extend the algorithm to localize any image with unknown intrinsic parameters. We will also explore developing a fully automatic localization failure detection mechanism.


Toyota Research Institute provided funds to support this work. Funding for M. Ghaffari was in part provided by NSF Award No. 2118818. We would also like to acknowledge FX Palo Alto Laboratory Inc. where this work initially started.