Improving the generalization of network based relative pose regression: dimension reduction as a regularizer

10/24/2020 ∙ by Xiaqing Ding, et al. ∙ Zhejiang University 0

Visual localization occupies an important position in many areas such as Augmented Reality, robotics and 3D reconstruction. The state-of-the-art visual localization methods perform pose estimation using geometry based solver within the RANSAC framework. However, these methods require accurate pixel-level matching at high image resolution, which is hard to satisfy under significant changes from appearance, dynamics or perspective of view. End-to-end learning based regression networks provide a solution to circumvent the requirement for precise pixel-level correspondences, but demonstrate poor performance towards cross-scene generalization. In this paper, we explicitly add a learnable matching layer within the network to isolate the pose regression solver from the absolute image feature values, and apply dimension regularization on both the correlation feature channel and the image scale to further improve performance towards generalization and large viewpoint change. We implement this dimension regularization strategy within a two-layer pyramid based framework to regress the localization results from coarse to fine. In addition, the depth information is fused for absolute translational scale recovery. Through experiments on real world RGBD datasets we validate the effectiveness of our design in terms of improving both generalization performance and robustness towards viewpoint change, and also show the potential of regression based visual localization networks towards challenging occasions that are difficult for geometry based visual localization methods.



There are no comments yet.


page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visual localization provides accurate orientation and position information for many applications such as Augmented Reality[18, 23, 27] and mobile robots [39, 33]

. The state-of-the-art visual localization systems usually contain four sequential modules: image retrieval

[1, 41]

, feature extraction and description

[30, 10], feature matching [32] and pose estimation [20, 14]. Each module in these modular pipeline based localization (MPL) methods has been investigated for many years varying from traditional to learning based solutions. Concluding from the current researches in the community, in many cases learning based methods show better performance in the first three modules, while for pose estimation traditional geometry based methods under the RANSAC frameworks[20, 14, 31] still hold the superiority in terms of generalization and precision.

To localize in environments with low appearance change and sufficient texture, geometry based MPL could demonstrate superior localization performance. However, if the appearance changes significantly, or a majority of the view are textureless, those methods are prone to failure due to the inadequate matching inliers. Many researches devote to improving the precision and robustness of detecting pixel-level correspondences between images[42, 32]. But usually in these situations only coarse correspondence can be determined and is hard to find precise matches at high resolution level even with human annotation.

However, is accurate pixel-to-pixel image correspondence really necessary for pose estimation? End-to-end learning based visual localization methods propose a promising solution that could bypass it. Considering the coordinate of estimated pose, end-to-end localization could be categorized as absolute pose regression (APR) and relative pose regression (RPR) methods. APR methods directly regress the global pose of the query image [16, 15] or global 3D points for pixels on the query image [3, 4, 5, 21]. Though some of these methods could achieve higher localization accuracy than geometry based MPL solutions [3, 4, 5, 21], APR methods can not generalize to unseen scenes as the scene-specific information is encoded within the models.

In contrast to regress the variables in the global coordinate directly, relative pose regression (RPR) based methods regress the relative pose between two images [19, 2, 11, 46], which can be considered as the combination of last three modules in MPL. Thus combined with image retrieval, RPR methods can achieve global localization with no need to encode the scene-specific geographic information within the network, thus possessing the potential of generalization.

Unfortunately, currently many RPR based methods show poor generalization performance to unseen scenes as shown in experimental results [46]. Compared with the MPL pipeline, matching and pose estimation processes are coupled in many RPR networks that regress the relative pose directly from the concatenation of the input image feature pair [19, 2, 11], which makes the regression results related to the enormous and scene-specific feature space. [46] can be considered as explicitly including the matching process within the RPR pipeline by adding a learnable Neighborhood Consensus (NC) matching layer [29] before regressing the pose. The matching layer outputs a score map that contains the entire pairwise feature correlation score between the two input images. In this way the regression result is related to the correlation between image features, which isolates the regression layer from the absolute feature values that vary across scenes. However, the generalization performance is still not acceptable, thus they infer that the implicit feature matching cannot be correctly learned within the RPR network [46].

In this paper, we argue that the implicit feature matching can be handled within the RPR network to boost generalization, and propose a novel framework to improve the performance of RPR methods. As pose information is the only supervision during the training process, it’s difficult to apply sufficient constraint for network to learn both matching and pose regression facing huge input data dimensions. We perform regularization to reduce the dimensions of both the image scale and feature correlations aiming to reasonably apply additional constraints on the network. To do this, besides adding a matching layer [29]

to explicitly calculate correlation information, we add a convolutional neural network (CNN) with bottleneck structure to regularize the feature correlations, and implement this new structure within a two-layer pyramid based framework to regress the relative pose from coarse to fine at low resolution with large receptive field, which further reduces the input dimension for regression. Moreover, depth image is concatenated with the regularized feature correlation to recover the absolute scale of the regressed pose as shown in Fig.

1. We implement RPR networks with different regression structures within this two-layer framework and compare their performance on public indoor RGBD datasets. Through experiments we validate the effectiveness of both the implicit matching layer and dimension regularization in terms of generalization improvement as well as the depth fusion in terms of scale recovery. Besides, the structure with correlation feature regularization shows superior performance in occasions with large perspective of view. The experimental results also demonstrate that in challenging changing environments learning based methods possess more potential than state-of-the-art MPL methods with geometry solvers.

Ii Related Work

In this section we review the works of visual localization that are related to geometry and learning based solutions. For a more complete review of this area, we recommend the survey [8].

Fig. 1: This figure demonstrates the whole pipeline of our visual localization framework. The left part of the figure shows our proposed two-layer relative pose regression network. We draw the detail about the regularization based pose regression layer (MotionNet) within the blue dotted box on the right.

Ii-a Geometry based visual localization

Geometry based visual localization usually solves the image pose given matches between 2D keypoints and 3D map points using the RANSAC [12] based Perspective-n-Point (PnP) solver [13, 20, 14]. The matches can be computed following nearest neighbor searching according to the distance between feature descriptors on the query image and the image retrieved from the map. Recently, there are also some learning based methods solving out the matches using CNN [29, 28]

or Graph Neural Network (GNN)


Ii-B Learning for absolute pose estimation

Learning based methods for absolute pose estimation encode the environmental information within the network parameters during the training process. Given a query image, some of these works directly output the global poses w.r.t the map. PoseNet [16] is the first end-to-end network modified from GoogleNet [38] to regress the translation and rotation represented by quaternion of the input image, and many following works [15, 24, 43] are designed to improve the performance.

Regressing the pose directly end-to-end is efficient but the precision is less accurate. Scene coordinate regression based visual localization [3, 21] chooses another way for pose estimation. Instead of directly regressing the image pose, these methods regress the global 3D location for each pixel on the query image. [3] utilizes a CNN network for scene coordinate regression, and the achieved 3D-2D matches are forwarded into a novel differentiable RANSAC to achieve end-to-end training. This method could exceed the traditional geometry based methods on localization accuracy, but only shows the efficiency in small environment. There are many following methods designed to improve its performance by adding reprojection [4] and multi-view geometry constrains [7]. To extend methods to large and ambiguous environments, some methods leverage hierarchical network structure to regress the 3D scene coordinates from coarse to fine [21], or integrate DSAC within a Mixture of Expert (MoE) framework [5]. However, as the map information is encoded within the parameters, these methods can not be generalized to unseen scenes.

Ii-C Learning for relative pose estimation

Learning the relative pose between two images is a more general solution to achieve global localization. The reference map images are retrieved by pre-trained networks [46] or networks that are jointly trained with the following RPR parts [2, 11]. RPR problem is also studied in some visual odometry works [45, 44], but in the context of localization the photometric consistency is usually broken.

In many RPR networks the depth information is not utilized for localization, thus the estimated translation is up-to-scale, and absolute localization results have to be recovered by RANSAC based triangulation [46, 19]. In this paper we fuse the depth information within regression and show its ability to recover pose with scale.

Iii Method

Our goal is to achieve robust visual localization given the current query image and the map constructed with RGB images with corresponding depth . To achieve this, a two-stage visual localization pipeline is utilized to first retrieve top- ranked images from the map, then estimate the relative poses between and each retrieved RGBD image for global localization.

Iii-a Image Retrieval from the Map

We take advantage of the success in visual place recognition technologies [41, 1, 40] and utilize NetVLAD [1] to extract the global image descriptor for each map image offline. During localization, we extract the global image descriptor for the query image and find the nearest map images according to the Euclidean distance between image descriptors. As the global pose is known for each retrieved map image , the global pose of can be calculated based on the estimated relative transformation between and :


In the following sections we introduce our regularization based network designed to calculate the relative transformation and a validation method used to select the best regressed result out of the estimated poses .

Iii-B Regularization based Relative Pose Regression

Iii-B1 Motivation

Many RPR methods utilize the concatenation of two CNN features as input for regression in the following Fully Connection (FC) layer [19, 2, 11]. In this way, the output of the FC layer is related to the absolute values of the feature pair. Imagining that given two images of the same content, if some patches of the images are changed simultaneously, their feature pair would change accordingly while the regressed pose should not. This brings difficulty to network learning as enormous input features correspond to the same output. Furthermore, the values of the image features are scene-specific, thus making the network difficult to localize in unseen environment. On the other way, the correlation score between two features only depends on their difference, which should stay stable as long as the feature descriptors are consistent. Thus explicitly adding the matching layer within RPR networks could largely reduce the complexity of the pose regression problem, and endow the network with better generalization ability.

Traditionally the correlation volume contains the entire matching information between pixel pairs from two images. When the dynamics is dominant or the perspective of view is largely different, the valid overlap between two images is limited and the correlation volume is occupied with a major of confusing information. Different from traditional methods that extract pixel-to-pixel correspondences to satisfy geometry based pose solvers [29, 46, 28], we extract the matching information implicitly by regularizing the correlation volume with a bottleneck structure based CNN model for dimension reduction as shown in Fig. 1. Circumventing pixel-level correspondence for pose estimation brings us with two benefits: i) No pixel-level supervision is required thus the training data is easier to obtain. ii) There is no need to find correspondences at high resolution for pose estimation solved by geometry based methods, thus effective global correlation is accessible due to restricted image resolution with large receptive field, leading to more robust patch-wise matching.

Iii-B2 Network Architecture

The details of the regularization based RPR network is demonstrated in Fig. 1. We utilize the pre-trained VGG16 [36] network for feature extraction and truncate it at the last pooling layer [22, 42]. The input image is resized to before put into network. We only use the last two layers of the features computed by VGG16 with resolution 1616 (feature ) and 3232 (feature ) for the following two-stage pose estimation.

In the first layer , the global correlation between features and of and respectively is first computed according to the scalar product between each pixel of and the corresponding pixel of


which then is forwarded into the NC matching layer [29] to constrain geometric consistency. The following MotionNet module takes the output score map to regress the initial relative pose

. Different from other works that represent the pose with 3D translation and a 3D/4D rotation vector

[19, 11], we represent the pose with a 9D vector , where denotes the translation and denotes the rotation [47]. A mapping is adapted to transform our rotation vector to conventional rotation matrix,


Combined with , induces a mapping from our 9D vector to the standard Euclidean transformation


We utilize depth image to calculate rigid flow between and at the same resolution as according to


in which , denote the th pixel on , and are the corresponding depth value and 3D point transformed by . denotes the intrinsic matrix and denotes the projection function.

In the second layer , we utilize to warp and compute its correlation with . In this layer only local correlation within neighborhood pixels are searched to refine the matching information calculated on . And this correlation results are forwarded to regress relative pose as refinement. The final estimated pose in is


The only supervision during training is the groundtruth relative pose between and . The pose estimation error in each layer is calculated as


We convert the rotational error into the angular value and the total loss is defined as




in which and represent the corresponding weights.

Iii-B3 Detail of MotionNet

In this module the correlation information is regularized for pose regression. It takes the correlation volume from global or local correlation modules as input and utilize a CNN network with bottleneck structure along the feature dimension and a FC layer to regress the corresponding relative poses as shown in Fig. 1. The feature dimension of the score map is regularized into a compact formation, on which the depth image with the same resolution is concatenated for scale recovery.

Iii-C Correlation based Pose Selection

After calculating the relative poses between and the retrieved map images, we evaluate the results according to the correlation between and warped based on the rigid flow computed by . We apply softmax along the channels of the correlation results and count the number of vectors in which the highest correlation score is larger than threshold . Only valid correlations that the warped positions are within the images are counted. The pose of the image pair with max number is selected as the best regressed result.

Iv Experiments

In this section, we assess the performance of our proposed regularization based RPR framework 111 with standard 7Scenes [35] dataset for comparison with other networks, and also utilize a challenging indoor public dataset OpenLoris-Scene [34] to investigate the potential of regression based methods addressing complex real-world environments compared with the geometry based MPL method [14].

Iv-a Datasets and implementation details

OpenLoris-Scene [34] is a public indoor dataset designed to evaluate the performance of lifelong visual SLAM methods. It collects RGBD image sequences using a mobile robot in five scenes separately along various trajectories and situations. The data includes significant appearance, illumination and perspective of view changes as well as textureless areas and blur, which is valuable to access the performance of vision based methods facing real-world situation. We utilize the first three sequences in both “Home” and “Office” scenes for training. Any two images from same or different sequences are selected as a training image pair if their translation and orientation distances are within the set thresholds. The translational threshold is set to and rotational threshold is . We test the localization performance on the other sequences in “Home” and “Office” scenes. To evaluate the performance of generalization, we also utilize the sequences in the “Cafe” scene for testing, in which the environmental appearance is entirely unseen in the training data.

7Scenes [35] contains RGBD images collected from 7 indoor rooms. We utilize the training data listed in 7 scenes together for training and compare our testing results with the other learning based pose estimation methods. We also evaluate the cross-dataset generalization performance with 7Scenes dataset for comparison.

TUM-RGBD [37] dataset contains sequential RGBD images collected in different scenes. We want to evaluate cross-dataset generalization based on the models trained OpenLoris-Scene and 7Scenes datasets, but the training data in OpenLoris-Scene only contains planar motion. We finetune the models on TUM-RGBD dataset to supplement the freedom of motion in training data of OpenLoris-Scene.

Implementation details:

The network is implemented in PyTorch


and trained for at most 50 epochs with the weights of VGG16 feature extraction layer fixed. We would stop the training process early if the training loss does not decrease. All the models are trained using AdamOptimizer

[17] with beginning learning rate of and decay ratio of every 10 epochs. The batch size is set to 6 and .

Network structure details: To validate the effectiveness of the matching layer and correlation regularization process, we implement three types of RPR networks within our proposed two-layer framework with difference in the input feature for pose regression:

  1. image feature concatenation (“feature-cat” in tables)

  2. correlation volume output from the NC matching layer (“score-map” in tables)

  3. correlation volume with dimension regularization (“score-map-dr” in tables, denotes the channel of the compact regularized feature)

For fair comparison all the three networks have CNN modules between the input features and the pooling layer within each MotionNet.

Iv-B Localization performance

sift+5p PoseNet MapNet DSAC++ RelativePN RelocNet NC-EssNet score-map score-map score-map-dr4 score-map feature-cat
[46] [16] [6] [4] [19] [2] [46] -dr2 -dr3 -dr4
median error 1.99/0.08 10.44/0.44 7.78/0.21 1.10/0.04 9.30/0.21 6.74/0.21 7.50/0.21 L1: 3.85/0.14 3.79/0.15 3.99/0.15 4.34/0.16 9.55/0.37
(deg/m) L2: 3.22/0.11 3.37/0.12 3.26/0.11 3.35/0.12 8.85/0.36
  • “L1” denotes the outputs from the first layer and “L2” denotes the outputs from layer in our framework

TABLE I: Visual Localization results evaluated on 7Scenes datasets.
Home1-4 Home1-5 Office1-4 Office1-5 Office1-6 Office1-7 Cafe2-1
score-map-dr2 (gt select) 53.8/61.0/63.7 90.2/92.4/92.8 12.0/20.1/24.7 29.3/31.1/33.0 91.4/96.5/96.8 0/0.1/5.7 72.1/78.7/83.0
score-map-dr3 (gt select) 53.2/59.6/62.8 89.7/91.3/91.5 13.8/23.4/34.5 29.5/31.7/35.8 89.6/98.5/98.5 9.1/19.1/28.7 70.7/76.2/82.2
score-map-dr4 (gt select) 57.3/60.9/62.7 90.0/91.5/91.5 18.3/29.7/32.4 30.5/34.9/37.4 91.4/96.8/97.2 3.2/20.7/30.6 74.5/81.8/82.2
score-map (gt select) 58.0/63.6/64.4 91.8/92.3/92.3 2.1/7.4/13.9 27.6/28.9/30.5 90.5/99.4/99.6 0/0/0.1 72.2/80.4/82.4
feature-cat (gt select) 48.7/55.7/62.1 78.7/84.2/85.5 26.4/32.6/36.3 26.4/32.4/37.7 84.4/97.5/98.6 0/0.1/11.3 57.6/64.5/67.4
score-map-dr2 (corr select) 43.1/49.9/51.6 82.7/85.8/85.9 7.5/15.2/18.0 26.2/28.4/30.0 88.2/91.6/91.6 0/0/4.7 65.6/76.1/82.3
score-map-dr3 (corr select) 43.4/49.6/52.2 84.1/85.6/85.6 10.1/18.7/22.1 28.0/28.8/30.7 87.3/93.9/94.0 5.2/11.7/20.2 66.9/73.4/81.3
score-map-dr4 (corr select) 47.3/52.9/53.5 85.8/86.8/86.8 10.5/21.1/24.0 27.9/29.3/30.7 89.4/95.0/95.0 1.6/7.7/23.0 71.3/80.4/80.6
score-map (corr select) 50.1/57.2/58.1 86.6/86.9/86.9 1.1/4.4/8.3 27.3/27.7/28.4 87.0/91.3/91.3 0/0/0 69.4/78.0/81.9
feature-cat (corr select) 40.5/49.7/55.5 69.7/74.1/74.1 13.6/20.0/22.4 23.7/28.4/32.9 73.4/90.5/91.4 0/0/6.7 52.9/61.4/62.1
SuperPoint+2p [14] 38.3/48.5/53.6 84.2/86.0/86.0 15.3/20.9/20.9 39.2/39.2/39.3 100/100/100 0.1/0.5/0.5 55.3/79.6/85.9
  • The best results are marked as red, and the best results selected by groundtruth are marked as blue.

TABLE II: Visual localization results evaluated on OpenLoris-Scenes datasets. Each column lists the percentage of localization results satisfying the translational and rotational error thresholds ()/()/().
method score-map-dr2 score-map-dr3 score-map-dr4 score-map feature-cat RelativePN [19] RelocNet[2]
training dataset TUM[37] TUM[37] TUM[37] TUM[37] TUM[37] University[19] ScanNet[9]
median error(deg/m) 6.37/0.24 5.84/0.23 5.89/0.23 5.71/0.21 10.5/0.30 18.37/0.36 11.29/0.29
TABLE III: Generalization results of different RPR methods evaluated on 7Scenes.

We first compare our methods with state-of-the-art methods in terms of localization accuracy by training and testing the models on OpenLoris-Scenes and 7Scenes datasets separately. As there is no visual localization benchmark on OpenLoris-Scenes datasets, we implement 2p-RANSAC based MPL solution [14] with SuperPoint features [10] as the baseline. Note that as there is only planar motion in OpenLoris-Scenes datasets, 2p-RANSAC based solution should outperform traditional EPnP [20] or 5p [25] based RANSAC solvers. We set the first sequences in “Home” and “Office” as the maps for each scene and for images in the query sequences we utilize NetVLAD [1] to retrieve top-5 images as the reference images. For 2p-RANSAC based solution we merge the features in all of the retrieved 5 images to do pose estimation for better robustness. While for our RPR methods we regress the relative pose with each retrieved image separately and select out one result according to the evaluation method in III-C (“corr select” in TABLE II). We also list the results selected by the groudtruth (“gt select” in TABLE II) that chooses the result with smallest localization error, which can be considered as the best performance the RPR networks achieve. While for 7Scenes datasets, we retrieve the top-1 image for RPR evaluation and compare the results with the other MPL [46] and learning based localization methods [46, 16, 19, 6, 2, 4]. The localization results are shown in TABLE I, II.

From the evaluation results on TABLE I, we can see that our proposed models implemented with matching layers (“score-map-dr2, score-map-dr3, score-map-dr4, score-map”) outperform the other listed RPR based methods and part of APR based methods. Note that we also outperform the result in [46] which also leverages matching layer for pose regression and we owe it to the pyramid structure based image scale regularization and the combination with depth information. As there is tiny appearance change or dynamics within the environments, MPL and scene coordinate based methods [4] could achieve superior localization precision as adequate pixel-level correspondences could be found, but our methods also show comparable performance. In TABLE I we list the outputs from the two layers in our frameworks to demonstrate the effectiveness of the second layer in terms of precision improvement. In other experiments we only list the output of the second layer as the regression results.

Fig. 2: Two cases that SuperPoint+2p-RANSAC method fails but our proposed RPR methods success in pose estimation. In second row we draw the match results computed by SuperPoint descriptors with yellow lines.
Fig. 3: Rotational errors of different methods evaluated in “Office1-7”. For better visualization we set the results of the image pairs with overlapped ratio less than 0.1 as 0. The two images are related to the data indicated by the green arrow.
Fig. 4: The generalization results tested in “Cafe2-1” with different methods. The image pairs related to the red arrows are listed in the second line as reference to indicating the environment.

TABLE II includes the localization results on OpenLoris-Scenes dataset. As we only use one sequence for each scene as the map, some query images cannot find the matched reference images thus for some sequences the best performance cannot achieve 100. We can see that in most of the localization sequences our RPR based methods could outperform SuperPoint+2p-RANSAC method. We analyze through the failure cases in scenes with large different performance to find the strengths and weaknesses of each methods. In “Home1-4” scene where RPR based methods extremely outperform RANSAC based method, we find that many failure cases occur in places that the appearance changes significantly or many dynamics exist. We draw two cases that SuperPoint+2p-RANSAC method fails but RPR methods could give good localization results in Fig. 2. We can see that many matches are inaccurate, and some matched points belong to dynamic objects such as the curtain, which would further degrade the pose estimation results. While in the proposed RPR methods, as the utilized image features have large receptive field and global correlation information is leveraged, accurate pose estimation could be achieved even though there are many dynamics in the views.

In “Office1-4” and “Office1-7”, we find that RPR method without correlation regularization shows largely decreased performance compared with the results with correlation regularization. In this two cases, the trajectories of query sequences are almost opposite to the mapping trajectories, and some objects are removed across scenes. Thus many retrieved images only have little overlap with the query ones. We draw the rotational errors of “score-map” and “score-map-dr4” tested in “Office1-7” along with the overlap ratio in Fig. 3. We can see at the beginning of the trajectory in which the environments can be inferred from figures in the first row of Fig. 3, “score-map-dr4” shows exceeding performance facing large perspective of view, which validates the effectiveness of our proposed correlation reduction process.

Iv-C Generalization Study

Office1-6 Cafe2-1
score-map-dr2 (7S, gt select) 87.7/90.5/90.8 64.6/74.6/77.2
score-map-dr3 (7S, gt select) 78.8/84.3/85.1 71.1/82.4/83.5
score-map-dr4 (7S, gt select) 78.8/84.3/85.1 65.5/70.3/71.0
score-map (7S, gt select) 74.2/81.5/81.5 67.7/72.5/73.2
feature-cat(7S, gt select) 55.8/75.3/77.3 22.2/41.0/47.2
score-map-dr2 (7S, corr select) 78.9/79.9/79.9 55.9/64.5/65.3
score-map-dr3 (7S, corr select) 64.0/67.9/67.9 58.0/76.2/77.3
score-map-dr4 (7S, corr select) 64.0/67.9/67.9 58.1/62.4/62.7
score-map (7S, corr select) 52.9/60.5/60.5 59.0/66.0/65.0
feature-cat(7S, corr select) 34.8/50.6/50.9 11.1/24.9/27.9
SuperPoint+2p [14] 100/100/100 55.3/ 79.6/85.9
  • The best results of RPR methods are marked as red and the best results selected by groundtruth are marked as blue.

TABLE IV: Generalization results evaluated on OpenLoris datasets based on the models trained on 7Scenes. Each column lists the percentage of localization results satisfying the translational and rotational error thresholds ()/()/().

In this subsection we execute generalization study of our networks and evaluate the performance of different input feature structures for the regression layer. We reuse the trained models on “Home” and “Office” scenes to test the cross-scene generalization performance on the “Cafe” scene, and the results are listed in the last column of TABLE II. Then we finetune these models on TUM-RGBD dataset for 5 epochs to supplement the degree of motion before applying them on 7Scenes dataset to test the cross-dataset generalization performance. Our results as well as the other generalization results from the state-of-the-art methods are shown in TABLE III.

From the generalization results in TABLE II, we find that our RPR networks with matching layer still outperform SuperPoint+2p-RANSAC method at the first two error intervals, which validates the generalization ability of our RPR networks. In this case the result of “feature-cat” network largely degrades, which reflects the importance of the matching layer in terms of improving generalization performance. We draw the localization errors of different methods in Fig. 4 and the results show that most of the failure cases of SuperPoint+2p-RANSAC method are due to large translational error. We select two places from these failure cases and show the image pairs on Fig. 4. We can find that in these cases though the appearance change is not as significant as in “Home”, there are many dynamic objects and people in the scene, making it hard to find accurate pixel-level correspondences with reliable depth for pose estimation, which also reveals the advantage of regression based methods that release the requirement for pixel-to-pixel correspondence.

TABLE III shows the generalization performances of different methods tested on 7Scenes and we can find that the precision of our proposed methods with matching layer even outperform the results of some RPR methods listed in TABLE I. We also test the generalization performance of the models trained on 7Scenes to OpenLoris-Scenes. As there is no appearance change in the 7Scenes datasets, we only show the generalization results on “Office1-6” and “Cafe2-1” in which the query and mapping trajectories are almost the same and only small part of the scenes are changed. The results are listed in TABLE IV. The experimental results show that even the sensors and environments are both changed, the degeneration problem is not severe.

Iv-D Ablation Study

Here we evaluate the effectiveness of depth fusion in terms of scale recovery. To validate the effectiveness of scale recovery of our method, we train the network with no depth concatenation on the regularized feature and test its generalization performance on “Cafe” dataset. Besides listing the results according to the thresholds in TABLE II, we also list the percentage of the results that with angular error smaller than in TABLE V. From the results we can find that the performances of rotational estimation are similar between methods with and without depth concatenation, while the performances taking translational error into account differ a lot, which can demonstrate that the depth concatenation within the model can successfully contribute to recover scale.

score-map-dr2 score-map feature-cat dr2-noDepth
median error 72.1/78.7/83.0 72.2/80.4/82.4 57.6/64.5/67.4 56.4/63.9/66.6
84.1 83.8 68.3 83.8
TABLE V: Ablation results to validate the effectiveness of scale recovery tested on Cafe2-1 dataset.

V Conclusions

In this paper we propose a novel relative pose regression framework for visual localization. In order to improve the network generalization towards unseen scenes, we explicitly add a matching layer and utilize the correlation volume for pose regression. Besides, we design a pyramid based structure to regress the pose from coarse to fine with restricted resolution, and apply dimension reduction on the correlation channel to improve the robustness facing large perspective of view. The experiments validate that our network could achieve state-of-the-art localization performance and demonstrate comparable results even in generalization test to unseen scenes. Experiments also show that regression based visual localization methods possess large potential in complicated real-world environments compared with the methods that require pixel-level correspondences for pose estimation. In the future we design to do research on fusing multiple map images to improve the robustness of visual localization.


  • [1] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic (2016) NetVLAD: cnn architecture for weakly supervised place recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 5297–5307. Cited by: §I, §III-A, §IV-B.
  • [2] V. Balntas, S. Li, and V. Prisacariu (2018) Relocnet: continuous metric learning relocalisation using neural nets. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 751–767. Cited by: §I, §I, §II-C, §III-B1, §IV-B, TABLE I, TABLE III.
  • [3] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother (2017) Dsac-differentiable ransac for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6684–6692. Cited by: §I, §II-B.
  • [4] E. Brachmann and C. Rother (2018) Learning less is more-6d camera localization via 3d surface regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4654–4662. Cited by: §I, §II-B, §IV-B, §IV-B, TABLE I.
  • [5] E. Brachmann and C. Rother (2019) Expert sample consensus applied to camera re-localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7525–7534. Cited by: §I, §II-B.
  • [6] S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz (2018) Geometry-aware learning of maps for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625. Cited by: §IV-B, TABLE I.
  • [7] M. Cai, H. Zhan, C. Saroj Weerasekera, K. Li, and I. Reid (2019) Camera relocalization by exploiting multi-view constraints for scene coordinates regression. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §II-B.
  • [8] C. Chen, B. Wang, C. X. Lu, N. Trigoni, and A. Markham (2020)

    A survey on deep learning for localization and mapping: towards the age of spatial machine intelligence

    arXiv preprint arXiv:2006.12567. Cited by: §II.
  • [9] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839. Cited by: TABLE III.
  • [10] D. DeTone, T. Malisiewicz, and A. Rabinovich (2018) Superpoint: self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 224–236. Cited by: §I, §IV-B.
  • [11] M. Ding, Z. Wang, J. Sun, J. Shi, and P. Luo (2019) CamNet: coarse-to-fine retrieval for camera re-localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2871–2880. Cited by: §I, §I, §II-C, §III-B1, §III-B2.
  • [12] M. A. Fischler and R. C. Bolles (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §II-A.
  • [13] X. Gao, X. Hou, J. Tang, and H. Cheng (2003) Complete solution classification for the perspective-three-point problem. IEEE transactions on pattern analysis and machine intelligence 25 (8), pp. 930–943. Cited by: §II-A.
  • [14] Y. Jiao, Y. Wang, X. Ding, B. Fu, S. Huang, and R. Xiong (2020) 2-entity ransac for robust visual localization: framework, methods and verifications. IEEE Transactions on Industrial Electronics. Cited by: §I, §II-A, §IV-B, TABLE II, TABLE IV, §IV.
  • [15] A. Kendall and R. Cipolla (2017)

    Geometric loss functions for camera pose regression with deep learning

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5974–5983. Cited by: §I, §II-B.
  • [16] A. Kendall, M. Grimes, and R. Cipolla (2015) Posenet: a convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pp. 2938–2946. Cited by: §I, §II-B, §IV-B, TABLE I.
  • [17] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-A.
  • [18] G. Klein and D. Murray (2007) Parallel tracking and mapping for small ar workspaces. In IEEE and ACM International Symposium on Mixed and Augmented Reality, pp. 1–10. Cited by: §I.
  • [19] Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala (2017) Camera relocalization by computing pairwise relative poses using convolutional neural network. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 929–938. Cited by: §I, §I, §II-C, §III-B1, §III-B2, §IV-B, TABLE I, TABLE III.
  • [20] V. Lepetit, F. Moreno-Noguer, and P. Fua (2009) Epnp: an accurate o (n) solution to the pnp problem. International journal of computer vision 81 (2), pp. 155. Cited by: §I, §II-A, §IV-B.
  • [21] X. Li, S. Wang, Y. Zhao, J. Verbeek, and J. Kannala (2020) Hierarchical scene coordinate classification and regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11983–11992. Cited by: §I, §II-B.
  • [22] I. Melekhov, A. Tiulpin, T. Sattler, M. Pollefeys, E. Rahtu, and J. Kannala (2019) Dgc-net: dense geometric correspondence network. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1034–1042. Cited by: §III-B2.
  • [23] S. Middelberg, T. Sattler, O. Untzelmann, and L. Kobbelt (2014) Scalable 6-dof localization on mobile devices. In European conference on computer vision, pp. 268–283. Cited by: §I.
  • [24] T. Naseer and W. Burgard (2017) Deep regression for monocular camera-based 6-dof global localization in outdoor environments. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1525–1530. Cited by: §II-B.
  • [25] D. Nistér (2004) An efficient solution to the five-point relative pose problem. IEEE transactions on pattern analysis and machine intelligence 26 (6), pp. 756–770. Cited by: §IV-B.
  • [26] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §IV-A.
  • [27] T. Qin, P. Li, and S. Shen (2018) Vins-mono: a robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics 34 (4), pp. 1004–1020. Cited by: §I.
  • [28] I. Rocco, R. Arandjelović, and J. Sivic (2020) Efficient neighbourhood consensus networks via submanifold sparse convolutions. arXiv preprint arXiv:2004.10566. Cited by: §II-A, §III-B1.
  • [29] I. Rocco, M. Cimpoi, R. Arandjelović, A. Torii, T. Pajdla, and J. Sivic (2018) Neighbourhood consensus networks. In Advances in Neural Information Processing Systems, pp. 1651–1662. Cited by: §I, §I, §II-A, §III-B1, §III-B2.
  • [30] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski (2011) ORB: an efficient alternative to sift or surf. In 2011 International conference on computer vision, pp. 2564–2571. Cited by: §I.
  • [31] P. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk (2019) From coarse to fine: robust hierarchical localization at large scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12716–12725. Cited by: §I.
  • [32] P. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich (2020) Superglue: learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4938–4947. Cited by: §I, §I, §II-A.
  • [33] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, et al. (2018) Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8601–8610. Cited by: §I.
  • [34] X. Shi, D. Li, P. Zhao, Q. Tian, Y. Tian, Q. Long, C. Zhu, J. Song, F. Qiao, L. Song, et al. (2020) Are we ready for service robots? the openloris-scene datasets for lifelong slam. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3139–3145. Cited by: §IV-A, §IV.
  • [35] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon (2013) Scene coordinate regression forests for camera relocalization in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2930–2937. Cited by: §IV-A, §IV.
  • [36] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §III-B2.
  • [37] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers (2012) A benchmark for the evaluation of rgb-d slam systems. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 573–580. Cited by: §IV-A, TABLE III.
  • [38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §II-B.
  • [39] L. Tang, Y. Wang, X. Ding, H. Yin, R. Xiong, and S. Huang (2019) Topological local-metric framework for mobile robots navigation: a long term perspective. Autonomous Robots 43 (1), pp. 197–211. Cited by: §I.
  • [40] L. Tang, Y. Wang, Q. Luo, X. Ding, and R. Xiong (2020) Adversarial feature disentanglement for place recognition across changing appearance. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 1301–1307. Cited by: §III-A.
  • [41] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla (2015) 24/7 place recognition by view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1808–1817. Cited by: §I, §III-A.
  • [42] P. Truong, M. Danelljan, and R. Timofte (2020) GLU-net: global-local universal network for dense flow and correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6258–6268. Cited by: §I, §III-B2.
  • [43] F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler, S. Hilsenbeck, and D. Cremers (2017) Image-based localization using lstms for structured feature correlation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 627–637. Cited by: §II-B.
  • [44] K. Wang and S. Shen (2020) Flow-motion and depth network for monocular stereo and beyond. IEEE Robotics and Automation Letters 5 (2), pp. 3307–3314. Cited by: §II-C.
  • [45] Z. Yin and J. Shi (2018)

    Geonet: unsupervised learning of dense depth, optical flow and camera pose

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1983–1992. Cited by: §II-C.
  • [46] Q. Zhou, T. Sattler, M. Pollefeys, and L. Leal-Taixe (2020) To learn or not to learn: visual localization from essential matrices. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3319–3326. Cited by: §I, §I, §II-C, §II-C, §III-B1, §IV-B, §IV-B, TABLE I.
  • [47] Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li (2019) On the continuity of rotation representations in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5745–5753. Cited by: §III-B2.