Log In Sign Up

Extreme Relative Pose Estimation for RGB-D Scans via Scene Completion

Estimating the relative rigid pose between two RGB-D scans of the same underlying environment is a fundamental problem in computer vision, robotics, and computer graphics. Most existing approaches allow only limited maximum relative pose changes since they require considerable overlap between the input scans. We introduce a novel deep neural network that extends the scope to extreme relative poses, with little or even no overlap between the input scans. The key idea is to infer more complete scene information about the underlying environment and match on the completed scans. In particular, instead of only performing scene completion from each individual scan, our approach alternates between relative pose estimation and scene completion. This allows us to perform scene completion by utilizing information from both input scans at late iterations, resulting in better results for both scene completion and relative pose estimation. Experimental results on benchmark datasets show that our approach leads to considerable improvements over state-of-the-art approaches for relative pose estimation. In particular, our approach provides encouraging relative pose estimates even between non-overlapping scans.


page 3

page 7

page 13

page 14

page 15


Extreme Relative Pose Network under Hybrid Representations

In this paper, we introduce a novel RGB-D based relative pose estimation...

VIPose: Real-time Visual-Inertial 6D Object Pose Tracking

Estimating the 6D pose of objects is beneficial for robotics tasks such ...

Extreme Rotation Estimation using Dense Correlation Volumes

We present a technique for estimating the relative 3D rotation of an RGB...

3D-SIC: 3D Semantic Instance Completion for RGB-D Scans

This paper focuses on the task of semantic instance completion: from an ...

A Framework for Depth Estimation and Relative Localization of Ground Robots using Computer Vision

The 3D depth estimation and relative pose estimation problem within a de...

Insights on Evaluation of Camera Re-localization Using Relative Pose Regression

We consider the problem of relative pose regression in visual relocaliza...

RIO: 3D Object Instance Re-Localization in Changing Indoor Environments

In this work, we introduce the task of 3D object instance re-localizatio...

1 Introduction

Estimating the relative rigid pose between a pair of RGB-D scans is a fundamental problem in computer vision, robotics, and computer graphics with applications to systems such as 3D reconstruction [46], structure-from-motion [35], and simultaneous localization and mapping (SLAM) [38]. Most existing approaches [12, 17, 1, 28, 43] follow a three-step paradigm (c.f. [46]

): feature extraction, feature matching, and rigid transform fitting with the most consistent feature correspondences. However, this paradigm requires the input RGB-D scans to have considerable overlap, in order to establish sufficient feature correspondences for matching. For input scans of extreme relative poses with little or even

no overlap, this paradigm falls short since there are very few or no features to be found in the overlapping regions. Nevertheless, such problem settings with minimal overlap are common in many applications such as solving jigsaw puzzles [5], early detection of loop closure for SLAM [13], and reconstruction from minimal observations, e.g., a few snapshots of an indoor environment [25].

Figure 1: Illustration of the work-flow of our approach. We align two RGB-D scans by alternating between scene completion (completion module) and pose estimation (relative pose module).

While the conventional paradigm breaks down in this setting, we hypothesize that solutions are possible using the prior knowledge for typical scene structure and object shapes. Intuitively, when humans are asked to perform pose estimation for non-overlapping inputs, they utilize the prior knowledge of the underlying geometry. For example, we can complete a human model from two non-overlapping scans of both the front and the back of a person; we can also tell the relative pose of two non-overlapping indoor scans by knowing that the layout of the room satisfies the Manhattan world assumption [7]. This suggests that when direct matching of non-overlapping scans is impossible, we seek to match them by first performing scene completions and then matching completed scans for their relative pose.

Inspired from the iterative procedure for simultaneous reconstruction and registration [15], we propose to alternate between scene completion and relative pose estimation so that we can leverage signals from both input scans to achieve better completion results. Specifically, we introduce a neural network that takes a pair of RGB-D scans with little overlap as input and outputs the relative pose between them. Key to our approach are internal modules that infer the completion of each input scan, allowing even widely separated scans to be iteratively registered with the proper relative pose via a recurrent module. As highlighted in Figure 1, our network first performs single-scan completion under a rich representation that combines depth, normal, and semantic descriptors. This is followed by a pair-wise matching module, which takes the current completions as input and outputs the current relative pose.

In particular, to address the issue of imperfect predictions, we introduce a novel pairwise matching approach that seamlessly integrates two popular pairwise matching approaches: spectral matching [23, 16] and robust fitting [2]

. Given the current relative pose, our network performs bi-scan completion, which takes as input the representation of one scan and transformed representation of the other scan (using the current relative pose estimate), and outputs a updated scan completion in the view of the first scan. The pair-wise matching module and the bi-scan completion module are alternated, as reflected by the recurrent nature of our network design. Note that compared to existing deep learning based pairwise matching approaches 

[26, 11], which combine feature extraction towers and a matching module, the novelty of our approach is three-fold:

explicitly supervising the relative pose network via completions of the underlying scene under a novel representation that combines geometry and semantics.

a novel pairwise matching method under this representation.

an iterative procedure that alternates between scene completion and pairwise matching.

We evaluate our approach on three benchmark datasets, namely, SUNCG [36], Matterport [3], and ScanNet [8]. Experimental results show that our approach is significantly better than state-of-the-art relative pose estimation techniques. For example, our approach reduces the mean rotation errors of state-of-the-art approaches from , , and on SUNCG, Matterport, and ScanNet, respectively, to , , and , respectively, on scans with overlap ratios greater than 10%. Moreover, our approach generates encouraging results for non-overlapping scans. The mean rotation errors of our approach for these scans are , , and , respectively. In contrast, the expected error of a random rotation is around .

Code is publicly available at

2 Related Work

Non-deep learning techniques. Pairwise object matching has been studied extensively in the literature, and it is beyond the scope of this paper to present a comprehensive overview. We refer to [20, 39, 24] for surveys on this topic and to [28] for recent advances. Regarding the specific task of relative pose estimation from RGB-D scans, popular methods [12, 17, 1, 28] follow a three-step procedure. The first step extracts features from each scan. The second step establishes correspondences for the extracted features, and the third step fits a rigid transform to a subset of consistent feature correspondences. Besides the fact that the performance of these techniques heavily relies on setting suitable parameters for each component, they also require that the two input scans possess sufficient overlapping features to match.

Deep learning techniques. Thanks to the popularity of deep neural networks, recent work explores deep neural networks for the task of relative pose estimation (or pairwise matching in general)  [11, 18, 40, 45, 27]. These approaches follow the standard pipeline of object matching, but they utilize a neural network module for each component. Specifically, feature extraction is generally done using a feed-forward module, while estimating correspondences and computing a rigid transform

are achieved using a correlation module. With proper pre-training, these methods exhibit better performance than their non-deep learning counterparts. However, they still require that the inputs possess a sufficient overlap so that the correlation module can identify common features for relative pose estimation.

A couple of recent works propose recurrent procedures for object matching. In [33], the authors present a recurrent procedure to compute weighted correspondences for estimating the fundamental matrix between two images. In [21], the authors use recurrent networks to progressively compute dense correspondences between two images. The network design is motivated from the procedure of non-rigid image registration between a pair of images. Our approach is conceptually relevant to these approaches. However, the underlying principle for the recurrent approach is different. In particular, our approach performs scan completions, from which we compute the relative pose.

Optimization techniques for pairwise matching. Existing feature-based pairwise matching techniques fall into two categories. The first category of methods is based on MAP inference [23, 16, 4], where feature matches and pairwise feature consistency are integrated as unary and pairwise potential functions. A popular relaxation of MAP inference is spectral relaxation [23, 16]. The second category of methods is based on fitting a rigid transformation to a set of feature correspondences [14]. In particular, state-of-the-art approaches [10, 20, 43] usually utilize robust norms to handle incorrect feature correspondences. In this paper, we introduce the first approach that optimizes a single objective function to simultaneously perform spectral matching and robust fitting for relative pose estimation.

Figure 2: Network design of the completion module. Given the partially observed color, depth, normal, our network complete cube-map representation of color, depth, normal, semantic, as well as a feature map. Please refer to Sec. 3.3 for details.

Scene completion. Our approach is also motivated from recent advances on inferring complete environments from partial observations [31, 37, 19, 47]. However, our approach differs from these approaches in two ways. First, in contrast to returning the completion as the final output [37, 47] or utilizing it for learning feature representations [31, 19]

, our approach treats completions as an intermediate representation for relative pose estimation. From the representation perspective, our approach predicts color,depth,normal,semantic,and feature vector using a single network. Experimentally, this leads to better results than performing completion using the RGB-D representation first and then extracting necessary features from the completions.

3 Approach

We begin with presenting an approach overview in Section 3.1. Section 3.2 to Section 3.4 elaborate the network design. Section 3.5 discusses the training procedure.

3.1 Approach Overview

The relative pose estimation problem studied in this paper considers two RGB-D scans of the same environment as input (,=160 in this paper). We assume that the intrinsic camera parameters are given so that we can extract the 3D position of each pixel in the local coordinate system of each . The output of relative pose estimation is a rigid transformation that recovers the relative pose between and . Note that we do not assume and overlap.

Our approach is inspired from simultaneous registration and reconstruction (or SRAR) [15], which takes multiple depth scans of the same environment as input and outputs both a 3D reconstruction of the underlying environment (expressed in a world coordinate system) and optimized scan poses (from which we can compute relative poses). The optimization procedure of SRAR alternates between fixing the scan poses to reconstruct the underlying environment and optimizing scan poses using the current 3D reconstruction. The key advantage of SRAR is that pose optimization can leverage a complete reconstruction of the underlying environment and thus it mitigates the issue of non-overlap.

However, directly applying SRAR to relative pose estimation for 3D scenes is challenging, as unlike 3D objects [41, 32, 6, 42], it is difficult to specify a world coordinate system for 3D scenes. To address this issue, we modify SRAR by maintaining two copies and of the complete underlying environment, where is expressed in the local coordinate system of (We will discuss the precise representation of later). Conceptually, our approach reconstructs each by combining the signals in both and . When performing relative pose estimation, our approach employs and , which addresses the issue of non-overlap.

As illustrated in Figure 1, the proposed network for our approach combines a scan completion module and a pairwise matching module. To provide sufficient signals for pairwise matching, we define the feature representation by concatenating color, depth, normal, semantic label, and descriptors. Here utilizes a reduced cube-map representation [37], where each face of shares the same representation as .

Experimentally, we found this approach gives far better results than performing scan completion under the RGB-D representation first and then computing the feature representation. The pairwise matching module takes current and as input and outputs the current relative pose . The completion module updates each scan completion using the transformed scans, e.g., utilizes and transformed in the local coordinate system of . We alternate between applying the pairwise matching module and the bi-scan completion module. In our implementation, we use recurrent steps. Next we elaborate on the details of our approach.

3.2 Feature Representation

Motivated by the particular design of our pairwise matching module, we define the feature representation of an RGB-D scan as . Here , , , (we use =15 for SUNCG, =21 for Matterport/ScanNet), (=32 in this paper), specify color, depth, normal, semantic class, and a learned descriptor, respectively. The color,depth,normal,semantic class are obtained using the densely labeled reconstructed model for all datasets.

3.3 Scan Completion Modules

The scan completion module takes in a source scan, a target scan transformed by current estimate , and output the complete feature representation . We encode using a reduced cube-map representation [37], which consists of four faces (excluding the floor and the ceiling). Each face of shares the same feature representation as . For convenience, we always write

in the tensor form as

. Following the convention [37, 31], we formulate the input to both scan completion modules using a similar tensor form , where the last channel is a mask that indicates the presence of data. As illustrated in Figure 2 (Left), we always place in . This means , , and are left blank.

We adapt a convolution-deconvolution structure for our scan completion network, denoted . As shown in Figure 2, we use separate layers to extract information from color, depth, and normal input, and concatenate the resulting feature maps. Note that we stack the source and transformed target scan in each of the color, normal, depth components to provide the network more information. Only source scan is shown for simplicity. Since designing the completion network is not the major focus of this paper, we leave the technical details to supplementary material.

3.4 Relative Pose Module

We proceed to describe the proposed relative pose module denoted as . We first detect SIFT keypoints on observed region, and further extracts the top matches of the keypoints on the other complete scan to form the final point set . With and we denote the resulting points. Our goal is to simultaneously extract a subset of correspondences from and fit to these selected correspondences. For efficiency, we remove a correspondence from whenever .

The technical challenge of extracting correct correspondences is that due to imperfect scan completions, many correspondences with similar descriptors are still outliers. We address this challenge by combining spectral matching 

[23] and robust fitting [2], which are two complementary pairwise matching methods. Specifically, let be latent indicators. We compute by solving

subject to (1)

As we will define next, is a consistency score associated with the correspondence pair , and is a robust regression loss between and . is set to 50 in our experiments. Intuitively, (1) seeks to extract a subset of correspondences that have large pairwise consistency scores and can be fit well by the rigid transformation.

We define , where and , by combining five consistency measures:

where measures the descriptor consistency, and , as motivated by [34, 17], measure the consistency in edge length and angles (see Figure 3). We now define the weight of as


where are hyper-parameters associated with the consistency measures.


Figure 3: The geometry consistency constraints are based on the fact that rigid transform preserves length and angle.

We define the robust rigid regression loss as

We perform alternating maximization to optimize (1). When and are fixed, (1) reduces to


where . It is clear that the optimal solution

is given by the maximum eigenvector of

. Likewise, when is fixed, (1) reduces to


We solve (4) using iterative reweighted least squares. The step exactly follows [2] and is left to Appendix A.2. In our implementation, we use alternating iterations between spectral matching and robust fitting.

Our approach essentially combines the strengths of iterative reweighted least squares (or IRLS) and spectral matching. IRLS is known to be sensitive to large outlier ratios (c.f.[9]). In our formulation, this limitation is addressed by spectral matching, which detects the strongest consistent correspondence subset. On the other hand, spectral matching, which is a relaxation of a binary-integer program, does not offer a clean separation between inliers and outliers. This issue is addressed by using IRLS.

3.5 Network Training

We train the proposed network by utilizing training data of the form , where each instance collects two input scans, their corresponding completions, and their relative pose. Network training proceeds in two phases.

3.5.1 Learning Each Individual Module

Learning semantic descriptors. We begin with learning the proposed feature representation. Since color, depth, normals,semantic label are all pre-specified, we only learn the semantic descriptor channels . To this end, we first define a contrastive loss on the representation of scan completions for training globally discriminative descriptors:


where and collect randomly sampled corresponding point pairs and non-corresponding point pairs between and , respectively. is set to 0.5 in our experiments. We then solve the following optimization problem to learn semantic descriptors:


where is the feed-forward network introduced in Section 3.2. In our experiments, we train around 100k iterations with batch size 2 using ADAM optimizer [22].

Learning completion modules. We train the completion network by combining a regression loss and a contrastive descriptor loss:

,where . denotes the concatenated input of and transformed using . We train again around 100k iterations with batch size 2 using ADAM optimizer [22].

The motivation of the contrastive descriptor loss is that the completion network does not fit the training data perfectly, and adding this term improves the performance of descriptor matching. Also noted that the input relative pose is not perfect during the execution of the entire network, thus we randomly perturb the relative pose in the neighborhood of each ground-truth for training.

Pre-training relative pose module. We pre-train the relative pose module using the results of the bi-scan completion module:


For optimization, we employ finite-difference gradient descent with backtracking line search [29] for optimization. In our experiments, the training converges in 30 iterations.

3.5.2 Fine-tuning Relative Pose Module

Given the pre-trained individual modules, we could fine-tune the entire network together. However, we find that the training is hard to converge and the test accuracy even drops. Instead, we find that a more effective fine-tuning strategy is to just optimize the relative pose modules. In particular, we allow them to have different hyper-parameters to accommodate specific distributions of the completion results at different layers of the recurrent network. Specifically, let and be the hyper-parameters of the first pairwise matching module and the pairwise matching module at iteration , respectively. With we denote the output of the entire network. We solve the following optimization problem for fine-tuning:


Similar to (7), we again employ finite-difference gradient descent with backtracking line search [29] for optimization. To stabilize the training, we further employ a layer-wise optimization scheme to solve (8) sequentially. In our experiments, the training converges in 20 iterations.

4 Experimental Results

In this section, we present an experimental evaluation of the proposed approach. We begin with describing the experimental setup in Section 4.1. We then present an analysis of the our results in Section 4.2. Finally, we present an ablation study in Section 4.3. Please refer to Appendix B for more qualitative results and an enriched ablation study.

4.1 Experimental Setup

SUNCG Matterport ScanNet
Rotation Trans. Rotation Trans. Rotation Trans.
Mean Mean Mean Mean Mean Mean
4PCS([0.5,1]) 64.3 83.7 87.6 21.0 68.2 74.4 79.0 0.30 42.7 65.7 80.3 33.4 52.6 64.3 69.0 0.46 25.3 48.7 80.1 31.2 36.9 43.2 59.8 0.52
GReg([0.5,1]) 85.9 91.9 94.1 10.3 86.9 89.3 90.7 0.16 80.8 89.2 92.1 12.0 84.8 88.5 90.6 0.17 58.9 84.4 88.8 16.3 81.7 85.8 88.6 0.19
CGReg([0.5,1]) 90.8 92.9 93.9 9.8 87.3 90.7 92.8 0.13 90.3 90.8 93.1 10.1 89.4 89.6 91.6 0.14 59.0 75.7 88.1 18.0 62.1 77.7 86.9 0.23
DL([0.5, 1]) 0.0 0.0 15.9 81.4 0.0 1.9 8.5 1.60 0.0 0.0 9.9 83.8 0.0 3.3 6.6 1.77 0.0 0.0 30.0 61.3 0.0 0.1 0.7 3.31
Ours-nc.([0.5,1]) 88.6 94.7 97.6 4.3 83.4 92.6 95.9 0.10 90.5 97.6 98.9 2.3 93.7 96.9 98.9 0.04 57.2 80.6 90.5 13.9 66.3 79.6 85.9 0.24
Ours-nr.([0.5,1]) 90.0 96.0 97.8 4.3 83.8 94.4 96.5 0.10 85.9 97.7 99.0 2.7 88.9 94.6 97.2 0.07 51.0 78.3 91.2 12.7 63.7 79.2 86.8 0.22
Ours([0.5, 1]) 90.9 95.9 97.8 4.0 83.6 94.3 96.6 0.10 89.5 98.5 99.3 1.9 93.1 96.7 98.5 0.05 52.9 79.1 91.3 12.7 64.7 78.6 86.0 0.23
4PCS([0.1,0.5)) 4.9 10.6 13.7 113.0 4.0 5.3 7.1 1.99 4.2 16.2 25.9 87.0 5.0 8.1 10.0 2.19 1.5 7.1 30.0 82.2 2.5 3.1 3.1 1.63
GReg([0.1,0.5)) 35.1 45.4 50.3 64.1 35.8 40.3 43.6 1.29 19.2 26.8 34.9 73.8 24.2 27.2 28.4 1.68 11.4 25.0 33.3 86.5 18.1 21.7 23.4 1.31
CGReg([0.1,0.5]) 46.4 48.5 51.0 63.4 40.2 42.7 46.0 1.34 28.5 29.3 35.9 73.9 28.1 28.3 29.5 1.99 11.8 20.0 32.9 88.2 11.6 16.0 21.0 1.36
DL([0.1, 0.5)) 0.0 0.0 8.0 94.0 0.0 1.8 4.0 2.06 0.0 0.0 8.5 94.3 0.0 0.4 2.7 2.25 0.0 0.0 7.5 92.1 0.0 0.0 0.0 4.03
Ours-nc.([0.1,0.5]) 47.5 62.6 71.4 32.8 36.3 54.6 63.4 0.89 54.4 75.7 83.7 22.8 53.3 65.3 73.7 0.55 14.1 37.1 56.0 55.3 18.8 31.2 41.3 0.98
Ours-nr.([0.1,0.5]) 60.3 80.3 83.7 20.8 41.2 70.0 80.6 0.56 47.3 72.9 82.4 24.6 44.4 65.1 73.9 0.57 12.2 36.0 65.3 45.2 18.1 33.6 47.0 0.90
Ours([0.1,0.5)) 67.2 84.1 86.4 18.1 44.8 73.8 83.9 0.49 53.7 80.7 87.9 17.2 52.0 71.2 81.4 0.45 14.4 39.1 66.8 43.9 19.6 35.5 48.4 0.87
DL([0.0, 0.1)) 0.0 0.0 2.1 115.4 0.0 1.4 4.3 2.23 0.0 0.0 2.1 125.9 0.0 0.2 2.1 2.83 0.0 0.0 0.0 130.4 0.0 0.0 0.0 5.37
Ours-nc.([0.0,0.1]) 2.2 5.8 13.8 102.1 0.1 0.7 5.6 2.21 1.3 4.9 11.7 117.3 0.2 0.2 0.9 3.10 0.5 4.8 16.3 99.4 0.0 0.5 2.2 1.92
Ours-nr.([0.0,0.1]) 12.6 27.1 33.8 83.4 3.2 15.7 28.8 1.78 1.6 11.4 27.3 92.6 0.2 2.2 7.3 2.33 0.7 7.7 29.1 83.4 0.2 1.7 7.6 1.70
Ours([0.0,0.1)) 15.7 32.4 37.7 79.5 4.5 21.3 34.3 1.66 2.5 16.3 31.3 87.3 0.3 3.0 11.7 2.19 0.9 8.8 32.8 78.9 0.4 2.3 8.7 1.62
Identity([0.0,0.1)) 103.8 2.37 131.1 3.20 82.5 1.96
Table 1: Benchmark evaluation on our approach and baseline approaches. Ours-nc and Ours-nr stand for ours method with completion module and recurrent module removed, respectively. For the rotation component, we show the percentage of pairs whose angular deviations fall within ,, and , respectively. For the translation component, we show the percentage of pairs whose translation deviations fall within ,, We also show the mean errors. In addition, we show statistics for pairs of scans whose overlapping ratios fall into three intervals, namely, [50%, 100%], [10%, 50%], and [0%, 10%]. Average numbers are reported for 10 repeated runs on test sets.

4.1.1 Datasets

We perform experimental evaluation on three datasets: SUNCG [36] is a synthetic dataset that collects 45k different 3D scenes, where we take 9892 bedrooms for experiments. For each room, we sample 25 camera locations around the room center, the field of view is set as horizontally and vertically. From each camera pose we collect an input scan and the underlying ground-truth completion stored in local coordinate system of that camera pose. We allocate 80% rooms and the rest for testing. Matterport [3] is a real dataset that collects 925 different 3D scenes. Each room was reconstructed from a real indoor room. We use their default train/test split. For each room, we pick 50 camera poses. The sampling strategy and camera configuration are the same as SUNCG. ScanNet [8] is a real dataset that collects 1513 rooms. Each room was reconstructed using thousands of depth scans from Kinect. For each room, we select every 25 frames in the recording sequence. For each camera location, we render the cube-map representation using the reconstructed 3D model. Note that unlike SUNCG and Matterport, where the reconstruction is complete. The reconstruction associated with ScanNet is partial, i.e., there are much more areas in our cube-map representation that are missing values due to the incompleteness of ground truth. For testing, we sample 1000 pair of scans (source and target scan are from the same room) for all datasets.

4.1.2 Baseline Comparison

We consider four baseline approaches:

Super4PCS [28] is a state-of-the-art non-deep learning technique for relative pose estimation between two 3D point clouds. It relies on using geometric constraints to vote for consistent feature correspondences. We used the author’s code for comparison.

Global registration (or GReg) [44] is another state-of-the-art non-deep learning technique for relative pose estimation. It combines cutting-edge feature extraction and reweighted least squares for rigid pose registration. GReg is a more robust version than fast global registration (or FGReg) [43], which focuses on efficiency. We used the Open3D implementation of GReg for comparison.

Colored Point-cloud Registration (or CGReg) [30] This method is a combination of GReg and colored point-cloud registration, where color information is used to boost the accuracy of feature matching. We used the Open3D implementation.

Deep learning baseline (or DL)[27] is the most relevant deep learning approach for estimating the relative pose between a pair of scans. It uses a Siamese network to extract features from both scans and regress the quaternion and translation vectors. We use the authors’ code and modify their network to take in color, depth, normal as input. Note that we did not directly compare to [33] as extending it to compute relative poses between RGB-D scans is non-trivial, and our best attempt was not as competitive as the pairwise matching module introduced in this paper.

G.T. Color

G.T. Scene






G.T. 1

Completed 1

G.T. 2

Completed 2

Figure 4: Qualitative results of our approach and baseline approaches. We show examples for the cases of no, small and significant overlap. From top to bottom: ground-truth color and scene geometry, our pose estimation results (two input scans in red and green), baseline results (4PCS, DL, GReg and CGReg), ground-truth scene RGBDN and completed scene RGBDN for two input scans. The unobserved regions are dimmed. See Section  4.2 for details.

4.1.3 Evaluation Protocol

We evaluate the rotation component and translation component of a relative pose separately. Let be the ground-truth, we follow the convention of reporting the relative rotation angle . Let be the ground-truth translation. We evaluate the accuracy of by measuring , where is the barycenter of .

To understand the behavior of each approach on different types of scan pairs, we divide the scan pairs into three categories. For this purpose, we first define the overlap ratio between a pair of scans and as . We say a testing pair falls into the category of significant overlap, small overlap, and non-overlap if , , and , respectively.

4.2 Analysis of Results

Table 1 and Figure 4 provide quantitative and qualitative results of our approach and baseline approaches. Overall, our approach outputs accurate relative pose estimations. The predicted normal are more accurate than color and depth.

In the following, we provide a detailed analysis under each category of scan pairs as well as the scan completion results:

Significant overlap. Our approach outputs accurate relative poses in the presence of significant overlap. The mean error in rotation/translation of our approach is , , and on SUNCG, Matterport, and ScanNet, respectively, In contrast, the mean error in rotation/translation of the top performing methods only achieve , , and , respectively. Meanwhile, the performance of our method drops when the completion component is removed. This means that although there are rich features to match between significantly overlapping scans, performing scan completion still matters. Moreover, our approach achieves better relative performance on SUNCG and Matterport, as the field-of-view is wider than ScanNet.

Small overlap. Our approach outputs good relative poses in the presence of small overlap. The mean errors in rotation/translation of our approach are , , and on SUNCG, Matterport, and ScanNet, respectively. In contrast, the top-performing method only achieves mean errors , , and , leaving a big margin from our approach. Moreover, the relative improvements

are more salient than scan pairs that possess significant overlaps.

This is expected as there are less features to match from the original scans, and scan completion provides more features to match.

Figure 5: Error distribution of rotation errors of our approach on non-overlapping scans. See Section  4.2 for discussion.
Figure 6: Mean errors in predicted normal and depth w.r.t the horizontal image coordinate. See Section  4.2 for discussion.

No overlap. Our approach delivers encouraging relative pose estimations on the extreme non-overlapping scans. For example, in the first column of Figure 4, a television is separated into two part in source and target scans. Our method correctly assembles the two scans to form a complete scene. In the second example, our method correctly predict the relative position of sofa and a bookshelf.

We also show the result (Identity) if we predict identity matrix for each scan pair, which is usually the best we can do for non-overlap scans using traditional method. To further understand our approach, Figure 

5 plots the error distribution of rotations on the three datasets. We can see a significant portion of the errors concentrate on and , which can be understood from the perspective that our approach mixes different walls when performing pairwise matching. This is an expected behavior as many indoor rooms are symmetric. Note that we neglect the quantitative results for Super4PCS, GReg, and CGRreg since they all require overlap.

Scan-completion results. Figure 6 plots the error distributions of predicted depth, normal with respect to the horizontal image coordinate. None that in our experiment the region is observed for SUNCG/Matterport, and for ScanNet. We can see that the errors are highly correlated with the distances to observed region, i.e., they are small in adjacent regions, and become less accurate when the distances become large. This explains why our approach leads to a significant boost on scan pairs with small overlaps, i.e., corresponding points are within adjacent regions.

4.3 Ablation Study

We consider two experiments to evaluate the effectiveness of the proposed network design. Each experiment removes one functional unit in the proposed network design. No completion. The first ablation experiment simply applies our relative pose estimation module on the input scans directly, i.e., without scan completions. The performance of our approach drops even on largely overlapping scans,

This means that it is important to perform scan completions even for partially overlapping scans. Moreover, without completion, our relative pose estimation module still possesses noticeable performance gains against the top-performing baseline GReg [44] on overlapping scans. Such improvements mainly come from combing spectral matching and robust fitting. Please refer to Appendix B for in-depth comparison.

No recurrent module. The second ablation experiment removes the recurrent module in our network design. This reduced network essentially performs scan completion from each input scan and then estimates the relative poses between the scan completions. We can see that the performance drops in almost all the configurations.

This shows the importance of the recurrent module, which leverages bi-scan completions to gradually improve the relative pose estimations.

5 Conclusions

We introduced an approach for relative pose estimation between a pair of RGB-D scans of the same indoor environment. The key idea of our approach is to perform scan completion to obtain the underlying geometry, from which we then compute the relative pose. Experimental results demonstrated the usefulness of our approach both in terms of its absolute performance when compared to existing approaches and the effectiveness of each module of our approach. In particular, our approach delivers encouraging relative pose estimations between extreme non-overlapping scans.


  • [1] D. Aiger, N. J. Mitra, and D. Cohen-Or. 4pointss congruent sets for robust pairwise surface registration. ACM Trans. Graph., 27(3):85:1–85:10, Aug. 2008.
  • [2] S. Bouaziz, A. Tagliasacchi, and M. Pauly. Sparse iterative closest point. In Proceedings of the Eleventh Eurographics/ACMSIGGRAPH Symposium on Geometry Processing, SGP ’13, pages 113–123, Aire-la-Ville, Switzerland, Switzerland, 2013. Eurographics Association.
  • [3] A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3d: Learning from RGB-D data in indoor environments. CoRR, abs/1709.06158, 2017.
  • [4] Q. Chen and V. Koltun. Robust nonrigid registration by convex optimization. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pages 2039–2047, Washington, DC, USA, 2015. IEEE Computer Society.
  • [5] T. S. Cho, S. Avidan, and W. T. Freeman. The patch transform. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.
  • [6] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, pages 628–644, 2016.
  • [7] J. M. Coughlan and A. L. Yuille.

    Manhattan world: Compass direction from a single image by bayesian inference.

    In Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2, ICCV ’99, pages 941–, Washington, DC, USA, 1999. IEEE Computer Society.
  • [8] A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. CoRR, abs/1702.04405, 2017.
  • [9] I. Daubechies, R. A. DeVore, M. Fornasier, and C. S. Güntürk. Iteratively re-weighted least squares minimization: Proof of faster than linear rate for sparse recovery. In 42nd Annual Conference on Information Sciences and Systems, CISS 2008, Princeton, NJ, USA, 19-21 March 2008, pages 26–29, 2008.
  • [10] D. W. Eggert, A. Lorusso, and R. B. Fisher. Estimating 3-d rigid body transformations: A comparison of four major algorithms. Mach. Vision Appl., 9(5-6):272–290, Mar. 1997.
  • [11] P. Fischer, A. Dosovitskiy, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. CoRR, abs/1504.06852, 2015.
  • [12] N. Gelfand, N. J. Mitra, L. J. Guibas, and H. Pottmann. Robust global registration. In Proceedings of the Third Eurographics Symposium on Geometry Processing, SGP ’05, Aire-la-Ville, Switzerland, Switzerland, 2005. Eurographics Association.
  • [13] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. Rgb-d mapping: Using kinect-style depth cameras for dense 3d modeling of indoor environments. Int. J. Rob. Res., 31(5):647–663, Apr. 2012.
  • [14] B. K. P. Horn. Closed-form solution of absolute orientation using unit quaternions. Journal of the Optical Society of America A, 4(4):629–642, 1987.
  • [15] Q. Huang and D. Anguelov. High quality pose estimation by aligning multiple scans to a latent map. In IEEE International Conference on Robotics and Automation, ICRA 2010, Anchorage, Alaska, USA, 3-7 May 2010, pages 1353–1360. IEEE, 2010.
  • [16] Q.-X. Huang, B. Adams, M. Wicke, and L. J. Guibas. Non-rigid registration under isometric deformations. In Proceedings of the Symposium on Geometry Processing, SGP ’08, pages 1449–1457, Aire-la-Ville, Switzerland, Switzerland, 2008. Eurographics Association.
  • [17] Q.-X. Huang, S. Flöry, N. Gelfand, M. Hofer, and H. Pottmann. Reassembling fractured objects by geometric matching. ACM Trans. Graph., 25(3):569–578, July 2006.
  • [18] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. CoRR, abs/1612.01925, 2016.
  • [19] D. Jayaraman, R. Gao, and K. Grauman. Unsupervised learning through one-shot image-based shape reconstruction. CoRR, abs/1709.00505, 2017.
  • [20] O. v. Kaick, H. Zhang, G. Hamarneh, and D. Cohen‐Or. A Survey on Shape Correspondence. Computer Graphics Forum, 2011.
  • [21] S. Kim, S. Lin, S. R. JEON, D. Min, and K. Sohn.

    Recurrent transformer networks for semantic correspondence.

    In NIPS, page to appear, 2018.
  • [22] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  • [23] M. Leordeanu and M. Hebert. A spectral technique for correspondence problems using pairwise constraints. In Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2, ICCV ’05, pages 1482–1489, Washington, DC, USA, 2005. IEEE Computer Society.
  • [24] X. Li and S. S. Iyengar. On computing mapping of 3d objects: A survey. ACM Comput. Surv., 47(2):34:1–34:45, Dec. 2014.
  • [25] C. Liu, A. G. Schwing, K. Kundu, R. Urtasun, and S. Fidler. Rent3d: Floor-plan priors for monocular layout estimation. In

    IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015

    , pages 3413–3421, 2015.
  • [26] J. Long, N. Zhang, and T. Darrell. Do convnets learn correspondence? CoRR, abs/1411.1091, 2014.
  • [27] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu.

    Relative camera pose estimation using convolutional neural networks.

    In International Conference on Advanced Concepts for Intelligent Vision Systems, pages 675–687. Springer, 2017.
  • [28] N. Mellado, D. Aiger, and N. J. Mitra. Super 4pcs fast global pointcloud registration via smart indexing. Computer Graphics Forum, 33(5):205–215, 2014.
  • [29] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York, NY, USA, second edition, 2006.
  • [30] J. Park, Q.-Y. Zhou, and V. Koltun. Colored point cloud registration revisited. 2017 IEEE International Conference on Computer Vision (ICCV), pages 143–152, 2017.
  • [31] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. CoRR, abs/1604.07379, 2016.
  • [32] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multi-view cnns for object classification on 3d data. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 5648–5656, 2016.
  • [33] R. Ranftl and V. Koltun. Deep fundamental matrix estimation. In European Conference on Computer Vision, pages 292–309. Springer, 2018.
  • [34] Y. Shan, B. Matei, H. S. Sawhney, R. Kumar, D. Huber, and M. Hebert. Linear model hashing and batch ransac for rapid and accurate object recognition. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR’04, pages 121–128, Washington, DC, USA, 2004. IEEE Computer Society.
  • [35] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: Exploring photo collections in 3d. ACM Trans. Graph., 25(3):835–846, July 2006.
  • [36] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [37] S. Song, A. Zeng, A. X. Chang, M. Savva, S. Savarese, and T. Funkhouser. Im2pano3d: Extrapolating 360 structure and semantics beyond the field of view. Proceedings of 31th IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [38] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. In IROS, pages 573–580. IEEE, 2012.
  • [39] J. W. Tangelder and R. C. Veltkamp. A survey of content based 3d shape retrieval methods. Multimedia Tools Appl., 39(3):441–471, Sept. 2008.
  • [40] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion network for learning monocular stereo. In IEEE Conference on computer vision and pattern recognition (CVPR), volume 5, page 6, 2017.
  • [41] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
  • [42] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1696–1704. Curran Associates, Inc., 2016.
  • [43] Q. Zhou, J. Park, and V. Koltun. Fast global registration. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II, pages 766–782, 2016.
  • [44] Q. Zhou, J. Park, and V. Koltun. Open3d: A modern library for 3d data processing. CoRR, abs/1801.09847, 2018.
  • [45] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, volume 2, page 7, 2017.
  • [46] M. Zollhöfer, P. Stotko, A. Görlitz, C. Theobalt, M. Nießner, R. Klein, and A. Kolb. State of the art on 3d reconstruction with RGB-D cameras. Comput. Graph. Forum, 37(2):625–652, 2018.
  • [47] C. Zou, A. Colburn, Q. Shan, and D. Hoiem. Layoutnet: Reconstructing the 3d room layout from a single rgb image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

Appendix A More Technical Details about Our Approach

a.1 Completion Network Architecture

The completion network takes two sets of RGB-D-N (RGB, depth, and normal) as input. Three separate layers of convolution (followed by ReLU and Batchnorm) are applied to extract domain specific signal before merging. Those three preprocessing-branches are applied to both sets of RGB-D-N input. We also use skip layer to facilitate training. The overall architecture is listed as follows, where C(m,n) specify the convolution layer input/output channel.

Figure 7: Completion network architecture. C(m,n) stands for m input channel and m output channel. Skip connections are added at mirroring location of the encoder and decoder network. Two sets of input(corresponding to source and transformed target scans respectively) first go through the first three layers separately, then being concatenated and pass through the rest layers.
SUNCG Matterport ScanNet
Rotation Trans. Rotation Trans. Rotation Trans.
Median Mean Median Mean Median Mean Median Mean Median Mean Median Mean
nr 4.51 26.25 0.21 0.62 4.85 22.33 0.22 0.60 12.90 33.89 0.36 0.61
r 1.54 23.36 0.10 0.54 2.51 18.69 0.10 0.49 7.11 30.40 0.23 0.57
sm 2.65 25.6 0.18 0.64 3.15 20.23 0.20 0.60 7.10 35.32 0.17 0.57
r+sm 1.32 19.36 0.06 0.48 1.45 13.9 0.04 0.34 5.47 32.38 0.12 0.57
Table 2: Ablation study for pairwise matching. nr: Directly apply the closed-form solution [14] without reweighted procedure. r: reweighted least square, sm: spectral method, r+sm: alternate between reweighted least square and spectral method.

a.2 Iteratively Reweighted Least Squares for Solving the Robust Regression Problem

In this section, we provide technical details on solving the following robust regression problem:


,where we use in all of our experiments. We solve (9) using reweighted non-linear least squares. Introduce an initial weight . At each iteration , we first solve the following non-linear least squares:


According to [14], (9) admits a closed-form solution. Specifically, define

The optimal translation and rotation to (10) are given by

where and

are given by the singular value decompostion of

and where

After obtaining the new optimal transformation , we update the weight associated with correspondence at iteration as

where is a small constant to address the issue of division by zero.

In our experiments, we used 5 reweighting operations for solving (9).

a.3 Implementation Details

Implementation details of the completion network. We used a combination of 5 source of information(color,normal,depth,semantic label,feature) to supervise the completion network. Specifically, we use

, where we use loss for color, normal, depth, loss for feature, and cross-entropy loss for semantic label. We use , . We trained for 100k iterations using a single GTX 1080Ti. We use Adam optimizer with initial learning rate 0.0002.

Appendix B Additional Experimental Results

Figure 8910 show more qualitative results on SUNCG, Matterport, and ScanNet, respectively. Table 2 gives a detailed ablation study of our proposed pairwise matching algorithm. We compare against three variants, namely, direct regression(nr) using [14], reweighted least squares(r) (using the robust norm), and merely using spectral matching (sm). We can see that the combination of reweighted least squares and spectral matching gives the best result.

We also applied the idea of learning weights for correspondence from data [33]. Since [33] addresses a different problem of estimating the functional matrix between a pair of RGB images, we tried applying the idea on top of reweighted least squares (r) of our approach, namely, by replacing the reweighting scheme described in Section A.2 by a small network for predicting the correspondence weight. However, we find this approach generalized poorly on testing data. In contrast, we found that the spectral matching approach, which leverages geometric constraints that are specifically designed for matching 3D data, leads to additional boost in performance.

G.T. Color

G.T. Scene






G.T. 1

Completed 1

G.T. 2

Completed 2

Figure 8: SUNCG qualitative results. From top to bottom: ground-truth color and scene geometry, our pose estimation results (two input scans in red and green), baseline results (4PCS, DL, GReg and CGReg), ground-truth scene RGBDNS and completed scene RGBDNS for two input scans. The unobserved regions are dimmed.

G.T. Color

G.T. Scene






G.T. 1

Completed 1

G.T. 2

Completed 2

Figure 9: Matterport qualitative results. From top to bottom: ground-truth color and scene geometry, our pose estimation results (two input scans in red and green), baseline results (4PCS, DL, GReg and CGReg), ground-truth scene RGBDNS and completed scene RGBDNS for two input scans. The unobserved regions are dimmed.

G.T. Color

G.T. Scene






G.T. 1

Completed 1

G.T. 2

Completed 2

Figure 10: ScanNet qualitative results. From top to bottom: ground-truth color and scene geometry, our pose estimation results (two input scans in red and green), baseline results (4PCS, DL, GReg and CGReg), ground-truth scene RGBDNS and completed scene RGBDNS for two input scans. The unobserved regions are dimmed.