1 Introduction
Estimating the relative rigid pose between a pair of RGBD scans is a fundamental problem in computer vision, robotics, and computer graphics with applications to systems such as 3D reconstruction [46], structurefrommotion [35], and simultaneous localization and mapping (SLAM) [38]. Most existing approaches [12, 17, 1, 28, 43] follow a threestep paradigm (c.f. [46]
): feature extraction, feature matching, and rigid transform fitting with the most consistent feature correspondences. However, this paradigm requires the input RGBD scans to have considerable overlap, in order to establish sufficient feature correspondences for matching. For input scans of extreme relative poses with little or even
no overlap, this paradigm falls short since there are very few or no features to be found in the overlapping regions. Nevertheless, such problem settings with minimal overlap are common in many applications such as solving jigsaw puzzles [5], early detection of loop closure for SLAM [13], and reconstruction from minimal observations, e.g., a few snapshots of an indoor environment [25].While the conventional paradigm breaks down in this setting, we hypothesize that solutions are possible using the prior knowledge for typical scene structure and object shapes. Intuitively, when humans are asked to perform pose estimation for nonoverlapping inputs, they utilize the prior knowledge of the underlying geometry. For example, we can complete a human model from two nonoverlapping scans of both the front and the back of a person; we can also tell the relative pose of two nonoverlapping indoor scans by knowing that the layout of the room satisfies the Manhattan world assumption [7]. This suggests that when direct matching of nonoverlapping scans is impossible, we seek to match them by first performing scene completions and then matching completed scans for their relative pose.
Inspired from the iterative procedure for simultaneous reconstruction and registration [15], we propose to alternate between scene completion and relative pose estimation so that we can leverage signals from both input scans to achieve better completion results. Specifically, we introduce a neural network that takes a pair of RGBD scans with little overlap as input and outputs the relative pose between them. Key to our approach are internal modules that infer the completion of each input scan, allowing even widely separated scans to be iteratively registered with the proper relative pose via a recurrent module. As highlighted in Figure 1, our network first performs singlescan completion under a rich representation that combines depth, normal, and semantic descriptors. This is followed by a pairwise matching module, which takes the current completions as input and outputs the current relative pose.
In particular, to address the issue of imperfect predictions, we introduce a novel pairwise matching approach that seamlessly integrates two popular pairwise matching approaches: spectral matching [23, 16] and robust fitting [2]
. Given the current relative pose, our network performs biscan completion, which takes as input the representation of one scan and transformed representation of the other scan (using the current relative pose estimate), and outputs a updated scan completion in the view of the first scan. The pairwise matching module and the biscan completion module are alternated, as reflected by the recurrent nature of our network design. Note that compared to existing deep learning based pairwise matching approaches
[26, 11], which combine feature extraction towers and a matching module, the novelty of our approach is threefold:explicitly supervising the relative pose network via completions of the underlying scene under a novel representation that combines geometry and semantics.
a novel pairwise matching method under this representation.
an iterative procedure that alternates between scene completion and pairwise matching.
We evaluate our approach on three benchmark datasets, namely, SUNCG [36], Matterport [3], and ScanNet [8]. Experimental results show that our approach is significantly better than stateoftheart relative pose estimation techniques. For example, our approach reduces the mean rotation errors of stateoftheart approaches from , , and on SUNCG, Matterport, and ScanNet, respectively, to , , and , respectively, on scans with overlap ratios greater than 10%. Moreover, our approach generates encouraging results for nonoverlapping scans. The mean rotation errors of our approach for these scans are , , and , respectively. In contrast, the expected error of a random rotation is around .
Code is publicly available at https://github.com/zhenpeiyang/RelativePose.
2 Related Work
Nondeep learning techniques. Pairwise object matching has been studied extensively in the literature, and it is beyond the scope of this paper to present a comprehensive overview. We refer to [20, 39, 24] for surveys on this topic and to [28] for recent advances. Regarding the specific task of relative pose estimation from RGBD scans, popular methods [12, 17, 1, 28] follow a threestep procedure. The first step extracts features from each scan. The second step establishes correspondences for the extracted features, and the third step fits a rigid transform to a subset of consistent feature correspondences. Besides the fact that the performance of these techniques heavily relies on setting suitable parameters for each component, they also require that the two input scans possess sufficient overlapping features to match.
Deep learning techniques. Thanks to the popularity of deep neural networks, recent work explores deep neural networks for the task of relative pose estimation (or pairwise matching in general) [11, 18, 40, 45, 27]. These approaches follow the standard pipeline of object matching, but they utilize a neural network module for each component. Specifically, feature extraction is generally done using a feedforward module, while estimating correspondences and computing a rigid transform
are achieved using a correlation module. With proper pretraining, these methods exhibit better performance than their nondeep learning counterparts. However, they still require that the inputs possess a sufficient overlap so that the correlation module can identify common features for relative pose estimation.
A couple of recent works propose recurrent procedures for object matching. In [33], the authors present a recurrent procedure to compute weighted correspondences for estimating the fundamental matrix between two images. In [21], the authors use recurrent networks to progressively compute dense correspondences between two images. The network design is motivated from the procedure of nonrigid image registration between a pair of images. Our approach is conceptually relevant to these approaches. However, the underlying principle for the recurrent approach is different. In particular, our approach performs scan completions, from which we compute the relative pose.
Optimization techniques for pairwise matching. Existing featurebased pairwise matching techniques fall into two categories. The first category of methods is based on MAP inference [23, 16, 4], where feature matches and pairwise feature consistency are integrated as unary and pairwise potential functions. A popular relaxation of MAP inference is spectral relaxation [23, 16]. The second category of methods is based on fitting a rigid transformation to a set of feature correspondences [14]. In particular, stateoftheart approaches [10, 20, 43] usually utilize robust norms to handle incorrect feature correspondences. In this paper, we introduce the first approach that optimizes a single objective function to simultaneously perform spectral matching and robust fitting for relative pose estimation.
Scene completion. Our approach is also motivated from recent advances on inferring complete environments from partial observations [31, 37, 19, 47]. However, our approach differs from these approaches in two ways. First, in contrast to returning the completion as the final output [37, 47] or utilizing it for learning feature representations [31, 19]
, our approach treats completions as an intermediate representation for relative pose estimation. From the representation perspective, our approach predicts color,depth,normal,semantic,and feature vector using a single network. Experimentally, this leads to better results than performing completion using the RGBD representation first and then extracting necessary features from the completions.
3 Approach
We begin with presenting an approach overview in Section 3.1. Section 3.2 to Section 3.4 elaborate the network design. Section 3.5 discusses the training procedure.
3.1 Approach Overview
The relative pose estimation problem studied in this paper considers two RGBD scans of the same environment as input (,=160 in this paper). We assume that the intrinsic camera parameters are given so that we can extract the 3D position of each pixel in the local coordinate system of each . The output of relative pose estimation is a rigid transformation that recovers the relative pose between and . Note that we do not assume and overlap.
Our approach is inspired from simultaneous registration and reconstruction (or SRAR) [15], which takes multiple depth scans of the same environment as input and outputs both a 3D reconstruction of the underlying environment (expressed in a world coordinate system) and optimized scan poses (from which we can compute relative poses). The optimization procedure of SRAR alternates between fixing the scan poses to reconstruct the underlying environment and optimizing scan poses using the current 3D reconstruction. The key advantage of SRAR is that pose optimization can leverage a complete reconstruction of the underlying environment and thus it mitigates the issue of nonoverlap.
However, directly applying SRAR to relative pose estimation for 3D scenes is challenging, as unlike 3D objects [41, 32, 6, 42], it is difficult to specify a world coordinate system for 3D scenes. To address this issue, we modify SRAR by maintaining two copies and of the complete underlying environment, where is expressed in the local coordinate system of (We will discuss the precise representation of later). Conceptually, our approach reconstructs each by combining the signals in both and . When performing relative pose estimation, our approach employs and , which addresses the issue of nonoverlap.
As illustrated in Figure 1, the proposed network for our approach combines a scan completion module and a pairwise matching module. To provide sufficient signals for pairwise matching, we define the feature representation by concatenating color, depth, normal, semantic label, and descriptors. Here utilizes a reduced cubemap representation [37], where each face of shares the same representation as .
Experimentally, we found this approach gives far better results than performing scan completion under the RGBD representation first and then computing the feature representation. The pairwise matching module takes current and as input and outputs the current relative pose . The completion module updates each scan completion using the transformed scans, e.g., utilizes and transformed in the local coordinate system of . We alternate between applying the pairwise matching module and the biscan completion module. In our implementation, we use recurrent steps. Next we elaborate on the details of our approach.
3.2 Feature Representation
Motivated by the particular design of our pairwise matching module, we define the feature representation of an RGBD scan as . Here , , , (we use =15 for SUNCG, =21 for Matterport/ScanNet), (=32 in this paper), specify color, depth, normal, semantic class, and a learned descriptor, respectively. The color,depth,normal,semantic class are obtained using the densely labeled reconstructed model for all datasets.
3.3 Scan Completion Modules
The scan completion module takes in a source scan, a target scan transformed by current estimate , and output the complete feature representation . We encode using a reduced cubemap representation [37], which consists of four faces (excluding the floor and the ceiling). Each face of shares the same feature representation as . For convenience, we always write
in the tensor form as
. Following the convention [37, 31], we formulate the input to both scan completion modules using a similar tensor form , where the last channel is a mask that indicates the presence of data. As illustrated in Figure 2 (Left), we always place in . This means , , and are left blank.We adapt a convolutiondeconvolution structure for our scan completion network, denoted . As shown in Figure 2, we use separate layers to extract information from color, depth, and normal input, and concatenate the resulting feature maps. Note that we stack the source and transformed target scan in each of the color, normal, depth components to provide the network more information. Only source scan is shown for simplicity. Since designing the completion network is not the major focus of this paper, we leave the technical details to supplementary material.
3.4 Relative Pose Module
We proceed to describe the proposed relative pose module denoted as . We first detect SIFT keypoints on observed region, and further extracts the top matches of the keypoints on the other complete scan to form the final point set . With and we denote the resulting points. Our goal is to simultaneously extract a subset of correspondences from and fit to these selected correspondences. For efficiency, we remove a correspondence from whenever .
The technical challenge of extracting correct correspondences is that due to imperfect scan completions, many correspondences with similar descriptors are still outliers. We address this challenge by combining spectral matching
[23] and robust fitting [2], which are two complementary pairwise matching methods. Specifically, let be latent indicators. We compute by solvingsubject to  (1) 
As we will define next, is a consistency score associated with the correspondence pair , and is a robust regression loss between and . is set to 50 in our experiments. Intuitively, (1) seeks to extract a subset of correspondences that have large pairwise consistency scores and can be fit well by the rigid transformation.
We define , where and , by combining five consistency measures:
where measures the descriptor consistency, and , as motivated by [34, 17], measure the consistency in edge length and angles (see Figure 3). We now define the weight of as
(2) 
where are hyperparameters associated with the consistency measures.
We define the robust rigid regression loss as
We perform alternating maximization to optimize (1). When and are fixed, (1) reduces to
(3) 
where . It is clear that the optimal solution
is given by the maximum eigenvector of
. Likewise, when is fixed, (1) reduces to(4) 
We solve (4) using iterative reweighted least squares. The step exactly follows [2] and is left to Appendix A.2. In our implementation, we use alternating iterations between spectral matching and robust fitting.
Our approach essentially combines the strengths of iterative reweighted least squares (or IRLS) and spectral matching. IRLS is known to be sensitive to large outlier ratios (c.f.[9]). In our formulation, this limitation is addressed by spectral matching, which detects the strongest consistent correspondence subset. On the other hand, spectral matching, which is a relaxation of a binaryinteger program, does not offer a clean separation between inliers and outliers. This issue is addressed by using IRLS.
3.5 Network Training
We train the proposed network by utilizing training data of the form , where each instance collects two input scans, their corresponding completions, and their relative pose. Network training proceeds in two phases.
3.5.1 Learning Each Individual Module
Learning semantic descriptors. We begin with learning the proposed feature representation. Since color, depth, normals，semantic label are all prespecified, we only learn the semantic descriptor channels . To this end, we first define a contrastive loss on the representation of scan completions for training globally discriminative descriptors:
(5) 
where and collect randomly sampled corresponding point pairs and noncorresponding point pairs between and , respectively. is set to 0.5 in our experiments. We then solve the following optimization problem to learn semantic descriptors:
(6) 
where is the feedforward network introduced in Section 3.2. In our experiments, we train around 100k iterations with batch size 2 using ADAM optimizer [22].
Learning completion modules. We train the completion network by combining a regression loss and a contrastive descriptor loss:
,where . denotes the concatenated input of and transformed using . We train again around 100k iterations with batch size 2 using ADAM optimizer [22].
The motivation of the contrastive descriptor loss is that the completion network does not fit the training data perfectly, and adding this term improves the performance of descriptor matching. Also noted that the input relative pose is not perfect during the execution of the entire network, thus we randomly perturb the relative pose in the neighborhood of each groundtruth for training.
Pretraining relative pose module. We pretrain the relative pose module using the results of the biscan completion module:
(7) 
For optimization, we employ finitedifference gradient descent with backtracking line search [29] for optimization. In our experiments, the training converges in 30 iterations.
3.5.2 Finetuning Relative Pose Module
Given the pretrained individual modules, we could finetune the entire network together. However, we find that the training is hard to converge and the test accuracy even drops. Instead, we find that a more effective finetuning strategy is to just optimize the relative pose modules. In particular, we allow them to have different hyperparameters to accommodate specific distributions of the completion results at different layers of the recurrent network. Specifically, let and be the hyperparameters of the first pairwise matching module and the pairwise matching module at iteration , respectively. With we denote the output of the entire network. We solve the following optimization problem for finetuning:
(8) 
Similar to (7), we again employ finitedifference gradient descent with backtracking line search [29] for optimization. To stabilize the training, we further employ a layerwise optimization scheme to solve (8) sequentially. In our experiments, the training converges in 20 iterations.
4 Experimental Results
In this section, we present an experimental evaluation of the proposed approach. We begin with describing the experimental setup in Section 4.1. We then present an analysis of the our results in Section 4.2. Finally, we present an ablation study in Section 4.3. Please refer to Appendix B for more qualitative results and an enriched ablation study.
4.1 Experimental Setup
SUNCG  Matterport  ScanNet  
Rotation  Trans.  Rotation  Trans.  Rotation  Trans.  
Mean  Mean  Mean  Mean  Mean  Mean  
4PCS([0.5,1])  64.3  83.7  87.6  21.0  68.2  74.4  79.0  0.30  42.7  65.7  80.3  33.4  52.6  64.3  69.0  0.46  25.3  48.7  80.1  31.2  36.9  43.2  59.8  0.52 
GReg([0.5,1])  85.9  91.9  94.1  10.3  86.9  89.3  90.7  0.16  80.8  89.2  92.1  12.0  84.8  88.5  90.6  0.17  58.9  84.4  88.8  16.3  81.7  85.8  88.6  0.19 
CGReg([0.5,1])  90.8  92.9  93.9  9.8  87.3  90.7  92.8  0.13  90.3  90.8  93.1  10.1  89.4  89.6  91.6  0.14  59.0  75.7  88.1  18.0  62.1  77.7  86.9  0.23 
DL([0.5, 1])  0.0  0.0  15.9  81.4  0.0  1.9  8.5  1.60  0.0  0.0  9.9  83.8  0.0  3.3  6.6  1.77  0.0  0.0  30.0  61.3  0.0  0.1  0.7  3.31 
Oursnc.([0.5,1])  88.6  94.7  97.6  4.3  83.4  92.6  95.9  0.10  90.5  97.6  98.9  2.3  93.7  96.9  98.9  0.04  57.2  80.6  90.5  13.9  66.3  79.6  85.9  0.24 
Oursnr.([0.5,1])  90.0  96.0  97.8  4.3  83.8  94.4  96.5  0.10  85.9  97.7  99.0  2.7  88.9  94.6  97.2  0.07  51.0  78.3  91.2  12.7  63.7  79.2  86.8  0.22 
Ours([0.5, 1])  90.9  95.9  97.8  4.0  83.6  94.3  96.6  0.10  89.5  98.5  99.3  1.9  93.1  96.7  98.5  0.05  52.9  79.1  91.3  12.7  64.7  78.6  86.0  0.23 
4PCS([0.1,0.5))  4.9  10.6  13.7  113.0  4.0  5.3  7.1  1.99  4.2  16.2  25.9  87.0  5.0  8.1  10.0  2.19  1.5  7.1  30.0  82.2  2.5  3.1  3.1  1.63 
GReg([0.1,0.5))  35.1  45.4  50.3  64.1  35.8  40.3  43.6  1.29  19.2  26.8  34.9  73.8  24.2  27.2  28.4  1.68  11.4  25.0  33.3  86.5  18.1  21.7  23.4  1.31 
CGReg([0.1,0.5])  46.4  48.5  51.0  63.4  40.2  42.7  46.0  1.34  28.5  29.3  35.9  73.9  28.1  28.3  29.5  1.99  11.8  20.0  32.9  88.2  11.6  16.0  21.0  1.36 
DL([0.1, 0.5))  0.0  0.0  8.0  94.0  0.0  1.8  4.0  2.06  0.0  0.0  8.5  94.3  0.0  0.4  2.7  2.25  0.0  0.0  7.5  92.1  0.0  0.0  0.0  4.03 
Oursnc.([0.1,0.5])  47.5  62.6  71.4  32.8  36.3  54.6  63.4  0.89  54.4  75.7  83.7  22.8  53.3  65.3  73.7  0.55  14.1  37.1  56.0  55.3  18.8  31.2  41.3  0.98 
Oursnr.([0.1,0.5])  60.3  80.3  83.7  20.8  41.2  70.0  80.6  0.56  47.3  72.9  82.4  24.6  44.4  65.1  73.9  0.57  12.2  36.0  65.3  45.2  18.1  33.6  47.0  0.90 
Ours([0.1,0.5))  67.2  84.1  86.4  18.1  44.8  73.8  83.9  0.49  53.7  80.7  87.9  17.2  52.0  71.2  81.4  0.45  14.4  39.1  66.8  43.9  19.6  35.5  48.4  0.87 
DL([0.0, 0.1))  0.0  0.0  2.1  115.4  0.0  1.4  4.3  2.23  0.0  0.0  2.1  125.9  0.0  0.2  2.1  2.83  0.0  0.0  0.0  130.4  0.0  0.0  0.0  5.37 
Oursnc.([0.0,0.1])  2.2  5.8  13.8  102.1  0.1  0.7  5.6  2.21  1.3  4.9  11.7  117.3  0.2  0.2  0.9  3.10  0.5  4.8  16.3  99.4  0.0  0.5  2.2  1.92 
Oursnr.([0.0,0.1])  12.6  27.1  33.8  83.4  3.2  15.7  28.8  1.78  1.6  11.4  27.3  92.6  0.2  2.2  7.3  2.33  0.7  7.7  29.1  83.4  0.2  1.7  7.6  1.70 
Ours([0.0,0.1))  15.7  32.4  37.7  79.5  4.5  21.3  34.3  1.66  2.5  16.3  31.3  87.3  0.3  3.0  11.7  2.19  0.9  8.8  32.8  78.9  0.4  2.3  8.7  1.62 
Identity([0.0,0.1))  103.8  2.37  131.1  3.20  82.5  1.96 
4.1.1 Datasets
We perform experimental evaluation on three datasets: SUNCG [36] is a synthetic dataset that collects 45k different 3D scenes, where we take 9892 bedrooms for experiments. For each room, we sample 25 camera locations around the room center, the field of view is set as horizontally and vertically. From each camera pose we collect an input scan and the underlying groundtruth completion stored in local coordinate system of that camera pose. We allocate 80% rooms and the rest for testing. Matterport [3] is a real dataset that collects 925 different 3D scenes. Each room was reconstructed from a real indoor room. We use their default train/test split. For each room, we pick 50 camera poses. The sampling strategy and camera configuration are the same as SUNCG. ScanNet [8] is a real dataset that collects 1513 rooms. Each room was reconstructed using thousands of depth scans from Kinect. For each room, we select every 25 frames in the recording sequence. For each camera location, we render the cubemap representation using the reconstructed 3D model. Note that unlike SUNCG and Matterport, where the reconstruction is complete. The reconstruction associated with ScanNet is partial, i.e., there are much more areas in our cubemap representation that are missing values due to the incompleteness of ground truth. For testing, we sample 1000 pair of scans (source and target scan are from the same room) for all datasets.
4.1.2 Baseline Comparison
We consider four baseline approaches:
Super4PCS [28] is a stateoftheart nondeep learning technique for relative pose estimation between two 3D point clouds. It relies on using geometric constraints to vote for consistent feature correspondences. We used the author’s code for comparison.
Global registration (or GReg) [44] is another stateoftheart nondeep learning technique for relative pose estimation. It combines cuttingedge feature extraction and reweighted least squares for rigid pose registration. GReg is a more robust version than fast global registration (or FGReg) [43], which focuses on efficiency. We used the Open3D implementation of GReg for comparison.
Colored Pointcloud Registration (or CGReg) [30] This method is a combination of GReg and colored pointcloud registration, where color information is used to boost the accuracy of feature matching. We used the Open3D implementation.
Deep learning baseline (or DL)[27] is the most relevant deep learning approach for estimating the relative pose between a pair of scans. It uses a Siamese network to extract features from both scans and regress the quaternion and translation vectors. We use the authors’ code and modify their network to take in color, depth, normal as input. Note that we did not directly compare to [33] as extending it to compute relative poses between RGBD scans is nontrivial, and our best attempt was not as competitive as the pairwise matching module introduced in this paper.
G.T. Color 


G.T. Scene 

Ours 

4PCS 

DL 

GReg 

CGReg 

G.T. 1 

Completed 1 

G.T. 2 

Completed 2 

4.1.3 Evaluation Protocol
We evaluate the rotation component and translation component of a relative pose separately. Let be the groundtruth, we follow the convention of reporting the relative rotation angle . Let be the groundtruth translation. We evaluate the accuracy of by measuring , where is the barycenter of .
To understand the behavior of each approach on different types of scan pairs, we divide the scan pairs into three categories. For this purpose, we first define the overlap ratio between a pair of scans and as . We say a testing pair falls into the category of significant overlap, small overlap, and nonoverlap if , , and , respectively.
4.2 Analysis of Results
Table 1 and Figure 4 provide quantitative and qualitative results of our approach and baseline approaches. Overall, our approach outputs accurate relative pose estimations. The predicted normal are more accurate than color and depth.
In the following, we provide a detailed analysis under each category of scan pairs as well as the scan completion results:
Significant overlap. Our approach outputs accurate relative poses in the presence of significant overlap. The mean error in rotation/translation of our approach is , , and on SUNCG, Matterport, and ScanNet, respectively, In contrast, the mean error in rotation/translation of the top performing methods only achieve , , and , respectively. Meanwhile, the performance of our method drops when the completion component is removed. This means that although there are rich features to match between significantly overlapping scans, performing scan completion still matters. Moreover, our approach achieves better relative performance on SUNCG and Matterport, as the fieldofview is wider than ScanNet.
Small overlap. Our approach outputs good relative poses in the presence of small overlap. The mean errors in rotation/translation of our approach are , , and on SUNCG, Matterport, and ScanNet, respectively. In contrast, the topperforming method only achieves mean errors , , and , leaving a big margin from our approach. Moreover, the relative improvements
are more salient than scan pairs that possess significant overlaps.
This is expected as there are less features to match from the original scans, and scan completion provides more features to match.
No overlap. Our approach delivers encouraging relative pose estimations on the extreme nonoverlapping scans. For example, in the first column of Figure 4, a television is separated into two part in source and target scans. Our method correctly assembles the two scans to form a complete scene. In the second example, our method correctly predict the relative position of sofa and a bookshelf.
We also show the result (Identity) if we predict identity matrix for each scan pair, which is usually the best we can do for nonoverlap scans using traditional method. To further understand our approach, Figure
5 plots the error distribution of rotations on the three datasets. We can see a significant portion of the errors concentrate on and , which can be understood from the perspective that our approach mixes different walls when performing pairwise matching. This is an expected behavior as many indoor rooms are symmetric. Note that we neglect the quantitative results for Super4PCS, GReg, and CGRreg since they all require overlap.Scancompletion results. Figure 6 plots the error distributions of predicted depth, normal with respect to the horizontal image coordinate. None that in our experiment the region is observed for SUNCG/Matterport, and for ScanNet. We can see that the errors are highly correlated with the distances to observed region, i.e., they are small in adjacent regions, and become less accurate when the distances become large. This explains why our approach leads to a significant boost on scan pairs with small overlaps, i.e., corresponding points are within adjacent regions.
4.3 Ablation Study
We consider two experiments to evaluate the effectiveness of the proposed network design. Each experiment removes one functional unit in the proposed network design. No completion. The first ablation experiment simply applies our relative pose estimation module on the input scans directly, i.e., without scan completions. The performance of our approach drops even on largely overlapping scans,
This means that it is important to perform scan completions even for partially overlapping scans. Moreover, without completion, our relative pose estimation module still possesses noticeable performance gains against the topperforming baseline GReg [44] on overlapping scans. Such improvements mainly come from combing spectral matching and robust fitting. Please refer to Appendix B for indepth comparison.
No recurrent module. The second ablation experiment removes the recurrent module in our network design. This reduced network essentially performs scan completion from each input scan and then estimates the relative poses between the scan completions. We can see that the performance drops in almost all the configurations.
This shows the importance of the recurrent module, which leverages biscan completions to gradually improve the relative pose estimations.
5 Conclusions
We introduced an approach for relative pose estimation between a pair of RGBD scans of the same indoor environment. The key idea of our approach is to perform scan completion to obtain the underlying geometry, from which we then compute the relative pose. Experimental results demonstrated the usefulness of our approach both in terms of its absolute performance when compared to existing approaches and the effectiveness of each module of our approach. In particular, our approach delivers encouraging relative pose estimations between extreme nonoverlapping scans.
References
 [1] D. Aiger, N. J. Mitra, and D. CohenOr. 4pointss congruent sets for robust pairwise surface registration. ACM Trans. Graph., 27(3):85:1–85:10, Aug. 2008.
 [2] S. Bouaziz, A. Tagliasacchi, and M. Pauly. Sparse iterative closest point. In Proceedings of the Eleventh Eurographics/ACMSIGGRAPH Symposium on Geometry Processing, SGP ’13, pages 113–123, AirelaVille, Switzerland, Switzerland, 2013. Eurographics Association.
 [3] A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3d: Learning from RGBD data in indoor environments. CoRR, abs/1709.06158, 2017.
 [4] Q. Chen and V. Koltun. Robust nonrigid registration by convex optimization. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pages 2039–2047, Washington, DC, USA, 2015. IEEE Computer Society.
 [5] T. S. Cho, S. Avidan, and W. T. Freeman. The patch transform. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.
 [6] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3dr2n2: A unified approach for single and multiview 3d object reconstruction. In Computer Vision  ECCV 2016  14th European Conference, Amsterdam, The Netherlands, October 1114, 2016, Proceedings, Part VIII, pages 628–644, 2016.

[7]
J. M. Coughlan and A. L. Yuille.
Manhattan world: Compass direction from a single image by bayesian inference.
In Proceedings of the International Conference on Computer VisionVolume 2  Volume 2, ICCV ’99, pages 941–, Washington, DC, USA, 1999. IEEE Computer Society.  [8] A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner. Scannet: Richlyannotated 3d reconstructions of indoor scenes. CoRR, abs/1702.04405, 2017.
 [9] I. Daubechies, R. A. DeVore, M. Fornasier, and C. S. Güntürk. Iteratively reweighted least squares minimization: Proof of faster than linear rate for sparse recovery. In 42nd Annual Conference on Information Sciences and Systems, CISS 2008, Princeton, NJ, USA, 1921 March 2008, pages 26–29, 2008.
 [10] D. W. Eggert, A. Lorusso, and R. B. Fisher. Estimating 3d rigid body transformations: A comparison of four major algorithms. Mach. Vision Appl., 9(56):272–290, Mar. 1997.
 [11] P. Fischer, A. Dosovitskiy, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. CoRR, abs/1504.06852, 2015.
 [12] N. Gelfand, N. J. Mitra, L. J. Guibas, and H. Pottmann. Robust global registration. In Proceedings of the Third Eurographics Symposium on Geometry Processing, SGP ’05, AirelaVille, Switzerland, Switzerland, 2005. Eurographics Association.
 [13] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. Rgbd mapping: Using kinectstyle depth cameras for dense 3d modeling of indoor environments. Int. J. Rob. Res., 31(5):647–663, Apr. 2012.
 [14] B. K. P. Horn. Closedform solution of absolute orientation using unit quaternions. Journal of the Optical Society of America A, 4(4):629–642, 1987.
 [15] Q. Huang and D. Anguelov. High quality pose estimation by aligning multiple scans to a latent map. In IEEE International Conference on Robotics and Automation, ICRA 2010, Anchorage, Alaska, USA, 37 May 2010, pages 1353–1360. IEEE, 2010.
 [16] Q.X. Huang, B. Adams, M. Wicke, and L. J. Guibas. Nonrigid registration under isometric deformations. In Proceedings of the Symposium on Geometry Processing, SGP ’08, pages 1449–1457, AirelaVille, Switzerland, Switzerland, 2008. Eurographics Association.
 [17] Q.X. Huang, S. Flöry, N. Gelfand, M. Hofer, and H. Pottmann. Reassembling fractured objects by geometric matching. ACM Trans. Graph., 25(3):569–578, July 2006.
 [18] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. CoRR, abs/1612.01925, 2016.
 [19] D. Jayaraman, R. Gao, and K. Grauman. Unsupervised learning through oneshot imagebased shape reconstruction. CoRR, abs/1709.00505, 2017.
 [20] O. v. Kaick, H. Zhang, G. Hamarneh, and D. Cohen‐Or. A Survey on Shape Correspondence. Computer Graphics Forum, 2011.

[21]
S. Kim, S. Lin, S. R. JEON, D. Min, and K. Sohn.
Recurrent transformer networks for semantic correspondence.
In NIPS, page to appear, 2018.  [22] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [23] M. Leordeanu and M. Hebert. A spectral technique for correspondence problems using pairwise constraints. In Proceedings of the Tenth IEEE International Conference on Computer Vision  Volume 2, ICCV ’05, pages 1482–1489, Washington, DC, USA, 2005. IEEE Computer Society.
 [24] X. Li and S. S. Iyengar. On computing mapping of 3d objects: A survey. ACM Comput. Surv., 47(2):34:1–34:45, Dec. 2014.

[25]
C. Liu, A. G. Schwing, K. Kundu, R. Urtasun, and S. Fidler.
Rent3d: Floorplan priors for monocular layout estimation.
In
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 712, 2015
, pages 3413–3421, 2015.  [26] J. Long, N. Zhang, and T. Darrell. Do convnets learn correspondence? CoRR, abs/1411.1091, 2014.

[27]
I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu.
Relative camera pose estimation using convolutional neural networks.
In International Conference on Advanced Concepts for Intelligent Vision Systems, pages 675–687. Springer, 2017.  [28] N. Mellado, D. Aiger, and N. J. Mitra. Super 4pcs fast global pointcloud registration via smart indexing. Computer Graphics Forum, 33(5):205–215, 2014.
 [29] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York, NY, USA, second edition, 2006.
 [30] J. Park, Q.Y. Zhou, and V. Koltun. Colored point cloud registration revisited. 2017 IEEE International Conference on Computer Vision (ICCV), pages 143–152, 2017.
 [31] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. CoRR, abs/1604.07379, 2016.
 [32] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multiview cnns for object classification on 3d data. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pages 5648–5656, 2016.
 [33] R. Ranftl and V. Koltun. Deep fundamental matrix estimation. In European Conference on Computer Vision, pages 292–309. Springer, 2018.
 [34] Y. Shan, B. Matei, H. S. Sawhney, R. Kumar, D. Huber, and M. Hebert. Linear model hashing and batch ransac for rapid and accurate object recognition. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR’04, pages 121–128, Washington, DC, USA, 2004. IEEE Computer Society.
 [35] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: Exploring photo collections in 3d. ACM Trans. Graph., 25(3):835–846, July 2006.
 [36] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [37] S. Song, A. Zeng, A. X. Chang, M. Savva, S. Savarese, and T. Funkhouser. Im2pano3d: Extrapolating 360 structure and semantics beyond the field of view. Proceedings of 31th IEEE Conference on Computer Vision and Pattern Recognition, 2018.
 [38] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgbd slam systems. In IROS, pages 573–580. IEEE, 2012.
 [39] J. W. Tangelder and R. C. Veltkamp. A survey of content based 3d shape retrieval methods. Multimedia Tools Appl., 39(3):441–471, Sept. 2008.
 [40] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion network for learning monocular stereo. In IEEE Conference on computer vision and pattern recognition (CVPR), volume 5, page 6, 2017.
 [41] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
 [42] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning singleview 3d object reconstruction without 3d supervision. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1696–1704. Curran Associates, Inc., 2016.
 [43] Q. Zhou, J. Park, and V. Koltun. Fast global registration. In Computer Vision  ECCV 2016  14th European Conference, Amsterdam, The Netherlands, October 1114, 2016, Proceedings, Part II, pages 766–782, 2016.
 [44] Q. Zhou, J. Park, and V. Koltun. Open3d: A modern library for 3d data processing. CoRR, abs/1801.09847, 2018.
 [45] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and egomotion from video. In CVPR, volume 2, page 7, 2017.
 [46] M. Zollhöfer, P. Stotko, A. Görlitz, C. Theobalt, M. Nießner, R. Klein, and A. Kolb. State of the art on 3d reconstruction with RGBD cameras. Comput. Graph. Forum, 37(2):625–652, 2018.
 [47] C. Zou, A. Colburn, Q. Shan, and D. Hoiem. Layoutnet: Reconstructing the 3d room layout from a single rgb image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Appendix A More Technical Details about Our Approach
a.1 Completion Network Architecture
The completion network takes two sets of RGBDN (RGB, depth, and normal) as input. Three separate layers of convolution (followed by ReLU and Batchnorm) are applied to extract domain specific signal before merging. Those three preprocessingbranches are applied to both sets of RGBDN input. We also use skip layer to facilitate training. The overall architecture is listed as follows, where C(m,n) specify the convolution layer input/output channel.
SUNCG  Matterport  ScanNet  
Rotation  Trans.  Rotation  Trans.  Rotation  Trans.  
Median  Mean  Median  Mean  Median  Mean  Median  Mean  Median  Mean  Median  Mean  
nr  4.51  26.25  0.21  0.62  4.85  22.33  0.22  0.60  12.90  33.89  0.36  0.61 
r  1.54  23.36  0.10  0.54  2.51  18.69  0.10  0.49  7.11  30.40  0.23  0.57 
sm  2.65  25.6  0.18  0.64  3.15  20.23  0.20  0.60  7.10  35.32  0.17  0.57 
r+sm  1.32  19.36  0.06  0.48  1.45  13.9  0.04  0.34  5.47  32.38  0.12  0.57 
a.2 Iteratively Reweighted Least Squares for Solving the Robust Regression Problem
In this section, we provide technical details on solving the following robust regression problem:
(9) 
,where we use in all of our experiments. We solve (9) using reweighted nonlinear least squares. Introduce an initial weight . At each iteration , we first solve the following nonlinear least squares:
(10) 
According to [14], (9) admits a closedform solution. Specifically, define
The optimal translation and rotation to (10) are given by
where and
are given by the singular value decompostion of
and where
After obtaining the new optimal transformation , we update the weight associated with correspondence at iteration as
where is a small constant to address the issue of division by zero.
In our experiments, we used 5 reweighting operations for solving (9).
a.3 Implementation Details
Implementation details of the completion network. We used a combination of 5 source of information(color,normal,depth,semantic label,feature) to supervise the completion network. Specifically, we use
, where we use loss for color, normal, depth, loss for feature, and crossentropy loss for semantic label. We use , . We trained for 100k iterations using a single GTX 1080Ti. We use Adam optimizer with initial learning rate 0.0002.
Appendix B Additional Experimental Results
Figure 8, 9, 10 show more qualitative results on SUNCG, Matterport, and ScanNet, respectively. Table 2 gives a detailed ablation study of our proposed pairwise matching algorithm. We compare against three variants, namely, direct regression(nr) using [14], reweighted least squares(r) (using the robust norm), and merely using spectral matching (sm). We can see that the combination of reweighted least squares and spectral matching gives the best result.
We also applied the idea of learning weights for correspondence from data [33]. Since [33] addresses a different problem of estimating the functional matrix between a pair of RGB images, we tried applying the idea on top of reweighted least squares (r) of our approach, namely, by replacing the reweighting scheme described in Section A.2 by a small network for predicting the correspondence weight. However, we find this approach generalized poorly on testing data. In contrast, we found that the spectral matching approach, which leverages geometric constraints that are specifically designed for matching 3D data, leads to additional boost in performance.
G.T. Color 

G.T. Scene 

Ours 

4PCS 

DL 

GReg 

CGReg 

G.T. 1 

Completed 1 

G.T. 2 

Completed 2 
G.T. Color 

G.T. Scene 

Ours 

4PCS 

DL 

GReg 

CGReg 

G.T. 1 

Completed 1 

G.T. 2 

Completed 2 
G.T. Color 

G.T. Scene 

Ours 

4PCS 

DL 

GReg 

CGReg 

G.T. 1 

Completed 1 

G.T. 2 

Completed 2 