Learning multiview 3D point cloud registration

01/15/2020 ∙ by Zan Gojcic, et al. ∙ 12

We present a novel, end-to-end learnable, multiview 3D point cloud registration algorithm. Registration of multiple scans typically follows a two-stage pipeline: the initial pairwise alignment and the globally consistent refinement. The former is often ambiguous due to the low overlap of neighboring point clouds, symmetries and repetitive scene parts. Therefore, the latter global refinement aims at establishing the cyclic consistency across multiple scans and helps in resolving the ambiguous cases. In this paper we propose, to the best of our knowledge, the first end-to-end algorithm for joint learning of both parts of this two-stage problem. Experimental evaluation on well accepted benchmark datasets shows that our approach outperforms the state-of-the-art by a significant margin, while being end-to-end trainable and computationally less costly. Moreover, we present detailed analysis and an ablation study that validate the novel components of our approach. The source code and pretrained models will be made publicly available under https: //github.com/zgojcic/3D_multiview_reg.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 15

page 16

page 17

page 18

Code Repositories

FCGF

Fully Convolutional Geometric Features: Fast and accurate 3D features for registration and correspondence.


view repo

3D_multiview_reg

[CVPR2020] Learning multiview 3D point cloud registration


view repo

Minkowski-Engine

forked repo from https://github.com/StanfordVL/MinkowskiEngine, need stable version for prototype.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

First two authors contributed equally to this work.

The capability of aligning and fusing multiple scans is essential for the tasks of structure from motion and 3D reconstruction. As such, it has several use cases in augmented reality and robotics. Hence, markerless registration of 3D point clouds either based on geometric constraints [47, 65, 54] or hand-engineered 3D feature descriptors [34, 24, 51, 57]

is a prerequisite for generating a holistic view of the scene for purposes like 3D reconstruction or semantic segmentation. Because all scene parts are usually captured from different viewpoints and with often small overlap, registration of adjacent scans is hard. While hand-engineered feature descriptors have shown successful results to some extent, a significant performance boost can be observed since the comeback of deep learning. Current research on local descriptors for pairwise registration of 3D point clouds is centered on deep learning approaches 

[66, 35, 19, 63, 17, 25] that succeed in capturing and encoding evidence hidden to hand-engineered descriptors. Learning pairwise point cloud matching end-to-end was recently proposed by [61, 39]. Despite demonstrating good performance for many tasks, pairwise registration of individual views of a scene has some conceptual drawbacks: (i) low overlap of adjacent point clouds can lead to inaccurate or wrong matches, (ii) point cloud registration has to rely on very local evidence, which can be harmful if 3D scene structure is scarce or repetitive, (iii) separate post-processing is required to combine all pair-wise matches into a global representation.

While for pairwise registration well accepted methods do exist [6, 68], registering multiple scans globally remains a challenge, as it is unclear how to best make use of the quadratic pairwise relationships.

Figure 1: Result of our end-to-end reconstruction on the 60 scans of Kitchen scene from 3DMatch benchmark [66].

In this paper, we present, to the best of our knowledge, the first end-to-end data driven multiview point cloud registration algorithm. Our method takes a set of potentially overlapping point clouds as input and outputs a global/absolute transformation matrix for each of the input scans (Fig. 1). We depart from a traditional two-stage approach where the individual stages are detached from each other and directly learn to register all views of a scene in a globally consistent manner.

The main contributions of our work are:

  • Building on [15] and [67] we propose a novel end-to-end pairwise 3D point cloud registration algorithm.

  • We propose a confidence estimation block that uses a novel

    overlap pooling layer to predict the confidence in the estimated pairwise transformation parameters.

  • We formulate the traditional two-stage approach in an end-to-end differentiable network that iteratively refines both the pairwise and absolute transformation estimates, the latter involving a differentiable synchronization layer.

Resulting from the aforementioned contributions, the proposed multiview registration algorithm (i) is very efficient to compute, (ii) achieves more accurate scan alignments because the residuals are being fed back to the pairwise network in an iterative manner, (iii) outperforms current state-of-the art on pairwise point cloud registration as well as transformation synchronization.

2 Related Work

Pairwise registration

The traditional pairwise registration pipeline consists of two stages: the coarse alignment stage, which provides the initial estimate of the relative transformation parameters and the refinement stage that iteratively refines the transformation parameters by minimizing the 3D registration error under the assumption of rigid transformation.

The former is traditionally performed by using either handcrafted [51, 57, 56] or learned [66, 35, 19, 18, 63, 25, 15] 3D local features descriptors to establish the pointwise candidate correspondences in combination with a RANSAC like robust model estimation algorithm [23, 48, 37] or geometric hashing [22, 8, 29]. A parallel stream of works [1, 55, 41] relies on establishing correspondences using the 4-point congruent sets. In the refinement stage, the coarse transformation parameters are often fine-tuned with a variant of the iterative closest point (ICP) algorithm [6]. ICP-like algorithms [38, 62]

perform optimization by alternatively hypothesizing the correspondence set and estimating the new set of transformation parameters. They are known to not be robust against outliers and to converge to a global optimum only when starting with a good prealingment 

[9]. ICP algorithms are often extended to use additional radiometric, temporal or odometry constraints [68]. Contemporary to our work, [61, 39] propose to integrate coarse and fine registration stages for pairwise point cloud registration into an end-to-end learnable algorithm.

Multiview registration  Multiview, global point cloud registration methods aim at resolving hard or ambiguous cases that arise with pairwise methods by incorporating global cyclic-consistency. A large body of literature [33, 27, 59, 2, 55, 3, 5, 4, 68, 7, 32] deals with multiview, global alignment of unorganized 3D point clouds. Most methods either rely on a good initialization of the pairwise transformation parameters [27, 59, 2, 3, 5, 4, 40, 10], which they try to synchronize, or incorporate pairwise keypoint correspondences in a joint optimization [68, 54, 7]. In general, these methods work well on data with low synthetic noise, but struggle on real scenes with high levels of clutter and occlusion [9]. A general drawback of this hierarchical procedure is that global noise distribution over all scans ends up being far from random, i.e. significant biases persist due to the highly correlated initial pairwise registrations. [21]

proposes a global point cloud registration approach using two networks, one for pose estimation of the input point clouds, and another modelling the scene structure by estimating the occupancy status of global coordinates. Probably the most similar work to ours is 

[32], where the authors aim to adapt the edge weights for the transformation synchronization layer by learning a data driven weighting function. A major conceptual difference to our approach is that relative transformation parameters are estimated using FPFH [51] in combination with FGR [68] and thus, unlike ours, are not learned. Furthermore, in each iteration [32] has to convert the point clouds to depth images as the weighting function is approximated by a 2D CNN. On the other hand our whole approach operates directly on point clouds, is fully differentiable and therefore facilitates learning a global, multiview point registration end-to-end.

3 End-to-End Multiview 3D Registration

In this section we derive the proposed multiview 3D registration algorithm as a composition of functions depending upon the data. The network architectures used to approximate these functions are then explained in detail in Sec 4. We begin with a new algorithm for learned pairwise point cloud registration, which uses two point clouds as input and outputs estimated transformation parameters (Sec. 3.1). This method is extended to multiple point clouds in Sec. 3.2

using a transformation synchronization layer amenable to backpropagation. The input graph to this synchronization layer is learned by a neural network, which encodes, along with the relative transformation parameters, the confidence of pairwise point cloud matches as edge information in the global scene graph. Finally, we use an iterative scheme (Sec. 

3.3) to refine the global registration of all point clouds in the scene by updating the edge weights as well as the pairwise poses.

Consider a set of potentially overlapping point clouds capturing a 3D scene from different viewpoints (i.e. poses). The task of multiview registration is to recover the rigid, absolute poses given the scan collection, where

(1)

and . can be augmented by connectivity information resulting in a finite graph , where each vertex represents a single point set and the edges encode the information about the relative rotation and translation between the vertices. These relative transformation parameters satisfy and as well as the compatibility constraint [4]

(2)

In current state-of-the-art [68, 32, 7] edges of are initialized with (noisy) relative transformation parameters , obtained by an independent, auxiliary pairwise registration algorithm. Global scene consistency is enforced via a subsequent synchronization algorithm. In contrast, we propose a joint approach where pairwise registration and transformation synchronization are tightly coupled as one fully differentiable component, which leads to an end-to-end learnable, global registration pipeline.

3.1 Pairwise registration of point clouds

Figure 2: Proposed pipeline for end-to-end multiview 3D point cloud registration. For each of the input point clouds we extract FCGF [15] features that are fed to the softNN layer to compute the stochastic correspondences for pairs. These correspondences are used as input to the initial registration block (i.e. Reg. init.) that outputs the per-correspondence weights, initial transformation parameters, and per-point residuals. Along with the correspondences, the initial weights and residuals are then input to the registration refinement block (i.e. Reg. iter.), whose outputs are used to build the graph. After each iteration of Transf-Sync layer the estimated transformation parameters are used to pre-align the correspondences that are concatenated with the weights from the previous iteration and the residuals and feed anew to Reg. iter. block. We iterate over the Reg. iter. and Transf-Sync layer for four times.

In the following, we introduce a differentiable, pairwise registration algorithm that can easily be incorporated into an end-to-end multiview 3D registration algorithm. Let denote a pair of point clouds where and

represent the coordinate vectors of individual points in point clouds

and , respectively. The goal of pairwise registration is to retrieve optimal and .

(3)

where is a correspondence function that maps the points to their corresponding points in point cloud . The formulation of Eq. 3 facilitates a differentiable closed-form solution, which is—subject to the noise distribution—close to the ground truth solution [53]. However, least square solutions are not robust and thus Eq. 3 will yield wrong transformation parameters in case of high outlier ratio. In practice, the mapping is far from ideal and erroneous correspondences typically dominate. To circumvent that, Eq. 3

can be robustified against outliers by introducing a heteroscedastic weighting matrix 

[58, 53]:

(4)

where is the weight of the putative correspondence computed by some weighting function , where and . Assuming that is close to one when the putative correspondence is an inlier and close to zero otherwise, Eq. 4 will yield the correct transformation parameters while retaining a differentiable closed-form solution [53]. Hereinafter we denote this closed-form solution WLS trans. and for the sake of completeness, its derivation is provided in the supplementary material.

3.2 Differentiable transformation synchronization

Returning to the original task of multiview registration, we again consider the initial set of point clouds . If no prior connectivity information is given, graph can be initialized by forming point cloud pairs and estimating their relative transformation parameters as described in Sec. 3.1. The global transformation parameters can be estimated either jointly (transformation synchronization[27, 5, 4, 10] or by dividing the problem into rotation synchronization [2, 3] and translation synchronization [31]. In the following paragraphs we summarize a differentiable, closed-form solution of the latter approach based on spectral relaxation [2, 3, 31].

Rotation synchronization  The goal of rotation synchronization is to retrieve global rotation matrices by solving the following minimization problem based on their observed ratios

(5)

Under the spectral relaxation Eq. 5 admits a closed form solution as follows [2, 3]. Consider a symmetric matrix of relative transformations  [52], where is a diagonal degree matrix and is constructed as

(6)

where the weigths represent the confidence in the relative transformation parameters . The least squares estimates of the global rotation matrices

are then given, under relaxed orthonormality and determinant constraint, by the three eigenvectors

corresponding to the smallest eigenvalues of

. Consequently, the nearest rotation matrices under Frobenius norm can be obtained by a projection of the submatrices of onto the orthonormal matrices and enforcing the (to avoid the reflections).

Translation synchronization

Similarly, the goal of translation synchronization is to retrieve global translation vectors that minimize the following least squares problem

(7)

Considering the normalized version of the above defined matrix of relative transformations , the closed-form solution of Eq. 7 can be written as [32]

(8)

where and with

(9)

where () denotes all the neighboring vertices of () in graph .

3.3 Iterative refinement of the registration

The above formulation (Sec. 3.1 and 3.2) facilitates an implementation in an iterative scheme, which in turn can be viewed as iterative reweighted least-squares (IRLS) algorithm. We can start each subsequent iteration by pre-aligning the point cloud pairs using the synchronized estimate of the relative transformation parameters from iteration such that , where denotes applying the transformation to point cloud . Additionally, weights and residuals of the previous iteration can be used as a side information in the correspondence weighting function. Therefore, is extended to

(10)

where . Analogously, the difference between the input and the synchronized transformation parameters of the iteration can be used as an additional cue for estimating the confidence . Thus, can be extended to

(11)

4 Network Architecture

We implement our proposed multiview registration algorithm as a deep neural network (Fig. 2). In this section, we first describe the architectures used to aproximate and , before integrating them into one fully differentiable, end-to-end trainable algorithm.

Learned correspondence function

Our approximation to the correspondence function extends a recently proposed fully convolutional 3D feature descriptor FCGF [15]

with a soft assignment layer. FCGF operates on sparse tensors 

[14] and computes dimensional descriptors for each point of the sparse point cloud in a single pass. Note that the function could be approximated with any of the recently proposed learned feature descriptors [35, 18, 19, 25], but we choose FCGF due to its high accuracy and low computational complexity.

Let and denote the FCGF embeddings of point clouds and , respectively. Pointwise correspondences can then be established by a nearest neighbor (NN) search in this high dimensional feature space. However, the selection rule of such hard assignments is not differentiable. We therefore form the NN-selection rule in a probabilistic manner by computing a probability vector of the categorical distribution [45]. The stochastic correspondence of the point in the point cloud is then defined as

(12)

where , is the FCGF embedding of the point and denotes the temperature parameter. In the limit the converges to the deterministic NN-search [45].

We follow [15] and supervise the learning of with a correspondence loss , which is defined as the hardest contrastive loss and operates on the FCGF embeddings

where is a set of all the positive pairs in a FCGF mini batch and is a random subset of all features that is used for the hardest negative mining. and are the margins for positive and negative pairs respectively. The detailed network architecture of as well as the training configuration and parameters are available in the supplementary material.

Deep pairwise registration

Despite the good performance of the FCGF descriptor, several putative correspondences will be false. Furthermore, the distribution of inliers and outliers does not resemble noise but rather shows regularity [49]. We thus aim to learn this regularity from the data using a deep neural network. Recently, several networks representing a complex weighting function for filtering of 2D [42, 49, 67] or 3D [26] feature correspondences have been proposed.

Herein, we propose extending the 3D outlier filtering network [26] that is based on [42] with the order-aware blocks proposed in [67]. Specifically, we create a pairwise registration block that takes the coordinates of the putative correspondences as input and outputs weights that are fed, along with , into the closed form solution of Eq. 4 to obtain and . Motivated by the results in [49, 67] we add another registration block to our network and append the weights and the pointwise residuals to the original input s.t. (see Sec. 3.3). The weights are then, again fed together with the initial correspondences to the closed form solution of Eq. 4 to obtain the refined pairwise transformation parameters. In order to ensure permutation-invariance of a PointNet-like [46] architecture that operates on individual correspondences is used in both registration blocks. As each branch only operates on individual correspondences, the local 3D context information is gathered in the intermediate layers using symmetric context normalization [64] and order-aware filtering layers [67]. The detailed architecture of the registration block is available in the supplementary material. Training of the registration network is supervised using the registration loss defined for a batch with examples as

(13)

loss, where denotes the binary cross entropy loss and

(14)

is used to penalize the deviation from the ground truth transformation parameters and and

are used to control the contribution of the individual loss functions.

Confidence estimation block

Along with the estimated relative transformation parameters , the edges of the graph encode the confidence in those estimates. Confidence encoded in each edge of the graph consist of (i) the local confidence of the pairwise transformation estimation and (ii) the global confidence derived from the transformation synchronization. We formulate the estimation of as a classification task and argue that some of the required information is encompassed in the features of the second-to-last layer of the registration block. Let denote the output of the second-to-last layer of the registration block, we propose an overlap pooling layer that extracts a global feature by performing the weighted average pooling as

(15)

The obtained global feature is concatenated with the ratio of the inliers (i.e., the number of putative correspondences whose weights are higher than a given threshold) and fed to the confidence estimation network with three fully connected layers

, each followed by a ReLU activation function. The local confidence can thus be expressed as

(16)

The training of the confidence estimation block is supervised with the confidence loss function ( denotes the number of cloud pairs), where BCE refers to the binary cross entropy and the ground truth confidence labels are computed on the fly by thresholding the angular error .

The incorporates the local confidence in the relative transformation parameters. On the other hand, the output of the transformation synchronization layer provides the information how the input relative transformations agree globally with the other edges. In fact, traditional synchronization algorithms [12, 4, 31] only use this global information to perform the reweighting of the edges in the iterative solutions, because they do not have access to the local confidence information. Global confidence in the relative transformation parameters can be expressed with the Cauchy weighting function [30, 4]

(17)

where and following [30, 4] with denoting the median operator and the vectorization of residuals . Since local and global confidence provide complementary information about the relative transformation parameters, we combine them into a joined confidence

using their harmonic mean:

(18)

where the balances the contribution of the local and global confidence estimates and is learned during training.

3DMatch CGF PPFNet 3DSN FCGF Ours
[66] [35] [19] [25] [15] 1-iter 4-iter
Kitchen
Home 1
Home 2
Hotel 1
Hotel 2
Hotel 3
Study
MIT Lab
Average
Table 1: Registration recall on 3DMatch data set. 1-iter and 4-iter denote the result of the pairwise registration network and input to the th Trasnf-Sync laser, respectively. Best results, except for 4-iter that is informed by the global information, are shown in bold.
End-to-end multiview 3D registration

The individual parts of the network are connected into an end-to-end multiview 3D registration algorithm as shown in Fig. 2. 222

The network is implemented in Pytorch 

[43]. A pseudo-code of the proposed approach is provided in the supplementary material.. We pre-train the individual sub-networks (training details available in the supp. material) before fine-tuning the whole model in an end-to-end manner on the 3DMatch dataset [66] using the official train/test data split. In fine-tuning we use to extract the FCGF features and randomly sample feature vectors of 2048 points per fragment. These features are used in the soft assignment (softNN) to form the putative correspondences of point clouds pairs 333We assume a fully connected graph during training but are able to consider the connectivity information, if provided., which are fed to the pairwise registration network. The output of the pairwise registration is used to build the graph, which is input to the transformation synchronization layer. The iterative refinement of the transformation parameters is performed four times. We supervise the fine tuning using the joint multiview registration loss

(19)

where the transformation synchronization loss reads

(20)

We fine-tune the whole network for iterations using Adam optimizer [36] with a learning rate of .

Per fragment pair Whole scene
NN search Model estimation Total time
[s] [s] [s]
RANSAC
Ours (softNN)
Table 2: Average run-time for estimating the pairwise transformation parameters of one fragment pair on 3DMatch data set.

5 Experiments

We conduct the evaluation of our approach on the publicly available benchmark datasets 3DMatch [66], Redwood [13] and ScanNet [16]. First, we evaluate the performance, efficiency, and the generalization capacity of the proposed pairwise registration algorithm on 3DMatch and Redwood dataset respectively (Sec. 5.1). We then evaluate the whole pipeline on the global registration of the point cloud fragments generated from RGB-D images, which are part of the ScanNet dataset [16].

Methods Rotation Error Translation Error ()
Mean/Med. 0.05 0.1 0.25 0.5 0.75 Mean/Med.
Pairwise (All) FGR [68] 9.9 16.8 23.5 31.9 38.4 /- 5.5 13.3 22.0 29.0 36.3 1.67/-
Ours ( iter.) 32.6 37.1 41.0 46.4 49.3 / 25.5 34.2 40.1 43.4 46.8 1.37/1.02
Edge Pruning (all) Ours ( iter.) 34.3 38.7 42.2 48.2 51.9 / 26.7 35.7 41.8 45.5 49.4 1.26/0.88
Ours (After Sync.) 48.3 53.6 58.9 63.2 64.0 / 34.5 49.1 58.5 61.6 63.9 0.83/0.55
FGR (Good) FastGR [68] 12.4 21.4 29.5 38.6 45.1 /- 7.7 17.6 28.2 36.2 43.4 1.43/-
GeoReg (FGR[13] 0.2 0.6 2.8 16.4 27.1 /- 0.1 0.7 4.8 16.4 28.4 1.80/-
EIGSE3 (FGR[4] 1.5 4.3 12.1 34.5 47.7 /- 1.2 4.1 14.7 32.6 46.0 1.29/-
RotAvg (FGR [11]) 6.0 10.4 17.3 36.1 46.1 /- 3.7 9.2 19.5 34.0 45.6 1.26/-
L2Sync (FGR[32] 34.4 41.1 49.0 58.9 62.3 /- 2.0 7.3 22.3 36.9 48.1 1.16/-
Ours (Good) Ours ( iter.) 57.7 65.5 71.3 76.5 78.1 / 44.8 60.3 69.6 73.1 75.5 0.62/0.20
Ours ( iter.) 60.6 68.3 73.7 78.9 81.0 / 47.1 63.3 72.2 76.2 78.7 0.54/0.17
Ours (After Sync) 65.8 72.8 77.6 81.9 83.2 / 48.4 67.2 76.5 79.7 82.0 0.46/0.22
Table 3: Multiview registration evaluation on ScanNet  [16] dataset. We report the ECDF values for rotation and translation errors. Best results are shown in bold.

5.1 Pairwise registration on 3DMatch dataset

We begin by evaluating the pairwise registration part of our algorithm on a traditional geometric registration task. We compare the results of our method to the state-of-the-art data-driven feature descriptors 3DMatch [66], CGF [35], PPFNet [19], 3DSmoothNet (3DS) [25], and FCGF [15], which is also used as part of our algorithm. Following the evaluation procedure of 3DMatch [66] we complement all the descriptors with the RANSAC-based transformation parameter estimation. For our approach we report the results after the pairwise registration network (1-iter in Tab. 1) as well as the the output of the in the iteration (4-iter in Tab. 1). The latter is already informed with the global information and serves only as verification that with the iterations our input to the Transf-Sync layer improves. Consistent with the 3DMatch evaluation procedure, we report the average recall per scene as well as for the whole dataset in Tab. 1.

The registration results show that our approach reaches the highest recall among all the evaluated methods. More importantly, it indicates that using the same features (FCGF), our method can outperform RANSAC-based estimation of the transformation parameters, while having a much lower time complexity (Tab. 2). The comparison of the results of 1-iter and 4-iter also confirms the intuition that feeding the residuals and weights of the previous estimation back to the pairwise registration block helps refining the estimated pairwise transformation parameters.

Generalization to other domains  In order to test if our pairwise registration model can generalize to new datasets and unseen domains, we perform a generalization evaluation on a synthetic indoor dataset Redwood indoor [13]. We follow the evaluation protocol of [13] and report the average registration recall and precision across all four scenes. We compare our approach to the recent data driven approaches 3DMatch [66], CGF [35]+FGR [68], RelativeNet (RN) [20] and traditional methods CZK [13] and Latent RANSAC (LR) [37]. Fig. 3 shows that our approach can achieve percentage points higher recall than state-of-the-art without being trained on synthetic data and thus confirming the good generalization capacity of our approach. Note that while the average precision across the scenes is low for all the methods, several works [13, 35, 20] have shown that the precision can easily be increased using pruning without almost any loss in the recall.

5.2 Multiview registration performance

We evaluate the performance of our approach on the task of multiview registration using the ScanNet [16] dataset. ScanNet is a large RGBD dataset of indoor scenes. It provides the reconstructions, ground truth camera poses and semantic segmentations for scenes. To ensure a fair comparison, we follow [32] and use the same randomly sampled scenes for evaluation. For each scene we randomly sample RGBD images that are 20 frames apart and convert them to point cloud fragments. The temporal sequence of the frames is discarded. In combination with the large gap between the frames, this makes the test setting extremely challenging. Different to [32], we do not train our network on ScanNet, but rather perform direct generalization.

Evaluation protocol  We use the standard evaluation protocol [12, 32]

and report the empirical cumulative distribution function (ECDF) for the angular

and translation deviations defined as

(21)

The ground truth rotations and translations are provided by the authors of ScanNet [16]. In Tab. 3 we report the results for three different scenarios. ”FGR (Good)” and ”Ours (Good)” denote the scenarios in which we follow [32] and use the computed pairwise registrations to prune the edges before the transformation synchronization if the median point distance in the overlapping444The overlapping regions are defined as parts, where after transformation, the points are less than m away from the other point cloud. [32] region after the transformation is larger than m (FGR) or m (ours). On the other hand, ”all” denotes the scenario in which all pairs are used to build the graph. In all scenarios we prune the edges of the graph if the confidence estimation in the relative transformation parameters of that edge drops below . This threshold was determined on 3DMatch dataset and its effect on the performance of our approach is analyzed in detail in the supplementary material.

Analysis of the results As shown in Tab. 3 our approach can achieve a large improvement on the multiview registration tasks when compared to the baselines. Not only are the initial pairwise relative transformation parameters estimated using our approach more accurate than the ones of FGR [68], but they can also be further improved in the subsequent iterations. This clearly confirms the benefit of the feed-back loop of our algorithm. Furthermore even when directly considering all input edges our approach still proves dominant, even when considering the results of the scenario ”Good” for our competitors. More qualitative results of the multiview registration evaluation are available in the supplementary material.

Figure 3: Registration results on the Redwood indoor data set.

Computation time Low computational costs of pairwise and multiview registration is important for various fields like augmented reality or robotics. We first compare computation time of our pairwise registration component to RANSAC. In Tab. 2 we report the average time needed to register one fragment pair of the 3DMatch data set as well as one whole scene. All timings were performed on a standalone computer with Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz with GeForce GTX 1080 and 32 GB RAM. Average time of performing softNN for a fragment pair is about , which is a approxiately four times faster than traditional nearest neighbor search (implemented using scikit-learn [44]). An even larger speedup (about times) is gained in the model estimation stage, where our approach requires a single forward pass (constant time) compared to up to 50000 iterations of RANSAC when the inlier ratio is and the desired confidence . This results in an overall run-time of about

for our entire global, multiview approach (including the feature extraction and transformation synchronization) for the

Kitchen scene with 1770 fragment pairs. In contrast, feature extraction and pairwise estimation of transformation parameters with RANSAC takes s. This clearly shows the efficiency of our method, being times faster to compute (for a scene with fragments).

5.3 Ablation study

To get a better intuition how much the individual novelties of our approach contribute to the joint performance, we carry out an ablation study on the ScanNet [16] dataset. Specifically, we analyze the proposed edge pruning scheme based on the confidence estimation block and Cauchy function as well as the impact of the iterative refinement of the relative transformation parameters.555Additional results of the ablation study are included in the supplementary material. The results of the ablation study are presented in form of a translation error ECDF in Fig. 4.

Benefit from the iterative refinement

We motivate the iterative refinement of the transformation parameters that are input to the Trans-Sync layer with a notion that the weights and residuals provide additional ques for their estimation. Results in Fig. 4 confirm this assumption. The input relative parameters in the 4th iterations are approximately percentage points better that the initial estimate. On the other hand, Fig. 4 shows that at the high presence of outliers or inefficient edge pruning (see e.g., the results w/o edge pruning) the weights and the residuals actually provide a negative bias and worsen the results.

Edge pruning scheme There are several possible ways to implement the pruning of the presumable outlier edges. In our experiments we prune the edges based on the output of the confidence estimation block (w-conf.). Other options are to realize this step using the global confidence, i.e. the Cauchy weights defined in (17) (w-Cau.) or not performing this at all (w/o). Fig. 4 clearly shows the advantage of using our confidence estimation block (gain of more than percentage points). Even more, due to preserving a large amount of outliers, alternative approaches preform even worse than the pairwise registration.

Figure 4: Ablation study on the ScanNet dataset.

6 Conclusions

We have introduced an end-to-end learnable, global, multiview point cloud registration algorithm. Our method departs from the common two-stage approach and directly learns to register all views in a globally consistent manner. We augment the 3D descriptor FCGF [15] by a soft correspondence layer that pairs all the scans to compute initial matches, which are fed to a differentiable pairwise registration block resulting in transformation parameters as well as weights. A pose graph is constructed and a novel, differentiable iterative transformation synchronization layer globally refines weights and transformations. Experimental evaluation on common benchmark datasets show that our method outperforms state-of-the-art by more than 25 percentage points on average regarding the rotation error statistics. Moreover, our approach is times faster than RANSAC-based methods in a multiview setting of 60 scans, and generalizes better to new scenes ( higher recall on Redwood indoor compared to state-of-the-art).

References

  • [1] Dror Aiger, Niloy J Mitra, and Daniel Cohen-Or. 4-points congruent sets for robust pairwise surface registration. In ACM transactions on graphics (TOG), number 3, 2008.
  • [2] Mica Arie-Nachimson, Shahar Z Kovalsky, Ira Kemelmacher-Shlizerman, Amit Singer, and Ronen Basri. Global motion estimation from point matches. In International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission, pages 81–88, 2012.
  • [3] Federica Arrigoni, Luca Magri, Beatrice Rossi, Pasqualina Fragneto, and Andrea Fusiello. Robust absolute rotation estimation via low-rank and sparse matrix decomposition. In IEEE International Conference on 3D Vision (3DV), pages 491–498, 2014.
  • [4] Federica Arrigoni, Beatrice Rossi, and Andrea Fusiello. Spectral synchronization of multiple views in se(3). SIAM Journal on Imaging Sciences, 9(4):1963–1990, 2016.
  • [5] Florian Bernard, Johan Thunberg, Peter Gemmar, Frank Hertel, Andreas Husch, and Jorge Goncalves. A solution for multi-alignment by transformation synchronisation. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 2161–2169, 2015.
  • [6] PJ Besl and Neil D McKay. A method for registration of 3-d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 14(2):239–256, 1992.
  • [7] Uttaran Bhattacharya and Venu Madhav Govindu. Efficient and robust registration on the 3d special euclidean group. In The IEEE International Conference on Computer Vision (ICCV), 2019.
  • [8] Tolga Birdal and Slobodan Ilic. Point pair features based object detection and pose estimation revisited. In IEEE International Conference on 3D Vision (3DV), 2015.
  • [9] Tolga Birdal and Slobodan Ilic. Cad priors for accurate and flexible instance reconstruction. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [10] Tolga Birdal, Umut Simsekli, Mustafa Onur Eken, and Slobodan Ilic. Bayesian pose graph optimization via bingham distributions and tempered geodesic mcmc. In Advances in Neural Information Processing Systems (NIPS), pages 308–319, 2018.
  • [11] A Chatterjee and VM Govindu. Robust relative rotation averaging. IEEE transactions on pattern analysis and machine intelligence, 40(4):958–972, 2018.
  • [12] Avishek Chatterjee and Venu Madhav Govindu. Efficient and robust large-scale rotation averaging. In Proceedings of the IEEE International Conference on Computer Vision, pages 521–528, 2013.
  • [13] Sungjoon Choi, Qian-Yi Zhou, and Vladlen Koltun. Robust reconstruction of indoor scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [14] Christopher Choy, JunYoung Gwak, and Silvio Savarese.

    4d spatio-temporal convnets: Minkowski convolutional neural networks.

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3075–3084, 2019.
  • [15] Christopher Choy, Jaesik Park, and Vladlen Koltun. Fully convolutional geometric features. In The IEEE International Conference on Computer Vision (ICCV), pages 8958–8966, 2019.
  • [16] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
  • [17] Haowen Deng, Tolga Birdal, and Slobodan Ilic.

    Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors.

    In European conference on computer vision (ECCV), 2018.
  • [18] Haowen Deng, Tolga Birdal, and Slobodan Ilic. Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors. In European Conference on Computer Vision (ECCV), pages 602–618, 2018.
  • [19] Haowen Deng, Tolga Birdal, and Slobodan Ilic. Ppfnet: Global context aware local features for robust 3d point matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 195–205, 2018.
  • [20] Haowen Deng, Tolga Birdal, and Slobodan Ilic. 3d local features for direct pairwise registration. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [21] Li Ding and Chen Feng. DeepMapping: Unsupervised map estimation from multiple point clouds. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8650–8659, 2019.
  • [22] Bertram Drost, Markus Ulrich, Nassir Navab, and Slobodan Ilic. Model globally, match locally: Efficient and robust 3d object recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 998–1005, 2010.
  • [23] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, 1981.
  • [24] A. Flint, A. Dick, and A. van den Hangel. Thrift: Local 3D structure recognition. In 9th Biennial Conference of the Australian Pattern Recognition Society on Digital Image Computing Techniques and Applications, 2007.
  • [25] Zan Gojcic, Caifa Zhou, Jan D Wegner, and Andreas Wieser. The perfect match: 3d point cloud matching with smoothed densities. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [26] Zan Gojcic, Caifa Zhou, and Andreas Wieser. Robust pointwise correspondences for point cloud based deformation monitoring of natural scenes. In 4th Joint International Symposium on Deformation Monitoring (JISDM), 2019.
  • [27] Venu Madhav Govindu. Lie-algebraic averaging for globally consistent motion estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 684–691, 2004.
  • [28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [29] Stefan Hinterstoisser, Vincent Lepetit, Naresh Rajkumar, and Kurt Konolige. Going further with point pair features. In European conference on computer vision (ECCV), pages 834–848, 2016.
  • [30] Paul W. Holland and Roy E. Welsch. Robust regression using iteratively reweighted least-squares. Communications in Statistics - Theory and Methods, 6(9):813–827, 1977.
  • [31] Xiangru Huang, Zhenxiao Liang, Chandrajit Bajaj, and Qixing Huang. Translation synchronization via truncated least squares. In Advances in neural information processing systems (NIPS), pages 1459–1468, 2017.
  • [32] Xiangru Huang, Zhenxiao Liang, Xiaowei Zhou, Yao Xie, Leonidas J Guibas, and Qixing Huang. Learning transformation synchronization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8082–8091, 2019.
  • [33] Daniel F Huber and Martial Hebert. Fully automatic registration of multiple 3d data sets. Image and Vision Computing, 21(7):637–650, 2003.
  • [34] A.E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21, 1999.
  • [35] Marc Khoury, Qian-Yi Zhou, and Vladlen Koltun. Learning compact geometric features. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [36] Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. In International Conference on Learning Representations 2015, 2015.
  • [37] Simon Korman and Roee Litman. Latent ransac. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6693–6702, 2018.
  • [38] Hongdong Li and Richard Hartley. The 3D-3D registration problem revisited. In International Conference on Computer Vision (ICCV), pages 1–8, 2007.
  • [39] Weixin Lu, Guowei Wan, Yao Zhou, Xiangyu Fu, Pengfei Yuan, and Shiyu Song. Deepvcp: An end-to-end deep neural network for point cloud registration. In IEEE International Conference on Computer Vision (ICCV), 2019.
  • [40] Eleonora Maset, Federica Arrigoni, and Andrea Fusiello. Practical and efficient multi-view matching. In Proceedings of the IEEE International Conference on Computer Vision, pages 4568–4576, 2017.
  • [41] Nicolas Mellado, Dror Aiger, and Niloy J Mitra. Super 4pcs fast global pointcloud registration via smart indexing. In Computer Graphics Forum, volume 33, 2014.
  • [42] Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. Learning to find good correspondences. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2666–2674, 2018.
  • [43] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
  • [44] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.

    Scikit-learn: Machine learning in Python.

    Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [45] Tobias Plötz and Stefan Roth. Neural nearest neighbors networks. In Advances in Neural Information Processing Systems (NIPS), pages 1087–1098, 2018.
  • [46] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [47] T. Rabbani, S. Dijkman, F. van den Heuvel, and G. Vosselman. An integrated approach for modelling and global registration of point clouds. ISPRS Journal of Photogrammetry and Remote Sensing, 61:355–370, 2007.
  • [48] Rahul Raguram, Ondrej Chum, Marc Pollefeys, Jiri Matas, and Jan-Michael Frahm. Usac: a universal framework for random sample consensus. IEEE transactions on pattern analysis and machine intelligence, 35(8), 2012.
  • [49] René Ranftl and Vladlen Koltun. Deep fundamental matrix estimation. In European Conference on Computer Vision (ECCV), pages 284–299, 2018.
  • [50] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [51] Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast point feature histograms (FPFH) for 3D registration. In IEEE International Conference on Robotics and Automation (ICRA), 2009.
  • [52] Amit Singer and H-T Wu. Vector diffusion maps and the connection laplacian. Communications on pure and applied mathematics, 65(8):1067–1144, 2012.
  • [53] Olga Sorkine-Hornung and Michael Rabinovich. Least-squares rigid motion using svd. Computing, 1(1), 2017.
  • [54] Pascal Theiler, Jan D. Wegner, and Konrad Schindler. Globally consistent registration of terrestrial laser scans via graph optimization. ISPRS Journal of Photogrammetry and Remote Sensing, 109:126–136, 2015.
  • [55] Pascal Willy Theiler, Jan Dirk Wegner, and Konrad Schindler. Keypoint-based 4-points congruent sets–automated marker-less registration of laser scans. ISPRS journal of photogrammetry and remote sensing, 2014.
  • [56] Federico Tombari, Samuele Salti, and Luigi Di Stefano. Unique shape context for 3D data description. In Proceedings of the ACM workshop on 3D object retrieval, 2010.
  • [57] Federico Tombari, Samuele Salti, and Luigi Di Stefano. Unique signatures of histograms for local surface description. In European conference on computer vision (ECCV), 2010.
  • [58] Philip HS Torr and David W Murray. The development and comparison of robust methods for estimating the fundamental matrix. International journal of computer vision, 24(3):271–300, 1997.
  • [59] Andrea Torsello, Emanuele Rodola, and Andrea Albarelli. Multiview registration via graph diffusion of dual quaternions. In CVPR 2011, pages 2441–2448. IEEE, 2011.
  • [60] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
  • [61] Yue Wang and Justin M. Solomon. Deep closest point: Learning representations for point cloud registration. In The IEEE International Conference on Computer Vision (ICCV), pages 3523–3532, October 2019.
  • [62] Jiaolong Yang, Hongdong Li, Dylan Campbell, and Yunde Jia. Go-icp: A globally optimal solution to 3d icp point-set registration. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 38(11):2241–2254, 2015.
  • [63] Zi Jian Yew and Gim Hee Lee. 3dfeat-net: Weakly supervised local 3d features for point cloud registration. In European Conference on Computer Vision, 2018.
  • [64] Kwang Moo Yi, Yannick Verdie, Pascal Fua, and Vincent Lepetit. Learning to assign orientations to feature points. In Computer Vision and Pattern Recognition (CVPR), 2016.
  • [65] B. Zeisl, K. Köser, and M. Pollefeys. Automatic registration of rgb-d scans via salient directions. In IEEE International Conference on Computer Vision, pages 2808–2815, 2013.
  • [66] Andy Zeng, Shuran Song, Matthias Nießner, Matthew Fisher, Jianxiong Xiao, and Thomas Funkhouser. 3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [67] Jiahui Zhang, Dawei Sun, Zixin Luo, Anbang Yao, Lei Zhou, Tianwei Shen, Yurong Chen, Long Quan, and Hongen Liao. Learning two-view correspondences and geometry using order-aware network. In International Conference on Computer Vision (ICCV), 2019.
  • [68] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Fast global registration. In European Conference on Computer Vision (ECCV), pages 766–782, 2016.

A Supplementary Material

a.1 Closed-form solution of Eq. 4.

For the sake of completeness we summarize the closed-form differentiable solution of the weighted least square pairwise registration problem

(22)

Let and

(23)

denote weighted centroids of point clouds and , respectively. We consider the case where . The centered point coordinates can then be computed as

(24)

Arranging the centered points back to the matrix forms and , a weighted covariance matrix can be computed as

(25)

where

. Considering the singular value decomposition

the solution to Eq. 22 is given by

(26)

where denotes computing the determinant and is used here to avoid creating a reflection matrix. Finally, is computed as

(27)
Figure 5: Network architecture of the FCGF [15]

feature descriptor. Each convolutional layer (except the last one) is followed by batch normalization and ReLU activation function. The numbers in parentheses denote kernel size, stride, and the number of kernels, respectively.

Figure 6: Network architecture of the registration block consists of two main modules: i) a PointNet-like ResNet block with instance normalization, and ii) an order-aware block. For each point cloud pair, putative correspondences are feed into three consecutive ResNet blocks followed by a differentiable pooling layer, which maps the putative correspondences to clusters at the level . These serve as input to the three order-aware blocks. Their output is fed along with into the differentiable unpooling layer. The recovered features are then used as input to the remaining three ResNet blocks. The output of the registration block are the scores indicating whether the putative correspondence is an outlier or an inlier. Additionally, the 128-dim features (denoted as

) before the last perceptron layer

are used as input to the confidence estimation block.

a.2 Network architecture and training details

This section describes the network architecture as well as training details of the FCGF [15] feature descriptor (Sec. A.2.1) and the registration block (Sec. A.2.2). Both networks are implemented in Pytorch and pretrained using the 3DMatch dataset [66].

a.2.1 FCGF local feature descriptor

Network architecture

The FCGF [15] feature descriptor operates on sparse tensors that represent a point cloud in form of a set of unique coordinates and their associated features

(28)

where are the coordinates of the -th point in the point cloud and is the associated feature (in our case simply 1). FCGF is implemented using the Minkowski Engine, an auto-differentiation library, which provides support for sparse convolutions and implements all essential deep learning layers [14]. We adopt the original, fully convolutional network design of FCGF that is depicted in Fig. 5. It has a UNet structure [50] and utilizes skip connections and ResNet blocks [28] to extract the per-point dim feature descriptors. To obtain the unique coordinates , we use a GPU implementation of the voxel grid downsampling [14] with the voxel size .

Training details

We again follow [15]

and pre-train FCGF for 100 epochs using the point cloud fragments from the

3DMatch dataset [66]

. We optimize the parameters of the network using Stochastic Gradient Descent (SGD) with a batch size

and an initial learning rate of combined with an exponential decay with . To introduce rotation invariance of the descriptors we perform a data augmentation by randomly rotating each of the fragments along an arbitrary direction, by a different rotation, sampled from the interval.

a.2.2 Registration block

Network architecture

The architecture of the registration block (same for and )666For the input dimension is increased from 6 to 8 (weights and residuals added). is based on the PointNet-like architecture [46] where each of the fully connected layers ( in Fig. 6) operates on individual correspondences. The local context is then aggregated using the instance normalization layers [60] defined as

(29)

where is the output of the layer and and

are per dimension mean value and standard deviation, respectively. Opposed to the more commonly used batch normalization, instance normalization operates on individual training examples and not on the whole batch. Additionally, to reinforce the local context, the order-aware blocks 

[67] are used to map the correspondences to clusters using the learned soft pooling and unpooling operators as

(30)

where is the number of correspondences and is the number of clusters. and are the features at the level (before clustering) and (after clustering), respectively (see Fig. 6). Finally, denotes the output of the last layer in the level .

Training details

We pre-train the registration blocks using the same fragments from the 3DMatch dataset. Specifically, we first infer the FCGF descriptors and randomly sample descriptors per fragment. We use these descriptors to compute the putative correspondences for all fragment pairs such that . Based on the ground truth transformation parameters, we label these correspondences as inliers if the Euclidean distance between the points after the transformation is smaller than

cm. At the start of the training (first 15000 iterations) we supervise the learning using only the binary cross-entropy loss. Once a meaningful number of correspondences can already be classified correctly we add the transformation loss. We train the network for

k iterations using Adam [36] optimizer with the initial learning rate of . We decay the learning rate every iterations by multiplying it with . To learn the rotation invariance we perform data augmentation, starting from the th iteration, by randomly sampling an angle from the interval where is initialized with zero and is then increased by every iteration until the interval becomes .

a.3 Pseudo-code

Alg. 1 shows pseudo-code of our proposed approach. We iterate times over network and transformation synchronization (i.e. Transf-Sync) layers. We execute the Transf-Sync layer four times per iteration. Our implementation is constructed in a modular way (each part can be run on its own) and can accept a varying number of input point clouds with or without the connectivity information.

Input: a set of potentially overlapping scans
Output: globally optimized poses
# Compute the pairwise transformations
1for each pair of scans  do
# find the putative correspondences using
-
# compute the weights using
-
2       - calculate using SVD according to (4)
# Iterative network for transformation synchronization
, ,
3for  do
4        for each pairwise output from  do
-
5              - estimate using (16)
- Gather the pairwise estimation as
# Build the graph and perform the synchronization
6       if  then
7               -
8       else
9               -
-
# update step
10       for each pair of scans  do
-
-
11              -
12       
Algorithm 1 Pseudo-code of the proposed approach

a.4 Extended ablation study

We extend the ablation study by analyzing the impact of edge pruning based on the local confidence (i.e. the output of the confidence estimation block) (Sec. A.4.1) and the weighting scheme (Sec. A.4.2) on the angular and translation errors. The ablation study is performed on the point cloud fragments of the ScanNet dataset [16].

a.4.1 Impact of the edge pruning threshold

Results depicted in Fig. 9 show that the threshold value used for edge pruning has little impact on the angular and translation errors as long as it is larger than 0.2.

a.4.2 Impact of the harmonic mean weighting scheme

In this work, we have introduced a scheme for combining the local and global confidence using the harmonic mean (HM). In the following, we perform the analysis of this proposal and compare its performance to established methods based only on global information [4]. To this end, we again consider the scenario ”Ours (Good)” as the input graph connectivity information. We compare the results of the proposed scheme (HM) to SE3 EIG [4] which proposes using the Cauchy function for computing the global edge confidence [4]. Note, we use the same pairwise transformation parameters, estimated using the method proposed herein, for all methods.

(a)
(b)
Figure 9: Impact of value for edge pruning on the angular and translation errors. Results are obtained using the all pairs as input graph on ScanNet dataset [16]. (a) angular error and (b) translation error.
(a)
(b)
Figure 12: Impact of the weighting scheme on the angular and translation errors without edge cutting. (a) angular error and (b) translation error.
(a)
(b)
Figure 15: Impact of the weighting scheme on the angular and translation errors with edge cutting. (a) angular error and (b) translation error.
Without edge pruning

It turns out that combining local and global evidence about the graph connectivity is essential to achieve good performance. In fact, merely relying on local confidence estimates without HM weighting (denoted as ours; green) in Fig. 12) the Transf-Sync is unable to recover global transformations from the given graph connectivity evidence that is very noisy. Introducing the HM weighting scheme allows us to reduce the impact of noisy graph connectivity built from local confidence and can significantly improve performance after Transf-Sync block as well as enables us to outperform the SE3 EIG.

With edge pruning

Fig. 15 shows that pruning the edges can help coping with noisy input graph connectivity built from the pairwise input. In principal, suppression of edges results in discarding outliers thus improving performance of the Transf-Sync block.

a.5 Qualitative results

We provide some additional qualitative results in form of success and failure cases on selected scenes of 3DMatch (Fig. 16 and 17) and ScanNet (Fig. 18 and 19) datasets. Specifically, we compare the results of our whole pipeline after Transf-Sync Ours (After Sync.) to the results of SE3 EIG [4], pairwise registration results of our method from the first iteration Ours ( iter.), and pairwise registration results of our method from the fourth iteration Ours ( iter.) that are already informed by the global information. The failure cases of our method predominantly occur on point clouds with low level of structure (planar areas in Fig. 17 bottom) or high level of symmetry and repetitive structures (Fig. 19 top and bottom, respectively).

Figure 16: Selected success cases of our method on 3DMatch dataset. Top: Kitchen and bottom: Hotel 1. Red rectangles highlight interesting areas with subtle changes.
Figure 17: Selected failure cases of our method on 3DMatch dataset. Top: Home 1 and bottom: Home 2. Note that our method still provides qualitatively better results than state-of-the-art.
Figure 18: Selected success cases of our method on ScanNet dataset. Top: scene0057_01 and bottom: scene0309_00. Red rectangles highlight interesting areas with subtle changes.
Figure 19: Selected failure cases of our method on ScanNet dataset. Top: scene0334_02 and bottom: scene0493_01. Note that our method still provides qualitatively better results than state-of-the-art.