Fully Convolutional Geometric Features: Fast and accurate 3D features for registration and correspondence.
We present a novel, end-to-end learnable, multiview 3D point cloud registration algorithm. Registration of multiple scans typically follows a two-stage pipeline: the initial pairwise alignment and the globally consistent refinement. The former is often ambiguous due to the low overlap of neighboring point clouds, symmetries and repetitive scene parts. Therefore, the latter global refinement aims at establishing the cyclic consistency across multiple scans and helps in resolving the ambiguous cases. In this paper we propose, to the best of our knowledge, the first end-to-end algorithm for joint learning of both parts of this two-stage problem. Experimental evaluation on well accepted benchmark datasets shows that our approach outperforms the state-of-the-art by a significant margin, while being end-to-end trainable and computationally less costly. Moreover, we present detailed analysis and an ablation study that validate the novel components of our approach. The source code and pretrained models will be made publicly available under https: //github.com/zgojcic/3D_multiview_reg.READ FULL TEXT VIEW PDF
We introduce PREDATOR, a model for pairwise point-cloud registration wit...
PointNet has revolutionized how we think about representing point clouds...
Point cloud completion is the task of predicting complete geometry from
Rigid registration of partial observations is a fundamental problem in
Clustering coins with respect to their die is an important component of
High Dynamic Range (HDR) images are generated using multiple exposures o...
We present an end-to-end system for reconstructing complete watertight a...
Fully Convolutional Geometric Features: Fast and accurate 3D features for registration and correspondence.
[CVPR2020] Learning multiview 3D point cloud registration
forked repo from https://github.com/StanfordVL/MinkowskiEngine, need stable version for prototype.
The capability of aligning and fusing multiple scans is essential for the tasks of structure from motion and 3D reconstruction. As such, it has several use cases in augmented reality and robotics. Hence, markerless registration of 3D point clouds either based on geometric constraints [47, 65, 54] or hand-engineered 3D feature descriptors [34, 24, 51, 57]
is a prerequisite for generating a holistic view of the scene for purposes like 3D reconstruction or semantic segmentation. Because all scene parts are usually captured from different viewpoints and with often small overlap, registration of adjacent scans is hard. While hand-engineered feature descriptors have shown successful results to some extent, a significant performance boost can be observed since the comeback of deep learning. Current research on local descriptors for pairwise registration of 3D point clouds is centered on deep learning approaches[66, 35, 19, 63, 17, 25] that succeed in capturing and encoding evidence hidden to hand-engineered descriptors. Learning pairwise point cloud matching end-to-end was recently proposed by [61, 39]. Despite demonstrating good performance for many tasks, pairwise registration of individual views of a scene has some conceptual drawbacks: (i) low overlap of adjacent point clouds can lead to inaccurate or wrong matches, (ii) point cloud registration has to rely on very local evidence, which can be harmful if 3D scene structure is scarce or repetitive, (iii) separate post-processing is required to combine all pair-wise matches into a global representation.
While for pairwise registration well accepted methods do exist [6, 68], registering multiple scans globally remains a challenge, as it is unclear how to best make use of the quadratic pairwise relationships.
In this paper, we present, to the best of our knowledge, the first end-to-end data driven multiview point cloud registration algorithm. Our method takes a set of potentially overlapping point clouds as input and outputs a global/absolute transformation matrix for each of the input scans (Fig. 1). We depart from a traditional two-stage approach where the individual stages are detached from each other and directly learn to register all views of a scene in a globally consistent manner.
The main contributions of our work are:
We propose a confidence estimation block that uses a noveloverlap pooling layer to predict the confidence in the estimated pairwise transformation parameters.
We formulate the traditional two-stage approach in an end-to-end differentiable network that iteratively refines both the pairwise and absolute transformation estimates, the latter involving a differentiable synchronization layer.
Resulting from the aforementioned contributions, the proposed multiview registration algorithm (i) is very efficient to compute, (ii) achieves more accurate scan alignments because the residuals are being fed back to the pairwise network in an iterative manner, (iii) outperforms current state-of-the art on pairwise point cloud registration as well as transformation synchronization.
The traditional pairwise registration pipeline consists of two stages: the coarse alignment stage, which provides the initial estimate of the relative transformation parameters and the refinement stage that iteratively refines the transformation parameters by minimizing the 3D registration error under the assumption of rigid transformation.
The former is traditionally performed by using either handcrafted [51, 57, 56] or learned [66, 35, 19, 18, 63, 25, 15] 3D local features descriptors to establish the pointwise candidate correspondences in combination with a RANSAC like robust model estimation algorithm [23, 48, 37] or geometric hashing [22, 8, 29]. A parallel stream of works [1, 55, 41] relies on establishing correspondences using the 4-point congruent sets. In the refinement stage, the coarse transformation parameters are often fine-tuned with a variant of the iterative closest point (ICP) algorithm . ICP-like algorithms [38, 62]
perform optimization by alternatively hypothesizing the correspondence set and estimating the new set of transformation parameters. They are known to not be robust against outliers and to converge to a global optimum only when starting with a good prealingment. ICP algorithms are often extended to use additional radiometric, temporal or odometry constraints . Contemporary to our work, [61, 39] propose to integrate coarse and fine registration stages for pairwise point cloud registration into an end-to-end learnable algorithm.
Multiview registration Multiview, global point cloud registration methods aim at resolving hard or ambiguous cases that arise with pairwise methods by incorporating global cyclic-consistency. A large body of literature [33, 27, 59, 2, 55, 3, 5, 4, 68, 7, 32] deals with multiview, global alignment of unorganized 3D point clouds. Most methods either rely on a good initialization of the pairwise transformation parameters [27, 59, 2, 3, 5, 4, 40, 10], which they try to synchronize, or incorporate pairwise keypoint correspondences in a joint optimization [68, 54, 7]. In general, these methods work well on data with low synthetic noise, but struggle on real scenes with high levels of clutter and occlusion . A general drawback of this hierarchical procedure is that global noise distribution over all scans ends up being far from random, i.e. significant biases persist due to the highly correlated initial pairwise registrations. 
proposes a global point cloud registration approach using two networks, one for pose estimation of the input point clouds, and another modelling the scene structure by estimating the occupancy status of global coordinates. Probably the most similar work to ours is, where the authors aim to adapt the edge weights for the transformation synchronization layer by learning a data driven weighting function. A major conceptual difference to our approach is that relative transformation parameters are estimated using FPFH  in combination with FGR  and thus, unlike ours, are not learned. Furthermore, in each iteration  has to convert the point clouds to depth images as the weighting function is approximated by a 2D CNN. On the other hand our whole approach operates directly on point clouds, is fully differentiable and therefore facilitates learning a global, multiview point registration end-to-end.
In this section we derive the proposed multiview 3D registration algorithm as a composition of functions depending upon the data. The network architectures used to approximate these functions are then explained in detail in Sec 4. We begin with a new algorithm for learned pairwise point cloud registration, which uses two point clouds as input and outputs estimated transformation parameters (Sec. 3.1). This method is extended to multiple point clouds in Sec. 3.2
using a transformation synchronization layer amenable to backpropagation. The input graph to this synchronization layer is learned by a neural network, which encodes, along with the relative transformation parameters, the confidence of pairwise point cloud matches as edge information in the global scene graph. Finally, we use an iterative scheme (Sec.3.3) to refine the global registration of all point clouds in the scene by updating the edge weights as well as the pairwise poses.
Consider a set of potentially overlapping point clouds capturing a 3D scene from different viewpoints (i.e. poses). The task of multiview registration is to recover the rigid, absolute poses given the scan collection, where
and . can be augmented by connectivity information resulting in a finite graph , where each vertex represents a single point set and the edges encode the information about the relative rotation and translation between the vertices. These relative transformation parameters satisfy and as well as the compatibility constraint 
In current state-of-the-art [68, 32, 7] edges of are initialized with (noisy) relative transformation parameters , obtained by an independent, auxiliary pairwise registration algorithm. Global scene consistency is enforced via a subsequent synchronization algorithm. In contrast, we propose a joint approach where pairwise registration and transformation synchronization are tightly coupled as one fully differentiable component, which leads to an end-to-end learnable, global registration pipeline.
In the following, we introduce a differentiable, pairwise registration algorithm that can easily be incorporated into an end-to-end multiview 3D registration algorithm. Let denote a pair of point clouds where and
represent the coordinate vectors of individual points in point cloudsand , respectively. The goal of pairwise registration is to retrieve optimal and .
where is a correspondence function that maps the points to their corresponding points in point cloud . The formulation of Eq. 3 facilitates a differentiable closed-form solution, which is—subject to the noise distribution—close to the ground truth solution . However, least square solutions are not robust and thus Eq. 3 will yield wrong transformation parameters in case of high outlier ratio. In practice, the mapping is far from ideal and erroneous correspondences typically dominate. To circumvent that, Eq. 3
can be robustified against outliers by introducing a heteroscedastic weighting matrix[58, 53]:
where is the weight of the putative correspondence computed by some weighting function , where and . Assuming that is close to one when the putative correspondence is an inlier and close to zero otherwise, Eq. 4 will yield the correct transformation parameters while retaining a differentiable closed-form solution . Hereinafter we denote this closed-form solution WLS trans. and for the sake of completeness, its derivation is provided in the supplementary material.
Returning to the original task of multiview registration, we again consider the initial set of point clouds . If no prior connectivity information is given, graph can be initialized by forming point cloud pairs and estimating their relative transformation parameters as described in Sec. 3.1. The global transformation parameters can be estimated either jointly (transformation synchronization) [27, 5, 4, 10] or by dividing the problem into rotation synchronization [2, 3] and translation synchronization . In the following paragraphs we summarize a differentiable, closed-form solution of the latter approach based on spectral relaxation [2, 3, 31].
Rotation synchronization The goal of rotation synchronization is to retrieve global rotation matrices by solving the following minimization problem based on their observed ratios
Under the spectral relaxation Eq. 5 admits a closed form solution as follows [2, 3]. Consider a symmetric matrix of relative transformations , where is a diagonal degree matrix and is constructed as
where the weigths represent the confidence in the relative transformation parameters . The least squares estimates of the global rotation matrices
are then given, under relaxed orthonormality and determinant constraint, by the three eigenvectors
corresponding to the smallest eigenvalues of. Consequently, the nearest rotation matrices under Frobenius norm can be obtained by a projection of the submatrices of onto the orthonormal matrices and enforcing the (to avoid the reflections).
Similarly, the goal of translation synchronization is to retrieve global translation vectors that minimize the following least squares problem
where and with
where () denotes all the neighboring vertices of () in graph .
The above formulation (Sec. 3.1 and 3.2) facilitates an implementation in an iterative scheme, which in turn can be viewed as iterative reweighted least-squares (IRLS) algorithm. We can start each subsequent iteration by pre-aligning the point cloud pairs using the synchronized estimate of the relative transformation parameters from iteration such that , where denotes applying the transformation to point cloud . Additionally, weights and residuals of the previous iteration can be used as a side information in the correspondence weighting function. Therefore, is extended to
where . Analogously, the difference between the input and the synchronized transformation parameters of the iteration can be used as an additional cue for estimating the confidence . Thus, can be extended to
We implement our proposed multiview registration algorithm as a deep neural network (Fig. 2). In this section, we first describe the architectures used to aproximate and , before integrating them into one fully differentiable, end-to-end trainable algorithm.
Our approximation to the correspondence function extends a recently proposed fully convolutional 3D feature descriptor FCGF 
with a soft assignment layer. FCGF operates on sparse tensors and computes dimensional descriptors for each point of the sparse point cloud in a single pass. Note that the function could be approximated with any of the recently proposed learned feature descriptors [35, 18, 19, 25], but we choose FCGF due to its high accuracy and low computational complexity.
Let and denote the FCGF embeddings of point clouds and , respectively. Pointwise correspondences can then be established by a nearest neighbor (NN) search in this high dimensional feature space. However, the selection rule of such hard assignments is not differentiable. We therefore form the NN-selection rule in a probabilistic manner by computing a probability vector of the categorical distribution . The stochastic correspondence of the point in the point cloud is then defined as
where , is the FCGF embedding of the point and denotes the temperature parameter. In the limit the converges to the deterministic NN-search .
We follow  and supervise the learning of with a correspondence loss , which is defined as the hardest contrastive loss and operates on the FCGF embeddings
where is a set of all the positive pairs in a FCGF mini batch and is a random subset of all features that is used for the hardest negative mining. and are the margins for positive and negative pairs respectively. The detailed network architecture of as well as the training configuration and parameters are available in the supplementary material.
Despite the good performance of the FCGF descriptor, several putative correspondences will be false. Furthermore, the distribution of inliers and outliers does not resemble noise but rather shows regularity . We thus aim to learn this regularity from the data using a deep neural network. Recently, several networks representing a complex weighting function for filtering of 2D [42, 49, 67] or 3D  feature correspondences have been proposed.
Herein, we propose extending the 3D outlier filtering network  that is based on  with the order-aware blocks proposed in . Specifically, we create a pairwise registration block that takes the coordinates of the putative correspondences as input and outputs weights that are fed, along with , into the closed form solution of Eq. 4 to obtain and . Motivated by the results in [49, 67] we add another registration block to our network and append the weights and the pointwise residuals to the original input s.t. (see Sec. 3.3). The weights are then, again fed together with the initial correspondences to the closed form solution of Eq. 4 to obtain the refined pairwise transformation parameters. In order to ensure permutation-invariance of a PointNet-like  architecture that operates on individual correspondences is used in both registration blocks. As each branch only operates on individual correspondences, the local 3D context information is gathered in the intermediate layers using symmetric context normalization  and order-aware filtering layers . The detailed architecture of the registration block is available in the supplementary material. Training of the registration network is supervised using the registration loss defined for a batch with examples as
loss, where denotes the binary cross entropy loss and
is used to penalize the deviation from the ground truth transformation parameters and and
are used to control the contribution of the individual loss functions.
Along with the estimated relative transformation parameters , the edges of the graph encode the confidence in those estimates. Confidence encoded in each edge of the graph consist of (i) the local confidence of the pairwise transformation estimation and (ii) the global confidence derived from the transformation synchronization. We formulate the estimation of as a classification task and argue that some of the required information is encompassed in the features of the second-to-last layer of the registration block. Let denote the output of the second-to-last layer of the registration block, we propose an overlap pooling layer that extracts a global feature by performing the weighted average pooling as
The obtained global feature is concatenated with the ratio of the inliers (i.e., the number of putative correspondences whose weights are higher than a given threshold) and fed to the confidence estimation network with three fully connected layers
, each followed by a ReLU activation function. The local confidence can thus be expressed as
The training of the confidence estimation block is supervised with the confidence loss function ( denotes the number of cloud pairs), where BCE refers to the binary cross entropy and the ground truth confidence labels are computed on the fly by thresholding the angular error .
The incorporates the local confidence in the relative transformation parameters. On the other hand, the output of the transformation synchronization layer provides the information how the input relative transformations agree globally with the other edges. In fact, traditional synchronization algorithms [12, 4, 31] only use this global information to perform the reweighting of the edges in the iterative solutions, because they do not have access to the local confidence information. Global confidence in the relative transformation parameters can be expressed with the Cauchy weighting function [30, 4]
where and following [30, 4] with denoting the median operator and the vectorization of residuals . Since local and global confidence provide complementary information about the relative transformation parameters, we combine them into a joined confidence
using their harmonic mean:
where the balances the contribution of the local and global confidence estimates and is learned during training.
The individual parts of the network are connected into an end-to-end multiview 3D registration algorithm as shown in
Fig. 2. 222 The network is implemented in Pytorch
The network is implemented in Pytorch. A pseudo-code of the proposed approach is provided in the supplementary material.. We pre-train the individual sub-networks (training details available in the supp. material) before fine-tuning the whole model in an end-to-end manner on the 3DMatch dataset  using the official train/test data split. In fine-tuning we use to extract the FCGF features and randomly sample feature vectors of 2048 points per fragment. These features are used in the soft assignment (softNN) to form the putative correspondences of point clouds pairs 333We assume a fully connected graph during training but are able to consider the connectivity information, if provided., which are fed to the pairwise registration network. The output of the pairwise registration is used to build the graph, which is input to the transformation synchronization layer. The iterative refinement of the transformation parameters is performed four times. We supervise the fine tuning using the joint multiview registration loss
where the transformation synchronization loss reads
We fine-tune the whole network for iterations using Adam optimizer  with a learning rate of .
|Per fragment pair||Whole scene|
|NN search||Model estimation||Total time|
We conduct the evaluation of our approach on the publicly available benchmark datasets 3DMatch , Redwood  and ScanNet . First, we evaluate the performance, efficiency, and the generalization capacity of the proposed pairwise registration algorithm on 3DMatch and Redwood dataset respectively (Sec. 5.1). We then evaluate the whole pipeline on the global registration of the point cloud fragments generated from RGB-D images, which are part of the ScanNet dataset .
|Methods||Rotation Error||Translation Error ()|
|Pairwise (All)||FGR ||9.9||16.8||23.5||31.9||38.4||/-||5.5||13.3||22.0||29.0||36.3||1.67/-|
|Ours ( iter.)||32.6||37.1||41.0||46.4||49.3||/||25.5||34.2||40.1||43.4||46.8||1.37/1.02|
|Edge Pruning (all)||Ours ( iter.)||34.3||38.7||42.2||48.2||51.9||/||26.7||35.7||41.8||45.5||49.4||1.26/0.88|
|Ours (After Sync.)||48.3||53.6||58.9||63.2||64.0||/||34.5||49.1||58.5||61.6||63.9||0.83/0.55|
|FGR (Good)||FastGR ||12.4||21.4||29.5||38.6||45.1||/-||7.7||17.6||28.2||36.2||43.4||1.43/-|
|GeoReg (FGR) ||0.2||0.6||2.8||16.4||27.1||/-||0.1||0.7||4.8||16.4||28.4||1.80/-|
|EIGSE3 (FGR) ||1.5||4.3||12.1||34.5||47.7||/-||1.2||4.1||14.7||32.6||46.0||1.29/-|
|RotAvg (FGR )||6.0||10.4||17.3||36.1||46.1||/-||3.7||9.2||19.5||34.0||45.6||1.26/-|
|L2Sync (FGR) ||34.4||41.1||49.0||58.9||62.3||/-||2.0||7.3||22.3||36.9||48.1||1.16/-|
|Ours (Good)||Ours ( iter.)||57.7||65.5||71.3||76.5||78.1||/||44.8||60.3||69.6||73.1||75.5||0.62/0.20|
|Ours ( iter.)||60.6||68.3||73.7||78.9||81.0||/||47.1||63.3||72.2||76.2||78.7||0.54/0.17|
|Ours (After Sync)||65.8||72.8||77.6||81.9||83.2||/||48.4||67.2||76.5||79.7||82.0||0.46/0.22|
We begin by evaluating the pairwise registration part of our algorithm on a traditional geometric registration task. We compare the results of our method to the state-of-the-art data-driven feature descriptors 3DMatch , CGF , PPFNet , 3DSmoothNet (3DS) , and FCGF , which is also used as part of our algorithm. Following the evaluation procedure of 3DMatch  we complement all the descriptors with the RANSAC-based transformation parameter estimation. For our approach we report the results after the pairwise registration network (1-iter in Tab. 1) as well as the the output of the in the iteration (4-iter in Tab. 1). The latter is already informed with the global information and serves only as verification that with the iterations our input to the Transf-Sync layer improves. Consistent with the 3DMatch evaluation procedure, we report the average recall per scene as well as for the whole dataset in Tab. 1.
The registration results show that our approach reaches the highest recall among all the evaluated methods. More importantly, it indicates that using the same features (FCGF), our method can outperform RANSAC-based estimation of the transformation parameters, while having a much lower time complexity (Tab. 2). The comparison of the results of 1-iter and 4-iter also confirms the intuition that feeding the residuals and weights of the previous estimation back to the pairwise registration block helps refining the estimated pairwise transformation parameters.
Generalization to other domains In order to test if our pairwise registration model can generalize to new datasets and unseen domains, we perform a generalization evaluation on a synthetic indoor dataset Redwood indoor . We follow the evaluation protocol of  and report the average registration recall and precision across all four scenes. We compare our approach to the recent data driven approaches 3DMatch , CGF +FGR , RelativeNet (RN)  and traditional methods CZK  and Latent RANSAC (LR) . Fig. 3 shows that our approach can achieve percentage points higher recall than state-of-the-art without being trained on synthetic data and thus confirming the good generalization capacity of our approach. Note that while the average precision across the scenes is low for all the methods, several works [13, 35, 20] have shown that the precision can easily be increased using pruning without almost any loss in the recall.
We evaluate the performance of our approach on the task of multiview registration using the ScanNet  dataset. ScanNet is a large RGBD dataset of indoor scenes. It provides the reconstructions, ground truth camera poses and semantic segmentations for scenes. To ensure a fair comparison, we follow  and use the same randomly sampled scenes for evaluation. For each scene we randomly sample RGBD images that are 20 frames apart and convert them to point cloud fragments. The temporal sequence of the frames is discarded. In combination with the large gap between the frames, this makes the test setting extremely challenging. Different to , we do not train our network on ScanNet, but rather perform direct generalization.
and report the empirical cumulative distribution function (ECDF) for the angularand translation deviations defined as
The ground truth rotations and translations are provided by the authors of ScanNet . In Tab. 3 we report the results for three different scenarios. ”FGR (Good)” and ”Ours (Good)” denote the scenarios in which we follow  and use the computed pairwise registrations to prune the edges before the transformation synchronization if the median point distance in the overlapping444The overlapping regions are defined as parts, where after transformation, the points are less than m away from the other point cloud.  region after the transformation is larger than m (FGR) or m (ours). On the other hand, ”all” denotes the scenario in which all pairs are used to build the graph. In all scenarios we prune the edges of the graph if the confidence estimation in the relative transformation parameters of that edge drops below . This threshold was determined on 3DMatch dataset and its effect on the performance of our approach is analyzed in detail in the supplementary material.
Analysis of the results As shown in Tab. 3 our approach can achieve a large improvement on the multiview registration tasks when compared to the baselines. Not only are the initial pairwise relative transformation parameters estimated using our approach more accurate than the ones of FGR , but they can also be further improved in the subsequent iterations. This clearly confirms the benefit of the feed-back loop of our algorithm. Furthermore even when directly considering all input edges our approach still proves dominant, even when considering the results of the scenario ”Good” for our competitors. More qualitative results of the multiview registration evaluation are available in the supplementary material.
Computation time Low computational costs of pairwise and multiview registration is important for various fields like augmented reality or robotics. We first compare computation time of our pairwise registration component to RANSAC. In Tab. 2 we report the average time needed to register one fragment pair of the 3DMatch data set as well as one whole scene. All timings were performed on a standalone computer with Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz with GeForce GTX 1080 and 32 GB RAM. Average time of performing softNN for a fragment pair is about , which is a approxiately four times faster than traditional nearest neighbor search (implemented using scikit-learn ). An even larger speedup (about times) is gained in the model estimation stage, where our approach requires a single forward pass (constant time) compared to up to 50000 iterations of RANSAC when the inlier ratio is and the desired confidence . This results in an overall run-time of about
for our entire global, multiview approach (including the feature extraction and transformation synchronization) for theKitchen scene with 1770 fragment pairs. In contrast, feature extraction and pairwise estimation of transformation parameters with RANSAC takes s. This clearly shows the efficiency of our method, being times faster to compute (for a scene with fragments).
To get a better intuition how much the individual novelties of our approach contribute to the joint performance, we carry out an ablation study on the ScanNet  dataset. Specifically, we analyze the proposed edge pruning scheme based on the confidence estimation block and Cauchy function as well as the impact of the iterative refinement of the relative transformation parameters.555Additional results of the ablation study are included in the supplementary material. The results of the ablation study are presented in form of a translation error ECDF in Fig. 4.
We motivate the iterative refinement of the transformation parameters that are input to the Trans-Sync layer with a notion that the weights and residuals provide additional ques for their estimation. Results in Fig. 4 confirm this assumption. The input relative parameters in the 4th iterations are approximately percentage points better that the initial estimate. On the other hand, Fig. 4 shows that at the high presence of outliers or inefficient edge pruning (see e.g., the results w/o edge pruning) the weights and the residuals actually provide a negative bias and worsen the results.
Edge pruning scheme There are several possible ways to implement the pruning of the presumable outlier edges. In our experiments we prune the edges based on the output of the confidence estimation block (w-conf.). Other options are to realize this step using the global confidence, i.e. the Cauchy weights defined in (17) (w-Cau.) or not performing this at all (w/o). Fig. 4 clearly shows the advantage of using our confidence estimation block (gain of more than percentage points). Even more, due to preserving a large amount of outliers, alternative approaches preform even worse than the pairwise registration.
We have introduced an end-to-end learnable, global, multiview point cloud registration algorithm. Our method departs from the common two-stage approach and directly learns to register all views in a globally consistent manner. We augment the 3D descriptor FCGF  by a soft correspondence layer that pairs all the scans to compute initial matches, which are fed to a differentiable pairwise registration block resulting in transformation parameters as well as weights. A pose graph is constructed and a novel, differentiable iterative transformation synchronization layer globally refines weights and transformations. Experimental evaluation on common benchmark datasets show that our method outperforms state-of-the-art by more than 25 percentage points on average regarding the rotation error statistics. Moreover, our approach is times faster than RANSAC-based methods in a multiview setting of 60 scans, and generalizes better to new scenes ( higher recall on Redwood indoor compared to state-of-the-art).
4d spatio-temporal convnets: Minkowski convolutional neural networks.In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3075–3084, 2019.
Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors.In European conference on computer vision (ECCV), 2018.
Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011.
For the sake of completeness we summarize the closed-form differentiable solution of the weighted least square pairwise registration problem
denote weighted centroids of point clouds and , respectively. We consider the case where . The centered point coordinates can then be computed as
Arranging the centered points back to the matrix forms and , a weighted covariance matrix can be computed as
. Considering the singular value decompositionthe solution to Eq. 22 is given by
where denotes computing the determinant and is used here to avoid creating a reflection matrix. Finally, is computed as
This section describes the network architecture as well as training details of the FCGF  feature descriptor (Sec. A.2.1) and the registration block (Sec. A.2.2). Both networks are implemented in Pytorch and pretrained using the 3DMatch dataset .
The FCGF  feature descriptor operates on sparse tensors that represent a point cloud in form of a set of unique coordinates and their associated features
where are the coordinates of the -th point in the point cloud and is the associated feature (in our case simply 1). FCGF is implemented using the Minkowski Engine, an auto-differentiation library, which provides support for sparse convolutions and implements all essential deep learning layers . We adopt the original, fully convolutional network design of FCGF that is depicted in Fig. 5. It has a UNet structure  and utilizes skip connections and ResNet blocks  to extract the per-point dim feature descriptors. To obtain the unique coordinates , we use a GPU implementation of the voxel grid downsampling  with the voxel size .
We again follow 
and pre-train FCGF for 100 epochs using the point cloud fragments from the3DMatch dataset 
. We optimize the parameters of the network using Stochastic Gradient Descent (SGD) with a batch sizeand an initial learning rate of combined with an exponential decay with . To introduce rotation invariance of the descriptors we perform a data augmentation by randomly rotating each of the fragments along an arbitrary direction, by a different rotation, sampled from the interval.
The architecture of the registration block (same for and )666For the input dimension is increased from 6 to 8 (weights and residuals added). is based on the PointNet-like architecture  where each of the fully connected layers ( in Fig. 6) operates on individual correspondences. The local context is then aggregated using the instance normalization layers  defined as
where is the output of the layer and and
are per dimension mean value and standard deviation, respectively. Opposed to the more commonly used batch normalization, instance normalization operates on individual training examples and not on the whole batch. Additionally, to reinforce the local context, the order-aware blocks are used to map the correspondences to clusters using the learned soft pooling and unpooling operators as
where is the number of correspondences and is the number of clusters. and are the features at the level (before clustering) and (after clustering), respectively (see Fig. 6). Finally, denotes the output of the last layer in the level .
We pre-train the registration blocks using the same fragments from the 3DMatch dataset. Specifically, we first infer the FCGF descriptors and randomly sample descriptors per fragment. We use these descriptors to compute the putative correspondences for all fragment pairs such that . Based on the ground truth transformation parameters, we label these correspondences as inliers if the Euclidean distance between the points after the transformation is smaller than
cm. At the start of the training (first 15000 iterations) we supervise the learning using only the binary cross-entropy loss. Once a meaningful number of correspondences can already be classified correctly we add the transformation loss. We train the network fork iterations using Adam  optimizer with the initial learning rate of . We decay the learning rate every iterations by multiplying it with . To learn the rotation invariance we perform data augmentation, starting from the th iteration, by randomly sampling an angle from the interval where is initialized with zero and is then increased by every iteration until the interval becomes .
Alg. 1 shows pseudo-code of our proposed approach. We iterate times over network and transformation synchronization (i.e. Transf-Sync) layers. We execute the Transf-Sync layer four times per iteration. Our implementation is constructed in a modular way (each part can be run on its own) and can accept a varying number of input point clouds with or without the connectivity information.
We extend the ablation study by analyzing the impact of edge pruning based on the local confidence (i.e. the output of the confidence estimation block) (Sec. A.4.1) and the weighting scheme (Sec. A.4.2) on the angular and translation errors. The ablation study is performed on the point cloud fragments of the ScanNet dataset .
Results depicted in Fig. 9 show that the threshold value used for edge pruning has little impact on the angular and translation errors as long as it is larger than 0.2.
In this work, we have introduced a scheme for combining the local and global confidence using the harmonic mean (HM). In the following, we perform the analysis of this proposal and compare its performance to established methods based only on global information . To this end, we again consider the scenario ”Ours (Good)” as the input graph connectivity information. We compare the results of the proposed scheme (HM) to SE3 EIG  which proposes using the Cauchy function for computing the global edge confidence . Note, we use the same pairwise transformation parameters, estimated using the method proposed herein, for all methods.
It turns out that combining local and global evidence about the graph connectivity is essential to achieve good performance. In fact, merely relying on local confidence estimates without HM weighting (denoted as ours; green) in Fig. 12) the Transf-Sync is unable to recover global transformations from the given graph connectivity evidence that is very noisy. Introducing the HM weighting scheme allows us to reduce the impact of noisy graph connectivity built from local confidence and can significantly improve performance after Transf-Sync block as well as enables us to outperform the SE3 EIG.
Fig. 15 shows that pruning the edges can help coping with noisy input graph connectivity built from the pairwise input. In principal, suppression of edges results in discarding outliers thus improving performance of the Transf-Sync block.
We provide some additional qualitative results in form of success and failure cases on selected scenes of 3DMatch (Fig. 16 and 17) and ScanNet (Fig. 18 and 19) datasets. Specifically, we compare the results of our whole pipeline after Transf-Sync Ours (After Sync.) to the results of SE3 EIG , pairwise registration results of our method from the first iteration Ours ( iter.), and pairwise registration results of our method from the fourth iteration Ours ( iter.) that are already informed by the global information. The failure cases of our method predominantly occur on point clouds with low level of structure (planar areas in Fig. 17 bottom) or high level of symmetry and repetitive structures (Fig. 19 top and bottom, respectively).