3DRegNet: A Deep Neural Network for 3D Point Registration

04/02/2019 ∙ by G. Dias Pais, et al. ∙ 0

We present 3DRegNet, a deep learning algorithm for the registration of 3D scans. With the recent emergence of inexpensive 3D commodity sensors, it would be beneficial to develop a learning based 3D registration algorithm. Given a set of 3D point correspondences, we build a deep neural network using deep residual layers and convolutional layers to achieve two tasks: (1) classification of the point correspondences into correct/incorrect ones, and (2) regression of the motion parameters that can align the scans into a common reference frame. 3DRegNet has several advantages over classical methods. First, since 3DRegNet works on point correspondences and not on the original scans, our approach is significantly faster than many conventional approaches. Second, we show that the algorithm can be extended for multi-view scenarios, i.e., simultaneous handling of the registration for more than two scans. In contrast to pose regression networks that employ four variables to represent rotation using quaternions, we use Lie algebra to represent the rotation using only three variables. Extensive experiments on two challenging datasets (i.e. ICL-NUIM and SUN3D) demonstrate that we outperform other methods and achieve state-of-the-art results. The code will be made available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 8

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a)

Inliers/outliers classification using the proposed 3DRegNet. Green correspondences indicate the inliders and red ones the outliers.

(b)

Results of the estimation of the transformation that aligns two point clouds.

Figure 3:

Given a set of 3D point correspondences from two scans with outliers, 3DRegNet simultaneously classifies the point corrrespondences into inliers and outliers (see

LABEL:sub@fig:intro_inliers), and also computes the transformation (rotation, translation) necessary for the alignment of the scans (see LABEL:sub@fig:intro_registration). Our learning-based registration algorithm is significantly faster and outperforms other standard geometric methods.

In this paper, we address the problem of the registration of 3D scans, which is one of the classical problems in geometrical computer vision. In 3D registration, we compute the 6 Degrees Of Freedom (DOF) motion parameters between two scans given noisy point correspondences containing outliers. The standard approach is to use minimal solvers that employ three point correspondences (see Orthogonal Procrustes problem

[39]) in a RANSAC [14] framework, followed by refinement techniques such as ICP [4]. In this paper, we investigate if this registration problem can be solved using a deep neural network (see Fig. 3). One may wonder: what is the point in developing a new algorithm for a well solved classical problem? Consider the impact of the classical 3D registration algorithms in a wide variety of vision, robotics, and medical applications. We would like to investigate if the deep learning machinery can provide any complementary advantages over classical 3-point solvers and ICP variants. In particular, we want to study if we can obtain a significant speedup without compromising on the accuracy of the registration in the presence of outliers. The hard part is not in the computation of the pose given point correspondences. The real challenge is about how we handle the outliers efficiently.

Consider Fig. 3LABEL:sub@fig:intro_inliers. We show the classification of noisy point correspondences into inliers and outliers using 3DRegNet for the alignment of two 3D scans (see Fig. 3LABEL:sub@fig:intro_registration). As shown in Fig. 4, our network architecture consists of two subnetworks: classification and registration. The former network takes a set of noisy point correspondences between two scans and produce weight parameters that indicate whether a given point correspondence is an inlier or an outlier. The registration network on the other hand, directly produces the 6 DOF motion parameters for the alignment of two 3D scans. Our main contributions are summarized below:

  1. We show a novel deep neural network architecture for solving the problem of 3D scan registration;

  2. We achieve significant speedup compared to traditional methods that employ RANSAC and other global methods;

  3. We show that the minimal number of three parameters for the rotation parameterization achieves the best results;

  4. We show that it is possible to handle multi-view relations, i.e., simultaneous computation of pose parameters for multiple point-clouds; and

  5. We outperform other registration algorithms such as fast global registration [51] on synthetic Augmented ICL-NUIM [7] and the SUN3D [50] datasets.

2 Related Work

The Iterative Closest Point (ICP) is the gold standard to solve the registration of point-clouds [4, 35] with good initialization, and there are several methods over the years that achieve efficiency and robustness [40, 31, 32, 46, 16, 25, 34]. There has been a few non-rigid 3D registration approaches [52, 3, 41, 26], and a survey on rigid and non-rigid registration of 3D point-clouds is presented in [43]. Optimal least squares solution can be solved using methods such as [44, 40, 31, 29, 20, 45, 51]. Many of these methods either need good initialization, or first identifies the correspondences using a RANSAC framework, and then compute the optimal pose using the selected inliers. On the contrary, we focus on jointly solving the inlier correspondences and the estimation of the transformation parameters using a single network without any requirement on the initialization.

In the last few years, we have witnessed the development of deep learning algorithms for solving geometrical problems. In particular, extending beyond CNNs and regular grid-graph images, is not straightforward. For example, PointNet is a deep neural network that produces classification and segmentation results for unordered point-clouds [37]

. PointNet strives to achieve results that is invariant to the order of points, rotations, and translations. In order to achieve invariance to order, PointNet uses several multi-layer perceptrons (MLP) individually on different points, and then use a symmetric function on top of the outputs from the MLPs. The registration block in our network is inspired by their segmentation block. However, there are several other differences in our network architecture such as the usage of collective data that uses information from all deep residual layers – see the pooling and the context normalization blocks in Fig. 

4 – and the extraction of features at different levels in the network.

The closest to our work is the recent deep learning algorithm [48] for learning to identify good correspondences in 2D images. In particular, [48] can be seen as a deep learning algorithm that is strongly rooted in multi-view geometry constraints. The network produces the classification of 2D point correspondences into inliers/outliers. The regression of the Essential Matrix is computed separately using eigendecomposition and the inliers correspondences. The input to this network is only pixel coordinates and not original images, and this allows fast inference. This work introduces a context normalization module in deep networks that allows us to process the data points independently. Yet, it still embeds global information to handle the notorious problem of achieving invariance to the order of point correspondences. While our approach follows [48] work closely, we have the following differences:

  1. Our approach is developed for 3D point correspondences, and theirs focus on 2D point correspondences;

  2. While the classification block of our network (See Fig. 4) is similar to theirs, we have an additional 3D registration component that explicitly computes the rotation and translation parameters;

  3. In addition to the point correspondences, their approach also requires essential matrix to be given as additional supervision for training their network. On the contrary, we require only noisy point correspondences as the input; and

  4. While their approach is shown for only 2 views, we show that our registration technique can simultaneously handle more than 2 views (i.e., trifocal cases as well).

The idea of using a neural network to regress on pose variables has existed in the literature. In fact, RATSLAM [30] is a classical algorithm for 3D reconstruction that uses a neural network with local view and pose cells to denote locations heading directions, respectively. Recently, Kendall  [23] developed PoseNet, a 23 layer deep convolutional network based on GoogleNet [42], for computing rotation and translation for the task of image-based localization. By training on a set of images and their global poses (locations and orientations), the network can automatically learn to directly compute the pose for a given image. PoseNet uses quaternions, and thus produces 4 output variables for denoting rotations.

During 3D reconstruction using a large collection of scans, rotation averaging can be used to improve the pairwise relative pose estimates using robust methods [6]. Recently, it was shown that it would be possible to utilize deep neural networks to compute the weights for different pairwise relative pose estimates [21]. While performing the 3D registration of overlapping point-clouds, one can use point feature histograms as features for describing a 3D point [38]

. The matching of 3D points can also be achieved by extracting features using convolutional neural networks 

[49, 10, 47, 12]. Some have looked at the extraction of 3D features directly from the point-clouds that are invariant of the 3D Environment (spherical CNNs) [8, 13]. A deep network has been designed recently for computing the pose for direct image to image registration [17]. Using graph convolutional networks and cycle consistency losses, one can train an image matching algorithm in an unsupervised manner [36]. Some methods have been proposed for mapping/registration using deep neural networks [11, 12, 19]

. In contrast to our work, the registration and the feature extraction are encoded in their deep network architecture.

A few papers in the literature use the minimal three parameters representation for the rotation. In [22], a deep learning approach solves the classification of an action, based on the human’s skeleton motion with Lie group based network. The minimum number of rotation parameters was also used in [5], in which the authors use a network to predict the motion of a robotic arm.

Figure 4: The proposed 3DRegNet architecture. On the left, we show a couple of 3D scans from an office environment [7] and 3D correspondences in the form of an array of 6-tuples between the two scans. On the right, we show the proposed network architecture. The input to the network is the set of point correspondences and produces output from two blocks: 1) Classification block that produces weights indicating whether a correspondence is an inlier or an outlier; and 2) the registration block that outputs the rotation and translation parameters that is necessary for the alignment of the two scans.

3 Problem Statement

Given 3D point correspondences , , where is a 3D point in the first scan and is the corresponding 3D point in the second scan. Our goal is to compute the transformation parameters (rotation matrix

and translation vector

) as shown below:

(1)

where is some distance metric for pair of elements and . The problem addressed in this work is shown in Fig. 4. We input the point correspondences, and we obtain output variables denoting the weights for classifying the correspondences and the transformation parameters. We define the output weights as a vector , where denotes that is an outlier and denotes that the correspondence is an inlier. The additional six motion parameters are given by three parameters () for rotation, and three parameters () for the translation .

While previous approaches use over-parameterization for the rotation (e.g. PoseNet [23] uses the four parameter rotation representation using quaternions, and deep PnP [9] uses nine parameters for rotation), we use the minimum number of parameters, i.e. three parameter using Lie algebra: , where and is the anti-symmetric matrix that expresses the cross product in a linear form, such as .

4 3DRegNet Architecture

The proposed network architecture (3DRegNet) is shown in Fig. 4 with two blocks for classification and registration.

Classification Block:  Our classification block (see the respective block in Fig. 4) is inspired by the network architecture shown in  [48]. The input to our network is a set of 3D point correspondences as given by 6-tuples between the two scans.

Each point correspondence (6 tuples) is processed by a fully connected layer with 128 ReLU activation functions. There is weight sharing for each of the individual

point correspondences, and the output is of dimension where we generate dimensional features from every point correspondence. The output is then passed through 12 deep ResNet blocks [18], with weight-shared fully connected layers instead of convolutional layers. At the end, we use another fully connected layer with ReLU () followed by tanh () units to produce the weights in the range given by .

Registration Block:  The input to this block are the features extracted from the point correspondences. As shown at the bottom of Fig. 4, we use pooling to extract meaningful features of dimensions

from each layer of the classification block. We extract features at 13 stages of the classification, i.e., the first one is extracted before the first ResNet block and the last one is extracted after the last (12th) ResNet block. Based on our experiments, max-pooling performed the best in comparison with other choices such as average pooling. After the pooling is completed, we apply context normalization, as introduced in  

[48], and concatenate the 13 feature maps. This process normalizes the features and it helps to extract the necessary and fixed number of features to obtain the transformation at the end of the registration block (it should be independent of ). The features from the context normalization is of size

, which is then passed on to a convolutional layer, with 8 channels. Each filter passes a 3-by-3 patch with a stride of 2 for the column and of 1 for the row. The output of the convolution will then be injected in two fully connected layers with 256 filters each, with ReLU between the layers, that generate the output of six variables:

and .

4.1 Loss Functions

Our loss function consists of two terms from the two blocks of the network.

Classification Loss: The goal is to provide a loss function to identify the input correspondences as inliers and outliers. For this, we use

(2)

where are the network outputs before passing them through ReLU and tanh for computing the weights . denotes the sigmoid activation function. Note that the motion between different pairs of scans are different, and we use the index to denote the associated training pair of scans. is the cross-entropy function, and (equals to one or zero) is the ground-truth which indicates whether the -th point correspondence is an inlier or outlier. Note that the loss term is the classification loss for the 3D point correspondences of a particular scan-pair with an index . The balances the classification loss by the number of examples for each class in the associated scan pair .

Registration Loss: For the loss function, we use the distance between the 3D points in the second scan and the transformed points from the first 3D scan , for . The loss function is

(3)

where is the distance metric function. For a given scan pair , the relative motion parameters obtained from the registration block are given by and . We considered and evaluated other distance metrics such as weighted least squares, , and Geman-McClure [15], and we found that produced the best results.

Total Loss: The collective loss terms for all the pairs of scans from the training data are given below:

(4)

where is the total number of scan pairs in the training set. The total training loss is the sum of both the classification and the registration loss terms. The total loss is as follows:

(5)

where the coefficients and

are hyperparameters that are manually set for classification and registration terms in the loss function.

5 Trifocal 3DRegNet

Standard methods for 3D registration involve only pairwise alignment. To register more than two scans, we sequentially register the first and second, followed by second and third, and so on. We show that it is possible to extend the 3D registration to simultaneously handle three scans using deep neural networks. Let us consider scans 1, 2, and 3. Let us assume that we have the correspondences between every pair of scans in the form of 6-tuples as shown below:

  • Scans 1 & 2: for ;

  • Scans 2 & 3: for ; and

  • Scans 1 & 3: for ,

where and are 3D points in Scans 1, 2, and 3, respectively.

We show the network in Fig. 5. The basic idea in building a 3-view 3DRegNet is straightforward. As we show for the pairwise case, we simultaneously use two 3DRegNet modules to produce correspondences between the first and second scans, and also between the second and third scans. We design the loss function so that there is alignment between all the three scans. We also share the same network parameters () for the two 3DRegNet models.

Figure 5: Trifocal 3DRegNet: The input consists of three sets of correspondences: between scans 1 & 2 , scans 2 & 3 , and scans 1 & 3 (). The first two sets of 3D point correspondences will have associated 3DRegNets, and the third set of correspondences () will be utilized in the loss function. The output will be a set of rotations, translation, and weights for the correspondences between of scans 1 & 2 and scans 2 & 3.

5.1 Loss Functions for Trifocal 3DRegNet

We will use three sets of correspondences: , , in our loss function. In particular, the loss function will use the cycle consistency between the scans 1, 2, and 3.

Classification Loss: It involves terms that depend on the correspondences between scans 1 & 2, and between the scans 2 & 3. The total classification loss is given by the average of all of the individual losses:

(6)

where are the classification losses defined in (4) given the correspondences between scans 1 & 2, and 2 & 3.

Registration Loss: We start by using the same loss derived in (4) for the two sets of correspondences used as inputs of the 3DRegNet blocks, and .

Now, consider the set of correspondences from the first to the third 3D scan. By cascading 3D rigid transformations, this loss function is defined as

(7)

The total registration loss is given below:

(8)

Please note that these loss functions correspond to only one triplet in the training data. The total loss should be the sum of classification and registration terms for all the different triplet scans as in Eqs. (4) and  (5).

6 Datasets and 3DRegNet Training

Datasets: We use two datasets, the synthetic augmented ICL-NUIM Dataset [7] and the SUN3D [50]. The former is divided in 4 scenes with about 8500 different pairs of connected point-clouds. The latter is composed of 13 randomly selected real dataset scenes, with a total 3700 different connected pairs. Using FPFH [38], we extract around 3000 3D point correspondences for each pair of scans in both datasets. Based on the ground-truth transformations and the 3D distance between the transformed 3D points, correspondences are labeled as inliers/outliers based on a predefined threshold (set to one or zero). The threshold is set such that the number of outliers is about 50% of the total matches. We select 70% of the pairs for the training and 30% for testing for the ICL-NUIM Dataset. With respect to the SUN3D Dataset we select 10 scenes for training and 3 scenes testing. We stress that, for the SUN3D Dataset, the 3 scenes in the testing are unseen data from the training set.

To train the 3-Scan Network, we get sequences of three sets of correspondences from the 3D scans in the ICL-NUIM Dataset.

Training:

The proposed architecture is implemented in Tensorflow 

[1]

. The network was trained for 1200 epochs with 625 steps with a learning rate of

with the Adam Optimizer [24]. A cross-validation strategy is used during training. The chosen batch size was . The coefficients of the classification and registration terms are given by and . The network was trained using an Intel i7-7600 and a Nvidia Geforce GTX 1080Ti.

Data Augmentation: To generalize for unseen rotations, we augment the training dataset by applying random rotations. We take inspiration from [2, 28, 33] and propose the use of a Curriculum Learning (CL) data augmentation strategy. The idea is to start small [2], (i.e., easier tasks containing small values of translation and rotation) and having the tasks ordered by increasing difficulty. The training only proceeds to harder tasks after the easier ones are completed.

An interesting alternation of conventional CL implementation was adapted. Let the magnitude of the augmented rotation to be applied in the training be denoted as , and an epoch such that (normalized training steps). In CL, at the beginning of each epoch, we should start small. However, this breaks the smoothness of its regularization (since the maximum value for i.e., has been reached at the end of the previous epoch). This can easily be tackled if we progressively increase the parameter up to at , decreasing afterwards (i.e., ).

7 Experimental Results

Evaluation Metrics: We defined the following metrices for accuracy of the rotation, translation, and classification. For rotation, we use

(9)

where and are the estimated and ground-truth rotation matrices. We refer to [27] for more detail. For measuring the accuracy of translation, we use

(10)

For the classification accuracy, we use the standard classification error. The computed weights will be rounded to 0 or 1 based on a threshold () before measuring the classification error.

7.1 Parameterization of

We studied three different representations for : 1) the minimal Lie algebra with three parameters; 2) the quaternion with four parameters; and 3) the linear matrix form with nine parameters. The 3DRegNet was trained with the ICL-NUIM Dataset. The results for the training set are shown in Tab. 1.

Rotation [deg] Translation [m] Time [s]
Classification
Accuracy
Representation Mean Median Mean Median
Lie Algebra 1.52 0.55 0.043 0.029 0.0090 0.96
Quaternions 1.64 1.11 0.065 0.05 0.0087 0.86
Linear 3.36 2.65 0.136 0.11 0.0083 0.66
Table 1: Evaluation of the use of different representations for the 3D rotations.

We observed that the minimal parameterization using Lie algebra provides the best results. However, the Linear representation has a slight advantage in terms of computation time. In the experimental results that follows, we use the three parameters Lie algebra representation.

7.2 Ablation Study

We study the role of the classification and registration blocks in the performance. For this purpose, we consider each loss separately. We used ICL-NUIM Dataset for this analysis. The results are shown in Tab. 2. For visualization purposes, we repeat the best result in Tab. 1.

Rotation [deg] Translation [m] Time [s]
Classification
Accuracy
Losses in (5) Mean Median Mean Median
Both and 1.52 0.55 0.043 0.029 0.0090 0.96
Only 99.65 99.18 2.31 2.27 0.0091 0.96
Only 1.47 0.81 0.058 0.034 0.0093 0.53
Table 2: Evaluation on the use of only on the classification (second row) and registration (third row) losses in the total loss function used (5).

From these results, we observe that the use of both losses gives significantly better results in the overall evaluation. The mean rotation error using only the registration loss is marginally better. However, we observe that median rotation error using both losses is significantly superior to using only one of the loss terms.

7.3 Sensitivity to the number of correspondences

In this test, instead of considering all the correspondences in each of the pairwise scans of the testing examples, we select a percentage of the total number of matches ranging from 10% to 100% (recall that the total number of correspondences per pair is around 3000). The errors are estimated as before for all the available pairwise scans, and the results are shown in Tab. 3.

Rotation [deg] Translation [m] Time [s]
Classification
Accuracy
Matches Mean Median Mean Median
10% 2.11 0.89 0.057 0.039 0.0088 0.95
25% 1.70 0.65 0.047 0.032 0.0089 0.96
50% 1.62 0.57 0.045 0.030 0.0089 0.96
75% 1.56 0.55 0.044 0.029 0.0090 0.96
90% 1.52 0.55 0.043 0.029 0.0086 0.96
100% 1.52 0.55 0.043 0.029 0.0090 0.96
Table 3: Evaluation of the use of a different number of correspondences as a function of the rotation (deg) & translation errors; computation time; and accuracy.

The accuracy of the regression degrades as the number of input correspondences decreases. The classification is not affected, as expected. The inlier/outlier classifications must not depend on the number of input correspondences, while the increase on the number of inliers should lead to a better estimate.

7.4 Data Augmentation

Figure 6: In this figure, it is shown the difference between training with data augmentation and without, for one testing data example. We can observe an improvement on the test results when some disturbance is applied. The data augmentation generalizes the network for other rotations that were not included in the original dataset.
ICL-NUIM Dataset
Rotation [deg] Translation [m] Time [s]
Method Mean Median Mean Median
3DRegNet 1.52 0.55 0.043 0.029 0.009
FGR 3.55 0.86 0.071 0.032 0.078
RANSAC 3.14 1.04 0.056 0.040 0.37
3DRegNet + U 1.34 0.25 0.045 0.013 0.082
RANSAC + U 1.2 0.73 0.045 0.031 0.360

SUN3D Dataset
Rotation [deg] Translation [m] Time [s]
Method Mean Median Mean Median
3DRegNet 2.19 2.01 0.100 0.089 0.012
FGR 3.82 2.37 0.143 0.079 0.074
RANSAC 5.31 2.03 0.228 0.082 3.391
3DRegNet + U 1.49 1.36 0.071 0.059 0.015
RANSAC + U 5.09 1.75 0.217 0.065 3.392
Table 4: Comparison with the baselines. The first tree rows aims at comparing the proposed 3DRegNet with previous state-of-the-art techniques, FGR [51] and RANSAC-based approaches [14, 39]. The last two rows aim at evaluating the quality of the classification block, by injecting these in a refinement technique (for this we use the least squares method [44]).

MIT

Brown

Harvard

Ground-Truth 3DRegNet RANSAC FGR
Figure 7: Three examples of 3D point-cloud alignment using the 3DRegNet, the FGR, and the RANSAC methods. A pair of 3D scans were chosen from three scenes in the SUN3D data-set: MIT, Brown, and Harvard sequences in the SUN3D data-set. These sequences were not used in the training.

Using the 3DRegNet trained in the previous sections, we select a pair of 3D scans from the training data and rotate the original point-clouds to increase the rotation angles between them. We vary the magnitude of this rotation from 0 to 50 degrees, and the results for the rotation error and accuracy in the testing are shown in Fig. 6 (green curve). Afterwards, we train the network a second time, using the data augmentation strategy proposed in Sec. 6. At each step, the pair of examples is disturbed by a rotation with increasing steps of , setting the maximum value of . We run the test as before, and the results are shown in Fig. 6 (blue curve).

From this experiment we can conclude that, by only training with the original dataset, we constrained to the rotations existed within the dataset. On the other hand, by performing a smooth regularization (CL data augmentation), we can surpass this drawback.

7.5 Baselines

We use two baselines to compare with 3DRegNet (estimated transformations vs. computation time). The first one is the Fast Global Gegistration [51] (FGR) geometric method, that aims at giving a global solution for some set of 3D correspondences. The second baseline is a RANSAC based method [14]. We use both the ICL-NUIM and the SUN3D datasets. These results are shown in the first three rows of Tab. 4. Afterwards, we evaluate the quality of the classification. For that purpose, we get the inliers given by the 3DRegNet and the ones given by the RANSAC method, and inject these 3D point correspondences in a least square non-linear refinement technique [44] (denoted by 3DRegNet + U and RANSAC + U, respectively). These results are shown in the fourth and fifth rows of Tab. 4.

The 3DRegNet outperforms all the baseline methods, in almost all the evaluation criteria. We conclude that: 1) 3DRegNet gives the best estimation for the transformation parameters when comparing to the current efficient and FGR estimation (see the first and second row); 2) By looking at the results in the fourth and fifth rows, we conclude that the quality of the inliers selected by the classification are better than the ones given by RANSAC111Note that the quality of the RANSAC solutions hilghly depend on the number of trials given to the solvers, which may increase exponentially the computation time. In this case, we use 40000 trials. We note that the RANSAC itself with this number already 40x slower than the 3DRegNet solver.; and 3) 3DRegNet is significantly faster than any of the previous methods.

Examples of the 3D alignments using the 3DRegNet against the baseline methods are shown in Fig. 7. For this test, we consider the network trained with the SUN3D dataset, and select pairs of 3D scans in the three training sequences (MIT, Brown, and Harvard). As we can see, for these examples, the 3DRegNet give better 3D registrations.

7.6 Trifocal 3DRegNet

We consider the ICL-NUIM Dataset, in which we identify triplets of three 3D scans with correspondences between 1 & 2, 2 & 3, and 1 & 3. We train the network proposed in Sec. 5, with 70% of the generated data, and tested with the remaining ones. The results are shown in the second row of Tab. 5. In addition, we considered the previously trained 3DRegNet for two scans, and computed the pose from 1 to 2, 2 to 3, and 1 to 3. These results are shown in the first row of Tab. 5.

Rotation [deg] Translation [m]
# Scans Mean Median Mean Median
2 1.00 0.77 0.047 0.039
3 0.97 0.73 0.045 0.037
Table 5: Pairwise against trifocal 3DRegNets: We consider sets of three views, and compute the motion between the first and third views using pairwise (by successively computing the motion from first to second, followed by second to third), and trifocal 3DRegNet.

We can observe that the use of the 3DRegNet can be extended to a higher number of 3D scans, and the accuracy can be improved by considering this kind of extensions.

8 Discussion

We propose 3DRegNet, a deep neural network that can solve the scan registration problem by jointly solving the outlier rejection given 3D point correspondences and computes the pose for alignment of the scans. We show that our approach is extremely efficient and outperforms traditional approaches. We demonstrate that this work can be extended to multiple point-cloud registrations. For more examples of 3D registrations, please refer to the Supplementary Material.

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng.

    TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.

    Software available from tensorflow.org.
  • [2] Y. Bengio, J. Lourador, R. Collobert, and J. Weston. Curriculum Learning. In Int’l Conf. Machine learning (ICML), pages 41–48, 2009.
  • [3] F. Bernard, F. R. Schmidt, J. Thunberg, and D. Cremers. A Combinatorial Solution to Non-Rigid 3D Shape-to-Image Matching. In

    IEEE Conf. Computer Vision and Pattern Recognition (CVPR)

    , pages 1436–1445, 2017.
  • [4] P. Besl and N. McKay. A method for Registration of 3-D Shapes. IEEE Trans. Pattern Analysis and Machine Intelligence (T-PAMI), 14(2):239–256, 1992.
  • [5] A. Byravan and D. Fox. SE3-nets: Learning Rigid Body Motion Using Deep Neural Networks. In IEEE Int’l Conf. Robotics and Automation (ICRA), pages 173–180, 2017.
  • [6] A. Chatterjee and V. M. Govindu. Robust relative rotation averaging. IEEE Trans. Pattern Analysis and Machine Intelligence (T-PAMI), 40(4):958–972, 2018.
  • [7] S. Choi, Q. Zhou, and V. Koltun. Robust Reconstruction of Indoor Scenes. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 5556–5565, 2015.
  • [8] T. S. Cohen, M. Geiger, J. Köhler, and M. Welling. Spherical CNNs. In Int’l Conf. Learning Representations (ICLR), 2018.
  • [9] Z. Dang, K. M. Yi, Y. Hu, F. Wang, and P. Fua.

    Eigendecomposition-Free Training of Deep Networks with Zero Eigenvalue-Based Losses.

    In European Conf. Computer Vision (ECCV), pages 792–807, 2018.
  • [10] H. Deng, T. Birdal, and S. Ilic. PPFNet: Global Context Aware Local Features for Robust 3D Point Matching. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 195–205, 2018.
  • [11] L. Ding and C. Feng. DeepMapping: Unsupervised map estimation from multiple point clouds. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2019.
  • [12] G. Elbaz, T. Avraham, and A. Fischer. 3D Point Cloud Registration for Localization using a Deep Neural Network Auto-Encoder. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 2472 – 2481, 2017.
  • [13] C. Esteves, C. Allen-Blanchette, A. Makadia, and K. Daniilidis. Learning SO(3) Equivariant Representations With Spherical CNNs. In European Conf. Computer Vision (ECCV), pages 52–68, 2018.
  • [14] M. A. Fischler and R. C. Bolles. Random Sample Consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, 1981.
  • [15] S. Geman and D. E. McClure. Bayesian image analysis: An application to single photon emission tomography. In Proc. American Statistical Association, pages 12–18, 1985.
  • [16] V. M. Govindu and A. Pooja. On Averaging Multiview Relations for 3D Scan Registration. IEEE Trans. Image Processing (T-IP), 23(3):1289–1302, 2014.
  • [17] L. Han, M. Ji, L. Fang, and M. Nießner. RegNet: Learning the Optimization of Direct Image-to-Image Pose Registration. arXiv:1812.10212, 2018.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [19] J. F. Henriques and A. Vedaldi. MapNet: An allocentric spatial memory for mapping environments. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 8476–8484, 2018.
  • [20] D. Holz, A. E. Ichim, F. Tombari, R. B. Rusu, and S. Behnke. Registration with the Point Cloud Library: A Modular Framework for Aligning in 3-D. IEEE Robotics Automation Magazine (RA-M), 22(4):110–124, 2015.
  • [21] X. Huang, Z. Liang, X. Zhou, Y. Xie, L. Guibas, and Q. Huang. Learning Transformation Synchronization. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2019.
  • [22] Z. Huang, C. Wan, T. Probst, and L. V. Gool. Deep Learning on Lie Groups for Skeleton-based Action Recognition. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 6099–6108, 2017.
  • [23] A. Kendall, M. Grimes, and R. Cipolla. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In IEEE Int’l Conf. Computer Vision (ICCV), pages 2938–2946, 2015.
  • [24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Int’l Conf. Learning Representations (ICLR), 2015.
  • [25] H. Li and R. Hartley. The 3D-3D Registration Problem Revisited. In IEEE Int’l Conf. Computer Vision (ICCV), pages 1–8, 2017.
  • [26] L. Ma, J. Stückler, C. Kerl, and D. Cremers. Multi-View Deep Learning for Consistent Semantic Mapping with RGB-D Cameras. In IEEE/RSJ Int’l Conf. Intelligent Robots and Systems (IROS), pages 598–605, 2017.
  • [27] Y. Ma, S. Soatto, J. Kosecka, and S. S. Sastry. An Invitation to 3-D Vision. Springer-Verlag New York, 2004.
  • [28] T. Matiisen, A. Oliver, T. Cohen, and J. Schulman. Teacher-Student Curriculum Learning. arXiv:1707.00183, 2017.
  • [29] N. Mellado, N. Mitra, and D. Aiger. SUPER 4PCS: Fast Global Pointcloud Registration via Smart Indexing. Computer Graphics Forum (Proc. EUROGRAPHICS), 33(5):205–215, 2014.
  • [30] M. Milford, G. Wyeth, and D. Prasser. RatSLAM: A Hippocampal Model for Simultaneous Localization and Mapping. In IEEE Int’l Conf. Robotics and Automation (ICRA), volume 1, pages 403–408, 2004.
  • [31] A. Myronenko and X. Song. Point set registration: Coherent point drift. IEEE Trans. Pattern Analysis and Machine Intelligence (T-PAMI), 32(12):2262–2275, 2010.
  • [32] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon. KinectFusion: Real-Time Dense Surface Mapping and Tracking. In IEEE Int’l Symposium on Mixed and Augmented Reality (ISMAR), pages 127–136, 2011.
  • [33] I. Öksüz, B. Ruijsink, E. Puyol-Antón, J. R. Clough, G. Cruz, A. Bustin, C. Prieto, R. M. Botnar, D. Rueckert, J. A. Schnabel, and A. P. King. Automatic CNN-based detection of cardiac MR motion artifacts using k-space data augmentation and curriculum learning. arXiv:1810.12185, 2018.
  • [34] J. Park, Q. Zhou, and V. Koltun. Colored Point Cloud Registration Revisited. In IEEE Int’l Conf. Computer Vision (ICCV), pages 143–152, 2017.
  • [35] G. P. Penney, P. J. Edwards, A. P. King, J. M. Blackall, P. G. Batchelor, and D. J. Hawkes. A Stochastic Iterative Closest Point Algorithm (stochastICP). In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 762–769, 2001.
  • [36] S. Phillips and K. Daniilidis. All Graphs Lead to Rome: Learning Geometric and Cycle-Consistent Representations with Graph Convolutional Networks. arXiv:1901.02078, 2019.
  • [37] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 652–660, 2017.
  • [38] R. B. Rusu, N. Blodow, and M. Beetz. Fast Point Feature Histograms (FPFH) for 3D registration. In IEEE Int’l Conf. Robotics and Automation (ICRA), pages 3212–3217, 2009.
  • [39] P. H. Schönemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1–10, 1966.
  • [40] A. V. Segal, D. Haehnel, and S. Thrun. Generalized-ICP. In Robotics: Science and Systems (RSS), 2009.
  • [41] M. Slavcheva, M. Baust, D. Cremers, and S. Ilic. KillingFusion: Non-rigid 3D Reconstruction without Correspondences. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 5474–5483, 2017.
  • [42] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
  • [43] G. K. Tam, Z.-Q. Cheng, Y. Lai, F. C. Langbein, Y. Liu, D. Marshall, X. S. R. R. Martin, and P. L. Rosin. Registration of 3D Point Clouds and Meshes: A Survey from Rigid to Nonrigid. IEEE Trans. Visualization and Computer Graphics (T-VCG), 19(7):1199–1217, 2013.
  • [44] S. Umeyama. Least-Squares Estimation of Transformation Parameters Between Two Point Patterns. IEEE Trans. Pattern Analysis and Machine Intelligence (T-PAMI), 13(4):376–380, 1991.
  • [45] J. Yang, H. Li, D. Campbell, and Y. Jia. Go-ICP: Solving 3D Registration Efficiently and Globally Optimally. IEEE Trans. Pattern Analysis and Machine Intelligence (T-PAMI), 38(11):2241–2254, 2016.
  • [46] J. Yang, H. Li, and Y. Jia. Go-ICP: Solving 3D Registration Efficiently and Globally Optimally. In IEEE Int’l Conf. Computer Vision (ICCV), pages 1457–1464, 2013.
  • [47] Z. J. Yew and G. H. Lee. 3DFeat-Net: Weakly Supervised Local 3D Features for Point Cloud Registration. In European Conf. Computer Vision (ECCV), pages 630–646, 2018.
  • [48] K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua. Learning to Find Good Correspondences. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 2666–2674, 2018.
  • [49] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser. 3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 199–208, 2017.
  • [50] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.

    Learning Deep Features for Scene Recognition using Places Database.

    In Advances in Neural Information Processing Systems (NIPS), pages 487–495, 2014.
  • [51] Q.-Y. Zhou, J. Park, and V. Koltun. Fast Global Registration. In European Conf. Computer Vision (ECCV), pages 766–782, 2016.
  • [52] M. Zollhöfer, M. Nießner, S. Izadi, C. Rehmann, C. Zach, M. Fisher, C. Wu, A. Fitzgibbon, C. Loop, C. Theobalt, and M. Stamminger. Real-time Non-rigid Reconstruction Using an RGB-D Camera. ACM Trans. Graphics, 33(4), 2014.

Appendix A Additional Results

We show some additional figures to better illustrate the advantages of the 3DRegNet against previous methods (i.e. Tab. 4 of the main document), and the comparison between the pairwise and trifocal 3DRegNet (i.e. Tab. 5 of the main document).

3D Registration from Several 3D Scans: We start by showing additional experimental results on the 3D scan alignment to complement the results shown in Fig. 5 of the paper. The same three sequences were used (MIT, BROWN, and HARVARD) from the SUN3D dataset. Please note that the 3DRegNet was not trained using these sequences, and these were only used for testing. These experiments are similar to what was done in Fig. 5. However, instead of only showing a pair of 3D scans (required by each of the methods), we show the registration of 30 RGB-D images. The 3D alignment was computed in a pairwise manner, i.e. we compute the transformation from Scan 1 to Scan2, from Scan 2 to Scan 3, …, and Scan 29 to Scan 30. Then, we apply transformations to move all the 3D Scans 2, 3, …, 30 into a reference frame. For that purpose, for each to computed 3D transformations, the ground-truth transformation from 1 to was pre-multiplied to move all the 3D scans into the first (common) reference frame. We use the same technique for the 3D registration computed from the 3DRegNet, RANSAC, and FGR methods, and the results are shown in Fig. A.8 (an additional row with the ground-truth transformation is shown for comparison). We use the network that was trained to get the results in Tab. 4. No additional non-linear refinement (e.g. motion averaging) was used.

As we can see from these figures, the registration results of the 30 scans given by our method are much closer to the ground-truth than the baseline methods such as FGR and RANSAC. Please note that our proposed method is more accurate and also significantly faster than the competing methods, as shown in the Tab. 4 of the main paper.

Ground-Truth

3DRegNet

RANSAC

FGR

MIT Brown Harvard
Figure A.8: Examples of the registration of 30 scans using the 3DRegNet, the FGR, and the RANSAC methods. This results are similar to the ones shown in Fig. 5 of the main paper, but in this case, instead of just considering the alignment of a pair of 3D scans, we aim at aligning 30 3D scans. We use the same three scenes in the SUN3D data-set (no additional datasets were used, and no additional parameter tuning was done): MIT, Brown, and Harvard sequences in the SUN3D data-set. These sequences were not used in the training.

3D Registration using the Trifocal Network: Next, we show some experimental results with the application of the trifocal network proposed in Sec. 5 of the paper. In this case, we used the network trained to get the results of Tab. 5 in the paper, i.e. we use the synthetically generated datasets ICL-NUIM. From some three 3D scans with correspondences between Scan 1 & Scan 2, Scan 2 & Scan 3, and Scan 1 & Scan 3, we run both the original 3DRegNet and the trifocal network (that uses two 3DRegNet blocks) to compute transformation from Scan 1 to Scan 2, and Scan 2 to Scan 3 (we did not use the pairwise 3DRegNet to get the pose from Scan 1 to Scan 3 directly). We consider three 3D scans in two different sequences, the Livingroom1 and the Office1. The results are shown in Fig. A.9.

Livingroom1

Office1

Ground-Truth 3DRegNet Trifocal 3DRegNet
Figure A.9: Examples of the registration of three scans using standard 3DRegNet (i.e., pairwise) and trifocal 3DRegNets These images are obtained by registering three 3D scans. Scan 1 is considered as the reference frame. We transform Scan 2 to the reference frame by applying the computed transformation from Scan 2 to Scan 1. The Scan 3 is transformed into the reference frame by applying the cascade of the computed transformations from Scan 3 to Scan 2 and then from Scan 2 to Scan 1.

In the Livingroom1 example, one can observe that the parwise 3DRegNet gives some drift errors, while the Trifocal one is able to get better results for the cascade of transformation from Scan 1 to Scan 3. In the Office1, the two methods for 3D scan registrations are very similar to the ground-truth. No significant differences were found within the methods.