DeepICP: An End-to-End Deep Neural Network for 3D Point Cloud Registration

05/10/2019 ∙ by Weixin Lu, et al. ∙ Baidu, Inc. 0

We present DeepICP - a novel end-to-end learning-based 3D point cloud registration framework that achieves comparable registration accuracy to prior state-of-the-art geometric methods. Different from other keypoint based methods where a RANSAC procedure is usually needed, we implement the use of various deep neural network structures to establish an end-to-end trainable network. Our keypoint detector is trained through this end-to-end structure and enables the system to avoid the inference of dynamic objects, leverages the help of sufficiently salient features on stationary objects, and as a result, achieves high robustness. Rather than searching the corresponding points among existing points, the key contribution is that we innovatively generate them based on learned matching probabilities among a group of candidates, which can boost the registration accuracy. Our loss function incorporates both the local similarity and the global geometric constraints to ensure all above network designs can converge towards the right direction. We comprehensively validate the effectiveness of our approach using both the KITTI dataset and the Apollo-SouthBay dataset. Results demonstrate that our method achieves comparable or better performance than the state-of-the-art geometry-based methods. Detailed ablation and visualization analysis are included to further illustrate the behavior and insights of our network. The low registration error and high robustness of our method makes it attractive for substantial applications relying on the point cloud registration task.



There are no comments yet.


page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Point cloud registration is a task that aligns two or more different point clouds collected by LiDAR (Light Detection and Ranging) scanners by estimating the relative transformation between them. It is a well-known problem and plays an essential role in many applications, such as LiDAR SLAM

[40, 6, 14, 20], 3D reconstruction and mapping [29, 8, 35, 7], positioning and localization [38, 16, 33, 18], object pose estimation [34] and so on.

This problem is challenging due to several aspects that are considered unique for LiDAR point clouds, including the local sparsity, the large amount of data and the noise caused by dynamic objects. Compared to the image matching problem, the sparsity of the point cloud makes finding two exact matching points from the source and target point clouds usually infeasible. It also increases the difficulty of feature extraction due to the large appearance difference of the same object viewed by a laser scanner from different perspectives. The millions of points produced every second requires highly efficient algorithms and powerful computational units. Appropriate handling of the interference caused by the noisy points of dynamic objects typically is crucial for delivering an ideal estimation.

Figure 1: The illustration of the major steps of our proposed end-to-end point cloud registration method: (a) The source (red) and target (green) point clouds and the keypoints (black) detected by the point weighting layer. (b) A search region is generated for each keypoint and represented by grid voxels. (c) The matched points (magenta) generated by the corresponding point generation layer. (d) The final registration result computed by performing SVD given the matched keypoint pairs.

Moreover, the unbounded variety of the scene is considered as the most significant challenge in solving this problem. Traditionally, a classic registration pipeline usually includes several steps with certain variations, for example, keypoint detection, feature descriptor extraction, feature matching, outlier rejection, and transformation estimation. Although good in performance; accuracy and robustness have been achieved in some scenarios after decades of considerable engineering efforts. Even so, finding a universal point cloud registration solution is still considered a popular, unresolved problem by the community.

Advances in deep learning have led to compelling improvements for most semantic computer vision tasks, such as classification, detection or segmentation. People are surprised by the remarkable capability of the DNNs in how well it can generalize in solving these empirically defined problems. For another important category of the problem, the geometric problems that are defined theoretically, there has been exciting progress for problems, such as stereo matching

[37, 4], depth estimation and SFM [31, 41]. Unfortunately, for 3D data, the experiential solutions of most of these attempts [39, 9, 5] have not been adequate enough in terms of local registration accuracy, partially due to the characteristics of the 3D point cloud mentioned above.

In this work, we propose an end-to-end learning-based method to accurately align two different point clouds. An overview of our framework is shown in Figure 1. We name it “DeepICP” because Iterative Closest Point (ICP) [2] is a classic algorithm that sometimes can represent the point cloud registration problem itself, and our approach is quite similar to it during the network training stage, despite the fact that there is only one iteration in the inference stage.

We first extract semantic features of each point both from the source and target point clouds using the latest point cloud feature extraction network, PointNet++ [24]. They are expected to have certain semantic meanings to empower our network to avoid dynamic objects and focus on those stable and unique features that are good for registration. To further achieve this goal, we select the keypoints in the source point cloud that are most significant for the registration task by making use of a point weighting layer to assign matching weights to the extracted features through a learning procedure. To tackle the problem of local sparsity of the point cloud, we propose a novel corresponding point generation method based on a feature descriptor extraction procedure using a mini-PointNet [23]

structure. We believe that it is the key contribution to enhance registration accuracy. Finally, besides only using the L1 Euclidean distance between the source keypoint and the generated corresponding point as the loss, we propose to construct another corresponding point by incorporating the keypoint weights adaptively and executing a single optimization iteration using the newly introduced SVD operator in TensorFlow. The L1 Euclidean distance between the keypoint and this newly generated corresponding point is again used as another loss. Unlike the first loss using only local similarity, this newly introduced loss builds the unified geometric constraints among local keypoints. The end-to-end closed-loop training allows the DNNs to generalize well and select the best keypoints for registration.

To summarize, our main contributions are:

  • To the best of our knowledge, the first end-to-end learning-based point cloud registration framework yielding comparable results to prior state-of-the-art geometric ones.

  • Our learning-based keypoint detection, novel corresponding point generation method and the loss function that incoporates both the local similarity and the global geometric constraints to achieve high accuracy in the learning based reigstration task.

  • Rigorous tests and detailed ablation analysis using the KITTI [10] and Apollo-SouthBay [18] datasets to fully demonstrate the effectiveness of the proposed method.

2 Related Work

The survey work from F. Pomerleau et al. [22] provides a good overview of the development of traditional point cloud registration algorithms. A discussion of the full literature of the these methods is beyond the scope of this work.

The attempt of using learning based methods starts by replacing each individual component in the classic point cloud registration pipeline. S. Salti et al. [27]

proposes to formulate the problem of 3D keypoint detection as a binary classification problem using a pre-defined descriptor, and attempts to learn a Random Forest

[3]classifier that can find the appropriate keypoints that are good for matching. M. Khoury et al. [17] proposes to first parameterize the input unstructured point clouds into spherical histograms, then a deep network is trained to map these high-dimensional spherical histograms to low-dimensional descriptors in Euclidean space. In terms of the method of keypoint detection and descriptor learning, the closest work to our proposal is [36]. Instead of constructing an End-to-End registration framework, it focuses on joint learning of keypoints and descriptors that can maximize local distinctiveness and similarity between point cloud pairs. G. Georgakis et al. [11] solves a similar problem for RGB-D data. Depth images are processed by a modified Faster R-CNN architecture for joint keypiont detection and descriptor estimation. Despite the different approaches, they all focus on the representation of the local distinctiveness and similarity of the keypoints. During keypoint selection, content awareness in real scenes is ignored due to the absence of the global geometric constraints introduced in our end-to-end framework. As a result, keypoints on dynamic objects in the scene cannot be rejected in these approaches.

Some recent works [39, 9, 5, 1]

propose to learn 3D descriptors leveraging the DNNs, and attempt to solve the 3D scene recognition and re-localization problem, in which obtaining accurate local matching results is not the goal. In order to achieve that, methods, as ICP, are still necessary for the registration refinement.

M. Velas et al. [32] encodes the 3D LiDAR data into a specific 2D representation designed for multi-beam mechanical LiDARs. CNNs is used to infer the 6 DOF poses as a classification or regression problem. An IMU assisted LiDAR odometry system is built upon it. Our approach processes the original unordered point cloud directly and is designed as a general point cloud registration solution.

Figure 2:

The architecture of the proposed end-to-end learning network for 3D point cloud registration, DeepICP. The source and target point clouds are fed into the deep feature extraction layer, then

keypoints are extracted from the source point cloud by the weighting layer. candidate corresponding points are selected from the target point cloud, followed by a deep feature embedding operation. The corresponding keypoints in the target point cloud are generated by the corresponding points generation layer. Finally, we propose to use the combination of two losses those encode both the global geometric constraints and local similarities.

3 Method

This section describes the architecture of the proposed network designed in detail as shown in Figure 2.

3.1 Deep Feature Extraction

The input of our network consists of the source and target point cloud, the predicted (prior) transformation, and the ground truth pose required only during the training stage. The first step is extracting feature descriptors from the point cloud. In the proposed method, we extract feature descriptors by applying a deep neural network layer, denoted as the Feature Extraction (FE) Layer. As shown in Figure 2, we feed the source point cloud, represented as an tensor, into the FE layer. The output is an tensor representing the extracted local feature. The FE layer here we used is PointNet++ [24] (see details in Sec. 4), which is a poineer work addressing the issue of consuming unordered points in a network architecture.

These local features are expected to have certain semantic meanings. Working together with the weighting layer to be introduced next, we expect our end-to-end network to be capable to avoid the interference from dynamic objects and deliver precise registration estimation. In Section 5.4, we visualize the selected keypoints and demonstrate that the dynamic objects are successfully avoided.

3.2 Point Weighting

Inspired by the attention layer in 3DFeatNet [36], we design a point weighting layer to learn the saliency of each point in an end-to-end framework. Ideally, points with invariant and distinct features on static objects should be assigned higher weights.

As shown in Figure 2,

local features from the source point cloud are fed into the point weighting layer. The weighting layer consists of a multi-layer perceptron (MLP) of 3 stacking fully connected layers and a top k operation. The first two fully connected layers use the batch normalization and the ReLU activation function, and the last layer omits the normalization and applies the

softplus activation function. The most significant points are selected as the keypoints through the top k operator and their learned weights are used in the subsequent processes.

Our approach is different from 3DFeatNet [36] in a few ways. First, the features used in the attention layer are extracted from local patches, while ours are semantic features extracted directly from the point cloud. We have greater receptive fields learned from an encoder-decoder style network (PointNet++ [24]). Moreover, our weighting layer does not output a 1D rotation angle to determine the feature direction, because our design of the feature embedding layer in the next section uses a symmetric and isotropic network architecture.

3.3 Deep Feature Embedding

After extracting keypoints from the source point cloud, we seek to find the corresponding points in the target point cloud for the final registration. In order to achieve this, we need a more detailed feature descriptor that can better represent their geometric characteristics. Therefore, we apply a deep feature embedding (DFE) layer on their neighborhood points to extract these local features. The DFE layer we used is a mini-PointNet [23, 5, 18] structure.

Specifically, we collect neighboring points within a certain radius of each keypoint. In case that there are less than neighboring points, we simply duplicate them. For all the neighboring points, we use their local coordinates and normalize them by the searching radius . Then, we concatenate the FE feature extracted in Sec 3.1 with the local coordinates and the LiDAR reflectance intensities of the neighboring points as the input to the DFE layer.

The mini-PointNet consists of a multi-layer perceptron (MLP) of 3 stacking fully connected layers and a max-pooling layer to aggregate and obtain the feature descriptor. As shown in Figure 2, the input of the DFE layer is an vector, which refers to the local coordinate, the intensity, and the -dimensional FE feature descriptor of each point in the neighborhood. The output of the DFE layer is again a -dimensional vector. In Section 5.3, we show the effectiveness of the DFE layer and how it help improve the registration precision significantly.

3.4 Corresponding Point Generation

Similar to ICP, our approach also seeks to find corresponding points in the target point cloud and estimate the transformation. The ICP algorithm chooses the closest point as the corresponding point. This prohibits backpropagation as it is not differentiable. Furthermore, there are actually no exact corresponding points in the target point cloud to the source due to its sparsity nature. To tackle the above problems, we propose a novel network structure, the corresponding point generation (CPG) layer, to generate corresponding points from the extracted featrues and the similairty represented by them.

We first transform the keypoints from the source point cloud using the input predicted transformation. Let denote the 3D coordinate of the keypoint from the source point cloud and its transformation in the target point cloud, respectively. In the neighborhood of , we divide its neighboring space into 3D grid voxels, where is the searching radius and is the voxel size. Let us denote the centers of the 3D voxels as , which are considered as the candidate corresponding points. We also extract their DFE feature descriptors as we did in Section 3.3. The output is an tensor. Similar to [18], those tensors representing the extracted DFE features descriptors from the source and target are fed into a three-layer 3D CNNs, followed by a softmax operation, as shown in Figure 2. The 3D CNNs can learn a similarity distance metric between the source and target features, and more importantly, it can smooth (regularize) the matching volume and suppress the matching noise. The softmax operation is applied to convert the matching costs into probabilities.

Finally, the target corresponding point is calculated through a weighted-sum operation as:


where is the similarity probability of each candidate corresponding point . The computed target corresponding points are represented by a tensor.

Compared to the traditional ICP algorithm that relied on the iterative optimization or the methods [25, 5, 39] which search the corresponding points among existing points from the target point cloud and use RANSAC to reject outliers, our approach utilizes the powerful generalization capability of CNNs in similarity learning, to directly “guess” where the corresponding points are in the target point cloud. This eliminates the use of RANSAC, reduces the iteration times to 1, significantly reduces the running time, and achieves fine registration with high precision.

3.5 Loss

For each keypoint from the source point cloud, we can calculate its corresponding ground truth with the given ground truth transformation . Using the estimated target corresponding point in Sec. 3.4, we can directly compute the distance in the Euclidean space as a loss:


If only the in Equation. 2 is used, the keypoint matching procedure during the registration is independent for each one. Consequently, only the local neighboring context is considered during the matching procedure, while the registration task is obviously constrained with a global geometric transform. Therefore, it’s essential to introduce another loss including global geometric constraints.

Inspired by the iterative optimization in the ICP algorithm, we perform a single optimization iteration. That is, we perform a singular value decomposition (SVD) step to estimate the relative transformation

given the corresponding keypoint pairs , . Then the second loss in our network is defined as:


Thanks to [13], the latest Tensorflow has supported the SVD operator and its backpropagation. This ensures that the proposed network can be trained in an end-to-end pattern. As a result, the combined loss is defined as:


where is the balancing factor. In Section 5.3, we demonstrate the effectiveness of our loss design.

It’s worth to note that the estimated corresponding keypoints are actually constantly being updated together as the estimated transformation during the training. When the network converges, the estimated corresponding keypoints become unlimitedly close to the ground truth. It’s interesting that this training procedure is actually quite similar to the classic ICP algorithm. While the network only needs a single iteration to find the optimal corresponding keypoint and then estimate the transformation during inference, which is very valuable.

4 Implementation Details

In the FE layer, a simplified PointNet++ is applied, in which only three set abstraction layers with a single scale grouping layer are used to sub-sample points into groups with sizes 4096, 1024, 256, and the MLPs of three hierarchical PointNet layer are , , in the sub-sampling stage, and , , in the up-sampling stage. This is followed by a fully connected layer with 32 kernels and a dropout layer with the keeping probability as to avoid overfitting. The MLP in the point weighting layer is , and only the top points are selected in the source point cloud according to their learned weights in the descending order. The searching range and the number of neighboring points to be collected in the DFE step are set to be and , respectively. In the mini-PointNet structure of the DFE layer, the MLP is . The 3D CNNs settings in the CPG step are Conv3d (16, 3, 1) - Conv3d(4, 3, 1) - Conv3d (1, 3, 1). The grid voxels are set as .

The proposed network is trained with the batch size as , the learning rate as and the decay rate as with the decay step to be

. During the training stage, we conduct the data augmentation and supervised training by adding a uniformly distributed random noise of

in the , and dimensions, and in the , , and

dimensions to the given ground truth. We randomly divide the dataset into the training and validation set, yielding the ratio of training to validation as 4 to 1. We stop at 200 epochs when there is no performance gain.

Another implementation detail worth mentioning is that we conduct a bidirectional matching strategy during inference to improve the registration accuracy. That is, the input point cloud pair is considered as the source and target simultaneously. While we don’t do this during training, because this does not improve the overall performance of the model.

Moreover, all the settings above are designated for the datasets (both the KITTI and the Apollo-SouthBay) collected with Velodyne HDL64 LiDAR. Because the point clouds from Velodyne HDL64 are distributed within a relatively narrow region in the -direction, the keypoints constraining the -direction are usually quite different from the other two, such as the points on the ground plane. This causes the registration precision at the , and directions to decline. To tackle this problem, we actually duplicate the whole network structure as shown in Figure 2, and use two copies of the network in a cascade pattern. The back network uses the estimated transformation from the front network as the input, but replaces the 3D CNNs in the CPG step of the latter with a 1D one sampling in the direction only. Both the networks share the same FE layer, becasue we do not want to extract FE features twice. This increases the , and ’s estimation precision.

5 Experiments

5.1 Benchmark Datasets

We evaluate the performance of the proposed network using 11 training sequences of the KITTI odometry dataset [10]. The KITTI dataset contains point clouds captured with a Velodyne HDL64 LiDAR in Karlsruhe, Germany together with the “ground truth” poses provided by a high-end GNSS/INS integrated navigation system. We split the dataset into two groups, the training, and the testing. The training group includes 00-07 sequences, and the testing includes 08 - 10 sequences.

Another dataset that is used for evaluation is the Apollo-SouthBay dataset [18]. It collected point clouds using the same model of LiDAR as the KITTI dataset, but, in the San Francisco Bay area, United States. Similar to KITTI, it covers various scenarios including residential areas, urban downtown areas, and highways. We also find that the “ground truth” poses in Apollo-SouthBay is more accurate than KITTI odometry dataset. Some ground truth poses in KITTI involve larger errors, for example, the first 500 frames in Sequence 08. Moreover, the mounting height of the LiDAR in Apollo-SouthBay is slightly higher than KITTI. This allows the LiDAR to see larger areas in the direction. We find that the keypoints picked up in these high regions sometimes are very helpful for registration. The setup of the training and test sets is similar to [18] with the mapping portion discarded. There is no overlap between the training and testing data.

5.2 Performance

Baseline Algorithms We present extensive performance evaluation by comparing with a few point cloud registration algorithms based on geometry. They are: (i) The ICP family, such as ICP [2], G-ICP [28], and AA-ICP [21]; (ii) NDT-P2D [30]; (iii) GMM family, such as GMM-REG [15] and CPD [19]. The implementations of ICP, G-ICP, AA-ICP, and NDT-P2D are from the Point Cloud Library (PCL) [26]. The maximum correspondence distance and the maximum iteration times are set to be and , respectively, to ensure the best accuracy. We use the official implementations of the GMM-REG and CPD method.

Evaluation Criteria The evaluation is performed by calculating the angular and translational error of the estimated relative transformation against the ground truth . The chordal distance [12] between and is calculated via the Frobenius norm of the rotation matrix, denoted as . The angular error then can be calculated as . The translational error is calculated as the Euclidean distance between and .

KITTI Dataset We sample the input source LiDAR scans at frame intervals and enumerate its registration target within distance to it. The original point cloud in the dataset includes about points/frame. We down-sample it to about points for methods such as ICP, G-ICP, AA-ICP, and NDT. For methods, such as GMM-REG or CPD, we down-sample to and , respectively, to achieve comparable running efficiency to other methods. The average running time of all the methods are limited to about for fair comparison as shown in Figure 3

. For our proposed method, we evaluate two versions. One is the base version, denoted as “Ours-Base”, that infers all the degree of freedoms

, , , , , and at once. The other is an improved version with network duplication as we discussed in Section 4, denoted as “Ours-Duplication”. The angular and translational errors of all the methods are listed in Table 1. As can be seen, for the KITTI dataset, DeepICP performs better than most geometry based methods like AA-ICP, NDT-P2D, GMM-REG, CPD, but slightly worse than ICP and G-ICP, especially for the angular error. One more thing worth mentioning is that we achieve a big improvement in the angular alignment using the duplication structure proposed in Section 4.

Method Angular Error() Translation Error()
Mean Max Mean Max
ICP [2] 0.121 1.176 0.081 2.067
G-ICP [28] 0.067 0.401 0.076 2.065
AA-ICP [21] 0.343 179.800 0.096 5.502
NDT-P2D [30] 1.162 32.008 0.268 1.841
GMM-REG [15] 1.879 4.865 0.778 1.544
CPD [19] 0.583 6.442 1.188 23.184
Ours-Base 0.432 2.375 0.095 0.744
Ours-Duplication 0.260 1.782 0.083 0.538
Table 1: Comparison using the KITTI dataset. DeepICP performs better than AA-ICP, NDT-P2D, GMM-REG, CPD, but slightly worse than ICP and GICP, especially for the angular error. Our duplication design significantly improves angular accuracy.

Apollo-SouthBay Dataset In Apollo-SouthBay dataset, we sample at frame intervals, and again enumerate the target within distance. All other parameter settings for each individual method are the same as the KITTI dataset. The angular and translational errors are listed in Table 2. For the Apollo-SouthBay dataset, most methods including ours have a performance improvement, which might be due to the better ground truth poses provided by the dataset. Our system with the duplication design achieves the second best mean translational error and the third best mean angular error. Additionally, the lowest maximum translational error demonstrates the good robustness and stability of our proposed learning-based method.

Method Angular Error() Translation Error()
Mean Max Mean Max
ICP [2] 0.039 0.446 0.040 4.328
G-ICP [28] 0.025 0.356 0.030 1.869
AA-ICP [21] 0.467 172.293 0.086 5.457
NDT-P2D [30] 0.325 23.618 0.145 2.003
GMM-REG [15] 1.909 3.535 0.765 1.531
CPD [19] 0.158 8.362 0.593 12.367
Ours-Base 0.214 1.568 0.044 0.808
Ours-Duplication 0.089 0.701 0.034 0.622
Table 2: Comparison using the Apollo-SouthBay dataset. Our system achieves the second best mean translational error and the third best mean angular error. The lowest maximum translational error demonstrates good robustness and stability.

Run-time Analysis We evaluate the runtime performance of our framework with a GTX 1080 Ti GPU, Core i7-9700K CPU, and 16GB Memory as shown in Figure 3. The total end-to-end inference time of our network is about seconds for registering a frame pair.

Figure 3: The running time performance analysis of all the methods. The total end-to-end inference time of our network is about seconds for registering a frame pair.

5.3 Ablations

In this section, we use the same training and testing data from the Apollo-SouthBay dataset to further evaluate each component or proposed design in our work.

Deep Feature Embedding In Section 3.3

, we propose to construct the network input by concatenating the FE feature together with the local coordinates and the intensities of the neighboring points. Now, we take a deeper look at this design choice by conducting the following experiments: i) LLF-DFE: Only the local coordinates and the intensities are used; ii) FEF-DFE: Only the FE feature is used; iii) FEF: The DFE layer is discarded. The FE feature is directly used as the input to the CPG layer. In the target point cloud, the FE features of the grid voxel centers are interpolated. It is seen that the DFE layer is crucial to this task as there is severe performance degradation without it as shown in Table 

3. The LLF-DFE and FEF-DFE give competitive results while our design gives the best performance.

Method Angular Error() Translation Error()
Mean Max Mean Max
LLF-DFE 0.094 0.830 0.038 0.837
FEF-DFE 0.094 0.810 0.042 0.858
FEF 0.900 2.999 0.984 9.616
Ours 0.089 0.701 0.034 0.622
Table 3: Comparison w/o the DFE layer. The usage of DFE layer is crucial as there is severe performance degradation as shown in Method FEF. When only partial features are used in DFE layer, it gives competitive results as shown in Method LLF-DFE and FEF-DFE, while ours yields the best performance.

Corresponding Points Generation To demonstrate the effectiveness of the CPG, we directly search the best corresponding point among the existing points in the target point cloud taking the predicted transformation into consideration. Specifically, for each source keypoint, the point with the highest similarity score in the feature space in the target neighboring field is chosen as the corresponding point. It turns out that it is unable to converge using our proposed loss function. The reason might be that the proportion of the positive and negative samples is extremely unbalanced.

Loss In Section 3.5, we propose to use the combination of two losses to incoorporate the global geometric information, and a balancing factor is introduced. In order to demonstrate the necessity of using both the losses, we sample values of from to and observe the registration accuracy. In Figure 4, we find that the balancing factor of and obviously give larger angular and translational mean errors. This clearly demonstrates the effectiveness of the combined loss function design. It is also quite interesting that it yields similar accuracies for between - . We conclude that this might be because of the powerful generalization capability of deep neural networks. The parameters in the networks can be well generalized to adopt any values away from or . Therefore, we use in all our experiments.

Figure 4: Registration accuracy comparison with different values in the loss function. Any values away from or give similarly good accuracies. This demonstrates the powerful generalization capability of deep neural networks.
Figure 5: Visualization of the detected keypoints by the point weighting layer. The pink and grey keypoints are detected by the front and back network, respectively. The pink ones appear on stationary objects, such as tree trunks and poles. The grey ones are mostly on the ground, as expected.
Figure 6: Illustrate the matching similarity probabilities of each keypoint to its matching candidates by visualizing them in and dimensions with fixed values. The black and pink points are the detected keypoints in the source point cloud and the generated ones in the target, respectively. The effectiveness of the registration process is shown on the left (before) and right (after).

5.4 Visualizations

In this section, to offer better insights on the behavior of the network, we visualize the keypoints chosen by the point weighting layer and the similarity probability distribution estimated in the CPG layer.

Visualization of Keypoints In Section 3.1, we propose to extract semantic features using PointNet++ [24], and weigh them using a MLP network structure. We expect that our end-to-end framework can intelligently learn to select keypoints that are unique and stable on stationary objects, such as traffic poles, tree trunks, but avoid the keypoints on dynamic objects, such as pedestrians, cars. In addition to this, we duplicate our network in Section 4. The front network with the 3D CNNs CPG layer are expected to find meaningful keypoints those have good constraints in all six degrees of freedom. While the back network with the 1D CNNs are expected to find those are good in , and directions. In Figure 5, the detected keypoints are shown compared with the camera photo and the LiDAR scan in the real scene. The pink and grey keypoints are detected by the front and back network, respectively. We observe that the distribution of keypoints match our expectations as the pink keypoints mostly appear on objects with salient features, such as tree trunks and poles, while the grey ones are mostly on the ground. Even in the scene where there are lots of cars or buses, none of keypoints are detected on them. This demonstrates that our end-to-end framework is capable to detect the keypoints those are good for the point cloud registration task.

Visualization of CPG Distribution The CPG layer in Section 3.4 estimates the matching similarity probability of each keypoint to its candidate corresponding ones. Figure 6 depicts the estimated probabilities by visualizing them in and dimensions with fixed values. On the left and right, the black and pink points are the keypoints from the source point cloud and the generated ones in the target, respectively. It is seen that the keypoints detected are sufficiently salient that the matching probabilities are concentratedly distributed.

6 Conclusion

We have presented an end-to-end framework for the point cloud registration task. The keypoints are detected through a point weighting deep neural network. The corresponding point to the keypoint is generated according to the matching similarity probabilities estimated by a 3D CNNs structure among candidates, but isn’t directly picked from the existing ones. Our loss function incoporates both the local similarity and the global geometric constraints. These novel designs in our network make our learning-based system achieve the comparable registration accuracy to the state-of-the-art geometric methods. The LiDAR point cloud registration is an important task, and is the foundation of various applications. We believe this has great benefits for many potential applications. In a further extension of this work, we will develop the potential of our system with more LiDAR models and application scenarios.


  • [1] M. Angelina Uy and G. Hee Lee. PointNetVLAD: Deep point cloud based retrieval for large-scale place recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 4470–4479, 2018.
  • [2] P. J. Besl and N. D. McKay. A method for registration of 3-D shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):239–256, Feb 1992.
  • [3] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • [4] X. Cheng, P. Wang, and R. Yang. Learning depth with convolutional spatial propagation network. arXiv preprint arXiv:1810.02695, 2018.
  • [5] H. Deng, T. Birdal, and S. Ilic. PPFNet: Global context aware local features for robust 3D point matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [6] J.-E. Deschaud. IMLS-SLAM: scan-to-model matching based on 3D data. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 2480–2485. IEEE, 2018.
  • [7] L. Ding and C. Feng. DeepMapping: Unsupervised map estimation from multiple point clouds. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [8] D. Droeschel and S. Behnke. Efficient continuous-time SLAM for 3D lidar-based online mapping. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–9. IEEE, 2018.
  • [9] G. Elbaz, T. Avraham, and A. Fischer. 3D point cloud registration for localization using a deep neural network auto-encoder. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4631–4640, 2017.
  • [10] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361. IEEE, 2012.
  • [11] G. Georgakis, S. Karanam, Z. Wu, J. Ernst, and J. Košecká. End-to-end learning of keypoint detector and descriptor for pose invariant 3d matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1965–1973, 2018.
  • [12] R. Hartley, J. Trumpf, Y. Dai, and H. Li. Rotation averaging. International Journal of Computer Vision, 103(3):267–305, Jul 2013.
  • [13] C. Ionescu, O. Vantzos, and C. Sminchisescu. Training deep networks with structured layers by matrix backpropagation. CoRR, abs/1509.07838, 2015.
  • [14] K. Ji, H. Chen, H. Di, J. Gong, G. Xiong, J. Qi, and T. Yi. CPFG-SLAM: a robust simultaneous localization and mapping based on LIDAR in off-road environment. In 2018 IEEE Intelligent Vehicles Symposium (IV), pages 650–655. IEEE, 2018.
  • [15] B. Jian and B. C. Vemuri.

    Robust point set registration using gaussian mixture models.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1633–1645, Aug 2011.
  • [16] S. Kato, E. Takeuchi, Y. Ishiguro, Y. Ninomiya, K. Takeda, and T. Hamada. An open approach to autonomous vehicles. IEEE Micro, 35(6):60–68, Nov 2015.
  • [17] M. Khoury, Q.-Y. Zhou, and V. Koltun. Learning compact geometric features. In Proceedings of the IEEE International Conference on Computer Vision, pages 153–161, 2017.
  • [18] W. Lu, Y. Zhou, G. Wan, S. Hou, and S. Song. L3-Net: Towards learning based LiDAR localization for autonomous driving. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [19] A. Myronenko and X. Song. Point Set Registration: Coherent point drift. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12):2262–2275, Dec 2010.
  • [20] F. Neuhaus, T. Koß, R. Kohnen, and D. Paulus. MC2SLAM: Real-time inertial Lidar odometry using two-scan motion compensation. In German Conference on Pattern Recognition, pages 60–72. Springer, 2018.
  • [21] A. L. Pavlov, G. W. Ovchinnikov, D. Y. Derbyshev, D. Tsetserukou, and I. V. Oseledets. AA-ICP: Iterative closest point with anderson acceleration. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–6, May 2018.
  • [22] F. Pomerleau, F. Colas, R. Siegwart, et al. A review of point cloud registration algorithms for mobile robotics. Foundations and Trends® in Robotics, 4(1):1–104, 2015.
  • [23] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep learning on point sets for 3D classification and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 77–85, July 2017.
  • [24] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
  • [25] R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (FPFH) for 3-D registration. In 2009 IEEE International Conference on Robotics and Automation, pages 3212–3217, May 2009.
  • [26] R. B. Rusu and S. Cousins. 3D is here: Point cloud library (PCL). In IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, May 9-13 2011.
  • [27] S. Salti, F. Tombari, R. Spezialetti, and L. Di Stefano. Learning a descriptor-specific 3D keypoint detector. In Proceedings of the IEEE International Conference on Computer Vision, pages 2318–2326, 2015.
  • [28] A. Segal, D. Haehnel, and S. Thrun. Generalized-ICP. In Proc. of Robotics: Science and Systems, 06 2009.
  • [29] T. Shiratori, J. Berclaz, M. Harville, C. Shah, T. Li, Y. Matsushita, and S. Shiller. Efficient large-scale point cloud registration using loop closures. In 2015 International Conference on 3D Vision, pages 232–240. IEEE, 2015.
  • [30] T. Stoyanov, M. Magnusson, H. Andreasson, and A. J. Lilienthal. Fast and accurate scan registration through minimization of the distance between compact 3D NDT representations. The International Journal of Robotics Research, 31(12):1377–1393, 2012.
  • [31] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. DeMoN: Depth and motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5038–5047, 2017.
  • [32] M. Velas, M. Spanel, M. Hradis, and A. Herout. CNN for IMU assisted odometry estimation using velodyne LiDAR. In 2018 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), pages 71–77. IEEE, 2018.
  • [33] G. Wan, X. Yang, R. Cai, H. Li, Y. Zhou, H. Wang, and S. Song. Robust and precise vehicle localization based on multi-sensor fusion in diverse city scenes. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 4670–4677. IEEE, 2018.
  • [34] J. M. Wong, V. Kee, T. Le, S. Wagner, G.-L. Mariottini, A. Schneider, L. Hamilton, R. Chipalkatty, M. Hebert, D. M. Johnson, et al. SegICP: Integrated deep semantic segmentation and pose estimation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5784–5789. IEEE, 2017.
  • [35] S. Yang, X. Zhu, X. Nian, L. Feng, X. Qu, and T. Mal. A robust pose graph approach for city scale LiDAR mapping. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1175–1182. IEEE, 2018.
  • [36] Z. J. Yew and G. H. Lee. 3DFeat-Net: Weakly supervised local 3d features for point cloud registration. In European Conference on Computer Vision, pages 630–646. Springer, 2018.
  • [37] Z. Yin, T. Darrell, and F. Yu. Hierarchical discrete distribution decomposition for match density estimation. arXiv preprint arXiv:1812.06264, 2018.
  • [38] K. Yoneda, H. Tehrani, T. Ogawa, N. Hukuyama, and S. Mita. LiDAR scan feature for localization with highly precise 3-D map. In IEEE Intelligent Vehicles Symposium Proceedings, pages 1345–1350, June 2014.
  • [39] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser. 3DMatch: Learning local geometric descriptors from rgb-d reconstructions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [40] J. Zhang and S. Singh. LOAM: Lidar odometry and mapping in real-time. In Robotics: Science and Systems, volume 2, page 9, 2014.
  • [41] H. Zhou, B. Ummenhofer, and T. Brox. DeepTAM: Deep tracking and mapping. In Proceedings of the European Conference on Computer Vision (ECCV), pages 822–838, 2018.