deepclr
DeepCLR: Correspondence-Less Architecture for Deep End-to-End Point Cloud Registration
view repo
This work addresses the problem of point cloud registration using deep neural networks. We propose an approach to predict the alignment between two point clouds with overlapping data content, but displaced origins. Such point clouds originate, for example, from consecutive measurements of a LiDAR mounted on a moving platform. The main difficulty in deep registration of raw point clouds is the fusion of template and source point cloud. Our proposed architecture applies flow embedding to tackle this problem, which generates features that describe the motion of each template point. These features are then used to predict the alignment in an end-to-end fashion without extracting explicit point correspondences between both input clouds. We rely on the KITTI odometry and ModelNet40 datasets for evaluating our method on various point distributions. Our approach achieves state-of-the-art accuracy and the lowest run-time of the compared methods.
READ FULL TEXT VIEW PDFDeepCLR: Correspondence-Less Architecture for Deep End-to-End Point Cloud Registration
The objective of point cloud registration (PCR) is the alignment of two point clouds, called template and source, by estimating their relative transformation. Point clouds are obtained from various sources such as LiDAR (Light Detection and Ranging) sensors or RGB-D cameras. In case a LiDAR is mounted on a robot or vehicle, we can apply PCR to obtain the sensor motion by estimating the relative transformation between measurements of two consecutive timesteps. This application of PCR called LiDAR odometry (LO) is one of the main tasks for our proposed architecture. An example from the KITTI odometry dataset
[geiger2012autonomous] is shown in Fig. 1.The most common algorithm for point cloud registration is Iterative Closest Point (ICP) [besl1992method]. Since its publication in 1992, many variants with modified distance metrics, filtering steps and association methods have been introduced [pomerleau2013comparing]. In the original work, Basler et al. prove that ICP achieves local convergence, which makes it suitable for fine registration. For larger translations and rotations, registration algorithms like Fast Point Feature Histograms (FPFH) [rusu2009fast] as coarse pre-registration are crucial for reaching the global transformation error minimum. While ICP is a point based algorithm that uses the coordinates of input points for registration, FPFH is a feature based approach that subsamples the point clouds and performs registration based on local features. Nevertheless, even for small transformations both methods still struggle on noisy data and large numbers of points. However, continuing the example of LiDAR odometry from above, this problem requires an algorithm that can handle large point clouds with high noise.
Especially for feature based approaches, it seems promising to apply deep learning for feature extraction, since neural networks can model higher order dependencies
[yew20183dfeatnet]. Recent publications for deep learning based PCR [wang2019deeppco, li2019lonet]apply projections to generate pixel representations. This makes it possible to apply methods based on convolutional neural networks (CNN) like PoseNet
[kendall2015posenet], originally introduced for camera images. While it is possible to project LiDAR point clouds to 2D panoramic view depth images [wang2019deeppco] from the sensor origin, such a projection is not possible for arbitrary point clouds without loosing information. ModelNet40 [wu20153d] is an example for such data where no projection is possible, since any projection would cause self-occlusions and therefore missing information. Thus, our goal is to develop an algorithm that processes raw point clouds without applying projections for ensuring complete data utilization.Recent advances in deep learning like PointNet [qi2017pointnet] enable direct processing of unsorted point clouds without prior projection. Many architectures for deep registration of raw point clouds are based on this concept. DeepLocalization [engel2019landmark] and PointNetLK [aoki2019pointnetlk]
use global PointNet feature vectors extracted directly for registration. However, these methods face problems with large point clouds like LiDAR measurements, since the global feature vectors do not hold sufficient information for a precise registration. An alternative approach is to use PointNet for extracting keypoints and correspondence weights, as in 3DFeat-Net
[yew20183dfeatnet] or DeepVCP [lu2019deepvcp]. Since these methods enforce explicit point correspondences and thereby constrain the decision space, they do not exploit the full potential of deep learning. Furthermore, they rely on a subsequent algorithm for estimating the registration which leads to higher run-times.Our proposed architecture ”DeepCLR” (Correspondence-Less Registration) addresses these problems by creating feature vectors describing the local point neighborhood as an alternative to global feature vectors for encoding the input. Instead of extracting explicit correspondences, we fuse the point clouds and create flow features for each template point cloud sample, which describe the estimated motion of each point between template and source. Finally, the resulting point cloud with added flow features is used to predict the alignment in an end-to-end fashion.
To summarize, the main contribution of this work is an architecture which 1) directly processes unsorted point clouds of arbitrary size without applying a projection, 2) estimates the transformation in an end-to-end fashion without extracting explicit correspondences and 3) is able to perform the registration sufficiently fast for online processing. We evaluate our architecture using the KITTI odometry [geiger2012autonomous] and ModelNet40 [wu20153d] datasets.
Iterative Closest Point (ICP) [besl1992method] and its variants [pomerleau2013comparing] all involve data association for extracting the closest source point to each template point, and error minimization for iterative transformation refinement. The error calculation for associated point pairs can be based on the point-to-point or point-to-plane distance. Generized ICP (G-ICP) [segal2010generalized] uses a probabilistic framework based on a plane-to-plane distance metric. Since ICP only guarantees local convergence, other registration methods are necessary as pre-registration for large transformations. The feature based FPFH [rusu2009fast] calculates feature histograms for each point from its local neighborhood, which are then used for extracting corresponding point pairs.
A wide range of deep learning based algorithms for point cloud registration focus on LiDAR point clouds. Due to their measuring principle, it is obvious to apply cylindrical or spherical projections for generating 2D panoramic-view depth images. This advantage is used by DeepPCO [wang2019deeppco] and LO-Net [li2019lonet] for LiDAR odometry estimation using CNNs. DeepPCO uses a dual-branch architecture in order to predict translation and rotation separately. In LO-Net, the depth image representation is used to estimate the surface normals for each pixel, which are then provided as additional input. Since only measurements of static objects can be used for odometry estimation, LO-Net predicts a mask to suppress measurements of dynamic objects. Nevertheless, these approaches for LiDAR odometry cannot be applied directly on registration of arbitrary point clouds, as explained in Section I. Since our architecture is based on PointNet++ [qi2017pointnetpp] we are able to process raw point clouds from any distribution without prior projection or rasterization. Thus, we are not limited on LiDAR measurements.
For direct processing of point clouds without projections, PointNet [qi2017pointnet] or PointNet++ [qi2017pointnetpp] provide powerful foundations. PointNet is used in DeepLocalization [engel2019landmark] and PointNetLK [aoki2019pointnetlk] for generating feature vectors from template and source point clouds. These features are concatenated in DeepLocalization and passed to fully connected layers for pose prediction. In PointNetLK, instead of concatenating feature vectors, these are used to calculate gradients for iterative transformation refinement. Both methods are developed for small point clouds. Unfortunately, PointNet fails to generate descriptive feature vectors for accurate registration of large point clouds like LiDAR measurements.
Another point based approach using PointNet++ is presented by DeepVCP [lu2019deepvcp]. Lu et al. subsample and group input points for calculating features, which describe the local surrounding structure of each sample. For merging both subsampled point clouds in the feature embedding, they collect all surrounding source points for each template point, concatenate the features and feed them into mini-PointNets similar to PointNet++. These fused features are used to predict correspondence weights for multiple point pairs, which are then processed afterwards to estimate the relative transformation between template and source point cloud. This feature embedding approach is similar to the flow embedding of FlowNet3D [liu2019flownet3d], which is used for point cloud scene flow estimation.
The concept of our proposed architecture is based on a similar method for fusing the point clouds. Our major innovation is the end-to-end approach, which means we do not predict any explicit correspondences or correspondence weights between template and source point clouds. Instead of using the embedding result for predicting correspondence weights, we use the PointNet architecture to infer the transformation directly from the embedded feature representation. Thus, the network can directly predict the registration without being forced to find an intermediate correspondence representation.
We first define our problem as a regression task [belagiannis2015robust], followed by the objective function. Afterwards, we describe the proposed network architecture for point cloud registration. The problem definition is formulated based on [engel2019landmark] and [liu2019flownet3d].
Input to our network are two unsorted point sets and of arbitrary size and called template and source point cloud, respectively. Further, and are individual points, where denote coordinates of points and denote optional input features like LiDAR intensities () or surface normals (
). The objective of registration is to find the linear transformation
, consisting of translation and rotation in three-dimensional space that aligns the coordinates of and . The result of our training process is a mapping parameterized by , which is learned from training data.For LiDAR odometry, point clouds captured from two consecutive time steps are denoted and . The transformation describes the motion between corresponding sensor poses with . The initial pose for odometry estimation is .
The learning objective is to minimize the difference between predicted transformation and ground truth transformation . Usually, translations are described using a vector in , while for rotations multiple representations, e.g. Euler angles, rotation matrices or quaternions, exist [kendall2017geometric]. We chose to use dual-quaternions described by Schneider et al. [schneider2017regnet] for representing translation and rotation, consisting of the real part and the dual part . The parts and are both quaternions defined as extended complex numbers , where , , and are real numbers and , and are the imaginary components [kenwright2012beginners]. Similar to the quaternion representation for rotation, the real part contains the rotational information with values within and is usually normalized to . The dual part contains rotational and translational information without a specific range. More details on dual-quaternions can be found in [kenwright2012beginners].
Based on the dual-quaternion representation, the dual loss is given by
(1) |
and the real loss by
(2) |
normalizing the predicted real part to a valid rotation quaternion [kendall2017geometric, schneider2017regnet]. Since the dual part has no specific range, we compensate this imbalance by scaling the loss value with a factor
, which results in the combined loss function
(3) |
We intentionally refrain from using homoscedastic uncertainty weighting as described in [kendall2017geometric], enabling us to manually set a higher weight on the rotation prediction for odometry estimation.
In the following section, we describe our network architecture. An overview is given in Fig. 2(a). For subsampling and local feature extraction, the input point clouds are initially processed by the set abstraction [qi2017pointnetpp], which is arranged in a Siamese architecture with shared weights. Afterwards, the point clouds are merged with a flow embedding module [liu2019flownet3d]. The flow embedding groups all surrounding source points for each template point in order to generate features describing the point flow between both point clouds. The resulting point cloud with added flow features is fed into a mini-PointNet with subsequent fully connected (FC) layers. In order to enforce valid predictions, the real part
of the dual-quaternion output is restricted using the sigmoid activation function for
and the tanh activation function for , and on the output layer. For further processing steps or evaluations, the predicted dual-quaternion is converted into the transformation matrix , consisting of translation and rotation .The mini-PointNet structure [qi2017pointnet] is used multiple times within our architecture. It is implemented as nonlinear function
, realized as multi-layer perceptron (MLP), with subsequent element-wise max pooling:
(4) |
The mini-PointNet is able to process an arbitrary number of points of dimension to a single feature vector of dimension , as shown in Fig. 3.
Set Abstraction. The set abstraction module is used to subsample the input points and add local features containing information about the surrounding point structure. We use a single set abstraction step from PointNet++ [qi2017pointnetpp] with multi-scale grouping (MSG). The input point cloud is subsampled from to points using iterative farthest point sampling (FPS). For each sample , all surrounding points within radius are grouped including the central point . We define the distance vectors , which model the element-wise distance between central points and surrounding points . The distance vectors and features are concatenated and fed into a mini-PointNet with nonlinear function , realized as , and element-wise max pooling. Following [liu2019flownet3d], this can be written as
(5) |
Performing this for multiple radii , the set abstraction result for each FPS sample is , where denotes the concatenation for radii in multi-scale grouping. Thus, the input point cloud with dimensions is subsampled and transformed into a point cloud with dimensions . See Fig. 2(b) for a visualization of the set abstraction.
Flow Embedding. For merging the point clouds, we use the flow embedding module of [liu2019flownet3d]. Flow embedding takes two point clouds and with and as input. In our case, the point clouds are the result of the set abstraction of template point cloud and source point cloud . Due to various causes like noise, missing measurements or occlusion, does not always contain direct correspondences for each point in . Instead of searching for corresponding points, we can alternatively compare each point of with all surrounding points in for estimating the flow. It is important to note that this is only possible for small relative transformations. Thus, in case of large transformations a coarse pre-registration is required.
The flow embedding is realized as follows: For each point of the first point cloud , all surrounding points of the second point cloud within a certain radius are grouped. The distance vector and the features and are concatenated and fed into a mini-PointNet with nonlinear function , realized as , and element-wise max pooling. Similar to the set abstraction, this can be written as
(6) |
resulting in as flow output for each point . The concept of flow embedding is shown in Fig. 2(c). Concatenating the features and instead of passing the feature distance vector was proven as a better approach in [liu2019flownet3d], since the network is able to learn an effective distance function for features.
Output. The merged point cloud with flow features is reduced to a global feature vector using a mini-PointNet with . This removes the necessity for extracting explicit correspondences between both point clouds, since the flow features can carry more information about the point flow than explicit correspondences or correspondence weights. The global feature vector is finally fed into a fully connected with 8 outputs for predicting a dual-quaternion.
In contrast to the original PointNet++, we only use a single set abstraction layer with FPS samples and two MSG radii in order to maintain a short inference time. For the radius search in set abstraction (SA) and flow embedding (FE), we have to provide a maximum number of samples and in addition to the radii. The parameters and were chosen based on the distribution of our training data. The radii for flow embedding were selected based on the maximum possible offset of the points for given ground truth transformations. An overview of the network parameters is given in Table I.
Parameter | KITTI | ModelNet40 |
---|---|---|
1024 | 512 | |
SA radii | 0.5, 1.0 | 0.05, 0.1 |
512, 1024 | 256, 512 | |
FE radius | 10.0 | 0.2 |
15 | 30 | |
Loss | 200 | 1 |
[16, 16, 32] | ||
[128, 128, 256] | ||
[256, 512, 512, 1024] | ||
[512, 256, 8] |
Architecture hyperparameters for KITTI and ModelNet40.
We use weight decay in combination with data augmentation as regularization. For augmentation, we can simply apply random translations and rotations to the source point clouds and ground truth transformations.
For LiDAR odometry of sequential measurements, we can reduce the inference time by reusing the set abstraction output of the previous registration pair. When predicting the transformation from time to , we already have the set abstraction of point cloud available from the previous inference of transformation to . On the first sample , we only perform the set abstraction, since we have no second point cloud available.
We conduct two experiments with point clouds from different sources for demonstrating the versatility of our architecture. The KITTI odometry dataset [geiger2012autonomous] is used to show the ability of our architecture to register noisy real world data with a long range and high numbers of points, including dynamic objects, which adversely affect the registration. Besides that, we use an adapted version of the ModelNet40 dataset [wu20153d] to evaluate the performance on ideal ground truth data with artificial transformations and noise.
For evaluation, the translation error and rotation error between ground truth transformation and prediction are calculated as Euclidean distance in and chordal distance [hartley2013rotation] in , respectively:
(7) | ||||
(8) |
Alternatively, the methods can be evaluated using the root mean square errors (RMSEs) and in and of translation vector and rotation in Euler angles.
We use ICP based on the Open3D library [zhou2018open3d] and G-ICP based on the original reference implementation [segal2010generalized]
for comparison. The maximum distance between corresponding points for ICP is set according to the radius used for our flow embedding. All methods are tested on a machine with a 3.70 GHz CPU, 8 GB RAM and a GPU with 11 GB memory. Our approach is implemented in PyTorch.
The KITTI odometry dataset [geiger2012autonomous] consists of 22 sequences with stereo gray-scale and color camera images and point clouds captured by a Velodyne laserscanner. The data is recorded in different road environments including static and dynamic objects. The LiDAR point clouds are scanned with . Ground truth trajectories for sequences 00-10 with 23 201 scans are publicly available for training and evaluation, trajectories for sequences 11-21 with 20 351 scans are hidden for benchmark purposes. Each LiDAR measurement consists of 64 scan layers with approximately 100 000 points. To reach an inference time of less than , we reduce the scan size by removing every other point.
It is important to note that the provided ground truth trajectories are partially inaccurate, which has also been observed by other authors [lu2019deepvcp]. For compensating incorrect labels, we implement an additional augmentation step on top of random transformations and noise. We create additional scan pairs by duplicating the template point clouds and applying random transformations and noise for creating the source point clouds. Although this creates similar scan pairs, the associated transformation labels reliably involve no error.
04 | 10 | |||
---|---|---|---|---|
Method | ||||
DeepPCO^{*} [wang2019deeppco] | 0.0263 | 0.0305 | 0.0247 | 0.0659 |
Ours | 0.0264 | 0.0283 | 0.0202 | 0.0475 |
Both methods were trained with sequences 00-03 and 05-09.
Units for and are and , respectively.
Values taken from [wang2019deeppco].
LiDAR Odometry.
For training LiDAR odometry, translation and rotation augmentation noise are Gaussian distributed with standard deviation [
, , ] and [, , ] in Euler angles. The point noise for training is Gaussian distributed with standard deviation . Various splits of the ground truth sequences 00-10 into train and test set and different metrics are used in literature.Li et al. train LO-Net [li2019lonet]
with 00-06 and test with 07-10 using the KITTI odometry evaluation metrics
[geiger2012autonomous], which consist of the average relative translation and rotation errors in and for segments of to . We compare our architecture to the performance metrics of LO-Net without mapping and to LOAM [zhang2014loam], as given in [li2019lonet]. Our architecture yields an average error of and . LO-Net with and , and LOAM with and achieve a better performance. This is not surprising, since our learning objective is only focused on registration of consecutive scans instead of odometry for longer sequences. Furthermore, both LO-Net and LOAM additionally exploit the sorted and grid-like structure of LiDAR measurements, which is intentionally not used by our method.In contrast, DeepPCO [wang2019deeppco] is trained with 00-03, 05-09 and tested with 04, 10. They use the average RMSEs and over all scan pairs as metrics. The results are shown in Table II. For all metrics except of sequence 04, where the results are almost equal, we achieve a better performance. Our average and maximum inference times are and , which is far below the upper limit of for online processing.
Rot. Error () | Trans. Error () | Time () | |||
---|---|---|---|---|---|
Method | Mean | Max | Mean | Max | Mean |
ICP Point2Point [besl1992method] | 0.088 | 0.691 | 0.177 | 4.703 | 0.61 |
ICP Point2Plane [besl1992method] | 0.079 | 0.538 | 0.140 | 4.784 | 0.98 |
G-ICP [segal2010generalized] | 0.029 | 0.215 | 0.109 | 4.746 | 3.21 |
3DFeat-Net^{*} [yew20183dfeatnet] | 0.199 | 2.428 | 0.116 | 4.972 | 15.02 |
DeepVCP^{*} [lu2019deepvcp] | 0.164 | 1.212 | 0.071 | 0.482 | 2.3 |
Ours | 0.053 | 0.389 | 0.080 | 1.239 | 0.08 |
Sequences 00-07 were used for training, sequences 08-10 for testing.
Values taken from [lu2019deepvcp].
Artificial Transformations. In contrast to the previous evaluation for odometry, DeepVCP [lu2019deepvcp] is focused on the general registration of LiDAR measurements, independent of the odometry task. Thus, they sample the sequences at 30 frame intervals for templates and use all subsequent scans within
as sources. The artificial samples are created by removing the ground truth transformation calculated from the trajectory and applying unbiased, uniformly distributed transformations with maximum translation
and rotation . As metrics, mean and maximum angular error as well as translation error are used. The dataset is split into sequences 00-07 for training and 08-10 for testing. The results are shown in Table III.Our proposed architecture has by far the shortest average run-time of all compared methods. The average inference time of is higher compared to the odometry task, since we have to perform the set abstraction on both clouds for non-consecutive measurements. For rotation, G-ICP achieves the best results, while for translation, DeepVCP has the lowest average and maximum errors. In both cases, our proposed architecture is the second best approach with only a small gap to the best method. Considering this and the comparatively high run-time of the other methods, our architecture achieves state-of-the-art results. The high maximum translation error in all methods is most likely a result of the inaccurate ground truth trajectories mentioned previously.
Rot. Error () | Trans. Error () | Time () | |||
---|---|---|---|---|---|
Method | Mean | Std. Dev. | Mean | Std. Dev. | Mean |
ICP Point2Point [besl1992method] | 0.472 | 0.270 | 0.0092 | 0.0051 | 86.3 |
ICP Point2Plane [besl1992method] | 0.559 | 0.508 | 0.0076 | 0.0038 | 83.8 |
G-ICP [segal2010generalized] | 0.695 | 0.374 | 0.0154 | 0.0071 | 101.9 |
Ours | 0.328 | 0.192 | 0.0055 | 0.0028 | 14.6 |
The ModelNet40 [wu20153d] dataset provides over 12 000 CAD models of 40 different shape categories like airplane, chair, person or table. Since the CAD vertices are non-uniformly distributed, Qi et al. [qi2017pointnetpp] provide a version with 10 000 uniformly sampled surface points and normals per model. By applying random transformations and point noise to the point clouds, this dataset can be used for the evaluation of point cloud registration. We use FPS to subsample the data to 2048 points in order to achieve a manageable run-time for the ICP variants. An example is shown in Fig. 5.
Since we only consider fine registration, we apply random transformations with translation and rotation angles about arbitrarily chosen axes during training and testing, which is significantly smaller than in [aoki2019pointnetlk]. Furthermore, Gaussian distributed random noise with standard deviation is added to template and source points during training. In contrast to [aoki2019pointnetlk], we add noise to template and source points, since our intended applications include the registration of real world measurements, where both point clouds are noisy. For testing, we perform multiple runs using Gaussian distributed random noise with standard deviations from to .
Our architecture is trained and tested with point clouds from 20 different shape categories. To examine our ability to generalize and for fair comparison with classic algorithms, we further perform tests on 20 previously unseen shape categories. All methods are tested on the same point clouds, noise data and transformations.
Since we did not notice a performance difference between seen and unseen shape categories, our network proves to generalize well instead of learning shape-specific features. Table IV shows results for unseen categories and Gaussian distributed noise with standard deviation . The rotation error over increasing noise standard deviations is given in Fig. 4. For low noise levels below , all ICP variants achieve better results, since template and source point clouds are more similar. Starting at standard deviation , our architecture clearly outperforms the compared methods considering error average and standard deviation, and run-time. Although we only used a noise level of during training, we still achieve good performance for higher noise levels. Again, this demonstrates the powerful generalization capability of our architecture.
In this work, we have presented DeepCLR, a novel architecture for point cloud registration based on PointNet and PointNet++. In contrast to previous approaches, we provide an end-to-end architecture based on radius search and mini-PointNets for merging the input point clouds without extracting explicit point correspondences. We have evaluated our method on LiDAR measurements from the KITTI odometry dataset and on model data created from ModelNet40. On both datasets, our architecture achieves state-of-the-art results while providing the lowest run-time of the compared methods. Furthermore, on noisy ModelNet40 data our architecture clearly outperforms classic algorithms like ICP. In future work, we would like to explore pretraining of the set abstraction for improved local features.
This research was accomplished within the project UNICARagil (FKZ EM2ADIS002). We acknowledge the financial support for the project by the Federal Ministry of Education and Research of Germany (BMBF).