Point clouds have been widely used in many application fields, including robotics, autonomous driving, and augmented/mixed reality as 3D sensor can capture rich geometric information. Estimation of an optimal transformation in registration based on correspondence between two objects. To deal with point clouds in 3D registration, extracting salient local geometric information is a key task. Early work on local geometric feature was based on on hand-crafting based methods. While end-to-end learning are becoming feasible for 3D point cloud analysis, due to recent advances in deep learning field, the robust local feature descriptor generation remains a challenge in the field of computer vision research.
Descriptors for point cloud applications have been a wide research area for point cloud registration, model segmentation, and classification. PointNet [PointNet] shows the new paradigm for point cloud analysis with permutation invariant method, but cannot encode the local geometric information. To encode the local geometric information, PPFNet [PPFNet] uses PointNet for local regions and DGCNN [DGCNN] encodes relative position of neighbors for each point. In practical, point cloud data captured by sensors are usually not aligned to the same frame and irregularly distributed. However, these method do not build rotation invariant descriptors, and these method are affected by point cloud density. KPConv [KPConv] uses kernel points around each point cloud for efficiently handle irregularly distributed point clouds. KPConv shows satisfactory performance, but do not consider rotational information. 3DSmoothNet [SmoothNet] extracts local region points and aligns the local points to the local reference frame of the center point. However, a sign of a normal axis and directions of the other two axes is not unique in planar region. Lastly, one common problem with a descriptor is that local descriptors from monotonous and repeating area may be non-salient descriptors, and these descriptors are not useless in 3D registration.
To overcome the previous limitation, we propose a rotation robust and point distribution robust descriptor generation method. Our method is inspired by KPConv [KPConv] and 3DSmoothNet [SmoothNet]. KPConv deals with point clouds similar to 2D image-based convolution using kernels located around points, and the method is efficient and robust to irregular structured point clouds. 3DSmoothNet generates voxel-based descriptor from aligned local points using local reference frame. Inspired by these idea, we align the kernels to the normal vector, and extract rotation invariant features. Due to the non uniqueness property of local reference frame in the planar region, we distribute kernels in a form of cylinder shape. This shape is symmetric about a tangent plane to handle the sign problem, and has the circular cross section to handle the other inaccurate reference axes. With this kernel structure, we apply convolution with adjacent kernels together in a way that is not affected by rotation to increase representative power. In addition, to increase representative power of a descriptor from monotonous and repeating area, we aggregate all features based on distances from each point to build discriminative global features.
The major contributions of this work can be summarized as follows:
The proposed descriptor encodes rotation-robust feature
The proposed convolution method with the symmetric kernel structure effectively gather structural information invariant to the sign problem
We experiment our method on several benchmark datasets
To demonstrate our feature descriptor, we experiment our method on three kinds of tasks: classification, registration and segmentation. We train and test on ModelNet40 [ShapeNet] for classification and registration, and ShapeNet [ShapeNetPart] for segmentation. We show descriminative power of our descriptor.
Ii Related Works
Ii-a Hand-crafted 3D features
Before advance of deep learning, 3D feature descriptor was hand-crafting descriptors. Local descriptors are generated based on the relationship between a point and spatial neighborhoods around the point. In addition, some methods build rotation invariant descriptor based on a local reference frame. Spin Images[SPIN] aligns neighbors using the surface normal of the interest point and represents aligned neighbors to cylindrical support region using radial and elevation coordinates. 3D Shape Context[3DShapeContext] represents neighbors in the support region with grid bins divided along the azimuth, elevation, and radial values. USC[USC] extends the 3D Shape Context method by applying the local reference frame based on the covariance matrix of points. Similarly, SHOT[SHOT] also calculates the local reference frame and builds a histogram using angles between point normals. PFH[PFH] and FPFH[FPFH]
estimate pairwise geometric differences. Application area of these methods are usually limited to the registration area if there is no additional machine learning methods, but ideas of these methods give inspiration to many deep learning based descriptor methods.
Ii-B Deep learning based 3D features
Ii-B1 Volumetric based Methods
There methods is to approximate the point cloud data into regular volumetric representation, and process the approximated data similar to 2D image-based methods [Voxel01][Voxel02].
Since the number of dimensions is limited by hardware performance, these methods approximate the data into low resolution volumetric representation. To overcome this problem, some methods represent point cloud data with efficient way. OctNet [Voxel04Octree] divides the space using a set of unbalanced octrees based on the sparsity. Some methods [Sparse][Minkowski][FCGF]
use sparse tensor which only saves the non-empty space coordinates and features. These methods not only reduce memory usage but also reduce run-time significantly.
To build rotation invariant descriptor, 3DSmoothNet[SmoothNet] calculates the local reference frame based on the covariance of points and transforms neighbor points within the spherical support area of the interest point using the local reference frame before voxelizing the points. However, a sign of a normal axis and directions of the other two axes is not unique in planar region. To deal with this situation, we assume the sign of the normal vector is not unique. In addition, to deal with the inaccurate remaining reference axes, kernels distributed in circular shape on the tangent plane is suitable rather than kernels distributed in square shape like a volumetric representation. Therefore, we use customized kernel similar to the KPConv [KPConv] method which distributes kernels around each point.
Ii-B2 Point based Methods
PointNet [PointNet] and PointNet++ [PointNet++]
are the pioneer point cloud deep learning methods. The methods encode unstructured point cloud using a shared multi-layer perceptron, and build the permutation invariant descriptor using a global max-pooling. Based on PointNet, some methods have been developed. PPFNet[PPFNet] extends PointNet [PointNet] to learn local geometric features. PPFNet build local features using PointNet, and then fuses global information extracted from the local features using a max-pooling to produce discriminative local descriptors. PPF-FoldNet [PPF-FoldNet]
learns descriptor using folding-based auto encoding for unsupervised learning, and uses rotation invariant features. DGCNN[DGCNN] select k-nearest neighbors for each point and encodes descriptors using relative neighbor locations from a point to encapsulate the geometric information. KPConv [KPConv] propose Kernel Point Convolution method which places kernel points around each point to efficiently handle irregularly distributed point clouds and aggregates the geometric information from the kernel points. We extend the KPConv method by applying normal vectors. As described in the KPConv, the normal vector is available for artificial data. Because the sign of normal vector can be changed. Therefore, we use unsigned normal vectors for aligning kernels and estimating features.
Figure 1 shows an overview of our descriptor generation framework. We represent each point using information getting from kernels around the point inspired by KPConv[KPConv]. To build rotation robust kernels, we use normal vectors of each point to align the kernels (Section 3.1). We then extract local information from kernels and aggregate to build descriptor (Section 3.2) and estimate global information (Section 3.3). Next, we apply convolution with the kernels invariant to the sign problem (Section 3.4). Lastly, to build scale robust descriptor, we apply scale adaptation module to the network (Section 3.5).
Iii-a Aligned kernel
Kernels are used to gather geometric information around each point from each kernel’s point of view. To do this, kernels have to be distributed uniformly around each point not only to analyze geometric information from various point of view but also to extract similar point of view as much as possible if there is no orientation information.
Usually orientation information is estimated based on the covariance of points, and the other reference axes are estimated by projecting the vector from the interest point to the weighted averaged neighbors to the plane orthogonal to the normal. These are not unique if the point is located on planar surface, but among the three local reference vectors, the normal direction is reliable. Therefore, we focus on the tangent plane direction rather than normal direction to maximize overlapped receptive field of kernels. We place kernels in the form of a cylinder shape around each point, and align the cylinder to the reference axes.
Figure 2(a) shows our kernel distribution. Cross section of the cylinder is the circle along the normal direction. We place from 5 to 6 kernels for each circle, and group them as one layer. Lastly, the cylinder consists of from 2 to 3 layers.
Iii-B Rotation invariant geometric information
Next step is to estimate rotation invariant features for each aligned kernel.
First, we extract k-nearest neighbor points from each kernel, and calculate weighted averaged location based on their distance from the kernel (Eq. 1
). As a weight term, we use gaussian function to reduce the influence of outliers.
is the weighted averaged location of the -th kernel point of , and indicates the distance from the center point to the kernel point.
To build rotation robust and discriminative descriptor, we estimate four kinds of features. we first estimate angles between two vectors, one is from the center point of kernels to the weighted averaged point, and the other one is normal vector. However, since the normal vector has the normal orientation problem, we multiply negative sign to the normal vector if the kernel is located at below the tangent plane (Eq. 2). indicates the normal vector of . returns negative sign if the kernel is located at below the tangent plane. This term helps to estimate the angle value regardless of the normal sign. Next, we estimate distances from the interest point to the averaged neighbors and from the kernels to the averaged neighbors. To provide direction to the closest adjacent kernel points, we estimate distance ratio from two adjacent kernel points to the averaged point (Eq. 5). Using these features, kernel point can specify relative location of the averaged neighbor points from the kernel. In addition, these features are rotation invariant features. Figure 2(d) shows the geometric information extraction process.
Iii-C global information
Above features describe local region of the point cloud. To increase representative power of a descriptor from monotonous and repeating area, we estimate global information from local information. Rather than building single global feature for all points, we estimate adaptive global features for each point. For each point, we calculate weights for the other points based on the gaussian distances between points, and then calculate weighted average based on the weights.
Iii-D Circular convolution
With these features, we then process the convolution step. Since we align the kernels using only normal axis, each kernel is not aligned to unique local reference frame, and the order of kernels may be changed depending on the point distribution. However, within the cylinder layer, adjacent kernels are not changed. Therefore, we extend 1x1x1 channel-wise convolution Eq. 6 to Eq. 7.
means the clockwise adjacent kernel point of the -th kernel point in the cyliner layer, and means the counterclockwise adjacent kernel point. However, this convolution process is also affected by order of adjacent kernels due to the sign problem of normal vectors. To avoid this problem, we divide the kernel points into two groups, one is collection of kernels above the tangent plane, and the other one is collection of kernels below the tangent plane. We select adjacent kernel in clockwise order if the kernel belongs to the first group, i.e. the kernel is located at positive normal direction. Otherwise, we select in counterclockwise order if the layer is located at negative normal direction. In addition, we process convolution with multiple layers if the layers belong to the same group. Eq. 7 is extended to the Eq. 8. indicates a set of adjacent kernel points of the
-th kernel point in the same group. To do this, we use circular padding convolution operation.
Figure 3 shows the convolution process using kernels. After convolution, kernel features around target point are aggregated by summation or maximum value selection to build the point feature.
Iii-E Scale adaptation module
Our features mentioned above are invariant to rotation, but depend on a distance from an interest point to the kernel of the point, and the kernel size is related to the scale of the point cloud. If the kernel size determined by a heuristic way is too small or too large, it cannot properly encode the geometric information. To build descriptor robust to the scale, we propose a method to adapt the kernel size based on the model geometric. First, we roughly select two kernel sizes and build features using the kernel sizes. Features are fed to the module, and the interpolation weight is estimated as the output of the module through convolution operations. Next, we use interpolated two kernel sizes using the interpolation weight and build descriptor with the interpolated kernel size.
ModelNet40 registration results. Evaluation metrics are mean squared error (MSE), root mean squared error (RMSE), and mean absolute error (MAE) for rotation (R) and translation (T). C indicates the number of categories for training.
We experiment our method on the ModelNet40 registration [ShapeNet] to demonstrate the rotation robustness. ModelNet40 contains 12,311 meshed CAD models from 40 categories. ModelNet40 is splited by category into training and test sets, and the first 20 categories among the 40 categories are used for training. For each model, we use 1,024 points for training and testing.
DCP [DCP] shows the end-to-end network for rigid registration. Inspired by the method, we use the SVD module to compute a rigid transformation after building features. Figure 4 (a) shows registration architecture. As evaluation metrics, mean squared error (MSE), root mean squared error (RMSE), and mean absolute error (MAE) are used for rotation and translation.
Table I shows the registration results on the remaining categories. Our method outperforms in both rotation and translation estimations compared to the other methods. Even when compared to the results trained using all categories, our method shows significantly better performance. Figure 5 shows the comparison of the registration results. results of DCP show small differences between two point clouds, but results of ours show almost overlapped point clouds. These results mean that mapping between the source and target points based on the encoded features are accurate, and there is almost no duplicated or wrong correspondence. It shows that our encoded features have discriminative power.
|Method||Mean Class Accuracy||Overall Accuracy|
|VRN (single view) [VRN]||88.98||-|
|VRN (multiple views) [VRN]||91.33||-|
Next, we experiment our method on the ModelNet40 for classification [ShapeNet]. Among 12,311 models, 9,843 models are used for training and remaining 2,468 models are used for testing. For each model, we use 1,024 points for training and testing.
We use DGCNN [DGCNN] classification architecture as our baseline. First, four convolution layers are used to extract geometric information, then concatenate four results to provide multi-scale features. In addition, to increase representation power, we estimate global context using local features and apply attention module. Finally, global feature is extracted and fully-connected layers are used to predict the model class. Table II shows the classification results. Although our method is focused on the rotation robust descriptor, our descriptor achieves the comparable performance with the state-of-arts.
|PCNN by Ext [PCNN]||85.1|
We experiment our method on the ShapenetPart for part segmentation [ShapeNetPart]. ShapenetPart contains 16,681 3D models from 16 categories. Each point is annotated with part labels. ShapenetPart consists of 50 kinds of parts, and models consist of from 2 to 6 parts. For each model, we use 2,048 points for training and testing.
In the segmentation experiment, we use U-Net without additional down and up-sampling layers as our segmentation architecture. To evaluate the result, we use Intersection-over-Union (IoU). Similar to the KPConv [KPConv], we add the positions as additional features. Table III shows the segmentation results.
Iv-D Parameter and ablation study
We experiment our methods with some parameter settings on the ModelNet40 registration. We change convolution operation, the number of kernels, the number of nearest neighbors for each kernels, and usage of the additional module.
First, we experiment the convolution methods from the 1x1x1 channel-wise convolution to the circular convolution method. The results show significantly better performance in both rotation and translation. With the circular convolution method. These results mean that convolution operations with structure information can analyze the kernel features properly.
Second, we experiment the effect of the number of kernels. Network with a larger number of kernels shows increased performance. As the number of kernels increases, the range covered by the descriptor also increases, resulting in better performance.
Third, to check the effect of the number of neighboring points on the stability, we experiment the network with a smaller number of neighbors. Although we reduce the number of neighbors for each kernel point, performance of the network using a smaller neighbors is comparable with that of the network using a larger neighbors. Since neighbors are selected from each spatially well-distributed kernel, the number of neighbors does not significantly affect to the results.
Fourth, we estimate global features using the local features of the last layer and concatenate the features to the local features. As a result, rotational error is decreased. This means that the global features help to disambiguate each local descriptor.
In this paper, we propose a generation method for 3D point cloud. Kernel points distributed in a cylinder shape are aligned to a normal vector of a point. Each kernel points takes neighbor points and extract rotation robust geometric information. We then apply circular padding convolution operation to refer nearby kernel information to increase representation power. Finally, we experiment our method on the registration, classification and segmentation tasks. Our method shows satisfactory performance. Especially, it shows out-performance in registration using rotation robust features and structural convolution. In practical, 3D point cloud data captured by sensors may be not aligned. We expect our method help to analyze 3D data in more general situation.