Point cloud analysis is attracting attention from Artificial Intelligence research since it can be extensively applied for robotics, Augmented Reality, self-driving, etc. However, it is always challenging due to problems such as irregularities, unorderedness, and sparsity. In this article, we propose a novel network named Dense-Resolution Network for point cloud analysis. This network is designed to learn local point features from point cloud in different resolutions. In order to learn local point groups more intelligently, we present a novel grouping algorithm for local neighborhood searching and an effective error-minimizing model for capturing local features. In addition to validating the network on widely used point cloud segmentation and classification benchmarks, we also test and visualize the performances of the components. Comparing with other state-of-the-art methods, our network shows superiority.READ FULL TEXT VIEW PDF
With the tide of artificial intelligence, we try to apply deep learning ...
3D point cloud segmentation remains challenging for structureless and
Point cloud registration sits at the core of many important and challeng...
Point cloud is an efficient representation of 3D visual data, and enable...
Classical methods of modelling and mapping robot work cells are time
In this work, we introduce three generic point cloud processing blocks t...
A promising technique of discovering disease biomarkers is to measure th...
With the help of fast progress in 3D sensing technology, an increasing number of researchers are now focusing on point cloud data. Different from complex 3D data e.g. , mesh and volumetric data, point clouds are concise. Particularly, point clouds are easier to collect using different types of scanners : e.g. , LiDAR scanner , light scanner, sound scanner, etc. . Traditional algorithms about point cloud learning [30, 23, 29, 35]
used to estimate geometric information and capture indirect clues utilizing complicated models. In contrast, deep learning provides intuitive and effective data-driven approaches to acquire information from 3D point cloud data leveraging Convolutional Neural Networks (CNN).
In general, CNN-related methods can be divided into two streams . The first one is projection-based, which involves some intermediate data representations and 2D/3D CNN for learning: e.g. , MVCNN  using multi-view 2D images, and VoxNet 
taking volumetric grids. The other one is point-based, which directly processes points. It has become popular since the multi-layer perceptrons (s) operation was introduced by PointNet . Subsequently, others [27, 37, 33, 28] promoted to learn local features in various ways.
For local areas of point clouds, Qi et al.  and Liu et al.  apply the Ball Query algorithm  to group local points while [37, 28] use k-nearest neighbors () to construct neighborhoods. According to these methods, the performances are affected by the areas of their pre-defined neighborhoods i.e. the searching radius of Ball Query or the of . If the area is small, it cannot cover sufficient local patterns; if too large, the overlap may involve redundancies. Recent DPC  proposes an idea of dilated point convolution to increase the size of the receptive field without extra computational cost. Different from the previous works, we attempt to adaptively define such a local area for each point w.r.t the density distribution around it. With fewer manual and empirical settings, a more reasonable neighborhood is supposed to be set up for each point in the point cloud.
Previously, the idea of error feedback has been applied in 2D human pose estimation
and image Super-Resolution (SR)[7, 19]. In contrast to 3D works [14, 28] utilizing complex error-correcting structure, here we propose an error-minimizing module, leveraging the properties of both error-feedback and CNN training mechanisms, by which the network learning can be guided while the complexity can be reduced. In terms of the architecture, we present a new model called Dense-Resolution Network with two branches: a Full-Resolution (FR) branch and a Multi-Resolution (MR) branch. By collecting features from different resolutions of point cloud and merging feature maps of the FR and MR in a novel fusion method, we can obtain more information for a comprehensive analysis. The main contributions are:
We propose a point grouping algorithm to find neighbors for each point considering the density distribution adaptively.
We design an error-minimizing module for local feature learning on point clouds.
We introduce a network to learn point clouds comprehensively in different resolutions.
We conduct thorough experiments to validate the properties and abilities of our proposals. Our results demonstrate that the approach outperforms state-of-the-art methods on some point cloud segmentation and classification benchmarks.
Local points grouping. Different from the pioneer PointNet  that relied on the global feature, subsequent work captured more local features in detail. PointNet++  firstly introduced Ball Query, an algorithm for collecting possible neighbors of a particular point through a ball-like searching space centering at itself, to group local neighbors of the point. Another simpler algorithm, , gathers nearest neighbors based on a distance metric, and this algorithm is applied for local features learning in [37, 5, 28].
Although Ball Query and grouping are intuitive, sometimes the size of the neighborhood (i.e. the receptive field of the point) is limited due to the range of searching (i.e. the radius of query ball, or the value of ). Meanwhile, merely increasing the searching range may involve substantial computational cost. To solve this problem, DPC  extended regular to dilated-knn
dilated-knn, which gathers local points over a dilated neighborhood obtained by computing the nearest neighbors ( is the dilation factor) and preserving only every -th point. Other works [27, 18, 41] also group neighbors through query balls in different scales (e.g. , multi-scale grouping) to capture information from various sizes of the local area.
However, the existing methods have some issues in common. On the one hand, the performances of grouping algorithms highly rely on pre-defined settings. For example, DGCNN  provided the results under different conditions, DPN  compared the effects of values, and PointNet++  discussed the influence from query ball radius. On the other hand, the grouping algorithms act on all points of the point clouds without taking the distinct condition of each point or model into account. As far as we are concerned, it is necessary to find an intelligent point-level adaptive grouping algorithm.
Error feedback structure. Previously in 2D, Carreira et al.  proposed a framework called Iterative Error Feedback (IEF): by minimizing the error loss between current and desired outputs in the back-propagation procedure, the network would help to approach the target. In contrast to minimizing error during back-propagation, the methods in [7, 19] complement the output with a back-projection unit in the forward procedure. For 3D point clouds, PU-GAN  leveraged a similar idea for point cloud generation, while  presented a structure with specially designed paths for prominent features learning.
Network architecture for point cloud learning.
To tackle problems in 2D computer vision tasks, many classical architectures have been introduced:e.g. , VGG , ResNet , etc. Besides, some works tried different image resolutions for more clues, for example, fully convolutional network  keeps the full size of an image, deconvolution network  steps into lower resolutions, and HRNet  shares the features among different resolutions.
As for 3D point clouds, there are two popular architectures. Some of them follow the form of PointNet++, which learns in lower resolutions using Farthest Point Sampling (FPS) in Set Abstraction (SA) module for downsampling and Feature Propagation (FP) module for upsamling the point features. Meanwhile, DGCNN  works as a fully convolutional network because it dynamically updates the crafted point graph around each point of the model. Different from them, our approach exploits more clues learnt from various resolutions for better representations of point-wise fine-grained features.
Since PointNet  introduced multi-layer perceptrons (s) that directly process point clouds, CNN-based learning on 3D data becomes more intuitive. Basically, an operation (
) can be described as a 1-by-1 convolution with a possible batch normalization layer (
) and an activation function () on feature map:
In addition, many works craft regional patterns to record more local details. Wang et al.  dynamically draws a graph around each point in -dimensional feature space encoding the information of both the absolute position of the centroid and relative positions of the neighbors in feature space. Specifically, the crafted graph () at the centroid is:
Therefore, the quality of information that can provide highly depends on the neighbors (i.e. ) that the grouping algorithm can find. Starting from this point, we investigate a better grouping algorithm for .
As we mentioned in Section 1, there are two main grouping algorithms applied: Ball Query and k-nearest neighbors (). Although they are popular, they have some common issues as analyzed in Section 2. To solve the problems, here we propose an algorithm, Adaptive Dilated Point Grouping (). The pipeline can be described as in Algorithm 1.
We take pairwise Euclidean distances in feature space as our metrics since it can indicate the point density distribution to a certain extent. With as a feature map having size and
as a row vector of all ones withentries, we calculate the metrics as:
By sorting the metrics in ascending order, we can easily identify the nearest points (i.e. the elements with smallest # () values in each row of ) as candidate neighbors for each point. Next, we select the qualified neighbors from all candidates, whose indices are and metrics are . To be specific, we apply and an activate function (e.g. , logistic function), on the metrics of candidates to summarize the information of point distribution of the local areas. Then, a projection function (e.g. , linear function) can map the activated values to a certain range. Finally, we take a scale function (e.g. , round function) to assign a certain dilation factor for each point according to the summarized information:
As each point has a corresponding dilation factor, we pick up every -th index of candidate indices to form the final neighbors for each point. Following similar behavior of dilated-knn () in , we have the indices of final point groups:
Once the neighbors are selected by , the local graph of point will be:
Assume that the crafted local graph embeds the full information about the neighborhood, it would be possible to restore the previous features by a back-projection. In terms of the back-projection feature , we adopt a 1-by- convolution over the local graph as in , since it acts to aggregate the nodes based on learned weights of the edges in the graph, which implicitly simulates a reverse process of crafting the graph:
Therefore, the error feature is defined as the difference between the original input feature and back-projection feature :
As the network training continues, this loss can constrain the feature learning by forcing the back-projection feature to approach the original input inside of this module, especially in the early stages of training. Moreover, it is expected to provide further instructions for the grouping of our algorithm compared with the general cross-entropy loss.
With a max-pooling functionbeing applied on the crafted local graph along with neighbors, we aggregate a prominent local feature as the output of the centroid :
algorithm and the error-minimizing module seem promising for local feature extraction, we still need a robust network architecture to leverage the potential offered by both. The basic fully convolutional network architecture in[26, 37, 28] remains the same size of points (i.e. full resolution of the point cloud) even in different scales of feature spaces. Even though it can retain the features point-wise without any confusion caused by upsampling, the output may lack channel-wise clues about semantic/shape information, which could be collected from different resolutions of the point cloud.
To overcome the above limitation, another branch learns the necessary information from different resolutions of the point cloud. In contrast to the full-resolution (FR) branch, a multi-resolution (MR) branch is able to capture point-wise channel-related information from different scales, which contributes to a comprehensive channel-wise understanding. After an enhancement of the feature map of FR, from the feature map of MR (please see Section 4.3 and Table 4 for more details), the final output of our dense-resolution (DR) network can be formulated with element-wise multiplication :
In this section, the details of our implementation are provided, including network parameters, training settings, datasets, etc. . By comparing the experimental results with other state-of-the-art methods, we analyze the performances quantitatively. Besides, some ablation studies and visualization are presented to illustrate the properties of our approach.
Network details. Generally, our dense-resolution network consists of two branches: a full-resolution (FR) branch and a multi-resolution (MR) branch. Specifically, The FR branch is a series of the error-minimizing modules extracting features in different scales of feature spaces i.e. 64, 128, and 256, etc. The FR output is a projected concatenation of the modules’ outputs. As for the MR branch, we adopt farthest point sampling (FPS) and feature propagation (FP) in  for downsampling and upsampling, respectively. The MR branch starts from the first output of FR in N size; after that, lower resolutions i.e. N/4 and N/16 are investigated. Different from others, more propagated features and skip links are densely connected to enhance the relations between different point resolutions and feature spaces. Empirically, we adopt and as in [37, 5]. For error-minimizing modules in MR, we use regular (equivalent to with ) since the points are sparse.
The output is obtained by following Equation 9. For the classification task, we apply a max-pooling function and Fully Connected (FC) layers to regress confidence scores for all possible categories. In terms of the segmentation task, we attach the max-pooled feature to each point feature of
and further predict the semantic label of each point with FC layers being applied. We implement the project with PyTorch and Python; all experiments are trained and tested on Linux and GeForce RTX 2080Ti GPUs.111The code and models will be available at https://github.com/
during the 300 epochs. For segmentation, we exploit Adam optimization for 200 epochs of training. The learning rate begins at 0.001 and gradually decays with a rate of 0.5 after every 20 epochs. The batch size for both tasks is 32. Besides, training data is augmented with random scaling and translation; the total loss is the sum of regular cross-entropy loss and weighted error-minimizing loss (see Equation 7). Part segmentation is evaluated with a ten-votes strategy used by state-of-the-art approaches [26, 27, 18].
Datasets. We test our approach on two main tasks: point cloud segmentation and classification. The ShapeNet Part dataset  is used to predict the semantic class (part label) for each point of the object. In addition, the synthetic ModelNet40  dataset and the real-world ScanObjectNN  dataset are used to identify the category of the object.
ShapeNet Part. In general, the dataset has 16,881 object point clouds in 16 categories. Each point is labeled as one of the 50 parts. As the primary dataset for our experiments, we follow the official data split . We input the 3D coordinates of 2048 points for each point cloud and feed a one-hot class feature before FC layers during training. In terms of the metric for evaluation, we adopt Intersection-over-Union (i.e. IoU). The IoU of the shape is calculated by the mean value of IoUs of all parts in that shape. Particularly, mIoU (i.e. mean IoU) is the average of IoUs for all testing shapes.
ModelNet40. It is a popular dataset because of the regular and clean point clouds. There are 12,311 meshes in 40 classes, with 9,843 for training and 2,468 for testing. Corresponding point clouds are generated by uniformly sampling from the surfaces, translating to the origin, and scaling within a unit sphere . In our case, only the 3D coordinates of 1024 points for each point cloud has been used.
ScanObjectNN. This real-world object dataset is recently published. Although it has 15,000 objects in only 15 categories, it is practically more challenging due to the background, missing parts, and deformations.
Segmentation. Table 1
shows the results of related works reported in overall mIoU, which is the most critical evaluation metric on the ShapeNet Part dataset. In general, our network achieves 86.4% and outperforms other state-of-the-art algorithms based on similar experimental settings. As for evaluations inside of each class, we surpass others in 5 out of 16 categories. Particularly in categories with a relatively large number of samples,e.g. , airplane, chair, or table, we perform even better (two out of these three classes) than others.
Classification. Table 2 presents the overall accuracy of the classification on both synthetic and real-world object datasets. For ModelNet40, we achieve 93.1% and exceed other state-of-the-art results with similar input. Besides, an overall accuracy of 80.3% is obtained on the ScanObjectNN dataset, which is significantly higher than all results on its official leaderboard . The inference time of our model is about 19.2ms running on a single GeForce RTX 2080Ti GPU. In general, our network is effective and robust for point cloud classification.
Visualization of learned dilation factors. The color of the point corresponds to the learned dilation factor by our algorithm. From Figure 3, we can find that our algorithm tends to assign larger dilation factors to the points on corner/boundary/edges. The reason is that the point distribution around them would be relatively sparse, thus larger neighborhoods for local feature learning are needed. Due to the series connection of modules, the points in deep layers are supposed to have larger receptive fields already, so the larger dilation factors are unnecessary: the points in relatively dense distribution (e.g. , on the flat surfaces or central areas) turn out to have smaller dilation factors as the network goes deeper. Different from regular /Ball Query with a limited receptive field or with fixed dilation factor for all points, our algorithm works adaptively and reasonably as expected.
Effects of components.
Here we conduct an ablation study about the effects of network architecture, grouping algorithm, and the error-minimizing module. We run the experiments on the ShapeNet Part dataset with the same input and classifier, and Table3 presents the results in overall mIoU. Comparing model 1&2 to model 0, we observe that the error-minimizing module with applied can significantly improve the network performance for part segmentation. Although the multi-resolution branch (model 3) alone is not able to learn the features as comprehensively as a full-resolution branch (model 2) does, we can take advantage from both by combining them into the form of a dense-resolution network (model 4).
Merging the feature maps. Both FR and MR have properties as mentioned, so we need to find an effective way to unify the advantages of both. We test simple ways of merging the features of and , i.e. concatenating them in channel-wise, adding and multiplying them in element-wise. Comparing the results of model 3&4&5 to model 0 in Table 4, we observe that the simple ways of merging may not improve performance. In contrast, channel-wise enhancement of from (model 5) can improve a bit because of the reasons explained in Section 3.3. With ten-votes testing, the overall mIoU can boost to 86.4%.
In this work, we propose a Dense-Resolution Network for point cloud analysis, which leverages information from different resolutions of the point cloud. Specifically, the Adaptive Dilated Point Grouping algorithm is introduced to realize a flexible point grouping based on the density distribution. Moreover, an error-minimizing module and corresponding loss are presented to capture local information and guide the network in training. We conduct experiments and provide ablation studies on both point cloud segmentation and classification benchmarks. According to the experimental results, we outperform competing state-of-the-art methods on ShapeNet Part, ModelNet40, and ScanObjectNN datasets. The quantitative reports and qualitative visualization demonstrate the advantages of our approach.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4733–4742. Cited by: §1, §2.
3D scene understanding benchmark. Note: https://hkust-vgd.github.io/benchmark/Accessed: 2020-04-20 Cited by: §4.2.