DRNet
None
view repo
Point cloud analysis is attracting attention from Artificial Intelligence research since it can be extensively applied for robotics, Augmented Reality, selfdriving, etc. However, it is always challenging due to problems such as irregularities, unorderedness, and sparsity. In this article, we propose a novel network named DenseResolution Network for point cloud analysis. This network is designed to learn local point features from point cloud in different resolutions. In order to learn local point groups more intelligently, we present a novel grouping algorithm for local neighborhood searching and an effective errorminimizing model for capturing local features. In addition to validating the network on widely used point cloud segmentation and classification benchmarks, we also test and visualize the performances of the components. Comparing with other stateoftheart methods, our network shows superiority.
READ FULL TEXT VIEW PDF
With the tide of artificial intelligence, we try to apply deep learning ...
read it
3D point cloud segmentation remains challenging for structureless and
te...
read it
Point cloud registration sits at the core of many important and challeng...
read it
Point cloud is an efficient representation of 3D visual data, and enable...
read it
Classical methods of modelling and mapping robot work cells are time
con...
read it
In this work, we introduce three generic point cloud processing blocks t...
read it
A promising technique of discovering disease biomarkers is to measure th...
read it
None
With the help of fast progress in 3D sensing technology, an increasing number of researchers are now focusing on point cloud data. Different from complex 3D data e.g. , mesh and volumetric data, point clouds are concise. Particularly, point clouds are easier to collect using different types of scanners [2]: e.g. , LiDAR scanner [11], light scanner, sound scanner, etc. . Traditional algorithms about point cloud learning [30, 23, 29, 35]
used to estimate geometric information and capture indirect clues utilizing complicated models. In contrast, deep learning provides intuitive and effective datadriven approaches to acquire information from 3D point cloud data leveraging Convolutional Neural Networks (CNN).
In general, CNNrelated methods can be divided into two streams [6]. The first one is projectionbased, which involves some intermediate data representations and 2D/3D CNN for learning: e.g. , MVCNN [32] using multiview 2D images, and VoxNet [22]
taking volumetric grids. The other one is pointbased, which directly processes points. It has become popular since the multilayer perceptrons (
s) operation was introduced by PointNet [26]. Subsequently, others [27, 37, 33, 28] promoted to learn local features in various ways.For local areas of point clouds, Qi et al. [27] and Liu et al. [18] apply the Ball Query algorithm [25] to group local points while [37, 28] use knearest neighbors () to construct neighborhoods. According to these methods, the performances are affected by the areas of their predefined neighborhoods i.e. the searching radius of Ball Query or the of . If the area is small, it cannot cover sufficient local patterns; if too large, the overlap may involve redundancies. Recent DPC [5] proposes an idea of dilated point convolution to increase the size of the receptive field without extra computational cost. Different from the previous works, we attempt to adaptively define such a local area for each point w.r.t the density distribution around it. With fewer manual and empirical settings, a more reasonable neighborhood is supposed to be set up for each point in the point cloud.
Previously, the idea of error feedback has been applied in 2D human pose estimation
[3]and image SuperResolution (SR)
[7, 19]. In contrast to 3D works [14, 28] utilizing complex errorcorrecting structure, here we propose an errorminimizing module, leveraging the properties of both errorfeedback and CNN training mechanisms, by which the network learning can be guided while the complexity can be reduced. In terms of the architecture, we present a new model called DenseResolution Network with two branches: a FullResolution (FR) branch and a MultiResolution (MR) branch. By collecting features from different resolutions of point cloud and merging feature maps of the FR and MR in a novel fusion method, we can obtain more information for a comprehensive analysis. The main contributions are:We propose a point grouping algorithm to find neighbors for each point considering the density distribution adaptively.
We design an errorminimizing module for local feature learning on point clouds.
We introduce a network to learn point clouds comprehensively in different resolutions.
We conduct thorough experiments to validate the properties and abilities of our proposals. Our results demonstrate that the approach outperforms stateoftheart methods on some point cloud segmentation and classification benchmarks.
Local points grouping. Different from the pioneer PointNet [26] that relied on the global feature, subsequent work captured more local features in detail. PointNet++ [27] firstly introduced Ball Query, an algorithm for collecting possible neighbors of a particular point through a balllike searching space centering at itself, to group local neighbors of the point. Another simpler algorithm, , gathers nearest neighbors based on a distance metric, and this algorithm is applied for local features learning in [37, 5, 28].
Although Ball Query and grouping are intuitive, sometimes the size of the neighborhood (i.e. the receptive field of the point) is limited due to the range of searching (i.e. the radius of query ball, or the value of ). Meanwhile, merely increasing the searching range may involve substantial computational cost. To solve this problem, DPC [5] extended regular to
dilatedknn
, which gathers local points over a dilated neighborhood obtained by computing the nearest neighbors ( is the dilation factor) and preserving only every th point. Other works [27, 18, 41] also group neighbors through query balls in different scales (e.g. , multiscale grouping) to capture information from various sizes of the local area.However, the existing methods have some issues in common. On the one hand, the performances of grouping algorithms highly rely on predefined settings. For example, DGCNN [37] provided the results under different conditions, DPN [5] compared the effects of values, and PointNet++ [27] discussed the influence from query ball radius. On the other hand, the grouping algorithms act on all points of the point clouds without taking the distinct condition of each point or model into account. As far as we are concerned, it is necessary to find an intelligent pointlevel adaptive grouping algorithm.
Error feedback structure. Previously in 2D, Carreira et al. [3] proposed a framework called Iterative Error Feedback (IEF): by minimizing the error loss between current and desired outputs in the backpropagation procedure, the network would help to approach the target. In contrast to minimizing error during backpropagation, the methods in [7, 19] complement the output with a backprojection unit in the forward procedure. For 3D point clouds, PUGAN [14] leveraged a similar idea for point cloud generation, while [28] presented a structure with specially designed paths for prominent features learning.
Network architecture for point cloud learning.
To tackle problems in 2D computer vision tasks, many classical architectures have been introduced:
e.g. , VGG [31], ResNet [8], etc. Besides, some works tried different image resolutions for more clues, for example, fully convolutional network [20] keeps the full size of an image, deconvolution network [24] steps into lower resolutions, and HRNet [36] shares the features among different resolutions.As for 3D point clouds, there are two popular architectures. Some of them follow the form of PointNet++[27], which learns in lower resolutions using Farthest Point Sampling (FPS) in Set Abstraction (SA) module for downsampling and Feature Propagation (FP) module for upsamling the point features. Meanwhile, DGCNN [37] works as a fully convolutional network because it dynamically updates the crafted point graph around each point of the model. Different from them, our approach exploits more clues learnt from various resolutions for better representations of pointwise finegrained features.
Since PointNet [26] introduced multilayer perceptrons (s) that directly process point clouds, CNNbased learning on 3D data becomes more intuitive. Basically, an operation (
) can be described as a 1by1 convolution with a possible batch normalization
[10] layer () and an activation function (
) on feature map:In addition, many works craft regional patterns to record more local details. Wang et al. [37] dynamically draws a graph around each point in dimensional feature space encoding the information of both the absolute position of the centroid and relative positions of the neighbors in feature space. Specifically, the crafted graph () at the centroid is:
Therefore, the quality of information that can provide highly depends on the neighbors (i.e. ) that the grouping algorithm can find. Starting from this point, we investigate a better grouping algorithm for .
As we mentioned in Section 1, there are two main grouping algorithms applied: Ball Query and knearest neighbors (). Although they are popular, they have some common issues as analyzed in Section 2. To solve the problems, here we propose an algorithm, Adaptive Dilated Point Grouping (). The pipeline can be described as in Algorithm 1.
We take pairwise Euclidean distances in feature space as our metrics since it can indicate the point density distribution to a certain extent. With as a feature map having size and
as a row vector of all ones with
entries, we calculate the metrics as:(1) 
By sorting the metrics in ascending order, we can easily identify the nearest points (i.e. the elements with smallest # () values in each row of ) as candidate neighbors for each point. Next, we select the qualified neighbors from all candidates, whose indices are and metrics are . To be specific, we apply and an activate function (e.g. , logistic function), on the metrics of candidates to summarize the information of point distribution of the local areas. Then, a projection function (e.g. , linear function) can map the activated values to a certain range. Finally, we take a scale function (e.g. , round function) to assign a certain dilation factor for each point according to the summarized information:
(2) 
As each point has a corresponding dilation factor, we pick up every th index of candidate indices to form the final neighbors for each point. Following similar behavior of dilatedknn () in [5], we have the indices of final point groups:
(3) 
Once the neighbors are selected by , the local graph of point will be:
(4) 
Assume that the crafted local graph embeds the full information about the neighborhood, it would be possible to restore the previous features by a backprojection. In terms of the backprojection feature , we adopt a 1by convolution over the local graph as in [28], since it acts to aggregate the nodes based on learned weights of the edges in the graph, which implicitly simulates a reverse process of crafting the graph:
(5) 
Therefore, the error feature is defined as the difference between the original input feature and backprojection feature :
(6) 
Different from the methods in [14, 28, 7, 19] that correct the error by extra computations in the forward pass, we use additional loss to minimize the error during backpropagation:
(7) 
As the network training continues, this loss can constrain the feature learning by forcing the backprojection feature to approach the original input inside of this module, especially in the early stages of training. Moreover, it is expected to provide further instructions for the grouping of our algorithm compared with the general crossentropy loss.
With a maxpooling function
being applied on the crafted local graph along with neighbors, we aggregate a prominent local feature as the output of the centroid :(8) 
Although the
algorithm and the errorminimizing module seem promising for local feature extraction, we still need a robust network architecture to leverage the potential offered by both. The basic fully convolutional network architecture in
[26, 37, 28] remains the same size of points (i.e. full resolution of the point cloud) even in different scales of feature spaces. Even though it can retain the features pointwise without any confusion caused by upsampling, the output may lack channelwise clues about semantic/shape information, which could be collected from different resolutions of the point cloud.overall  air  bag  cap  car  chair  ear  guitar  knife  lamp  laptop  moto  mug  pistol  rocket  skate  table  
mIoU  plane  phone  bike  board  
# shapes  16881  2690  76  55  898  3758  69  787  392  1547  451  202  184  283  66  152  5271 
PointNet [26]  83.7  83.4  78.7  82.5  74.9  89.6  73.0  91.5  85.9  80.8  95.3  65.2  93.0  81.2  57.9  72.8  80.6 
ASCN [39]  84.6  83.8  80.8  83.5  79.3  90.5  69.8  91.7  86.5  82.9  96.0  69.2  93.8  82.5  62.9  74.4  80.8 
SONet [13]  84.6  81.9  83.5  84.8  78.1  90.8  72.2  90.1  83.6  82.3  95.2  69.3  94.2  80.0  51.6  72.1  82.6 
PointNet++ [27]  85.1  82.4  79.0  87.7  77.3  90.8  71.8  91.0  85.9  83.7  95.3  71.6  94.1  81.3  58.7  76.4  82.6 
PCNN [1]  85.1  82.4  80.1  85.5  79.5  90.8  73.2  91.3  86.0  85.0  95.7  73.2  94.8  83.3  51.0  75.0  81.8 
DGCNN [37]  85.2  84.0  83.4  86.7  77.8  90.6  74.7  91.2  87.5  82.8  95.7  66.3  94.9  81.1  63.5  74.5  82.6 
P2Sequence [16]  85.2  82.6  81.8  87.5  77.3  90.8  77.1  91.1  86.9  83.9  95.7  70.8  94.6  79.3  58.1  75.2  82.8 
SpiderCNN [40]  85.3  83.5  81.0  87.2  77.5  90.7  76.8  91.1  87.3  83.3  95.8  70.2  93.5  82.7  59.7  75.8  82.8 
PointASNL [41]  86.1  84.1  84.7  87.9  79.7  92.2  73.7  91.0  87.2  84.2  95.8  74.4  95.2  81.0  63.0  76.3  83.2 
RSCNN [18]  86.2  83.5  84.8  88.8  79.6  91.2  81.1  91.6  88.4  86.0  96.0  73.7  94.1  83.4  60.5  77.7  83.6 
Ours  86.4  84.3  85.0  88.3  79.5  91.2  79.3  91.8  89.0  85.2  95.7  72.2  94.2  82.0  60.6  76.8  84.2 
To overcome the above limitation, another branch learns the necessary information from different resolutions of the point cloud. In contrast to the fullresolution (FR) branch, a multiresolution (MR) branch is able to capture pointwise channelrelated information from different scales, which contributes to a comprehensive channelwise understanding. After an enhancement of the feature map of FR, from the feature map of MR (please see Section 4.3 and Table 4 for more details), the final output of our denseresolution (DR) network can be formulated with elementwise multiplication :
(9) 
In this section, the details of our implementation are provided, including network parameters, training settings, datasets, etc. . By comparing the experimental results with other stateoftheart methods, we analyze the performances quantitatively. Besides, some ablation studies and visualization are presented to illustrate the properties of our approach.
Network details. Generally, our denseresolution network consists of two branches: a fullresolution (FR) branch and a multiresolution (MR) branch. Specifically, The FR branch is a series of the errorminimizing modules extracting features in different scales of feature spaces i.e. 64, 128, and 256, etc. The FR output is a projected concatenation of the modules’ outputs. As for the MR branch, we adopt farthest point sampling (FPS) and feature propagation (FP) in [27] for downsampling and upsampling, respectively. The MR branch starts from the first output of FR in N size; after that, lower resolutions i.e. N/4 and N/16 are investigated. Different from others, more propagated features and skip links are densely connected to enhance the relations between different point resolutions and feature spaces. Empirically, we adopt and as in [37, 5]. For errorminimizing modules in MR, we use regular (equivalent to with ) since the points are sparse.
method  input type  #points  ModelNet40  ScanObjectNN 
PointNet [26]  coords  89.2  68.2  
ASCN [39]  coords  90.0    
PointNet++ [27]  coords  90.7  77.9  
SONet [13]  coords  90.9    
PointCNN [15]  coords  92.2  78.5  
PCNN [1]  coords  92.3    
SpiderCNN [40]  coords  92.4  73.7  
P2Sequence [16]  coords  92.6    
DensePoint [17]  coords  92.8    
RSCNN [18]  coords  92.9    
DGCNN [37]  coords  92.9  78.1  
KPConv [33]  coords  92.9    
PointASNL [41]  coords  92.9    
Ours  coords  93.1  80.3 
The output is obtained by following Equation 9. For the classification task, we apply a maxpooling function and Fully Connected (FC) layers to regress confidence scores for all possible categories. In terms of the segmentation task, we attach the maxpooled feature to each point feature of
and further predict the semantic label of each point with FC layers being applied. We implement the project with PyTorch and Python; all experiments are trained and tested on Linux and GeForce RTX 2080Ti GPUs.
^{1}^{1}1The code and models will be available at https://github.com/Training strategy.Stochastic Gradient Descent (SGD) with momentum of 0.9 is adopted as the optimizer for classification. The learning rate decreases from 0.1 to 0.001 by cosine annealing [21]
during the 300 epochs. For segmentation, we exploit Adam
[12] optimization for 200 epochs of training. The learning rate begins at 0.001 and gradually decays with a rate of 0.5 after every 20 epochs. The batch size for both tasks is 32. Besides, training data is augmented with random scaling and translation; the total loss is the sum of regular crossentropy loss and weighted errorminimizing loss (see Equation 7). Part segmentation is evaluated with a tenvotes strategy used by stateoftheart approaches [26, 27, 18].Datasets. We test our approach on two main tasks: point cloud segmentation and classification. The ShapeNet Part dataset [42] is used to predict the semantic class (part label) for each point of the object. In addition, the synthetic ModelNet40 [38] dataset and the realworld ScanObjectNN [34] dataset are used to identify the category of the object.
ShapeNet Part. In general, the dataset has 16,881 object point clouds in 16 categories. Each point is labeled as one of the 50 parts. As the primary dataset for our experiments, we follow the official data split [4]. We input the 3D coordinates of 2048 points for each point cloud and feed a onehot class feature before FC layers during training. In terms of the metric for evaluation, we adopt IntersectionoverUnion (i.e. IoU). The IoU of the shape is calculated by the mean value of IoUs of all parts in that shape. Particularly, mIoU (i.e. mean IoU) is the average of IoUs for all testing shapes.
ModelNet40. It is a popular dataset because of the regular and clean point clouds. There are 12,311 meshes in 40 classes, with 9,843 for training and 2,468 for testing. Corresponding point clouds are generated by uniformly sampling from the surfaces, translating to the origin, and scaling within a unit sphere [26]. In our case, only the 3D coordinates of 1024 points for each point cloud has been used.
ScanObjectNN. This realworld object dataset is recently published. Although it has 15,000 objects in only 15 categories, it is practically more challenging due to the background, missing parts, and deformations.
Segmentation. Table 1
shows the results of related works reported in overall mIoU, which is the most critical evaluation metric on the ShapeNet Part dataset. In general, our network achieves 86.4% and outperforms other stateoftheart algorithms based on similar experimental settings. As for evaluations inside of each class, we surpass others in 5 out of 16 categories. Particularly in categories with a relatively large number of samples,
e.g. , airplane, chair, or table, we perform even better (two out of these three classes) than others.Classification. Table 2 presents the overall accuracy of the classification on both synthetic and realworld object datasets. For ModelNet40, we achieve 93.1% and exceed other stateoftheart results with similar input. Besides, an overall accuracy of 80.3% is obtained on the ScanObjectNN dataset, which is significantly higher than all results on its official leaderboard [9]. The inference time of our model is about 19.2ms running on a single GeForce RTX 2080Ti GPU. In general, our network is effective and robust for point cloud classification.


model  overall mIoU  
0      85.2  
1    ✓  85.6  
2  ✓  ✓  85.7  
3  ✓  ✓  85.3  
4  DR  ✓  ✓  86.0 

Visualization of learned dilation factors. The color of the point corresponds to the learned dilation factor by our algorithm. From Figure 3, we can find that our algorithm tends to assign larger dilation factors to the points on corner/boundary/edges. The reason is that the point distribution around them would be relatively sparse, thus larger neighborhoods for local feature learning are needed. Due to the series connection of modules, the points in deep layers are supposed to have larger receptive fields already, so the larger dilation factors are unnecessary: the points in relatively dense distribution (e.g. , on the flat surfaces or central areas) turn out to have smaller dilation factors as the network goes deeper. Different from regular /Ball Query with a limited receptive field or with fixed dilation factor for all points, our algorithm works adaptively and reasonably as expected.
Effects of components.
Here we conduct an ablation study about the effects of network architecture, grouping algorithm, and the errorminimizing module. We run the experiments on the ShapeNet Part dataset with the same input and classifier, and Table
3 presents the results in overall mIoU. Comparing model 1&2 to model 0, we observe that the errorminimizing module with applied can significantly improve the network performance for part segmentation. Although the multiresolution branch (model 3) alone is not able to learn the features as comprehensively as a fullresolution branch (model 2) does, we can take advantage from both by combining them into the form of a denseresolution network (model 4).Merging the feature maps. Both FR and MR have properties as mentioned, so we need to find an effective way to unify the advantages of both. We test simple ways of merging the features of and , i.e. concatenating them in channelwise, adding and multiplying them in elementwise. Comparing the results of model 3&4&5 to model 0 in Table 4, we observe that the simple ways of merging may not improve performance. In contrast, channelwise enhancement of from (model 5) can improve a bit because of the reasons explained in Section 3.3. With tenvotes testing, the overall mIoU can boost to 86.4%.
In this work, we propose a DenseResolution Network for point cloud analysis, which leverages information from different resolutions of the point cloud. Specifically, the Adaptive Dilated Point Grouping algorithm is introduced to realize a flexible point grouping based on the density distribution. Moreover, an errorminimizing module and corresponding loss are presented to capture local information and guide the network in training. We conduct experiments and provide ablation studies on both point cloud segmentation and classification benchmarks. According to the experimental results, we outperform competing stateoftheart methods on ShapeNet Part, ModelNet40, and ScanObjectNN datasets. The quantitative reports and qualitative visualization demonstrate the advantages of our approach.
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 4733–4742. Cited by: §1, §2.3D scene understanding benchmark
. Note: https://hkustvgd.github.io/benchmark/Accessed: 20200420 Cited by: §4.2.
Comments
There are no comments yet.