As an important type of 3D data which can be acquired conveniently by various 3D sensors, point cloud has been increasingly used in diverse real word applications including autonomous driving qi2017frustum; yi2019hierarchical, 3D modeling golovinskiy2009shape; gao2017bimtag; Zhizhong2016; HanCyber17a; zhong2019surface; skrodzki2018directional; zheng2018rolling; gao2015query, indoor navigation zhu2017target and robotics rusu2008towards
. Therefore, there is an emerging demand to learn discriminative features with deep neural networks for 3D shape understanding.
Unlike images, point cloud is not suitable for the traditional convolutional neural network (CNN) which often requires some fixed spatial distribution in the neighborhood of each pixel. To alleviate this issue, an alternative way is to rasterize the point cloud into regular voxel representations and then apply 3D CNNszhou2017voxelnet. However, the performance of plain 3D CNNs is largely limited by the serious resolution loss and the fast-growing computational cost, due to the inherent sparsity of 3D shapes. To overcome the shortcoming of 3D CNNs, PointNet qi2017pointnet was proposed as a pioneering work which directly learns global features for 3D shapes from point sets. However, PointNet learns the feature of each point individually, while omitting the important contextual information among points.
To solve above-mentioned problems, recent studies have attempted to encode the local region contexts of point clouds with various designed manners. Specifically, there are two kinds of local region contexts, including the intra-region geometric context and the inter-region spatial context. On the one hand, some methods concentrate on capturing the context of geometric correlations inside each local region. For example, PointNet++ qi2017pointnet++ uses a sampling and grouping strategy to hierarchically extract features for local regions. More recently, Point2Sequence liu2018point2sequence learns the contextual information inside a local region with an attention-based sequence to sequence network. On the other hand, several studies attempt to utilize the context of spatial distribution information among local regions. For example, KD-Net klokov2017escape builds a kd-tree to divide the point cloud into small leaf bins and then hierarchically extracts the point cloud feature from the leaves to root according to a fixed spatial partition. KC-Net shen2018mining uses a graph pooling operation which can partially utilize the spatial distribution information among local regions. However, it is still hard for these methods to encode the fine-grained contexts inside and among local regions simultaneously, especially for the geometric correlation between different scale areas inside a local region and the spatial relationships among local regions. This motivates us to employ variable-size filters inside each local region and spatial similarity measures among local regions for capturing intra-region context information and inter-region context information, respectively. Our method reliefs the limitation of the traditional CNNs in encoding the geometric context information on point clouds, which usually implements a convolution layer with fixed-size filters, while the concrete filter size is a hyper-parameter. To address above-mentioned problems, we propose LRC-Net to learn discriminative features from point clouds.
Our key contributions are summarized as follows.
LRC-Net is presented for learning discriminative features directly from point clouds by simutaneously encoding the geometric correlation inside each local region and the spatial relationships among local regions.
Intra-region context encoding module is designed for capturing the geometric correlation inside each local region by novel variable-size convolution filters, which learns the intrinsic structure and correction of multi-scale areas from their feature maps, rather than simple feature concatenating or max pooling as usually used in previous methods such as qi2017pointnet.
Inter-region context encoding module is proposed for integrating the spatial relationships among local regions based on spatial similarity measures, which encodes the spatial distribution of local regions in their metric space.
The above two modules are illustrated in Figure 1.
2 Related Work
Feature Learning from Regularized 3D Data. Traditional methods liu2009robust; liu2009computing; liu2011computing; gao2015query; fehr2016covariance; zou2018broph; srivastava2019deeppoint3d; beksi2019topology; zhao2020hoppf
focus on cpturing the geometric information of 3D shapes, which are usually limited by the hand-crafted manner in specific application. Benefit from the success of CNNs on large-scale image repositories such as ImageNetkrizhevsky2012imagenet, deep neural networks are being applied to process the 3D format data. As an irregular format of 3D data, point clouds can be transformed into other kinds of regularized format, such as the 3D voxel Zhizhong2016b; han2017boscc; han2018deep or the rendered view han2018seqviews2seqlabels; han20193d2seqviews; han2019y2seq2seq; han2019view; han20193dviewgraph; han2019parts4feature. The voxelization of point cloud is a feasible choice, which first converts the point cloud into voxels, and then applies 3D CNNs. 3D ShapeNets wu20153d and VoxNet maturana2015voxnet represent each voxel with a binary value which indicates the occupied of the location in the space. However, the performance is largely limited by the resolution loss and the rapid growth of computational complexity. The inherent sparsity of 3D shapes makes it hard to make full use of the storage of input data, where the hollow inside 3D shapes is often meaningless. Some improvements li2016fpnn have been proposed to alleviate the data sparsity of the volumetric representation. However, it is still nontrivial to deal with large point clouds with high resolution.
Feature Learning from Point Clouds. PointNet qi2017pointnet is a pioneering work which directly adopts point sets as input and obtains convincing performances. A concise strategy is adopted in PointNet by computing the feature for each point individually and then aggregating these features into a global representation with max-pooling. However, PointNet is largly limited in capturing the contextual information of local regions. To address this problem, many recent studies attempt to capture local region contexts. Specifically, local region contexts can be divided into two categories, which are intra-region context and inter-region context, respectively. On the one hand, some studies capture the intra-region context by building graph inside multi-scale local regions. PointNet++ qi2017pointnet++ uses sampling and grouping operations for extracting features from several clusters hierarchically to capture the context of each cluster. Point2Sequence liu2018point2sequence extracts the feature of local regions by a sequence to sequence model with an attention mechanism. On the other hand, some studies li2018pointcnn; wang2018dynamic; xu2018spidercnn; wang2017cnn; komarichev2019cnn; hu2019render4completion; wen2020cvpr
investigate CNN-like operations to aggregate neighbors of a given point by building kNN graph inside the single-scale local region. On the other hand, some studies encode the inter-region context with indexing structures. KC-Netshen2018mining employs a kernel correlation layer and a graph pooling layer for capturing the local structure of point clouds. ShapeContextNet xie2018attentional extends the 2D Shape context belongie2001shape to 3D, which divides the local region of a given point into bins and updates the point feature with the aggregation of these bin features. KD-Net klokov2017escape and OctNet riegler2017octnet first divide the input point cloud into leaves, and then hierarchically extracts features from leaves to the root. Point2SpatialCapsule wen2019point2spatialcapsule integrates capsules to explore the local structures of point clouds, which employs a multi-scale shuffling to increase the diversity of local region features and applies a clustering operation to capture the spatial information of local regions in the feature space. This complicated procedure significantly differentiates Point2SpatialCapsule from ours. In general, it is hard for current methods to simultaneously capture the contextual information inside and among multi-scale local regions, which limits the expressiveness of learned representations of point clouds.
3 The LRC-Net Model
shows the architecture of LRC-Net, which is composed of six parts: multi-scale area establishment, area feature extraction, intra-region context encoding, inter-region context encoding, shape classification and shape segmentation, respectively. LRC-Net adopts a point cloudas input which is composed of 3D point coordinates x, y and z. Firstly, a subset with points, denoted by , is selected from the input point cloud to act as the centroids of local regions . Based on the selected centeroids , different scale areas are established in each local region centered at , where points are contained in the each scale area, respectively. Then, a -dimensional feature is extracted from each scale area . By stacking , a feature matrix is formed for each local region , and further aggregated into a -dimensional feature by the intra-region encoding module (see Figure 2(c)). Meanwhile, another module of calculating the spatial similarity in the 3D space is applied to capture the inter-region context among local regions (see Figure 2(d)). Finally, a 1024-dimensional feature of the whole input point cloud is aggregated from the feature of local regions, which integrates the extracted intra-region and inter-region context features. The learned global feature can be applied to shape classification and shape segmentation applications.
3.1 Multi-scale Area Establishment
Three key layers are engaged in our structure to establish the multi-scale areas around each sampled point, including sampling layer, searching layer and grouping layer. The sampling layer uniformly selects points from the input point cloud as the centroids of local regions. Around each sampled centroid, the searching layer continuously finds nearest points to build the indexing relationship between points. According to the indexes in the searching layer, the grouping layer groups multi-scale areas inside each local region .
In the sampling layer, farthest point sampling (FPS) is adopted to select points which defines the centroids of local regions. In the sampling process, the new sampled point is always the farthest one from previously selected points . Compared with other sampling methods, such as random sampling, FPS can achieve a more uniform coverage of the entire point cloud with the same number of sampled points.
To build the multi-scale areas , the k-nearest neighbors (kNN) algorithm is applied to search the neighbors of a given point based on the Euclidean distance between points. Another alternative method is the ball query qi2017pointnet++ which selects all points within a given radius around a point. Compared with the ball query, kNN can guarantee the information inside local regions and is robust to the input point cloud with different sparsity.
3.2 Area Feature Extraction
As shown in Figure 2
, a concise and effective PointNet layer is employed in LRC-Net to extract the feature for each scale area. The PointNet layer is composed of two key parts: a Multi-Layer-Perceptron (MLP) layer and a max-pooling layer, respectively. The MLP layer individually abstracts the coordinates of points in each areainto the feature space, and then these features are aggregated into a -dimensional feature by the max pooling layer. So far, a feature map of different scale areas with the size of is acquired after the PointNet layer.
Following previous studies li2018so; qi2017pointnet++, the relative coordinates are adopted in LRC-Net. Before feeding points inside each local region into the PointNet layer, a relative coordinate system of the centroid is built by a simple operation: , where is the index of points in the local region . Different from absolute coordinates, the relative coordinates are determined by the relative positional relationship between points. Therefore, by using relative coordinates, the learned feature of local regions can be invariant to transformations such as rotation and translation.
3.3 Intra-region Context Encoding
In order to capture the fine-grained contextual information between multi-scale areas inside local regions, variable-size convolution filters are employed in the architecture. Inspired by capturing the correlation of different words in the natural language processing taskskim2014convolutional, the intra-region correlation of multi-scale areas is also important in the feature learning of point clouds. Different from most existing methods that only encode the correlation of fixed scale of areas, we consider capturing the correlation among multiple scales from to . As depicted in Figure 2, given the features of multi-scale areas in a local region from the area feature extraction module, we first represent these features in a feature map by
where is the concatenation operator. In general, let refer to the concatenation of features . A convolution operation involves a filter , which is applied to a window of scale features to produce a new feature. For example, a feature is generated from a window of features by
Here is a bias term and
is a non-linear function such as ReLUnair2010rectified. As the intermediate step shown in Figure 2, this filter is applied to each possible window of scales in the features
to produce a feature vector
with length of . Then, we apply max pooling operation over the feature vector, which extracts the maximum value from feature vector by
Here is one element of local region feature corresponding to this particular filter. So far, we have shown the process of getting one element in by a convolution filter with window size . In general, there are kinds of convolution filters in different sizes and filters for each kind of convolution filter. Therefore, the output of each input local region is a -dimensional feature vector .
3.4 Inter-region Context Encoding
To obtain the global feature of point clouds, most existing methods adopt simple pooling layers to aggregate local region features. However, the inter-region context is largely lost in the pooling process, especially for the spatial distribution information among local regions. To capture the inter-region spatial context, a greedy strategy is proposed by aggregating the spatial distribution information among local regions in an explicit manner. Following the intra-region context encoding module, the feature map with the size of local regions is obtained. As shown in Figure 2, to encode the spatial information of local regions, we explicitly calculate the spatial similarity among local regions based on the coordinates of local region centroids. Given the coordinate of centroids , the distance matrix is build by
To convert the distance matrix to the similarity space, the spatial similarity matrix is calculated by
Here is a parameter which can regulate the effect of the spatial similarity. Thus, we obtain the spatial similarity among local regions. To enhance the feature of each local region, a greedy weighting strategy is adopted as
where is the enhanced feature vector of and is the index of the column. In addition, a normalization operation is applied to the enhanced features by
Here is the final features of local regions after the regularization, which contains the information of spatial distribution among local regions. In general, it is a greedy stratedgy to compute the spatial similarity between any two local regions. The greedy strategy aims to enhance the correlation of local regions, which can promote the learning of the global features. In the subsequent network, a 1024-dimensional global feature of the input point cloud is extracted by another PointNet layer. The learned global feature can be applied to various applications, such as shape classification and shape segmentation.
3.5 Expansion for Shape Segmentation
The target of shape segmentation is to predict a semantic label for each point in the point cloud. With the obtained global feature , the key is how to acquire the feature for each point. There are two options, one is to duplicate the global feature with times as in wang2018dynamic
, the other is to perform upsampling by the interpolation layerqi2017pointnet++. In the shape segmentation module, two interpolation layers are equipped in our network, which propagate the features from shape level to point level by upsampling. The feature propagation between different levels is guided by the inverse distance between -nearest points. In the interpolation layer, we search () nearest points for each point in current level from points in previous level. Therefore, the feature of point in current level is interpolated by the positional relationship of points between two levels, denoted by
where is the inverse square Euclidean distance between two points, is the point feature of and are the nearest points of in the previous level. The points in each level are already obtained from the multi-scale area establishment module and the interpolation step can be regard as a reverse process of the abstraction step.
|Method||Scale||mean||Intersection over Union (IoU)|
In this section, shape classification and shape segmentation applications are adopted to evaluate the performances of the LRC-Net. In the ablation study, we first investigate how the two main modules affect the performances of LRC-Net in the shape classification task on ModelNet40 wu20153d. Then, we compare our model with several state-of-the-art methods in shape classification on ModelNet10/40 and shape part seqmentation on ShapeNet part dataset savva2016shrec. Finally, some visualizations of the shape segmentation results are also reported.
4.1 Network Configuration
In LRC-Net, some network configurations need to be initialized. According to the input point cloud, we first initialize parameters of the number of sampled points , the number of scales , the number of points in multi-scale areas and , the feature dimension of each local region . The rest settings of our model are same as in Figure 2
. The discussion of some parameters is listed in the Supplementary Material. In addition, ReLU is used after each fully connected layer with Batch-normalization, and Dropout is also applied with drop ratio 0.4 in the fully connected layers. In the experiment, we train our network on a NVIDIA GTX 1,080Ti GPU using ADAM optimizer with initial learning rate 0.001, batch size of 16 and batch normalization rate 0.5. The learning rate and batch normalization rate are decreased by 0.3 and 0.5 for every 20 epochs, respectively.
5 Parameters Setting
All the experiments in this section are evaluated on ModelNet40. which contains 40 categories and 12,311 CAD shapes with 9,843 shapes for training and 2,468 shapes for testing. And the results listed in tables are the instance accuracies. For each 3D shape, we adopt the point cloud with 1,024 points which are uniformly sampled from the corresponding mesh faces as input.
In the module of spatial distribution information encoding, is an important parameter which influences the performance of the whole model. The results of several settings of are shown in the Table 3. The best instance accuracy is reached at which maximizes the effect of the spatial information encoding module. In particular, represents a simple summation of the local region features, which will result in the discriminative ability loss of local region features. From the results, we can see that the spatial information encoding module can promote the global representation learning of point cloud.
To explore the effect of the sampled points , we keep the setting and vary from 128 to 512 as shown in Table 4. The number of the sample points influences the local regions which are visible to the network in the training process. can obtain a better coverage of all training point clouds, where the input information is balanced between insufficiency and redundancy.
Moreover, we also discuss the impact of the number of scale areas in each local region. In the implementation, we keep the number of points 128 in each local region and range from 1 to 5. The number of points in each scale is a power of 2 and varies in . Specifically, indicates that there are points in the scale areas. And similarly, represents there are points in the two scale areas respectively. In terms of results in Table 5, LRC-Net reaches the best performance when the scale areas number is 4. In practice, the number of scale areas is largely determined by the properties of the input point cloud, especially the sparsity of points. Therefore, when the number of scale areas in each local region is 4, it is more suitable for our model.
With the number of scale areas to be 4, we change the kind of filters from 1 to 4 in the variable-size convolution module. In Table 6, represents only one type of convolutional filter , and similarly, represents two kind of filters . The experiment results show that the module of variable-size convolution is effective in aggregating the multi-scale area features.
5.1 Ablation Study
In the following, we show the effects of the two main modules: the intra-region context encoding and the inter-region context encoding, respectively. In Table 7, we show the performances of LRC-Net with and without the intra-region context encoding module. Specifically, when we remove the intra-region context encoding, there are three widely used ways to aggregate the features of multi-scale areas by mean pooling (Mean), max pooling (Max) and concatenating (Con), respectively.
These results show that the intra-region context encoding module can promote the discriminative ability of the learned point cloud features. Similarly, we also evaluate the role of the inter-region context encoding module. As depicted in Table 8, we then list the results with (Y) and without (N) the inter-region context encoding module. In addition, we also show the influence of the max pooling operation (max) or the mean pooling operation (mean) in the PointNet layer which extracts the global representation as shown in Figure 2. Therefore, there are four alternative combinations, Y(max), N(max), Y(mean) and N(mean), respectively. The results suggests that the inter-region context encoding module is effective in improving the learning of global features by capturing the spatial context among local regions. According to above comparisons, the two modules in LRC-Net are effective in encoding local region contexts.
5.2 Shape Classification
The performances of LRC-Net are evaluated on both ModelNet10 (MN10) and ModelNet40 (MN40) 3D shape classification benchmarks. In detail, MN40 contains 40 categories and 12,311 CAD shapes with 9,843 shapes for training and 2,468 shapes for testing. And MN10 is a subset of MN40 with 4,899 CAD shapes, including 3,991 shapes for training and 908 shapes for testing. Table 1 compares LRC-Net with several state-of-the-art methods in terms of instance accuracy on MN10 and MN40, respectively. As shown in the Table 1, all methods can be divided into two categories: single-scale based methods li2018pointcnn; komarichev2019cnn and multi-scale based methods qi2017pointnet++; liu2018point2sequence. LRC-Net has greatly improved the baseline of PointNet++ qi2017pointnet++ on both ModelNet10 and ModelNet40. And LRC-Net achieves the same results with Point2SpatialCapsule wen2019point2spatialcapsule on ModelNet10 and reaches comparable results with Point2SpatialCapsule on ModelNet40. Point2SpatialCapsule benefits from its network structures such as dynamic routing for clustering and point cloud reconstruction, which aims to increase the network capability. However, the two newly added modules (i.e. clustering and point cloud reconstruction) increase both the model size and the computational cost of Point2SpatialCapsule during network learning. This makes Point2SpatialCapsule more complicated than LRC-Net in term of the network architecture. The best accuracy is achieved with 10,000 points as input, where the higher resolution point cloud can provide more local details than the sparse input with 1,024 points. Experimental results show that LRC-Net can effectively enhance the representation learning of point clouds from multi-scale local regions by capturing the contextual information inside and among local regions.
|Method||Model size (MB)||Time (MS)||Accuracy (%)|
|PointNet (vanilla) qi2017pointnet||9.4||6.8||87.1|
|PointNet++ (SSG) qi2017pointnet++||8.7||82.4||-|
|PointNet++ (MSG) qi2017pointnet++||12||163.2||90.7|
In addition, to show the network complexity of LRC-Net intuitively, we make a statistics of model size and space cost of some point cloud based methods. We follow PointNet++ to evaluate the time and space cost of several point cloud based methods as shown in Table 9. We record forward time under the same conditions with a batch size 8 using TensorFlow 1.0 with a single GTX 1080 Ti. Table 9 shows LRC-Net can achieve tradeoff between the model complexity (number of parameters) and computational complexity (forward pass time). However, influenced by the setting of multi-scale grouping (MSG), LRC-Net and PointNet++ take longer than other single-scale grouping (SSG) based methods.
5.3 Shape Segmentation
To further verify the validity of our model, we also evaluate the performance of LRC-Net in the shape segmentation task. The shape segmentation branch is implemented as depicted in Figure 2. In this task, ShapeNet Part dataset is adopt as the benchmark which contains 16,881 models from 16 categories and is spit into train set, validation set and test set as PointNet++. There are 2,048 points for each point cloud, where each point belongs to a certain one of 50 part classes. And the kind of semantic parts in each shape varies from 2 to 5. There is no overlap of the part classes between shapes in different shape categories.
We employ the mean Intersection over Union (IoU) proposed in qi2017pointnet
as the evaluation metric for shape segmentation. For each shape, the IoU is computed between ground-truth and the prediction for each part class in the shape category. And the average IoUs are calculated in each shape category and overall shapes. In Table2, we report the performance of LRC-Net in each category and the mean IoU of all testing shapes.
From Table 2, the performance of LRC-Net is not as good as three latest proposed single-scale based methods including PointCNN li2018pointcnn, A-CNN komarichev2019cnn and RS-CNN liu2019relation
that adopt some special strategies in the training process. For example, A-CNN states “We concatenate the one-hot encoding of the object label to the last feature layer” and PointCNN states “we perturb point locations with the point shuffling for better generalization”, which are different from mainstream approaches like PointNetqi2017pointnet. For fair comparison with most of other methods, we do not apply these strategies in our method.
Moreover, compared with other multi-scale based methods qi2017pointnet++; liu2018point2sequence, LRC-Net achieves the best mean instance IoU of and comparable performances on many shape categories, which shows the effective of enhancing the contextual information inside and among local regions. In addition, some visualizations of the shape segmentation results are shown in Figure 3, where our predictions are highly consistent with the ground-truths. The shape segmentation results qualitatively show the effectiveness of LRC-Net in capturing the contextual information for each point.
|Method||Mean IoU||Overall accuracy|
|PointNet (baseline) qi2017pointnet||20.1||53.2|
|MS + CU (2) engelmann2017exploring||47.8||49.7|
|G + RCU engelmann2017exploring||49.7||81.1|
5.4 Indoor Scene Segmentation
We evaluate our model on Standford Large-Scale 3D Indoor Spaces Dataset (S3DIS) armeni20163d for the semantic scene segmentation task. There are 6 indoor areas including 272 rooms of the 3D scan point clouds in the S3DIS dataset. Each point in one point cloud belongs to one of the 13 categories, e.g. chair, board, ceiling and beam. We follow the same setting as in PointNet qi2017pointnet, where each room is split into blocks and 4,096 points are sampled from each block in the training process. In the testing process, all the points are used. We also apply the 6-fold cross validation over the 6 areas and report the average evaluation results.
Similar to shape part segmentation task, the probability distribution over the semantic object classes is generated for each input point. The quantified comparison results with some existing methods are reported in Table10. LRC-Net outperforms PointNet qi2017pointnet and achieves comparable results with ShapeContextNet xie2018attentional.
In this paper, we propose a novel feature learning framework for the understanding of point cloud in the shape classification and shape segmentation. With the intra-region context encoding module, the LRC-Net effectively learns the correlation between multi-scale areas inside each local region. To enhance the aggregation of local region features, a greedy strategy enables to encode the inter-region context of point clouds. We justify that both of these two modules are vital to encode local region contexts, which promote learning discriminative feature for point clouds.
This work was supported by National Key R&D Program of China (2018YFB0505400).