3DContextNet: K-d Tree Guided Hierarchical Learning of Point Clouds Using Local Contextual Cues

by   Wei Zeng, et al.
University of Amsterdam

3D data such as point clouds and meshes are becoming more and more available. The goal of this paper is to obtain 3D object and scene classification and semantic segmentation. Because point clouds have irregular formats, most of the existing methods convert the 3D data into multiple 2D projection images or 3D voxel grids. These representations are suited as input of conventional CNNs but they either ignore the underlying 3D geometrical structure or are constrained by data sparsity and computational complexity. Therefore, recent methods encode the coordinates of each point cloud to certain high dimensional features to cover the 3D space. However, by their design, these models are not able to sufficiently capture the local patterns. In this paper, we propose a method that directly uses point clouds as input and exploits the implicit space partition of k-d tree structure to learn the local contextual information and aggregate features at different scales hierarchically. Extensive experiments on challenging benchmarks show that our proposed model properly captures the local patterns to provide discriminative point set features. For the task of 3D scene semantic segmentation, our method outperforms the state-of-the-art on the challenging Stanford Large-Scale 3D Indoor Spaces Dataset(S3DIS) by a large margin.


Cylinder3D: An Effective 3D Framework for Driving-scene LiDAR Semantic Segmentation

State-of-the-art methods for large-scale driving-scene LiDAR semantic se...

Contextually Guided Semantic Labeling and Search for 3D Point Clouds

RGB-D cameras, which give an RGB image to- gether with depths, are becom...

MARNet: Multi-Abstraction Refinement Network for 3D Point Cloud Analysis

Representation learning from 3D point clouds is challenging due to their...

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

Few prior works study deep learning on point sets. PointNet by Qi et al....

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

Point cloud is an important type of geometric data structure. Due to its...

Edge and Corner Detection for Unorganized 3D Point Clouds with Application to Robotic Welding

In this paper, we propose novel edge and corner detection algorithms for...

A Fully Convolutional Network for Semantic Labeling of 3D Point Clouds

When classifying point clouds, a large amount of time is devoted to the ...

1 Introduction

Over the past few years, ConvNets have achieved excellent performance in different computer vision tasks such as image classification [1, 2, 3], object detection [4, 5, 6] and semantic segmentation [4, 7, 8, 9].

3D imaging technology has also experienced a major progress. As large scale datasets are crucial for supervised 3D deep learning, recently, large-scale 3D datasets are made publicly available such as ModelNet 

[10], ShapeNet [11], and real 3D scene datasets such as the Stanford Large-Scale 3D Indoor Spaces Dataset [12] and ScanNet [13]. To perform weight sharing and hierarchical learning, ConvNets need highly regular input data. Therefore, most of the traditional methods convert the irregular 3D data to regular formats like 2D projection images [14, 15, 16] or 3D voxel grids [10, 16, 17] before they are used by ConvNets.

Figure 1: Example of the implicit 3D space partition of a k-d tree. Colors of different local parts indicate different corresponding nodes in the k-d tree structure

Methods that employ 2D image projections of 3D models as their input [14] [15], are well suited as input for 2D ConvNet architectures. However, the intrinsic 3D geometrical information is distorted by the 3D-to-2D projection. Hence, this type of methods are limited in the exploitation of 3D spatial connections between regions. While it seems straightforward to extend 2D CNNs to 3D data processing by using 3D convolutional kernels, data sparsity and computational complexity are restrictive factors of this type of approaches [10, 17, 18, 19].

To fully exploit the 3D nature of point clouds, in this paper, the goal is to use the k-d tree structure [20] as the 3D data representation model, see Figure 1. Our method consists of two parts: feature learning and aggregation. It exploits both local and global contextual information and aggregates point features to obtain discriminative 3D signatures in a hierarchical manner. In the feature learning stage, local patterns are identified by the use of an adaptive feature recalibration procedure, and global patterns are calculated as non-local responses of different regions at the same level. In the feature aggregation stage, point features are merged hierarchically corresponding to the associated k-d tree structure in a bottom-up way.

Our main contributions are as follows: (1) A novel 3D context-aware neural network is proposed for 3D point cloud feature learning by exploiting the implicit partition space of the k-d tree structure. (2) A novel method is presented to incorporate the 3D space partition structure into a CNN architecture. (3) For semantic segmentation, our method significantly outperforms the state-of-the-art on the challenging Stanford Large-Scale 3D Indoor Spaces Dataset(S3DIS) 


2 Related Work

Figure 2: Comparison to related work for the classification task. Our model is based on hierarchical feature learning and aggregation using the k-d tree structure

Previous work on ConvNets and volumetric models use different rasterization strategies. Wu et al. propose 3DShapeNets [10]

using 3D binary voxel grids as input of a Convolutional Deep Belief Network. This is the first work to use deep ConvNets for 3D data processing. VoxNet 

[17] proposes a 3D ConvNet architecture to integrate the 3D volumetric occupancy grid. ORION [18] exploits the 3D orientation to improve the results of voxel nets for 3D object recognition. Based on the ResNet [21] architecture, Voxception-ResNet (VRN) [19] builds a very deep architecture. However, volumetric models are limited by their resolution, data sparsity, and computational cost of 3D convolutions.

Other methods rely on 2D projection images to represent the original 3D data and then apply 2D ConvNets to classify them. MVCNN 

[14] uses 2D rendered images of 3D shapes to learn representations of multiple views of a 3D model and then combines them to compute a compact descriptor. DeepPano [15] converts each 3D shape to a panoramic view and uses 2D ConvNets to build classifiers directly from these panoramas. With well-designed ConvNets, this type of methods (2D projections from 3D) performs successfully in different shape classification and retrieval tasks. However, due to the 3D-to-2D projection these methods are limited in exploring the full 3D nature of the data.  [22, 23]

exploits ConvNets to process non-Euclidean geometries. Geodesic Convolutional Neural Networks (GCNN) 


apply linear and non-linear transformations to polar coordinates in a local geodesic system. However, these methods are limited to manifold meshes.

Only recently, a number of methods are proposed that apply deep learning directly to the raw 3D data. PointNet [24] is pioneering work that processes 3D point sets. Because every point is treated equally, this approach fails in retaining the full 3D information. Another recent work uses Kd-Networks [25] and uses a 3D indexing structure to guide the computation. The method employs parameter sharing and calculates representations from the leaf nodes to the roots. However, this method needs to sample the point clouds and to construct k-d trees for every iteration. Further, the method employs multiple k-d trees to represent a single object. It is split-direction-dependent and is negatively influenced by a change in rotation (3D object classification) and viewpoint (3D scene semantic segmentation). The modified version of PointNet, PointNet++ [26], abstracts local patterns by sampling representative points and recursively applies PointNet [24] as a learning component to obtain the final representation. However, it directly discards the unselected points after each layer, and needs to sample points recursively at different scales which may yield relative slow inference speed.

In contrast to previous methods, our model is based on a hierarchical feature learning and aggregation pipeline. Our neural network exploits the local and global contextual cues which are inferred by the implicit space partition of the k-d tree. In this way, our model learn features, and calculates the representation vectors progressively using the associated k-d tree. Figure 2 shows a comparison of related methods to our work for the classification task.

3 Method

In this section, we describe our architecture, 3DContextNet, see Figure 3. First, the type of tree structure is motivated to subdivide the 3D space. Then, the feature learning stage is discussed that uses both local and global contextual cues to encode the point features. Finally, we describe our feature aggregation stage computing representation vectors progressively from the k-d trees.

3.1 K-d Tree Structure: Implicit 3D Space Partition

Our method is designed to capture both the local and global context by learning and aggregating point features progressively and hierarchically. Therefore, a representation model is required to partition 3D point clouds encapsulate the latent relations between regions. To this end, the k-d tree structure [20] is chosen.

A k-d tree is a space partitioning structure which is constructed by recursively computing axis-aligned hyperplanes to divide point sets. In this paper, we choose the standard k-d tree to obtain balanced k-d trees from the 3D input point clouds/sets. The latent region subdivisions of the constructed k-d tree is used to capture the local and global contextual information of point sets. Each node, at a certain level, represents a local region at the same scale whereas nodes at different levels represent subdivisions at corresponding scales. From the k-d tree construction, splitting direction and position are not used. In this way, our method is more robust to jittering and rotation than the k-d network of 

[25] which trains different affine transformations depending on the splitting directions of the nodes.

The k-d tree structure can be used to search for k-nearest neighbors for each point to determine the local point adjacency and neighbor connectivity. Our approach uses the implicit local partitioning obtained by the k-d tree structure to determine the point adjacency and neighbor connectivity.

In general, conventional ConvNets learn and merge nearby features at the same time enlarging the receptive fields of the network. Because of the non-overlapping partition of the k-d tree structure, in our method, learning and merging at the same time would decrease the size of remaining points too fast. This may lead to a lack of fine geometrical cues which are factored out during the early merging stages. To this end, our approach is to divide the network architecture into two parts: feature learning and aggregation.

Figure 3: 3DContextNet architecture. 3D object point clouds are used to illustrate that our method is suitable for both 3D classification and segmentation tasks. The corresponding nodes of the k-d tree determine the receptive fields at different levels. For feature learning, both local and global contextual information is encoded for each level. The associated k-d tree forms the computational graph to compute the representation vectors progressively for feature aggregation

3.2 Feature Learning Stage

Given as input is a 3D point set with the corresponding k-d tree. The tree leaves contain the individual (raw) 3D points with their representation vectors, denoted by . For example, denotes the initial vectors containing the 3D point coordinates. Features are directly learned from the raw point cloud without any pre-processing step. According to [27], a function is permutation invariant to the elements in , if and only if it can be decomposed in the form of , for a suitable transformation of and .

We follow PointNet [24], where a point set is mapped to a discriminative vector as follows:


where , and is a symmetric function.

In the feature learning stage, point features are computed at different levels hierarchically. For a certain level, we first process each point using shared multi-layer perceptron networks (MLP) as function

in equation (1

). Then, different local region representations are computed by a symmetric function (Max pooling in our work) for the nodes at this same level, as function

in equation (1).

3.2.1 Local Contextual Cues: Adaptive Feature Recalibration

To model the inter-dependencies between point features of the same region, we use the local region representations obtained from the symmetric function to perform adaptive feature recalibration [28]. All operations are adaptive to each local region, represented by a certain node in the k-d tree. The local region representation obtained by the symmetric function can be interpreted as a feature descriptor for this local region. A gating function is used with a sigmoid activation to capture the feature-wise dependencies. Point features in this local region are then rescaled by the activations to obtain the adaptive recalibrated output:


where denotes the sigmoid activation and is the symmetric function to obtain the local region representation. is the point feature set of the local region and is the number of points in the region. In this way, feature dependencies are consolidated for each local region, by enhancing informative features. This yields more discriminative local patterns. Note that the activations act as feature weights and adaptively recalibrate point features for different local regions. This avoids the necessity of a canonical partition and increases the robustness to point cloud rotation.

3.2.2 Global Contextual Cues: Non-local Responses

The global contextual cue is based on the non-local responses to capture a greater range of dependencies. Intuitively, a non-local operation computes the response for one position as a weighted sum over the features for all positions in the input feature maps. A generic non-local operation [29] in deep neural networks is calculated by:


where is the index of the output position and is the index that enumerates all possible positions. Function denotes the relationships between and . Function computes a representation of the input signal at position . The response is normalized by a factor .

The k-d tree divides the input point set into different local regions. These are represented by different nodes of the tree. Larger range dependencies for different local regions at the same level are computed as non-local responses of the corresponding nodes of the k-d tree. We consider as an MLP, and the pairwise function as an embedded Gaussian function:


where and are two MLPs representing two embeddings. In this paper, the relationships between different nodes at the same level should be undirected, and hence . Therefore, the two embeddings are the same i.e. . The normalization factor is calculated by . Note that this operation is different from a fully-connected layer. The non-local responses are based on the connections between different local regions, whereas fully-connected layers use learned weights.

Due to our input format and architecture, the receptive fields of the convolutional kernels are always in the feature learning stage. Following DenseNet [30], to strengthen the information flow between layers, all layers at the same level are connected (in the feature learning stage) with each other by concatenating all corresponding point features together. Such connection also leads to an implicit deep supervision which makes the network easier to train. The output of the feature learning stage has the same number of points as the input point set.

3.3 Feature Aggregation Stage

In the feature aggregation stage, the associated k-d tree structure is used to form the computational graph to progressively abstract over larger regions. For the classification task, the global signature is computed for the entire 3D model. For the semantic segmentation task, the outputs are the point labels. Instead of aggregating the information once over all points, the more discriminative features are computed in a bottom-up manner. The representation vector of a non-leaf node at a certain level is computed from its children nodes by MLPs and the symmetric function. Max pooling is used as the symmetric function.

For classification, by using this bottom-up, hierarchical approach, more discriminative global signatures are obtained. This procedure corresponds to a ConvNet in which the representation of a certain location is computed from the representations of nearby locations at the previous layers by a series of convolutions and pooling operations. Our architecture is able to progressively capture features at increasingly larger scales. Features at lower levels have smaller receptive fields whereas features at higher levels have larger receptive fields. This is due to the data-dependent partition of the k-d tree structure. Our model is invariant to the input order of the point sets because the aggregating direction is along the k-d tree structure which is invariant to input permutations.

For the semantic segmentation task, the k-d tree structure is used to represent an encoder-decoder architecture with skip connections to link the related layers. The input of the feature aggregation stage is the point feature set in which the representation of each point encapsulates both local and global contextual information at different scales. The output is a semantic label for each point.

In conclusion, our architecture fully utilizes the local and global contextual cues in the feature learning stage. It calculates the representation vectors hierarchically in the feature aggregation stage. Hence, our method obtains discriminative features for points of different semantic labels for the semantic segmentation task.

3.4 Discussion

Our method is related to PointNet [24] which encodes the coordinates of each point to higher dimensional features. However, by its design, this method is not able to sufficiently capture the local patterns in 3D space. More recently, PointNet++ [26] is proposed which abstracts local patterns by selecting representative points in a metric space and recursively applies PointNet as a local feature learner to obtain features of the whole point set. In fact, the method handles the non-uniform point sampling problem. However, the set of abstraction layers need to sample the point sets multiple times at different scales which leads to a relative slow inference speed. And only the selected points are preserved. Others are directly discarded after each layer which causing the loss of fine geometric details. Another recent work on K-d networks [25] performs linear and non-linear transformations and share the transformation parameters corresponding to the splitting directions of each node in the k-d tree. This method needs to calculate the representation vectors for all the nodes of the associated k-d tree. For each node at a certain level, the input is the representation vectors of the two previous nodes. The method heavily depends on the splitting direction of each node to train different multiplicative transformations at each level. Hence, the method is not invariant to rotation. Furthermore, point cloud sampling and k-d tree fitting during every iteration lead to slow training and inference speed.

3.5 Implementation Details

Our 3DContextNet model deals with point clouds of a fixed size where is the depth of the corresponding balanced k-d tree. Point clouds of different sizes can be converted to the same size using sub- or oversampling. In our experiments, not all the levels of the k-d tree are used. For simplicity and efficiency reasons, this number is for both the feature learning and aggregation stage. The receptive fields (number of points) for each level in the feature learning stage are 32 - 64 - 128 for the classification tasks and 32 - 128 - 512 for the segmentation tasks.

In the feature learning stage, the sizes of the MLPs are (64, 64, 128, 128) - (64, 64, 256, 256) - (64, 64, 512, 512) for the three levels, respectively. Dense connections are applied within each level. In the feature aggregation stage, the MLPs and pooling operations are used recursively to progressively abstract the discriminative representations. For the classification task, the sizes of the MLPs are (1024) - (512) - (256), respectively. For the segmentation task, like the hourglass shape, the sizes of the MLPs are (1024) - (512) - (256) - (256) - (512) - (1024), respectively. The output is then processed by two fully-connected layers with size 256. Dropout is applied after each fully-connected layer with a dropout ratio of .

4 Experiments

In this section, we evaluate our 3DContextNet on different 3D point cloud datasets. First, it is shown that our model significantly outperforms state-of-the-art methods for the task of semantic segmentation on the Stanford Large-Scale 3D Indoor Spaces Dataset [12]. Then, it is shown that our model provides competitive results for the task of 3D object classification on the ModelNet40 dataset [10] and the task of 3D object part segmentation on the ShapeNet part dataset [11].

4.1 3D Semantic Segmentation of Scenes

mean IoU overall accuracy avg. class accuracy
Baseline [24] 20.1 53.2 -

PointNet [24]
47.6 78.5 66.2

MS + CU(2) [31]
47.8 79.2 59.7

G + RCU [31]
49.7 81.1 66.4

PointNet++ [26]
53.2 83.0 70.5
Ours 55.6 84.9 74.5
Table 1: 3D semantic segmentation results on the Stanford Large-Scale 3D Indoor Spaces Dataset (S3DIS). Our method outperforms previous state-of-the-art methods by a large margin

Our network is evaluated on the Stanford Large-Scale 3D Indoor Spaces (S3DIS) dataset [12, 32] for 3D semantic segmentation. This dataset contains 6 large scale indoor areas and each point is labeled with one of the 13 semantic categories, including 5 types of furniture (board, bookcase, chair, sofa and table) and seven building elements (ceiling, beam, door, wall, window, column and floor) plus clutter. We follow the same setting as in [24] and use a 6-fold cross validation over all the areas.

Our method is compared with the baseline by PointNet [24] and the recently introduced MS+CU and G+RCU models [31]. We also produce the results of PointNet++ [26] for this dataset. During training, we use the same pre-processing as in [24]. We first split rooms into blocks of m and represent each point by a 9-dimension vector containing coordinates (, , ), the color information and the normalized position (, , ). The baseline extracts the same 9-dim local features and three additional ones: local point density, local curvature and normal. The standard MLP is used as classifier. PointNet [24] computes the global point cloud signature and feed it back to per point features. In this way, each point representation incorporates both local and global information. Recent work by [31] proposes two models that enlarge the receptive field over the 3D scene. The motivation is to incorporate both the input-level context and the out-level context. MS+CU represents the multi-scale input block with a consolidation unit model, while G+RCU stands for the grid-blocks in combination with a recurrent consolidation block model. PointNet++ [26] exploits metric space distances to build a hierarchical grouping of points and abstracts the features progressively. Results are shown in Table 1. A significance test is conducted between our results and the state-of-the-art results obtained by PointNet++ [26]. The P-value equals to 0.0122 in favor of our method.

We also compare the mean IoU for each semantic class with and only with as input, see Table 2 and Table 3 respectively. We obtain state-of-the-art results in mean IoU and for most of the individual classes for both and input. Note that MS+CU [31] obtains the best performance for category Floor because of the extra pre-processing step that extends each block to the room height. In this way, their method explicitly includes floor information. The reason of obtaining comparable results with PointNet++ [26] for furnitures is that the k-d tree structure is computed along the axes. Therefore, it may be inefficient for precise prediction near the splitting boundaries, especially for relatively small objects. Our model using only geometry information (i.e. ) achieves better results than the original PointNet method using both geometry and color/appearance information.

mean IoU Ceiling Floor Wall Beam Column Window Door Table Chair Sofa Bookcase Board clutter
PointNet [24] 47.6 88.0 88.7 69.3 42.4 23.1 47.5 51.6 54.1 42.0 9.6 38.2 29.4 35.2

MS + CU(2) [31]
47.8 88.6 95.8 67.3 36.9 24.9 48.6 52.3 51.9 45.1 10.6 36.8 24.7 37.5

G + RCU [31]
49.7 90.3 92.1 67.9 44.7 24.2 52.3 51.2 58.1 47.4 6.9 39.0 30.0 41.9

PointNet++ [26]
53.2 90.2 91.7 73.1 42.7 21.2 49.7 42.3 62.7 59.0 19.6 45.8 48.2 45.6
Ours 55.6 92.6 93.1 73.9 52.9 35.0 55.8 57.5 62.9 49.0 22.0 42.8 39.8 45.8
Table 2: IoU per semantic class for the S3DIS dataset with as input. It can be derived that our method obtains the state-of-the-art results in mean IoU and for most of the individual classes
mean IoU Ceiling Floor Wall Beam Column Window Door Table Chair Sofa Bookcase Board clutter
PointNet [24] 40.0 84.0 87.2 57.9 37.0 19.6 29.3 35.3 51.6 42.4 11.6 26.4 12.5 25.5

MS + CU(2) [31]
43.0 86.5 94.9 58.8 37.7 25.6 28.8 36.7 47.2 46.1 18.7 30.0 16.8 31.2

PointNet++ [26]
47.0 88.0 92.4 64.7 37.7 16.8 31.0 41.1 59.6 52.0 29.4 42.2 19.2 36.9

48.6 90.5 92.8 63.6 49.4 31.2 44.2 37.8 59.6 50.6 17.7 38.7 17.3 37.9
Table 3: IoU per semantic class for the S3DIS dataset using only input features (no color/appearance). It is shown that our method provides comparable results in mean IoU and for all individual classes even without color/appearance information
Figure 4: Qualitative results for 3D indoor semantic segmentation. Results for the S3DIS dataset with as input. From left to right: the input point cloud, the results of PointNet, our results, and the ground truth semantic labels. Our model obtains more consistent and less noisy predictions

A number of qualitative results are presented in Figure 4 for the 3D indoor semantic segmentation task. It can be derived that our method provides more precise predictions for local structures. It shows that our model exploits both local and global contextual cues to learn discriminative features to achieve proper semantic segmentation. Our model size is less than 160 MB, average inference time is less than 70 ms per block, which makes our method suitable for large scale point cloud analysis.

4.2 3D Object Classification and Part Segmentation

Method Input Accuracy (%)
DeepPano [15] image 77.6
MVCNN [14] image 90.1
MVCNN-MultiRes [16] image 91.4
3DShapeNets [10] voxel 77
VoxNet [17] voxel 83
Subvolume [16] voxel 89.2
PointNet (vanilla) [24] point cloud 87.2
PointNet [24] point cloud 89.2
K-d network [25] point cloud 90.6
PointNet++ [26] point cloud 90.7
PointNet++ (with normal) [26] point cloud 91.9
Ours point cloud 90.2
Ours (with normal) point cloud 91.1
Table 4: 3D object classification results on ModelNet40. The result of our model outperforms PointNet and is comparable to PointNet++
PointNet [24] PointNet++(SSG) [26] PointNet++(MSG) [26] PointNet++(MRG) [26] 3DContextNet
Model size (MB) 40 8.7 12 24 56.8

Forward time (ms)
25.3 82.4 163.2 87.0 45.9
Table 5: Comparison of the model sizes and the inference time for the classification task. Our model is faster than PointNet++ while keeping comparable classification performance

We evaluate our method on the ModelNet40 shape classification benchmark [10]. This dataset contains a collection of 3D CAD models for 40 categories. We use the official split consisting of 9843 examples for training and 2468 for testing. Using the same experimental setting as in [24], we convert the CAD models to point sets by uniformly sampling (1024 points in our case) over the mesh faces. Then, these points are normalized to be zero mean and unit sphere. We also randomly rotate the point sets along the -axis and jitter the coordinates of each point by Gaussian noise for data augmentation during training.

It can be derived from Table 4, that our model outperforms PointNet [24]. Our model has competitive performance compared to PointNet++. However, our method is much faster in inference time. Table 5

summarizes the comparison of time and space computations between PointNet, PointNet++ and our proposed method. We measure forward time with a batch size of 8 using TensorFlow 1.1. PointNet has the best time efficiency, but our model is faster than PointNet++ while keeping comparable classification performance.

We also evaluate our method on the ShapeNet part dataset [11]. This dataset contains 16881 CAD models from 16 categories. Each category is annotated with 2 to 6 parts. There are 50 different parts annotated in total. We use the official split for training and testing. In this dataset, both the number of shapes and the parts within the categories are highly imbalanced. Therefore, many previous methods train their network on every category separately. Our network is trained across categories.

We compare our model with two traditional learning based techniques Wu [33] and Yi [34], the volumetric deep learning baseline (3DCNN) in PointNet [24], as well as state-of-the-art approaches of SSCNN [35] and PointNet++ [26], see Table 6. The point intersection over union for each category as well as the mean IoU are reported. In comparison to PointNet, our approach performs better on most of the categories, which proves the importance of local and global contextual information. See Figure 5 for a number of qualitative results for the 3D object part segmentation task.

mean airplane bag cap car chair earphone guitar knife lamp laptop motor mug pistol rocket skateboard table
#shapes 2690 76 55 898 3758 69 787 392 1547 451 202 184 283 66 152 5271
Wu [33] - 63.2 - - - 73.5 - - - 74.4 - - - - - - 74.8
K-d Networks [25] 77.2 79.9 71.2 80.9 68.8 88.0 72.4 88.9 86.4 79.8 94.9 55.8 86.5 79.3 50.4 71.1 80.2
3DCNN [24] 79.4 75.1 72.8 73.3 70.0 87.2 63.5 88.4 79.6 74.4 93.9 58.7 91.8 76.4 51.2 65.3 77.1
Yi [34] 81.4 81.0 78.4 77.7 75.7 87.6 61.9 92.0 85.4 82.5 95.7 70.6 91.9 85.9 53.1 69.8 75.3
PointNet [24] 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6
SSCNN [35] 84.7 81.6 81.7 81.9 75.2 90.2 74.9 93.0 86.1 84.7 95.6 66.7 92.7 81.6 60.6 82.9 82.1
PointNet++ [26] 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6
Ours 84.3 83.3 78.0 84.2 77.2 90.1 73.1 91.6 85.9 81.4 95.4 69.1 92.3 81.7 60.8 71.8 81.4
Table 6: 3D object part segmentation results on ShapeNet part dataset
Figure 5: Qualitative results for the 3D object part segmentation task. For each group from left to right: the prediction and the ground truth

5 Conclusion

In this paper, we proposed a deep learning architecture that exploits the local and global contextual cues imposed by the implicit space partition of the k-d tree for feature learning, and calculate the representation vectors progressively along the associated k-d tree for feature aggregation. Large scale experiments showed that our model outperformed existing state-of-the-art methods for semantic segmentation task. Further, the model obtained comparable results for 3D object classification and 3D part segmentation.

In the future, other hierarchical 3D space partition structures can be studied as the underlying structure for the deep net computation and the non-uniform point sampling issue needs to be taken into consideration.