Over the past few years, ConvNets have achieved excellent performance in different computer vision tasks such as image classification [1, 2, 3], object detection [4, 5, 6] and semantic segmentation [4, 7, 8, 9].
3D imaging technology has also experienced a major progress. As large scale datasets are crucial for supervised 3D deep learning, recently, large-scale 3D datasets are made publicly available such as ModelNet, ShapeNet , and real 3D scene datasets such as the Stanford Large-Scale 3D Indoor Spaces Dataset  and ScanNet . To perform weight sharing and hierarchical learning, ConvNets need highly regular input data. Therefore, most of the traditional methods convert the irregular 3D data to regular formats like 2D projection images [14, 15, 16] or 3D voxel grids [10, 16, 17] before they are used by ConvNets.
Methods that employ 2D image projections of 3D models as their input  , are well suited as input for 2D ConvNet architectures. However, the intrinsic 3D geometrical information is distorted by the 3D-to-2D projection. Hence, this type of methods are limited in the exploitation of 3D spatial connections between regions. While it seems straightforward to extend 2D CNNs to 3D data processing by using 3D convolutional kernels, data sparsity and computational complexity are restrictive factors of this type of approaches [10, 17, 18, 19].
To fully exploit the 3D nature of point clouds, in this paper, the goal is to use the k-d tree structure  as the 3D data representation model, see Figure 1. Our method consists of two parts: feature learning and aggregation. It exploits both local and global contextual information and aggregates point features to obtain discriminative 3D signatures in a hierarchical manner. In the feature learning stage, local patterns are identified by the use of an adaptive feature recalibration procedure, and global patterns are calculated as non-local responses of different regions at the same level. In the feature aggregation stage, point features are merged hierarchically corresponding to the associated k-d tree structure in a bottom-up way.
Our main contributions are as follows: (1) A novel 3D context-aware neural network is proposed for 3D point cloud feature learning by exploiting the implicit partition space of the k-d tree structure. (2) A novel method is presented to incorporate the 3D space partition structure into a CNN architecture. (3) For semantic segmentation, our method significantly outperforms the state-of-the-art on the challenging Stanford Large-Scale 3D Indoor Spaces Dataset(S3DIS).
2 Related Work
Previous work on ConvNets and volumetric models use different rasterization strategies. Wu et al. propose 3DShapeNets 
using 3D binary voxel grids as input of a Convolutional Deep Belief Network. This is the first work to use deep ConvNets for 3D data processing. VoxNet proposes a 3D ConvNet architecture to integrate the 3D volumetric occupancy grid. ORION  exploits the 3D orientation to improve the results of voxel nets for 3D object recognition. Based on the ResNet  architecture, Voxception-ResNet (VRN)  builds a very deep architecture. However, volumetric models are limited by their resolution, data sparsity, and computational cost of 3D convolutions.
Other methods rely on 2D projection images to represent the original 3D data and then apply 2D ConvNets to classify them. MVCNN uses 2D rendered images of 3D shapes to learn representations of multiple views of a 3D model and then combines them to compute a compact descriptor. DeepPano  converts each 3D shape to a panoramic view and uses 2D ConvNets to build classifiers directly from these panoramas. With well-designed ConvNets, this type of methods (2D projections from 3D) performs successfully in different shape classification and retrieval tasks. However, due to the 3D-to-2D projection these methods are limited in exploring the full 3D nature of the data. [22, 23]
exploits ConvNets to process non-Euclidean geometries. Geodesic Convolutional Neural Networks (GCNN)
apply linear and non-linear transformations to polar coordinates in a local geodesic system. However, these methods are limited to manifold meshes.
Only recently, a number of methods are proposed that apply deep learning directly to the raw 3D data. PointNet  is pioneering work that processes 3D point sets. Because every point is treated equally, this approach fails in retaining the full 3D information. Another recent work uses Kd-Networks  and uses a 3D indexing structure to guide the computation. The method employs parameter sharing and calculates representations from the leaf nodes to the roots. However, this method needs to sample the point clouds and to construct k-d trees for every iteration. Further, the method employs multiple k-d trees to represent a single object. It is split-direction-dependent and is negatively influenced by a change in rotation (3D object classification) and viewpoint (3D scene semantic segmentation). The modified version of PointNet, PointNet++ , abstracts local patterns by sampling representative points and recursively applies PointNet  as a learning component to obtain the final representation. However, it directly discards the unselected points after each layer, and needs to sample points recursively at different scales which may yield relative slow inference speed.
In contrast to previous methods, our model is based on a hierarchical feature learning and aggregation pipeline. Our neural network exploits the local and global contextual cues which are inferred by the implicit space partition of the k-d tree. In this way, our model learn features, and calculates the representation vectors progressively using the associated k-d tree. Figure 2 shows a comparison of related methods to our work for the classification task.
In this section, we describe our architecture, 3DContextNet, see Figure 3. First, the type of tree structure is motivated to subdivide the 3D space. Then, the feature learning stage is discussed that uses both local and global contextual cues to encode the point features. Finally, we describe our feature aggregation stage computing representation vectors progressively from the k-d trees.
3.1 K-d Tree Structure: Implicit 3D Space Partition
Our method is designed to capture both the local and global context by learning and aggregating point features progressively and hierarchically. Therefore, a representation model is required to partition 3D point clouds encapsulate the latent relations between regions. To this end, the k-d tree structure  is chosen.
A k-d tree is a space partitioning structure which is constructed by recursively computing axis-aligned hyperplanes to divide point sets. In this paper, we choose the standard k-d tree to obtain balanced k-d trees from the 3D input point clouds/sets. The latent region subdivisions of the constructed k-d tree is used to capture the local and global contextual information of point sets. Each node, at a certain level, represents a local region at the same scale whereas nodes at different levels represent subdivisions at corresponding scales. From the k-d tree construction, splitting direction and position are not used. In this way, our method is more robust to jittering and rotation than the k-d network of which trains different affine transformations depending on the splitting directions of the nodes.
The k-d tree structure can be used to search for k-nearest neighbors for each point to determine the local point adjacency and neighbor connectivity. Our approach uses the implicit local partitioning obtained by the k-d tree structure to determine the point adjacency and neighbor connectivity.
In general, conventional ConvNets learn and merge nearby features at the same time enlarging the receptive fields of the network. Because of the non-overlapping partition of the k-d tree structure, in our method, learning and merging at the same time would decrease the size of remaining points too fast. This may lead to a lack of fine geometrical cues which are factored out during the early merging stages. To this end, our approach is to divide the network architecture into two parts: feature learning and aggregation.
3.2 Feature Learning Stage
Given as input is a 3D point set with the corresponding k-d tree. The tree leaves contain the individual (raw) 3D points with their representation vectors, denoted by . For example, denotes the initial vectors containing the 3D point coordinates. Features are directly learned from the raw point cloud without any pre-processing step. According to , a function is permutation invariant to the elements in , if and only if it can be decomposed in the form of , for a suitable transformation of and .
We follow PointNet , where a point set is mapped to a discriminative vector as follows:
where , and is a symmetric function.
In the feature learning stage, point features are computed at different levels hierarchically. For a certain level, we first process each point using shared multi-layer perceptron networks (MLP) as functionin equation (1
). Then, different local region representations are computed by a symmetric function (Max pooling in our work) for the nodes at this same level, as functionin equation (1).
3.2.1 Local Contextual Cues: Adaptive Feature Recalibration
To model the inter-dependencies between point features of the same region, we use the local region representations obtained from the symmetric function to perform adaptive feature recalibration . All operations are adaptive to each local region, represented by a certain node in the k-d tree. The local region representation obtained by the symmetric function can be interpreted as a feature descriptor for this local region. A gating function is used with a sigmoid activation to capture the feature-wise dependencies. Point features in this local region are then rescaled by the activations to obtain the adaptive recalibrated output:
where denotes the sigmoid activation and is the symmetric function to obtain the local region representation. is the point feature set of the local region and is the number of points in the region. In this way, feature dependencies are consolidated for each local region, by enhancing informative features. This yields more discriminative local patterns. Note that the activations act as feature weights and adaptively recalibrate point features for different local regions. This avoids the necessity of a canonical partition and increases the robustness to point cloud rotation.
3.2.2 Global Contextual Cues: Non-local Responses
The global contextual cue is based on the non-local responses to capture a greater range of dependencies. Intuitively, a non-local operation computes the response for one position as a weighted sum over the features for all positions in the input feature maps. A generic non-local operation  in deep neural networks is calculated by:
where is the index of the output position and is the index that enumerates all possible positions. Function denotes the relationships between and . Function computes a representation of the input signal at position . The response is normalized by a factor .
The k-d tree divides the input point set into different local regions. These are represented by different nodes of the tree. Larger range dependencies for different local regions at the same level are computed as non-local responses of the corresponding nodes of the k-d tree. We consider as an MLP, and the pairwise function as an embedded Gaussian function:
where and are two MLPs representing two embeddings. In this paper, the relationships between different nodes at the same level should be undirected, and hence . Therefore, the two embeddings are the same i.e. . The normalization factor is calculated by . Note that this operation is different from a fully-connected layer. The non-local responses are based on the connections between different local regions, whereas fully-connected layers use learned weights.
Due to our input format and architecture, the receptive fields of the convolutional kernels are always in the feature learning stage. Following DenseNet , to strengthen the information flow between layers, all layers at the same level are connected (in the feature learning stage) with each other by concatenating all corresponding point features together. Such connection also leads to an implicit deep supervision which makes the network easier to train. The output of the feature learning stage has the same number of points as the input point set.
3.3 Feature Aggregation Stage
In the feature aggregation stage, the associated k-d tree structure is used to form the computational graph to progressively abstract over larger regions. For the classification task, the global signature is computed for the entire 3D model. For the semantic segmentation task, the outputs are the point labels. Instead of aggregating the information once over all points, the more discriminative features are computed in a bottom-up manner. The representation vector of a non-leaf node at a certain level is computed from its children nodes by MLPs and the symmetric function. Max pooling is used as the symmetric function.
For classification, by using this bottom-up, hierarchical approach, more discriminative global signatures are obtained. This procedure corresponds to a ConvNet in which the representation of a certain location is computed from the representations of nearby locations at the previous layers by a series of convolutions and pooling operations. Our architecture is able to progressively capture features at increasingly larger scales. Features at lower levels have smaller receptive fields whereas features at higher levels have larger receptive fields. This is due to the data-dependent partition of the k-d tree structure. Our model is invariant to the input order of the point sets because the aggregating direction is along the k-d tree structure which is invariant to input permutations.
For the semantic segmentation task, the k-d tree structure is used to represent an encoder-decoder architecture with skip connections to link the related layers. The input of the feature aggregation stage is the point feature set in which the representation of each point encapsulates both local and global contextual information at different scales. The output is a semantic label for each point.
In conclusion, our architecture fully utilizes the local and global contextual cues in the feature learning stage. It calculates the representation vectors hierarchically in the feature aggregation stage. Hence, our method obtains discriminative features for points of different semantic labels for the semantic segmentation task.
Our method is related to PointNet  which encodes the coordinates of each point to higher dimensional features. However, by its design, this method is not able to sufficiently capture the local patterns in 3D space. More recently, PointNet++  is proposed which abstracts local patterns by selecting representative points in a metric space and recursively applies PointNet as a local feature learner to obtain features of the whole point set. In fact, the method handles the non-uniform point sampling problem. However, the set of abstraction layers need to sample the point sets multiple times at different scales which leads to a relative slow inference speed. And only the selected points are preserved. Others are directly discarded after each layer which causing the loss of fine geometric details. Another recent work on K-d networks  performs linear and non-linear transformations and share the transformation parameters corresponding to the splitting directions of each node in the k-d tree. This method needs to calculate the representation vectors for all the nodes of the associated k-d tree. For each node at a certain level, the input is the representation vectors of the two previous nodes. The method heavily depends on the splitting direction of each node to train different multiplicative transformations at each level. Hence, the method is not invariant to rotation. Furthermore, point cloud sampling and k-d tree fitting during every iteration lead to slow training and inference speed.
3.5 Implementation Details
Our 3DContextNet model deals with point clouds of a fixed size where is the depth of the corresponding balanced k-d tree. Point clouds of different sizes can be converted to the same size using sub- or oversampling. In our experiments, not all the levels of the k-d tree are used. For simplicity and efficiency reasons, this number is for both the feature learning and aggregation stage. The receptive fields (number of points) for each level in the feature learning stage are 32 - 64 - 128 for the classification tasks and 32 - 128 - 512 for the segmentation tasks.
In the feature learning stage, the sizes of the MLPs are (64, 64, 128, 128) - (64, 64, 256, 256) - (64, 64, 512, 512) for the three levels, respectively. Dense connections are applied within each level. In the feature aggregation stage, the MLPs and pooling operations are used recursively to progressively abstract the discriminative representations. For the classification task, the sizes of the MLPs are (1024) - (512) - (256), respectively. For the segmentation task, like the hourglass shape, the sizes of the MLPs are (1024) - (512) - (256) - (256) - (512) - (1024), respectively. The output is then processed by two fully-connected layers with size 256. Dropout is applied after each fully-connected layer with a dropout ratio of .
In this section, we evaluate our 3DContextNet on different 3D point cloud datasets. First, it is shown that our model significantly outperforms state-of-the-art methods for the task of semantic segmentation on the Stanford Large-Scale 3D Indoor Spaces Dataset . Then, it is shown that our model provides competitive results for the task of 3D object classification on the ModelNet40 dataset  and the task of 3D object part segmentation on the ShapeNet part dataset .
4.1 3D Semantic Segmentation of Scenes
|mean IoU||overall accuracy||avg. class accuracy|
MS + CU(2) 
G + RCU 
Our network is evaluated on the Stanford Large-Scale 3D Indoor Spaces (S3DIS) dataset [12, 32] for 3D semantic segmentation. This dataset contains 6 large scale indoor areas and each point is labeled with one of the 13 semantic categories, including 5 types of furniture (board, bookcase, chair, sofa and table) and seven building elements (ceiling, beam, door, wall, window, column and floor) plus clutter. We follow the same setting as in  and use a 6-fold cross validation over all the areas.
Our method is compared with the baseline by PointNet  and the recently introduced MS+CU and G+RCU models . We also produce the results of PointNet++  for this dataset. During training, we use the same pre-processing as in . We first split rooms into blocks of m and represent each point by a 9-dimension vector containing coordinates (, , ), the color information and the normalized position (, , ). The baseline extracts the same 9-dim local features and three additional ones: local point density, local curvature and normal. The standard MLP is used as classifier. PointNet  computes the global point cloud signature and feed it back to per point features. In this way, each point representation incorporates both local and global information. Recent work by  proposes two models that enlarge the receptive field over the 3D scene. The motivation is to incorporate both the input-level context and the out-level context. MS+CU represents the multi-scale input block with a consolidation unit model, while G+RCU stands for the grid-blocks in combination with a recurrent consolidation block model. PointNet++  exploits metric space distances to build a hierarchical grouping of points and abstracts the features progressively. Results are shown in Table 1. A significance test is conducted between our results and the state-of-the-art results obtained by PointNet++ . The P-value equals to 0.0122 in favor of our method.
We also compare the mean IoU for each semantic class with and only with as input, see Table 2 and Table 3 respectively. We obtain state-of-the-art results in mean IoU and for most of the individual classes for both and input. Note that MS+CU  obtains the best performance for category Floor because of the extra pre-processing step that extends each block to the room height. In this way, their method explicitly includes floor information. The reason of obtaining comparable results with PointNet++  for furnitures is that the k-d tree structure is computed along the axes. Therefore, it may be inefficient for precise prediction near the splitting boundaries, especially for relatively small objects. Our model using only geometry information (i.e. ) achieves better results than the original PointNet method using both geometry and color/appearance information.
MS + CU(2) 
G + RCU 
MS + CU(2) 
A number of qualitative results are presented in Figure 4 for the 3D indoor semantic segmentation task. It can be derived that our method provides more precise predictions for local structures. It shows that our model exploits both local and global contextual cues to learn discriminative features to achieve proper semantic segmentation. Our model size is less than 160 MB, average inference time is less than 70 ms per block, which makes our method suitable for large scale point cloud analysis.
4.2 3D Object Classification and Part Segmentation
|PointNet (vanilla) ||point cloud||87.2|
|PointNet ||point cloud||89.2|
|K-d network ||point cloud||90.6|
|PointNet++ ||point cloud||90.7|
|PointNet++ (with normal) ||point cloud||91.9|
|Ours (with normal)||point cloud||91.1|
|PointNet ||PointNet++(SSG) ||PointNet++(MSG) ||PointNet++(MRG) ||3DContextNet|
|Model size (MB)||40||8.7||12||24||56.8|
Forward time (ms)
We evaluate our method on the ModelNet40 shape classification benchmark . This dataset contains a collection of 3D CAD models for 40 categories. We use the official split consisting of 9843 examples for training and 2468 for testing. Using the same experimental setting as in , we convert the CAD models to point sets by uniformly sampling (1024 points in our case) over the mesh faces. Then, these points are normalized to be zero mean and unit sphere. We also randomly rotate the point sets along the -axis and jitter the coordinates of each point by Gaussian noise for data augmentation during training.
summarizes the comparison of time and space computations between PointNet, PointNet++ and our proposed method. We measure forward time with a batch size of 8 using TensorFlow 1.1. PointNet has the best time efficiency, but our model is faster than PointNet++ while keeping comparable classification performance.
We also evaluate our method on the ShapeNet part dataset . This dataset contains 16881 CAD models from 16 categories. Each category is annotated with 2 to 6 parts. There are 50 different parts annotated in total. We use the official split for training and testing. In this dataset, both the number of shapes and the parts within the categories are highly imbalanced. Therefore, many previous methods train their network on every category separately. Our network is trained across categories.
We compare our model with two traditional learning based techniques Wu  and Yi , the volumetric deep learning baseline (3DCNN) in PointNet , as well as state-of-the-art approaches of SSCNN  and PointNet++ , see Table 6. The point intersection over union for each category as well as the mean IoU are reported. In comparison to PointNet, our approach performs better on most of the categories, which proves the importance of local and global contextual information. See Figure 5 for a number of qualitative results for the 3D object part segmentation task.
|K-d Networks ||77.2||79.9||71.2||80.9||68.8||88.0||72.4||88.9||86.4||79.8||94.9||55.8||86.5||79.3||50.4||71.1||80.2|
In this paper, we proposed a deep learning architecture that exploits the local and global contextual cues imposed by the implicit space partition of the k-d tree for feature learning, and calculate the representation vectors progressively along the associated k-d tree for feature aggregation. Large scale experiments showed that our model outperformed existing state-of-the-art methods for semantic segmentation task. Further, the model obtained comparable results for 3D object classification and 3D part segmentation.
In the future, other hierarchical 3D space partition structures can be studied as the underlying structure for the deep net computation and the non-uniform point sampling issue needs to be taken into consideration.
-  LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11) (1998) 2278–2324
-  Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3) (2015) 211–252
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. (2012) 1097–1105
Girshick, R., Donahue, J., Darrell, T., Malik, J.:
Rich feature hierarchies for accurate object detection and semantic
In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2014) 580–587
-  Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. (2015) 1440–1448
-  Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. (2015) 91–99
-  Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015)
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3431–3440
-  Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1520–1528
-  Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: A deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 1912–1920
-  Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)
-  Armeni, I., Sener, O., Zamir, A.R., Jiang, H., Brilakis, I., Fischer, M., Savarese, S.: 3d semantic parsing of large-scale indoor spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 1534–1543
-  Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richly-annotated 3d reconstructions of indoor scenes. arXiv preprint arXiv:1702.04405 (2017)
-  Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3d shape recognition. In: Proceedings of the IEEE international conference on computer vision. (2015) 945–953
-  Shi, B., Bai, S., Zhou, Z., Bai, X.: Deeppano: Deep panoramic representation for 3-d shape recognition. IEEE Signal Processing Letters 22(12) (2015) 2339–2343
-  Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multi-view cnns for object classification on 3d data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 5648–5656
-  Maturana, D., Scherer, S.: Voxnet: A 3d convolutional neural network for real-time object recognition. In: Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, IEEE (2015) 922–928
-  Sedaghat, N., Zolfaghari, M., Brox, T.: Orientation-boosted voxel nets for 3d object recognition. arXiv preprint arXiv:1604.03351 (2016)
-  Brock, A., Lim, T., Ritchie, J.M., Weston, N.: Generative and discriminative voxel modeling with convolutional neural networks. arXiv preprint arXiv:1608.04236 (2016)
-  Bentley, J.L.: Multidimensional binary search trees used for associative searching. Communications of the ACM 18(9) (1975) 509–517
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition (2016) 770–778
-  Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013)
-  Masci, J., Boscaini, D., Bronstein, M., Vandergheynst, P.: Geodesic convolutional neural networks on riemannian manifolds. In: Proceedings of the IEEE international conference on computer vision workshops. (2015) 37–45
-  Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593 (2016)
-  Klokov, R., Lempitsky, V.: Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. arXiv preprint arXiv:1704.01222 (2017)
-  Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. CoRR abs/1706.02413 (2017)
-  Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R., Smola, A.: Deep sets. (2017)
-  Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507 (2017)
-  Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. (2017)
-  Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016)
-  Engelmann, F., Kontogianni, T., Hermans, A., Leibe, B.: Exploring spatial context for 3d semantic segmentation of point clouds
-  Armeni, I., Sax, S., Zamir, A.R., Savarese, S.: Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105 (2017)
-  Wu, Z., Shou, R., Wang, Y., Liu, X.: Interactive shape co-segmentation via label propagation. Computers & Graphics 38 (2014) 248–254
-  Yi, L., Kim, V.G., Ceylan, D., Shen, I., Yan, M., Su, H., Lu, A., Huang, Q., Sheffer, A., Guibas, L., et al.: A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG) 35(6) (2016) 210
-  Yi, L., Su, H., Guo, X., Guibas, L.: Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. In: Computer Vision and Pattern Recognition (CVPR). (2017)