1 Introduction
Over the past few years, ConvNets have achieved excellent performance in different computer vision tasks such as image classification [1, 2, 3], object detection [4, 5, 6] and semantic segmentation [4, 7, 8, 9].
3D imaging technology has also experienced a major progress. As large scale datasets are crucial for supervised 3D deep learning, recently, largescale 3D datasets are made publicly available such as ModelNet
[10], ShapeNet [11], and real 3D scene datasets such as the Stanford LargeScale 3D Indoor Spaces Dataset [12] and ScanNet [13]. To perform weight sharing and hierarchical learning, ConvNets need highly regular input data. Therefore, most of the traditional methods convert the irregular 3D data to regular formats like 2D projection images [14, 15, 16] or 3D voxel grids [10, 16, 17] before they are used by ConvNets.Methods that employ 2D image projections of 3D models as their input [14] [15], are well suited as input for 2D ConvNet architectures. However, the intrinsic 3D geometrical information is distorted by the 3Dto2D projection. Hence, this type of methods are limited in the exploitation of 3D spatial connections between regions. While it seems straightforward to extend 2D CNNs to 3D data processing by using 3D convolutional kernels, data sparsity and computational complexity are restrictive factors of this type of approaches [10, 17, 18, 19].
To fully exploit the 3D nature of point clouds, in this paper, the goal is to use the kd tree structure [20] as the 3D data representation model, see Figure 1. Our method consists of two parts: feature learning and aggregation. It exploits both local and global contextual information and aggregates point features to obtain discriminative 3D signatures in a hierarchical manner. In the feature learning stage, local patterns are identified by the use of an adaptive feature recalibration procedure, and global patterns are calculated as nonlocal responses of different regions at the same level. In the feature aggregation stage, point features are merged hierarchically corresponding to the associated kd tree structure in a bottomup way.
Our main contributions are as follows: (1) A novel 3D contextaware neural network is proposed for 3D point cloud feature learning by exploiting the implicit partition space of the kd tree structure. (2) A novel method is presented to incorporate the 3D space partition structure into a CNN architecture. (3) For semantic segmentation, our method significantly outperforms the stateoftheart on the challenging Stanford LargeScale 3D Indoor Spaces Dataset(S3DIS)
[12].2 Related Work
Previous work on ConvNets and volumetric models use different rasterization strategies. Wu et al. propose 3DShapeNets [10]
using 3D binary voxel grids as input of a Convolutional Deep Belief Network. This is the first work to use deep ConvNets for 3D data processing. VoxNet
[17] proposes a 3D ConvNet architecture to integrate the 3D volumetric occupancy grid. ORION [18] exploits the 3D orientation to improve the results of voxel nets for 3D object recognition. Based on the ResNet [21] architecture, VoxceptionResNet (VRN) [19] builds a very deep architecture. However, volumetric models are limited by their resolution, data sparsity, and computational cost of 3D convolutions.Other methods rely on 2D projection images to represent the original 3D data and then apply 2D ConvNets to classify them. MVCNN
[14] uses 2D rendered images of 3D shapes to learn representations of multiple views of a 3D model and then combines them to compute a compact descriptor. DeepPano [15] converts each 3D shape to a panoramic view and uses 2D ConvNets to build classifiers directly from these panoramas. With welldesigned ConvNets, this type of methods (2D projections from 3D) performs successfully in different shape classification and retrieval tasks. However, due to the 3Dto2D projection these methods are limited in exploring the full 3D nature of the data. [22, 23]exploits ConvNets to process nonEuclidean geometries. Geodesic Convolutional Neural Networks (GCNN)
[23]apply linear and nonlinear transformations to polar coordinates in a local geodesic system. However, these methods are limited to manifold meshes.
Only recently, a number of methods are proposed that apply deep learning directly to the raw 3D data. PointNet [24] is pioneering work that processes 3D point sets. Because every point is treated equally, this approach fails in retaining the full 3D information. Another recent work uses KdNetworks [25] and uses a 3D indexing structure to guide the computation. The method employs parameter sharing and calculates representations from the leaf nodes to the roots. However, this method needs to sample the point clouds and to construct kd trees for every iteration. Further, the method employs multiple kd trees to represent a single object. It is splitdirectiondependent and is negatively influenced by a change in rotation (3D object classification) and viewpoint (3D scene semantic segmentation). The modified version of PointNet, PointNet++ [26], abstracts local patterns by sampling representative points and recursively applies PointNet [24] as a learning component to obtain the final representation. However, it directly discards the unselected points after each layer, and needs to sample points recursively at different scales which may yield relative slow inference speed.
In contrast to previous methods, our model is based on a hierarchical feature learning and aggregation pipeline. Our neural network exploits the local and global contextual cues which are inferred by the implicit space partition of the kd tree. In this way, our model learn features, and calculates the representation vectors progressively using the associated kd tree. Figure 2 shows a comparison of related methods to our work for the classification task.
3 Method
In this section, we describe our architecture, 3DContextNet, see Figure 3. First, the type of tree structure is motivated to subdivide the 3D space. Then, the feature learning stage is discussed that uses both local and global contextual cues to encode the point features. Finally, we describe our feature aggregation stage computing representation vectors progressively from the kd trees.
3.1 Kd Tree Structure: Implicit 3D Space Partition
Our method is designed to capture both the local and global context by learning and aggregating point features progressively and hierarchically. Therefore, a representation model is required to partition 3D point clouds encapsulate the latent relations between regions. To this end, the kd tree structure [20] is chosen.
A kd tree is a space partitioning structure which is constructed by recursively computing axisaligned hyperplanes to divide point sets. In this paper, we choose the standard kd tree to obtain balanced kd trees from the 3D input point clouds/sets. The latent region subdivisions of the constructed kd tree is used to capture the local and global contextual information of point sets. Each node, at a certain level, represents a local region at the same scale whereas nodes at different levels represent subdivisions at corresponding scales. From the kd tree construction, splitting direction and position are not used. In this way, our method is more robust to jittering and rotation than the kd network of
[25] which trains different affine transformations depending on the splitting directions of the nodes.The kd tree structure can be used to search for knearest neighbors for each point to determine the local point adjacency and neighbor connectivity. Our approach uses the implicit local partitioning obtained by the kd tree structure to determine the point adjacency and neighbor connectivity.
In general, conventional ConvNets learn and merge nearby features at the same time enlarging the receptive fields of the network. Because of the nonoverlapping partition of the kd tree structure, in our method, learning and merging at the same time would decrease the size of remaining points too fast. This may lead to a lack of fine geometrical cues which are factored out during the early merging stages. To this end, our approach is to divide the network architecture into two parts: feature learning and aggregation.
3.2 Feature Learning Stage
Given as input is a 3D point set with the corresponding kd tree. The tree leaves contain the individual (raw) 3D points with their representation vectors, denoted by . For example, denotes the initial vectors containing the 3D point coordinates. Features are directly learned from the raw point cloud without any preprocessing step. According to [27], a function is permutation invariant to the elements in , if and only if it can be decomposed in the form of , for a suitable transformation of and .
We follow PointNet [24], where a point set is mapped to a discriminative vector as follows:
(1) 
where , and is a symmetric function.
In the feature learning stage, point features are computed at different levels hierarchically. For a certain level, we first process each point using shared multilayer perceptron networks (MLP) as function
in equation (1). Then, different local region representations are computed by a symmetric function (Max pooling in our work) for the nodes at this same level, as function
in equation (1).3.2.1 Local Contextual Cues: Adaptive Feature Recalibration
To model the interdependencies between point features of the same region, we use the local region representations obtained from the symmetric function to perform adaptive feature recalibration [28]. All operations are adaptive to each local region, represented by a certain node in the kd tree. The local region representation obtained by the symmetric function can be interpreted as a feature descriptor for this local region. A gating function is used with a sigmoid activation to capture the featurewise dependencies. Point features in this local region are then rescaled by the activations to obtain the adaptive recalibrated output:
(2) 
where denotes the sigmoid activation and is the symmetric function to obtain the local region representation. is the point feature set of the local region and is the number of points in the region. In this way, feature dependencies are consolidated for each local region, by enhancing informative features. This yields more discriminative local patterns. Note that the activations act as feature weights and adaptively recalibrate point features for different local regions. This avoids the necessity of a canonical partition and increases the robustness to point cloud rotation.
3.2.2 Global Contextual Cues: Nonlocal Responses
The global contextual cue is based on the nonlocal responses to capture a greater range of dependencies. Intuitively, a nonlocal operation computes the response for one position as a weighted sum over the features for all positions in the input feature maps. A generic nonlocal operation [29] in deep neural networks is calculated by:
(3) 
where is the index of the output position and is the index that enumerates all possible positions. Function denotes the relationships between and . Function computes a representation of the input signal at position . The response is normalized by a factor .
The kd tree divides the input point set into different local regions. These are represented by different nodes of the tree. Larger range dependencies for different local regions at the same level are computed as nonlocal responses of the corresponding nodes of the kd tree. We consider as an MLP, and the pairwise function as an embedded Gaussian function:
(4) 
where and are two MLPs representing two embeddings. In this paper, the relationships between different nodes at the same level should be undirected, and hence . Therefore, the two embeddings are the same i.e. . The normalization factor is calculated by . Note that this operation is different from a fullyconnected layer. The nonlocal responses are based on the connections between different local regions, whereas fullyconnected layers use learned weights.
Due to our input format and architecture, the receptive fields of the convolutional kernels are always in the feature learning stage. Following DenseNet [30], to strengthen the information flow between layers, all layers at the same level are connected (in the feature learning stage) with each other by concatenating all corresponding point features together. Such connection also leads to an implicit deep supervision which makes the network easier to train. The output of the feature learning stage has the same number of points as the input point set.
3.3 Feature Aggregation Stage
In the feature aggregation stage, the associated kd tree structure is used to form the computational graph to progressively abstract over larger regions. For the classification task, the global signature is computed for the entire 3D model. For the semantic segmentation task, the outputs are the point labels. Instead of aggregating the information once over all points, the more discriminative features are computed in a bottomup manner. The representation vector of a nonleaf node at a certain level is computed from its children nodes by MLPs and the symmetric function. Max pooling is used as the symmetric function.
For classification, by using this bottomup, hierarchical approach, more discriminative global signatures are obtained. This procedure corresponds to a ConvNet in which the representation of a certain location is computed from the representations of nearby locations at the previous layers by a series of convolutions and pooling operations. Our architecture is able to progressively capture features at increasingly larger scales. Features at lower levels have smaller receptive fields whereas features at higher levels have larger receptive fields. This is due to the datadependent partition of the kd tree structure. Our model is invariant to the input order of the point sets because the aggregating direction is along the kd tree structure which is invariant to input permutations.
For the semantic segmentation task, the kd tree structure is used to represent an encoderdecoder architecture with skip connections to link the related layers. The input of the feature aggregation stage is the point feature set in which the representation of each point encapsulates both local and global contextual information at different scales. The output is a semantic label for each point.
In conclusion, our architecture fully utilizes the local and global contextual cues in the feature learning stage. It calculates the representation vectors hierarchically in the feature aggregation stage. Hence, our method obtains discriminative features for points of different semantic labels for the semantic segmentation task.
3.4 Discussion
Our method is related to PointNet [24] which encodes the coordinates of each point to higher dimensional features. However, by its design, this method is not able to sufficiently capture the local patterns in 3D space. More recently, PointNet++ [26] is proposed which abstracts local patterns by selecting representative points in a metric space and recursively applies PointNet as a local feature learner to obtain features of the whole point set. In fact, the method handles the nonuniform point sampling problem. However, the set of abstraction layers need to sample the point sets multiple times at different scales which leads to a relative slow inference speed. And only the selected points are preserved. Others are directly discarded after each layer which causing the loss of fine geometric details. Another recent work on Kd networks [25] performs linear and nonlinear transformations and share the transformation parameters corresponding to the splitting directions of each node in the kd tree. This method needs to calculate the representation vectors for all the nodes of the associated kd tree. For each node at a certain level, the input is the representation vectors of the two previous nodes. The method heavily depends on the splitting direction of each node to train different multiplicative transformations at each level. Hence, the method is not invariant to rotation. Furthermore, point cloud sampling and kd tree fitting during every iteration lead to slow training and inference speed.
3.5 Implementation Details
Our 3DContextNet model deals with point clouds of a fixed size where is the depth of the corresponding balanced kd tree. Point clouds of different sizes can be converted to the same size using sub or oversampling. In our experiments, not all the levels of the kd tree are used. For simplicity and efficiency reasons, this number is for both the feature learning and aggregation stage. The receptive fields (number of points) for each level in the feature learning stage are 32  64  128 for the classification tasks and 32  128  512 for the segmentation tasks.
In the feature learning stage, the sizes of the MLPs are (64, 64, 128, 128)  (64, 64, 256, 256)  (64, 64, 512, 512) for the three levels, respectively. Dense connections are applied within each level. In the feature aggregation stage, the MLPs and pooling operations are used recursively to progressively abstract the discriminative representations. For the classification task, the sizes of the MLPs are (1024)  (512)  (256), respectively. For the segmentation task, like the hourglass shape, the sizes of the MLPs are (1024)  (512)  (256)  (256)  (512)  (1024), respectively. The output is then processed by two fullyconnected layers with size 256. Dropout is applied after each fullyconnected layer with a dropout ratio of .
4 Experiments
In this section, we evaluate our 3DContextNet on different 3D point cloud datasets. First, it is shown that our model significantly outperforms stateoftheart methods for the task of semantic segmentation on the Stanford LargeScale 3D Indoor Spaces Dataset [12]. Then, it is shown that our model provides competitive results for the task of 3D object classification on the ModelNet40 dataset [10] and the task of 3D object part segmentation on the ShapeNet part dataset [11].
4.1 3D Semantic Segmentation of Scenes
mean IoU  overall accuracy  avg. class accuracy  

Baseline [24]  20.1  53.2   
PointNet [24] 
47.6  78.5  66.2 
MS + CU(2) [31] 
47.8  79.2  59.7 
G + RCU [31] 
49.7  81.1  66.4 
PointNet++ [26] 
53.2  83.0  70.5 
Ours  55.6  84.9  74.5 
Our network is evaluated on the Stanford LargeScale 3D Indoor Spaces (S3DIS) dataset [12, 32] for 3D semantic segmentation. This dataset contains 6 large scale indoor areas and each point is labeled with one of the 13 semantic categories, including 5 types of furniture (board, bookcase, chair, sofa and table) and seven building elements (ceiling, beam, door, wall, window, column and floor) plus clutter. We follow the same setting as in [24] and use a 6fold cross validation over all the areas.
Our method is compared with the baseline by PointNet [24] and the recently introduced MS+CU and G+RCU models [31]. We also produce the results of PointNet++ [26] for this dataset. During training, we use the same preprocessing as in [24]. We first split rooms into blocks of m and represent each point by a 9dimension vector containing coordinates (, , ), the color information and the normalized position (, , ). The baseline extracts the same 9dim local features and three additional ones: local point density, local curvature and normal. The standard MLP is used as classifier. PointNet [24] computes the global point cloud signature and feed it back to per point features. In this way, each point representation incorporates both local and global information. Recent work by [31] proposes two models that enlarge the receptive field over the 3D scene. The motivation is to incorporate both the inputlevel context and the outlevel context. MS+CU represents the multiscale input block with a consolidation unit model, while G+RCU stands for the gridblocks in combination with a recurrent consolidation block model. PointNet++ [26] exploits metric space distances to build a hierarchical grouping of points and abstracts the features progressively. Results are shown in Table 1. A significance test is conducted between our results and the stateoftheart results obtained by PointNet++ [26]. The Pvalue equals to 0.0122 in favor of our method.
We also compare the mean IoU for each semantic class with and only with as input, see Table 2 and Table 3 respectively. We obtain stateoftheart results in mean IoU and for most of the individual classes for both and input. Note that MS+CU [31] obtains the best performance for category Floor because of the extra preprocessing step that extends each block to the room height. In this way, their method explicitly includes floor information. The reason of obtaining comparable results with PointNet++ [26] for furnitures is that the kd tree structure is computed along the axes. Therefore, it may be inefficient for precise prediction near the splitting boundaries, especially for relatively small objects. Our model using only geometry information (i.e. ) achieves better results than the original PointNet method using both geometry and color/appearance information.
mean IoU  Ceiling  Floor  Wall  Beam  Column  Window  Door  Table  Chair  Sofa  Bookcase  Board  clutter  

PointNet [24]  47.6  88.0  88.7  69.3  42.4  23.1  47.5  51.6  54.1  42.0  9.6  38.2  29.4  35.2 
MS + CU(2) [31] 
47.8  88.6  95.8  67.3  36.9  24.9  48.6  52.3  51.9  45.1  10.6  36.8  24.7  37.5 
G + RCU [31] 
49.7  90.3  92.1  67.9  44.7  24.2  52.3  51.2  58.1  47.4  6.9  39.0  30.0  41.9 
PointNet++ [26] 
53.2  90.2  91.7  73.1  42.7  21.2  49.7  42.3  62.7  59.0  19.6  45.8  48.2  45.6 
Ours  55.6  92.6  93.1  73.9  52.9  35.0  55.8  57.5  62.9  49.0  22.0  42.8  39.8  45.8 
mean IoU  Ceiling  Floor  Wall  Beam  Column  Window  Door  Table  Chair  Sofa  Bookcase  Board  clutter  

PointNet [24]  40.0  84.0  87.2  57.9  37.0  19.6  29.3  35.3  51.6  42.4  11.6  26.4  12.5  25.5 
MS + CU(2) [31] 
43.0  86.5  94.9  58.8  37.7  25.6  28.8  36.7  47.2  46.1  18.7  30.0  16.8  31.2 
PointNet++ [26] 
47.0  88.0  92.4  64.7  37.7  16.8  31.0  41.1  59.6  52.0  29.4  42.2  19.2  36.9 
Ours 
48.6  90.5  92.8  63.6  49.4  31.2  44.2  37.8  59.6  50.6  17.7  38.7  17.3  37.9 
A number of qualitative results are presented in Figure 4 for the 3D indoor semantic segmentation task. It can be derived that our method provides more precise predictions for local structures. It shows that our model exploits both local and global contextual cues to learn discriminative features to achieve proper semantic segmentation. Our model size is less than 160 MB, average inference time is less than 70 ms per block, which makes our method suitable for large scale point cloud analysis.
4.2 3D Object Classification and Part Segmentation
Method  Input  Accuracy (%) 

DeepPano [15]  image  77.6 
MVCNN [14]  image  90.1 
MVCNNMultiRes [16]  image  91.4 
3DShapeNets [10]  voxel  77 
VoxNet [17]  voxel  83 
Subvolume [16]  voxel  89.2 
PointNet (vanilla) [24]  point cloud  87.2 
PointNet [24]  point cloud  89.2 
Kd network [25]  point cloud  90.6 
PointNet++ [26]  point cloud  90.7 
PointNet++ (with normal) [26]  point cloud  91.9 
Ours  point cloud  90.2 
Ours (with normal)  point cloud  91.1 
PointNet [24]  PointNet++(SSG) [26]  PointNet++(MSG) [26]  PointNet++(MRG) [26]  3DContextNet  

Model size (MB)  40  8.7  12  24  56.8 
Forward time (ms) 
25.3  82.4  163.2  87.0  45.9 
We evaluate our method on the ModelNet40 shape classification benchmark [10]. This dataset contains a collection of 3D CAD models for 40 categories. We use the official split consisting of 9843 examples for training and 2468 for testing. Using the same experimental setting as in [24], we convert the CAD models to point sets by uniformly sampling (1024 points in our case) over the mesh faces. Then, these points are normalized to be zero mean and unit sphere. We also randomly rotate the point sets along the axis and jitter the coordinates of each point by Gaussian noise for data augmentation during training.
It can be derived from Table 4, that our model outperforms PointNet [24]. Our model has competitive performance compared to PointNet++. However, our method is much faster in inference time. Table 5
summarizes the comparison of time and space computations between PointNet, PointNet++ and our proposed method. We measure forward time with a batch size of 8 using TensorFlow 1.1. PointNet has the best time efficiency, but our model is faster than PointNet++ while keeping comparable classification performance.
We also evaluate our method on the ShapeNet part dataset [11]. This dataset contains 16881 CAD models from 16 categories. Each category is annotated with 2 to 6 parts. There are 50 different parts annotated in total. We use the official split for training and testing. In this dataset, both the number of shapes and the parts within the categories are highly imbalanced. Therefore, many previous methods train their network on every category separately. Our network is trained across categories.
We compare our model with two traditional learning based techniques Wu [33] and Yi [34], the volumetric deep learning baseline (3DCNN) in PointNet [24], as well as stateoftheart approaches of SSCNN [35] and PointNet++ [26], see Table 6. The point intersection over union for each category as well as the mean IoU are reported. In comparison to PointNet, our approach performs better on most of the categories, which proves the importance of local and global contextual information. See Figure 5 for a number of qualitative results for the 3D object part segmentation task.
mean  airplane  bag  cap  car  chair  earphone  guitar  knife  lamp  laptop  motor  mug  pistol  rocket  skateboard  table  

#shapes  2690  76  55  898  3758  69  787  392  1547  451  202  184  283  66  152  5271  
Wu [33]    63.2        73.5        74.4              74.8 
Kd Networks [25]  77.2  79.9  71.2  80.9  68.8  88.0  72.4  88.9  86.4  79.8  94.9  55.8  86.5  79.3  50.4  71.1  80.2 
3DCNN [24]  79.4  75.1  72.8  73.3  70.0  87.2  63.5  88.4  79.6  74.4  93.9  58.7  91.8  76.4  51.2  65.3  77.1 
Yi [34]  81.4  81.0  78.4  77.7  75.7  87.6  61.9  92.0  85.4  82.5  95.7  70.6  91.9  85.9  53.1  69.8  75.3 
PointNet [24]  83.7  83.4  78.7  82.5  74.9  89.6  73.0  91.5  85.9  80.8  95.3  65.2  93.0  81.2  57.9  72.8  80.6 
SSCNN [35]  84.7  81.6  81.7  81.9  75.2  90.2  74.9  93.0  86.1  84.7  95.6  66.7  92.7  81.6  60.6  82.9  82.1 
PointNet++ [26]  85.1  82.4  79.0  87.7  77.3  90.8  71.8  91.0  85.9  83.7  95.3  71.6  94.1  81.3  58.7  76.4  82.6 
Ours  84.3  83.3  78.0  84.2  77.2  90.1  73.1  91.6  85.9  81.4  95.4  69.1  92.3  81.7  60.8  71.8  81.4 
5 Conclusion
In this paper, we proposed a deep learning architecture that exploits the local and global contextual cues imposed by the implicit space partition of the kd tree for feature learning, and calculate the representation vectors progressively along the associated kd tree for feature aggregation. Large scale experiments showed that our model outperformed existing stateoftheart methods for semantic segmentation task. Further, the model obtained comparable results for 3D object classification and 3D part segmentation.
In the future, other hierarchical 3D space partition structures can be studied as the underlying structure for the deep net computation and the nonuniform point sampling issue needs to be taken into consideration.
References
 [1] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11) (1998) 2278–2324
 [2] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3) (2015) 211–252
 [3] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. (2012) 1097–1105

[4]
Girshick, R., Donahue, J., Darrell, T., Malik, J.:
Rich feature hierarchies for accurate object detection and semantic
segmentation.
In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2014) 580–587
 [5] Girshick, R.: Fast rcnn. In: Proceedings of the IEEE international conference on computer vision. (2015) 1440–1448
 [6] Ren, S., He, K., Girshick, R., Sun, J.: Faster rcnn: Towards realtime object detection with region proposal networks. In: Advances in neural information processing systems. (2015) 91–99
 [7] Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoderdecoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015)
 [8] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3431–3440
 [9] Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1520–1528
 [10] Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: A deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 1912–1920
 [11] Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An informationrich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)
 [12] Armeni, I., Sener, O., Zamir, A.R., Jiang, H., Brilakis, I., Fischer, M., Savarese, S.: 3d semantic parsing of largescale indoor spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 1534–1543
 [13] Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richlyannotated 3d reconstructions of indoor scenes. arXiv preprint arXiv:1702.04405 (2017)
 [14] Su, H., Maji, S., Kalogerakis, E., LearnedMiller, E.: Multiview convolutional neural networks for 3d shape recognition. In: Proceedings of the IEEE international conference on computer vision. (2015) 945–953
 [15] Shi, B., Bai, S., Zhou, Z., Bai, X.: Deeppano: Deep panoramic representation for 3d shape recognition. IEEE Signal Processing Letters 22(12) (2015) 2339–2343
 [16] Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multiview cnns for object classification on 3d data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 5648–5656
 [17] Maturana, D., Scherer, S.: Voxnet: A 3d convolutional neural network for realtime object recognition. In: Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, IEEE (2015) 922–928
 [18] Sedaghat, N., Zolfaghari, M., Brox, T.: Orientationboosted voxel nets for 3d object recognition. arXiv preprint arXiv:1604.03351 (2016)
 [19] Brock, A., Lim, T., Ritchie, J.M., Weston, N.: Generative and discriminative voxel modeling with convolutional neural networks. arXiv preprint arXiv:1608.04236 (2016)
 [20] Bentley, J.L.: Multidimensional binary search trees used for associative searching. Communications of the ACM 18(9) (1975) 509–517
 [21] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition (2016) 770–778
 [22] Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013)
 [23] Masci, J., Boscaini, D., Bronstein, M., Vandergheynst, P.: Geodesic convolutional neural networks on riemannian manifolds. In: Proceedings of the IEEE international conference on computer vision workshops. (2015) 37–45
 [24] Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593 (2016)
 [25] Klokov, R., Lempitsky, V.: Escape from cells: Deep kdnetworks for the recognition of 3d point cloud models. arXiv preprint arXiv:1704.01222 (2017)
 [26] Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. CoRR abs/1706.02413 (2017)
 [27] Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R., Smola, A.: Deep sets. (2017)
 [28] Hu, J., Shen, L., Sun, G.: Squeezeandexcitation networks. arXiv preprint arXiv:1709.01507 (2017)
 [29] Wang, X., Girshick, R., Gupta, A., He, K.: Nonlocal neural networks. (2017)
 [30] Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016)
 [31] Engelmann, F., Kontogianni, T., Hermans, A., Leibe, B.: Exploring spatial context for 3d semantic segmentation of point clouds
 [32] Armeni, I., Sax, S., Zamir, A.R., Savarese, S.: Joint 2d3dsemantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105 (2017)
 [33] Wu, Z., Shou, R., Wang, Y., Liu, X.: Interactive shape cosegmentation via label propagation. Computers & Graphics 38 (2014) 248–254
 [34] Yi, L., Kim, V.G., Ceylan, D., Shen, I., Yan, M., Su, H., Lu, A., Huang, Q., Sheffer, A., Guibas, L., et al.: A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG) 35(6) (2016) 210
 [35] Yi, L., Su, H., Guo, X., Guibas, L.: Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. In: Computer Vision and Pattern Recognition (CVPR). (2017)