DAR-Net: Dynamic Aggregation Network for Semantic Scene Segmentation

07/28/2019 ∙ by Zongyue Zhao, et al. ∙ Purdue University 0

Traditional grid/neighbor-based static pooling has become a constraint for point cloud geometry analysis. In this paper, we propose DAR-Net, a novel network architecture that focuses on dynamic feature aggregation. The central idea of DAR-Net is generating a self-adaptive pooling skeleton that considers both scene complexity and local geometry features. Providing variable semi-local receptive fields and weights, the skeleton serves as a bridge that connect local convolutional feature extractors and a global recurrent feature integrator. Experimental results on indoor scene datasets show advantages of the proposed approach compared to state-of-the-art architectures that adopt static pooling methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For the task of 3D geometry understanding, neural networks that directly take point clouds as input have shown advantages compared to voxel and multi-view based networks that simulate 2D scenarios 

[20, 4, 7, 29, 26, 21]. The trailblazer, PointNet, addressed the lack of correspondence graph by using multi-layer perception and a single global pooling layer, neither of which relied on local dependencies [4]. The two-scale network performed well on object analysis. However, its deficiencies in a) local correspondence identification; b) intermedium feature aggregation and c) valid global information integration lead to poor performance on large-scale scene segmentation.

Analyzing the drawbacks of PointNet, several papers worked on the local deficiency by constructing mapping indices for convolutional neural networks (CNN) 

[25, 3, 17, 24]

. On the other hand, works that focused on the global integration problem gained inspiration from natural language processing and turned to deep recurrent neural networks (RNN) 

[27, 11, 31]. While various works contributed to both micro and macro ends of the scale spectrum, what left in between was less attended. Feature aggregation between local neighborhoods and global representation, if any, remain to be static and independent of the geometry context [3, 11, 18, 22, 27, 25, 31]. For example, Tangent Convolutions [25] used rectangular grids with uniform resolution for local mean pooling. RSNet [11]

evenly divided the entire scene into slices and did max-pooling within each slice. 3P-RNN 

[31]

, despite introducing variance of receptive field sizes in the local scale, went back to the voxelization track when feeding the global recurrent network. Those rigid pooling methods little adopted information density distribution within the point cloud, leading to computational inefficiencies and poor segmentation results on less-dominating classes. Shapes with rich geometry features but occurred less are not detected effectively.

Figure 1: Segmentation results on S3DIS dataset. a) Input point cloud; b) Validation: color grey indicates successful prediction; c) Ground truth; d) Prediction from DAR-Net.

We present a novel approach for intermedium feature aggregation to address deficiencies from traditional static pooling layers. The key concept in the aggregation process is forming a pooling skeleton whose a) size is corresponding to the individual scene scale and complexity; b) each node links a variable set of points that represent a meaningful spatial discreteness; c) each node is weighted against the index set to further utilize information distribution, and provide robustness even when the node-discreetness correlation fails. Such a skeleton is unsupervised learned prior to the training process.

We construct a network, DAR-Net, to incorporate the dynamic aggregation operation with convolutional feature extraction and global integration, while handling permutation invariance in multiple scales. The network is trained and tested on two large-scale indoor scene datasets 

[7, 2] and shows advantage compared to state-of-art architectures with static pooling methods. A sample of semantic segmentation results on S3DIS dataset [2] is shown in Figure 1.

2 Related work

Recent contributions relevant to our work can be roughly divided into three categories: convolutional feature extraction, global integration and unsupervised pre-processing. For context completeness, traditional 3D analysis networks that do not operate on point clouds are first introduced.

Prior to point clouds. Although convolutional neural networks (CNN) had achieved great success in analyzing 2D images, they cannot be directly applied to point clouds beacuse of its unorganized nature. Without a pixel-based neighborhood defined, vanilla CNNs cannot extract local information and gradually expand receptive field sizes in a meaningful manner. Thus, segmentation tasks were first performed in a way that simulate 2D scenarios – by fusing partial views represented with RGB-D images together [1, 21, 19, 8]. Some other work transform point clouds into cost-inefficient voxel representations on which CNN can be directly applied [20, 10, 7].

Although these methods did benefit from mature 2D image processing network structures, inefficient 3D data representations constrained them from showing good performance for scene segmentation, where it is necessary to deal with large, dense 3D scenes as a whole. Therefore, recent research gradually turned to networks that directly operate on point clouds when dealing with semantic segmentation for complex indoor/outdoor scenes [31, 25, 15].

Local feature extraction. As introduced, PointNet used multi-layer perception (which process each point independently) to fit the unordered nature of point clouds [10]. Furthermore, similar approaches like using convolutional kernels [16], radius querying [22] or nearest neighbor searching [13] were also adopted. Because local dependencies were not effectively modeled, overfitting constantly occurred when these networks were used to perform large-scale scene segmentation. In addition, work like R-Conv [27] tried to avoid time-consuming neighbor searching with global recurrent transformation prior to convolutional analysis. However, scalability problems still occurred as the global RNN cannot directly operate on the point cloud representing an entire dense scene, which often contains multiple million points.

Tangent Convolution [25] proposed a way to efficiently model local dependencies and align convolutional filters on different scales. Their work is based on local covariance analysis and down-sampled neighborhood reconstruction with raw data points. Despite tangential convolution itself functioned well extracting local features, their network architecture was limited by static, uniform intermedium feature aggregation and a complete lack of global integration.

Global integration. Several works turned to the global scale for permutation robustness. Its simplest form, global maximum pooling, only fulfilled light-weight tasks like object classification or part segmentation [4]

. Moreover, RNNs constructed with advance cells like Long-Short-Term-Memory 

[9]

or Gate-Recurrent-Unit 

[5] offered promising results on scene segmentation [15], even for those architectures without significant consideration for local feature extraction [11, 18]. However, in those cases the global RNNs were built deep, bidirectional or compact with hidden units, giving out a strict limitation on the direct input. As a result, the original point cloud was often down-sampled to an extreme extent, or the network was only capable of operating on sections of the original point cloud [11].

Unsupervised learning.

Various works in this area aimed to promote existing supervised-learning networks as auto-encoders. For example, FoldingNet 

[30] managed to learn global features of a 3D object through constructing a deformable 2D grid surface; PointWise [24] considered theoretical smoothness of object surface; and, MortonNet [28] learned compact local features by generating fractal space-filling curves and predicting its endpoint. Although features provided by these auto-encoders are reported to be beneficial, we do not adopt them into our network for a fair evaluation on the aggregation method we propose.

Different from the common usage of finding a rich, concise feature embedding, SO-Net [16]

unsupervised learned a self-organizing map (SOM) for feature extraction and aggregation. Despite its novelty, few performance improvements were observed even when compared to PointNet or OctNet 

[23]. Possible reasons include their usage of the SOM and a lack of deep local and global analysis.

SO-Net used the SOM to expand the scale of data for local feature extraction, and conducted most of the operations on the expanded feature space. As a result, its architecture was only capable of handling object analysis tasks. However, for this task each point cloud merely contained a few thousand points, making the benefit from carefully arranging tens of pooling nodes limited. In addition, SO-Net adopted a global max pooling layer for global integration, then directly concatenated the global representation vector, aggregated node features and pointwise local features together through expansion. Thus, there is a lack of hierarchical understanding in the feature decoder. Finally, fixed-size SOMs constructed by SO-Net only provided self-adaption on local spatial distributions, which failed to fit datasets in which individual scale and complexity significantly varied. In short, we argue that SOM or other similar self-adapted maps are best used merely as pooling skeletons to dynamically reduce the scale of feature carrier for analyzing large-scale point clouds. Map nodes should be assigned with appropriate weights to provide more detailed differentiation and robustness. Once features are aggregated to the skeleton nodes, a thorough, deep integration process should be conducted prior to decoding.

3 Dynamic aggregation

This section aims to transform pointwise local features to semi-local rich representations, and, to propagate information stored in the semi-local space back to the point cloud for semantic analysis. In this process, two things need to be properly designed. First, a pooling skeleton (intermedium information carrier) that adapts global and local geometry structures of the input point cloud. Second, reflections that map the skeleton feature space from and to the point cloud feature space.

For clarity, in following paragraphs the point cloud is referred as and its pooling skeleton is referred as .

3.1 Skeleton formation

For the task of indoor scene segmentation, the scale of each individual scene varies significantly. E.g., in S3DIS dataset [2] the most complicated scene contains points, more than 100 times larger than the least complicated one (). Therefore, the size of the skeleton, indicated with the number of nodes , should not be static like those work applied to object analysis  [16]. Further ablation studies (Sec. 6.3) demonstrate that an empirical logarithm relationship (Figure 5) between and adapts better scene complexity than stationary or stationary average receptive field size.

We use a Kohonen network [14, 16] to implement the dynamic skeleton. Contrary to initialization methods suggested by [6, 16]

, we conduct random initialization prior to normalization to provide a reasonable guess with respect to substantial spacing along different axes. Such a method provides extra robustness for the task of scene understanding, where individual inputs often contain a dimension disproportionate to others (long hallways, large conference rooms). An example of the skeleton is shown in Figure

2.

Figure 2: Dynamic Aggregation Skeleton (Red) and Point Cloud (Yellow). (Floor, ceiling and walls in the front are removed for clarity.) Note that each chair is assigned with at least one skeleton node, while uniform structures like walls are assigned with smaller node density. Two nodes (circumscribed with blue ovals) fail to be attached to point cloud. Such a problem is addressed in Sec. 3.2.

3.2 Feature aggregation

Consider an arbitrary point cloud and its corresponding skeleton , dynamic aggregation maps pointwise feature space into the node-wise feature space . By introducing a correspondence indicator that regulate node receptive field size and a possible global intervention factor , the general expression of dynamic aggregation is shown in eq. 1.

(1)

Such general expression contains two sections that await instantiation: the dependency searching function that constructs indices , and the pooling function dealing with an arbitary set of inputs .

Indexing. Each point is first linked to its K-nearest neighbor nodes, referred as . As all points are iterated through this process, a global index matrix is formed, whose each element is the index of the k-th neighbor node point searched. I.e., . Dynamic aggregation indices, , are then generated from traversing the node space: all points indexing will be categorized into .

Aggregating function. A semi-average pooling method is used to further extract information density and address skeleton construction failures. Although skeleton nodes are already arranged more compact where geometry features differ more (Sec. 3.1), each node in those areas is still indexing relevantly more neighbor points, as rich geometry features usually fold more points in unit space (in the scale of skeleton receptive field). Therefore, assigning a larger weight to nodes correlating with more points becomes advantageous. Moreover, when a skeleton node fails to represent any geometry structure (shown in Figure 2), traditional average or maximum pooling cannot identify the situation and passes irrelevant features forward.

The semi-mean pooling function, shown in eq. 2, is implemented with a global descriptor that indicates average reception field size, i.e., the average amount of neighbor points a node would index.

(2)

3.3 Feature propagation

As the neighborhood searching process is conducted on the node space, the global index matrix can be directly used to unpool node-wise features back to the point cloud as if pooling from to .

(3)

Note that the redundant space (size of ) is only implicitly used with indices throughout the pooling-unpooling process. Hence, the dynamic aggregation approach we propose is compatible with large-scale dense point clouds.

4 Global integration

Global Integration aims to model the long-range dependency in a point cloud, which can be described as a reflection mapping one node-wise feature space to another: .

We use GRU-based RNN [5] for permutation invariance upon unordered nodes. Features on the entire skeleton, , are treated as a single-batch sequence of length : . In addition, as

varies from scene to scene, all input sequences are padded to the same length

. The padded sequence is then fed into the recurrent network. As a result, output features on each node is relevant to input information from all nodes, creating a maximized overlapping receptive field:

(4)

5 Architecture

We design a convolutional-recurrent network to consort short and long range spatial dependencies with the operation of dynamic aggregation, as is shown in Figure 3.

For pre-processing, we first estimate the scene complexity to determine the appropriate skeleton size for each scene. The skeleton is then unsupervised clustered for a rough, preliminary understanding for semi-local structure distribution.

We adopt tangent convolutions [25] for local feature extraction. The encoded pointwise local features are then dynamically aggregated to the skeleton, as an intermediate scale of information abstraction. Furthermore, node-wise features that independently correspond to a semi-local receptive field are treated with a global RNN, which implicitly learns long-range knowledge. Globally integrated information is propagated back to the point cloud for concatenations with local features and hierarchically decoding. In the end, pointwise convolutions are used to generate semantic prediction results.

Figure 3: Schematic Diagram of DAR-Net. Solid blocks indicate features: black/gray for pointwise features on ; green for node-wise features on . Color information is also used as input since geometrical structures are not sufficient for identifying multiple classes (Sec. 6.1).

6 Experiments

Method mIoU mA ceiling floor wall beam column window door chair table sofa bookcase board clutter
PointNet [4] 41.1 49.0 88.8 97.3 69.8 0.1 3.9 49.3 10.8 58.9 52.6 5.8 40.3 26.3 33.2
SEGCloud [26] 48.9 57.4 90.1 96.1 69.9 0.0 18.4 38.4 23.1 75.9 70.4 58.4 40.9 13.0 41.6
T-Conv [25] 52.8 62.2 - - - - - - - - - - - - -
RSNet [11] 51.9 59.4 93.3 98.3 79.2 0.0 15.8 45.4 50.1 65.5 67.9 52.5 22.5 41.0 43.6
Ours 58.8 68.4 93.4 97.0 75.3 0.0 26.6 47.2 50.8 68.2 81.3 61.7 63.4 53.7 46.0
Table 1: Results on S3DIS Dataset

In this section, we first report a few unmentioned implementation details and evaluation criteria, then present best segmentation results DAR-Net generates. More experiments are discussed in the ablation study section.

6.1 Datasets and details

The performance of dynamic segmentation and DAR-Net is evaluated in the task of indoor scene segmentation. Two commonly used large-scale datasets, Stanford Large-Scale 3D Indoor Spaces (S3DIS) [2] and ScanNet [7] are adopted to conduct the experiments.

S3DIS dataset includes more than 200 dense indoor scenes gathered from three different buildings, each scene contains up to more than nine million points in 13 classes. For this dataset, we use a widely-used A5 train-test split  [31, 11, 26, 4].

ScanNet dataset contains over 1,500 indoor scans labeled in 20 classes. We use the standard train-test spilt provided by authors [7].

Implementation details. We introduce multiple levels of feature carriers other than the most compact space. For clarity, point clouds down-sampled to a certain resolution cm will be denoted as .

For computational purposes, we use coordinates and color information on

as raw inputs. Coordinates are then used to generate the dynamic pooling skeleton and conduct covariance analysis, for estimating normals and reconstructing local neighborhoods. As a result, input channels of the feature extractor include depth to the tangential plane, z-coordinate, estimated normals and RGB information, all of which are batch-normalized to

 [25]. Feature extractors encode to a rich representation on with 128 channels [25]. Features on are then aggregated to the skeleton space , which is concise enough for a global integration network to handle ().

For best performance (Sec. 6.3), the RNN is designed to be single-directional, single-layered with 256 hidden units. Its 128 output channels are then propagated to and fed back to convolutional decoders.

All our reported results are based on original point clouds. As the network only gives out segmentation on the down-sampled space, a nearest neighbor searching between and is conducted to extrapolate predictions.

The only data augmentation method we adopt is rotation about the z-axis, in order to reduce invalid information from normal vector directions.

We use individual rooms as training batches. Following the suggestions of [25], we pad each room to an uniform batch size throughout the network for computational purposes. Padded data stays out of indexing and has no effect.

All supervised sections are trained as a whole using the cross-entropy loss function and an Adam optimizer with an initial learning rate of

 [12].

Measures. For quantitative reasoning, we report mean (over classes) intersection over union (mIoU), class-wise IoU and mean accuracy over class (mA). We do not use the indicator of overall accuracy as it fails to measure the actual performance for scene segmentation, where several classes (floor, ceiling, etc.) are dominating in size yet easy to identify. In addition, all results are calculated over the entire dataset, i.e., if a certain class does not occur in a certain scene, we do NOT pad accuracy for misleadingly better results.

Method mIoU mA Wall Floor Chair Table Desk Bed
Book-
shelf
Sofa Sink
PointNet [4] 14.7 19.9 69.4 88.6 35.9 32.8 2.6 18.0 3.2 32.8 0
PointNet++ [22] 34.3 43.8 77.5 92.5 64.6 46.6 12.7 51.3 52.9 52.3 30.2
T-Conv [25] 40.9 55.1 - - - - - - - - -
RSNet [11] 39.4 48.4 79.2 94.1 65.0 51.0 34.5 56.0 53.0 55.4 34.8
Ours 42.3 55.8 75.6 93.3 69.3 44.2 29.5 57.4 54.8 63.0 37.4
Method Bathtub Toilet Curtain Counter Door Window
Shower
Curtain
Refrid-
gerator
Picture Cabinet
Other
Furniture
PointNet [4] 0.2 0.0 0.0 5.1 0.0 0.0 0.0 0.0 0.0 5.0 0.1
PointNet++ [22] 42.7 31.4 33.0 20.0 2.0 3.6 27.4 18.5 0.0 23.8 2.2
T-Conv [25] - - - - - - - - - - -
RSNet [11] 49.4 54.2 6.8 22.7 3.0 8.8 29.9 37.9 1.0 31.3 19.0
Ours 61.0 64.5 19.2 23.9 17.0 19.7 27.7 34.1 6.2 29.6 19.1
Table 2: Results on ScanNet Dataset
(a) Color
(b) Tangent Conv [25]
(c) Ours
(d) Ground Truth
Figure 4: Semantic segmentation results on S3DIS dataset.

6.2 Main results

All results in this section are from similar skeleton settings: receptive field indicator , aggregation skeleton is log-activated with , and aggregation method is as eq. 2. The only difference comes from the feature propagation process, where the experiment on S3DIS dataset use maximum unpooling shown in eq. 3 and the experiment on ScanNet dataset rely on average unpooling.

Segmentation results on the two indoor scene datasets are shown in Table 1 and 2, respectively. We compare our results against both commonly-used benchmarks [4, 26, 22] and state-of-the-art networks that focused on local feature extraction [25] or global recurrent analysis [11]. As a result, we argue the advantage of using dynamic aggregation to coordinate local and global analysis as a whole.

For the S3DIS dataset, class-wise IoU results demonstrate that our network achieve better prediction on classes that are spatially discrete ([table], [door]) or with rich, compact geometry features ([bookcase], [clutter]), which matches theoretical benefits of forming a self-adaptive pooling skeleton. On the other hand, such performance improvements come with a minor cost on classes that have more uniform geometrical structures (less information density), like [floor] and [wall].

Failure on segmenting class [beam] is natural due to the train-test spilt, as beams in the train-set (Area 1-4, 6) shows a different pattern than that of the test-set (Area 5). We do not use a cross-six-fold validation [31] to address this matter, as the difference between training data and test data appropriately models real-world applications.

For the ScanNet dataset, class-wise IoU give out similar demonstrations. Discrete objects ([sofa], [bathtub], [toilet], [door]) are better detected whereas structures containing more points yet less information ([wall], [floor]) are partially omitted. We argue that such characteristics are desired, especially for real-world applications like robotic vision or automatic object retrieval.

Sample segmentation results on S3DIS dataset are shown in Figure 4. As T-Conv [25] did not report class-wise IoU, their results are visualized as benchmarks here for a thorough comparison. As suggested in text, our network performs particularly well on detecting complicated geometry features and spatial discreteness.

6.3 Ablation study

In this section, we report results from adjusting the dynamic aggregation approach. Unless otherwise specified, all experiments are conducted on S3DIS dataset.

Node receptive field. The local receptive field size , although varying among nodes, can be generally indicated with an average value . As serves as a linear coefficient, its effect is first evaluated, as is shown in Table 3.

mIoU mA
1 54.8 63.2
2 56.1 65.3
3 58.8 68.4
4 56.2 64.2
5 55.8 63.5
7 55.1 62.9
Table 3: Reception Field Size:

The global integration network demands a limited skeleton size for computational purposes. Fixing is clearly not desirable. As varies through two orders in indoor scene datasets [2], keeping unchanged leads to a harmful variance on the average receptive field size . However, rigidly setting a uniform for all scenes is also not desirable as it fails to take structure complexity into account. An office room and a long hallway may contain the same amount of points, but the former one naturally requires more detailed inspection (Figure 4).

Without introducing hand-crafted histogram descriptors, a reasonable solution is to adapt a greedy approach: always assigning more nodes (more detailed inspection) than the linear relationship suggests. Experimental studies show that an approximate logarithm function best serves for this purpose. Different selections of receptive field size are shown in Figure 5, and corresponding segmentation results are reported in Table 4.

Figure 5: Schematic representation of receptive field size at . This figure only represent average trends for demonstration clarity, actual receptive field size varies.
Method mIoU mA
Logarithm 58.8 68.4
Power 57.2 66.9
Linear 56.3 63.8
Static (256) 53.0 62.3
Static (196) 57.4 65.6
Static (100) 45.0 53.1
Table 4: Receptive Field Size:

Aggregation method. We compare the proposed semi-mean aggregation function (eq. 2) with traditional average/maximum pooling functions. The results shown in Table 5 indicate advantage from weighting each skeleton node against its receptive field size.

Function mIoU mA
Proposed 58.8 68.4
Mean 57.6 64.6
Max 57.8 66.8
Table 5: Pooling Function

No significant influence is observed when changing the unpooling method on S3DIS dataset ((mIoU) ). However, experiments on ScanNet dataset suggests otherwise, as is shown in Table 6. This phenomenon may be due to scans in this dataset are half-open with less uniform geometrical structures.

Function mIoU mA
Mean 42.3 55.8
Max 41.1 58.1
Table 6: Unpooling Function (ScanNet)

7 Conclusion

We present an approach of dynamic aggregation to introduce variance on the extent of semi-local inspection. By introducing self-adapted receptive field size and node weights, dynamic aggregation provides a deep understanding on structures that contain richer geometrical information. We design a network architecture, DAR-Net, to coordinate such intermedium aggregation method with local and global analysis. Experimental results on large-scale scene segmentation indicate DAR-Net outperforms previous state-of-the-art network architectures that adopted static feature aggregation.

———————

References

  • [1] H. Afzal, D. Aouada, D. Font, B. Mirbach, and B. Ottersten. Rgb-d multi-view system calibration for full 3d scene reconstruction. In

    2014 22nd International Conference on Pattern Recognition

    , pages 2459–2464. IEEE, 2014.
  • [2] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese. 3d semantic parsing of large-scale indoor spaces. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 1534–1543. IEEE.
  • [3] D. Boscaini, J. Masci, E. Rodolà, and M. Bronstein. Learning shape correspondence with anisotropic convolutional neural networks. In Advances in Neural Information Processing Systems, pages 3189–3197.
  • [4] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas.

    PointNet: Deep learning on point sets for 3d classification and segmentation.

    In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 77–85. IEEE.
  • [5] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling.
  • [6] A. Ciampi and Y. Lechevallier. Clustering large, multi-level data sets: an approach based on kohonen self organizing maps. In European Conference on Principles of Data Mining and Knowledge Discovery, pages 353–358. Springer.
  • [7] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nies sner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5828–5839.
  • [8] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers. Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Asian conference on computer vision, pages 213–228. Springer.
  • [9] S. Hochreiter and J. Schmidhuber. Long short-term memory. 9(8):1735–1780.
  • [10] J. Huang and S. You. Point cloud labeling using 3d convolutional neural network. In 2016 23rd International Conference on Pattern Recognition (ICPR), pages 2670–2675. IEEE.
  • [11] Q. Huang, W. Wang, and U. Neumann. Recurrent slice networks for 3d segmentation of point clouds. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2626–2635. IEEE.
  • [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [13] R. Klokov and V. Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, pages 863–872.
  • [14] T. Kohonen. Self-organized formation of topologically correct feature maps. 43(1):59–69.
  • [15] L. Landrieu and M. Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4558–4567.
  • [16] J. Li, B. M. Chen, and G. H. Lee. SO-net: Self-organizing network for point cloud analysis. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9397–9406. IEEE.
  • [17] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen. Pointcnn: Convolution on x-transformed points. In Advances in Neural Information Processing Systems, pages 820–830.
  • [18] F. Liu, S. Li, L. Zhang, C. Zhou, R. Ye, Y. Wang, and J. Lu.

    3dcnn-DQN-RNN: A deep reinforcement learning framework for semantic parsing of large-scale 3d point clouds.

    In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5679–5688. IEEE.
  • [19] Z. Lun, M. Gadelha, E. Kalogerakis, S. Maji, and R. Wang. 3d shape reconstruction from sketches via multi-view convolutional networks. In 2017 International Conference on 3D Vision (3DV), pages 67–77. IEEE, 2017.
  • [20] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 922–928. IEEE.
  • [21] C. R. Qi, H. Su, M. Nies sner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656.
  • [22] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space.
  • [23] G. Riegler, A. Osman Ulusoy, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3577–3586.
  • [24] M. Shoef, S. Fogel, and D. Cohen-Or. PointWise: An unsupervised point-wise feature learning network.
  • [25] M. Tatarchenko, J. Park, V. Koltun, and Q.-Y. Zhou. Tangent convolutions for dense prediction in 3d. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3887–3896. IEEE.
  • [26] L. Tchapmi, C. Choy, I. Armeni, J. Gwak, and S. Savarese. SEGCloud: Semantic segmentation of 3d point clouds. In 2017 International Conference on 3D Vision (3DV), pages 537–547. IEEE.
  • [27] D. Tchuinkou and C. Bobda. R-covnet: Recurrent neural convolution network for 3d object recognition. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 331–335. IEEE.
  • [28] A. Thabet, H. Alwassel, and B. Ghanem. MortonNet: Self-supervised learning of local features in 3d point clouds.
  • [29] D. Z. Wang and I. Posner. Voting for voting in online point cloud object detection. In Robotics: Science and Systems, volume 1, pages 10–15607.
  • [30] Y. Yang, C. Feng, Y. Shen, and D. Tian. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 206–215.
  • [31] X. Ye, J. Li, H. Huang, L. Du, and X. Zhang. 3d recurrent neural networks with context fusion for point cloud semantic segmentation. In V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, editors, Computer Vision – ECCV 2018, volume 11211, pages 415–430. Springer International Publishing.