1 Introduction
The quality of 3D sensors has rapidly improved in recent years as their ability to accurately and quickly measure the depth of scenes has surpassed traditional visionbased methods Žbontar and LeCun (2016); Godard et al. (2017); Zhou et al. (2017); Kendall et al. (2017); Zhang et al. (2019). This improved accessibility to point clouds demands the development of algorithms to interpret and analyze them. Inspired by the success of DNN in solving 2D image analysis tasks He et al. (2016); Ren et al. (2015); Krizhevsky et al. (2012); Long et al. (2015); Huang et al. (2017), approaches with DNN have been successfully applied to perform similar point cloud analysis tasks such as shape classification and part segmentation Qi et al. (2017b); Li et al. (2018b); Qi et al. (2017a); Liu et al. (2019); Li et al. (2018a); Wu et al. (2019); Klokov and Lempitsky (2017). These DNN methods achieve stateoftheart performance on these point cloud analysis tasks by learning representations from large synthetic datasets constructed from sampling the surfaces of CAD objects Wu et al. (2015); Chang et al. (2015). Unfortunately since point clouds generated from 3D sensors in the realworld scenarios are often incomplete, these approaches struggle when tasked to perform shape classification or part segmentation on partial point clouds. Since realworld point clouds are often incomplete due to perspective of view or occlusions, this shortcoming of existing point cloud analysis methods is particularly cumbersome.
To address the limitations of existing approaches, this paper proposes a novel model for partial point clouds analysis. We model the point clouds as a partition of point sets. Each local point set independently contributes to infer the latent feature encoding the complete point cloud. In contrast to prior work on learning a representation of point clouds, we utilize an encoder to embed each local point set. All embeddings vote to infer a latent space characterized by a distribution. As we show in this paper, giving each local point set a vote, ensures that the model has the ability to address the incomplete nature of realworld point clouds. Inspired by recent progress in variational inference Diederik and Welling (2014); Kingma and Welling (2019)
, we output a distribution for the latent variable and then use a decoder to generate a prediction from the latent value with the highest probability. In particular, each local point set generates a Gaussian distribution in the latent space and independently votes to form the distribution of the latent variable. This voting strategy ensures that the model outputs more accurate prediction when more partial observations are given, and the probabilistic modeling enables the model to generate multiple possible outputs.
The contributions of this paper are: (1) We propose that each local point set independently votes to infer the latent feature. This voting strategy is shown to be robust to partial observation; (2) We propose to construct each vote as a distribution in the latent space and this distribution modeling allows for diverse predictions; (3) The proposed model trained with complete point clouds performs robustly on partial observation at test, which reduces the cost of collecting large partial point clouds dataset; (4) The proposed model achieves stateoftheart results on shape classification, part segmentation, and point clouds completion. In particular, it outperforms approaches trained with pairs of partial and complete point clouds on point clouds completion.
2 Related Work
To perform point cloud analysis, researchers have traditionally converted point clouds into 2D grids or 3D voxels since they can leverage existing CNN. With the help of CNN, those approaches achieve impressive results on 3D shape analysis Feng et al. (2018); Maturana and Scherer (2015); Qi et al. (2016); Wu et al. (2015). Unfortunately, these 2D grid or 3D voxel representations degrade the resolution of objects. Researchers have attempted to address this issue by utilizing sparse representations Klokov and Lempitsky (2017); Tatarchenko et al. (2017); Wang et al. (2017). However, these representations are still less efficient than point clouds and are unable to avoid quantization effects. More recently, PointNet Qi et al. (2017a) has pioneered the approaches of directly taking point clouds as inputs and processed them using DNN. To accomplish this objective, it uses a symmetric function to aggregate information from each point which is transformed to a highdimensional space. A variety of extensions have been applied to PointNet Qi et al. (2017b); Li et al. (2018a); Liu et al. (2019); Wang et al. (2019); Zhang and Rabbat (2018); Shen et al. (2018); however, none of them is robust to the partial observation that is common in realworld scenarios.
To address this challenge posed by partial observations, researchers have relied on training DNNs on partial point clouds collected in realworld scenarios Qi et al. (2019); Yang et al. (2019); Shi et al. (2019); Qi et al. (2018); Chen et al. (2019); Huang et al. (2018); Armeni et al. (2016); Dai et al. (2017a); Song et al. (2015); Geiger et al. (2012)
. Each of these approaches rely on networks that were proposed to perform feature extraction on complete point clouds. Unfortunately, collecting and annotating those partial point datasets is expensive. Another approach seeks to first infer the missing data of the partially observed point clouds before later analysis. A variety of methods have been proposed to perform shape completion using distance fields and voxels
Dai et al. (2017b); Han et al. (2017); Stutz and Geiger (2018); Le and Duan (2018). However, they degrade the resolution of objects represented by point clouds.Recently, researches have been done to perform shape completion on point clouds. A common pipeline to perform this completion first encodes partial observations into a feature vector and then decodes it to complete point clouds. A variety of methods have been proposed for designing the decoder
Fan et al. (2017); Sun et al. (2020); Yang et al. (2018); Yuan et al. (2018); Tchapmi et al. (2019). Multiview methods Su et al. (2015); Hu et al. (2019a, b) and approaches modeling an implicit distance function Park et al. (2019); Chen and Zhang (2019) have also been applied to perform shape completion. However, each of these methods outputs a single prediction given inputs and lacks the ability to generate multiple plausible results. One notable exception is able to generate diverse results by modeling the spatial distribution of all the points Sun et al. (2020). Unfortunately, this approach is only able to address partial point clouds from specific locations. In contrast, the method developed in this paper has no such requirement on an observed partial point cloud, and leverages the distribution over the latent space to generate diverse predictions.The VAE (VAE) is one of the popular methods to model a generative distribution Diederik and Welling (2014). It assumes a prior distribution of latent variables, which is often a Gaussian distribution. More recently, the CVAE (CVAE) was proposed to extend the VAE by modeling the conditional distribution Sohn et al. (2015). Unfortunately, directly applying a CVAE to partial point clouds requires a collection of annotated partial point cloud datasets for training. This paper addresses this limitation by proposing each local point set serve as the unit voter to contribute to the latent feature. Encoding features learned for local point sets of complete point clouds can be leveraged for embedding local point sets of partial point clouds at test, which allows us to train on complete point clouds and perform on partial point clouds at test.
3 Problem Statement
Consider an observed partial point cloud, denoted by . Suppose the output of a model for a point cloud analysis task is and we are interested in modeling the conditional distribution , which provides a way to predict given the partial inputs
. This paper aims to address the following three challenges: (1) classifying partial point clouds, (2) segmenting parts on partial point clouds, and (3) recovering complete point clouds from partial observation.
4 Method
This section describes the method we use to accomplish the aforementioned objective.
4.1 Preliminary: Conditional Variational Autoencoder
A CVAE Sohn et al. (2015) is a directed graphical model. The conditional generative process of CVAE is as follows: for a given observation , a noise vector is drawn from the prior distribution , and an output is generated from the distribution . The training objective, variational lower bound, of CVAE is written as follows:
(1) 
The CVAE is composed of recognition network , prior network , and generation network . In this framework, the recognition network is used to approximate the prior network
. All distributions are modeled using neural networks. During training, the reparameterization trick
Diederik and Welling (2014) is applied to propagate the gradients of and through the latent variables .4.2 Proposed Point Clouds Model
We model the point clouds as an overlapping partition of point sets, denoted by if there are point sets in the partition. In the simplest setting, each point is described by just its 3D coordinates, i.e. . Each point set is defined by a centroid and scale, as is shown in the Figure 1. To evenly cover the whole point clouds, we use Farthest Point Sampling (FPS) Qi et al. (2017b) algorithm to sample centroids. The number of centroids and the scale are manually set.
4.3 Proposed Method
In contrast to CVAEs, in which the generation network takes
as inputs, we model the generation process as a Markov chain, as is shown in the Figure
2. Specifically, given the latent variable sampled from , is independent on . As a result, the generation of the output satisfies the following equation: . The variational lower bound of this model is written as follows:(2) 
One problem with this learning framework is that the generation network takes values sampled from the recognition network at training while takes values sampled from the prior network at testing. This makes training inconsistent with testing. Similar to Sohn et al. (2015), we force consistency between these settings by making the recognition network the same as the prior network . By doing this, can be drawn from the distribution at both training and testing, and the KL divergence term becomes zero. We approximate the resulting version of Equation (2
) with the Monte Carlo estimator formed by sampling
from the recognition network :(3) 
Recall that in our case is modeled as a set of local point sets. Each of them generates a vote to compute the latent variable . This voting strategy is inspired by the Hough transform Duda and Hart (1972) and VoteNet Qi et al. (2019), and it is shown to accumulate small bits of partial information and output confident predictions. Unlike VoteNet, which outputs each vote as a deterministic feature, we model each vote as a distribution in the latent space. By assuming the independence of each vote, the recognition network can be expanded as follows:
(4) 
In the experiments, we assume is a Gaussian distribution characterized by mean vector and covariance matrix, which enables the use of a closedform solution to optimizing with respect to . We denote the maximizing argument of by . Equation (3) is equivalent to estimation with a single point when setting . Combining this with the highest probability sample of , given by , the objective function can be further written as follows:
(5) 
Previous variational models choose latent features by sampling. However, in our case can be computed directly, so all operations are differentiated with respect to the parameters and , which means that the reparameterization trick is no longer needed.
Note that the loss function differs as the task changes. A common softmax crossentropy loss is used for classification and segmentation whereas the Chamfer distance
Fan et al. (2017) is used for training partial point clouds completion. After training, the generative process is as follows: for the given observation , is the result of voting from all , and then the output is generated from . The use of produces a single deterministic prediction. Diverse predictions are generated by instead sampling followed by applying the generation network as in the deterministic case.5 Experiment
5.1 Implementation
Network architecture The architecture of the proposed model is illustrated in the Figure 1. We use DNN to model and . A sharedweights network is used to represent . Given the local point set, we represent each point within the local region relative to the centroid and then use a sharedweight PointNet as the basic feature extractor to encode the local region. Both the encoding feature and coordinates of centroid are processed by a MLP (MLP) and it outputs a distribution of the latent space. We assume a simple case of multivariate Gaussian distribution with diagonal covariance matrix. Different analysis tasks correspond to different networks for modeling as is shown in Figure 1. A foldingbased decoder Yang et al. (2018) is used for point clouds completion.
Training During training, we partition the point cloud into 64 local point sets, and each of them generates a vote in the latent space. To make the voting strategy tenable, we propose that a random number of votes are selected to contribute to latent feature at training and in the extreme case only one vote is selected. This ensures that a single vote has the potential to be decoded to the prediction and the trained model is robust to any type partialness. All votes contribute to compute latent feature at testing.
Simulated partial Point clouds
For datasets that do not provide partial point clouds, we simulate partial point clouds during testing. Partial point clouds are synthesized by selecting points falling into one side of a plane. The plane goes through the origin and is defined by the normal, which is a 3D vector and generated by randomly sampling from a normal distribution. Note that these simulated partial point clouds are only used during testing. Sample partial point clouds are visualized in the Figure
7.5.2 Point Clouds Classification
We first consider the task of point cloud classification. This requires that the model extract global features describing distinct geometric information and decode it into a predicated category.
Dataset We use the ModelNet40 Wu et al. (2015) dataset to evaluate our proposed method on shape classification of point clouds. It contains 12,311 CAD models from 40 object categories. In the experiments, point clouds are generated by evenly sampling 1024 points from the surface of objects. We follow the same strategy of set splitting as in Qi et al. (2017a). Before being passed into the model, point clouds are centered and normalized within a unit sphere. No data augmentation techniques are applied during training.
Results Quantitative results on Modelnet40 are shown in the Table 1. Overall classification accuracy is reported on both complete point clouds and simulated partial point clouds. All listed methods achieve the stateoftheart results on complete point clouds and our proposed method performs slightly better than PointNet and PointNet++. When evaluated on simulated partial point clouds, however, other approaches experience a significant drop in accuracy. This is unsurprising since existing approaches are designed for complete point clouds, and generalize poorly to novel partial point clouds. In contrast, our proposed method trained on complete point clouds is robust to partial observation and achieves 86.4% classification accuracy on partial point clouds.
Method  Input  Complete  Partial 

PointNet Qi et al. (2017a)  88.8  20.9  
PointNet++ Qi et al. (2017b)  91.0  61.5  
RSCNN Liu et al. (2019)  92.3  43.3  
DGCNN Wang et al. (2019)  92.9  51.5  
Ours  91.4  86.4 
5.3 Part Segmentation
Given the point clouds and object category, the part segmentation tasks requires predicting a part label for each point. As a result, this task requires that the model extract both global and local information.
Dataset We use the ShapeNet part dataset Wu et al. (2015) to evaluate our proposed method on part segmentation of point clouds. It contains 16,881 shapes from 16 object categories with 50 parts. Each point cloud contains 2048 points which are generated by evenly sampling from the surface of objects. We follow the same set splitting conventions in Qi et al. (2017b). During evaluation, mean interoverunion (mIoU) that are averaged across all classes is reported. No data augmentation techniques are applied during training.
Results We report the results of our proposed method and compare its performance to existing approaches in the Table 2. Our proposed method has superior performance on partial point clouds and achieves 78.1 mIoU, while others achieve around 30 mIoU. This robustness to partial observation comes at the expense of slightly lower accuracy on segmentation of complete point clouds when compared with other approaches. We also test the robustness of our approach by applying it to the completion3D dataset Tchapmi et al. (2019) which contains partial point clouds but no part labels. Qualitative results on part segmentation by applying the method proposed in this paper are shown in the Figure 3.
Method  Input  Complete  Partial 

PointNet Qi et al. (2017a)  80.5  29.9  
PointNet++ Qi et al. (2017b)  82.0  30.9  
DGCNN Wang et al. (2019)  82.3  29.8  
RSCNN Liu et al. (2019)  82.4  30.6  
Ours  79.0  78.1 
5.4 Point Clouds Completion
Given partial point clouds, point cloud completion tries to generate points to fill in the missing parts of the input. This requires a model that has the ability to infer a global feature which encodes complete point clouds from the partial inputs.
Dataset We evaluate our model on Completion3D Tchapmi et al. (2019), which is a 3D object point cloud completion benchmark. It contains pairs of partial and complete point clouds from 8 categories which are derived from the Shapenet dataset with 2048 points per object point clouds. We apply the set splitting given by the dataset. Partial point clouds are generated by backprojecting 2.5D depth images into 3D space and complete point clouds are used as ground truth.
Results Quantitative results on Completion3D’s withheld test dataset are shown in Table 4 where the Chamfer distance multiplied by is reported. Our proposed model outperforms FoldingNet and PCN, which are trained with both partial and complete point clouds, while only complete point clouds are used during training for the method developed in this paper.
Since Completion3D’s withheld test dataset has only a limited number of partial cloud examples, we more exhaustively evaluated all approaches by experimenting on simulated partial point clouds as in Figure 3 using models that were each trained on the Completion3D. As shown in Table 4, all approaches except the one developed in this paper experience a significant performance drop. We suspect that this is because the partial point clouds in the Completion3D are different from the simulated partial point clouds. However, our proposed method generalizes well to any partial point cloud. Qualitative completion results are shown in Figure 4, and it shows that we have sharper output point clouds. More challenging tests are performed on realworld point clouds and are shown in the Figure 5. The partial point clouds are extracted within the object bounding boxes provided in ScanNet Dai et al. (2017a) and each point cloud is transformed to the box’s coordinates.
5.5 Voting Strategy Analysis
The process of computing latent feature is inspired by the voting mechanism proposed in Qi et al. (2019)
. As proposed in that paper, we infer the latent space by votes from local point sets. However, the vote in our case is a distribution in the latent space instead of a deterministic feature vector. The optimal latent feature is the sampled one with the highest probability. In this section, we analyze this voting strategy and compare it with an aggregation strategy where the extracted features from local point sets are aggregated using a symmetric function. We study two different symmetric functions: max pooling and mean pooling. To utilize the aggregation strategy, we change the vote of each local point set from a distribution to a deterministic feature vector.
The comparison results are shown in the Figure 7
. The evaluation metric is the average classification accuracy on the simulated partial ModelNet40. All models are trained using the proposed training strategy where a random number of votes less than or equal to 10 are selected to contribute to latent features. Our proposed voting strategy is verified by the improved accuracy when compared to the aggregation strategy using either max pooling or mean pooling. Classification accuracy grows as the number of votes increases during testing in both the voting strategy and mean pooling. This is because more votes accumulate more information for prediction. However, this is not the case for max pooling, whose accuracy peaks at 32 votes. This indicates that max pooling is sensitive to the number of selected votes.
5.6 Visualization of multiple predictions
The proposed model is designed to be able to generate multiple possible outputs. This is achieved by the latent space model. The latent space is represented by a set of multivariate Gaussian distributions generated by local point sets. As a result, the most possible prediction is decoded from the latent value with the highest probability. However, diverse prediction can be generated by sampling from the latent space. We visualize the results in Figure 8.
6 Conclusion
This paper proposes a general model for partial point clouds analysis. In particular, point clouds are modeled as a partition of point sets which generate votes to model a latent space distribution. This voting strategy is shown to accumulate partial information and be robust to partial observation. The sampled latent feature in the latent space is then decoded for prediction. Extensive experiments are performed in classification, part segmentation, and completion and stateoftheart results on all of them demonstrate the effectiveness of the proposed method.
7 Acknowledgment
This work was supported by a grant from Ford Motor Company via the FordUM Alliance under award N028603.
References

3d semantic parsing of largescale indoor spaces.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1534–1543. Cited by: §2.  Shapenet: an informationrich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §1.
 Fast point rcnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9775–9784. Cited by: §2.
 Learning implicit fields for generative shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5939–5948. Cited by: §2.
 ScanNet: richlyannotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, Cited by: §2, §5.4.
 Shape completion using 3dencoderpredictor cnns and shape synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5868–5877. Cited by: §2.
 Autoencoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Vol. 1. Cited by: §1, §2, §4.1.
 Use of the hough transformation to detect lines and curves in pictures. Communications of the ACM 15 (1), pp. 11–15. Cited by: §4.3.
 A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605–613. Cited by: §2, §4.3.

GVCNN: groupview convolutional neural networks for 3d shape recognition
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 264–272. Cited by: §2. 
Fast graph representation learning with PyTorch Geometric
. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §9.1.  Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §2.
 Unsupervised monocular depth estimation with leftright consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270–279. Cited by: §1.
 A papiermâché approach to learning 3d surface generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 216–224. Cited by: Table 4, Table 4, Table 5, Table 6.
 Highresolution shape completion using deep neural networks for global structure and local geometry inference. In Proceedings of the IEEE International Conference on Computer Vision, pp. 85–93. Cited by: §2.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
 Render4completion: synthesizing multiview depth maps for 3d shape completion. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §2.
 3D shape completion with multiview consistent inference. arXiv preprint arXiv:1911.12465. Cited by: §2.
 Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1.
 Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2626–2635. Cited by: §2.

Batch normalization: accelerating deep network training by reducing internal covariate shift.
In
International Conference on Machine Learning
, pp. 448–456. Cited by: §9.2.  Endtoend learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 66–75. Cited by: §1.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §9.1.

An introduction to variational autoencoders
. arXiv preprint arXiv:1906.02691. Cited by: §1.  Escape from cells: deep kdnetworks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 863–872. Cited by: §1, §2.
 Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
 Pointgrid: a deep network for 3d shape understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9204–9214. Cited by: §2.
 Sonet: selforganizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: §1, §2.
 Pointcnn: convolution on xtransformed points. In Advances in Neural Information Processing Systems, pp. 820–830. Cited by: §1.
 Relationshape convolutional neural network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8895–8904. Cited by: §1, §2, Table 1, Table 2.
 Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
 Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §9.2.
 Voxnet: a 3d convolutional neural network for realtime object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §2.
 Deepsdf: learning continuous signed distance functions for shape representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 165–174. Cited by: §2.
 Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9277–9286. Cited by: §2, §4.3, §5.5.
 Frustum pointnets for 3d object detection from rgbd data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927. Cited by: §2.

Pointnet: deep learning on point sets for 3d classification and segmentation
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §1, §2, §5.2, Table 1, Table 2, §9.2.  Volumetric and multiview cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656. Cited by: §2.
 Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §1, §2, §4.2, §5.3, Table 1, Table 2, §9.2.
 Faster rcnn: towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
 Mining point cloud local structures by kernel correlation and graph pooling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4548–4557. Cited by: §2.
 Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–779. Cited by: §2.
 Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §2, §4.1, §4.3.

Sun rgbd: a rgbd scene understanding benchmark suite
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 567–576. Cited by: §2.  Learning 3d shape completion from laser scan data with weak supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1955–1964. Cited by: §2.
 Multiview convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953. Cited by: §2.
 Pointgrow: autoregressively learned point cloud generation with selfattention. In The IEEE Winter Conference on Applications of Computer Vision, pp. 61–70. Cited by: §2.
 Octree generating networks: efficient convolutional architectures for highresolution 3d outputs. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2088–2096. Cited by: §2.
 Topnet: structural point cloud decoder. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 383–392. Cited by: §2, §5.3, §5.4, Table 4, Table 4, Table 5, Table 6.
 Ocnn: octreebased convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG) 36 (4), pp. 72. Cited by: §2.
 Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (5), pp. 146. Cited by: §2, Table 1, Table 2.
 Pointconv: deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9621–9630. Cited by: §1.
 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §1, §2, §5.2, §5.3.
 Learning object bounding boxes for 3d instance segmentation on point clouds. In Advances in Neural Information Processing Systems, pp. 6737–6746. Cited by: §2.
 Foldingnet: point cloud autoencoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 206–215. Cited by: §2, §5.1, Table 4, Table 4, §9.2, Table 5, Table 6.
 Pcn: point completion network. In 2018 International Conference on 3D Vision (3DV), pp. 728–737. Cited by: §2, Table 4, Table 4, Table 5, Table 6.
 Stereo matching by training a convolutional neural network to compare image patches. The journal of machine learning research 17 (1), pp. 2287–2318. Cited by: §1.
 DispSegNet: leveraging semantics for endtoend learning of disparity estimation from stereo imagery. IEEE Robotics and Automation Letters 4 (2), pp. 1162–1169. Cited by: §1.
 A graphcnn for 3d point cloud classification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6279–6283. Cited by: §2.
 Unsupervised learning of depth and egomotion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858. Cited by: §1.
Supplementary
8 Computation of the Optimal Latent Feature
Here we provide a derivation of , the optimal latent feature with highest probability in the latent space. The distribution of the latent space is represented by a set of multivariate Gaussian distributions (Equation (5) in the main paper). By assuming that votes are independent and is Gaussian distributed, the derivation of is as follows:
(6) 
where a multivariate Gaussian distribution is characterized by mean vector and covariance matrix ; is the number of votes; is the dimension of the latent space. The solution to optimizing can be computed by setting derivative to zero:
(7) 
Thanks to concavity, the maximizing argument of is given by:
(8) 
For simplicity, we assume diagonal covariance matrix during experiments. Both and are generated from each local point set, and modeled by neural networks.
9 Details of Implementation
9.1 Training
We implement our network in PyTorch and use PyTorch Geometric Library Fey and Lenssen (2019). During optimization, we use the Adam optimizer Kingma and Ba (2014)
with default parameters except for the learning rate. We train models for three different tasks. (1) For point clouds classification experiments, the learning rate starts with 0.001 and is scaled by 0.2 every 200 epochs and total 500 epochs are performed. Batch size is 64 and we split them into 2 NVIDIA Tesla V100 GPUs during training. (2) For part segmentation experiments, the learning rate starts with 0.001 and is scaled by 0.2 every 200 epochs and total 500 epochs are performed. Batch size is 128 and we split them into 4 NVIDIA Tesla V100 GPUs during training. (3) For point clouds completion experiments, the learning rate starts with 0.0002 and is scaled by 0.2 every 200 epoch and total 500 epochs are performed. Batch size is 64 and we split them into 4 NVIDIA Tesla V100 GPUs during training.
Model  Plane  Cabinet  Car  Chair  Lamp  Sofa  Table  W.craft  Average 

FoldingNet Yang et al. (2018)  12.83  23.01  14.88  25.69  21.79  21.31  20.71  11.51  19.07 
PCN Yuan et al. (2018)  9.79  22.70  12.43  25.14  22.72  20.26  20.27  11.73  18.22 
AtlasNet Groueix et al. (2018)  10.36  23.40  13.40  24.16  20.24  20.82  17.52  11.62  17.77 
TopNet Tchapmi et al. (2019)  7.32  18.77  12.88  19.82  14.60  16.29  14.89  8.82  14.25 
Ours  6.88  21.18  15.78  22.54  18.78  28.39  19.96  11.16  18.18 
Model  Plane  Cabinet  Car  Chair  Lamp  Sofa  Table  W.craft  Average 

FoldingNet Yang et al. (2018)  25.79  40.52  16.12  39.90  43.01  43.76  40.88  26.54  34.56 
PCN Yuan et al. (2018)  21.58  41.87  16.56  38.86  50.19  43.37  39.44  27.57  34.93 
AtlasNet Groueix et al. (2018)  23.13  49.60  16.80  43.34  60.83  48.21  41.94  33.96  39.73 
TopNet Tchapmi et al. (2019)  21.61  15.24  38.14  35.23  44.42  38.36  36.18  25.97  31.87 
Ours  7.54  17.96  9.22  19.49  29.97  15.82  24.58  13.15  17.22 
9.2 Network Architecture
We use similar notations as PointNet++ Qi et al. (2017b) to describe the network architecture of the proposed model. is a set abstraction (SA) level with local regions of ball radius using a sharedweights PointNet structure Qi et al. (2017a), which contains fully connected layers and is the width of inputs. represents a fully connected layer with input width , output width , and dropout ratio . All layers are followed by batch normalization Ioffe and Szegedy (2015)
and Leaky ReLU
Maas et al. (2013) layers except for the last prediction layer, last layer in the vote generation, and layers within the foldingbased decoder.Points coordinates are first transformed to a highdimensional space by a fully connected layer. For all experiments, the architecture in vote generation process is the same and the outputs are the stack of mean vectors and diagonal elements of covariance matrix, since we model the vote as a multivariate Gaussian distribution:
For shape classification experiments, the architecture for decoding the latent feature into category scores is as follows:
For part segmentation experiments, the encoding feature for each point is the stack of the latent feature, transformed point coordinates, and a onehot vector for representing the object category. The architecture for pointwise prediction of part category scores is as follows:
For point clouds completion experiments, model with 0.1 ball radius achieves the best performance. The architecture of decoder is inspired by the folding idea proposed in Yang et al. (2018), which folds 2D grids into 3D shapes:
10 Point Clouds Completion
The quantitative results on partial point clouds completion on Completion3D test set is shown in the Table 5. With only training on complete point clouds, our proposed model still outperforms FoldingNet and PCN, which are trained with both partial and complete point clouds.
We further evaluate all approaches on simulated partial point clouds using models trained on the Completion3D dataset. The simulated partial point clouds are constructed by processing the validation set and selecting points falling into one side of a random plane. As it is shown in the Table 6, all approaches except the one developed in this paper experience a significant performance drop. We suspect that this is due to the difference between partial point clouds in Completion3D and the simulated partial point clouds. However, the proposed method achieves similar performance on both partial point clouds and this leads to the conclusion that the proposed method has a better generalizability to any partial point clouds.
11 Ablation Study
Model  BN  DP  # v. train  # v. test  radius  Bottleneck  Acc. 

10  16  0.25  1024  78.4  
✓  10  16  0.25  1024  80.9  
✓  ✓  10  16  0.25  1024  81.0  
✓  ✓  4  16  0.25  1024  79.1  
✓  ✓  10  16  0.25  1024  81.0  
✓  ✓  16  16  0.25  1024  79.4  
✓  ✓  32  16  0.25  1024  78.1  
✓  ✓  64  16  0.25  1024  76.2  
✓  ✓  10  8  0.25  1024  76.8  
✓  ✓  10  16  0.25  1024  81.0  
✓  ✓  10  32  0.25  1024  82.7  
✓  ✓  10  64  0.25  1024  83.8  
✓  ✓  10  128  0.25  1024  84.4  
✓  ✓  10  256  0.25  1024  84.6  
✓  ✓  10  256  0.15  1024  83.1  
✓  ✓  10  256  0.20  1024  86.4  
✓  ✓  10  256  0.25  1024  84.6  
✓  ✓  10  256  0.35  1024  82.5  
✓  ✓  10  256  0.20  512  85.3  
✓  ✓  10  256  0.20  1024  86.4  
✓  ✓  10  256  0.20  2048  86.8 
Results of ablation study are shown in the Table 7 and the overall classification accuracy is reported on the simulated partial test set of ModelNet40. Unsurprisingly, the accuracy of models is improved after using the batchnorm and dropout techniques, from 78.4% to 81.0%, shown by Model . We model the point clouds as an overlapping partition of point sets, each of which is defined by its centroid and scale. As it is shown by Model , the performance of this modeling is sensitive to the scale and accuracy peaks at 0.2 (86.4%). This can be in part explained by that local regions with small scale contain inefficient distinct geometry features to infer the latent feature encoding the complete point clouds, while the learned features for local regions with large scale tend to be different from those in partial point clouds with unexpected edges due to missing parts. Model illustrates that the performance of model grows as we increase the dimension of the latent space (bottleneck), and it saturates at 2048 (86.8%).
We propose that the latent feature encoding the complete point clouds are inferred by voting from local point sets. To make this voting strategy tenable, we design a training strategy that random number of votes are selected to compute the latent feature. Specifically, the maximum number of selected votes is manually set during training. As it is shown by Model , the performance peaks when the maximum number of votes is set to 10 (81.0%). Model illustrates that the accuracy of the trained model grows as the number of votes increases at test from 76.8% to 84.6%, which indicates that more votes accumulate more information for prediction.
12 Visualization of Voting Strategy
The latent space proposed in this paper is represented by a set of independent multivariate Gaussian distributions generated from local point sets. Combining this with the designed training strategy, each local point set is able to infer a distribution in the latent space. We perform experiments on partial point clouds completion on the Completion3D dataset and visualize each vote. Specifically, the optimal latent feature inferred by a single vote is decoded into complete point clouds using the foldingbased decoder as in the voting case. As it is shown in the Figure 9, local point sets located at different parts of object generate votes encoding complete point clouds with different shapes. In the shown example, votes at the front of the vehicle tend to infer vehicles with sloping rears, while votes at the rear tend to infer trucklike vehicles. Moreover, compared to votes located at the front and the rear of the vehicle, votes in the middle contain less distinct geometry information since their decoded point clouds are blurrier.
13 Visualization of Multiple Predictions
The method developed in this paper is designed to be able to generate multiple possible outputs, and this is achieved by modeling the latent space. The latent space is represented by a set of multivariate Gaussian distributions generated by local point sets. As a result, the most possible prediction is decoded from the latent value with the highest probability. However, diverse predictions can be generated by sampling from the latent space and followed by the decoding module. Since it is not easy to sample in the latent space as it is represented by a set of distributions, we instead sample latent value by interpolating between the optimal latent feature inferred by all votes and the optimal latent feature inferred by a single vote. We perform experiments on point clouds completion and visualize the results in Figure
11.14 Visualization of Point Clouds Completion with Noisy Inputs
We visualize the results of point clouds completion with added noise in the Figure 10. Input partial point clouds are perturbed using Gaussian noise with zero mean, and the standard deviation differs in experiments as they are indicated at bottom of the figure.
15 Failure Cases on Point Clouds Completion
We show failure cases of point clouds completion on Completion3D in the Figure 12. Given partial observation with no distinct geometric information, all models fail to generate correct complete point clouds. However, the method developed in this paper is able to generate sharp and reasonable completion, while outputs of other approaches are blurry. We suspect the reason is that other approaches will generate mean shape of what they have trained on when difficult partial point clouds are observed.
Comments
There are no comments yet.