1 Introduction
Based on a point cloud sampled from an object, humans are able to perceive the underlying shape of the object. Via properly capturing the topology behind the point set, the human understanding is robust to variations in scales and viewpoints. Topology reflects how the points are put together to form an object. Moreover, topology is an intrinsic property of Riemannian manifolds that are usually used to model 3D shapes in geometric learning boscaini2016learning ; masci2015geodesic
. Hence, it is important to seek topologyaware representations for point clouds in machine learning.
As an unsupervised learning architecture, autoencoder (AE)
ng2011sparse is popularly investigated to learn latent representations with unlabeled point clouds. In fact, it tries to learn an approximation to an identity function that is nontrivially constrained by outputting a compact representation from its encoder network. The decoder network attempts to reconstruct the point cloud from the compact representation. The compact representation is typically a fixedlength codeword characterizing geometric properties of point clouds. Therefore, it not only preserves the ability for reconstruction chen20203d but is also valuable for downstream tasks such as classification yang2018foldingnet ; zhao20193d ; gao2020graphter .With ample topological structures, unfortunately, it is a major challenge to produce topologyfriendly representations that count for object point clouds with varying genera, or scene point clouds with varying number of objects. In fact, existing works, including LatentGAN achlioptas2018learning , FoldingNet yang2018foldingnet , AtlasNet groueix2018papier , GraphTER gao2020graphter , etc., all target to reconstruct point clouds with simple topology, e.g., objectlevel point clouds.
Another challenge for point cloud autoencoder is the training strategy. When point cloud autoencoders are trained over a dataset with many complex topologies being mixed, a problem we call the curse of pointcollapse could be observed achlioptas2018learning
. Within a few training epochs or even less, a high density of points may be trapped near to some collapse centers and they are not able to escape in the final reconstruction. This is caused by the intrinsic structure of the loss function, which induces the undesired training behavior from diversified topologies.
In this paper, we endeavor to propose a new autoencoder, entitled TearingNet. It can novelly tear a 2D lattice apart into patches and match the topology of 2D lattice to 3D point clouds as shown in Figure 6 and Table 1. The parameterization of 3D point cloud topology is realized via a proposed Tearing network that is novelly coupled with a Folding network carried from FoldingNet yang2018foldingnet . Finally, TearingNet can generate topologyfriendly representations. The superiority of representation is verified in experiments, including shape reconstruction, object counting and object detection tasks. We also examine why the learned representations are topologyaware by analyzing the feature space. The contributions of our work can be summarized below:

We propose the TearingNet that can faithfully reconstruct point clouds with diverse topological structures, and generating topologyfriendly representations for input point clouds. We analyze our design by interpreting it as a proposed GraphConditioned AutoEncoder (GCAE) which discovers and utilizes topology iteratively.

We propose a Tearing network (TNet) to explicitly learn point cloud topology by tearing a regular 2D grid into patches, and exploit a Folding network (FNet) to accept refined 2D topology to polish point cloud reconstruction. A locallyconnected NN graph is built based on the torn 2D grid, which filters the point cloud towards a final faithful reconstruction.

We analyze the pointcollapse phenomenon by inspecting the mechanisms of the Chamfer Distance. Correspondingly, we propose a subtractive sculpture strategy which couples the training of the proposed TNet and FNet.
Our paper is organized as follows. Related work are reviewed in Section 2. In Section 3, we elaborate the design of our topologyfriendly TearingNet. We then detail the curse of pointcollapse and our consequent training strategy in Section 4. Experimentation is presented in Section 5 and conclusions are provided in Section 6.
2 Related Work
Geometric deep learning has recently shown great potential in various point cloud applications ahmed2018deep . Compared to deep learning on regularly structured data like image and video, point cloud learning is, however, more challenging as the points are irregularly sampled over object/scene surface.
Conventionally, point clouds are preprocessed, e.g., either voxelized maturana2015voxnet ; ioannidou2017deep or projected into multiview images su2015multi
, so as to carry over deep learning frameworks justified in image domain. After a format conversion, for example, conventional convolutional neural network (CNN) could be applied on 3D voxels or 2D pixels
choy20163d ; roveri2018network . Obviously, voxelization is a tradeoff between accuracy and data volume. Multiview projection is a balance between accuracy/occlusion and data volume. Such compromises occur before the data is fed into deep neural networks. Octreelike approaches wang2018adaptive demonstrate limited adaptivity on such tradeoffs. Fortunately, emerging techniques for native learning on point clouds relieve the frustration from the front.As a feature extractor, PointNet qi2017pointnet directly operates on input points and generates a latent codeword depicting the object shape. The latent code is point permutation invariant through a pooling operation. Once equipped with objectlevel or partlevel labels, PointNet could serve for supervised tasks like classification or segmentation. PointNet++ qi2017pointnet++ recursively applies PointNet in a hierarchical manner so as to capture local structures and enhance the ability to recognize finegrained patterns. With similar motivations, PointCNN li2018pointcnn utilizes a hierarchical convolution and Dynamic Graph CNN (DGCNN) wang2019dynamic employs an edgeconvolution over graphs. In brief, advanced feature extractors for point clouds often exploit local topology information.
As opposed to the advanced feature extractors, designs of current point cloud generators—e.g., a generator in a Generative Adversarial Network (GAN) and a decoder in an autoencoder (AE)—appear to be more preliminary without taking advantage of topology. For example, topology is not considered in a fullyconnected decoder of LatentGAN achlioptas2018learning . As a pioneering work, recent autoencoders FoldingNet yang2018foldingnet and AtlasNet groueix2018papier fold 2D lattice(s) to a 3D point cloud. They for the first time represent the topology explicitly by 2D square(s) with genus zero in their decoders.
FoldingNet adopts a PointNetlike qi2017pointnet encoder to produce latent representations. Like the PointNet encoder, FoldingNet decoder is a shared network among points. To map each 2D point to a 3D point, FoldingNet decoder takes a 2D coordinate and latent codeword as input and outputs a 3D coordinate. The Chamfer Distance is used to measure the errors between input and output point clouds. Unfortunately, FoldingNet would fail to embed geometric information for manifolds with genus higher than zero even if the network is scaled up. This is because FoldingNet is a continuous deformation, and the topology is unchanged in continuous deformations. Hence the topologies FoldingNet can represent remains the same as the 2D lattice with genus zero.
AtlasNet groueix2018papier and AtlasNet2 deprelle2019learning naively duplicate the decoderlattice pair to comply with complex topology. In chen2019deep , a fullyconnected graph is advanced as a companion to FoldingNet decoder aiming to approximate point cloud topology with a graph topology. Its main weakness is in the misaligned topology from graphs to point clouds as it allows to connect distant point pairs. In addition, it is expensive to learn a fullyconnected graph due to the large number of graph edges.
Motivated by the limitations in the related work, we propose an autoencoder: TearingNet. In particular, TearingNet is the first autoencoder that is able to use a fixedlength latent representation ( dimensions in our case) to reconstruct a scenelevel point cloud with multiple objects or objectlevel point clouds with high genera. We introduce a learnable Tearing network to make the latent representation to be aware of the topology in point clouds. Intuitively, the Tearing network is able to cut the 2D lattice into pieces so as to align its genus to 3D point clouds. A point cloud topology parameterization could then be easily inferred.
3 TearingNet for Topology Preservation
3.1 Overview
A block diagram of the proposed autoencoder—TearingNet—is shown in Figure 1. PointNet architecture qi2017pointnet is adopted as our encoder (ENet) to output latent representations. On top of FoldingNet yang2018foldingnet decoder, referred to as Folding network (FNet, denoted by ) hereinafter, a novel Tearing network (TNet, denoted by ) is proposed and wedged inbetween two iterations of FNet. Finally, a Graph filtering is appended at the end to complete a TearingNet configuration.
Given an original 3D point cloud composed of points
, the encoder generates a vector
from . A 2D point set samples points in a 2D plane, which are to be deformed during reconstruction. The 2D point set brings in a primitive shape and is initialized as by sampling on regular 2Dgrid locations (implying a gridgraph topology).TearingNet decoder takes the latent code and the 2D point set as inputs, then runs , , and sequentially as follows:
(1) 
Two iterations of shared Folding network produce preliminary and improved 3D point clouds, and , respectively. Tearing network especially counts for the preliminary point cloud from the first iteration of the Folding network, and modifies the point set in 2D plane. The updated 2D point set is supplied to the second iteration of the Folding network. Reconstructions contain points.
In a nutshell, TearingNet is characterized by the interaction between the Folding network and the Tearing network. In general, the interaction can be iterated several times. For the first iteration, FNet endeavors a trial folding, which is in particular evaluated by TNet from a topology perspective. The evaluation turns out to be a correction in 2D topology. The next iteration is triggered once 2D topology is updated. From the closedloop design, FNet and TNet teach each other in an alternative manner.
3.2 Tearing Network
As a core contribution, Tearing network is introduced to learn topology in 3D point clouds to boost the reconstruction accuracy, and ultimately to enhance the representability of the latent code. To find a way embodying the topology, the 2D lattice in FoldingNet could be regarded as the roughest approximation. Then we are motivated to align its topology to the input 3D point cloud using the proposed Tearing network. In this way, we avoid duplicating decoders many times as in AtlasNet groueix2018papier .
In our design, the Tearing network explicitly learns pointwise modifications on the 2D point set
with a residual connection
he2016deep . The 2D points are expected to move around like flocks depending on the topology chart they belong to. Hence Tearing network behaves like tearing the 2D grid into patches and increasing (or adjusting) the topology genus.To demonstrate the effectiveness of the Tearing network, we train the whole TearingNet to overfit the Torus dataset introduced in chen2019deep which contains 300 torusshape point clouds with genera ranging from 1 to 3. Figure 6 shows a genus3 torus before and after the TNet. In Figure (d)d, we see that the 2D grid is torn apart with “holes” to accommodate the topology of the torus.
The Tearing network could choose shared pointwise MLP as in Folding network and PointNet. With an MLP design as assumed in Eq. (1), taking an extra input the gradient of over would bring benefits to count local context. Alternatively, 2D convolutional layers can be used to absorb information from neighboring points on the 2D grid; while the gradient in the former MLP design is not kept. More details on Tearing network architecture can be found in the supplementary material.
3.3 Graph Filtering With Torn 2D Point Set
As a complementary step, a lightweight graph filtering is appended to promote graph smoothness chen2019deep . This is a predetermined signal processing module instead of a neural network to provide enhancement with little overhead.
Different from directly learning a globallyconnected graph in chen2019deep , a locallyconnected graph is constructed with the torn 2D grid, . Provided that the torn 2D grid now follows the topology of the input point cloud, the locallyconnected graph naturally leads to a mesh over the reconstructed point cloud as a side output (Figure (c)c). Moreover, graph filtering acts as a second coupling point to enforce the learned topology in point cloud reconstruction, in addition to the closedloop design in Tearing. Hence, it is preferable to filter the point cloud with this locallyconnected graph. Please refer to the supplementary material for more details.
3.4 GraphConditioned AutoEncoder
We distill the architecture of the TearingNet (Figure 1) and come up with a generally defined GraphConditioned AutoEncoder—GCAE (Figure 7). With this regard, TearingNet is an unrolled version of GCAE with two iterations. Particularly, GCAE promotes an explicit way to discover and utilize topology within an autoencoder, which we believe useful for processing data where topology matters, e.g., image, video, or any graph signals. In the GCAE diagram, “E”, “F” and “T” correspond to ENet, FNet and TNet presented earlier. GCAE is equipped with a graph topology which evolves with iterating FNet and TNet based on an initial graph ( in our case). FNet “embeds” the graph to a reconstruction; while TNet attempts to “decode” a graph (in a residual form) from a reconstruction, which may tear a graph into patches or glue them together. A graph filter can be appended at the end based on the learned topology . Therefore, TearingNet/GCAE can learn a topologyfriendly latent representation in an unsupervised manner.
4 Subtractive Sculpture Analysis
4.1 The Curse of Point Collapse
To train point cloud generation networks, point cloud distortion needs to be evaluated, where popular Earth Mover’s Distance (EMD) and Chamfer Distance (CD) demonstrate distinct effects on tuning a network fan2017point . EMD requires to solve a linear assignment problem at kuhn1955hungarian , while CD is only at . However, CD is observed to be inferior to EMD with respect to visual quality achlioptas2018learning ; williams2019deep due to a phenomenon quoted as pointcollapse in this work. Points are overpopulated around collapse centers, e.g., Figure (b)b & (c)c where points are colored according to their density and the overpopulated regions are deepened in red.
Next we provide deeper insights on pointcollapse by rewriting the Chamfer Distance with original and reconstructed point clouds being and :
(2) 
Above, the two distance terms in Chamfer Distance are hereinafter referenced as supersetdistance and subsetdistance , respectively.
To begin training, reconstructed points spatter around the space, as the network parameters are randomly initialized. Given a sufficient number of points and a dataset with ample topological structures, subsetdistance is likely to be larger than supersetdistance and thus dominant
. This can be interpreted by treating reconstruction as learning a conditional occurrence probability at each spatial location given latent code
. When shapes (point clouds) used for training fluctuate drastically, the learned distribution is more uniformly spread across space. Hence there is a higher chance of reconstructed points to fall outside of ground truth . It finally penalizes more on subsetdistance than supersetdistance and makes subsetdistance dominant during training.The illbalanced Chamfer Distance with dominating subsetdistance may lead to the curse of point collapse, even at the beginning of training. Consider that there exists a single shared point among all objects in a dataset, a trivial solution to minimize the subsetdistance (to be ) is to collapse all points to
. Even there are no intersections between object shapes, points may still collapse to one single pointestimator close to surface for a trivial solution to minimize the subsetdistance.
With pointcollapse, the reconstruction quality, as well as latent code representability, are to be degraded (Section 5.2). The insights hold for Chamfer Distance using squared superset/subsetdistance, and using instead of .
4.2 A Subtractive Sculpture Strategy—TearingNet Training
Motivated by the curse of point collapse, we propose a subtractive sculpture strategy to train the TearingNet. Our strategy is a twostep design that resembles how a statue is constructed.

Molding  to pretrain Folding network (FNet) and Encoding network (ENet). Specifically, they are trained under FoldingNet architecture (without Tearing network). By intention, the subsetdistance is scaled down significantly in Chamfer Distance, i.e., multiplied by a weight . We aim at roughing out a preliminary reconstruction that fully “encloses” the groundtruth surface. Unwanted points may spread outside objects or inside holes of objects (Figure (d)d).

Carving  to train TearingNet autoencoder by loading the pretrained FNet and ENet. Chamfer Distance is now used untouched, i.e., both the superset and subsetdistance are equally counted. A smaller learning rate is adopted for this finetune step. Tearing network (TNet) specifically carves out ghost points in the second step (Figure (e)e) via tearing the 2D grid apart into patches (Figure (f)f).
In the end, the proposed subtractive sculpture strategy designed for the TearingNet training can effectively avoid pointcollapse while still choosing Chamfer Distance as the loss function.
5 Experimentation
5.1 Experimental Setup
Datasets: We collect objects from offtheshelf point cloud datasets to synthesize our multiobject point cloud datasets. A squareshaped “playground” with grids is defined to host objects. Randomly picked objects are normalized and then randomly placed on the grids.
A first dataset we call CAD model multiobject (CAMO) is composed of point clouds sampled from CAD models in ModelNet40 wu20153d and ShapeNet chang2015shapenet . More challenging datasets than CAD models are chosen from KITTI 3D Object Detection geiger2012we , that are LiDAR scans and thus sparse and incomplete (e.g., Figure (a)a). In total, objects from KITTI with labels Pedestrian, Cyclist, Car, Van and Truck are “cropped” using annotated bounding boxes. Specifically, four datasets are created with playground dimension , and the resulting KITTI multiobject datasets are called KIMO respectively. Each KIMO is composed of and point clouds for training and testing, where each point cloud has roughly points and up to objects.
Implementation details: The 2D grid is defined to be , and the codeword to be dimension. Adam optimizer kingma2014adam is applied for training with a batch size . We pretrain FNet (and ENet) by suppressing subsetdistance by a weight in the molding step with epochs and learning rate . In the carving step, we train TearingNet endtoend using the intact Chamfer Distance (Eq.(2)) for 400 epochs with a smaller learning rate .
Benchmarks: We compare TearingNet with several methods: i) LatentGAN achlioptas2018learning , ii) FoldingNet yang2018foldingnet , and iii) AtlasNet groueix2018papier , where FoldingNet and AtlasNet are representative autoencoders reconstructing point clouds via deforming 2D primitive(s). They are all trained with Chamfer Distance employed as loss function. Five (5) patches are set for AtlasNet to have the same network scale as TearingNet. To compensate the network scale, a naive extension of FoldingNet is also considered, i.e., iv) Cascaded FNet, which has two FNets cascaded as . Subtractive sculpture strategy is applied to have pretrained and then , jointly trained. A last configuration is: v) TearingNet, where the proposed TearingNet is trained directly with Chamfer Distance. For a fair comparison, ENets in all the methods are always configured as PointNet.
  Torus  ModelNet40  CAMO4  KIMO4  

G  
A  
F  
T  
U 
Metrics  CD ()  EMD  

Datasets  MN40  K.3  K.4  K.5  K.6  MN40  K.3  K.4  K.5  K.6 
LatentGAN  3.27  7.10  11.64  17.18  19.205  0.24  1.98  3.23  3.77  4.19 
AtlasNet  3.10  4.53  6.50  8.78  11.14  0.18  1.38  2.64  3.11  3.24 
FoldingNet  3.06  4.72  6.57  9.01  11.06  0.34  1.75  2.86  3.06  4.57 
Cascaded FNet  3.17  4.77  6.67  9.13  10.94  0.24  1.64  2.44  3.46  4.96 
TearingNet  3.49  4.73  7.16  8.96  11.96  0.35  1.50  2.51  2.58  4.62 
TearingNet  2.98  4.88  6.38  8.20  10.15  0.20  0.87  1.32  1.84  2.65 
5.2 Performance Comparison
We perform the evaluation on three tasks: reconstruction, object counting and object detection.
Reconstruction: We first evaluate the superior reconstruction quality of proposed TearingNet. Table 1 visualizes the reconstructions from several datasets. Compared to TearingNet, FoldingNet leaves more bad points outside object surfaces, while AtlasNet results in more irregular and unbalanced point distribution. Not surprisingly, TearingNet produces point clouds that look clean and orderly, with appearances close to the input. For the results of KIMO4 (last two columns), our proposal even recovers rough silhouettes for objects from incomplete LiDAR scanning, showing its potential for scenelevel point cloud completion. The 2Dgrids are confirmed to be torn apart to approximate 3D topology. It explains how object topology is discovered and utilized via the iterative architecture in TearingNet/GCAE. Point density distributions exhibited in torn 2Dgrids may benefit subsequent tasks such as resampling and segmentation.
Chamfer Distance (CD) and Earth Mover’s Distance (EMD) are reported in Table 2. As topology complexity increases (from ModelNet40, KIMO3 to KIMO6), TearingNet outperforms benchmarks more significantly. By comparing TearingNet against Cascaded FNet and TearingNet, we demonstrate how TearingNet/GCAE is boosted by the subtractive sculpture training. Moreover, TearingNet’s capability to spread points more evenly is exemplified by the higher gain in EMD than CD. Similar results also observed on the CAMO dataset.
Object counting: In a multiobject scene, adding objects yields a more complex topology. With multiobject examples in Table 1, 2Dgrid patches basically coincide with the object numbers. It implies that the latent codeword from TearingNet is aware of the geometric topology. To further affirm the representativeness of topologies, we next try to “count” objects directly based on codewords. In fact, counting is a practical task in applications like traffic jam detection and crowd analysis onoro2016towards ; lempitsky2010learning . In addition, we “count” torus genus (13) from codewords as a toy set up for additional information.
In this task, TearingNet and benchmark autoencoders trained from reconstruction experiment are carried over. Specifically, KIMO datasets are chosen to simulate challenging use cases. As preparation, we feed the test dataset to PointNet encoder to collect codewords. Next, we employ a 4fold crossvalidation to train/test an SVM classifier: codewords are equally divided into 4 folds, then only
one of the four is used to train the SVM together with their count labels while the other three are used for count test. SVM is selected for the test as it would not modify the feature space learned by autoencoders. Further, the setup requires a small number of groundtruth labels, as feature learning is achieved in an unsupervised manner while the counting task in a weakly supervised manner.The counting performance is measured by mean absolute error (MAE) between predicted counting and groundtruth counting zhang2015cross . As shown in the left of Table 3, TearingNet consistently produces the smallest MAEs. For KIMO4, TearingNet brings down MAE by more than 40% comparing to FoldingNet/AtlasNet, showing its strong capability in representing scene topologies.
To further illustrate that the feature space learned by TearingNet is well linked to topology (i.e., counting), tSNE visualization is shown in Figure 15 for KIMO3, that are colored based on counting labels. For the playground in KIMO3, there are 9 and 36 combinations when placing 1 and 2 objects. Correspondingly, 9 and 36 clusters could be observed in the tSNE figure. As there is only 1 possible combination to arrange 9 objects, all countings of 9 aggregate to a single cluster.
Finally, the overall appearance of tSNE exhibits a tree structure. When inspecting one cluster of a larger counting (e.g., 9, 8, etc.), it is always surrounded by several smaller counting clusters (e.g., 8, 7, etc.). This observation is actually due to a recursive encapsulation from counting 1 to 9 where counting 9 stays at the center. If we compute an average Euclidean distance from all codewords of counting to the mean codeword of counting 9, we observe that approximately linearly increases as object counting decreases (topright of Figure 15 where error bars of are also shown). It implies that the feature descriptors distribute in a layered manner with respect to counting (i.e., topology). This insight shows that TearingNet codewords are topologyaware.
Tasks  Counting (MAE)  Detection (Accuracy %)  

Datasets  Torus  K.3  K.4  K.5  K.6  K.3  K.4  K.5  K.6 
LatentGAN  0.345  0.067  0.845  1.410  1.449  93.59  63.79  65.71  78.93 
AtlasNet  0.249  0.021  0.303  0.675  0.919  89.50  73.91  74.37  83.36 
FoldingNet  0.254  0.020  0.303  0.634  0.849  92.75  80.18  77.25  83.01 
Cascaded FNet  0.267  0.037  0.361  0.701  1.001  89.72  74.02  74.64  82.16 
TearingNet  0.251  0.017  0.331  0.621  0.996  92.97  80.31  78.28  82.63 
TearingNet  0.220  0.012  0.173  0.506  0.800  93.47  83.52  79.80  84.60 
Object detection: After the superiority of TearingNet/GCAE is revealed in point reconstruction and topology understanding, we finally devise a last experiment to demonstrate such superiority in lowlevel tasks can be transferred to highlevel understanding tasks. Specially, we take pedestrian detection task under an autonomous driving scenario. Similar to object counting, we train binary SVM classifiers and evaluate their performance using a fold crossvalidation strategy. Detection accuracy is collected in the right of Table 3. Comparing to the best among benchmarks, TearingNet performs comparable for KIMO3 and significantly better for KIMO4, 5, and 6. Note that KIMO3 is an easiest dataset as it contains least combination possibilities and LatentGAN already performs very well. For KIMO4, TearingNet/GCAE surpasses AtlasNet and FoldingNet by 10% and 3%.
6 Conclusion
We consider the problem of representing and reconstructing point clouds of ample topologies with an autoencoder, given the latent representations in a form of a fixedlength vector. We propose a TearingNet/GraphConditioned AutoEncoder (GCAE) architecture via discovering and utilizing topology during decoding to tackle this task. We further address the curse of pointcollapse by teaching our TearingNet/GCAE with a technique from subtractive sculpture—a wisdom dating back to ancient Greece. The superior capability of our proposal is demonstrated in terms of shape reconstruction and producing topologyfriendly representations for point clouds.
7 Broader Impact
This work (TearingNet/GCAE) is dedicated to a general unsupervised feature learning framework, especially for scene understandings via point clouds. For robotics, selfdriving cars, etc., it is critically important to let machines acquire an ability to understand the topology from its surroundings. For example, the awareness of relationship with other moving vehicles, cyclists, pedestrians, etc. can help the car to identify a potential future risk and hence to avoid accidents. Moreover, unsupervised learning requires no human labeling and avoids to introduce human mistakes when teaching machines. A common social risk of unsupervised learning systems and systems automating content analysis/understanding—including our work—is that, less human intervention might lead to less available job positions or to reshape the structure of job market to certain extent. However, we believe that our technique will bring more positive contributions and impacts: i) to address realworld challenges in deep learning and signal processing communities; and ii) to promote products and services in related industry domains.
References
 (1) P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representations and generative models for 3D point clouds. In International Conference on Machine Learning, pages 40–49, 2018.
 (2) E. Ahmed, A. Saint, A. E. R. Shabayek, K. Cherenkova, R. Das, G. Gusev, D. Aouada, and B. Ottersten. Deep learning advances on different 3D data representations: A survey. arXiv preprint arXiv:1808.01462, 1, 2018.
 (3) D. Boscaini, J. Masci, E. Rodolà, and M. Bronstein. Learning shape correspondence with anisotropic convolutional neural networks. In Advances in Neural Information Processing Systems, pages 3189–3197, 2016.
 (4) A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. ShapeNet: An informationrich 3D model repository. arXiv preprint arXiv:1512.03012, 2015.
 (5) S. Chen, C. Duan, Y. Yang, D. Li, C. Feng, and D. Tian. Deep unsupervised learning of 3D point clouds via graph topology inference and filtering. IEEE Transactions on Image Processing, 2019.
 (6) S. Chen, B. Liu, C. Feng, C. VallespiGonzalez, and C. Wellington. 3D point cloud processing and learning for autonomous driving. arXiv preprint arXiv:2003.00601, 2020.

(7)
C. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese.
3DR2N2: A unified approach for single and multiview 3D object
reconstruction.
In
Proceedings of the European Conference on Computer Vision
, 2016.  (8) T. Deprelle, T. Groueix, M. Fisher, V. Kim, B. Russell, and M. Aubry. Learning elementary structures for 3D shape generation and matching. In Advances in Neural Information Processing Systems, pages 7433–7443, 2019.

(9)
H. Fan, H. Su, and L. J. Guibas.
A point set generation network for 3D object reconstruction from a
single image.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 605–613, 2017.  (10) X. Gao, W. Hu, and G.J. Qi. GraphTER: Unsupervised learning of graph transformation equivariant representations via autoencoding nodewise transformations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2020.
 (11) A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012.
 (12) T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry. A papiermâché approach to learning 3D surface generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 216–224, 2018.
 (13) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 (14) A. Ioannidou, E. Chatzilari, S. Nikolopoulos, and I. Kompatsiaris. Deep learning advances in computer vision with 3D data: A survey. ACM Computing Surveys (CSUR), 50(2):1–38, 2017.
 (15) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
 (16) H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(12):83–97, 1955.
 (17) V. Lempitsky and A. Zisserman. Learning to count objects in images. In Advances in Neural Information Processing Systems, pages 1324–1332, 2010.
 (18) Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen. PointCNN: Convolution on Xtransformed points. In Advances in Neural Information Processing Systems, pages 820–830, 2018.
 (19) J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst. Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 37–45, 2015.
 (20) D. Maturana and S. Scherer. Voxnet: A 3D convolutional neural network for realtime object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928, 2015.
 (21) A. Ng et al. Sparse autoencoder. CS294A Lecture notes, 72(2011):1–19, 2011.
 (22) D. OnoroRubio and R. J. LópezSastre. Towards perspectivefree object counting with deep learning. In European Conference on Computer Vision, pages 615–629, 2016.
 (23) C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
 (24) C. R. Qi, L. Yi, H. Su, and L. J. Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
 (25) R. Roveri, L. Rahmann, C. Oztireli, and M. Gross. A network architecture for point cloud classification via automatic depth images generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4176–4184, 2018.
 (26) H. Su, S. Maji, E. Kalogerakis, and E. LearnedMiller. Multiview convolutional neural networks for 3D shape recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 945–953, 2015.
 (27) P.S. Wang, C.Y. Sun, Y. Liu, and X. Tong. Adaptive OCNN: A patchbased deep representation of 3D shapes. ACM Transactions on Graphics (TOG), 37(6):1–11, 2018.
 (28) Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon. Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics (TOG), 38(5):1–12, 2019.
 (29) F. Williams, T. Schneider, C. Silva, D. Zorin, J. Bruna, and D. Panozzo. Deep geometric prior for surface reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10130–10139, 2019.
 (30) Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
 (31) Y. Yang, C. Feng, Y. Shen, and D. Tian. FoldingNet: Point cloud autoencoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 206–215, 2018.
 (32) C. Zhang, H. Li, X. Wang, and X. Yang. Crossscene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 833–841, 2015.
 (33) Y. Zhao, T. Birdal, H. Deng, and F. Tombari. 3D point capsule networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1009–1018, 2019.
Comments
There are no comments yet.