TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations

06/17/2020
by   Jiahao Pang, et al.
0

Topology matters. Despite the recent success of point cloud processing with geometric deep learning, it remains arduous to capture the complex topologies of point cloud data with a learning model. Given a point cloud dataset containing objects with various genera or scenes with multiple objects, we propose an autoencoder, TearingNet, which tackles the challenging task of representing the point clouds using a fixed-length descriptor. Unlike existing works to deform primitives of genus zero (e.g., a 2D square patch) to an object-level point cloud, we propose a function which tears the primitive during deformation, letting it emulate the topology of a target point cloud. From the torn primitive, we construct a locally-connected graph to further enforce the learned topology via filtering. Moreover, we analyze a widely existing problem which we call point-collapse when processing point clouds with diverse topologies. Correspondingly, we propose a subtractive sculpture strategy to train our TearingNet model. Experimentation finally shows the superiority of our proposal in terms of reconstructing more faithful point clouds as well as generating more topology-friendly representations than benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/07/2021

TreeGCN-ED: Encoding Point Cloud using a Tree-Structured Graph Network

Point cloud is an efficient way of representing and storing 3D geometric...
01/03/2019

GeoNet: Deep Geodesic Networks for Point Cloud Analysis

Surface-based geodesic topology provides strong cues for object semantic...
06/22/2018

Point cloud segmentation using hierarchical tree for architectural models

Recent developments in the 3D scanning technologies have made the genera...
01/19/2019

Automatic normal orientation in point clouds of building interiors

Orienting surface normals correctly and consistently is a fundamental pr...
04/16/2021

Point-Based Modeling of Human Clothing

We propose a new approach to human clothing modeling based on point clou...
03/12/2019

A Skeleton-bridged Deep Learning Approach for Generating Meshes of Complex Topologies from Single RGB Images

This paper focuses on the challenging task of learning 3D object surface...
10/03/2018

Primitive Fitting Using Deep Boundary Aware Geometric Segmentation

To identify and fit geometric primitives (e.g., planes, spheres, cylinde...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Based on a point cloud sampled from an object, humans are able to perceive the underlying shape of the object. Via properly capturing the topology behind the point set, the human understanding is robust to variations in scales and viewpoints. Topology reflects how the points are put together to form an object. Moreover, topology is an intrinsic property of Riemannian manifolds that are usually used to model 3D shapes in geometric learning boscaini2016learning ; masci2015geodesic

. Hence, it is important to seek topology-aware representations for point clouds in machine learning.

As an unsupervised learning architecture, autoencoder (AE) 

ng2011sparse is popularly investigated to learn latent representations with unlabeled point clouds. In fact, it tries to learn an approximation to an identity function that is non-trivially constrained by outputting a compact representation from its encoder network. The decoder network attempts to reconstruct the point cloud from the compact representation. The compact representation is typically a fixed-length codeword characterizing geometric properties of point clouds. Therefore, it not only preserves the ability for reconstruction chen20203d but is also valuable for downstream tasks such as classification yang2018foldingnet ; zhao20193d ; gao2020graphter .

With ample topological structures, unfortunately, it is a major challenge to produce topology-friendly representations that count for object point clouds with varying genera, or scene point clouds with varying number of objects. In fact, existing works, including LatentGAN achlioptas2018learning , FoldingNet yang2018foldingnet , AtlasNet groueix2018papier , GraphTER gao2020graphter , etc., all target to reconstruct point clouds with simple topology, e.g., object-level point clouds.

Another challenge for point cloud autoencoder is the training strategy. When point cloud autoencoders are trained over a dataset with many complex topologies being mixed, a problem we call the curse of point-collapse could be observed achlioptas2018learning

. Within a few training epochs or even less, a high density of points may be trapped near to some collapse centers and they are not able to escape in the final reconstruction. This is caused by the intrinsic structure of the loss function, which induces the undesired training behavior from diversified topologies.

In this paper, we endeavor to propose a new autoencoder, entitled TearingNet. It can novelly tear a 2D lattice apart into patches and match the topology of 2D lattice to 3D point clouds as shown in Figure 6 and Table 1. The parameterization of 3D point cloud topology is realized via a proposed Tearing network that is novelly coupled with a Folding network carried from FoldingNet yang2018foldingnet . Finally, TearingNet can generate topology-friendly representations. The superiority of representation is verified in experiments, including shape reconstruction, object counting and object detection tasks. We also examine why the learned representations are topology-aware by analyzing the feature space. The contributions of our work can be summarized below:

  1. We propose the TearingNet that can faithfully reconstruct point clouds with diverse topological structures, and generating topology-friendly representations for input point clouds. We analyze our design by interpreting it as a proposed Graph-Conditioned AutoEncoder (GCAE) which discovers and utilizes topology iteratively.

  2. We propose a Tearing network (T-Net) to explicitly learn point cloud topology by tearing a regular 2D grid into patches, and exploit a Folding network (F-Net) to accept refined 2D topology to polish point cloud reconstruction. A locally-connected -NN graph is built based on the torn 2D grid, which filters the point cloud towards a final faithful reconstruction.

  3. We analyze the point-collapse phenomenon by inspecting the mechanisms of the Chamfer Distance. Correspondingly, we propose a subtractive sculpture strategy which couples the training of the proposed T-Net and F-Net.

Our paper is organized as follows. Related work are reviewed in Section 2. In Section 3, we elaborate the design of our topology-friendly TearingNet. We then detail the curse of point-collapse and our consequent training strategy in Section 4. Experimentation is presented in Section 5 and conclusions are provided in Section 6.

2 Related Work

Geometric deep learning has recently shown great potential in various point cloud applications ahmed2018deep . Compared to deep learning on regularly structured data like image and video, point cloud learning is, however, more challenging as the points are irregularly sampled over object/scene surface.

Conventionally, point clouds are preprocessed, e.g., either voxelized maturana2015voxnet ; ioannidou2017deep or projected into multiview images su2015multi

, so as to carry over deep learning frameworks justified in image domain. After a format conversion, for example, conventional convolutional neural network (CNN) could be applied on 3D voxels or 2D pixels 

choy20163d ; roveri2018network . Obviously, voxelization is a tradeoff between accuracy and data volume. Multiview projection is a balance between accuracy/occlusion and data volume. Such compromises occur before the data is fed into deep neural networks. Octree-like approaches wang2018adaptive demonstrate limited adaptivity on such tradeoffs. Fortunately, emerging techniques for native learning on point clouds relieve the frustration from the front.

As a feature extractor, PointNet qi2017pointnet directly operates on input points and generates a latent codeword depicting the object shape. The latent code is point permutation invariant through a pooling operation. Once equipped with object-level or part-level labels, PointNet could serve for supervised tasks like classification or segmentation. PointNet++ qi2017pointnet++ recursively applies PointNet in a hierarchical manner so as to capture local structures and enhance the ability to recognize fine-grained patterns. With similar motivations, PointCNN li2018pointcnn utilizes a hierarchical convolution and Dynamic Graph CNN (DGCNN) wang2019dynamic employs an edge-convolution over graphs. In brief, advanced feature extractors for point clouds often exploit local topology information.

As opposed to the advanced feature extractors, designs of current point cloud generators—e.g., a generator in a Generative Adversarial Network (GAN) and a decoder in an autoencoder (AE)—appear to be more preliminary without taking advantage of topology. For example, topology is not considered in a fully-connected decoder of LatentGAN achlioptas2018learning . As a pioneering work, recent autoencoders FoldingNet yang2018foldingnet and AtlasNet groueix2018papier fold 2D lattice(s) to a 3D point cloud. They for the first time represent the topology explicitly by 2D square(s) with genus zero in their decoders.

FoldingNet adopts a PointNet-like qi2017pointnet encoder to produce latent representations. Like the PointNet encoder, FoldingNet decoder is a shared network among points. To map each 2D point to a 3D point, FoldingNet decoder takes a 2D coordinate and latent codeword as input and outputs a 3D coordinate. The Chamfer Distance is used to measure the errors between input and output point clouds. Unfortunately, FoldingNet would fail to embed geometric information for manifolds with genus higher than zero even if the network is scaled up. This is because FoldingNet is a continuous deformation, and the topology is unchanged in continuous deformations. Hence the topologies FoldingNet can represent remains the same as the 2D lattice with genus zero.

AtlasNet groueix2018papier and AtlasNet2 deprelle2019learning naively duplicate the decoder-lattice pair to comply with complex topology. In chen2019deep , a fully-connected graph is advanced as a companion to FoldingNet decoder aiming to approximate point cloud topology with a graph topology. Its main weakness is in the misaligned topology from graphs to point clouds as it allows to connect distant point pairs. In addition, it is expensive to learn a fully-connected graph due to the large number of graph edges.

Motivated by the limitations in the related work, we propose an autoencoder: TearingNet. In particular, TearingNet is the first autoencoder that is able to use a fixed-length latent representation ( dimensions in our case) to reconstruct a scene-level point cloud with multiple objects or object-level point clouds with high genera. We introduce a learnable Tearing network to make the latent representation to be aware of the topology in point clouds. Intuitively, the Tearing network is able to cut the 2D lattice into pieces so as to align its genus to 3D point clouds. A point cloud topology parameterization could then be easily inferred.

3 TearingNet for Topology Preservation

Figure 1: Block diagram of the proposed TearingNet.

3.1 Overview

A block diagram of the proposed autoencoder—TearingNet—is shown in Figure 1. PointNet architecture qi2017pointnet is adopted as our encoder (E-Net) to output latent representations. On top of FoldingNet yang2018foldingnet decoder, referred to as Folding network (F-Net, denoted by ) hereinafter, a novel Tearing network (T-Net, denoted by ) is proposed and wedged in-between two iterations of F-Net. Finally, a Graph filtering is appended at the end to complete a TearingNet configuration.

Given an original 3D point cloud composed of points

, the encoder generates a vector

from . A 2D point set samples points in a 2D plane, which are to be deformed during reconstruction. The 2D point set brings in a primitive shape and is initialized as by sampling on regular 2D-grid locations (implying a grid-graph topology).

TearingNet decoder takes the latent code and the 2D point set as inputs, then runs , , and sequentially as follows:

(1)

Two iterations of shared Folding network produce preliminary and improved 3D point clouds, and , respectively. Tearing network especially counts for the preliminary point cloud from the first iteration of the Folding network, and modifies the point set in 2D plane. The updated 2D point set is supplied to the second iteration of the Folding network. Reconstructions contain points.

In a nutshell, TearingNet is characterized by the interaction between the Folding network and the Tearing network. In general, the interaction can be iterated several times. For the first iteration, F-Net endeavors a trial folding, which is in particular evaluated by T-Net from a topology perspective. The evaluation turns out to be a correction in 2D topology. The next iteration is triggered once 2D topology is updated. From the closed-loop design, F-Net and T-Net teach each other in an alternative manner.

(a) Before tearing.
(b) After tearing.
(c) Induced mesh.
(d) Torn 2D grid.
Figure 6: Applying TearingNet to a genus-3 torus.

3.2 Tearing Network

As a core contribution, Tearing network is introduced to learn topology in 3D point clouds to boost the reconstruction accuracy, and ultimately to enhance the representability of the latent code. To find a way embodying the topology, the 2D lattice in FoldingNet could be regarded as the roughest approximation. Then we are motivated to align its topology to the input 3D point cloud using the proposed Tearing network. In this way, we avoid duplicating decoders many times as in AtlasNet groueix2018papier .

In our design, the Tearing network explicitly learns point-wise modifications on the 2D point set

with a residual connection 

he2016deep . The 2D points are expected to move around like flocks depending on the topology chart they belong to. Hence Tearing network behaves like tearing the 2D grid into patches and increasing (or adjusting) the topology genus.

To demonstrate the effectiveness of the Tearing network, we train the whole TearingNet to over-fit the Torus dataset introduced in chen2019deep which contains 300 torus-shape point clouds with genera ranging from 1 to 3. Figure 6 shows a genus-3 torus before and after the T-Net. In Figure (d)d, we see that the 2D grid is torn apart with “holes” to accommodate the topology of the torus.

The Tearing network could choose shared point-wise MLP as in Folding network and PointNet. With an MLP design as assumed in Eq. (1), taking an extra input the gradient of over would bring benefits to count local context. Alternatively, 2D convolutional layers can be used to absorb information from neighboring points on the 2D grid; while the gradient in the former MLP design is not kept. More details on Tearing network architecture can be found in the supplementary material.

Figure 7: Graph-Conditioned AutoEncoder.

3.3 Graph Filtering With Torn 2D Point Set

As a complementary step, a lightweight graph filtering is appended to promote graph smoothness chen2019deep . This is a pre-determined signal processing module instead of a neural network to provide enhancement with little overhead.

Different from directly learning a globally-connected graph in chen2019deep , a locally-connected graph is constructed with the torn 2D grid, . Provided that the torn 2D grid now follows the topology of the input point cloud, the locally-connected graph naturally leads to a mesh over the reconstructed point cloud as a side output (Figure (c)c). Moreover, graph filtering acts as a second coupling point to enforce the learned topology in point cloud reconstruction, in addition to the closed-loop design in Tearing. Hence, it is preferable to filter the point cloud with this locally-connected graph. Please refer to the supplementary material for more details.

3.4 Graph-Conditioned AutoEncoder

We distill the architecture of the TearingNet (Figure 1) and come up with a generally defined Graph-Conditioned AutoEncoder—GCAE (Figure 7). With this regard, TearingNet is an unrolled version of GCAE with two iterations. Particularly, GCAE promotes an explicit way to discover and utilize topology within an autoencoder, which we believe useful for processing data where topology matters, e.g., image, video, or any graph signals. In the GCAE diagram, “E”, “F” and “T” correspond to E-Net, F-Net and T-Net presented earlier. GCAE is equipped with a graph topology which evolves with iterating F-Net and T-Net based on an initial graph ( in our case). F-Net “embeds” the graph to a reconstruction; while T-Net attempts to “decode” a graph (in a residual form) from a reconstruction, which may tear a graph into patches or glue them together. A graph filter can be appended at the end based on the learned topology . Therefore, TearingNet/GCAE can learn a topology-friendly latent representation in an unsupervised manner.

4 Subtractive Sculpture Analysis

4.1 The Curse of Point Collapse

To train point cloud generation networks, point cloud distortion needs to be evaluated, where popular Earth Mover’s Distance (EMD) and Chamfer Distance (CD) demonstrate distinct effects on tuning a network fan2017point . EMD requires to solve a linear assignment problem at  kuhn1955hungarian , while CD is only at . However, CD is observed to be inferior to EMD with respect to visual quality achlioptas2018learning ; williams2019deep due to a phenomenon quoted as point-collapse in this work. Points are over-populated around collapse centers, e.g., Figure (b)b & (c)c where points are colored according to their density and the over-populated regions are deepened in red.

Next we provide deeper insights on point-collapse by rewriting the Chamfer Distance with original and reconstructed point clouds being and :

(2)

Above, the two distance terms in Chamfer Distance are hereinafter referenced as superset-distance and subset-distance , respectively.

To begin training, reconstructed points spatter around the space, as the network parameters are randomly initialized. Given a sufficient number of points and a dataset with ample topological structures, subset-distance is likely to be larger than superset-distance and thus dominant

. This can be interpreted by treating reconstruction as learning a conditional occurrence probability at each spatial location given latent code

. When shapes (point clouds) used for training fluctuate drastically, the learned distribution is more uniformly spread across space. Hence there is a higher chance of reconstructed points to fall outside of ground truth . It finally penalizes more on subset-distance than superset-distance and makes subset-distance dominant during training.

The ill-balanced Chamfer Distance with dominating subset-distance may lead to the curse of point collapse, even at the beginning of training. Consider that there exists a single shared point among all objects in a dataset, a trivial solution to minimize the subset-distance (to be ) is to collapse all points to

. Even there are no intersections between object shapes, points may still collapse to one single point-estimator close to surface for a trivial solution to minimize the subset-distance.

With point-collapse, the reconstruction quality, as well as latent code representability, are to be degraded (Section 5.2). The insights hold for Chamfer Distance using squared superset/subset-distance, and using instead of .

(a) Ground-truth
(b) LatentGAN
(c) FoldingNet
(d) Molding step
(e) Carving step
(f) 2D grid
Figure 14: Subtractive sculpture strategy relieves the curse of point-collapse. Points are colored/sized according to density. Over-populated regions are highlighted in red.

4.2 A Subtractive Sculpture Strategy—TearingNet Training

Motivated by the curse of point collapse, we propose a subtractive sculpture strategy to train the TearingNet. Our strategy is a two-step design that resembles how a statue is constructed.

  1. Molding - to pre-train Folding network (F-Net) and Encoding network (E-Net). Specifically, they are trained under FoldingNet architecture (without Tearing network). By intention, the subset-distance is scaled down significantly in Chamfer Distance, i.e., multiplied by a weight . We aim at roughing out a preliminary reconstruction that fully “encloses” the ground-truth surface. Unwanted points may spread outside objects or inside holes of objects (Figure (d)d).

  2. Carving - to train TearingNet autoencoder by loading the pre-trained F-Net and E-Net. Chamfer Distance is now used untouched, i.e., both the superset- and subset-distance are equally counted. A smaller learning rate is adopted for this fine-tune step. Tearing network (T-Net) specifically carves out ghost points in the second step (Figure (e)e) via tearing the 2D grid apart into patches (Figure (f)f).

In the end, the proposed subtractive sculpture strategy designed for the TearingNet training can effectively avoid point-collapse while still choosing Chamfer Distance as the loss function.

5 Experimentation

5.1 Experimental Setup

Datasets: We collect objects from off-the-shelf point cloud datasets to synthesize our multi-object point cloud datasets. A square-shaped “playground” with grids is defined to host objects. Randomly picked objects are normalized and then randomly placed on the grids.

A first dataset we call CAD model multi-object (CAMO) is composed of point clouds sampled from CAD models in ModelNet40 wu20153d and ShapeNet chang2015shapenet . More challenging datasets than CAD models are chosen from KITTI 3D Object Detection geiger2012we , that are LiDAR scans and thus sparse and incomplete (e.g., Figure (a)a). In total, objects from KITTI with labels Pedestrian, Cyclist, Car, Van and Truck are “cropped” using annotated bounding boxes. Specifically, four datasets are created with playground dimension , and the resulting KITTI multi-object datasets are called KIMO- respectively. Each KIMO- is composed of and point clouds for training and testing, where each point cloud has roughly points and up to objects.

Implementation details: The 2D grid is defined to be , and the codeword to be -dimension. Adam optimizer kingma2014adam is applied for training with a batch size . We pre-train F-Net (and E-Net) by suppressing subset-distance by a weight in the molding step with epochs and learning rate . In the carving step, we train TearingNet end-to-end using the intact Chamfer Distance (Eq.(2)) for 400 epochs with a smaller learning rate .

Benchmarks: We compare TearingNet with several methods: i) LatentGAN achlioptas2018learning , ii) FoldingNet yang2018foldingnet , and iii) AtlasNet groueix2018papier , where FoldingNet and AtlasNet are representative autoencoders reconstructing point clouds via deforming 2D primitive(s). They are all trained with Chamfer Distance employed as loss function. Five (5) patches are set for AtlasNet to have the same network scale as TearingNet. To compensate the network scale, a naive extension of FoldingNet is also considered, i.e., iv) Cascaded F-Net, which has two F-Nets cascaded as . Subtractive sculpture strategy is applied to have pre-trained and then , jointly trained. A last configuration is: v) TearingNet, where the proposed TearingNet is trained directly with Chamfer Distance. For a fair comparison, E-Nets in all the methods are always configured as PointNet.

- Torus ModelNet40 CAMO-4 KIMO-4
G
A
F
T
U
Table 1: Visual comparisons of point cloud reconstruction. Points are colored according to their indices. G - Ground-truth; A - AtlasNet; F - FoldingNet; T - TearingNet; U - Torn 2D grid.
Metrics CD () EMD
Datasets MN40 K.-3 K.-4 K.-5 K.-6 MN40 K.-3 K.-4 K.-5 K.-6
LatentGAN 3.27 7.10 11.64 17.18 19.205 0.24 1.98 3.23 3.77 4.19
AtlasNet 3.10 4.53 6.50 8.78 11.14 0.18 1.38 2.64 3.11 3.24
FoldingNet 3.06 4.72 6.57 9.01 11.06 0.34 1.75 2.86 3.06 4.57
Cascaded F-Net 3.17 4.77 6.67 9.13 10.94 0.24 1.64 2.44 3.46 4.96
TearingNet 3.49 4.73 7.16 8.96 11.96 0.35 1.50 2.51 2.58 4.62
TearingNet 2.98 4.88 6.38 8.20 10.15 0.20 0.87 1.32 1.84 2.65
Table 2: Evaluation of point cloud reconstruction.

5.2 Performance Comparison

We perform the evaluation on three tasks: reconstruction, object counting and object detection.

Reconstruction: We first evaluate the superior reconstruction quality of proposed TearingNet. Table 1 visualizes the reconstructions from several datasets. Compared to TearingNet, FoldingNet leaves more bad points outside object surfaces, while AtlasNet results in more irregular and unbalanced point distribution. Not surprisingly, TearingNet produces point clouds that look clean and orderly, with appearances close to the input. For the results of KIMO-4 (last two columns), our proposal even recovers rough silhouettes for objects from incomplete LiDAR scanning, showing its potential for scene-level point cloud completion. The 2D-grids are confirmed to be torn apart to approximate 3D topology. It explains how object topology is discovered and utilized via the iterative architecture in TearingNet/GCAE. Point density distributions exhibited in torn 2D-grids may benefit subsequent tasks such as re-sampling and segmentation.

Chamfer Distance (CD) and Earth Mover’s Distance (EMD) are reported in Table 2. As topology complexity increases (from ModelNet40, KIMO-3 to KIMO-6), TearingNet outperforms benchmarks more significantly. By comparing TearingNet against Cascaded F-Net and TearingNet, we demonstrate how TearingNet/GCAE is boosted by the subtractive sculpture training. Moreover, TearingNet’s capability to spread points more evenly is exemplified by the higher gain in EMD than CD. Similar results also observed on the CAMO dataset.

Object counting: In a multi-object scene, adding objects yields a more complex topology. With multi-object examples in Table 1, 2D-grid patches basically coincide with the object numbers. It implies that the latent codeword from TearingNet is aware of the geometric topology. To further affirm the representativeness of topologies, we next try to “count” objects directly based on codewords. In fact, counting is a practical task in applications like traffic jam detection and crowd analysis onoro2016towards ; lempitsky2010learning . In addition, we “count” torus genus (1-3) from codewords as a toy set up for additional information.

Figure 15: t-SNE visualization of TearingNet codewords.

In this task, TearingNet and benchmark autoencoders trained from reconstruction experiment are carried over. Specifically, KIMO datasets are chosen to simulate challenging use cases. As preparation, we feed the test dataset to PointNet encoder to collect codewords. Next, we employ a 4-fold cross-validation to train/test an SVM classifier: codewords are equally divided into 4 folds, then only

one of the four is used to train the SVM together with their count labels while the other three are used for count test. SVM is selected for the test as it would not modify the feature space learned by autoencoders. Further, the setup requires a small number of ground-truth labels, as feature learning is achieved in an unsupervised manner while the counting task in a weakly supervised manner.

The counting performance is measured by mean absolute error (MAE) between predicted counting and ground-truth counting zhang2015cross . As shown in the left of Table 3, TearingNet consistently produces the smallest MAEs. For KIMO-4, TearingNet brings down MAE by more than 40% comparing to FoldingNet/AtlasNet, showing its strong capability in representing scene topologies.

To further illustrate that the feature space learned by TearingNet is well linked to topology (i.e., counting), t-SNE visualization is shown in Figure 15 for KIMO-3, that are colored based on counting labels. For the playground in KIMO-3, there are 9 and 36 combinations when placing 1 and 2 objects. Correspondingly, 9 and 36 clusters could be observed in the t-SNE figure. As there is only 1 possible combination to arrange 9 objects, all countings of 9 aggregate to a single cluster.

Finally, the overall appearance of t-SNE exhibits a tree structure. When inspecting one cluster of a larger counting (e.g., 9, 8, etc.), it is always surrounded by several smaller counting clusters (e.g., 8, 7, etc.). This observation is actually due to a recursive encapsulation from counting 1 to 9 where counting 9 stays at the center. If we compute an average Euclidean distance from all codewords of counting to the mean codeword of counting 9, we observe that approximately linearly increases as object counting decreases (top-right of Figure 15 where error bars of are also shown). It implies that the feature descriptors distribute in a layered manner with respect to counting (i.e., topology). This insight shows that TearingNet codewords are topology-aware.

Tasks Counting (MAE) Detection (Accuracy %)
Datasets Torus K.-3 K.-4 K.-5 K.-6 K.-3 K.-4 K.-5 K.-6
LatentGAN 0.345 0.067 0.845 1.410 1.449 93.59 63.79 65.71 78.93
AtlasNet 0.249 0.021 0.303 0.675 0.919 89.50 73.91 74.37 83.36
FoldingNet 0.254 0.020 0.303 0.634 0.849 92.75 80.18 77.25 83.01
Cascaded F-Net 0.267 0.037 0.361 0.701 1.001 89.72 74.02 74.64 82.16
TearingNet 0.251 0.017 0.331 0.621 0.996 92.97 80.31 78.28 82.63
TearingNet 0.220 0.012 0.173 0.506 0.800 93.47 83.52 79.80 84.60
Table 3: Evaluation of object counting and object detection based on codewords.

Object detection: After the superiority of TearingNet/GCAE is revealed in point reconstruction and topology understanding, we finally devise a last experiment to demonstrate such superiority in low-level tasks can be transferred to high-level understanding tasks. Specially, we take pedestrian detection task under an autonomous driving scenario. Similar to object counting, we train binary SVM classifiers and evaluate their performance using a -fold cross-validation strategy. Detection accuracy is collected in the right of Table 3. Comparing to the best among benchmarks, TearingNet performs comparable for KIMO-3 and significantly better for KIMO-4, -5, and -6. Note that KIMO-3 is an easiest dataset as it contains least combination possibilities and LatentGAN already performs very well. For KIMO-4, TearingNet/GCAE surpasses AtlasNet and FoldingNet by 10% and 3%.

6 Conclusion

We consider the problem of representing and reconstructing point clouds of ample topologies with an autoencoder, given the latent representations in a form of a fixed-length vector. We propose a TearingNet/Graph-Conditioned AutoEncoder (GCAE) architecture via discovering and utilizing topology during decoding to tackle this task. We further address the curse of point-collapse by teaching our TearingNet/GCAE with a technique from subtractive sculpture—a wisdom dating back to ancient Greece. The superior capability of our proposal is demonstrated in terms of shape reconstruction and producing topology-friendly representations for point clouds.

7 Broader Impact

This work (TearingNet/GCAE) is dedicated to a general unsupervised feature learning framework, especially for scene understandings via point clouds. For robotics, self-driving cars, etc., it is critically important to let machines acquire an ability to understand the topology from its surroundings. For example, the awareness of relationship with other moving vehicles, cyclists, pedestrians, etc. can help the car to identify a potential future risk and hence to avoid accidents. Moreover, unsupervised learning requires no human labeling and avoids to introduce human mistakes when teaching machines. A common social risk of unsupervised learning systems and systems automating content analysis/understanding—including our work—is that, less human intervention might lead to less available job positions or to reshape the structure of job market to certain extent. However, we believe that our technique will bring more positive contributions and impacts: i) to address real-world challenges in deep learning and signal processing communities; and ii) to promote products and services in related industry domains.

References

  • (1) P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representations and generative models for 3D point clouds. In International Conference on Machine Learning, pages 40–49, 2018.
  • (2) E. Ahmed, A. Saint, A. E. R. Shabayek, K. Cherenkova, R. Das, G. Gusev, D. Aouada, and B. Ottersten. Deep learning advances on different 3D data representations: A survey. arXiv preprint arXiv:1808.01462, 1, 2018.
  • (3) D. Boscaini, J. Masci, E. Rodolà, and M. Bronstein. Learning shape correspondence with anisotropic convolutional neural networks. In Advances in Neural Information Processing Systems, pages 3189–3197, 2016.
  • (4) A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012, 2015.
  • (5) S. Chen, C. Duan, Y. Yang, D. Li, C. Feng, and D. Tian. Deep unsupervised learning of 3D point clouds via graph topology inference and filtering. IEEE Transactions on Image Processing, 2019.
  • (6) S. Chen, B. Liu, C. Feng, C. Vallespi-Gonzalez, and C. Wellington. 3D point cloud processing and learning for autonomous driving. arXiv preprint arXiv:2003.00601, 2020.
  • (7) C. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3D-R2N2: A unified approach for single and multi-view 3D object reconstruction. In

    Proceedings of the European Conference on Computer Vision

    , 2016.
  • (8) T. Deprelle, T. Groueix, M. Fisher, V. Kim, B. Russell, and M. Aubry. Learning elementary structures for 3D shape generation and matching. In Advances in Neural Information Processing Systems, pages 7433–7443, 2019.
  • (9) H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3D object reconstruction from a single image. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 605–613, 2017.
  • (10) X. Gao, W. Hu, and G.-J. Qi. GraphTER: Unsupervised learning of graph transformation equivariant representations via auto-encoding node-wise transformations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2020.
  • (11) A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012.
  • (12) T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry. A papier-mâché approach to learning 3D surface generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 216–224, 2018.
  • (13) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • (14) A. Ioannidou, E. Chatzilari, S. Nikolopoulos, and I. Kompatsiaris. Deep learning advances in computer vision with 3D data: A survey. ACM Computing Surveys (CSUR), 50(2):1–38, 2017.
  • (15) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  • (16) H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955.
  • (17) V. Lempitsky and A. Zisserman. Learning to count objects in images. In Advances in Neural Information Processing Systems, pages 1324–1332, 2010.
  • (18) Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen. PointCNN: Convolution on X-transformed points. In Advances in Neural Information Processing Systems, pages 820–830, 2018.
  • (19) J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst. Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 37–45, 2015.
  • (20) D. Maturana and S. Scherer. Voxnet: A 3D convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928, 2015.
  • (21) A. Ng et al. Sparse autoencoder. CS294A Lecture notes, 72(2011):1–19, 2011.
  • (22) D. Onoro-Rubio and R. J. López-Sastre. Towards perspective-free object counting with deep learning. In European Conference on Computer Vision, pages 615–629, 2016.
  • (23) C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
  • (24) C. R. Qi, L. Yi, H. Su, and L. J. Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
  • (25) R. Roveri, L. Rahmann, C. Oztireli, and M. Gross. A network architecture for point cloud classification via automatic depth images generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4176–4184, 2018.
  • (26) H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multi-view convolutional neural networks for 3D shape recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 945–953, 2015.
  • (27) P.-S. Wang, C.-Y. Sun, Y. Liu, and X. Tong. Adaptive O-CNN: A patch-based deep representation of 3D shapes. ACM Transactions on Graphics (TOG), 37(6):1–11, 2018.
  • (28) Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon. Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics (TOG), 38(5):1–12, 2019.
  • (29) F. Williams, T. Schneider, C. Silva, D. Zorin, J. Bruna, and D. Panozzo. Deep geometric prior for surface reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10130–10139, 2019.
  • (30) Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
  • (31) Y. Yang, C. Feng, Y. Shen, and D. Tian. FoldingNet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 206–215, 2018.
  • (32) C. Zhang, H. Li, X. Wang, and X. Yang. Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 833–841, 2015.
  • (33) Y. Zhao, T. Birdal, H. Deng, and F. Tombari. 3D point capsule networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1009–1018, 2019.