The proliferation of 3D point cloud data from LiDAR sensors for self-driving cars and commercial sensors like Microsoft’s Kinect has led to an increase in the interest for machine learning techniques to interpret 3D data. Just as generative models for text and images gained traction following the success of classification models for those domains, the success of neural networks on point clouds(Qi et al., 2017a, b; Zaheer et al., 2017) has led to a growth in research into point cloud generative models (Yang et al., 2018; Li et al., 2018; Groueix et al., 2018b; Yang et al., 2019; Groueix et al., 2018a; Li et al., 2019).
. The decoder itself is a generative model which samples from a corresponding 3-dimensional distribution of the input point cloud. If we naively train a neural network to generate point clouds, we will end up with an extremely under-constrained problem. This will result in low reconstruction loss but a non-uniform distribution of points on the surface of the object and it will require an additional meshing process.
In order to remedy this, we draw inspiration from computer graphics and topology by learning a generative model by way of deformation. We begin with a goal object we hope to reconstruct and an initial point cloud. At each step, we will learn a slight modification of the initial point cloud so the output will be close to the goal point cloud. As the underlying object that generated the point cloud sample is a solid object, we will constrain our model to learn deformations that obey the rules that govern valid topological and mesh deformations. As we will see, this will limit our network to invertible deformations. Incidentally, by focusing on this type of deformation, we can frequently use a mesh from the initial point cloud as a template for our final object, simplifying the meshing pipeline.
Traditional computer graphics approaches such as ball-pivoting, marching cubes, and Poisson reconstruction work by algorithmically building a mesh to fit a fixed set of points.
Ezuz et al. (2019)
connect this generation of meshes to the problem of matching points on corresponding meshes. Their problem then becomes finding a plausible, smooth, and accurate map between triangular meshes. As investigated in their work, this requires a harmonic and reversible mapping between meshes that can be found by optimizing a loss function for individual pairs of meshes with a true underlying correspondence. This approach has the benefit of allowing the transference of texture and other mesh properties from the source mesh to the target mesh.
The closest work to ours in the deep learning community is that of FoldingNet(Yang et al., 2018). Their model learns a series of deformations (or folds) from an initial fixed 2D grid of points to a final object. It seems intuitive that starting from a 3D surface could lead to an easier learning problem, just as it is simpler to mold clay than to fold origami. This idea is generalized with AtlasNet where multiple 2D grids are used with multiple generators. In addition they explore using a sphere to sample points, but neglect the theoretical advantages connecting this approach to topology and the concept of deformations (Groueix et al., 2018b).
While our work focuses on the generation of point clouds, a similar vein of work has been explored in the mesh reconstruction field by works like Wang et al. (2018) and Kanazawa et al. (2018). These methods both proceed by deforming an initial mesh (given a priori or learned respectively) into a final shape. These methods employ a graph-based method that does not allow for sampling an arbitrary number of points.
3 Theoretical Justification
We must examine the complexity of transforming various initial distributions to the type of surfaces encoded with point clouds to justify our claim that our model is a more natural generative model for point clouds. To simplify matters, we will limit our discussion to maps from a -ball and a 3-dimensional isotropic Gaussian to a -sphere.
Point clouds are typically sampled from the surfaces of real-world objects or realistic meshes. Intuitively, this means that the majority of points are located on the boundary of objects and not on the interior. If we hope to perfectly capture the surface of objects in point clouds with a continuous, invertible map as has become common practice in many generative models, we must consider the topology of our initial shape (Rezende and Mohamed, 2015; Grathwohl et al., 2018; Behrmann et al., 2018).
There is no continuous invertible map between the 3-ball and the 2-sphere that respects the boundary.
This follows from Brouwer’s fixed point theorem. ∎
There is no continuous invertible map between and the 2-sphere that respects the boundary.
This follows from the relationship between Hausdorf spaces and compact subspaces. ∎
These results show that if we wish to learn a transformation that is continuous, invertible, and achieves no error on the boundary, we must choose an initial point cloud that is topologically close to our goal point cloud. Otherwise, our efforts will be thwarted by the underlying topology. For these reasons we choose to start from a hollow sphere of points with radius 1 as our initial shape. We believe that this is the most topologically similar structure that is simple to sample from. As we will see in section 4.2, this decision gives us additional advantages.
The architecture for our network is built on the idea of repeated deformations to an initial point cloud based on the encoding generated by a Deep Set model (Zaheer et al., 2017). This model takes inspiration from FoldingNet with its series of "folds" replaced by deformations and its graph-based encoder exchanged for a set-based encoder (Yang et al., 2018).
On top of this basic framework, we introduce a forward deformation network (going from a random sphere to the goal point cloud) and a backward pass (from the goal point cloud to a sphere). Training both of these networks simultaneously is meant to regularize the transformation, as inspired by the computer graphics community’s requirement for an invertible function without limiting ourselves to models with an analytic inverse. The forward architecture is depicted in Figure 1 and the backward architecture is identical.
4.2 Loss Function
Our loss function encodes our desire to minimize distortion. While the majority of point cloud generative models are trained using Chamfer distance for autoencoder models or maximum likelihood for normalizing flow models, our loss function takes inspiration from the computer graphics community. Note that each is a point and is our learned function.
Equation 1 can be seen as an approximation to the Laplacian loss frequently used in mesh generation tasks (Wang et al., 2018). While the true Laplacian would require knowledge of the neighborhood of each point, we can use properties of the sphere to approximate it and then enforce that the neighborhood persists in the output point cloud. For each point on the sphere, its neighborhood may be simply approximated as the -nearest neighbors in Euclidean distance. This allows us to define the neighborhood function required in equation 1. Our final loss function is a weighted combination of the Chamfer loss in both directions and the deformation loss.
5 Experimental Results
In order to train and test our model, we sample points uniformly from the surface of meshes provided in the ShapeNet dataset (Chang et al. (2015)). The 51,300 meshes cover 55 distinct categories including airplanes, cars, lamps, and doors. All of the results in the following sections are trained on a portion of each category.
5.2 Metrics and Results
to evaluate GANs with the different names called precision and recall.
The results in Table 1 show that our topologically motivated approach is competitive with traditional methods for point cloud generation. In other words, we do not pay a substantial cost for incorporating topological similarity into our loss function.
While these metrics give us an insight into the quality of our reconstruction, their failure to capture structure leads to its poor performance as a measure of accuracy for the underlying surface the point cloud describes. As can be seen in Figure 2, our model is able to produce plausible meshes without a secondary meshing procedure. This is accomplished simply by feeding the vertices of a sphere mesh into our pretrained network. While omitted here for brevity, our ablation experiments show that omitting our deformation loss leads to under-constrained transformations that cause intersecting faces.
|Category||D2F ()||Coverage ()|
Because our model is limited to deformations between topologically similar objects, the standard reconstruction metrics show a decrease in performance when compared to objects trained for reconstruction loss alone. These shortcomings may be due to the deformation loss incentivizing the deletion of smaller details.
Traditionally, generative models for point clouds have been based entirely on the properties of sets: permutation invariance and conditional independence of the points given the underlying shape. Although these properties are crucial for efficiently modeling point cloud distributions, they ignore the relationship between point cloud and mesh, making mesh generation less effective. Our preliminary results show that methods that incorporate this knowledge can be trained solely on point sets and yet produce a generative process for meshes. Our hope is that this progress motivates further research into how set models can benefit from external structure, either as regularization or as a means for improving downstream tasks.
- Invertible residual networks. External Links: Cited by: §3.
- ShapeNet: An Information-Rich 3D Model Repository. Technical report Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago. Cited by: §5.1.
- Reversible harmonic maps between discrete surfaces. ACM Transactions on Graphics 38 (2), pp. 1–12. External Links: Cited by: §2.
- FFJORD: free-form continuous dynamics for scalable reversible generative models. External Links: Cited by: §3.
3d-coded: 3d correspondences by deep deformation.
Proceedings of the European Conference on Computer Vision (ECCV), pp. 230–246. Cited by: §1.
- AtlasNet: a papier-mâché approach to learning 3d surface generation. arXiv preprint arXiv:1802.05384. Cited by: §1, §2.
- Learning category-specific mesh reconstruction from image collections. Lecture Notes in Computer Science, pp. 386–402. External Links: Cited by: §2.
LBS autoencoder: self-supervised fitting of articulated meshes to point clouds.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11967–11976. Cited by: §1.
- Point cloud gan. External Links: Cited by: §1, §1, §5.2.
- Are gans created equal? a large-scale study. In Advances in neural information processing systems, pp. 700–709. Cited by: §5.2.
- Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §1.
- Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §1.
- Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770. Cited by: §3.
- Pixel2Mesh: generating 3d mesh models from single rgb images. Lecture Notes in Computer Science, pp. 55–71. External Links: Cited by: §2, §4.2.
- PointFlow: 3d point cloud generation with continuous normalizing flows. arXiv preprint arXiv:1906.12320. Cited by: §1, §1.
- Foldingnet: point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 206–215. Cited by: §1, §2, §4.1.
- Deep sets. In Advances in neural information processing systems, pp. 3391–3401. Cited by: §1, §4.1.