A fundamental problem in machine learning is: given a data set, learn a generative model that can efficiently generate arbitrary many new sample points from the domain of the underlying distribution (Bishop, 2006)
. Deep generative models use deep neural networks as a tool for learning complex data distributions(Kingma and Welling, 2013; Oord et al., 2016; Goodfellow et al., 2014). Especially, Generative Adversarial Networks (GAN; Goodfellow et al. 2014) is drawing attentions because of its success in many applications. Compelling results have been demonstrated on different types of data, including text, images and videos (Lamb et al., 2016; Karras et al., 2017; Vondrick et al., 2016). Their wide range of applicability was also shown in many important problems, including data augmentation (Salimans et al., 2016), image style transformation (Zhu et al., 2017), image captioning (Dai et al., 2017) and art creations (Kang, 2017).
Recently, capturing 3D information is garnering attention. There are many different data types for 3D information, such as CAD, 3D meshes and point clouds. 3D point clouds are getting popular since these store more information than 2D images and sensors capable of collecting point clouds have become more accessible. These include Lidar on self-driving cars, Kinect for Xbox and face identification sensor on phones. Compared to other formats, point clouds can be easily represented as a set of points, which has several advantages, such as permutation invariance. The algorithms which can effectively learn from this type of data is an emerging field (Qi et al., 2017a, b; Zaheer et al., 2017; Kalogerakis et al., 2017; Fan et al., 2017)
. However, compared to supervised learning, unsupervised generative models for 3D data are still under explored(Achlioptas et al., 2017; Oliva et al., 2018).
Extending existing GAN frameworks to handle 3D point clouds, or more generally set data, is not straightforward. In this paper, we begin by formally defining the problem and discussing the difficulty of the problem (Section 2). Circumventing the challenges, we propose a deep generative adversarial network (PC-GAN) with a hierarchical sampling and inference network for point clouds. The proposed architecture learns a stochastic procedure which can generate new point clouds as well as draw samples from point clouds without explicitly modeling the underlying density function (Section 3). The proposed PC-GAN is a generic algorithm which can incorporate many existing GAN variants. By utilizing the property of point clouds, we further propose a sandwiching objective by considering both upper and lower bounds of Wasserstein distance estimate, which can lead to tighter approximation (Section 4
). Evaluation on ModelNet40 shows excellent generalization capability of PC-GAN. We first show we can sample from the learned model to generate new point clouds and the latent representations learned by the inference network provide meaningful interpolations between point clouds. We further show the conditional generation results onunseen classes of objects to demonstrate the superior generalization ability of PC-GAN. Lastly, we also provide several interesting studies, such as classification and point clouds generation from images (Section 6).
2 Problem Definition and Difficulty
We begin by defining the problem and ideal generative process for point cloud over objects. Formally, a point cloud for an object is a set of
low dimensional vectorswith , where is usually and can be infinite. Point cloud for different objects is a collection of sets , where each is as defined above. Thus, we basically need a generative model for sets which should be able to:
Sample entirely new sets , as well as
Sample more points for a given set, i.e. .
De-Finetti theorem allows us to express the set probability in a factored format asfor some suitably defined . In case of point clouds, the latent variable
is not useful. We have to learn the joint distribution. An illustrative parallel can be drawn between pixels of an image and point cloud — marginal distribution of pixels is quite uninformative, one has to consider the joint distribution of all pixels in one image (not across different images).
can be interpreted as an object representation. In this view, the factoring can be understood as follows: Given an object, , the points in the point cloud can be considered as i.i.d. samples from , an unknown latent distribution representing object . Joint likelihood can be expressed as:
Attempts have been made to characterize (12005; Strom et al., 2010; Eckart et al., 2015). However, such approaches have limited success as the point cloud conditional density is highly non-linear and complicated (example of point clouds can be seen in Figure 1, 3, 7).
With advent of implicit generative models, like GANs (Goodfellow et al., 2014), it is possible to model complicated distributions, but it is not clear how to extend the existing GAN frameworks to handle the hierarchical model of point clouds as developed above. The aim of most GAN framework(Goodfellow et al., 2014; Arjovsky et al., 2017; Li et al., 2017a) is to learn to generate new samples from a distribution , for fixed dimensional , given a training dataset. We, in contrast, aim to develop a new method where the training data is a set of sets, and learning from this training data would allow us to generate points from this hierarchical distribution (e.g. point clouds of 3D objects). To re-emphasize the incompatibility, note that the training data of traditional GANs is a set of fixed dimensional instances, in our case, however, it is a set of sets. Using existing GANs to just learning marginal distribution , in case of point clouds , is not of much use because the marginal distribution is quite uninformative as can be seen from Figure 1.
One approach can be used to model the distribution of the point cloud set together, i.e., . In this setting, a naíve application of traditional GAN is possible through treating the point cloud as finite dimensional vector by fixing number and order of the points (reducing the problem to instances in ) with DeepSets (Zaheer et al., 2017)classifier as the discriminator to distinguish real sets from fake sets. However, it would not work because the IPM guarantees behind the traditional GAN no longer hold (e.g. in case of Arjovsky et al. (2017), nor are 1-Lipschitz functions over sets well-defined). The probabilistic divergence approximated by a DeepSets classifier is not clear and might be ill-defined. Counter examples for breaking IPM guarantees can be easily found as we show next.
Consider a simple GAN (Goodfellow et al., 2014) with a DeepSets classifier as the discriminator. In order to generate coherent sets of variable size, we consider a generator having two noise sources: and . To generate a set, is sampled once and is sampled for to produce points in the generated set. Intuitively, fixing the first noise source selects a set and ensures the points generated by repeated sampling of are coherent and belong to same set. The setup is depicted in Figure 2. In this setup, the GAN minimax problem would be:
Now consider the case, when there exist an ‘oracle’ mapping which maps each sample point deterministically to the object it originated from, i.e. . A valid example is when different leads to conditional distribution with non-overlapping support. Let and ignores then optimization becomes
Thus, we can achieve the lower bound by only matching the component, while the conditional is allowed to remain arbitrary. We note that there still exists good solutions, which can lead to successful training, other than this hand-crafted example. We found empirically that GAN with simple DeepSet-like discriminator most of the times fails to learn to generate point clouds even after converging. However, sometimes it does results in reasonable generations. So simply using DeepSets classifier without any constraints in simple GAN in order to handle sets does not always lead to a valid generative model. We need additional constraints for GANs with simple DeepSet-like discriminator to exclude such bad solutions and lead to a more stable training.
3 Proposed Method
Although directly learning point cloud generation under GAN formulation is difficult as described in Section 2, given a , learning is reduced to learning a 2 or 3 dimensional distribution, which fits standard GAN settings. Formally, given a , we train a generator such that , where , follows by optimizing a (peudo) probabilistic divergence between the distribution of and , which is denoted as . The full objective can be written as .
Although GANs have been extended to learn conditional distributions (Mirza and Osindero, 2014; Isola et al., 2017), they require conditioning variables to be observed, such as the one-hot label or a given image. While in case of point clouds we only have partial knowledge of the conditional, i.e. we only have groupings of point coming from the same object but we have no representation of the conditional or the object other than the points themselves. Naïvely modeling to be a one-hot vector, to indicate which object the points belong to in the training data, cannot generalize to unseen testing data. Instead, we need a richer representation for
, which is an unobserved random variable. Thus, we need to inferduring the training. The proposed algorithm has to concurrently learn the inference network which encodes while we learn .
Neural Network Realization
Our solution comprises of a generator which takes in a noise source and a descriptor encoding information about distribution of . Here is dimensionality of the per-point noise source and is the size of the descriptor vector. Typically, we would need , as needs to encode much more information because it decides the whole shape of the conditional distribution whereas just corresponds to different samples from it. Another interpretation of the descriptor can be as an embedding of distribution over . For a given , the descriptor would encode information about the distribution and samples generated as would follow the distribution . More generally, can be used to encode more complicated distributions regarding as well. In particular, it could be used to encode the posterior for a given sample set , such that
follows the posterior predictive distribution:
As is unobserved, we use an inference network to learn the most informative descriptor about the distribution to minimize the The divergence between , which is denoted as , and the distribution of given , which we abuse for convenience, that is .
A major hurdle in taking this path is that is a set of points, which can vary in size and permutation of elements. Thus, making design of
complicated as traditional neural network can not handle this and possibly is the reason for absence of such framework in the literature despite being a natural solution for the important problem of generative modeling of point clouds. However, we can overcome this challenge and we propose to construct the inference network by utilizing the recent advance in deep learning for dealing with sets(Qi et al., 2017a; Zaheer et al., 2017). This allows it handle variable number of inputs points in arbitrary order, yet yielding a consistent descriptor .
After training and , we use trained to collect inferred and train another generator for higher hierarchical sampling. Here is the other noise source independent of and dimensionality this noise source is typically much smaller than . In addition to layer-wise training, a joint training could further boost performance. The full generative process for sampling one point cloud could be represented as
We call the proposed algorithm for point cloud generation as PC-GAN as shown in Figure 3. The conditional distribution matching with a learned inference in PC-GAN can also be interpreted as an encoder-decoder formulation (Kingma and Welling, 2013)
. The difference between it and the point cloud autoencoder(Achlioptas et al., 2017; Yang et al., 2018) will be discussed in Section 5.
4 Different Divergences for Matching Point Clouds
Given two point clouds and
, one commonly used heuristic distance measure is chamfer distance(Achlioptas et al., 2017). On the other hand, if we treat each point cloud as a 3 dimensional distribution, we can adopt a more broader classes of probabilistic divergence for training . Instead of using divergence esitmates which requires density estimation (Jian and Vemuri, 2005; Strom et al., 2010; Eckart et al., 2015), we are interested in implicit generative models with a GAN-like objective (Goodfellow et al., 2014), which has been demonstrated to learn complicated distributions.
To train the generator using a GAN-like objective for point clouds, we need a discriminator which distinguishes between the generated samples and true samples conditioned on . Combining with the inference network discussed in Section 3, if we use an IPM-based GAN (Arjovsky et al., 2017; Mroueh and Sercu, 2017; Mroueh et al., 2017), the objective can be written as
4.1 Tighter Solutions via Sandwiching
In our setting, each point in the point cloud can be considered to correspond to single images when we train GANs over images. An example is illustrated in Figure 4, where the samples from MMD-GAN (Li et al., 2017a)
trained on CelebA consists of both good and bad faces. In case of images, when quality is evaluated, it primarily focuses on coherence individual images and the few bad ones are usually left out. Whereas in case of point cloud, to get representation of an object we need many sampled points together and presence of outlier points degrades the quality of the object. Thus, when training a generative model for point cloud, we need to ensure a much lower distancebetween true distribution and generator distribution than would be needed for images.
We begin by noting that the popular Wasserstein GAN (Arjovsky et al., 2017), aims to optimize by , where is the Wasserstein distance between the truth and generated distribution of . Many GAN works (e.g. Arjovsky et al. (2017)) approximate in dual form (a maximization problem), such as (4), by neural networks. The resulting estimate is a lower bound of the true Wasserstein distance, as neural networks can only recover a subset of 1-Lipschitz functions (Arora et al., 2017) required in the dual form. However, finding a lower bound for may not be an ideal surrogate for solving a minimization problem . In optimal transport literature, Wassertein distance is usually estimated by approximate matching cost, , which gives us an upper bound of the true Wasserstein distance.
We propose to combine, in general, a lower bound and upper bound estimate by sandwiching the solution between the two, i.e. we solve the following minimization problem:
The problem can be simplified and solved using method of Lagrange multipliers as follows:
By solving the new sandwiched problem (6), we show that under certain conditions we obtain a better estimate of Wasserstein distance in the following lemma:
Suppose we have two approximators to Wasserstein distance: an upper bound and a lower , such that and respectively, for some and . Then, using the sandwiched estimator from (6), we can achieve tighter estimate of the Wasserstein distance than using either one estimator, i.e.
4.1.1 Upper Bound Implementation
The primal form of Wasserstein distance is defined as
where is the coupling of and . The Wasserstein distance is also known as optimal transport (OT) or earth moving distance (EMD). As the name suggests, when is estimated with finite number of samples and , we find the one-to-one matching between and such that the total pairwise distance is minimal. The resulting minimal total (average) pairwise distance is . In practice, finding the exact matching efficiently is non-trivial and still an open research problem (Peyré et al., 2017). Instead, we consider an approximation provided by Bertsekas (1985). It is an iterative algorithm where each iteration operates like an auction whereby unassigned points bid simultaneously for closest points , thereby raising their prices. Once all bids are in, points are awarded to the highest bidder. The crux of the algorithm lies in designing a non-greedy bidding strategy. One can see by construction the algorithm is embarrassingly parallelizable, which is favourable for GPU implementation. One can show that algorithm terminates with a valid matching and the resulting matching cost is an -approximation of . Thus, the estimate can serve as an upper bound, i.e.
We remark estimating Wasserstein distance with finite samples via primal form is only favorable to low dimensional data, such as point clouds. The error of empirical estimate in primal is (Weed and Bach, 2017). When the dimension is large (e.g. images), we cannot accurately estimate in primal as well as its upper bound with a small minibatch.
4.2 Lower Bound Implementation
The dual form of Wasserstein distance is defined as
where is the set of -Lipschitz functions whose Lipschitz constant is no larger than . In practice, deep neural networks parameterized by with constraints (Arjovsky et al., 2017), result in a distance approximation
If there exists such that , then is a lower bound. To enforce , Arjovsky et al. (2017) propose a weight clipping constraint , which constrains every weight to be in and guarantees that for some . Based on Arora et al. (2017), the Lipschitz functions realized by neural networks is actually .
In practice, choosing clipping range is non-trivial. Small ranges limit the capacity of networks, while large ranges result in numerical issues during the training. On the other hand, in addition to weight clipping, several constraints (regularization) have bee proposed with better empirical performance, such as gradient penalty (Gulrajani et al., 2017) and ball (Mroueh and Sercu, 2017). However, there is no guarantee the resulted functions are still Lipschitz or the resulted distances are lower bounds of Wasserstein distance. To take the advantage of those regularization with the Lipschitz guarantee, we propose a simple variation by combining weight clipping, which always ensures Lipschitz functions.
There exists such that
Note that, if , then . Therefore, from Proposition A, for any regularization of discriminator (Gulrajani et al., 2017; Mroueh and Sercu, 2017; Mroueh et al., 2017), we can always combine it with a weight clipping constraint to ensure a valid lower bound estimate of Wasserstein distance and enjoy the advantage that it is numerically stable when we use large compared with original weight-clipping WGAN (Arjovsky et al., 2017).
5 Related Works
Generative Adversarial Network (Goodfellow et al., 2014) aims to learn a generator that can sample data followed by the data distribution. Compelling results on learning complex data distributions with GAN have been shown on images (Karras et al., 2017), speech (Lamb et al., 2016), text (Yu et al., 2016; Hjelm et al., 2017), vedio (Vondrick et al., 2016) and 3D voxels (Wu et al., 2016). However, the GAN algorithm on 3D point cloud is still under explored (Achlioptas et al., 2017). Many alternative objectives for training GANs have been studied. Most of them are the dual form of -divergence (Goodfellow et al., 2014; Mao et al., 2017; Nowozin et al., 2016), integral probability metrics (IPMs) (Zhao et al., 2016; Li et al., 2017a; Arjovsky et al., 2017; Gulrajani et al., 2017) or IPM extensions (Mroueh and Sercu, 2017; Mroueh et al., 2017). Genevay et al. (2018) learn the generative model by the approximated primal form of Wasserstein distance (Cuturi, 2013).
Instead of training a generative model on the data space directly, one popular approach is combining with autoencoder (AE), which is called adversarial autoencoder (AAE) (Makhzani et al., 2015)
. AAE constrain the encoded data to follow normal distribution via GAN loss, which is similar to VAE(Kingma and Welling, 2013) by replacing the KL-divergence on latent space via any GAN loss. Tolstikhin et al. (2017) provide a theoretical explanation for AAE by connecting it with the primal form of Wasserstein distance. The other variant of AAE is training the other generative model to learn the distribution of the encoded data instead of enforcing it to be similar to a known distribution (Engel et al., 2017; Kim et al., 2017). Achlioptas et al. (2017) explore a AAE variant for point cloud. They use a specially-designed encoder network (Qi et al., 2017a) for learning a compressed representation for point clouds before training GAN on the latent space. However, their decoder is restricted to be a MLP which generates fixed number of points, where has to be pre-defined. That is, the output of their decoder is fixed to be for 3D point clouds, while the output of the proposed is only 3 dimensional and can generate arbitrarily many points by sampling different random noise as input. Yang et al. (2018); Groueix et al. (2018b) propose similar decoders to with fixed grids to break the limitation of Achlioptas et al. (2017) aforementioned, but they use heuristic Chamfer distance without any theoretical guarantee and do not exploit generative models for point clouds. The proposed PC-GAN can also be interpreted as an encoder-decoder formulation. However, the underlying interpretation is different. We start from De-Finetti theorem to learn both and with inference network interpretation of , while Achlioptas et al. (2017) focus on learning without modeling .
GAN for learning conditional distribution (conditional GAN) has been studied in images with single conditioning (Mirza and Osindero, 2014; Pathak et al., 2016; Isola et al., 2017; Chang et al., 2017) or multiple conditioning (Wang and Gupta, 2016). The case on point cloud is still under explored. Also, most of the works assume the conditioning is given (e.g. labels and base images) without learning the inference during the training. Training GAN with inference is studied by Dumoulin et al. (2016); Li et al. (2017b); however, their goal is to infer the random noise of generators instead of semantic latent variable of the data. Li et al. (2018) is a parallel work aiming to learn GAN and unseen latent variable simultaneously, but they only study image and video datasets.
Lastly, we briefly review some recent development of deep learning on point clouds. Instead of transforming into 3D voxels and projecting objects into different views to use convolution (Su et al., 2015; Maturana and Scherer, 2015; Wu et al., 2015; Qi et al., 2016; Tatarchenko et al., 2017), which have the concern of memory usage, one direction is designing permutation invaraint operation for dealing with set data directly Qi et al. (2017a); Zaheer et al. (2017); Qi et al. (2017b). Wang et al. (2018) use graph convolution (Bronstein et al., 2017) to utilize local neighborhood information. Most of the application studied by those works focus on classification and segmentation tasks, but they can be used to implement the inference network of PC-GAN.
In this section we demonstrate the point cloud generation capabilities of PC-GAN. As discussed in Section 5, we refer Achlioptas et al. (2017) as AAE as it could be treated as an AAE extension to point clouds and we use the implementation provided by the authors for experiments. The sandwitching objective for PC-GAN combines and with the mixture 1:20 without tunning for all experiment. is a GAN loss by combining Arjovsky et al. (2017) and Mroueh and Sercu (2017) and we adopt (Bertsekas, 1985) for as discussed in Section 4.2 and Section 4.1.1. We parametrize in PC-GAN by DeepSets (Zaheer et al., 2017). The review of DeepSets is in Appendix B. Other detailed configurations of each experiment can be found in Appendix C. Next, we study both synthetic 2D point cloud and ModelNet40 benchmark datasets.
6.1 Synthetic Datasets
We created a simple 2D synthetic point cloud datasets from parametric distributions on which we can carry out thorough evaluations of the prposed PC-GAN and draw comparisons with AAE Achlioptas et al. (2017). We generate 2D point clouds for circles, where the center of circles is followed a mixture of four Gaussians with means equal to . The covariance matrices were set to be
and we used equal mixture weights. The radius of the circles was drawn from a uniform distribution. One sampled circile is shown in Figure 4(a). We sampled circles for the training and testing data, respectively.
For PC-GAN, the inference network is a stack of 3 Permutation Equivariance Layers with hidden layer size to be and output size to be (e.g. ). In this experiment the total number of parameters for PC-GAN is . For AAE encoder, we follow the same setting in Achlioptas et al. (2017); for decoder, we increase it from 3 to be 4 layers with larger capacities and set output size to be dimensions for points. We consider two model configurations AAE-10 and AAE-20, which use and units for the hidden layers of AAE decoder, respectively. The total number of parameters (encoder+decoder) are and for AAE-10 and AAE-20, respectively. Detailed model configurations are provided in the supplementary material.
We evaluated the conditional distributions on the testing circles. For the proposed PC-GAN, we pass the same points into the inference network , then sample points with the conditional generator to match the output number of AAE. We measured the empirical distributions of the centers and the radius of the generated circles conditioning on the testing data for PC-GAN. Similarly, we measured the reconstructed circles of the testing data for AAE. The results are shown in Figure 5.
From Figure 5, both AAE and PC-GAN can successfully recover the center distribution, but AAE does not learn the radius distribution well. Even if we increase number the hidden layer unit of the decoder to be (AAE-20), which almost doubles the number of parameters, the performance is still not satisfactory. Compared with AAE, the proposed PC-GAN recovers the both center and radius distributions well with less parameters. The gap of memory usage could be larger if we configure AAE to generate more points, while the model size required for PC-GAN is independent of the number of points. The reason is MLP decoder adopted by Achlioptas et al. (2017) wasted parameters for nearby points. A much larger model (more parameters) can potentially boost the performance, yet would be still restricted to generate a fixed number of points for each object as discussed in Section 5.
6.2 Conditional Generation on ModelNet40
We consider the ModelNet40 (Wu et al., 2015) benchmark, which contains 40 classes of objects. There are training and testing instances. We follow Zaheer et al. (2017) to do pre-processing. For each object, we sampled
points from the mesh representation and normalize it to have zero mean (for each axis) and unit (global) variance. During the training, we augment the data by uniformly rotatingrad on the - plane. For PC-GAN, the random noise is fixed to be dimensional for all experiments. For other settings, we follow Achlioptas et al. (2017).
Training on Single Class
We start from a smaller model which is only trained on single class of objects. For AAE, the latent code size for its encoder is and the decoder outputs points for each object. The number of parameters for encoder and decoder are in total. Similarly, we set the size of PC-GAN latent variable (the output of ) to be dimensional. The number of parameters for and is less than in total.
Training on All Classes.
We also train the proposed model on all objects in the training set. The size of AAE latent code of is increased to be . The number of parameters of its encoder and decoder is . We set the size of PC-GAN latent variable to be dimensional as well. The number of parameters for and are around in total.
6.2.1 Quantitative Comparison
|Data||Distance to Face||Coverage|
|PC-GAN ()||AAE||PC-GAN ()||PC-GAN ()||PC-GAN ()||AAE||PC-GAN ()||PC-GAN ()|
We first evaluate the performance of trained conditional generator and the inference network . We are interested in whether the learned model can model the distribution of the unseen testing data. Therefore, for each testing point cloud, we use to infer the latent variable , then use to generate points. We then compare the distribution between the input point cloud and the generated point clouds.
There are many criteria based on finite sample estimation can be used for evaluation, such -divergence and IPM. However, the estimator with finite samples are either biased or with high variance (Peyré et al., 2017; Wang et al., 2009; Póczos et al., 2012; Weed and Bach, 2017). Also, it is impossible to use these estimators with infinitely many samples if they are accessible.
For ModelNet40, the meshes of each object are available. In many statistically guaranteed distance estimates, the adopted statistics are commonly based on distance between nearest neighbors (Wang et al., 2009; Póczos et al., 2012). Therefore, we propose to measure the performance with the following criteria. Given a point cloud and a mesh, which is a collection of faces , we measure the distance to face (D2F) as
where is the Euclidean distance from to the face . This distance is similar to Chamfer distance, which is commonly used for measuring images and point clouds (Achlioptas et al., 2017; Fan et al., 2017), with infinitely samples from true distributions (meshes).
However, the algorithm can have low or zero D2F by only focusing a small portion of the point clouds (mode collapse). Therefore, we are also interested in whether the generated points recover enough supports of the distribution. We compute the Coverage ratio as follows. For each points, we find the its nearest face, we then treat this face is covered111We should do thresholding to ignore outlier points. In our experiments, we observe that without excluding outliers does not change conclusion for comparison.. We then compute the ratio of number of faces of a mesh is covered. A sampled mesh is showed in Figure 6, where the details have more faces (non-uniform). Thus, it is difficult to get high coverage for AAE or PC-GAN trained by limited number of sampled points. However, the coverage ratio, on the other hand, serve as an indicator about how much details the model recovers.
The results are reported in Table 1. We compare four different algorithm, AAE and PC-GAN with three objectives, including upper bound ( approximated Wasserstein distance), lower bound (GAN with ball constraints and weight clipping), and the sandwiching loss as discussed in Section 4.1, The study with and also serves as the ablation test of the proposed sandwiching loss .
Comparison between Upper bound, Lower bound and Sandwiching
Since directly optimizes distance between training and generated point clouds, usually results in smaller D2F than in Table 1. One the other hand, although only recovers lower bound estimate of Wasserstein distance, its discriminator is known to focus on learning support of the distribution (Bengio, 2018), which results in better coverage (support) than .
Theoretically, the proposed sandwiching results in a tighter Wasserstein distance estimation than and (Lemma A). Based on above discussion, it can also be understood as balancing both D2F and coverage by combining both and to get a desirable middle ground. Empirically, we even observe that results in better coverage than , and competitive D2F with . The intuitive explanation is that some discriminative tasks are off to objective, so the discriminator can focus more on learning distribution supports. We argue that this difference is crucial for capturing the object details. Some reconstructed point clouds of testing data are shown in Figure 7. For aeroplane examples, are failed to capture aeroplane tires and has better tire than . For Chair example, recovers better legs than and better seat cushion than . Lastly, we highlight outperforms others more significantly when training data is larger (ModelNet10 and ModelNet40) in Table 1.
Comparison between PC-GAN and AAE
In most of cases, PC-GAN with has lower D2F in Table 1 with less number of parameters aforementioned. Similar to the argument in Section 6.1, although AAE use larger networks, the decoder wastes parameters for nearby points. AAE only outperforms PC-GAN () in Guitar and Sofa in terms of D2F, since the variety of these two classes are low. It is easier for MLP to learn the shared template (basis) of the point clouds. On the other hand, due to the limitation of the fixed number of output points and Chamfer distance objective, AAE has worse coverage than PC-GAN, It can be supported by Figure 7, where AAE is also failed to recover aeroplane tire.
6.3 Hierarchical Sampling
In Section 3, we propose a hierarchical sampling process for sampling point clouds. In the first hierarchy, the generator samples a object, while the second generator samples points to form the point cloud. The randomly sampled results without given any data as input are shown in Figure 8. The point clouds are all smooth, structured and almost symmetric. It shows PC-GAN captures inherent symmetries and patterns in all the randomly sampled objects, even if overall object is not perfectly formed. This highlights that learning point-wise generation scheme encourages learning basic building blocks of objects.
6.4 Understand the Learned Manifold
A commonly used method to demonstrate quality of the learned latent space is showing whether the interpolation between two objects on the latent space results in smooth change. We interpolate the inferred representations from two objects by the inference network, and use the generator to sample points. The inter-class result is shown in Figure 9.
It is also popular to show intra-class interpolation. In addition showing simple intra-class interpolations, where the objects are almost aligned, we present an interesting study on interpolations between rotations. During the training, we only rotate data with possible angles for augmentation, here we show it generalizes to other unseen rotations as shown in Figure 10.
However, if we linearly interpolate the code, the resulted change is scattered and not smooth as shown in Figure 10. Instead of using linear interpolation, We train a 2-layer MLP with limited hidden layer size to be 16, where the input is the angle, output is the corresponding latent representation of rotated object. We then generate the code for rotated planes with this trained MLP. It suggests although the transformation path of rotation on the latent space is not linear, it follows a smooth trajectory222By the capability of 1-layer MLP.. It may also suggest the geodesic path of the learned manifold may not be nearly linear between rotations. Finding the geodesic path with a principal method Shao et al. (2017) and Understanding the geometry of the manifold for point cloud worth more deeper study as future work.
We evaluate the quality of the representation acquired from the learned inference network . We train the inference network and the generator on the training split of ModelNet40 with data augmentation as mentioned above for learning generative models without label information. We then extract the latent representation for each point clouds and train linear SVM on the that with its label. We apply the same setting to a linear classifier on the latent code of Achlioptas et al. (2017).
We only sample as input for our inference network . As the Deep Sets architecture for the inference network is invariant to number of points, we can sample different number of points as input to the trained inference network for evaluation. Because of the randomness of sampling points for extracting latent representation, we repeat the experiments
times and report the average accuracy and standard deviation on the testing split in Table2. By using points, we are already better than unsupervised algorithms, Achlioptas et al. (2017) with points and 3D Voxel GAN (Wu et al., 2016), and competitive with the supervised learning algorithm Deep Sets.
|AAE (Achlioptas et al., 2017)||2048|
|3D GAN (Wu et al., 2016)||voxel|
|Deep Sets (Zaheer et al., 2017)||1000|
|Deep Sets (Zaheer et al., 2017)||5000|
|MVCNN (Su et al., 2015)||images|
|RotationNet (Kanezaki et al., 2018)||images|
Generalization on Unseen Categories
In above, we studied the reconstruction of unseen testing objects, while PC-GAN still saw the point clouds from the same class during training. Here we study a more challenging task. We train PC-GAN on first 30 (Alphabetic order) class, and test on the other fully unseen 10 classes. Some reconstructed (conditionally generated) point clouds are shown in Figure 11. More (larger) results can be found in Appendix LABEL:sec:larger. For the object from the unseen classes, the conditionally generated point clouds still recovers main shape and reasonable geometry structure, which confirms the advantage of the proposed PC-GAN: by enforcing the point-wise transformation, the model is forced to learn the underlying geometry structure and the shared building blocks, instead of naively copying the input from the conditioning. The resulted D2F and coverage are and , which are only slightly worse than and by training on whole 40 classes in Table 1 (ModelNet40), which also supports the claims of the good generalization ability of PC-GAN.
6.5 Images to Point Cloud
Here we demonstrate a potential extension of the proposed PC-GAN for images to point cloud applications. After training as described in 6.3, instead of learning for hierarchical sampling, we train a regressor , where the input is the different views of the point cloud , and the output is . In this proof of concept experiment, we use the view data and the Res18 architecture in Su et al. (2015), while we change the output size to be . Some example results on reconstructing testing data is shown in Figure 12. A straightforward extension is using end-to-end training instead of two-staged approached adopted here. Also, after aligning objects and take representative view along with traditional ICP techniques, we can also do single view to point cloud transformation as Choy et al. (2016); Fan et al. (2017); Häne et al. (2017); Groueix et al. (2018a), which is not the main focus of this paper and we leave it for future work.
In this paper, we first showed a straightforward extension of existing GAN algorithm is not applicable to point clouds. We then proposed a GAN modification (PC-GAN) that is capable of learning to generate point clouds by using ideas both from hierarchical Bayesian modeling and implicit generative models. We further propose a sandwiching objective which results in a tighter Wasserstein distance estimate theoretically and better performance empirically.
In contrast to some existing methods (Achlioptas et al., 2017), PC-GAN can generate arbitrary as many i.i.d. points as we need to form a point clouds without pre-specification. Quantitatively, PC-GAN achieves competitive or better results using smaller network than existing methods. We also demonstrated that PC-GAN can capture delicate details of point clouds and generalize well even on unseen data. Our method learns “point-wise” transformations which encourage the model to learn the building components of the objects, instead of just naively copying the whole object. We also demonstrate other interesting results, including point cloud interpolation and image to point clouds.
Although we only focused on 3D applications in this paper, our framework can be naturally generalized to higher dimensions. In the future we would like to explore higher dimensional applications, where each 3D point can have other attributes, such as RGB colors and 3D velocity vectors.
- Achlioptas et al.  Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3d point clouds. arXiv preprint arXiv:1707.02392, 2017.
- Arjovsky et al.  Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. ICML, 2017.
- Arora et al.  Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.
- Bengio  Yoshua Bengio. Gans and unsupervised representation learning, 2018.
- Bertsekas  Dimitri P Bertsekas. A distributed asynchronous relaxation algorithm for the assignment problem. In Decision and Control, 1985.
- Bishop  M Christopher Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, 2006.
- Bronstein et al.  Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 2017.
- Chang et al.  JH Rick Chang, Chun-Liang Li, Barnabas Poczos, BVK Vijaya Kumar, and Aswin C Sankaranarayanan. One network to solve them all—solving linear inverse problems using deep projection models. arXiv preprint, 2017.
- Choy et al.  Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In ECCV, 2016.
- Cuturi  Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, 2013.
- Dai et al.  Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. Towards diverse and natural image descriptions via a conditional gan. arXiv preprint arXiv:1703.06029, 2017.
- Dumoulin et al.  Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
- Eckart et al.  Ben Eckart, Kihwan Kim, Alejandro Troccoli, Alonzo Kelly, and Jan Kautz. Mlmd: Maximum likelihood mixture decoupling for fast and accurate point cloud registration. In 3DV, 2015.
- Engel et al.  Jesse Engel, Matthew Hoffman, and Adam Roberts. Latent constraints: Learning to generate conditionally from unconditional generative models. arXiv preprint arXiv:1711.05772, 2017.
- Fan et al.  Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. In CVPR, 2017.
- Genevay et al.  Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with sinkhorn divergences. In AISTATS, 2018.
- Goodfellow et al.  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
- Groueix et al. [2018a] Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. Atlasnet: A papier-m^ ach’e approach to learning 3d surface generation. arXiv preprint arXiv:1802.05384, 2018a.
- Groueix et al. [2018b] Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. A papier-mâché approach to learning 3d surface generation. In CVPR, 2018b.
- Gulrajani et al.  Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. In NIPS, 2017.
- Häne et al.  Christian Häne, Shubham Tulsiani, and Jitendra Malik. Hierarchical surface prediction for 3d object reconstruction. In 3D Vision (3DV), 2017 International Conference on, 2017.
- Hjelm et al.  R. Devon Hjelm, Athul Paul Jacob, Tong Che, Kyunghyun Cho, and Yoshua Bengio. Boundary-seeking generative adversarial networks. arXiv:1702.08431, 2017.
- Isola et al.  Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint, 2017.
- Jian and Vemuri  Bing Jian and Baba C Vemuri. A robust algorithm for point set registration using mixture of gaussians. In ICCV, 2005.
- Kalogerakis et al.  Evangelos Kalogerakis, Melinos Averkiou, Subhransu Maji, and Siddhartha Chaudhuri. 3d shape segmentation with projective convolutional networks. CVPR, 2, 2017.
- Kanezaki et al.  Asako Kanezaki, Yasuyuki Matsushita, and Yoshifumi Nishida. Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In CVPR, 2018.
- Kang  Eunsu Kang. FACE Exhibition, 2017. Judith Rae Solomon Gallery, Youngstown, OH. http://art.ysu.edu/2017/09/06/face-by-eunsu-kang-and-collaborators/.
- Karras et al.  Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
- Kim et al.  Yoon Kim, Kelly Zhang, Alexander M Rush, Yann LeCun, et al. Adversarially regularized autoencoders for generating discrete structures. arXiv preprint arXiv:1706.04223, 2017.
- Kingma and Welling  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Lamb et al.  Alex M Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In NIPS, 2016.
- Li et al.  Chongxuan Li, Max Welling, Jun Zhu, and Bo Zhang. Graphical generative adversarial networks. arXiv preprint arXiv:1804.03429, 2018.
Li et al. [2017a]
Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás
Mmd gan: Towards deeper understanding of moment matching network.In NIPS, 2017a.
- Li et al. [2017b] Chunyuan Li, Hao Liu, Changyou Chen, Yuchen Pu, Liqun Chen, Ricardo Henao, and Lawrence Carin. Alice: Towards understanding adversarial learning for joint distribution matching. In NIPS, 2017b.
- Makhzani et al.  Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
- Mao et al.  Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, and Zhen Wang. Least squares generative adversarial networks. In ICCV, 2017.
Maturana and Scherer 
Daniel Maturana and Sebastian Scherer.
Voxnet: A 3d convolutional neural network for real-time object recognition.In IROS, 2015.
- Mirza and Osindero  Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
- Mroueh and Sercu  Youssef Mroueh and Tom Sercu. Fisher gan. In NIPS, 2017.
- Mroueh et al.  Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev gan. arXiv preprint arXiv:1711.04894, 2017.
- Nowozin et al.  Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In NIPS, 2016.
- Oliva et al.  Junier B Oliva, Avinava Dubey, Barnabás Póczos, Jeff Schneider, and Eric P Xing. Transformation autoregressive networks. arXiv preprint arXiv:1801.09819, 2018.
- Oord et al.  Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
- Pathak et al.  Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
- Peyré et al.  Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport. Technical report, 2017.
- Póczos et al.  Barnabás Póczos, Liang Xiong, and Jeff Schneider. Nonparametric divergence estimation with applications to machine learning on distributions. arXiv preprint arXiv:1202.3758, 2012.
- Qi et al.  Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. Volumetric and multi-view cnns for object classification on 3d data. In CVPR, 2016.
- Qi et al. [2017a] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. CVPR, 2017a.
- Qi et al. [2017b] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017b.
- Salimans et al.  Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In NIPS, 2016.
- Shao et al.  Hang Shao, Abhishek Kumar, and P Thomas Fletcher. The riemannian geometry of deep generative models. arXiv preprint arXiv:1711.08014, 2017.
- Strom et al.  Johannes Strom, Andrew Richardson, and Edwin Olson. Graph-based segmentation for colored 3d laser point clouds. In IROS, 2010.
- Su et al.  Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In ICCV, 2015.
- Tatarchenko et al.  Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In ICCV, 2017.
- Tolstikhin et al.  Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.
- Vondrick et al.  Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In NIPS, pages 613–621, 2016.
- Wang et al.  Qing Wang, Sanjeev R Kulkarni, and Sergio Verdú. Divergence estimation for multidimensional densities via -nearest-neighbor distances. IEEE Transactions on Information Theory, 2009.
- Wang and Gupta  Xiaolong Wang and Abhinav Gupta. Generative image modeling using style and structure adversarial networks. In ECCV, 2016.
- Wang et al.  Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018.
- Weed and Bach  Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. arXiv preprint arXiv:1707.00087, 2017.
- Wu et al.  Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In NIPS, 2016.
- Wu et al.  Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, 2015.
- Yang et al.  Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Foldingnet: Point cloud auto-encoder via deep grid deformation. In CVPR, volume 3, 2018.
- Yu et al.  Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. CoRR, abs/1609.05473, 2016.
- Zaheer et al.  Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In NIPS, 2017.
- Zhao et al.  Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
- Zhu et al.  Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
Appendix A Technical Proof
Suppose we have two approximators to Wasserstein distance: an upper bound and a lower , such that and respectively, for some and . Then, using the sandwiched estimator from (6), we can achieve tighter estimate of the Wasserstein distance than using either one estimator, i.e.
We prove the claim by show that LHS is at most , which is the lower bound for RHS.
Without loss of generality we can assume , which brings us to
Now if we chose , then as desired. ∎
There exists such that
Since there exists such that , it is clear that
Appendix B Permutation Equivariance Layers
Zaheer et al.  propose a generic framework of deep learning for set data. The building block which can be stacked to be deep neural networks is called Permutation Equivariance Layer. One Permutation Equivariance Layer example is defined as
where can be any functions (e.g. parametrized by neural networks) and is an input set. Also, the mox pooling operation can be replaced with mean pooling. We note that PointNetQi et al. [2017a] is a special case of using Permutation Equivariance Layer by properly defining . In our experiments, we follow Zaheer et al.  to set to be a linear layer with output size
followed by any nonlinear activation function.
Appendix C Experiment Settings
c.1 Synthetic Data
The batch size is fixed to be . We sampled 10,000 samples for training and testing.
For the inference network, we stack mean Permutation Equivariance Layer [Zaheer et al., 2017], where the hidden layer size (the output of the first two layers ) is and the final output size is . The activation function are used SoftPlus. For the generater is a layer MLP, where the hidden layer size is set to be . The discirminator is layer MLP with hidden layer size to be . For Achlioptas et al. , we change their implementation by replcing the number of filters for encoder to be , while the hidden layer width for decoder is or except for the output layer. The decoder is increased from 3 to 4 layers to have more capacity.
We follow Zaheer et al.  to do pre-processing. For each object, we sampled points from the mesh representation and normalize it to have zero mean (for each axis) and unit (global) variance. During the training, we augment the data by uniformly rotating rad on the - plane. The random noise of PC-GAN is fixed to be dimensional for all experiments.
For of single class model, we stack max Permutation Equivariance Layer with output size to be for every layer. On the top of the satck, we have a layer MLP with the same width and the output . The generator is a layer MLP where the hidden layer size is and output size is . The discirminator is layer MLP with hidden layer size to be .
For training whole ModelNet40 training set, we increae the width to be . The generator is a layer MLP where the hidden layer size is and output size is . The discirminator is layer MLP with hidden layer size to be . For hirarchical sampling, the top generator and discriminator are all -layer MLP with hidden layer size to be .
For AAE, we follow every setting used in Achlioptas et al. , where the latent code size is and for single class model and whole ModelNet40 models.