Point Cloud GAN

10/13/2018 ∙ by Chun-Liang Li, et al. ∙ Carnegie Mellon University 0

Generative Adversarial Networks (GAN) can achieve promising performance on learning complex data distributions on different types of data. In this paper, we first show a straightforward extension of existing GAN algorithm is not applicable to point clouds, because the constraint required for discriminators is undefined for set data. We propose a two fold modification to GAN algorithm for learning to generate point clouds (PC-GAN). First, we combine ideas from hierarchical Bayesian modeling and implicit generative models by learning a hierarchical and interpretable sampling process. A key component of our method is that we train a posterior inference network for the hidden variables. Second, instead of using only state-of-the-art Wasserstein GAN objective, we propose a sandwiching objective, which results in a tighter Wasserstein distance estimate than the commonly used dual form. Thereby, PC-GAN defines a generic framework that can incorporate many existing GAN algorithms. We validate our claims on ModelNet40 benchmark dataset. Using the distance between generated point clouds and true meshes as metric, we find that PC-GAN trained by the sandwiching objective achieves better results on test data than the existing methods. Moreover, as a byproduct, PC- GAN learns versatile latent representations of point clouds, which can achieve competitive performance with other unsupervised learning algorithms on object recognition task. Lastly, we also provide studies on generating unseen classes of objects and transforming image to point cloud, which demonstrates the compelling generalization capability and potentials of PC-GAN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 11

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A fundamental problem in machine learning is: given a data set, learn a generative model that can efficiently generate arbitrary many new sample points from the domain of the underlying distribution (Bishop, 2006)

. Deep generative models use deep neural networks as a tool for learning complex data distributions 

(Kingma and Welling, 2013; Oord et al., 2016; Goodfellow et al., 2014). Especially, Generative Adversarial Networks (GAN; Goodfellow et al. 2014) is drawing attentions because of its success in many applications. Compelling results have been demonstrated on different types of data, including text, images and videos (Lamb et al., 2016; Karras et al., 2017; Vondrick et al., 2016). Their wide range of applicability was also shown in many important problems, including data augmentation (Salimans et al., 2016), image style transformation (Zhu et al., 2017), image captioning (Dai et al., 2017) and art creations (Kang, 2017).

Recently, capturing 3D information is garnering attention. There are many different data types for 3D information, such as CAD, 3D meshes and point clouds. 3D point clouds are getting popular since these store more information than 2D images and sensors capable of collecting point clouds have become more accessible. These include Lidar on self-driving cars, Kinect for Xbox and face identification sensor on phones. Compared to other formats, point clouds can be easily represented as a set of points, which has several advantages, such as permutation invariance. The algorithms which can effectively learn from this type of data is an emerging field (Qi et al., 2017a, b; Zaheer et al., 2017; Kalogerakis et al., 2017; Fan et al., 2017)

. However, compared to supervised learning, unsupervised generative models for 3D data are still under explored 

(Achlioptas et al., 2017; Oliva et al., 2018).

Extending existing GAN frameworks to handle 3D point clouds, or more generally set data, is not straightforward. In this paper, we begin by formally defining the problem and discussing the difficulty of the problem (Section 2). Circumventing the challenges, we propose a deep generative adversarial network (PC-GAN) with a hierarchical sampling and inference network for point clouds. The proposed architecture learns a stochastic procedure which can generate new point clouds as well as draw samples from point clouds without explicitly modeling the underlying density function (Section 3). The proposed PC-GAN is a generic algorithm which can incorporate many existing GAN variants. By utilizing the property of point clouds, we further propose a sandwiching objective by considering both upper and lower bounds of Wasserstein distance estimate, which can lead to tighter approximation (Section 4

). Evaluation on ModelNet40 shows excellent generalization capability of PC-GAN. We first show we can sample from the learned model to generate new point clouds and the latent representations learned by the inference network provide meaningful interpolations between point clouds. We further show the conditional generation results on

unseen classes of objects to demonstrate the superior generalization ability of PC-GAN. Lastly, we also provide several interesting studies, such as classification and point clouds generation from images (Section 6).

2 Problem Definition and Difficulty

We begin by defining the problem and ideal generative process for point cloud over objects. Formally, a point cloud for an object is a set of

low dimensional vectors

with , where is usually and can be infinite. Point cloud for different objects is a collection of sets , where each is as defined above. Thus, we basically need a generative model for sets which should be able to:

  • Sample entirely new sets , as well as

  • Sample more points for a given set, i.e. .

De-Finetti theorem allows us to express the set probability in a factored format as

for some suitably defined . In case of point clouds, the latent variable

(a) Table:
(b) Chair:
(c) Plane:
(d) Marginalized:
Figure 1: Unlike in case of images, the marginal distribution

is not useful. We have to learn the joint distribution

. An illustrative parallel can be drawn between pixels of an image and point cloud — marginal distribution of pixels is quite uninformative, one has to consider the joint distribution of all pixels in one image (not across different images).

can be interpreted as an object representation. In this view, the factoring can be understood as follows: Given an object, , the points in the point cloud can be considered as i.i.d. samples from , an unknown latent distribution representing object . Joint likelihood can be expressed as:

(1)

Attempts have been made to characterize (1

) with parametric models like Gaussian Mixture Models or parametric hierarchical models

(Jian and Vemuri, 2005; Strom et al., 2010; Eckart et al., 2015). However, such approaches have limited success as the point cloud conditional density is highly non-linear and complicated (example of point clouds can be seen in Figure 1, 3, 7).

With advent of implicit generative models, like GANs (Goodfellow et al., 2014), it is possible to model complicated distributions, but it is not clear how to extend the existing GAN frameworks to handle the hierarchical model of point clouds as developed above. The aim of most GAN framework(Goodfellow et al., 2014; Arjovsky et al., 2017; Li et al., 2017a) is to learn to generate new samples from a distribution , for fixed dimensional , given a training dataset. We, in contrast, aim to develop a new method where the training data is a set of sets, and learning from this training data would allow us to generate points from this hierarchical distribution (e.g. point clouds of 3D objects). To re-emphasize the incompatibility, note that the training data of traditional GANs is a set of fixed dimensional instances, in our case, however, it is a set of sets. Using existing GANs to just learning marginal distribution , in case of point clouds , is not of much use because the marginal distribution is quite uninformative as can be seen from Figure 1.

One approach can be used to model the distribution of the point cloud set together, i.e., . In this setting, a naíve application of traditional GAN is possible through treating the point cloud as finite dimensional vector by fixing number and order of the points (reducing the problem to instances in ) with DeepSets (Zaheer et al., 2017)classifier as the discriminator to distinguish real sets from fake sets. However, it would not work because the IPM guarantees behind the traditional GAN no longer hold (e.g. in case of Arjovsky et al. (2017), nor are 1-Lipschitz functions over sets well-defined). The probabilistic divergence approximated by a DeepSets classifier is not clear and might be ill-defined. Counter examples for breaking IPM guarantees can be easily found as we show next.

Figure 2: Natural extension of GAN to handle set data does not work.
Counter Example

Consider a simple GAN (Goodfellow et al., 2014) with a DeepSets classifier as the discriminator. In order to generate coherent sets of variable size, we consider a generator having two noise sources: and . To generate a set, is sampled once and is sampled for to produce points in the generated set. Intuitively, fixing the first noise source selects a set and ensures the points generated by repeated sampling of are coherent and belong to same set. The setup is depicted in Figure 2. In this setup, the GAN minimax problem would be:

(2)

Now consider the case, when there exist an ‘oracle’ mapping which maps each sample point deterministically to the object it originated from, i.e. . A valid example is when different leads to conditional distribution with non-overlapping support. Let and ignores then optimization becomes

(3)

Thus, we can achieve the lower bound by only matching the component, while the conditional is allowed to remain arbitrary. We note that there still exists good solutions, which can lead to successful training, other than this hand-crafted example. We found empirically that GAN with simple DeepSet-like discriminator most of the times fails to learn to generate point clouds even after converging. However, sometimes it does results in reasonable generations. So simply using DeepSets classifier without any constraints in simple GAN in order to handle sets does not always lead to a valid generative model. We need additional constraints for GANs with simple DeepSet-like discriminator to exclude such bad solutions and lead to a more stable training.

3 Proposed Method

Figure 3: Overview of PC-GAN.

Although directly learning point cloud generation under GAN formulation is difficult as described in Section 2, given a , learning is reduced to learning a 2 or 3 dimensional distribution, which fits standard GAN settings. Formally, given a , we train a generator such that , where , follows by optimizing a (peudo) probabilistic divergence between the distribution of and , which is denoted as . The full objective can be written as .

Inference

Although GANs have been extended to learn conditional distributions (Mirza and Osindero, 2014; Isola et al., 2017), they require conditioning variables to be observed, such as the one-hot label or a given image. While in case of point clouds we only have partial knowledge of the conditional, i.e. we only have groupings of point coming from the same object but we have no representation of the conditional or the object other than the points themselves. Naïvely modeling to be a one-hot vector, to indicate which object the points belong to in the training data, cannot generalize to unseen testing data. Instead, we need a richer representation for

, which is an unobserved random variable. Thus, we need to infer

during the training. The proposed algorithm has to concurrently learn the inference network which encodes while we learn .

Neural Network Realization

Our solution comprises of a generator which takes in a noise source and a descriptor encoding information about distribution of . Here is dimensionality of the per-point noise source and is the size of the descriptor vector. Typically, we would need , as needs to encode much more information because it decides the whole shape of the conditional distribution whereas just corresponds to different samples from it. Another interpretation of the descriptor can be as an embedding of distribution over . For a given , the descriptor would encode information about the distribution and samples generated as would follow the distribution . More generally, can be used to encode more complicated distributions regarding as well. In particular, it could be used to encode the posterior for a given sample set , such that

follows the posterior predictive distribution:

As is unobserved, we use an inference network to learn the most informative descriptor about the distribution to minimize the The divergence between , which is denoted as , and the distribution of given , which we abuse for convenience, that is .

A major hurdle in taking this path is that is a set of points, which can vary in size and permutation of elements. Thus, making design of

complicated as traditional neural network can not handle this and possibly is the reason for absence of such framework in the literature despite being a natural solution for the important problem of generative modeling of point clouds. However, we can overcome this challenge and we propose to construct the inference network by utilizing the recent advance in deep learning for dealing with sets 

(Qi et al., 2017a; Zaheer et al., 2017). This allows it handle variable number of inputs points in arbitrary order, yet yielding a consistent descriptor .

Hierarchical Sampling

After training and , we use trained to collect inferred and train another generator for higher hierarchical sampling. Here is the other noise source independent of and dimensionality this noise source is typically much smaller than . In addition to layer-wise training, a joint training could further boost performance. The full generative process for sampling one point cloud could be represented as

We call the proposed algorithm for point cloud generation as PC-GAN as shown in Figure 3. The conditional distribution matching with a learned inference in PC-GAN can also be interpreted as an encoder-decoder formulation (Kingma and Welling, 2013)

. The difference between it and the point cloud autoencoder 

(Achlioptas et al., 2017; Yang et al., 2018) will be discussed in Section 5.

4 Different Divergences for Matching Point Clouds

Given two point clouds and

, one commonly used heuristic distance measure is chamfer distance 

(Achlioptas et al., 2017). On the other hand, if we treat each point cloud as a 3 dimensional distribution, we can adopt a more broader classes of probabilistic divergence for training . Instead of using divergence esitmates which requires density estimation (Jian and Vemuri, 2005; Strom et al., 2010; Eckart et al., 2015), we are interested in implicit generative models with a GAN-like objective (Goodfellow et al., 2014), which has been demonstrated to learn complicated distributions.

To train the generator using a GAN-like objective for point clouds, we need a discriminator which distinguishes between the generated samples and true samples conditioned on . Combining with the inference network discussed in Section 3, if we use an IPM-based GAN (Arjovsky et al., 2017; Mroueh and Sercu, 2017; Mroueh et al., 2017), the objective can be written as

(4)

where is the constraint for different probabilistic distances, such as 1-Lipschitz (Arjovsky et al., 2017), ball (Mroueh and Sercu, 2017) or Sobolev ball (Mroueh et al., 2017).

4.1 Tighter Solutions via Sandwiching

In our setting, each point in the point cloud can be considered to correspond to single images when we train GANs over images. An example is illustrated in Figure 4, where the samples from MMD-GAN (Li et al., 2017a)

trained on CelebA consists of both good and bad faces. In case of images, when quality is evaluated, it primarily focuses on coherence individual images and the few bad ones are usually left out. Whereas in case of point cloud, to get representation of an object we need many sampled points together and presence of outlier points degrades the quality of the object. Thus, when training a generative model for point cloud, we need to ensure a much lower distance

between true distribution and generator distribution than would be needed for images.

Figure 4: Connection between good/bad points and faces generated from a GAN.

We begin by noting that the popular Wasserstein GAN (Arjovsky et al., 2017), aims to optimize by , where is the Wasserstein distance between the truth and generated distribution of . Many GAN works (e.g. Arjovsky et al. (2017)) approximate in dual form (a maximization problem), such as (4), by neural networks. The resulting estimate is a lower bound of the true Wasserstein distance, as neural networks can only recover a subset of 1-Lipschitz functions (Arora et al., 2017) required in the dual form. However, finding a lower bound for may not be an ideal surrogate for solving a minimization problem . In optimal transport literature, Wassertein distance is usually estimated by approximate matching cost, , which gives us an upper bound of the true Wasserstein distance.

We propose to combine, in general, a lower bound and upper bound estimate by sandwiching the solution between the two, i.e. we solve the following minimization problem:

(5)
such that

The problem can be simplified and solved using method of Lagrange multipliers as follows:

(6)

By solving the new sandwiched problem (6), we show that under certain conditions we obtain a better estimate of Wasserstein distance in the following lemma:

Suppose we have two approximators to Wasserstein distance: an upper bound and a lower , such that and respectively, for some and . Then, using the sandwiched estimator from (6), we can achieve tighter estimate of the Wasserstein distance than using either one estimator, i.e.

(7)

4.1.1 Upper Bound Implementation

The primal form of Wasserstein distance is defined as

where is the coupling of and . The Wasserstein distance is also known as optimal transport (OT) or earth moving distance (EMD). As the name suggests, when is estimated with finite number of samples and , we find the one-to-one matching between and such that the total pairwise distance is minimal. The resulting minimal total (average) pairwise distance is . In practice, finding the exact matching efficiently is non-trivial and still an open research problem (Peyré et al., 2017). Instead, we consider an approximation provided by Bertsekas (1985). It is an iterative algorithm where each iteration operates like an auction whereby unassigned points bid simultaneously for closest points , thereby raising their prices. Once all bids are in, points are awarded to the highest bidder. The crux of the algorithm lies in designing a non-greedy bidding strategy. One can see by construction the algorithm is embarrassingly parallelizable, which is favourable for GPU implementation. One can show that algorithm terminates with a valid matching and the resulting matching cost is an -approximation of . Thus, the estimate can serve as an upper bound, i.e.

(8)

We remark estimating Wasserstein distance with finite samples via primal form is only favorable to low dimensional data, such as point clouds. The error of empirical estimate in primal is  (Weed and Bach, 2017). When the dimension is large (e.g. images), we cannot accurately estimate in primal as well as its upper bound with a small minibatch.

Finding a modified primal form with low sample complexity, especially for high dimensional data, is still an open research problem 

(Cuturi, 2013; Genevay et al., 2018). Combining those into the proposed sandwiching objective for high dimensional data is left for future works.

4.2 Lower Bound Implementation

The dual form of Wasserstein distance is defined as

(9)

where is the set of -Lipschitz functions whose Lipschitz constant is no larger than . In practice, deep neural networks parameterized by with constraints  (Arjovsky et al., 2017), result in a distance approximation

(10)

If there exists such that , then is a lower bound. To enforce , Arjovsky et al. (2017) propose a weight clipping constraint , which constrains every weight to be in and guarantees that for some . Based on Arora et al. (2017), the Lipschitz functions realized by neural networks is actually .

In practice, choosing clipping range is non-trivial. Small ranges limit the capacity of networks, while large ranges result in numerical issues during the training. On the other hand, in addition to weight clipping, several constraints (regularization) have bee proposed with better empirical performance, such as gradient penalty (Gulrajani et al., 2017) and ball (Mroueh and Sercu, 2017). However, there is no guarantee the resulted functions are still Lipschitz or the resulted distances are lower bounds of Wasserstein distance. To take the advantage of those regularization with the Lipschitz guarantee, we propose a simple variation by combining weight clipping, which always ensures Lipschitz functions.

There exists such that

(11)

Note that, if , then . Therefore, from Proposition A, for any regularization of discriminator (Gulrajani et al., 2017; Mroueh and Sercu, 2017; Mroueh et al., 2017), we can always combine it with a weight clipping constraint to ensure a valid lower bound estimate of Wasserstein distance and enjoy the advantage that it is numerically stable when we use large compared with original weight-clipping WGAN (Arjovsky et al., 2017).

5 Related Works

Generative Adversarial Network (Goodfellow et al., 2014) aims to learn a generator that can sample data followed by the data distribution. Compelling results on learning complex data distributions with GAN have been shown on images (Karras et al., 2017), speech (Lamb et al., 2016), text (Yu et al., 2016; Hjelm et al., 2017), vedio (Vondrick et al., 2016) and 3D voxels (Wu et al., 2016). However, the GAN algorithm on 3D point cloud is still under explored (Achlioptas et al., 2017). Many alternative objectives for training GANs have been studied. Most of them are the dual form of -divergence (Goodfellow et al., 2014; Mao et al., 2017; Nowozin et al., 2016), integral probability metrics (IPMs) (Zhao et al., 2016; Li et al., 2017a; Arjovsky et al., 2017; Gulrajani et al., 2017) or IPM extensions (Mroueh and Sercu, 2017; Mroueh et al., 2017). Genevay et al. (2018) learn the generative model by the approximated primal form of Wasserstein distance (Cuturi, 2013).

Instead of training a generative model on the data space directly, one popular approach is combining with autoencoder (AE), which is called adversarial autoencoder (AAE) (Makhzani et al., 2015)

. AAE constrain the encoded data to follow normal distribution via GAN loss, which is similar to VAE 

(Kingma and Welling, 2013) by replacing the KL-divergence on latent space via any GAN loss. Tolstikhin et al. (2017) provide a theoretical explanation for AAE by connecting it with the primal form of Wasserstein distance. The other variant of AAE is training the other generative model to learn the distribution of the encoded data instead of enforcing it to be similar to a known distribution (Engel et al., 2017; Kim et al., 2017). Achlioptas et al. (2017) explore a AAE variant for point cloud. They use a specially-designed encoder network (Qi et al., 2017a) for learning a compressed representation for point clouds before training GAN on the latent space. However, their decoder is restricted to be a MLP which generates fixed number of points, where has to be pre-defined. That is, the output of their decoder is fixed to be for 3D point clouds, while the output of the proposed is only 3 dimensional and can generate arbitrarily many points by sampling different random noise as input. Yang et al. (2018); Groueix et al. (2018b) propose similar decoders to with fixed grids to break the limitation of Achlioptas et al. (2017) aforementioned, but they use heuristic Chamfer distance without any theoretical guarantee and do not exploit generative models for point clouds. The proposed PC-GAN can also be interpreted as an encoder-decoder formulation. However, the underlying interpretation is different. We start from De-Finetti theorem to learn both and with inference network interpretation of , while  Achlioptas et al. (2017) focus on learning without modeling .

GAN for learning conditional distribution (conditional GAN) has been studied in images with single conditioning (Mirza and Osindero, 2014; Pathak et al., 2016; Isola et al., 2017; Chang et al., 2017) or multiple conditioning (Wang and Gupta, 2016). The case on point cloud is still under explored. Also, most of the works assume the conditioning is given (e.g. labels and base images) without learning the inference during the training. Training GAN with inference is studied by Dumoulin et al. (2016); Li et al. (2017b); however, their goal is to infer the random noise of generators instead of semantic latent variable of the data. Li et al. (2018) is a parallel work aiming to learn GAN and unseen latent variable simultaneously, but they only study image and video datasets.

Lastly, we briefly review some recent development of deep learning on point clouds. Instead of transforming into 3D voxels and projecting objects into different views to use convolution (Su et al., 2015; Maturana and Scherer, 2015; Wu et al., 2015; Qi et al., 2016; Tatarchenko et al., 2017), which have the concern of memory usage, one direction is designing permutation invaraint operation for dealing with set data directly Qi et al. (2017a); Zaheer et al. (2017); Qi et al. (2017b). Wang et al. (2018) use graph convolution (Bronstein et al., 2017) to utilize local neighborhood information. Most of the application studied by those works focus on classification and segmentation tasks, but they can be used to implement the inference network of PC-GAN.

6 Experiments

In this section we demonstrate the point cloud generation capabilities of PC-GAN. As discussed in Section 5, we refer Achlioptas et al. (2017) as AAE as it could be treated as an AAE extension to point clouds and we use the implementation provided by the authors for experiments. The sandwitching objective for PC-GAN combines and with the mixture 1:20 without tunning for all experiment. is a GAN loss by combining Arjovsky et al. (2017) and Mroueh and Sercu (2017) and we adopt (Bertsekas, 1985) for as discussed in Section 4.2 and Section 4.1.1. We parametrize in PC-GAN by DeepSets (Zaheer et al., 2017). The review of DeepSets is in Appendix B. Other detailed configurations of each experiment can be found in Appendix C. Next, we study both synthetic 2D point cloud and ModelNet40 benchmark datasets.

6.1 Synthetic Datasets

We created a simple 2D synthetic point cloud datasets from parametric distributions on which we can carry out thorough evaluations of the prposed PC-GAN and draw comparisons with AAE Achlioptas et al. (2017). We generate 2D point clouds for circles, where the center of circles is followed a mixture of four Gaussians with means equal to . The covariance matrices were set to be

and we used equal mixture weights. The radius of the circles was drawn from a uniform distribution

. One sampled circile is shown in Figure 4(a). We sampled circles for the training and testing data, respectively.

For PC-GAN, the inference network is a stack of 3 Permutation Equivariance Layers with hidden layer size to be and output size to be (e.g. ). In this experiment the total number of parameters for PC-GAN is . For AAE encoder, we follow the same setting in Achlioptas et al. (2017); for decoder, we increase it from 3 to be 4 layers with larger capacities and set output size to be dimensions for points. We consider two model configurations AAE-10 and AAE-20, which use and units for the hidden layers of AAE decoder, respectively. The total number of parameters (encoder+decoder) are and for AAE-10 and AAE-20, respectively. Detailed model configurations are provided in the supplementary material.

We evaluated the conditional distributions on the testing circles. For the proposed PC-GAN, we pass the same points into the inference network , then sample points with the conditional generator to match the output number of AAE. We measured the empirical distributions of the centers and the radius of the generated circles conditioning on the testing data for PC-GAN. Similarly, we measured the reconstructed circles of the testing data for AAE. The results are shown in Figure 5.

(a) True Data
(b) AAE-10
(c) AAE-20
(d) PC-GAN
Figure 5: The reconstructed center and radius distributions. (a) (top) the true center distribution and (bottom) one example of the 2D circle point cloud. (b-d) are the reconstructed center and radius distributions of different algorithms.

From Figure 5, both AAE and PC-GAN can successfully recover the center distribution, but AAE does not learn the radius distribution well. Even if we increase number the hidden layer unit of the decoder to be (AAE-20), which almost doubles the number of parameters, the performance is still not satisfactory. Compared with AAE, the proposed PC-GAN recovers the both center and radius distributions well with less parameters. The gap of memory usage could be larger if we configure AAE to generate more points, while the model size required for PC-GAN is independent of the number of points. The reason is MLP decoder adopted by Achlioptas et al. (2017) wasted parameters for nearby points. A much larger model (more parameters) can potentially boost the performance, yet would be still restricted to generate a fixed number of points for each object as discussed in Section 5.

6.2 Conditional Generation on ModelNet40

We consider the ModelNet40 (Wu et al., 2015) benchmark, which contains 40 classes of objects. There are training and testing instances. We follow Zaheer et al. (2017) to do pre-processing. For each object, we sampled

points from the mesh representation and normalize it to have zero mean (for each axis) and unit (global) variance. During the training, we augment the data by uniformly rotating

rad on the - plane. For PC-GAN, the random noise is fixed to be dimensional for all experiments. For other settings, we follow Achlioptas et al. (2017).

Training on Single Class

We start from a smaller model which is only trained on single class of objects. For AAE, the latent code size for its encoder is and the decoder outputs points for each object. The number of parameters for encoder and decoder are in total. Similarly, we set the size of PC-GAN latent variable (the output of ) to be dimensional. The number of parameters for and is less than in total.

Training on All Classes.

We also train the proposed model on all objects in the training set. The size of AAE latent code of is increased to be . The number of parameters of its encoder and decoder is . We set the size of PC-GAN latent variable to be dimensional as well. The number of parameters for and are around in total.

6.2.1 Quantitative Comparison

Data Distance to Face Coverage
PC-GAN () AAE PC-GAN () PC-GAN () PC-GAN () AAE PC-GAN () PC-GAN ()
Aeroplanes 1.89E+01 1.99E+01 1.53E+01 2.49E+01 1.95E-01 2.99E-02 1.73E-01 1.88E-01
Benches 1.09E+01 1.41E+01 1.05E+01 2.46E+01 4.44E-01 2.35E-01 2.58E-01 3.83E-01
Cars 4.39E+01 6.23E+01 4.25E+01 6.68E+01 2.35E-01 4.98E-02 1.78E-01 2.35E-01
Chairs 1.01E+01 1.08E+01 1.06E+01 1.08E+01 3.90E-01 1.82E-01 3.57E-01 3.95E-01
Cups 1.44E+03 1.79E+03 1.28E+03 3.01E+03 6.31E-01 3.31E-01 4.32E-01 5.68E-01
Guitars 2.16E+02 1.93E+02 1.97E+02 1.81E+02 2.25E-01 7.98E-02 2.11E-01 2.27E-01
Lamps 1.47E+03 1.60E+03 1.64E+03 2.77E+03 3.89E-01 2.33E-01 3.79E-01 3.66E-01
Laptops 2.43E+00 3.73E+00 2.65E+00 2.58E+00 4.31E-01 2.56E-01 3.93E-01 4.55E-01
Sofa 1.71E+01 1.64E+01 1.45E+01 2.76E+01 3.65E-01 1.62E-01 2.94E-01 3.47E-01
Tables 2.79E+00 2.96E+00 2.44E+00 3.69E+00 3.82E-01 2.59E-01 3.20E-01 3.53E-01
ModelNet10 5.77E+00 6.89E+00 6.03E+00 9.19E+00 3.47E-01 1.90E-01 3.36E-01 3.67E-01
ModelNet40 4.84E+01 5.86E+01 5.24E+01 7.96E+01 3.80E-01 1.85E-01 3.65E-01 3.71E-01
Table 1: The quantitative results of different models trained on different subsets of ModelNet40 and evaluated on the corresponding testing set. ModelNet10 is a subset containing 10 classes of objects, while ModelNet40 is a full training set. AAE is trained using the code provided by Achlioptas et al. (2017). Three PC-GAN are trained via upper bound , lower bound and sandwiching loss discussed in Section 4.

We first evaluate the performance of trained conditional generator and the inference network . We are interested in whether the learned model can model the distribution of the unseen testing data. Therefore, for each testing point cloud, we use to infer the latent variable , then use to generate points. We then compare the distribution between the input point cloud and the generated point clouds.

There are many criteria based on finite sample estimation can be used for evaluation, such -divergence and IPM. However, the estimator with finite samples are either biased or with high variance (Peyré et al., 2017; Wang et al., 2009; Póczos et al., 2012; Weed and Bach, 2017). Also, it is impossible to use these estimators with infinitely many samples if they are accessible.

Figure 6: Sample mesh from ModelNet40

For ModelNet40, the meshes of each object are available. In many statistically guaranteed distance estimates, the adopted statistics are commonly based on distance between nearest neighbors (Wang et al., 2009; Póczos et al., 2012). Therefore, we propose to measure the performance with the following criteria. Given a point cloud and a mesh, which is a collection of faces , we measure the distance to face (D2F) as

where is the Euclidean distance from to the face . This distance is similar to Chamfer distance, which is commonly used for measuring images and point clouds (Achlioptas et al., 2017; Fan et al., 2017), with infinitely samples from true distributions (meshes).

However, the algorithm can have low or zero D2F by only focusing a small portion of the point clouds (mode collapse). Therefore, we are also interested in whether the generated points recover enough supports of the distribution. We compute the Coverage ratio as follows. For each points, we find the its nearest face, we then treat this face is covered111We should do thresholding to ignore outlier points. In our experiments, we observe that without excluding outliers does not change conclusion for comparison.. We then compute the ratio of number of faces of a mesh is covered. A sampled mesh is showed in Figure 6, where the details have more faces (non-uniform). Thus, it is difficult to get high coverage for AAE or PC-GAN trained by limited number of sampled points. However, the coverage ratio, on the other hand, serve as an indicator about how much details the model recovers.

The results are reported in Table 1. We compare four different algorithm, AAE and PC-GAN with three objectives, including upper bound ( approximated Wasserstein distance), lower bound (GAN with ball constraints and weight clipping), and the sandwiching loss as discussed in Section 4.1, The study with and also serves as the ablation test of the proposed sandwiching loss .

Comparison between Upper bound, Lower bound and Sandwiching

Since directly optimizes distance between training and generated point clouds, usually results in smaller D2F than in Table 1. One the other hand, although only recovers lower bound estimate of Wasserstein distance, its discriminator is known to focus on learning support of the distribution (Bengio, 2018), which results in better coverage (support) than .

Theoretically, the proposed sandwiching results in a tighter Wasserstein distance estimation than and (Lemma A). Based on above discussion, it can also be understood as balancing both D2F and coverage by combining both and to get a desirable middle ground. Empirically, we even observe that results in better coverage than , and competitive D2F with . The intuitive explanation is that some discriminative tasks are off to objective, so the discriminator can focus more on learning distribution supports. We argue that this difference is crucial for capturing the object details. Some reconstructed point clouds of testing data are shown in Figure 7. For aeroplane examples, are failed to capture aeroplane tires and has better tire than . For Chair example, recovers better legs than and better seat cushion than . Lastly, we highlight outperforms others more significantly when training data is larger (ModelNet10 and ModelNet40) in Table 1.

Figure 7: The reconstruction on test objects from seen categories. For each object, from left to right is training data, AAE, and PC-GAN. PC-GAN is better in capturing fine details like wheels of aeroplane or proper chair legs.
Comparison between PC-GAN and AAE

In most of cases, PC-GAN with has lower D2F in Table 1 with less number of parameters aforementioned. Similar to the argument in Section 6.1, although AAE use larger networks, the decoder wastes parameters for nearby points. AAE only outperforms PC-GAN () in Guitar and Sofa in terms of D2F, since the variety of these two classes are low. It is easier for MLP to learn the shared template (basis) of the point clouds. On the other hand, due to the limitation of the fixed number of output points and Chamfer distance objective, AAE has worse coverage than PC-GAN, It can be supported by Figure 7, where AAE is also failed to recover aeroplane tire.

6.3 Hierarchical Sampling

In Section 3, we propose a hierarchical sampling process for sampling point clouds. In the first hierarchy, the generator samples a object, while the second generator samples points to form the point cloud. The randomly sampled results without given any data as input are shown in Figure 8. The point clouds are all smooth, structured and almost symmetric. It shows PC-GAN captures inherent symmetries and patterns in all the randomly sampled objects, even if overall object is not perfectly formed. This highlights that learning point-wise generation scheme encourages learning basic building blocks of objects.

Figure 8: Randomly sampled objects and corresponding point cloud from the hierarchical sampling. Even if there are some defects, the objects are smooth, symmetric and structured. It suggests PC-GAN captures inherent patterns and learns basic building blocks of objects.

6.4 Understand the Learned Manifold

Interpolation

A commonly used method to demonstrate quality of the learned latent space is showing whether the interpolation between two objects on the latent space results in smooth change. We interpolate the inferred representations from two objects by the inference network, and use the generator to sample points. The inter-class result is shown in Figure 9.

Figure 9: Interpolating between a table and a chair point clouds, using our latent space representation.

It is also popular to show intra-class interpolation. In addition showing simple intra-class interpolations, where the objects are almost aligned, we present an interesting study on interpolations between rotations. During the training, we only rotate data with possible angles for augmentation, here we show it generalizes to other unseen rotations as shown in Figure 10.

However, if we linearly interpolate the code, the resulted change is scattered and not smooth as shown in Figure 10. Instead of using linear interpolation, We train a 2-layer MLP with limited hidden layer size to be 16, where the input is the angle, output is the corresponding latent representation of rotated object. We then generate the code for rotated planes with this trained MLP. It suggests although the transformation path of rotation on the latent space is not linear, it follows a smooth trajectory222By the capability of 1-layer MLP.. It may also suggest the geodesic path of the learned manifold may not be nearly linear between rotations. Finding the geodesic path with a principal method Shao et al. (2017) and Understanding the geometry of the manifold for point cloud worth more deeper study as future work.

Figure 10: Interpolating between rotation of an aeroplane, using our latent space representation.
Classification

We evaluate the quality of the representation acquired from the learned inference network . We train the inference network and the generator on the training split of ModelNet40 with data augmentation as mentioned above for learning generative models without label information. We then extract the latent representation for each point clouds and train linear SVM on the that with its label. We apply the same setting to a linear classifier on the latent code of Achlioptas et al. (2017).

We only sample as input for our inference network . As the Deep Sets architecture for the inference network is invariant to number of points, we can sample different number of points as input to the trained inference network for evaluation. Because of the randomness of sampling points for extracting latent representation, we repeat the experiments

times and report the average accuracy and standard deviation on the testing split in Table 

2. By using points, we are already better than unsupervised algorithms, Achlioptas et al. (2017) with points and 3D Voxel GAN (Wu et al., 2016), and competitive with the supervised learning algorithm Deep Sets.

Method points Accuracy
PC-GAN 1000
PC-GAN 2048
AAE (Achlioptas et al., 2017) 2048
3D GAN (Wu et al., 2016) voxel
Deep Sets (Zaheer et al., 2017) 1000
Deep Sets (Zaheer et al., 2017) 5000
MVCNN (Su et al., 2015) images
RotationNet (Kanezaki et al., 2018) images
Table 2: Classification accuracy results.
Generalization on Unseen Categories

In above, we studied the reconstruction of unseen testing objects, while PC-GAN still saw the point clouds from the same class during training. Here we study a more challenging task. We train PC-GAN on first 30 (Alphabetic order) class, and test on the other fully unseen 10 classes. Some reconstructed (conditionally generated) point clouds are shown in Figure 11. More (larger) results can be found in Appendix LABEL:sec:larger. For the object from the unseen classes, the conditionally generated point clouds still recovers main shape and reasonable geometry structure, which confirms the advantage of the proposed PC-GAN: by enforcing the point-wise transformation, the model is forced to learn the underlying geometry structure and the shared building blocks, instead of naively copying the input from the conditioning. The resulted D2F and coverage are and , which are only slightly worse than and by training on whole 40 classes in Table 1 (ModelNet40), which also supports the claims of the good generalization ability of PC-GAN.

(a) Sofa
(b) Stool
(c) Table
(d) Toilet
(e) TV Stand
(f) Vase
Figure 11: The reconstructed objects from unseen categories. In each plot, LHS is true data while RHS is PC-GAN. PC-GAN generalizes well as it can match patterns and symmetries from categories seen in the past to new unseen categories.

6.5 Images to Point Cloud

Here we demonstrate a potential extension of the proposed PC-GAN for images to point cloud applications. After training as described in 6.3, instead of learning for hierarchical sampling, we train a regressor , where the input is the different views of the point cloud , and the output is . In this proof of concept experiment, we use the view data and the Res18 architecture in Su et al. (2015), while we change the output size to be . Some example results on reconstructing testing data is shown in Figure 12. A straightforward extension is using end-to-end training instead of two-staged approached adopted here. Also, after aligning objects and take representative view along with traditional ICP techniques, we can also do single view to point cloud transformation as Choy et al. (2016); Fan et al. (2017); Häne et al. (2017); Groueix et al. (2018a), which is not the main focus of this paper and we leave it for future work.

(a) Lamp
(b) Chair
(c) Plane
(d) Vase
Figure 12: Images to Point Clouds.

7 Conclusion

In this paper, we first showed a straightforward extension of existing GAN algorithm is not applicable to point clouds. We then proposed a GAN modification (PC-GAN) that is capable of learning to generate point clouds by using ideas both from hierarchical Bayesian modeling and implicit generative models. We further propose a sandwiching objective which results in a tighter Wasserstein distance estimate theoretically and better performance empirically.

In contrast to some existing methods (Achlioptas et al., 2017), PC-GAN can generate arbitrary as many i.i.d. points as we need to form a point clouds without pre-specification. Quantitatively, PC-GAN achieves competitive or better results using smaller network than existing methods. We also demonstrated that PC-GAN can capture delicate details of point clouds and generalize well even on unseen data. Our method learns “point-wise” transformations which encourage the model to learn the building components of the objects, instead of just naively copying the whole object. We also demonstrate other interesting results, including point cloud interpolation and image to point clouds.

Although we only focused on 3D applications in this paper, our framework can be naturally generalized to higher dimensions. In the future we would like to explore higher dimensional applications, where each 3D point can have other attributes, such as RGB colors and 3D velocity vectors.

References

Appendix A Technical Proof

Suppose we have two approximators to Wasserstein distance: an upper bound and a lower , such that and respectively, for some and . Then, using the sandwiched estimator from (6), we can achieve tighter estimate of the Wasserstein distance than using either one estimator, i.e.

(12)
Proof.

We prove the claim by show that LHS is at most , which is the lower bound for RHS.

(13)

Without loss of generality we can assume , which brings us to

(14)

Now if we chose , then as desired. ∎

There exists such that

(15)
Proof.

Since there exists such that , it is clear that

(16)

Appendix B Permutation Equivariance Layers

We briefly review the notion of Permutation Equivariance Layers proposed by Zaheer et al. [2017] as a background required for this paper. For more details, please refer to Zaheer et al. [2017].

Zaheer et al. [2017] propose a generic framework of deep learning for set data. The building block which can be stacked to be deep neural networks is called Permutation Equivariance Layer. One Permutation Equivariance Layer example is defined as

where can be any functions (e.g. parametrized by neural networks) and is an input set. Also, the mox pooling operation can be replaced with mean pooling. We note that PointNetQi et al. [2017a] is a special case of using Permutation Equivariance Layer by properly defining . In our experiments, we follow Zaheer et al. [2017] to set to be a linear layer with output size

followed by any nonlinear activation function.

Appendix C Experiment Settings

c.1 Synthetic Data

The batch size is fixed to be . We sampled 10,000 samples for training and testing.

For the inference network, we stack mean Permutation Equivariance Layer [Zaheer et al., 2017], where the hidden layer size (the output of the first two layers ) is and the final output size is . The activation function are used SoftPlus. For the generater is a layer MLP, where the hidden layer size is set to be . The discirminator is layer MLP with hidden layer size to be . For  Achlioptas et al. [2017], we change their implementation by replcing the number of filters for encoder to be , while the hidden layer width for decoder is or except for the output layer. The decoder is increased from 3 to 4 layers to have more capacity.

c.2 ModelNet40

We follow Zaheer et al. [2017] to do pre-processing. For each object, we sampled points from the mesh representation and normalize it to have zero mean (for each axis) and unit (global) variance. During the training, we augment the data by uniformly rotating rad on the - plane. The random noise of PC-GAN is fixed to be dimensional for all experiments.

For of single class model, we stack max Permutation Equivariance Layer with output size to be for every layer. On the top of the satck, we have a layer MLP with the same width and the output . The generator is a layer MLP where the hidden layer size is and output size is . The discirminator is layer MLP with hidden layer size to be .

For training whole ModelNet40 training set, we increae the width to be . The generator is a layer MLP where the hidden layer size is and output size is . The discirminator is layer MLP with hidden layer size to be . For hirarchical sampling, the top generator and discriminator are all -layer MLP with hidden layer size to be .

For AAE, we follow every setting used in  Achlioptas et al. [2017], where the latent code size is and for single class model and whole ModelNet40 models.