latent_3d_points
Auto-encoding & Generating 3D Point-Clouds.
view repo
Three-dimensional geometric data offer an excellent domain for studying representation learning and generative modeling. In this paper, we look at geometric data represented as point clouds. We introduce a deep autoencoder (AE) network with state-of-the-art reconstruction quality and generalization ability. The learned representations outperform existing methods for 3D recognition tasks and enable basic shape editing via simple algebraic manipulations, such as semantic part editing, shape analogies and shape interpolation. We perform a thorough study of different generative models including: GANs operating on the raw point clouds, significantly improved GANs trained in the fixed latent space our AEs and Gaussian mixture models (GMM). For our quantitative evaluation we propose measures of sample fidelity and diversity based on matchings between sets of point clouds. Interestingly, our careful evaluation of generalization, fidelity and diversity reveals that GMMs trained in the latent space of our AEs produce the best results.
READ FULL TEXT VIEW PDFAuto-encoding & Generating 3D Point-Clouds.
Three-dimensional (3D) representations of real-life objects are a core tool for vision, robotics, medicine, augmented reality and virtual reality applications. Recent attempts to encode 3D geometry for use in deep learning include view-based projections, volumetric grids and graphs. In this work, we focus on the representation of 3D point clouds. Point clouds are becoming increasingly popular as a homogeneous, expressive and compact representation of surface-based geometry, with the ability to represent geometric details while taking up little space. Point clouds are particularly amenable to simple geometric operations and are a standard 3D acquisition format used by range-scanning devices like LiDARs, the Kinect or iPhone’s face ID feature.
All the aforementioned encodings, while effective in their target tasks (e.g. rendering or acquisition), are hard to manipulate directly in their raw form. For example, naïvely interpolating between two cars in any of those representations does not yield a representation of an “intermediate” car. Furthermore, these representations are not well suited for the design of generative models via classical statistical methods. Using them to edit and design new objects involves the construction and manipulation of custom, object-specific parametric models, that link the semantics to the representation. This process requires significant expertise and effort.
Deep learning brings the promise of a data-driven approach. In domains where data is plentiful, deep learning tools have eliminated the need for hand-crafting features and models. Architectures like AutoEncoders (AEs) (Rumelhart et al., 1988; Kingma & Welling, 2013) and Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Radford et al., ; Che et al., 2016) are successful at learning data representations and generating realistic samples from complex underlying distributions. However, an issue with GAN-based generative pipelines is that training them is notoriously hard and unstable (Salimans et al., 2016). In addition, and perhaps more importantly, there is no universally accepted method for the evaluation of generative models.
In this paper, we explore the use of deep architectures for learning representations and introduce the first deep generative models for point clouds. Only a handful of deep architectures tailored to 3D point clouds exist in the literature, and their focus is elsewhere: they either aim at classification and segmentation (Qi et al., 2016a, 2017), or use point clouds only as an intermediate or output representation (Kalogerakis et al., 2016; Fan et al., 2016). Our specific contributions are:
A new AE architecture for point clouds—inspired by recent architectures used for classification (Qi et al., 2016a)—that can learn compact representations with (i) good reconstruction quality on unseen samples; (ii) good classification quality via simple methods (SVM), outperforming the state of the art (Wu et al., 2016); (iii) the capacity for meaningful semantic operations, interpolations and shape-completion.
The first set of deep generative models for point clouds, able to synthesize point clouds with (i) measurably high fidelity to, and (ii) good coverage of both the training and the held-out data. One workflow that we propose is to first train an AE to learn a latent representation and then train a generative model in that fixed latent space. The GANs trained in the latent space, dubbed here l-GANs, are easier to train than raw GANs and achieve superior reconstruction and better coverage of the data distribution. Multi-class GANs perform almost on par with class-specific GANs when trained in the latent space.
A study of various old and new point cloud metrics, in terms of their applicability (i) as reconstruction objectives for learning good representations; (ii) for the evaluation of generated samples. We find that a commonly used metric, Chamfer distance, fails to identify certain pathological cases.
Fidelity and coverage metrics for generative models, based on an optimal matching between two different collections of point clouds. Our coverage metric can identify parts of the data distribution that are completely missed by the generative model, something that diversity metrics based on cardinality might fail to capture (Arora & Zhang, 2017).
The rest of this paper is organized as follows: Section 2 outlines some background for the basic building blocks of our work. Section 3 introduces our metrics for the evaluation of generative point cloud pipelines. Section 4 discusses our architectures for latent representation learning and generation. In Section 5, we perform comprehensive experiments evaluating all of our models both quantitatively and qualitatively. Further results can be found in the Appendix. Last, the code for all our models is publicly available^{1}^{1}1http://github.com/optas/latent_3d_points.
In this section we give the necessary background on point clouds, their metrics and the fundamental building blocks that we will use in the rest of the paper.
A point cloud represents a geometric shape—typically its surface—as a set of 3D locations in a Euclidean coordinate frame. In 3D, these locations are defined by their coordinates. Thus, the point cloud representation of an object or scene is a matrix, where is the number of points, referred to as the point cloud resolution.
Point clouds as an input modality present a unique set of challenges when building a network architecture. As an example, the convolution operator—now ubiquitous in image-processing pipelines—requires the input signal to be defined on top of an underlying grid-like structure. Such a structure is not available in raw point clouds, which renders them significantly more difficult to encode than images or voxel grids. Recent classification work on point clouds (PointNet (Qi et al., 2016a)) bypasses this issue by avoiding convolutions involving groups of points. Another related issue with point clouds as a representation is that they are permutation invariant: any reordering of the rows of the point cloud matrix yields a point cloud that represents the same shape. This property complicates comparisons between two point sets which is needed to define a reconstruction loss. It also creates the need for making the encoded feature permutation invariant.
Two permutation-invariant metrics for comparing unordered point sets have been proposed in the literature (Fan et al., 2016). On the one hand, the Earth Mover’s distance (EMD) (Rubner et al., 2000) is the solution of a transportation problem which attempts to transform one set to the other. For two equally sized subsets , their EMD is defined by
where is a bijection. As a loss, EMD is differentiable almost everywhere. On the other hand, the Chamfer (pseudo)-distance (CD) measures the squared distance between each point in one set to its nearest neighbor in the other set:
CD is differentiable and compared to EMD more efficient to compute.
One of the main deep-learning components we use in this paper is the AutoEncoder (AE, inset),
which is an architecture that learns to reproduce its input. AEs can be especially useful, when they contain a narrow bottleneck layer between input and output. Upon successful training, this layer provides a low-dimensional representation, or code, for each data point. The Encoder (E) learns to compress a data point into its latent representation, . The Decoder (D) can then produce a reconstruction , of , from its encoded version .
In this paper we also work with Generative Adversarial Networks (GANs), which are state-of-the-art generative models. The basic architecture (inset) is based on a adversarial game between a generator (G) and a discriminator (D). The generator aims to synthesize samples that look
indistinguishable from real data (drawn from ) by passing a randomly drawn sample from a simple distribution through the generator function. The discriminator is tasked with distinguishing synthesized from real samples.
A GMM is a probabilistic model for representing a population whose distribution is assumed to be multimodal Gaussian, i.e. comprising of multiple subpopulations, where each subpopulation follows a Gaussian distribution. Assuming the number of subpopulations is known, the GMM parameters (means and variances of the Gaussians) can be learned from random samples, using the Expectation-Maximization (EM) algorithm
(Dempster et al., 1977). Once fitted, the GMM can be used to sample novel synthetic samples.An important component of this work is the introduction of measures that enable comparisons between two sets of points clouds and . These metrics are useful for assessing the degree to which point clouds, synthesized or reconstructed, represent the same population as a held-out test set. Our three measures are described below.
The Jensen-Shannon Divergence between marginal distributions defined in the Euclidean 3D space. Assuming point cloud data that are axis-aligned and a canonical voxel grid in the ambient space; one can measure the degree to which point clouds of tend to occupy similar locations as those of . To that end, we count the number of points lying within each voxel across all point clouds of , and correspondingly for and report the JSD between the obtained empirical distributions :
where and the KL-divergence between the two distributions (Kullback & Leibler, 1951).
For each point cloud in we first find its closest neighbor in . Coverage is measured as the fraction of the point clouds in that were matched to point clouds in . Closeness can be computed using either the CD or EMD point-set distance of Section 2, thus yielding two different metrics, COV-CD and COV-EMD. A high coverage score indicates that most of is roughly represented within .
Coverage does not indicate exactly how well the covered examples (point-clouds) are represented in set ; matched examples need not be close. We need a way to measure the fidelity of with respect to . To this end, we match every point cloud of to the one in with the minimum distance (MMD) and report the average of distances in the matching. Either point-set distance can be used, yielding MMD-CD and MMD-EMD. Since MMD relies directly on the distances of the matching, it correlates well with how faithful (with respect to ) elements of are.
The complementary nature of MMD and Coverage directly follows from their definitions. The set of point clouds captures all modes of with good fidelity when MMD is small and Coverage is large. JSD is fundamentally different. First, it evaluates the similarity between and in coarser way, via marginal statistics. Second and contrary to the other two metrics, it requires pre-aligned data, but is also computationally friendlier. We have found and show experimentally that it correlates well with the MMD, which makes it an efficient alternative for e.g. model-selection, where one needs to perform multiple comparisons between sets of point clouds.
In this section we describe the architectures of our neural networks starting from an autoencoder. Next, we introduce a GAN that works directly with 3D point cloud data, as well as a decoupled approach which first trains an AE and then trains a minimal GAN in the AE’s latent space. We conclude with a similar but even simpler solution that relies on classical Gaussian mixtures models.
The input to our AE network is a point cloud with 2048 points ( matrix), representing a 3D shape. The encoder architecture follows the design principle of (Qi et al., 2016a): 1-D convolutional layers with kernel size 1 and an increasing number of features; this approach encodes every point independently
. A ”symmetric”, permutation-invariant function (e.g. a max pool) is placed after the convolutions to produce a joint representation. In our implementation we use 5 1-D convolutional layers, each followed by a ReLU
(Nair & Hinton, 2010)and a batch-normalization layer
(Ioffe & Szegedy, 2015). The output of the last convolutional layer is passed to a feature-wise maximum to produce a-dimensional vector which is the basis for our latent space. Our decoder transforms the latent vector using 3 fully connected layers, the first two having ReLUs, to produce a
output. For a permutation invariant objective, we explore both the EMD approximation and the CD (Section 2) as our structural losses; this yields two distinct AE models, referred to as AE-EMD and AE-CD. To regularize the AEs we considered various bottleneck sizes, the use of drop-out and on-the-fly augmentations by randomly-rotating the point clouds. The effect of these choices is showcased in the Appendix (Section A) along with the detailed training/architecture parameters. In the remainder of the paper, unless otherwise stated, we use an AE with a -dimensional bottleneck layer.Our first GAN operates on the raw
point set input. The architecture of the discriminator is identical to the AE (modulo the filter-sizes and number of neurons), without any batch-norm and with leaky ReLUs
(Maas et al., 2013) instead or ReLUs. The output of the last fully connected layer is fed into a sigmoid neuron. The generator takes as input a Gaussian noise vector and maps it to a output via 5 FC-ReLU layers.For our l-GAN, instead of operating on the raw point cloud input, we pass the data through a pre-trained autoencoder, which is trained separately for each object class with the EMD (or CD) loss function. Both the generator and the discriminator of the l-GAN then operate on the bottleneck variables of the AE. Once the training of GAN is over, we convert a code learned by the generator into a point cloud by using the AE’s decoder. Our chosen architecture for the l-GAN, which was used throughout our experiments, is
significantly simpler than the one of the r-GAN. Specifically, an MLP generator of a single hidden layer coupled with an MLP discriminator of two hidden layers suffice to produce measurably good and realistic results.In addition to the l-GANs, we also fit a family of Gaussian Mixture Models (GMMs) on the latent spaces learned by our AEs. We experimented with various numbers of Gaussian components and diagonal or full covariance matrices. The GMMs can be turned into point cloud generators by first sampling the fitted distribution and then using the AE’s decoder, similarly to the l-GANs.
In this section we experimentally establish the validity of our proposed evaluation metrics and highlight the merits of the AE-representation (Section
5.1) and the generative models (Section 5.2). In all experiments in the main paper, we use shapes from the ShapeNet repository (Chang et al., 2015), that are axis aligned and centered into the unit sphere. To convert these shapes (meshes) to point clouds we uniformly sample their faces in proportion to their area. Unless otherwise stated, we train models with point clouds from a single object class and work with train/validation/test sets of an 85%-5%-10% split. When reporting JSD measurements we use a regular voxel grid to compute the statistics.We begin with demonstrating the merits of the proposed AE. First we report its generalization ability as measured using the MMD-CD and MMD-EMD metrics. Next, we utilize its latent codes to do semantically meaningful operations. Finally, we use the latent representation to train SVM classifiers and report the attained classification scores.
Our AEs are able to reconstruct unseen shapes with quality almost as good as that of the shapes that were used for training. In Fig. 1 we use our AEs to encode unseen samples from the test split (the left of each pair of images) and then decode them and compare them visually to the input (the right image). To support our visuals quantitatively, in Table 1 we report the MMD-CD and MMD-EMD between reconstructed point clouds and their corresponding ground-truth in the train and test datasets of the chair object class. The generalization gap under our metrics is small; to give a sense of scale for our reported numbers, note that the MMD is and under the CD and EMD respectively between two versions of the test set that only differ by the randomness introduced in the point cloud sampling. Similar conclusions regarding the generalization ability of the AE can be made based on the reconstruction loss attained for each dataset (train or test) which is shown in Fig. 9 of the Appendix.
AE | MMD-CD | MMD-EMD | ||
---|---|---|---|---|
loss | Train | Test | Train | Test |
CD | 0.0004 | 0.0012 | 0.068 | 0.075 |
EMD | 0.0005 | 0.0013 | 0.042 | 0.052 |
Another argument against under/over-fitting can be made by showing that the learned representation is amenable to intuitive and semantically rich operations. As it is shown in several recent works, well trained neural-nets learn a latent representation where additive linear algebra works to that purpose (Mikolov et al., 2013; Tasse & Dodgson, 2016). First, in Fig. 2 we show linear interpolations, in the latent space, between the left and right-most geometries. Similarly, in Fig. 3 we alter the input geometry (left) by adding, in latent space, the mean vector of geometries with a certain characteristic (e.g., convertible cars or cups without handles). Additional operations (e.g. shape analogies) are also possible, but due to space limitations we illustrate and provide the details in the Appendix (Section B) instead. These results attest to the smoothness of the learned space but also highlight the intrinsic capacity of point clouds to be smoothly morphed.
Our proposed AE architecture can be used to tackle the problem of shape completion with minimal adaptation. Concretely, instead of feeding and reconstructing the same point cloud, we can feed the network with an incomplete version of its expected output. Given proper training data, our network learns to complete severely partial point clouds. Due to space limitations we give the exact details of our approach in the Appendix (Section D) and demonstrate some achieved completions in Fig. 4 of the main paper.
Our final evaluation for the AE’s design and efficacy is done by using the learned latent codes as features for classification. For this experiment to be meaningful, we train an AE across all different shape categories: using 57,000 models from 55 categories of man-made objects. Exclusively for this experiment, we use a bottleneck of dimensions and apply random rotations to the input point clouds along the gravity axis. To obtain features for an input 3D shape, we feed its point cloud into the AE and extract the bottleneck activation vector. This vector is then classified by a linear SVM trained on the de-facto 3D classification benchmark of ModelNet (Wu et al., 2015). Table 2 shows comparative results. Remarkably, in the ModelNet10 dataset, which includes classes (chairs, beds etc.) that are populous in ShapeNet, our simple AE significantly outperforms the state of the art (Wu et al., 2016) which instead uses several layers of a GAN to derive a -long feature. In Fig. 16
of the Appendix we include the confusion matrix of the classifier evaluated on our latent codes on ModelNet40 – the confusion happens between particularly similar geometries: a dresser vs. a nightstand or a flower-pot vs. a plant. The nuanced details that distinguish these objects may be hard to learn without stronger supervision.
A | B | C | D | E | ours EMD | ours CD | |
---|---|---|---|---|---|---|---|
MN10 | 79.8 | 79.9 | - | 80.5 | 91.0 | 95.4 | 95.4 |
MN40 | 68.2 | 75.5 | 74.4 | 75.5 | 83.3 | 84.0 | 84.5 |
Having established the quality of our AE, we now demonstrate the merits and shortcomings of our generative pipelines and establish one more successful application for the AE’s learned representation. First, we conduct a comparison between our generative models followed by a comparison between our latent GMM generator and the state-of-the-art 3D voxel generator. Next, we describe how Chamfer distance can yield misleading results in certain pathological cases that r-GANs tends to produce. Finally, we show the benefit of working with a pre-trained latent representation in multi-class generators.
dataset. Left – the JSD distance between the ground truth test set and synthetic datasets generated by the GANs at various epochs of training. Right – EMD based MMD/Coverage: curve markers indicate epochs 1, 10, 100, 200, 400, 1000, 1500, 2000, with
larger symbols denoting later epochs.For this study, we train five generators with point clouds of the chair category. First, we establish two AEs trained with the CD or EMD loss respectively—referred to as AE-CD and AE-EMD and train an l-GAN in each latent space with the non-saturating loss of Goodfellow et al. (2014). In the space learned by the AE-EMD we train two additional models: an identical (architecture-wise) l-GAN that utilizes the Wasserstein objective with gradient-penalty (Gulrajani et al., 2017) and a family of GMMs with a different number of means and structures of covariances. We also train an r-GAN directly on the point cloud data.
Fig. 6 shows the JSD (left) and the MMD and Coverage (right) between the produced synthetic datasets and the held-out test data for the GAN-based models, as training proceeds. Note that the r-GAN struggles to provide good coverage and good fidelity of the test set; which alludes to the well-established fact that end-to-end GANs are generally difficult to train. The l-GAN (AE-CD) performs better in terms of fidelity with much less training, but its coverage remains low. Switching to an EMD-based AE for the representation and otherwise using the same latent GAN architecture (l-GAN, AE-EMD), yields a dramatic improvement in coverage and fidelity. Both l-GANs though suffer from the known issue of mode collapse: half-way through training, first coverage starts dropping with fidelity still at good levels, which implies that they are overfitting a small subset of the data. Later on, this is followed by a more catastrophic collapse, with coverage dropping as low as 0.5%. Switching to a latent WGAN largely eliminates this collapse, as expected.
In Table 3, we report measurements for all generators based on the epoch (or underlying GMM parameters) that has minimal JSD between the generated samples and the validation set. To reduce the sampling bias of these measurements each generator produces a set of synthetic samples that is the population of the comparative set (test or validation) and repeat the process times and report the averages. The GMM selected by this process has Gaussians and a full covariance. As shown in Fig. 18 of the Appendix, GMMs with full covariances perform much better than those that have diagonal structure and 20 Gaussians suffice for good results. Last, the first row of Table 3 shows a baseline model that memorizes a random subset of the training data of the same size as the other generated sets.
Discussion. The results of Table 3 agree with the trends shown in Fig. 6 and further verify the superiority of the latent-based approaches and the relative gains of using an AE-EMD vs. an AE-CD. Moreover they demonstrate that a simple GMM can achieve results of comparable quality to a latent WGAN. Lastly, it is worth noting how the GMM has achieved similar fidelity as that of the perfect/memorized chairs and with almost as good coverage. Table 8 of the supplementary shows the same performance-based conclusions when our metrics are evaluated on the train split.
Model | Type | JSD | MMD-CD | MMD-EMD | COV-EMD | COV-CD |
---|---|---|---|---|---|---|
A | MEM | 0.017 | 0.0018 | 0.063 | 78.6 | 79.4 |
B | RAW | 0.176 | 0.0020 | 0.123 | 19.0 | 52.3 |
C | CD | 0.048 | 0.0020 | 0.079 | 32.2 | 59.4 |
D | EMD | 0.030 | 0.0023 | 0.069 | 57.1 | 59.3 |
E | EMD | 0.022 | 0.0019 | 0.066 | 66.9 | 67.6 |
F | GMM | 0.020 | 0.0018 | 0.065 | 67.4 | 68.9 |
An interesting observation regarding r-GAN can be made in Table 3. The JSD and the EMD based metrics strongly favor the latent-approaches, while the Chamfer-based ones are much less discriminative. To decipher this discrepancy we did an extensive qualitative inspection of the r-GAN samples and found many cases of point clouds that were over-populated in locations, that on average, most chairs have mass. This hedging of the r-GAN is particularly hard for Chamfer to penalize since one of its two summands can become significantly small and the other can be only moderately big by the presence of a few sparsely placed points in the non-populated locations. Figure 7 highlights this point. For a ground-truth point cloud we retrieve its nearest neighbor, under the CD, in synthetically generated sets produced by the r-GAN and the l-GAN and in-image numbers report their CD and EMD distances from it. Notice how the CD fails to distinguish the inferiority of the r-GAN samples while the EMD establishes it. This blindness of the CD metric to only partially good matches, has the additional side-effect that the CD-based coverage is consistently bigger than the EMD-based one.
Class | Fidelity | Coverage | ||
---|---|---|---|---|
A | Ours | A | Ours | |
car | 0.059 | 0.041 | 28.6 | 65.3 |
rifle | 0.051 | 0.045 | 69.0 | 74.8 |
sofa | 0.077 | 0.055 | 52.5 | 66.6 |
table | 0.103 | 0.061 | 18.3 | 71.1 |
Generative models for other 3D modalities, like voxels, have been recently proposed (Wu et al., 2016). One interesting question is: if point clouds are our target modality, does it make sense to use voxel generators and then convert to point clouds? This experiment answers this question in the negative. First, we make a comparison using a latent GMM which is trained in conjunction with an AE-EMD. Secondly, we build an AE which operates with voxels and fit a GMM in the corresponding latent space. In both cases, we use 32 Gaussians and a full covariance matrix for these GMMs. To use our point-based metrics, we convert the output of (Wu et al., 2016) and our voxel-based GMM into meshes which we sample to generate point clouds. To do this conversion we use the marching-cubes (Lewiner et al., 2003) algorithm with an isovalue of for the former method (per authors’ suggestions) and for our voxel-AE. We also constrain each mesh to be a single connected component as the vast majority of ground-truth data are.
Table 4 reveals how our point-based GMM trained with a class specific AE-EMD fares against (Wu et al., 2016) on four object classes for which the authors have made their (also class-specific) models publicly^{2}^{2}2http://github.com/zck119/3dgan-release available. Our approach is consistently better, with a coverage boost that can be as large as and an almost improved fidelity (case of table). This is despite the fact that (Wu et al., 2016) uses all models of each class for training, contrary to our generators that never had access to the underlying test split.
Table 5 reveals the performance achieved by pre-training a voxel-based AE for the chair class. Observe how by working with a voxel-based latent space, aside of making comparisons more direct to (Wu et al., 2016) (e.g. we both convert output voxels to meshes), we also establish significant gains in terms of coverage and fidelity.
MMD-CD | MMD-EMD | COV-CD | COV-EMD | |
---|---|---|---|---|
A | 0.0046 | 0.091 | 19.6 | 22.4 |
Ours | 0.0025 | 0.072 | 60.3 | 64.8 |
Finally, we compare between class specific and class agnostic generators. In Table 6 we report the MMD-CD for l-WGANs trained in the space of either a dedicated (per-class) AE-EMD or with an AE-EMD trained with all listed object classes. It turns out that the l-WGANs produce perform similar results in either space. Qualitative comparison (Fig. 8) also reveals that by using a multi-class AE-EMD we do not sacrifice much in terms of visual quality compared to the dedicated AEs.
airplane | car | chair | sofa | table | average | multi-class | |
---|---|---|---|---|---|---|---|
Tr | 0.0004 | 0.0006 | 0.0015 | 0.0011 | 0.0013 | 0.0010 | 0.0011 |
Te | 0.0006 | 0.0007 | 0.0019 | 0.0014 | 0.0017 | 0.0013 | 0.0014 |
Recently, deep learning architectures for view-based projections (Su et al., 2015; Wei et al., 2016; Kalogerakis et al., 2016), volumetric grids (Qi et al., 2016b; Wu et al., 2015; Hegde & Zadeh, 2016) and graphs (Bruna et al., 2013; Henaff et al., 2015; Defferrard et al., 2016; Yi et al., 2016b)
have appeared in the 3D machine learning literature.
A few recent works ((Wu et al., 2016), (Wang et al., 2016), (Girdhar et al., 2016), (Brock et al., 2016), (Maimaitimin et al., 2017), (Zhu et al., 2016)) have explored generative and discriminative representations for geometry. They operate on different modalities, typically voxel grids or view-based image projections. To the best of our knowledge, our work is the first to study such representations for point clouds.
Training Gaussian mixture models (GMM) in the latent space of an autoencoder is closely related to VAEs (Kingma & Welling, 2013). One documented issue with VAEs is over-regularization: the regularization term associated with the prior, is often so strong that reconstruction quality suffers (Bowman et al., 2015; Sønderby et al., 2016; Kingma et al., 2016; Dilokthanakul et al., 2016). The literature contains methods that start only with a reconstruction penalty and slowly increase the weight of the regularizer. An alternative approach is based on adversarial autoencoders (Makhzani et al., 2015) which use a GAN to implicitly regularize the latent space of an AE.
We presented a novel set of architectures for 3D point cloud representation learning and generation. Our results show good generalization to unseen data and our representations encode meaningful semantics. In particular our generative models are able to produce faithful samples and cover most of the ground truth distribution. Interestingly, our extensive experiments show that the best generative model for point clouds is a GMM trained in the fixed latent space of an AE. While this might not be a universal result, it suggests that simple classic tools should not be dismissed. A thorough investigation on the conditions under which simple latent GMMs are as powerful as adversarially trained models would be of significant interest.
The authors wish to thank all the anonymous reviewers for their insightful comments and suggestions. Lin Shao and Matthias Nießner for their help with the shape-completions and Fei Xia for his suggestions on the evaluation metrics. Last but not least, they wish to acknowledge the support of NSF grants NSF IIS-1528025 and DMS-1546206, ONR MURI grant N00014-13-1-0341, a Google Focused Research award, and a gift from Amazon Web Services for Machine Learning Research.
On Visual Similarity Based 3D Model Retrieval.
Computer Graphics Forum, 2003.The encoding layers of our AEs were implemented as 1D-convolutions with ReLUs, with kernel size of
and stride of
, i.e. treating each 3D point independently. Their decoding layers, were MLPs built with FC-ReLUs. We used Adam (Kingma & Ba, 2014) with initial learning rate of , of and a batch size of to train all AEs.For the AE mentioned in the SVM-related experiments of Section 5.1 of the main paper, we used an encoder with and filters in each of its layers and a decoder with neurons, respectively. Batch normalization was used between every layer. We also used online data augmentation by applying random rotations along the gravity-(z)-axis to the input point clouds of each batch. We trained this AE for epochs with the CD loss and for with the EMD.
For all other AEs, the encoder had and filters at each layer, with being the bottle-neck size. The decoder was comprised by FC-ReLU layers with neurons each. We trained these AEs for a maximum of epochs when using single class data and epochs for the experiment involving shape classes (end of Section 5.2, main paper).
To determine an appropriate size for the latent-space, we constructed 8 (otherwise architecturally identical) AEs with bottleneck sizes and trained them with point clouds of the chair object class, under the two losses (Fig. 9). We repeated this procedure with pseudo-random weight initializations three times and found that had the best generalization error on the test data, while achieving minimal reconstruction error on the train split.
Remark. Different AE setups brought no noticeable advantage over our main architecture. Concretely, adding drop-out layers resulted in worse reconstructions and using batch-norm on the encoder only, sped up training and gave us slightly better generalization error when the AE was trained with single-class data. Exclusively, for the SVM experiment of Section 5.1 of the main paper we randomly rotate the input chairs to promote latent features that are rotation-invariant.
For shape editing applications, we use the embedding we learned with the AE-EMD trained across all 55 object classes, not separately per-category. This showcases its ability to encode features for different shapes, and enables interesting applications involving different kinds of shapes.
We use the shape annotations of Yi et al.(Yi et al., 2016a) as guidance to modify shapes. As an example, assume that a given object category (e.g. chairs) can be further subdivided into two sub-categories and : every object possesses a certain structural property (e.g. has armrests, is four-legged, etc.) and objects do not. Using our latent representation we can model this structural difference between the two sub-categories by the difference between their average latent representations , where , . Then, given an object , we can change its property by transforming its latent representation: , and decode to obtain . This process is shown in Fig. 3 of the main paper.
By linearly interpolating between the latent representations of two shapes and decoding the result we obtain intermediate variants between the two shapes. This produces a “morph-like” sequence with the two shapes at its end points Fig. 2 of main paper and Fig. 11 here). Our latent representation is powerful enough to support removing and merging shape parts, which enables morphing between shapes of significantly different appearance. Our cross-category latent representation enables morphing between shapes of different classes, cfg. the second row for an interpolation between a bench and a sofa.
Another demonstration of the Euclidean nature of the latent space is demonstrated by finding “analogous” shapes by a combination of linear manipulations and Euclidean nearest-neighbor searching. Concretely, we find the difference vector between and , we add it to shape and search in the latent space for the nearest-neighbor of that result, which yields shape . We demonstrate the finding in Fig. 10 with images taken from the meshes used to derive the underlying point clouds to help the visualization. Finding shape analogies has been of interest recently in the geometry processing community (Rustamov et al., 2013; Huang et al., 2018).
Loss | ModelNet40 | ModelNet10 | ||||
---|---|---|---|---|---|---|
-plt | icpt | loss | -plt | icpt | loss | |
EMD | 0.09 | 0.5 | hng | 0.02 | 3 | sq-hng |
CD | 0.25 | 0.4 | sq-hng | 0.05 | 0.2 | sq-hng |
In addition to ShapeNet core which contains man-made only objects, we have experimented with the D-FAUST dataset of (Bogo et al., 2017) that contains meshes of human subjects. Specifically, D-FAUST contains K scanned meshes of 10 human subjects performing a variety of motions. Each human performs a set of (maximally) motions, each captured by a temporal sequence of meshes. For our purposes, we use a random subset of (out of the ) meshes for each human/motion and extract from each mesh a point cloud with 4096 points. Our resulting dataset contains a total of point clouds and we use a train-test-val split of - while enforcing that every split contains all human/motion combinations. We use this data to train and evaluate an AE-EMD that is identical to the single-class AE presented in the main paper, with the only difference being the number of neurons of the last layer ( instead of ).
We demonstrate reconstruction and interpolation results in Figs. 13 and 14. For a given human subject and a specific motion we pick at random two meshes corresponding to time points , (with ) and show their reconstructions along with the ground truth in Fig. 13 (left-most and right-most of each row). In the same figure we also plot the reconstructions of two random meshes captured in (middle-two of each row). In Fig. 14, instead of encoding/decoding the ground truth test data, we show decoded linear interpolations between the meshes of , .
An important application that our AE architecture can be used for is that of completing point clouds that contain limited information of the underlying geometry. Typical range scans acquired for an object in real-time can often miss entire regions of the object due to the existence of self-occlusions and the lack of adequate (or ”dense”) view-point registrations. This fact makes any sensible solution to this problem of high practical importance. To address it here, we resort in a significantly different dataset than the ones used in the rest of this paper. Namely, we utilize the dataset of (Dai et al., 2016) that contains pairs of complete (intact) 3D CAD models and partial versions of them. Specifically, for each object of ShapeNet (core) it contains six partial point clouds created by the aggregation of frames taken over a limited set of view-points in a virtual trajectory established around the object. Given this data, we first fix the dimensionality of the partial point clouds to be points for each one by randomly sub-sampling them. Second, we apply uniform-in-area sampling to each complete CAD model to extract from it points to represent a ”complete” ground-truth datum. All the resulting point clouds are centered in the unit-sphere and (within a class) the partial and complete point clouds are co-aligned. Last, we train class-specific neural-nets with Chair, Table and Airplane data and a train/val/test split of [80%, 5%, 15%].
The high level design of the architecture we use for shape-completions is identical to the AE, i.e. independent-convolutions followed by FCs, trained under a structural loss (CD or EMD). However, essential parts of this network are different: depth, bottleneck size (controlling compression ratio) and the crucial differentiation between the input and the output data. Technically, the resulting architecture is an Abstractor-Predictor (AP) and is comprised by three layers of independent per-point convolutions, with filter sizes of , followed by a max-pool, which is followed by an FC-ReLU ( neurons) and a final FC layer ( neurons). We don’t use batch-normalization between any layer and train each class-specific AP for a maximum of epochs, with ADAM, initial learning rate of and a batch size of . We use the minimal per the validation split model (epoch) for evaluating our models with the test data.
We use the specialized point cloud completion metrics introduced in (Sung et al., 2015). That is a) the accuracy: which is the fraction of the predicted points that are within a given radius () from any point in the ground truth point cloud and b) the coverage: which is the fraction of the ground-truth points that are within from any predicted point. In Table 8 we report these metrics (with a similarly to (Sung et al., 2015)) for class-specific networks that were trained with the EMD and CD losses respectively. We observe that the CD loss gives rise to more accurate but also less complete outputs, compared to the EMD. This highlights again the greedy nature of CD – since it does not take into account the matching between input/output, it can get generate completions that are more concentrated around the (incomplete) input point cloud. Figure 15 shows the corresponding completions of those presented in the main paper, but with a network trained under the CD loss.
Class | Airplane | Chair | Table |
---|---|---|---|
Test-size | 4.5K | 6K | 6K |
Acc-CD | 96.9 | 86.5 | 87.6 |
Acc-EMD | 94.7 | 77.1 | 78.4 |
Cov-CD | 96.6 | 77.5 | 75.2 |
Cov-EMD | 96.8 | 82.6 | 83.0 |
For the classification experiments of Section 5.1 (main paper) we used a one-versus-rest linear SVM classifier with an norm penalty and balanced class weights. The exact optimization parameters can be found in Table 7. The confusion matrix of the classifier evaluated on our latent codes on ModelNet40 is shown in Fig. 16.
The discriminator’s first layers are 1D-convolutions with stride/kernel of size and filters each; interleaved with leaky-ReLU. They are followed by a feature-wise max-pool. The last FC-leaky-ReLU layers have , neurons each and they lead to single sigmoid neuron. We used units of leak.
The generator consists of FC-ReLU layers with neurons each. We trained r-GAN with Adam with an initial learning rate of , and of in batches of size . The noise vector was drawn by a spherical Gaussian of dimensions with zero mean and
units of standard deviation.
Some synthetic results produced by the r-GAN are shown in Fig. 12.
The discriminator consists of FC-ReLU layers with neurons each and a final FC layer with a single sigmoid neuron. The generator consists of FC-ReLUs with neurons each. When used the l-Wasserstein-GAN, we used a gradient penalty regularizer and trained the critic for iterations for each training iteration of the generator. The training parameters (learning rate, batch size) and the generator’s noise distribution were the same as those used for the r-GAN.
All GANs are trained for maximally 2000 epochs; for each GAN, we select one of its training epochs to obtain the “final” model, based on how well the synthetic results match the ground-truth distribution. Specifically, at a given epoch, we use the GAN to generate a set of synthetic point clouds, and measure the distance between this set and the validation set. We avoid measuring this distance using MMD-EMD, given the high computational cost of EMD. Instead, we use either the JSD or MMD-CD metrics to compare the synthetic dataset to the validation dataset. To further reduce the computational cost of model selection, we only check every 100 epochs (50 for r-GAN). The generalization error of the various GAN models, at various training epochs, as measured by MMD and JSD is shown in Fig. 17 (left and middle).
Left/middle: Generalization error of the various GAN models, at various training epochs. Generalization is estimated using the JSD (left) and MMD-CD (middle) metrics, which measure closeness of the synthetic results to the training resp. test ground truth distributions. The plots show the measurements of various GANs. Right: Training trends in terms of the MMD-CD metric for the various GANs. Here, we sample a set of synthetic point clouds for each model, of size 3x the size of the ground truth test dataset, and measure how well this synthetic dataset matches the ground truth in terms of MMD-CD. This plot complements Fig. 6 (left) of the main paper, where a different evaluation measure was used - note the similar behavior.
Using the same JSD criterion, we also select the number and covariance type of Gaussian components for the GMM (Fig. 18, left), and obtain the optimal value of 32 components. GMMs performed much better with full (as opposed to diagonal) covariance matrices, suggesting strong correlations between the latent dimensions (Fig. 18, right).
When using MMD-CD as the selection criterion, we obtain models of similar quality and at similar stopping epochs (see Table 9); the optimal number of Gaussians in this case was 40. The training behavior measured using MMD-CD can be seen in Fig. 17 (right).
Method | Epoch | JSD | MMD-CD | MMD-EMD | COV-EMD | COV-CD |
---|---|---|---|---|---|---|
A | 1350 | 0.1893 | 0.0020 | 0.1265 | 19.4 | 54.7 |
B | 300 | 0.0463 | 0.0020 | 0.0800 | 32.6 | 58.2 |
C | 200 | 0.0319 | 0.0022 | 0.0684 | 57.6 | 58.7 |
D | 1700 | 0.0240 | 0.0020 | 0.0664 | 64.2 | 64.7 |
E | - | 0.0182 | 0.0018 | 0.0646 | 68.6 | 69.3 |
Our voxel-based AEs are fully-convolutional with the encoders consisting of 3D-Conv-ReLU layers and the decoders of 3D-Conv-ReLU-transpose layers. Below, we list the parameters of consecutive layers, listed left-to-right. The layer parameters are denoted in the following manner: (number of filters, filter size). Each Conv/Conv-Transpose has a stride of except the last layer of the decoder which has . In the last layer of the decoders we do not use a non-linearity. The abbreviation ”bn” stands for batch-normalization.
- model
Encoder: Input (32, 6) (32, 6) bn (64, 4) (64, 2) bn (64, 2)
Decoder: (64, 2) (32, 4) bn (32, 6) (1, 8) Output
- model
Encoder: Input (32, 6) (32, 6) bn (64, 4) (64, 4) bn (64, 2) (64, 2)
Decoder: (64, 2) (32, 4) bn (32, 6) (32, 6) bn (32, 8) (1, 8) Output
We train each AE for epochs with Adam under the binary cross-entropy loss. The learning rate was , the and the batch size . To validate our voxel AE architectures, we compared them in terms of reconstruction quality to the state-of-the-art method of (Tatarchenko et al., 2017) and obtained comparable results, as demonstrated in Table 10.
Voxel Resolution | 32 | 64 |
---|---|---|
Ours | 92.7 | 88.4 |
(Tatarchenko et al., 2017) | 93.9 | 90.4 |
Sample Set Size | COV-CD | MMD-CD | COV-EMD | MMD-EMD |
---|---|---|---|---|
Entire —Train— | 97.3 | 0.0013 | 98.2 | 0.0545 |
1 —Test— | 54.0 | 0.0023 | 51.9 | 0.0699 |
3 —Test— | 79.4 | 0.0018 | 78.6 | 0.0633 |
Full-GMM/32 | ||||
(3 —Test—) | 68.9 | 0.0018 | 67.4 | 0.0651 |
Here we compare our GMM-generator against a model that memorizes the training data of the chair class. To do this, we either consider the entire training set or randomly sub-sample it, to create sets of different sizes. We then evaluate our metrics between these ”memorized” sets and the point clouds of test split (see Table 11). The coverage/fidelity obtained by our generative models (last row) is slightly lower than the equivalent in size case (third row) as expected: memorizing the training set produces good coverage/fidelity with respect to the test set when they are both drawn from the same population. This speaks for the validity of our metrics. Naturally, the advantage of using a learned representation lies in learning the structure of the underlying space instead of individual samples, which enables compactly representing the data and generating novel shapes as demonstrated by our interpolations. In particular, note that while some mode collapse is present in our generative results, as indicated by the drop in coverage, the MMD of our generative models is almost identical to that of the memorization case, indicating excellent fidelity.
In addition to the EMD-based comparisons in Table 4 of the main paper, in Tables 12, 13 we provide comparisons with (Wu et al., 2015) for the ShapeNet classes for which the authors have made publicly available their models. In Table 12 we provide JSD-based comparisons for two of our models. In Table 13 we provide Chamfer-based Fidelity/Coverage comparisons on the test split, that complement the EMD-based ones of Table 4 in the main paper.
Class | A | B | C | ||
---|---|---|---|---|---|
Tr+Te | Tr | Te | Tr | Te | |
airplane | - | 0.0149 | 0.0268 | 0.0065 | 0.0191 |
car | 0.1890 | 0.0081 | 0.0109 | 0.0063 | 0.0108 |
rifle | 0.2012 | 0.0212 | 0.0364 | 0.0092 | 0.0214 |
sofa | 0.1812 | 0.0102 | 0.0102 | 0.0102 | 0.0101 |
table | 0.2472 | 0.0058 | 0.0177 | 0.0035 | 0.0143 |
Class | MMD-CD | COV-CD | ||
---|---|---|---|---|
A | B | A | B | |
airplane | - | 0.0005 | - | 71.1 |
car | 0.0015 | 0.0007 | 22.9 | 63.0 |
rifle | 0.0008 | 0.0005 | 56.7 | 71.7 |
sofa | 0.0027 | 0.0013 | 42.40 | 75.5 |
table | 0.0058 | 0.0016 | 16.7 | 71.7 |
In Table 14 we compare to (Wu et al., 2016) in terms of the JSD and MMD-CD on the training set of the chair category. Since (Wu et al., 2016)
do not use any train/test split, we perform 5 rounds of sampling 1K synthetic results from their models and report the best values of the respective evaluation metrics. We also report the average classification probability of the synthetic samples to be classified as chairs by the PointNet classifier. The r-GAN mildly outperforms
(Wu et al., 2016) in terms of its diversity (as measured by JSD/MMD), while also creating realistic-looking results, as shown by the classification score. The l-GANs perform even better, both in terms of classification and diversity, with less training epochs. Finally, note that the PointNet classifier was trained on ModelNet, and (Wu et al., 2016) occasionally generates shapes that only rarely appear in ModelNet. In conjunction with their higher tendency for mode collapse, this partially accounts for their lower classification scores.Metric | A | B | C | D | E | F |
---|---|---|---|---|---|---|
JSD | 0.1660 | 0.1705 | 0.0372 | 0.0188 | 0.0077 | 0.0048 |
MMD-CD | 0.0017 | 0.0042 | 0.0015 | 0.0018 | 0.0015 | 0.0014 |
CLF | 84.10 | 87.00 | 96.10 | 94.53 | 89.35 | 87.40 |
Figure 19 shows some failure cases of our models. Chairs with rare geometries (left two images) are sometimes not faithfully decoded. Additionally, the AEs may miss high-frequency geometric details, e.g. a hole in the back of a chair (middle), thus altering the style of the input shape. Finally, the r-GAN often struggles to create realistic-looking shapes (right) – while the r-GAN chairs are easily visually recognizable, it has a harder time on cars. Designing more robust raw-GANs for point clouds remain an interesting avenue for future work. A limitation of our shape-completion pipeline regards the style of the partial shape, which might not be well preserved in the completed point cloud (see Fig. 21 for an example).