EditVAE: Unsupervised Part-Aware Controllable 3D Point Cloud Shape Generation

by   Shidi Li, et al.
Australian National University

This paper tackles the problem of parts-aware point cloud generation. Unlike existing works which require the point cloud to be segmented into parts a priori, our parts-aware editing and generation is performed in an unsupervised manner. We achieve this with a simple modification of the Variational Auto-Encoder which yields a joint model of the point cloud itself along with a schematic representation of it as a combination of shape primitives. In particular, we introduce a latent representation of the point cloud which can be decomposed into a disentangled representation for each part of the shape. These parts are in turn disentangled into both a shape primitive and a point cloud representation, along with a standardising transformation to a canonical coordinate system. The dependencies between our standardising transformations preserve the spatial dependencies between the parts in a manner which allows meaningful parts-aware point cloud generation and shape editing. In addition to the flexibility afforded by our disentangled representation, the inductive bias introduced by our joint modelling approach yields the state-of-the-art experimental results on the ShapeNet dataset.



There are no comments yet.


page 6

page 8

page 10


ShapeAdv: Generating Shape-Aware Adversarial 3D Point Clouds

We introduce ShapeAdv, a novel framework to study shape-aware adversaria...

Canonical and Compact Point Cloud Representation for Shape Classification

We present a novel compact point cloud representation that is inherently...

Unsupervised Learning for Cuboid Shape Abstraction via Joint Segmentation from Point Clouds

Representing complex 3D objects as simple geometric primitives, known as...

ChartPointFlow for Topology-Aware 3D Point Cloud Generation

A point cloud serves as a representation of the surface of a three-dimen...

Cloud Sphere: A 3D Shape Representation via Progressive Deformation

In the area of 3D shape analysis, the geometric properties of a shape ha...

Unsupervised Learning of Shape Concepts - From Real-World Objects to Mental Simulation

An unsupervised shape analysis is proposed to learn concepts reflecting ...

FAKIR : An algorithm for estimating the pose and elementary anatomy of archaeological statues

The digitization of archaeological artefacts has become an essential par...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The generation of 3D shapes has broad applications in computer graphics such as automatic model generation for artists and designers nash2017shape, computer-aided design mo2020pt2pc

and computer vision tasks such as recognition (Choy et al. 

bongsoo2015enriching). There has been a recent boost in efforts to learn generative shape models from data achlioptas2018learning; shu20193d, with the main trend being to learn the distribution of 3D point clouds using deep generative models such as Variational Auto-Encoders (VAEsyang2019pointflow

, Generative Adversarial Networks (

GANsshu20193d, and normalising flows yang2019pointflow.

Recently, mo2020pt2pc addressed structure-aware 3D shape generation, which conditions on the segmentation of point clouds into meaningful parts such as the legs of a chair. This yields high quality generation results, but requires time-consuming annotation of the point cloud as a part-tree representation. A natural alternative therefore, involves extracting a semantically meaningful parts representations in an unsupervised manner, using ideas from recent work on disentangled latent representations chen2018isolating; kim2018disentangling—that is, representations for which statistical dependencies between latents are discouraged. While disentanglement of the latents allows independent part sampling, reducing the dependence among parts themselves leads to samples with mis-matched style across parts.

Figure 1: Our model learns a disentangled latent representation from point clouds in an unsupervised manner, allowing parts-aware generation, controllable parts mixing and parts sampling. Here we demonstrate: parts-aware generation as denoted by the different colours; controllable parts mixing to combine the legs of the upper chair with the fixed back and base of the chairs at left; and parts sampling of the plane stabilizers.

In this paper we propose EditVAE, a framework for unsupervised parts-aware generation. EditVAE is unsupervised yet learned end-to-end, and allows parts-aware editing while respecting inter-part dependencies. We leverage a simple insight into the VAE which admits a latent space that disentangles the style and pose of the parts of the generated point clouds. Our model builds upon recent advances in primitive-based point cloud representations, to disentangle the latent space into parts, which are modeled by both latent point clouds and latent superquadric primitives, along with latent transformations to a canonical co-ordinate system. While we model point-clouds (thereby capturing detailed geometry), our model inherits from the shape primitive based point cloud segmentation method of paschalidou2019superquadrics: a semantically consistent segmentation across datasets that does not require supervision in the form of part labeling. Given the disentangled parts representation, we can perform shape editing in the space of point-clouds, e.g by exchanging the corresponding parts across point clouds or by re-sampling only some parts.

Our main contributions are summarised as follows.

  1. We propose a framework for unsupervised parts-based point cloud generation.

  2. We achieve reliable disentanglement of the latents by modeling points, primitives, and pose for each part.

  3. We demonstrate controllable parts editing via disentangled point cloud latents for different parts.

We provide extensive experimental results on ShapeNet which quantitatively demonstrates the superior performance of our method as a generator of point clouds.

Related work

Disentangled Latent Representation in VAE.

To promote disentanglement beyond that of the vanilla VAE kingma2013auto, Higgins et al. higgins2016beta introduced an additional KL divergence penalty above that of the usual evidence lower bound (ELBO). Learning of disentangled latent representations is further investigated by Kim et. al kim2018disentangling, and chen2018isolating. To handle minibatches while accounting for the correlation of latents, Kim et. al kim2018disentangling

proposed a neural-discriminator based estimation while 

chen2018isolating introduced a minibatch-weighted approximation. Further, kim2019relevance split latent factors into relevant and nuisance factors, treated each in a different manner within a hierarchical Bayesian model kim2019bayeslocatello2019fairness showed that disentanglement may encourage fairness with unobserved variables, and proved the impossibility of learning disentangled representations without inductive biases locatello2019challenging in an unsupervised manner, while showing that mild supervision may be sufficient locatello2019disentangling.

To learn a reliable disentangled latent representation, the present work introduces a useful inductive bias locatello2019challenging by jointly modeling points, primitives and pose for 3D shapes. Inspired by the relevance and nuisance factor separation kim2019bayes; kim2019relevance

, this work observes and balances the conflict between disentanglement of representation and quality of generation, by separately modeling global correlations of the relative pose of the different parts of a shape, disentangled from their style. Finally, we fill the gap of learning disentangled latent representations of 3D point cloud in an unsupervised manner, thereby contrasting with much recent disentangled representation learning works focusing on 2D or supervised cases 

kalatzis2020variational; nielsen2020survae; sohn2015learning.

Neural 3D Point Cloud Generation.

While 2D image generation has been widely investigated using GANs isola2017image; zhu2017unpaired and VAEs kingma2013auto; higgins2016beta; kim2019bayes; sohn2015learning, neural 3D point cloud generation has only been explored in recent years. achlioptas2018learning first proposed the r-GAN to generate 3D point clouds, with fully connected layers as the generator. In order to learn localized features, Valsesia et al. valsesia2018learning and Shu et al. shu20193d introduced a generator based on Graph Convolutions. Specifically, Shu et al. shu20193d proposed a tree-based structure with ancestors yielding a neighbor term and direct parents yielding a loop term, named the TreeGAN. This design links the geometric relationships between generated points and shared ancestors. In addition, PointFlow yang2019pointflow learns a distribution of points based on a distribution of shapes by combining VAEs and Normalizing Flows rezende2015variational, from which a point set with variable number of points may be sampled. However, all of the above works generate the point cloud as a whole or by a tree structure without disentanglement, thereby limiting their application power in parts editing. Although the work by chen2019bae focusing on reconstruction could easily be adapted to unsupervised parts-based generation task, it does not infer precise pose information which is crucial in editing.

A few recent works nash2017shape; mo2019structurenet; mo2020pt2pc; schor2019componet; dubrovina2019composite propose (or could be adapted) to generate point clouds given ground-truth point cloud parts segmentation. However, the requirement of well-aligned parts semantic labels hinders their real world applications. MRGAN gal2020mrgan firstly attempts to address the parts-aware point cloud generation by discovering parts of convex shape in an unsupervised fashion. While effective, the decomposed parts may lack semantic meaning. Following this line of work, our EditVAE approaches parts-aware generation without semantic label requirements. In addition, the proposed model learns a disentangled latent representation, so that the style and pose of parts can be edited independently.

Figure 2: An overview of the EditVAE architecture. During training, the posterior is inferred by the encoder given the input point cloud , from which a global latent is sampled. The global latent is linearly mapped by to the disentangled latent . The disentangled latent maps to parts (denoted by colors), which are further split into point , pose , and primitive representations, via the deterministic mappings . Each point and primitive is transformed to the global coordinate system by the shared pose . The transformed part points and primitives are then assembled to the complete decoded point cloud and primitive models, respectively. Jointly training with a single loss (far right) parsimoniously models key dependencies between point, primitive, and pose models. For generation, is sampled from the standard Gaussian and fed forward to generate point cloud .


To disentangle semantically relevant parts of a 3D point cloud, we decompose it into latent parts which are modeled both as 3D point clouds and 3D shape primitives.

A point cloud

in is a set of points sampled from the surface of 3D shape in Euclidean coordinates.


are simple shapes used to assemble parts of more complex shape. We employ the superquadric parameterisation for the primitives, which is a flexible model that includes cubes, spheres and ellipsoids as special cases. In line with paschalidou2019superquadrics, we formally define our superquadric as the two dimensional manifold parameterised by and , with surface point


where and are the size and shape parameters, respectively. We include additional deformation parameters based on barr1987global in supplementary.

Pose transformations

are employed to map both the superquadric and point cloud representations of the parts from a canonical pose to the actual pose in which they appear in the complete point cloud. We parameterise this transformation as , which is parameterised by a translation and a rotation defined by the quaternion . We refer to as the pose for a given part.

Variational Auto-Encoders (Vae)


are an approximate Bayesian inference scheme that introduces an approximate posterior

of the latent representation conditional on the point cloud . The variational parameters are obtained by optimising a bound on the (marginal) data likelihood known as the ELBO,


The first term is known as the reconstruction error, and the second as the variational regulariser. We follow the usual approach of letting the posterior be multivariate normal, so that we can employ the usual Monte Carlo approximation with the reparameterisation trick kingma2013auto to approximate the reconstruction error. By additionally letting the prior be multivariate normal, we obtain a closed form expression for the regulariser.


We motivate our design in next subsection, and then introduce our variational inference scheme, explain how we obtain disentanglement of part latents, give details of the loss functions we use, and conclude with architecture details.

Overview of the Design

We divide the challenge of parts-based point cloud generation and editing into the following essential challenges:

  1. Decomposing multiple (unlabeled) point clouds into semantically meaningful parts.

  2. Disentangling each part into both style (such as the shape of the chair leg) and the relative pose (the orientation in relation to the other parts of the chair).

  3. Representing the components of the above decomposition in a latent space which allows style and pose to be manipulated independently of one another, while generating concrete and complete point clouds.

We address this problem in an end-to-end manner with a unified probabilistic model. To accomplish this we depart slightly from the well known VAE structure, which directly reconstructs the input by the decoder.

For any given input point cloud we generate a separate point cloud for each part of the input point cloud (such as the base of a chair), along with a super-quadric prototype of that part. This addresses point 1 above. To address point 2, we model and in a standardised reference pose via the affine transformation , and denote by


the point cloud and primitive part representations in the original pose. This allows a part’s style to be edited while maintaining global coherance. Finally, while we model a single global latent , our decoder generates each part via separate network branches (see Figure 2), thereby facilitating various editing operations and satisfying point 3 above.

Variational Inference

Our approximate inference scheme is based on that of the VAE kingma2013auto; pmlr-v32-rezende14, but similarly to kim2019variational relaxes the assumption that the encoder and decoder map from and to the same (data) space. The following analysis is straight-forward, yet noteworthy in that it side-steps the inconvenience of applying variational regularization to .

Denote by the -th latent part representation, by the union of all such parts, and by a global latent which abstractly represents a shape. We let represent the approximate posterior with parameters , and for simplicity we neglect to notate the dependence of on . Our training objective is the usual marginal likelihood of the data given the parameters ,


Taking logs and applying Jensen’s inequality we have


We assume a chain-structured factorisation in our posterior,


Under this factorisation we obtain a tractable variational inference scheme by assuming that conditional on , the approximate posterior matches the true one, i.e.


Putting (23) into (21) and cancelling in the in (21),


where . In a nutshell, this shows that we need only learn an approximate posterior via a similar ELBO as (2), to obtain an approximate posterior on . We achieve this via a simple deterministic mapping which, like nielsen2020survae, we may notate as the limit , where is the Dirac distribution and

denotes a neural network. Crucially, while the posterior in

is non-Gaussian, it doesn’t appear in the variational regulariser which is therefore tractable.

Disentangling the Latent Representation

EditVAE disentangles the global latent into a local (to part ) latent , and further to latents for specific component of that part (namely or ). We achieve this key feature by linearly transforming and partitioning the global latent, i.e. we define


where is a matrix of weights (representing a linear neural network layer). We further partition the part latents as


and let the corresponding parts themselves be defined as


and similarly for and . Here,

non-linearly transforms from the latent space to the part representation.

This achieves several goals. First, we inherit from the VAE a meaningful latent structure on . Second, by linearly mapping from to the local part latents and , we ensure that linear operations (e.g. convex combination) on the global latent precisely match linear operations on the local latent space, which therefore captures a meaningfully local latent structure. Finally, partitioning yields a representation that disentangles parts by construction, while dependencies between parts are captured by . Experiments show we obtain meaningful disentangled parts latents.

Chair r-GAN (dense) 0.238 0.0029 0.136 33 13
r-GAN (conv) 0.517 0.0030 0.223 23 4
Valsesia (no up.) 0.119 0.0033 0.104 26 20
Valsesia (up.) 0.100 0.0029 0.097 30 26
TreeGAN shu20193d 0.119 0.0016 0.101 58 30
MRGAN gal2020mrgan 0.246 0.0021 0.166 67 23
EditVAE (M=7) 0.063 0.0014 0.082 46 32
EditVAE (M=3) 0.031 0.0017 0.101 45 39
Airplane r-GAN(dense) 0.182 0.0009 0.094 31 9
r-GAN(conv) 0.350 0.0008 0.101 26 7
Valsesia (no up.) 0.164 0.0010 0.102 24 13
Valsesia (up.) 0.083 0.0008 0.071 31 14
TreeGAN shu20193d 0.097 0.0004 0.068 61 20
MRGAN gal2020mrgan 0.243 0.0006 0.114 75 21
EditVAE (M=6) 0.043 0.0004 0.024 39 30
EditVAE (M=3) 0.044 0.0005 0.067 23 17
Table TreeGAN shu20193d 0.077 0.0018 0.082 71 48
MRGAN gal2020mrgan 0.287 0.0020 0.155 78 31
EditVAE (M=5) 0.081 0.0016 0.071 42 27
EditVAE (M=3) 0.042 0.0017 0.130 39 30
Table 1: Generative performance. means the higher the better, means the lower the better. The score is highlighted in bold if it is the best one compared with state-of-the-art. Here is the number of minimum parts we expect to separate in training. For network with we use the result reported in valsesia2018learning; shu20193d

Loss Functions

Completing the model of the previous sub-section requires to specify the log likelihood , which we decompose in the usual way as the negative of a sum of loss functions involving either or both of the point and super-quadric , representations—combined with the standardisation transformation which connects these representations to the global point cloud, . Note that from a Bayesian modelling perspective, there is no need to separate the loss into terms which decouple and ; indeed, the flexibility to couple these representations within the loss is a source of useful inductive bias in our model.

While our loss does not correspond to a normalised conditional , working with un-normalised losses is both common sun2019learning; paschalidou2019superquadrics, and highly convenient since we may engineer a practically effective loss function by combining various carefully designed losses from previous works.

Point Cloud Parts Loss.

We include a loss term for each part point cloud based on the Chamfer distance


We sum over parts to obtain a total loss of


where is the subset of whose nearest superquadric is , and is in canonical pose.

Superquadric Losses.

The remaining terms in our loss relate to the part and combined primitives, and would match paschalidou2019superquadrics but for the addition of a regulariser which discourages overlapping superquadrics, i.e.111 matches the implementation of paschalidou2019superquadrics provided by the authors.


where denotes cardinality, is a point cloud sampled from , , and is the smoothed indicator function for defined in solina1990recovery.

Figure 3: Parts-based generated point clouds from the airplane, table and chair categories, coloured by part. Bottom row: examples generated by TreeGAN shu20193d. The top three rows are EditVAE—the top row with , and the second and third rows with the number of parts reported in Table 1.

Architecture Details

EditVAE framework is shown in Figure 2. The posterior is based on the PointNet architecture qi2017pointnet, with the same structure as achlioptas2018learning. For , we apply the linear transform and partitioning of (11) for disentangled part representations followed by further shape and pose disentanglement. We use the generator of TreeGAN shu20193d as the decoder, modelling , to generate the point cloud for each part. The super-quadric decoder modules match paschalidou2019superquadrics for primitive generation , as do those for the . Weights are not shared among branches.


Evaluation metrics.

We evaluate our EditVAE on the ShapeNet shapenet2015 with the same data split as shu20193d

and report results on the three dominant categories of chair, airplane, and table. We adopt the evaluation metrics of 

achlioptas2018learning, including Jensen-Shannon Divergence (JSD), Minimum Matching Distance (MMD), and Coverage (COV). As MMD and COV may be computed with either Chamfer Distance (CD) or Earth-Mover Distance (EMD), we obtain five different evaluation metrics, i.e. JSD,  MMD-CD,  MMD-EMD,  COV-CD,  and COV-EMD.


We compare with four existing models of r-GAN achlioptas2018learning, Valsesia valsesia2018learning, TreeGAN shu20193d and MRGAN gal2020mrgan. r-GAN and Valsesia generate point clouds as a single whole without parts inference or generation based on a tree structure as in TreeGAN. Similar to our approach, MRGAN performs unsupervised parts-aware generation, but with “parts” that lack a familiar semantic meaning and without disentangling pose.

Implementation details.

222Code will be provided on publication of the paper.

The input point cloud consists of a set of 2048 points, which matches the above baselines. Our prior on the global latent representation

is the usual standard Gaussian distribution. We chose

, and , for the local latents of (12). We trained EditVAE using the Adam optimizer kingma2014adam with a learning rate of

for 1000 epochs and a batch size of 30. To fine-tune our model we adopted the

-VAE framework higgins2016beta.



EditVAE generates point clouds by simply sampling from a standard Gaussian prior for , mapping by and and the subsequent part branches of Figure 2, before merging to form the complete point cloud. We show quantitative and qualitative results in Table 1 and Figure 3, respectively. As shown in Table 1, the proposed EditVAE achieves competitive results (see e.g. the results for the chair category) compared with the states of the art. The parts number is manually selected to achieve a meaningful semantic segmentation, e.g. a chair may be roughly decomposed into back, base, and legs for . Furthermore, while shu20193d generates point clouds according to a tree structure—and could thereby potentially generate points with consistent part semantics—it does not allow the semantics-aware shape editing due to lacking of disentangled parts representations. To the best of our knowledge, MRGAN gal2020mrgan is the only other method achieving parts-disentangled shape representation and generation in an unsupervised manner. The results in Table 1 show that our method outperforms MRGAN in both the JSD and MMD metrics. Morover, EditVAE achieves highly semantically meaningful parts generation as shown in Figure 3 and the experiment as discussed below, which further achieves parts-aware point cloud editing.

Parts Editing.

EditVAE disentangles the point clouds into latents for each part, and then in turn into the point cloud, pose, and primitive for each part. This design choice allows editing some parts with other parts fixed, yielding controllable parts editing and generation. We demonstrate this via both parts mixing and parts (re-)sampling.

Figure 4: Parts mixing in the chair category with . Far left: ground truth point clouds, top: reference point cloud. Remaining: from left to right, back, base, and legs for ground truth points are mixed by corresponding parts in the reference one via mixing their disentangled latents.
Model Chair Airplane Table
TreeGAN M=3 M=7 TreeGAN M=3 M=6 TreeGAN M=3 M=5
MCD 0.0164 0.0028 0.0121 0.0043 0.0016 0.0018 0.0266 0.0121 0.0214
Table 2: Semantic meaningfulness measurements. represents EditVAE inTable 1. The lower MCD the better.
Model MMD-CD
as whole base back leg
EditVAE 0.0017 0.0016 0.0014 0.0024
Baseline 0.0025 0.0017 0.0015 0.0024
Table 3: Generative performance for the entire shape and its parts, for the chair category. Semantic labels are obtained by primitive segmentation in our framework.
Baseline-G 0.062 0.0019 42
Baseline-S 0.163 0.0030 10
EditVAE (M=3) 0.031 0.0017 45
EditVAE (M=7) 0.063 0.0014 46
Table 4: Generative performance comparsion for EditVAE and two baselines in chair category.

Parts Mixing.

It is defined by exchanging some parts between generated reference and ground-truth point clouds while keeping others fixed. We achieve mixing by transferring corresponding parts latents from reference to ground-truth, and further transforming it by the generator and pose of the parts in the ground-truth. The corresponding part in the ground-truth point cloud may therefore be changed to the style of the reference one. For example, the results in the first row of Figure 4 show that the ground-truth shape of a sofa with solid armed base may be changed into a larger hollow armed one based on its reference shape with consistent style. Namely, the size and pose of mixed parts follow that of the ground-truth, but keep the style from the reference.

Parts Sampling.

This involves resampling some parts in a generated point cloud. For resampled parts, we fix the pose but resample the point cloud parts latent. The fixed pose is essential to maintain generated part point clouds with a consistent location that matches the other fixed parts to achieve controllable generation.

Figure 5: Parts sampling. Far left: the reference point clouds. Colored parts in the three right columns are sampled from latent space—from top to bottom, we sampled the airplane stabilizer, chair base, and chair back.

Qualitative results for parts sampling are in Figure 5. Variations in the part styles demonstrated the controllable point cloud generation.

Semantic Meaningfulness.

We first define a vanilla measurement by comparing the distance between the ground truth semantic label and the unsupervisedly generated one. The distance is defined as the mean of smallest Chamfer distance for each unsupervised part with respect to all ground truth parts (MCD in Table 2). As MRGAN gal2020mrgan lacks accompanying code, we mainly compare the semantic meaningfulness with respect to TreeGAN in Table 2. EditVAE outperforms when we define the ground truth segmentation as the most meaningful.

Ablation Studies

Generation / Editing Trade-Off.

We aim to evaluate the influence of the linear mapping for disentangled representation learning (see Figure 2). To this end, we introduce a Baseline framework by simply removing this . Results are shown in Table 3. Specifically, we compare our generation with the Baseline results at the whole point cloud level and at the parts level, such as the base, leg, and back, for the chair category. While Baseline achieves disentangled parts-aware representation learning and comparable results for parts sampling to EditVAE333We evaluate each part generation result separately., the manner in which Baseline generates points as a whole via sampling from a standard Gaussian yields inferior performance due to the mismatched style across parts. Thus, the mapping manages to decouple the undesirable generation / editing trade-off caused by disentanglement. Detailed analysis and visualizations are in the supplementary materials.

Stage-wise Baselines.

We compared EditVAE with two stage-wise baselines defined as Baseline-S and Baseline-G. In particular, Baseline-S is built by first generating parts labels via the state-of-the-art unsupervised segmentation method paschalidou2019superquadrics followed by a supervised parts-aware generation approach schor2019componet. Baseline-G is created by training the the point cloud branch in Figure 2 with the ground-truth parts segmentation. The comparison is performed on the chair category in ShapeNet shapenet2015, and reported in Table 4.

EditVAE is robust to semantic segmentation as its generation is close to Baseline-G. Further, the performance of is closer to Baseline-G compared with , in line with our observation (see Figure 3) that this case achieves a similar segmentation to the ground-truth. Further, EditVAE outperforms Baseline-S by overcoming the style-mismatch issue and is robust to noise introduced by mapping parts to a canonical system with learned poses.


We introduced EditVAE, which generates parts-based point clouds in an unsupervised manner. The proposed framework learns a disentangled latent representation with a natural inductive bias that we introduce by jointly modeling latent part- and pose-models, thereby making parts controllable. Through various experiments, we demonstrated that EditVAE balances parts-based generation and editing in a useful way, while performing strongly on standard pointcloud generation metrics.

Appendix A Generation / Editing Trade-off Analysis & Results

Class Primitive Number Model MMD-CD
as whole part A part B part C part D part E part F part G
Chair 7 EditVAE 0.0014 0.0012 0.0011 0.0015 0.0013 0.0025 0.0015 0.0013
Baseline 0.0029 0.0014 0.0012 0.0019 0.0014 0.0027 0.0016 0.0015
3 EditVAE 0.0017 0.0014 0.0016 0.0024 - - - -
Baseline 0.0025 0.0016 0.0016 0.0024 - - - -
Airplane 6 EditVAE 0.0004 0.0004 0.0005 0.00004 0.0006 0.0006 0.0005 -
Baseline 0.0007 0.0004 0.0005 0.0005 0.0006 0.0007 0.0005 -
3 EditVAE 0.0005 0.0006 0.0005 0.0007 - - - -
Baseline 0.0006 0.0006 0.0005 0.0008 - - - -
Table 5 EditVAE 0.0016 0.0020 0.0011 0.0023 0.0015 0.0020 - -
Baseline 0.0042 0.0024 0.0011 0.0030 0.0016 0.0022 - -
3 EditVAE 0.0017 0.0025 0.0012 0.0022 - - - -
Baseline 0.0035 0.0034 0.0013 0.0025 - - - -
Table 5: More results in generation/editing trade-off

We aim to evaluate the influence of the linear mapping for disentangled representation learning (see Figure 2 in the main paper). To this end, we introduce a Baseline framework by simply removing this . Results are shown in the main paper Table 3. Specifically, we compare our generation with the Baseline results at the whole point cloud level and at the parts level, such as the base, leg, and back, for the chair category. While Baseline achieves disentangled parts-aware representation learning and comparable results for parts sampling to EditVAE  the manner in which Baseline generates points as a whole via sampling from a standard Gaussian yields inferior performance due to the mismatched style across parts.

We observe that well-disentangled latents benefit controllable editing, as we may unilaterally alter the style of one part, without affecting that of the other parts. This is mainly due to our particular disentangled representation which discourages certain dependencies among latents. By contrast, parts-based generation requires strong correlation within latent factors to generate style-matched point clouds. Hence, this disentanglement is fundamentally opposing to the parts-based point cloud generation as a whole due to the lack of global correlation across parts.

This observation can be further explained by the concept of relevant and nuisance latents separation in kim2019relevance which addresses the balance between reconstruction and generation. Specifically, relevant latents depend on the input and vice versa, which indicates that the global “style” information is stored in the relevant latent. Completely disentangled latents can achieve perfect reconstruction, as the known inputs can lead to fully observed relevant and nuisance latents. However, relevant latents are randomly sampled in generation due to the lack of input as observation. As a result, disentangled latents with different ”style” information lead to a style mismatch across the generated part point clouds. We thus introduce a linear mapping to encode the ”relevant” latents consistently across disentangled part latents, to achieve parts-aware generation with a consistent style.

We provide more quantitative results in Table 5. Similar to the results reported in Table 1 of the main paper, we compare the generative performance of EditVAE with a Baseline for which we removed the linear mapping from our model.

Figure 6: Visualization of point clouds generated by EditVAE (below lines) and Baseline (above lines). Colors denote the different parts.

As shown in Table 5, the proposed EditVAE consistently outperforms the Baseline  for all three categories and for various numbers of primitives . The quantitative results demonstrate that sampling from disentangled latents without global context information leads to point clouds of low quality. More qualitative comparison results are provided in Figure 6, which shows that the style and pose are mismatched in general among parts for point clouds generated by Baseline. For example, back parts in the chair category either intersect the base (left most), or are detached from it (third column). In addition, the back sizes are also not matched to the bases (all four examples). For airplanes generated by Baseline, we observe glider’s wings (middle left) and fighter’s wings (middle right) being assembled with civil airliners. Moreover, as sampled pose latents are mismatched with sampled point latents, the stabilizers are added at the wrong position (left most).

In summary, the ‘style’ of parts is mismatched in point clouds generated by Baseline, mainly because the disentangled latents do not keep the global correlations within parts. By contrast, our model can generate point clouds in a consistent style due to our global context-aware latents disentanglement which is achieved by the linear mapping in our framework.

Appendix B Additional Mixing Examples

In the main paper we showed parts mixing results for the chair category in ShapeNet shapenet2015 with number of primitives . Here we will provide more parts mixing results on other categories.

Figure 7: Parts mixing in the airplane category with . Far left: ground truth point clouds, top: reference point cloud. Remaining: from left to right: stabilizer, right wing, and engine of the ground truth point clouds are replaced by corresponding ones in the reference via mixing of their disentangled latents.

In Figure 7, we mix parts in the airplane category with number of primitives . Each ground truth point cloud (blue) is mixed with a reference point cloud (red) with respect to the stabilizer, the right wing, and the engine. In the first column of Figure 7, the shapes of all stabilizers in the ground truth point clouds are changed to that of the reference one but respecting their poses, which leads to a mixed point cloud with consistent style. In addition, the ground truth airplanes without engines are also ‘assembled’ with reference’s engine by the mixing operation. It is worth noting that the style of remaining parts has not been changed thanks to our learned disentangled representation. Similar observations can be found in Figure 8.

Figure 8: Parts mixing in the airplane category with . Far left: ground truth point clouds, top: the reference point cloud. Remaining: from left to right: the wings, stabilizer, and body for ground truth points are replaced by the corresponding parts in the reference one via mixing their disentangled latents.
Figure 9: Parts mixing in the table category with . Far left: ground truth point clouds, top: reference point cloud. Remaining: from left to right: right legs, left legs, and base for ground truth points are replaced by the corresponding parts in the reference one via mixing of the disentangled latents.

We additionally show our mixing results on the table category in Figure 9. As demonstrated in the figure, we can change the round base of the table to a rectangular one from the reference point cloud in a consistent style.

Appendix C Additional Sampling Examples

Figure 10: Parts sampling. Far left: the reference point clouds. Colored parts in the three right columns are sampled from the latent space —from top to bottom, we sampled the chair legs and table legs.

As the parts distribution is unknown, we achieve parts sampling by first sampling a global latent from a multivariate normal distribution and then passing to the linear mapping

. Another option is passing the parts latent to a Real NVP layer dinh2016density before feeding to the generators/decoders during training. By letting Real NVP learn to map the parts latent into a standard normal distribution, we may then generate novel parts by sampling the parts latent directly. Both options are equivalent if the Real NVP layer is linear, as it can be included in generators/decoders. In order to have a simple and elegant model, we removed the Real NVP layer in the main paper.

Additional parts sampling results may be found in Figures 10 and 11. We sampled chair legs and table right legs in Figure 10. In particular, different styles (normal or sofa style), sizes (thick or slim), and pose (rotation) of legs are sampled from our disentangled latents. Moreover, we provide more results for parts sampling of table bases and airplane wings in Figure 11.

As shown in the figure, different shapes of table base (round, rectangular and square), and styles of airplane wing (glider’s and fighter’s wing) are sampled while the remaining parts are held fixed. We see that parts sampling allows us to achieve controllable point clouds generation.

Figure 11: Parts sampling. Far left: the reference point clouds. Colored parts in the three right columns are sampled from the latent space —from top to bottom, we sampled the table base and airplane wings.

Appendix D Interpolation

Figure 12: Interpolation result. Leftmost to rightmost by mixing latents with weights 0.2, 0.5, 0.8, respectively.

Two generated point clouds are interpolated by first mixing corresponding latents with different weights, and then pass it to corresponding generators. The visualization results is shown in Figure 12. As we can see, the middle three point clouds are deforming continuously from the leftmost to rightmost. Thus, the learned latent space is continuous.

Appendix E Semantic meaningfulness

Note that the Arxiv paper MRGAN gal2020mrgan lacks accompanying code, we only compared semantic meaningfulness with TreeGAN shu20193d quantitatively in the main paper. Here we show the qualitative comparison with MRGAN gal2020mrgan via their main Figure 3 and supplmentary Figure 1: For example, MRGAN’s table bases are separated into three parts, some of them even linked to a leg, while EditVAE separates base and legs more clearly as per rows 2-3 in the main paper Figure 3.

Appendix F Superquadrics visualization

Figure 13: Generated Superquadrics. Listed chair’s , table’s , airplane’s as in Table 1 in main paper.

See Figure 13 for an example for the generated superquadrics by passing sampled latents into pose and primitive branches in the main paper Figure 3.

Appendix G Primitive Detail


As mentioned in Preliminaries of the main paper, we use a tapering deformation barr1987global to enhance the representation power of the superquadrics. Following the code provided by paschalidou2019superquadrics, the tapering deformation is defined by:


where is a point, and defines deformation parameters, and is the size parameter in the -axis. This linear taper deforms the primitive shape in the -axis by an amount which depends on the value of the -axis. As a result, tapering deformation will make primitives more conic, which helps to model unbalanced shapes such as the head of the airplane.

Details on the Superquadric Losses

While the definition our superquadric loss functions follows paschalidou2019superquadrics, we include more details here for the sake of completeness.

The superquadric loss is defined as


where is the distance term which encourages superquadric to fit the input point cloud . is a regularisation term which encourages desired behaviour; for example, we prefer primitives that do not overlap one another.

The distance term measures the distance between points sampled from primitive surface and input point cloud . Following the idea of the Chamfer distance, the distance term is decomposed by:


where defines the distance from the primitive to the input point cloud , and defines the distance from the point cloud to primitive . Additional details may be found in paschalidou2019superquadrics.

The regularisation term is defined as


As we manually select the number of parts, we only use an overlapping regularizer to discourage the superquadrics from overlapping one another; this term is adapted from paschalidou2019superquadrics.

In order to achieve the best performance, different are used for different categories during training. In particular we set: for the chair category with number of primitives ; for the chair category with , and the airplane category with and ; for the table category with and .

Appendix H Model details

We give two alternative derivations of our training objective, followed by some additional discussions, and details of our network architectures.

Detailed Derivations

To make the supplementary material self contained, we first recall inequality (7) in the main paper,


as well as equations (8) and (9) in the main paper,


First derivation

By putting (22) and (23) into the lower bound (21), we have


By cancelling and taking the integral of we get


By applying Bayes’ rule, we have


We see the key point, that the final term in (27) is tractable as it does not depend on , that is . Since our decoder has a simple deterministic relationship which we denote by the limit


we can rewrite the reconstruction error term to emphasise the dependence of on to get the ELBO


where .

Second Derivation

Using (22) and (23) in (21), we have:


The key point is revealed, that the regulariser term is tractable because, by (23)


Finally, since our decoder has a simple deterministic relationship which we denote by the limit


we can rewrite the reconstruction error term to emphasise the dependence of on ,


where in the final line .


Because of the deterministic mapping between and , we have . This allows us to annotate as in Figure 2 of the main paper.

In that same figure, we annotate with the mapping (on the right hand side) from parts representations to the output point cloud , despite appearing on the left hand side. This is consistent with standard expositions: for example, we may connect this with Appendix C.1 of the VAE paper kingma2013auto by noting that our and are together analogous to their Bernoulli parameters .

Finally note that the posterior of, for example, the combined primitive is not included in our variational inference model, which is a byproduct obtained by assembling the part primitives from posterior samples of .

Architecture and Implementation Details

The model is implemented using PyTorch 

paszke2019pytorch on the platform of Ubuntu 16.04, trained on one GeForce RTX 3090 and one GeForce RTX 2080 TI. 10 Gigabytes memory is allocated.

The number of parts (parameter ) for each category is manually selected with domain specific knowledge. The choice reflects what one believes a good semantic segmentation should be, which is application dependent. As mentioned in the main paper, a chair may be roughly decomposed into back, base and legs for . In addition, a chair could also be decomposed into back, base, armrest, and four legs for . For the airplane category, it could separated into body, tail, and wings for ; and also into body, two wings, two engines, and tail for . Finally, a table may be decomposed into base and four legs for ; and also into base, left two legs, and right two legs for . The general message here is clear: various decompositions are valid and useful.


We use PointNet qi2017pointnet as the encoder. Following the network structure from achlioptas2018learning

, the encoder has 64, 128, 128, 256 filters at each layer. Batch normalization and

LeakyReLU are used between each layer.

Point Decoder

We use the generator of TreeGAN shu20193d

as the point decoder. The architecture we used has 5 layers, with the root being the local part latent vector, and the leaves being points in

. The loop term has supports. The feature dimension and branching factor for each layer are and , respectively. Hence, each point decoder outputs points.

Pose and Primitive Decoders

All pose and primitive decoders are one layer fully connected networks, following paschalidou2019superquadrics. The dimension of the fully connected layers depends on the input latent size (namely 8) and output parameter dimension. See the repository of paschalidou2019superquadrics for the detailed implementation.