Composite Shape Modeling via Latent Space Factorization

by   Anastasia Dubrovina, et al.

We present a novel neural network architecture, termed Decomposer-Composer, for semantic structure-aware 3D shape modeling. Our method utilizes an auto-encoder-based pipeline, and produces a novel factorized shape embedding space, where the semantic structure of the shape collection translates into a data-dependent sub-space factorization, and where shape composition and decomposition become simple linear operations on the embedding coordinates. We further propose to model shape assembly using an explicit learned part deformation module, which utilizes a 3D spatial transformer network to perform an in-network volumetric grid deformation, and which allows us to train the whole system end-to-end. The resulting network allows us to perform part-level shape manipulation, unattainable by existing approaches. Our extensive ablation study, comparison to baseline methods and qualitative analysis demonstrate the improved performance of the proposed method.


page 6

page 7

page 8


Joint Learning of 3D Shape Retrieval and Deformation

We propose a novel technique for producing high-quality 3D models that m...

Neural Cages for Detail-Preserving 3D Deformations

We propose a novel learnable representation for detail-preserving shape ...

DeformSyncNet: Deformation Transfer via Synchronized Shape Deformation Spaces

Shape deformation is an important component in any geometry processing t...

Temporal Latent Auto-Encoder: A Method for Probabilistic Multivariate Time Series Forecasting

Probabilistic forecasting of high dimensional multivariate time series i...

High-Resolution Shape Completion Using Deep Neural Networks for Global Structure and Local Geometry Inference

We propose a data-driven method for recovering miss-ing parts of 3D shap...

Modeling of Personalized Anatomy using Plastic Strains

We give a method for modeling solid objects undergoing large spatially v...

Cloud Sphere: A 3D Shape Representation via Progressive Deformation

In the area of 3D shape analysis, the geometric properties of a shape ha...

1 Introduction

Understanding, modeling and manipulating 3D objects are areas of great interest to the vision and graphics communities, and have been gaining increasing popularity in recent years. Examples of related applications include semantic segmentation [41], shape synthesis  [37, 2], 3D reconstruction [7, 8] and view synthesis [39]

, to name a few. The advancement of deep learning techniques, and the creation of large-scale 3D shape datasets 

[5] enabled researchers to learn task-specific representations directly from the existing data, and led to significant progress in all the aforementioned areas.

Figure 1: Given unlabeled shapes, the Decomposer maps them into a factorized embedding space. The Composer can either reconstruct the shapes with semantic part labels, or manipulate their geometry, for instance, by exchanging chair legs.

There is a growing interest in learning shape modeling and synthesis in a structure-aware manner, for instance, at the level of semantic shape parts. This poses several challenges compared to approaches considering the shapes as a whole. Semantic shape structure and shape part geometry are usually interdependent, and relations between the two must be implicitly or explicitly modeled and learned by the system. Examples of such structure-aware shape representation-learning include [22, 18, 35, 38].

However, the existing approaches for shape modeling, while being part aware at the intermediate stages of the system, still ultimately operate on the low-dimensional representations of the whole shape. For example, [22]

use a Variational Autoencoder (VAE) 


to learn a generative part-aware model of man-made shapes, but the encoding space of the VAE corresponds to complete shapes. As a result, factors corresponding to different parts are entangled in that space. Thus, existing approaches cannot be utilized to perform direct part-level shape manipulation, such as single part replacement, part interpolation, or part-level shape synthesis.

Inspired by the recent efforts in image modeling to separate different image formation factors, to gain better control over image generation process and simplify editing tasks [27, 32, 33], we propose a new semantic structure-aware shape modeling system. This system utilizes an auto-encoder-based pipeline, and produces a factorized embedding space which both reflects the semantic part structure of the shapes in the dataset, and compactly encodes different semantic parts’ geometry. In this embedding space, different semantic part embedding coordinates lie in separate linear subspaces, and shape composition can naturally be performed by summing up part embedding coordinates. The embedding space factorization is data-dependent, and is performed using learned linear projection operators. Furthermore, the proposed system operates on unlabeled input shapes, and at test time it simultaneously infers the shape’s semantic structure and compactly encodes its geometry.

Towards that end, we propose a Decomposer-Composer pipeline, which is schematically illustrated in Figure 1. The Decomposer maps an input shape, represented by an occupancy grid, into the factorized embedding space described above. The Composer reconstructs a shape with semantic part-labels from a set of part-embedding coordinates. It explicitly learns the set of transformations to be applied to the parts, so that together they form a semantically and geometrically plausible shape. In order to learn and apply those part transformations, we employ a 3D variant of the Spatial Transformer Network (STN) [11]. 3D STN was previously utilized to scale and translate objects represented as 3D occupancy grids in [10], but to the best of our knowledge, ours is the first approach suggesting in-network affine deformation of occupancy grids.

Finally, to promote part-based shape manipulation, such as part replacement, part interpolation, or shape synthesis from arbitrary parts, we employ the cycle consistency constraint [42, 27, 23, 34]. We utilize the fact that the Decomposer maps input shapes into a factorized embedding space, making it possible to control which parts are passed to the Composer for reconstruction. Given a batch of input shapes, we apply our Decomposer-Composer network twice, while randomly mixing part embedding coordinates before the first Composer application, and then de-mixing them into their original positions before the second Composer application. The resulting shapes are required to be as similar as possible to the original shapes, using a cycle consistency loss.

Main contributions

Our main contributions are: (1) A novel latent space factorization approach which enables performing shape structure manipulation using linear operations directly in the learned latent space; (2) The application of a 3D STN to perform in-network affine shape deformation, for end-to-end training and improved reconstruction accuracy; (3) The incorporation of a cycle consistency loss for improved reconstruction quality.

Figure 2: The proposed Decomposer-Composer architecture.

2 Related work

Learning-based shape synthesis

Learning-based methods have been used for automatic synthesis of shapes from complex real-world domains; In a seminal work [12], Kalogerakis et al. used a probabilistic model, which learned both continuous geometric features and discrete component structure, for component-based shape synthesis and novel shape generation. Recently, the development of deep neural networks enabled learning high-dimensional features more easily; 3DGAN [37] uses 3D decoders and a GAN to generate voxelized shapes. A similar approach has been applied to 3D point clouds and achieved high fidelity and diversity in shape synthesis [2]. Apart from generating shapes using a latent representation, some methods generate shapes from a latent representation with structure. SSGAN [36] generate the shape and texture for a 3D scene in a 2-stage manner. GRASS [18] generate shapes in two stages: first, by generating orientated bounding boxes, and then a detailed geometry within those bounding boxes. Nash and Williams [22] use a VAE, and generate parts by dividing the latent code equally into predefined segments representing different parts. In a related work [35], Wang et al. introduced a 3D GAN-based generative model for 3D shapes, which produced segmented and labeled into parts shapes. Unlike the two latter approaches, our network does not use predefined subspaces for part embedding, but learns to project the latent code of the entire shape to the subspaces corresponding to codes of different parts.

Spatial transformer networks

Spatial transformer networks (STN) [11] allow to easily incorporate deformations into a learning pipeline. Kurenkov et al[16] retrieve a 3D model from one RGB image and generate a deformation field to modify it. Kanazawa et al[13] model articulated or soft objects with a template shape and deformations. Lin et al[19] iteratively use STNs to warp a foreground onto a background, and use a GAN to constrain the composition results to the natural image manifold. Hu et al[10] use a 3D STN to scale and translate objects given as volumetric grids, as a part of scene generation network. Inspired by this line of work, we incorporate an affine transformation module into our network. This way, the generation module only needs to generate normalized parts, and the deformation module transforms and assembles the parts together.

Deep latent space factorization

Several approaches suggested to learn disentangled latent spaces for image representation and manipulation. -VAE [9]

introduce an adjustable hyperparameter

that balances latent channel capacity and independence constraints with reconstruction accuracy. InfoGAN [6] achieves the disentangling of factors by maximizing the mutual information between certain channels of latent code and image labels. Some approaches disentangle the image generation process using intrinsic decomposition, such as albedo and shading [33], or normalized shape and deformation grid [27, 32]. Note that the proposed approach differs from [27, 32, 33] by the fact that both full and partial shape embedding coordinates reside in the same low dimensional embedding space, while in the latter, different components have their own separated embedding spaces.

Projection in neural networks

Projection is widely used in representation learning. It can be used for transformation from one domain to another domain [3, 25, 26]

, which is useful for tasks like translation in natural language processing. For example, Senel

et al. [30]

use projections to map word vectors into semantic categories, to analyze the semantic structures of word embeddings. In this work, we use a projection layer to transform a whole shape embedding into semantic part embeddings.

3 Our model

3.1 Decomposer network

The Decomposer network is trained to embed unlabeled shapes (naturally built of a set of semantic parts) into a factorized embedding space, reflecting the shared semantic structure of the shape collection. To allow for composite shape synthesis, the embedding space has to satisfy the following two properties: factorization consistency across input shapes, and existence of a simple shape composition operator to combine different semantic factors. We propose to model this embedding space as a direct sum of subspaces , where is the number of semantic parts, and each subspace corresponds to a semantic part , thus satisfying the factorization consistency property. The second property is ensured by the fact that every vector is given by a sum of (unique) , so that part composition may be performed by part embedding summation. This also implies that the decomposition and composition operations in the embedding space are fully reversible.

A simple approach for such factorization is to split the dimensions of the -dimensional embedding space into coordinate groups, each group representing a certain semantic part-embedding. In this case, the full shape embedding is a concatenation of part embeddings, an approach explored in [35]. This, however, puts a hard constraint on the dimensionality of part embeddings, and thus also on the representation capacity of each part embedding subspace. Given that different semantic parts may have different geometric complexities, this factorization may be sub-optimal.

Instead, we propose performing a data-driven learned factorization of the embedding space into semantic subspaces. We achieve that by performing the factorization using learned part-specific projection matrices, denoted by . To ensure that the aforementioned two factorization properties hold, and that  is factorized into such that (direct sum property), these projection matrices must form a partition of the identity. Namely, must satisfy the following three properties


where and are the all-zero and the identity matrices of size , respectively.

In practice, we efficiently implement the projection operators using fully connected layers without added biases, with a total of variables, constrained as in Equation 1. The projection layers receive as input a whole shape encoding, which is produced by a 3D convolutional shape encoder. The parameters of the shape encoder and the projection layers are learned simultaneously. The resulting architecture of the Decomposer network is schematically described in Figure 2, and a detailed description of the shape encoder and the projection layer architecture is given in the supplementary material.

3.2 Composer network

The composer network is trained to reconstruct shapes from sets of semantic part embedding coordinates. We assume that these part embedding sets are valid, in the sense that each set includes at most one embedding coordinate per semantic part type. The composer produces an output shape labeled with semantic part labels.

The simplest composer implementation would consist of a single decoder mirroring the (whole) shape encoder described in the previous section, which would produce an output shape with or without semantic labels. Such approach was used in [35], for instance. However, this straightforward method fails to reconstruct thin volumetric shape parts, e.g., thin chair legs, and other fine shape details. To address this issue, we use a different approach, where we first separately reconstruct scaled and centered shape parts, using a shared part decoder; we then produce per-part transformation parameters and use them to deform the parts in a coherent manner, to obtain a complete reconstructed shape.

In our model, we make the simplifying assumption that it is possible to combine a given set of parts into a plausible shape by transforming them with per-part affine transformations and translations. While the true set of transformations which produce plausible shapes is significantly larger and more complex, we demonstrate that the proposed simplified model is successful at producing geometrically and visually plausible results. This in-network part transformation is implemented using a 3D spatial transformer network (STN) [11]. It consists of a localization net, which produces a set of 12-dimensional affine transformations (including translations) for all parts, and a re-sampling unit, which transforms the reconstructed scaled and centered part volumes and places the parts in their correct locations in the full shape. The SNT receives as input both the reconstructed parts from the part decoder, and the sum of part encodings, for best reconstruction results.

The resulting Composer architecture consists of two components: a shared part decoder, which receives part embedding coordinates and produces centered and scaled (to unit cube) versions of parts, and a spatial transformer network (STN) that deforms and places the parts in the full assembled shape. The Composer architecture is schematically described in Figure 2; its detailed description is given in the supplementary material.

We note that the proposed approach is related to the two-stage shape synthesis approach of [18], in which a GAN is first used to synthesize oriented bounding boxes for different parts, and then the part geometry is created per bounding box using a separate part decoder. Our approach is similar, yet it works in a reversed order. Namely, we first reconstruct part geometry with our shared part decoders, and then compute per-part affine transformation parameters, which are a 12-dimensional equivalent of the oriented part bounding boxes in [18]. Similarly to [18], this two stage approach improves the reconstruction of fine geometric details. However, unlike [18], where the GAN and the part decoder where trained separately, in our approach the two stages belong to the same reconstruction pipeline, coupled by the full model reconstruction loss, and trained simultaneously and end-to-end. On the other hand, compared to [18], our approach is limited in the sense that it reconstructs shapes from semantic parts, while [18] synthesizes shape from an arbitrary number of smaller parts, and is trained with more complex symmetry-preserving requirements. We plan to address this limitation in future work.

3.3 Cycle consistency

Our training set is comprised of 3D shapes with ground-truth semantic part-decomposition; It does not include any training examples of synthesized composite shapes. In fact, existing methods for such a shape assembly task operate on 3D meshes with very precise segmentations, and often with additional knowledge about part connectivity [40, 31]. These methods cannot be applied to a dataset (like ours) which does not come with precise segmentation and extra knowledge about the part connectivity. As a result, existing methods cannot be used to produce a sufficiently large set of plausible new shapes (constructed from existing parts) to use for training a deep network for composite shape modelling. Instead, we add a cycle consistency requirement to train the network to produce non-trivial geometrically and semantically plausible part transformations for arbitrary part arrangements.

Cycle consistency has been previously utilized in geometry processing [23], image segmentation [34], and more recently in neural image transformation [27, 42]. We use it as follows: given a batch of training examples , the Decomposer produces sets of corresponding semantic part encodings, each with encodings. During training, we randomly mix part encodings of different shapes in the batch, while ensuring that after the mixing each of the new encoding sets is valid (includes exactly one embedding coordinate per semantic part). After that, we pass the new sets with mixed encodings to the Composer, which reconstructs the shapes with correspondingly mixed parts. We then pass those shapes (as binary occupancy grids) to the Decomposer for the second time, once again producing sets of part encodings. We de-mix the encodings, to re-store the original encoding-to-shape association, and pass the de-mixed encoding sets to the Composer again. The cycle consistency requirement means that the results produced by the second Composer application must be as similar as possible to the original shapes, which is enforced by the cycle consistency loss described in the next section. The double application of the proposed network with part encoding mixing and de-mixing is schematically described in Figure 3.

Figure 3: Schematic description of the cycle consistency constraint. See Section 3.3 for details.

3.4 Loss function

Our loss function is defined as the following weighted sum of several loss terms


The weights compensate for the different scales of the loss terms, and reflect their relative importance.

Partition of the identity loss

measures the deviation of the predicted projection matrices from the optimal projections, as given by Equation 1.


Part reconstruction loss

is the binary cross-entropy loss between the reconstructed centered and scaled part volumes and their respective ground truth part indicator volumes, summed over parts.

Transformation parameter loss

is an regression loss between the predicted and the ground truth 12-dimensional transformation parameter vectors, summed over parts. Unlike the original STN approach [11], where there was no direct supervision on the transformation parameters, we found that this supervision is critical for our network convergence. We provide more details on the exact training procedure in Section 3.5.

Whole model reconstruction loss

measures the complete shape reconstruction quality, and is given by the cross entropy loss between the resulting volume with predicted part labels, and the ground truth labeled volume.

Cycle consistency loss

measures the deviation between ground truth input volumes and their (binarized) reconstructions, obtained using two applications of the proposed network, with part encoding mixing and de-mixing between the first and the second composing steps, as described in Section 

3.3. We measure this deviation using a binary cross-entropy loss.

3.5 Training details

The network was implemented in TensorFlow 


, and trained for 700 epochs with batch size 48. We used Adam optimizer 

[14] with learning rate , decay rate of , and decay step size of 300 epochs. We found it essential to first pre-train the binary shape encoder, projection layer and part decoder parameters separately for epochs, by minimizing the part reconstruction and the partition of the identity losses, for improved part reconstruction results. We then train the parameters of the spatial transformer network for another epochs, using the transformation parameter loss, while keeping the rest of the parameters fixed. After that we resume the training with all parameters, and turn on the full model and cycle consistency losses after additional epochs, to fine-tune the reconstruction parameters. The total train time is one day on an NVIDIA Tesla V100 GPU. The optimal loss combination weights were empirically detected using the validation set, and set to be .

4 Experiments


In our experiments, we used the chair models from the ShapeNet 3D data collection [5], with part annotations produced by Yi et al[41]. The shapes were converted to occupancy grids using binvox [24]. Semantic part labels were first assigned to the occupied voxels according to the proximity to the labeled 3D points, and the final voxel labels were obtained using graph-cuts in the voxel domain [4]. We used the official ShapeNet train, validation and test data splits in all our experiments.

Data augmentation

We perform two types of data augmentation which we found important for successful training. First, during training we randomly and independently remove parts from the training shapes, with probability of

per part. Thus, our model learns to reconstruct not only complete shapes, but also shapes consisting of a subset of parts. As will be illustrated by the ablation study, this type of data augmentation is important for the network not to overfit to the training data, and learn to perform high quality reconstruction.

The second type of data augmentation is intended to assist the spatial transformer network learn to reconstruct non-trivial per-part affine transformations. Specifically, we augment input parts with random affine transformations and provide the transformed sets of parts separately to the Decomposer as an additional input, while training the spatial transformer network to predict inverse transformations, to reconstructs the ground truth labeled shape from their corresponding transformed parts.

4.1 Shape reconstruction

In this experiment, we tested the reconstruction capabilities of the proposed network. Note that for this and other experiments described below we used unlabeled shapes from the test set. Labeled ground truth shapes, when shown, are for illustration and comparison purpose only. Fig 4 presents the input unlabeled shapes (in gray), and the reconstructed shapes composed of semantic parts (color-coded).

Figure 4: Reconstruction results of the proposed pipeline. Gray shapes are the input test shapes; the results are colored according to the part label: seat (blue), back (green), legs (light blue) and armrests (yellow).

4.2 Composite shape synthesis

Figure 5: Single part exchange experiment. GT denotes ground truth labeled shapes (shown for illustration purposes), REC - reconstruction results, and SWAP - result of exchanging a part between the shapes. The exchanged parts are (top to down): legs, back, seat. Unlabeled shapes were used as an input.

Shape composition by part exchange

We randomly picked pairs of shapes, and exchanged parts between them by mapping the input unlabeled shapes into the embedding space with the Decomposer, exchanging a certain semantic part encodings, and composing shapes from the new part arrangements with the Composer. The results are shown in Figure 5, and demonstrate the ability of our system to perform accurate part exchange, while deforming the geometry of both the new and the existing parts to obtain a plausible result.

Shape composition by random part assembly

In this experiment we tested the ability of the proposed network to assemble shapes from random parts using our factorized embedding space. Here, we worked with batches of size four. We mapped the input shapes into the embedding space with the Decomposer; we then randomly mixed the part embedding coordinates as described in Section 3.2, ensuring that no two encodings in the new set came from the same original shape; finally, we composed the new shapes with mixed parts using the Composer. The results are shown in Figure 6, and illustrate the ability of the proposed method to combine parts from different shapes and deform them so that the resulting shape looks realistic.

See the supplementary material for additional results of shape reconstruction, part exchange and assembly from parts, for the chairs and two additional classes of shapes from the ShapeNet (planes and tables).

GT Collected

[0.2em]       Batch #1 Batch #2

Figure 6: Synthesis-from-parts example. For two batches of 4 chairs, the top row shows the ground truth (GT) shapes, and the bottom row - shapes assembled by randomly picking parts from the GT shapes, such that no two parts come from the same GT shape, and assembled using the proposed approach. Unlabeled shapes were used as an input.

Full and partial interpolation in the embedding space

In this experiment, we tested reconstruction from linearly interpolated embedding coordinates of complete shapes, as well as interpolated embedding coordinates of a single semantic part. For the latter, we interpolated the embedding coordinates of one of the original shape parts and the corresponding semantic part from another randomly picked shape, while keeping the rest of part embedding coordinates fixed. The results are shown in Figure 7. See the supplementary material for a detailed description of the interpolated shape reconstruction process, and more shape and part interpolation examples.

Figure 7: Example of a whole and partial shape interpolation. Left and right are test models with ground truth segmentation. The results in the first row were obtained by linearly interpolating whole model embedding vectors. The results in the second row were obtained by linearly interpolating the seat embedding vectors. The final shapes were reconstructed using the Composer network. Unlabeled shapes were used as an input.

4.3 Embedding space and projection matrix analysis

Embedding space

Figure 8 visualizes the structure of our learned shape and part embedding space using the T-SNE algorithm[21], and illustrates the clear separation into different semantic part subspaces.

Figure 8: T-SNE visualization of the produced embedding space, using both train and test shape embedding coordinates. The ”empty” part coordinates correspond to the embedding produced for non-existing parts.

Projection matrix analysis

Figure 9

shows the obtained projection matrices, their sum, and the plot of their singular values. The proposed method succeeds to obtain a set of projection matrices which approximately sum to an identity, and have a partition of the identity loss in Eq. (

3.4) of the order of one, for a hundred-dimensional embedding space and four semantic subspaces. While are full-rank and not strictly orthogonal projection matrices, the plot of their singular values shows that their effective ranks are significantly lower than the embedding space dimension. This is also in line with the excellent separation into non-overlapping subspaces produced by these projection matrices.

(seat) (back)
(legs) (armrests)

Singular values of the projection matrices

Figure 9: Projection matrix analysis. Two upper rows present the obtained projection matrices, and their sum. The bottom row shows the singular values of the matrices.
Rec. Rec. Rec. Swap Col. Rec. Swap Col. Rec. Swap Col.
Our method 0.63 0.66 0.87 0.85 0.80 0.92 0.74 0.56 0.93 0.93 0.93


Fixed projection 0.61 0.66 0.82 0.79 0.78 0.90 0.65 0.47 0.92 0.93 0.93
Decoder w/o STN 0.77 0.77 0.81 0.65 0.58 0.96 0.72 0.53 0.94 0.92 0.91
W/o data augmentation 0.60 0.66 0.85 0.83 0.79 0.86 0.69 0.51 0.93 0.93 0.93
W/o part removal 0.61 0.55 0.75 0.70 0.68 0.91 0.78 0.69 0.88 0.89 0.89
W/o cycle loss 0.64 0.62 0.77 0.71 0.67 0.89 0.65 0.50 0.93 0.93 0.93
Naive placement - - - 0.68 0.62 - 0.47 0.21 - 0.96 0.96
Table 1:

Ablation study results. The evaluation metrics are mean Intersection over Union (

mIoU), per-part mean IoU (mIoU (parts)), shape connectivity measure, binary shape classifier accuracy, and shape symmetry score. Rec., Swap and Col. stand for the shape reconstruction, part exchange and random part assembly experiment results, respectively (see Section 4.2). See Section 4.4 for a detailed description of the compared methods and the evaluation metrics.

4.4 Ablation study

To highlight the importance of the different elements of the proposed approach, we conducted an ablation study, where we used several variants of the proposed method, as listed below, as well as a naive part placement baseline.

Fixed projection matrices

Here, instead of using learned projection matrices in the Decomposer, the -dimensional shape encoding is split into consecutive equal-sized segments, which correspond to different part embedding subspaces. This is equivalent to using constant projection matrices, where the elements of the rows corresponding to a particular embedding space dimensions are , and the rest of the elements are .

Composer without STN

Here, we substituted the proposed composer, consisting of the part decoder and the STN, with a single decoder producing a labeled shape. The decoder receives the sum of part encodings as an input, processes it with two FC layers to combine information from different parts, and then reconstructs a shape with parts labels using a series of deconvolution steps, similar to the part decoder in the proposed architecture.

Without random part removal

Here, we removed the first type of data augmentation, and used only complete shapes for training.

Without affine transformation augmentation

Here, we removed the second type of data augmentation, namely - adding random affine transformations to the input parts. Instead we supplied to the net the original parts without transformations.

Without cycle loss

Here, we removed the cycle loss component during the network training.

Baseline method

Here, given input parts, we simply placed them in the output volume at their original positions. All the shapes in our dataset are centered and uniformly scaled to fill the unit volume, and there are clusters of similarly looking chairs. Thus, we can expect that even this naive approach without part transformations will produce plausible results in some cases.

Evaluation metrics

For the evaluation, we used the following metrics.


Mean Intersection over Union (mIoU) [20] is used to evaluate the performance of segmentation algorithms. Here, it is used as metric for the quality of the reconstruction. We computed mIoU for both scaled and centered (when applicable) and actual-sized reconstructed parts.


In part based shape synthesis, one pathological issue is that parts are often disconnected with each other. Here, we would like to benchmark the quality of part placement, in terms of part connectivity. For each volume, we dilate each voxel by in order to allow for small gaps between parts. Then, the frequency of the shape forming one single connected component was calculated.

Classification accuracy

We trained a binary neural classifier to estimate the quality of the assembly in terms of the ”realistic-ness” of a given shape. Specifically, this classifier was trained to distinguish between ground-truth whole chairs (acting as positive examples) and chairs produced by naively placing random chair parts together (acting as negative examples). To construct negative examples, we used ground-truth chair parts from arbitrary ’source’ chairs, considering the addition of a part-type (

e.g., legs) at most once, and placing each part at the same location and with the same orientation it had in the source chair from which it was extracted. In addition, we removed negative examples assembled of parts from geometrically and semantically similar chairs, since such part arrangement could produce plausible shapes incorrectly placed in the negative example set. The attained classification accuracy on the test set was . For a given set of chairs, we report the average classification score. Details of the network can be found in the supplementary material.


The chair shapes in the ShapeNet are predominantly bilaterally symmetric, with vertical symmetry plane. Hence, similar to [35], we evaluate the symmetry of the reconstructed shapes, and define the symmetry score as the percentage of the matched voxels (filled or empty) in the reconstructed volume, and the same volume reflected with respect to the symmetry plane. We perform this evaluation using binarized reconstruction results, effectively measuring the global symmetry of the shapes.

For evaluation, we used the shapes in the test set (690 shapes), and conducted three types of experiments: full shape encoding and reconstruction, full shape encoding and single random part exchange between a pair of random shapes, shape composition by random part assembly. The experiments are described in more detail in Sections 4.1 and 4.2.

Evaluation results

According to the mIoU and per-part mIoU metrics, the proposed method outperforms all other variants and the baseline, except when using the simple shape decoder. This follows from the fact that the proposed system, while reconstructing better fine geometry features, decomposes the problem into two inference problems, for the geometry and the transformation, and thus does not produce as faithful reconstruction of the original model as the simple decoder. On the other hand, as illustrated in Figure 10, this allows our method to perform better when constructing a shape from random parts. See the supplementary material for an additional qualitative comparison of the results of the proposed and the baseline methods.

In the connectivity test, our method outperformed all the baselines. This shows that the proposed network effectively solves the problem of misplaced disconnected parts better than the baseline methods. This is also verified by the qualitative comparison in Figure 10 and additional examples in the supplementary material. Specifically, improved connectivity results, as compared to the architecture with a Decoder without STN (”W/o STN”), illustrate the importance of using the STN to improve assembly quality. In the symmetry test, all methods, except for training without random part removal (”W/o part removal”), show comparable results. As expected, the naive placement achieves the highest symmetry score, equal to the score of the original test shapes, since it preserves their symmetry during shape assembly.

Interestingly, methods showing better mIoU or classification accuracy, such as the architecture without STN, or training without random part removal (”W/o part removal”), perform quite poorly on the connectivity benchmark and in qualitative comparison, and the latter - also on the symmetry benchmark. And, although the proposed method doesn’t get the best performance in mIoU and classifier benchmarks, it usually gets second best at those and at the same time produces shape with better connectivity and symmetry. Overall, the proposed method generates good performance in these four benchmarks and superior qualitative results, justifying our design choices.

Figure 10: Results of reconstruction (top row) and shape assembly from random parts (bottom row) of different methods. Left to right: labeled ground truth shape (GT), our method (Our), fixed projection (Fixed proj.), decoder w/o STN (Dec. w/o STN), no cycle loss, no random part removal (No part rem.), no data augmentation with random affine transformations (No aff. augm.). Same random parts were used for all methods.

Affine transformation analysis

We performed a statistical analysis of the learned affine transformations. Please refer to the supplementary material for more details.

5 Conclusions and future work

We presented a Decomposer-Composer network for structure-aware 3D shape modelling. It is able to generate a factorized latent shape representation, where different semantic part embedding coordinates lie in separate linear subspaces. The subspace factorization allows us to perform shape manipulation via part embedding coordinates, exchange parts between shapes, or synthesize novel shapes by assembling a shape from random parts. Qualitative results show that the proposed system can generate high fidelity 3D shapes and meaningful part manipulations. Quantitative results shows we are competitive in the mIOU, connectivity, symmetry and classification benchmarks.

While the proposed approach makes a step toward automatic shape-from-part assembly, it has several limitations. First, while we can generate high-fidelity shapes at a relatively low resolution, memory limitations do not allow us to work with higher resolution voxelized shapes. Memory-efficient architectures, such as OctNet [29] and PointGrid [17], may help alleviate this constraint. Alternatively, using point-based shape representations, and a compatible deep network architecture, such as [28]

, may also reduce the memory requirements and increase the output resolution. Secondly, we made a simplifying assumption that a plausible shape can be assembled from parts using per-part affine transformations, which represent only a subset of possible transformations. While this assumption simplifies the training, it is quite restrictive in terms of the deformations we can perform. In future work, we will investigate that further, with general transformations which have higher degree of freedom, such as a 3D thin plate spline or a general deformation field. Finally, we have been using a cross-entropy loss to measure the shape reconstruction quality; it would be interesting to investigate the use of a GAN-type loss in this structure-aware shape generation context.


L. Guibas acknowledges NSF grant CHS-1528025, a Vannevar Bush Faculty Fellowship, and gifts from the Adobe and Autodesk Corporations. A. Dubrovina and M. Shalah acknowledge the support in part by The Eric and Wendy Schmidt Postdoctoral Grant for Women in Mathematical and Computing Sciences.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al.

    Tensorflow: a system for large-scale machine learning.

    In OSDI, volume 16, pages 265–283, 2016.
  • [2] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Representation learning and adversarial generation of 3d point clouds. arXiv preprint arXiv:1707.02392, 2017.
  • [3] J. Barnes, R. Klinger, and S. S. i. Walde. Projecting embeddings for domain adaption: Joint modeling of sentiment analysis in diverse domains. arXiv preprint arXiv:1806.04381, 2018.
  • [4] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on pattern analysis and machine intelligence, 23(11):1222–1239, 2001.
  • [5] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  • [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
  • [7] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In

    European conference on computer vision

    , pages 628–644. Springer, 2016.
  • [8] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. In CVPR, volume 2, page 6, 2017.
  • [9] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. 2016.
  • [10] R. Hu, Z. Yan, J. Zhang, O. van Kaick, A. Shamir, H. Zhang, and H. Huang. Predictive and generative neural networks for object functionality. In Computer Graphics Forum (Eurographics State-of-the-art report), volume 37, pages 603–624, 2018.
  • [11] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
  • [12] E. Kalogerakis, S. Chaudhuri, D. Koller, and V. Koltun. A probabilistic model for component-based shape synthesis. ACM Transactions on Graphics (TOG), 31(4):55, 2012.
  • [13] A. Kanazawa, S. Kovalsky, R. Basri, and D. Jacobs. Learning 3d deformation of animals from 2d images. In Computer Graphics Forum, volume 35, pages 365–374. Wiley Online Library, 2016.
  • [14] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [15] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [16] A. Kurenkov, J. Ji, A. Garg, V. Mehta, J. Gwak, C. Choy, and S. Savarese. Deformnet: Free-form deformation network for 3d shape reconstruction from a single image. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 858–866. IEEE, 2018.
  • [17] T. Le and Y. Duan. Pointgrid: A deep network for 3d shape understanding. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 9204–9214, 2018.
  • [18] J. Li, K. Xu, S. Chaudhuri, E. Yumer, H. Zhang, and L. Guibas. Grass: Generative recursive autoencoders for shape structures. ACM Transactions on Graphics (TOG), 36(4):52, 2017.
  • [19] C.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey. St-gan: Spatial transformer generative adversarial networks for image compositing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9455–9464, 2018.
  • [20] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • [21] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • [22] C. Nash and C. K. Williams. The shape variational autoencoder: A deep generative model of part-segmented 3d objects. In Computer Graphics Forum, volume 36, pages 1–12. Wiley Online Library, 2017.
  • [23] A. Nguyen, M. Ben-Chen, K. Welnicka, Y. Ye, and L. Guibas. An optimization approach to improving collections of shape maps. In Computer Graphics Forum, volume 30, pages 1481–1491. Wiley Online Library, 2011.
  • [24] F. S. Nooruddin and G. Turk. Simplification and repair of polygonal models using volumetric techniques. IEEE Transactions on Visualization and Computer Graphics, 9(2):191–205, 2003.
  • [25] C. Poelitz.

    Projection based transfer learning.

    In Workshops at ECML, 2014.
  • [26] D. V. Poerio and S. D. Brown. Dual-domain calibration transfer using orthogonal projection. Applied spectroscopy, 72(3):378–391, 2018.
  • [27] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), pages 818–833, 2018.
  • [28] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
  • [29] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 3, 2017.
  • [30] L. K. Senel, I. Utlu, V. Yucesoy, A. Koc, and T. Cukur. Semantic structure and interpretability of word embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018.
  • [31] C.-H. Shen, H. Fu, K. Chen, and S.-M. Hu. Structure recovery by part assembly. ACM Transactions on Graphics (TOG), 31(6):180, 2012.
  • [32] Z. Shu, M. Sahasrabudhe, A. Guler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance, 2018.
  • [33] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5444–5453. IEEE, 2017.
  • [34] F. Wang, Q. Huang, and L. J. Guibas. Image co-segmentation via consistent functional maps. In Proceedings of the IEEE International Conference on Computer Vision, pages 849–856, 2013.
  • [35] H. Wang, N. Schor, R. Hu, H. Huang, D. Cohen-Or, and H. Huang. Global-to-local generative model for 3d shapes. ACM Transactions on Graphics (Proc. SIGGRAPH ASIA), 37(6):214:1—214:10, 2018.
  • [36] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision, pages 318–335. Springer, 2016.
  • [37] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
  • [38] Z. Wu, X. Wang, D. Lin, D. Lischinski, D. Cohen-Or, and H. Huang. Structure-aware generative network for 3d-shape modeling. arXiv preprint arXiv:1808.03981, 2018.
  • [39] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese. Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9068–9079, 2018.
  • [40] K. Xu, H. Zheng, H. Zhang, D. Cohen-Or, L. Liu, and Y. Xiong. Photo-inspired model-driven 3d object modeling. In ACM Transactions on Graphics (TOG), volume 30, page 80. ACM, 2011.
  • [41] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, L. Guibas, et al. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG), 35(6):210, 2016.
  • [42] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros.

    Unpaired image-to-image translation using cycle-consistent adversarial networks.

    arXiv preprint, 2017.