Learning Part Generation and Assembly for Structure-aware Shape Synthesis

by   Jun Li, et al.

Learning deep generative models for 3D shape synthesis is largely limited by the difficulty of generating plausible shapes with correct topology and reasonable geometry. Indeed, learning the distribution of plausible 3D shapes seems a daunting task for most existing, structure-oblivious shape representation, given the significant topological variations of 3D objects even within the same shape category. Based on the consensus from 3D shape analysis that shape structure is defined as part composition and mutual relations between parts, we propose to model 3D shape variations with a deep generative network being both Part-Aware and Relation-Aware, named PARANet. The network is composed of an array of per-part VAE-GANs, generating semantic parts composing a complete shape, followed by a part assembly module that estimates a transformation for each part to correlate and assemble them into a plausible structure. Through splitting the generation of part composition and part relations into separate networks, the difficulty of modeling structural variations of 3D shapes is greatly reduced. We demonstrate through extensive experiments that PARANet generates 3D shapes with plausible, diverse and detailed structure, and show two prototype applications: semantic shape segmentation and shape set evolution.


page 7

page 8


SCORES: Shape Composition with Recursive Substructure Priors

We introduce SCORES, a recursive neural network for shape composition. O...

ShapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis

Manually authoring 3D shapes is difficult and time consuming; generative...

Report on the software "SemanticModellingFramework"

The evolution of 3D visual content calls for innovative methods for mode...

Structure-Aware Shape Synthesis

We propose a new procedure to guide training of a data-driven shape gene...

StructEdit: Learning Structural Shape Variations

Learning to encode differences in the geometry and (topological) structu...

GRASS: Generative Recursive Autoencoders for Shape Structures

We introduce a novel neural network architecture for encoding and synthe...

StructureNet: Hierarchical Graph Networks for 3D Shape Generation

The ability to generate novel, diverse, and realistic 3D shapes along wi...

1 Introduction

Learning deep generative models is one of the most exciting and active research area in deep learning. Following that trend, learning deep generative models for 3D shape synthesis has been increasingly studied lately. Despite the notable success made by several recent works 

[40], one major difficulty which is seldom being touched is how to ensure the structural correctness of the generated shapes.

The existing generative models are mostly structure-oblivious. These models tend to generate 3D shapes in a holistic manner, without comprehending its compositional parts explicitly. Consequently, when generating 3D shapes with detailed structures, the details of complicated part structures are often blurred or even messed up (see Figure 1, the right column). To alleviate this issue, one common practice is to increase the shape resolution to better capture fine-grained details, with the cost of increased learning time and training examples. The main reason behind is that the 3D shape representation employed by those models, e.g., volumetric grid or point cloud, are oblivious to shape structures. With such representations, part information is not encoded and thus cannot be decoded during the generation process.

Figure 1: PARANet generates 3D shapes in a part- and relation-aware manner: It first generates the semantic parts (left) and then correlates them with proper assembly transformations, forming a structurally valid 3D shape (middle) in contrast to the results by holistic generation (right).

In 3D shape analysis, a common view of “structure” is the combination of the part composition of a shape and the relations between the parts [26]. Following this insight, we approach the modeling of structural variations of 3D shapes through learning a generative network that is both part-aware and relation-aware, named as PARANet. The model is aware of what parts it is generating, through a semantic-part-wise generation process. On the other hand, the model should be able to preserve the mutual relations between the generated parts, according to some learned part relation priors. In particular, the relation refers to the way how two adjacency semantic parts connect with each other, in order to form a structurally valid 3D shape.

PARANet is composed of an array of part generators, each of which is a combination of variational auto-encoder (VAE) and generative adversarial network (GAN) trained for generating a specified semantic part of the target shape category, followed by a part assembly module that estimates a transformation for each part used to assemble them into a valid shape structure (Figure 1). In our work, PARANet is realized in the volumetric setting although it can be easily extended to support other basic representations such as point clouds. Being both part- and relation-aware, PARANet generates quality 3D shapes with detailed, plausible and diverse structure.

Our model splits the generation of parts and relations into two separate networks, thus greatly reducing the difficulty in modeling structural variations of 3D shapes. In our model, 3D structure variation is modeled by the concatenation of the latent vectors of all part generators, forming a structured latent space 

[10] of 3D shapes with different dimensions controlling the generation of different parts. It facilitates part-level user control or editing over the generated shapes. Moreover, the concatenated latent codes can be used as “shape DNA”, over which “genetic operations” (mutation and crossover) can be performed, to achieve shape set evolution [43]. Through mapping a 3D shape into the structured latent space, the part generators altogether can lead to a semantic segmentation of the shape.

Our main contributions include:

  • A divide-and-conquer framework for learning a deep generative model of structure-aware shape generation.

  • A part assembly module to relate the generated semantic parts with their assembly transformations.

  • Two prototype applications including semantic segmentation and set evolution of 3D shapes.

2 Related work

Modeling 3D shape variations.

The study of modeling 3D shape variability dates back to statistical learning of parametric models of faces 

[6] and bodies [2]. The task of modeling structural variation of 3D shapes of man-made objects is much harder. Most existing works learn one or multiple parametric template of part arrangement from collections of training shapes [29, 22, 13]. These methods often require part correspondence of the training shapes. Probabilistic graphical models can be used to model shape variability as the causal relations between shape parts [21]. Pre-segmented and part labeled shapes are required for learning such models. Entering the era of deep learning, deep generative models have been utilized to learn shape space of 3D objects in a unsupervised manner.

Deep generative models of 3D shapes.

Deep generative models for 3D shape generation have been developed based on various 3D representations, such as volumetric grids [40, 16, 39, 31], point clouds [12, 1], surface meshes [17, 38], implicit functions [11, 30], and multi-view images [33]. Common to these works is that shape variability is modeled in a holistic, structure-oblivious fashion, which is mainly due to the limited options of deep-learning-friendly 3D shape representations.

Figure 2: Left: Training the part-wise generative network (a VAE-GAN for each semantic part). Since parts are generated independently, there could be mismatching and disconnection (see the overlaid volumes). Right: The part assembly module is trained to regress a transformation (scaling + translation) for each part, which assembly the parts into a plausible shape.

Structure-aware 3D shape synthesis.

Since the seminal work of “Modeling by Example” [14], numerous research effort has been devoted on part-based, data-driven 3D shape synthesis (e.g. [9, 21, 43, 4]). A comprehensive survey is available [42]. Part-based methods are inherently structure-aware: Shapes are generated in parts and part relations are preserved to form a valid structure. In the traditional approaches, however, parts are retrieved from a shape database (e.g. part suggestion [9, 36, 35]) instead of being generated from scratch. Meanwhile, part assembly relies on part corresponding [43, 4] or part labeling [9, 21].

Apart from the traditional approaches, research on deep generative models for structure-aware shape synthesis starts to gain increasing attention recently. Huang et al. [19] propose a deep generative model based on part-based templates learned a priori, which is not end-to-end trainable. Li et al. [24]

propose the first deep generative model of 3D shape structures. They employ recursive neural network to achieve hierarchical encoding and decoding of parts and relations. However, this model does not explicitly ensure a quality part assembly as in our method; see Section 

5 for a comparison. Zou et al. [50]

propose to learn sequential part generation with recurrent neural networks, which, however, produces only cuboids but no detailed geometry.

Nash and Williams [28] propose ShapeVAE to generate part-segmented 3D objects. Their model is trained using shapes with dense point correspondence. On the contrary, our model requires only part level correspondence. Wu et al. [41] couples the synthesis of intra-part geometry and inter-part structure. Similar idea is proposed in [5] where landmark-based structure priors are used for structure-aware shape generation. Wang et al. [37] propose to generate 3D shapes with part labeling using a carefully designed GAN, and then pass the shape to a pre-trained part refiner to obtain higher quality shape volume. Our method takes a reverse process where we first generate parts and then their assembling transformations. In a concurrent work, Dubrovina et al. [Dubrovina2019] utilize a decomposer-composer network to learn a factorized shape embedding space for 3D shape modeling. However, their network is not a generative model and trained with a reconstruction loss, thus they can only generate new shapes by exchange exiting part from different models.

Structure-aware 3D shape deformation.

Deformation is another common way of generating shape variations. In handling man-made shapes, structure-aware deformation has been a central research goal [15, 48, 7]. Existing deep models for 3D shape deformation have so far been mainly focusing on free-form deformation [46, 23, 25, 20], which is not designed for global structure preservation. The part composition network in [49] performs structure-preserving deformation in substructure level. Our part assembly module achieves multi-part joint deformation, through learning inter-part assembling transformations.

3 Method

3.1 Network architecture

Our network architecture (Figure 2) is straightforward. It is composed of two modules: a part-wise generative network and a part assembly module. The part-wise generative network contains part generators, each for one of the predefined semantic part labels (e.g., back, seat, leg and armrest for a chair). Each part generator is trained to generate a volume of a specific part from a random vector. Taking the generated volumes for all the parts as input, the part assembly module predicts a transformation (scaling + translation) for each part, to assemble the parts into a complete shape with proper part scaling and inter-part connection.

3.2 Part-wise generative network

The part-wise generative network is simply a collection of part generators. For each semantic part, we train a generative network of 3D volumes, which is a combination of variational auto-encoder (VAE) and generative adversarial network (GAN), or VAE-GAN. The VAE part comprises of an encoder and a decoder of 3D volumes with a resolution of . The dimension of the latent vector is . Similar to [40], the encoder consists of five volumetric fully convolutional layers with a kernel size of

and a stride of

. Batch normalization and ReLU layers are inserted between convolutional layers. The decoder / generator simply reverses the encoder, except that a

nonlinearity is used in the last layer. Following the decoder, the encoder architecture is reused in learning a discriminator that tells whether a given part volume is real (voxelization of a real shape part) or fake (generated by the generator).

Therefore, the loss function for a part generator consists of three terms: a part volume reconstruction loss

, a Kullback-Leibler divergence loss

and a adversarial loss . In addition, we introduce a reflective symmetry loss to penalize generating asymmetric parts. This loss would help regularize the part generation, since most parts are reflective symmetric. For those asymmetric parts, the weight of is set to . In summary, the loss is defined as:


where measures the mean-square-error (MSE) loss between the input volume and output volume ; is the MSE loss between a volume and its reflection about the reflection plane of the input shape; is a Kronecker delta function indicating whether part shares reflective symmetry with the full shape. In training, we can detect reflective symmetry easily using the method in [27] for a training shape and its semantic parts, to evaluate for each part.

For the adversarial training, we follow WGAN-GP [18] which improves Wasserstein GAN [3] with a gradient penalty to train our generative model,


where is the discriminator, and are the distributions of generated part volumes and real part volumes, respectively. The last term is the gradient penalty and is sampled uniformly along straight lines between pairs of points from the data distribution and the generator distribution . The discriminator attempts to minimize while the generator maximizes the first term in Equation (2).

3.3 Part assembly module

Since the part volumes are generated independently, their scales may not match with each other and their positions may disconnect adjacent parts. Taking part volumes generated from the part-wise generative network, part assembly module regresses a transformation, including a scaling and a translation, for each part. It relates different semantic parts, with proper resizing and repositioning, to assemble them into a valid and complete shape volume. Essentially, it learns the spatial relations between semantic parts (or part arrangements [47]) in terms of relative size and position, as well as mutual connection between different parts.

Figure 3: Reasonable assembly result can be produced with different parts serving as anchor (highlighted in red color).

Part assembly module takes part volumes as input, which amount to a

input tensor. The input tensor is passed through five volumetric fully convolutional layers of kernel sizes

with a stride of . Similar to part encoders, batch normalization and ReLU layers are used between convolutional layers. In the last layer, a sigmoid layer is added to regress the scaling and translation parameters. To ease the training, we normalize all scaling and translation parameters into , based on the allowed range of scaling () and translation (, with the unit being voxel size). The actual values of scaling and translation parameters are recovered when being applied.

Figure 4: Part-aware 3D shape generation. Given a random vector, our network splits it into several sections and pass them to respective part decoders, yielding a set of semantic parts. The parts are then assembled together based on the predicted per-part transformations.

Anchored transformation.

Given part volumes, the transformations assembling them together is not unique. Taking the chair model in Figure 3 as an example, the chair seat can be stretched to match the back, while the back can also be shrunk to conform to the seat; both result in a valid shape structure. This also adds some diversity to the generation. To make the transformation estimation determined and the assembly network easier to train, we introduce an extra input to the part assembly module to indicate a anchor part. When estimating part transformations, the anchor part is kept fixed (with an identity transformation) while all the other parts are transformed to match the anchor. To do this, one option is to input an indicator vector (a one-hot vector with the corresponding part being ). However, the dimension of this indicator vector is too small, making its information easily overwhelmed by the large tensor of part volumes. Therefore, we opt to infuse anchor information by setting the occupied voxels in the anchor part volume to , to strongly contrast against the ’s in the volumes of the free parts. During test, the anchor part can be randomly selected or user-specified; see Figure 4.

3.4 Training details

We train and test our part-wise VAE-GANs and part assembly module on a subset of ShapeNet [8]. This subset, proposed in [44], provides consistent alignment and semantic labeling for all shapes. We select four representative categories exhibiting rich part structure variation, including chairs (3746), airplanes (2690), lamps (1546), motorbikes (202). In the dataset, each object category has a fixed number of semantic parts: a chair contains a back, a seat, a leg and an armrest; an airplane consists of a body, a wing, a tail and an engine; a lamp has a base, a shade, and a tube; a motorbike is composed of a light, a gas tank, a seat, a body, a wheel and a handle. Note that a shape may not contain all semantic parts belonging to the corresponding category. The dataset is divided into two parts, according to the official training/test split, to train and test our part-wise generative network and part assembly module. To enhance the training set, we employ the structure-aware deformation technique in [48] to deform each shape, generating about variations of the shape. Finally, each shape and its semantic parts are voxelized to form our training set.

The part-wise VAE-GANs are trained with part volumes. We augment the dataset of part volumes via randomly scaling and translating the parts, with the ranges in for scaling and (in voxels) for translation. To train the part assembly module, we generate a large set of training pairs of messed-up part arrangement and ground-truth assembly, with randomly selected anchor part. The messed-up arrangements are generated by randomly scaling and translating the semantic parts of a shape. The inverse of the messing-up transformations are used as ground-truth assembling transformations. Besides that, we also introduce some random noise to the training part volumes, to accommodate the imperfect volume generation during testing.

As Wasserstein GANs usually have large gradients, which might result in unstable training. We opt to first pre-train the VAEs and then fine-tune them via joint training with the discriminators. For both the part-wise VAE-GANs and the part assembly module, we set the initial learning rate to , and use ADAM () for network optimization. Batch size is set to . For the parameters in the loss computation in Equation (1), we use and for all experiments. is set to as in [18].

Note that the part generation and assembly networks are not trained jointly, in an end-to-end fashion, since there is no ground-truth assembly for the parts generated by VAE (random generation). It is, however, possible to make the whole pipeline end-to-end trainable if a discriminator network could be devised to judge whether the final assembled shape is reasonable or not. We leave this for future work.

4 Results and Evaluations

w/ sym. loss w/o sym. loss
Chair back
leg (sym.)
leg (asym.)
Airplane body
Table 1: Comparing average symmetry measure over generated shapes between our method (with and w/o symmetry loss) and 3D-GAN [40], G2L [37], on two shape categories. For each category, we report the measure for both full shape and semantic parts. Note how our method discriminates between reflectively symmetric and asymmetric legs of chairs.
Chair Plane Motorbike Lamp
Template-based 0.60 0.65 0.56 0.52
Anchor part seat back leg armrest body wing tail engine light gas tank seat handle wheel body base shade tube
One-hot vector 0.77 0.78 0.79 0.72 0.76 0.75 0.73 0.74 0.72 0.76 0.74 0.73 0.71 0.76 0.79 0.77 0.70
Ours 0.83 0.81 0.82 0.79 0.80 0.82 0.76 0.82 0.80 0.79 0.82 0.79 0.78 0.82 0.81 0.83 0.77
Ours (training data) 0.89 0.91 0.90 0.88 0.89 0.87 0.90 0.88 0.87 0.92 0.86 0.86 0.85 0.87 0.91 0.89 0.81
Table 2: Evaluation of part assembly through comparing to two baselines (template-based and one-hot vector).

Part-wise generation.

Symmetry preservation is especially useful for generating of man-made shapes. Through imposing reflective symmetry regularization for those parts which are reflectively symmetric, our model is able to produce structurally more plausible shapes. To evaluate symmetry preservation in shape generation, we define a symmetry measure for generated shapes. Given a generated shape volume, the reflective plane is the vertical bisector plane of the volume, since all training data were globally aligned and centered before voxelization. The symmetry measure can be obtained simply by reflecting the left half of the shape volume and computing the IoU against the right half.

Table 1 shows the average symmetry measures on randomly generated shapes by our method. The results are reported both for full shape and individual semantic parts. We also compare to a baseline model trained without symmetry loss, as well as the 3D-GAN model proposed in [40] and G2L [37]. In G2L [37], they trained an extra part refinement network via minimizing the average reconstruction loss against three nearest neighbors retrieved from the training set. While achieving a higher symmetry score, such refinement also limited the diversity of the generated shapes, which is indicated by the lower inception score than ours in Table 3.

An interesting feature of our part generators is that it learns when to impose symmetry constraint on the generated parts, through judging from the input random vectors. This is due to the discriminative treatment of reflectively symmetric and asymmetric parts during the generator training (Equation (1)). Taking the leg part generator for example, if a random vector of a four-leg chair is input, the leg generator will preserve the reflective symmetry in the leg part. If, on the other hand, the input random vector implies a swivel chair would be generated, the symmetry preservation will be automatically disabled since the leg part of a swivel chair is mostly not reflectively symmetric in reality. This is reflected in the average symmetry measures of symmetric and asymmetric legs in Table 1.

Figure 5: Plots of assembly quality measure (average IoU w.r.t. ground-truth) over varying amount of translation (left; with scaling being fixed to ) and scaling (right; with translation being fixed to ).

Part assembly.

To evaluate the ability of our part assembly module, we test it on the testing set. For each test shape, we perturb each of its semantic parts with random scaling and translation, and use our network to regress the transformation. The assembly quality is measured by the IoU between the assembled shape volume and the ground-truth. In testing, we choose each semantic part as anchor and report the average IoU as the assembly quality. In Table 2, we compare the assembly performance over three methods. The first is our method. The second is our method in which the anchor part is indicated by a one-hot vector. The third one is a template-based part assembly where we retrieve a template shape from the training set based on part-wise CNN features. We then transform the shape parts according to the corresponding parts in the template, since part correspondence is available for all shapes in the training set and the generated shapes. For contrasting, we also show the performance on the training shapes (the last row).

The results show that our part assembly generalizes well to unseen shapes. The numbers repored in Table 2 are under messy part arrangement with random scale from and random translation from . Figure 5 plots the assembly quality measure over varying amount of translation and scaling. Our method obtains reasonably good assembly results within the range of for translation and for scaling. Note, however, the goal of our part assembly module is not to reconstruct an input shape. In fact, there is not a unique solution to structurally plausible part assembly. Therefore, this experiment only approximately evaluates the assembly ability of our model.

Figure 6: Results of random shape generation on four categories. All generated shapes have semantic segmentation.
Chair Airplane Motorbike Lamp
3DGAN [40]
GRASS [24]
G2L [37]
Table 3: Comparing diversity (inception score) of random shape generation with three state-of-the-art methods.

Random shape generation.

Figure 6 shows a few examples of random generation for all four shape categories. For each shape, both the generate part volumes (overlaid) and the final assembling result are shown. A nice feature of our method is that the generated shapes all possess semantic segmentation by construction, which can be used in training data enhancement for shape segmentation. More generation results can be found in the supplemental material.

In Table 3, we compare the diversity of random generation by our method and three alternatives including 3DGAN [40], GRASS [24] and G2L [37]. Similar to [37], we use the inception score [32]

to measure the diversity of shape sets. In particular, we first cluster the training shapes and then train a classifier targeting the clusters. The inception score for a given set of shapes is then measured based on the confidence and variance of the classification over the set. From the results, our method achieves consistently more diverse generation than alternatives, thanks to the part-wise shape variation modeling.

Figure 7: Examples of high resolution () generation (top; the reconstructed surface mesh is shown for each shape) and point cloud generation (bottom).

High-res. and point cloud generation.

The split of part synthesis and part assembly in our approach well supports high-resolution shape generation. We first synthesize each part in very high resolution, within a local volume around the part. The synthesized parts are then placed into a low-resolution global volume, in which the part assembly module estimates an assembling transformation for each part. The transformed parts are then unified in a high-resolution volume, resulting in a high-res 3D model. Figure 7(top) shows four examples of high-res shape generation. Through implementing the part generators with point cloud representation [1], PARANet can also support 3D point cloud generation; see Figure 7(bottom).

Comparison with GRASS [24].

Figure 8 shows a visual comparison to GRASS. Although GRASS can recover part relations (adjacency and symmetry) in the generated shapes, it does not explicitly learn how adjacent parts are connected. Therefore, parts generated by GRASS can sometimes mistakenly detach. In contrast, PARANet leads to better part connection thanks to the learned part assembly prior. To quantitatively evaluate part assembly quality, we propose two measures, one objective and one subjective. The objective measure simply examines the voxel connectivity of a generated shape volume. In the subjective evaluation, we recruited five human participants to visually inspect and vote for the correctness of part connections of a generated shape. For both measures, we compute the average success rate of part assembly for each shape category. Table 4 compares the success rate of the two methods over randomly generated shapes for each category.

Figure 8: Comparing part connection with GRASS [24].
Chair Airplane Motorbike Lamp
Obj. Subj. Obj. Subj. Obj. Subj. Obj. Subj.
Table 4: Comparing PARANet and GRASS [24] on objective and subjective success rate of part assembly (in ).

Shape interpolation.

Through breaking down 3D shape generation into part-wise generation and part assembly inference, our method is able to model significant variation of 3D shape structures. This can be demonstrated by shape interpolation between shapes with significantly different structures. Figure 

9 shows two such examples. Our method can generate high quality in-between shapes with detailed shape structures, although the source and target shapes have considerably different structures. More interpolation results can be found in the supplemental material.

Arithmetic in latent space.

Due to the part-aware representation in our model, the latent space for full shapes is by construction structured. Therefore, our model naturally supports arithmetic modeling of 3D shapes, at the granularity of semantic part. Figure 10 shows two examples arithmetic modeling. Again, the final shapes possess detailed shape structures due to part-aware generation. Meanwhile, the overall structures look plausible although the parts are originally from different shapes.

Figure 9: Shape interpolation in latent space.
Figure 10: Arithmetic operations in latent space.

5 Applications

Shape set evolution.

In [43], a bio-inspired approach to batch generation of 3D shapes is introduced: An initial population of 3D shapes is evolved to produce generations of novel shapes. The core technique supporting 3D shape evolution is the realization of two basic “genetic operators”, mutation and crossover, which are key to maintain shape diversity from one generation to the next. In [43], shape mutation and crossover are realized as direct part alternation and recombination. There is not a notion of “shape DNA”, where genetic operators are performed on some “shape chromosome”, and new shapes are recovered from the resultant chromosomes analogous to genetic expression.

With the part-aware shape representation in our model, genetic operators on shape chromosomes (latent codes) can be easily defined. Specifically, mutation can be realized by randomly altering the values of some random dimensions of the latent code. Crossover can be achieved by recombining the code pieces corresponding to semantic parts. See Figure 11 for two examples of crossover operation. After genetic operations, new shapes can be recovered by our learned decoder and part assembler. To some extent, our part-aware latent codes can be viewed as a part-level “shape DNA”, which can be used in shape set evolution. Figure 12 shows two generations of shape set evolution starting from an initial population of chair models. See more evolution results in the supplemental material.

Figure 11: 3D shape crossover at the granularity of semantic part, enabled by our structured latent space.
Figure 12: Given an initial population of shapes from the testing set (top), we perform set evolution through random crossover and mutation of latent codes, leading to two generations of novel shapes (bottom).

Shape segmentation.

The part-wise generation of our model can also be used to segment a 3D shape. The network is shown in Figure 13 (top). Given a 3D shape in volumetric representation, we first project it into the latent space of PARANet based on a trained shape projection network. It encodes the shape volume with five volumetric convolutional layers and project the input shape volume to a latent code. Then, our pre-trained PARANet is used to reconstruct a 3D shape (with semantic segmentation). The projection network is trained by minimizing the reconstruction loss against the input shape volume, while keeping the PARANet part fixed. During testing, passing a 3D shape volume into the network results in a reconstructed 3D volume with semantic segmentation. Since the recovered 3D volume is geometrically close to the input shape volume (due to the self-reconstruction training), we can accurately transfer its voxel labels onto the input volume, thus obtaining a semantic segmentation for the input.

Figure 13: Top: The network of using PARANet for 3D shape segmentation. Bottom: A few segmentation results. For each example, the recovered shape used for semantic label transfer is shown to the top-left.
Chair Airplane Motorbike Lamp
PointNet [34]
SyncSpecCNN [45]
O-CNN [39]
Table 5: A quantitative comparison of shape segmentation.

Figure 13 shows a few examples of such segmentation on the testing set. Essentially, we learn a deep model of segmentation transfer. The model integrates a pre-learned part-aware shape manifold with a shape-to-manifold projector: To segment a shape, the projector retrieves the nearest neighbor from the manifold and generate semantic segmentation for the retrieved shape, whose segmentation can be easily transferred to the input due to shape resemblance. Table 5 shows some quantitative results of segmentation with comparison with a few state-of-the-art methods. It shows that our method achieves comparable performance with the state-of-the-arts. Extended results of shape segmentation can be found in the supplemental material.

6 Conclusion

We have proposed a simple and effective generative model for quality 3D shape generation. The model knows what it generates (semantic parts) and how the generated parts correlate with each other (by assembling transformation). This makes the generation part-aware and structure-revealing. Our model adopts a divide-and-conquer scheme and thus greatly reduces the difficulty in modeling full shape variations. There are two main limitations. First, our model relies on hard-coded split of semantic part generators which is not adaptable for a different label set. Learning a structure-aware generative model with a built-in shape decomposition module is an interesting future direction. Second, our method currently works with major semantic parts; although it can be extended to synthesize and assemble more fine-grained parts, too many parts would increase the difficulty of part assembly. This could be alleviated with the help of a hierarchical part organization as in [49].


  • [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representations and generative models for 3d point clouds. arXiv preprint arXiv:1707.02392, 2018.
  • [2] B. Allen, B. Curless, and Z. Popović. The space of human body shapes: Reconstruction and parameterization from range scans. ACM Trans. Graph., 22(3), 2003.
  • [3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
  • [4] M. Averkiou, V. G. Kim, Y. Zheng, and N. J. Mitra. Shapesynth: Parameterizing model collections for coupled shape exploration and synthesis. Computer Graphics Forum, 33(2):125–134, 2014.
  • [5] E. Balashova, V. Singh, J. Wang, B. Teixeira, T. Chen, and T. Funkhouser. Structure-aware shape synthesis. In 3D Vision (3DV), 2018.
  • [6] V. Blanz and T. Vetter. A morphable model for the synthesis of 3D faces. In Proc. of SIGGRAPH, pages 187–194, 1999.
  • [7] M. Bokeloh, M. Wand, V. Koltun, and H.-P. Seidel. Pattern-aware shape deformation using sliding dockers. ACM Trans. on Graph. (SIGGRAPH Asia), 30(6):123, 2011.
  • [8] A. X. Chang, T. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich 3D Model Repository. (arXiv:1512.03012 [cs.GR]).
  • [9] S. Chaudhuri, E. Kalogerakis, L. Guibas, and V. Koltun. Probabilistic reasoning for assembly-based 3d modeling. ACM Trans. on Graph. (SIGGRAPH), 30(4):35:1–35:10, 2011.
  • [10] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proc. NIPS, pages 2172–2180, 2016.
  • [11] Z. Chen and H. Zhang. Learning implicit fields for generative shape modeling. In CVPR, 2019.
  • [12] H. Fan, H. Su, and L. Guibas. A point set generation network for 3d object reconstruction from a single image. arXiv preprint arXiv:1612.00603, 2016.
  • [13] N. Fish, M. Averkiou, O. Van Kaick, O. Sorkine-Hornung, D. Cohen-Or, and N. J. Mitra. Meta-representation of shape families. ACM Transactions on Graphics (TOG), 33(4):34, 2014.
  • [14] T. Funkhouser, M. Kazhdan, P. Shilane, P. Min, W. Kiefer, A. Tal, S. Rusinkiewicz, and D. Dobkin. Modeling by example. ACM Transactions on Graphics (Proc. SIGGRAPH), Aug. 2004.
  • [15] R. Gal, O. Sorkine, N. J. Mitra, and D. Cohen-Or. iwires: an analyze-and-edit approach to shape manipulation. ACM Trans. on Graph. (SIGGRAPH), 28(3):33, 2009.
  • [16] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector representation for objects. In

    European Conference on Computer Vision

    , pages 484–499. Springer, 2016.
  • [17] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry. A papier-mâché approach to learning 3d surface generation. In CVPR, pages 216–224, 2018.
  • [18] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In NIPS, pages 5769–5779, 2017.
  • [19] H. Huang, E. Kalogerakis, and B. Marlin. Analysis and synthesis of 3d shape families via deep-learned generative models of surfaces. Computer Graphics Forum, 34(5), 2015.
  • [20] D. Jack, J. K. Pontes, S. Sridharan, C. Fookes, S. Shirazi, F. Maire, and A. Eriksson. Learning free-form deformations for 3d object reconstruction. arXiv preprint arXiv:1803.10932, 2018.
  • [21] E. Kalogerakis, S. Chaudhuri, D. Koller, and V. Koltun. A Probabilistic Model of Component-Based Shape Synthesis. ACM Transactions on Graphics, 31(4), 2012.
  • [22] V. G. Kim, W. Li, N. J. Mitra, S. Chaudhuri, S. DiVerdi, and T. Funkhouser. Learning part-based templates from large collections of 3D shapes. ACM Transactions on Graphics (Proc. SIGGRAPH), 32(4), July 2013.
  • [23] A. Kurenkov, J. Ji, A. Garg, V. Mehta, J. Gwak, C. Choy, and S. Savarese. Deformnet: Free-form deformation network for 3d shape reconstruction from a single image. 2018.
  • [24] J. Li, K. Xu, S. Chaudhuri, E. Yumer, H. Zhang, and L. Guibas. Grass: Generative recursive autoencoders for shape structures. arXiv preprint arXiv:1705.02090, 2017.
  • [25] K. Li, T. Pham, H. Zhan, and I. Reid. Efficient dense point cloud object reconstruction using deformation vector fields. In ECCV, 2018.
  • [26] N. Mitra, M. Wand, H. R. Zhang, D. Cohen-Or, V. Kim, and Q.-X. Huang. Structure-aware shape processing. In SIGGRAPH Asia 2013 Courses, page 1. ACM, 2013.
  • [27] N. J. Mitra, L. J. Guibas, and M. Pauly. Partial and approximate symmetry detection for 3d geometry. ACM Trans. on Graph., 25(3):560–568, 2006.
  • [28] C. Nash and C. K. Williams.

    The shape variational autoencoder: A deep generative model of part-segmented 3d objects.

    Computer Graphics Forum (SGP 2017), 36(5):1–12, 2017.
  • [29] M. Ovsjanikov, W. Li, L. Guibas, and N. J. Mitra. Exploration of continuous variability in collections of 3d shapes. ACM Trans. on Graph. (SIGGRAPH), 30(4):33:1–33:10, 2011.
  • [30] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In CVPR, 2019.
  • [31] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. In Proc. CVPR, volume 3, 2017.
  • [32] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Proc. NIPS, pages 2234–2242, 2016.
  • [33] A. A. Soltani, H. Huang, J. Wu, T. D. Kulkarni, and J. B. Tenenbaum. Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 1511–1519, 2017.
  • [34] H. Su, C. R. Qi, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page to appear, 2017.
  • [35] M. Sung, A. Dubrovina, V. G. Kim, and L. Guibas. Learning fuzzy set representations of partial shapes on dual embedding spaces. Computer Graphics Forum, 37(5):71–81, 2018.
  • [36] M. Sung, H. Su, V. G. Kim, S. Chaudhuri, and L. Guibas. ComplementMe: Weakly-supervised component suggestions for 3D modeling. ACM Trans. on Graph. (SIGGRAPH Asia), 2017.
  • [37] H. Wang, N. Schor, R. Hu, H. Huang, D. Cohen-Or, and H. Huang. Global-to-local generative model for 3d shapes. ACM Transactions on Graphics (Proc. SIGGRAPH ASIA), 37(6):214:1—214:10, 2018.
  • [38] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, pages 52–67, 2018.
  • [39] P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong.

    O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis.

    ACM Transactions on Graphics (SIGGRAPH), 36(4), 2017.
  • [40] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
  • [41] Z. Wu, X. Wang, D. Lin, D. Lischinski, D. Cohen-Or, and H. Huang. Structure-aware generative network for 3d-shape modeling. arXiv preprint arXiv:1808.03981, 2018.
  • [42] K. Xu, V. G. Kim, Q. Huang, N. Mitra, and E. Kalogerakis. Data-driven shape analysis and processing. In SIGGRAPH ASIA 2016 Courses, page 4. ACM, 2016.
  • [43] K. Xu, H. Zhang, D. Cohen-Or, and B. Chen. Fit and diverse: set evolution for inspiring 3d shape galleries. ACM Transactions on Graphics (TOG), 31(4):57, 2012.
  • [44] L. Yi, V. G. Kim, D. Ceylan, I.-C. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas. A scalable active framework for region annotation in 3d shape collections. SIGGRAPH Asia, 2016.
  • [45] L. Yi, H. Su, X. Guo, and L. J. Guibas. Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. In CVPR, pages 6584–6592, 2017.
  • [46] M. E. Yumer and N. J. Mitra. Learning semantic deformation flows with 3d convolutional networks. In European Conference on Computer Vision (ECCV 2016), pages –. Springer, 2016.
  • [47] Y. Zheng, X. Chen, M.-M. Cheng, K. Zhou, S.-M. Hu, and N. J. Mitra. Interactive images: Cuboid proxies for smart image manipulation. ACM Transactions on Graphics, 31(4):99:1–99:11, 2012.
  • [48] Y. Zheng, H. Fu, D. Cohen-Or, O. K.-C. Au, and C.-L. Tai. Component-wise controllers for structure-preserving shape manipulation. In Computer Graphics Forum, volume 30, pages 563–572. Wiley Online Library, 2011.
  • [49] C. Zhu, K. Xu, S. Chaudhuri, R. Yi, and H. Zhang. Scores: Shape composition with recursive substructure priors. ACM Transactions on Graphics (SIGGRAPH Asia 2018), 37(6):to appear, 2018.
  • [50] C. Zou, E. Yumer, J. Yang, D. Ceylan, and D. Hoiem. 3d-prnn: Generating shape primitives with recurrent neural networks. In The IEEE International Conference on Computer Vision (ICCV), 2017.