Learning Generative Models of Shape Handles

04/06/2020 ∙ by Matheus Gadelha, et al. ∙ 7

We present a generative model to synthesize 3D shapes as sets of handles – lightweight proxies that approximate the original 3D shape – for applications in interactive editing, shape parsing, and building compact 3D representations. Our model can generate handle sets with varying cardinality and different types of handles (Figure 1). Key to our approach is a deep architecture that predicts both the parameters and existence of shape handles, and a novel similarity measure that can easily accommodate different types of handles, such as cuboids or sphere-meshes. We leverage the recent advances in semantic 3D annotation as well as automatic shape summarizing techniques to supervise our approach. We show that the resulting shape representations are intuitive and achieve superior quality than previous state-of-the-art. Finally, we demonstrate how our method can be used in applications such as interactive shape editing, completion, and interpolation, leveraging the latent space learned by our model to guide these tasks. Project page: http://mgadelha.me/shapehandles.



There are no comments yet.


page 1

page 6

page 7

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dramatic improvements in quality of image generation have become a key driving force behind many novel image editing applications. Yet, similar approaches are lacking for editing and generating 3D shapes. There are two related challenges. First, learning generative models for 3D data is challenging, as unlike images, high-quality 3D data is hard to obtain and the data is high dimensional and often unstructured. Second, regardless of whether good generative models are available, manipulating and editing 3D shapes in interactive applications is harder to users than editing images. For this reason, the geometry processing community has developed techniques for representing 3D data as a small collection of simpler proxy shapes [4, 22, 2, 24, 7, 43, 17]. In this paper, we refer to these light-weight proxies as shape handles due to their ability to be easily manipulated by users. These representations have been widely used in tasks that require interaction and high-level reasoning in 3D environments, such as shape editing [12, 38], animation [37], grasping [27], and tracking [39].

We propose a generative models of shape handles. Our method adopts a two-branch network architecture to generate shapes with varying number of handles, where one branch focuses on generating handles while the other predicts the existence of each handle (Section 3.2). Furthermore, we propose a novel similarity measure based on distance fields to compare shape handle pairs. This measure can be easily adapted to accommodate various type of handles, such as cuboids and sphere-meshes [38] (Section 3.1). Finally, in contrast to previous work [40, 32] which focuses on unsupervised methods, we leverage recent works in collecting 3D annotations [29] as well as shape summarization techniques [38] to provide supervision to our approach. Experiments show that our method significantly outperforms previous methods on shape parsing and generation tasks. Using self-supervised training data generated by [38], our approach produces shapes that are twice as accurate as competing approaches in terms of intersection-over-union (IoU) metric. By employing human annotated data, our model can be further improved, achieving even higher accuracy than using self-supervised training data. Moreover, as shape handles provide a compact representation, our generative networks are compact (less than 10MB). Despite the small memory footprint, our method generates a diverse set of high quality 3D shapes, as seen in Figure 1.

Finally, our method is built towards generating shapes using representations that are amenable to manipulation by users. In contrast to point clouds and other 3D representations such as occupancy grids, handles are intuitive to modify and naturally suitable for editing and animation tasks. The latent space of shape handles induced by the learned generative model can be leveraged to support shape editing, completion, and interpolation tasks, as depicted in Figure 2.

2 Related work

Deep generative models of 3D shapes.

Multiple 3D shape representations have been used in the context of deep generative models. 3D voxel grids [6, 10] are a natural extension to image-based architectures, but suffer from high memory footprint requirements. Sparse occupancy grids [42, 36, 15, 33] alleviate this issue using a hierarchical grid, but they are still not able to generate detailed shapes and they require additional bookkeeping. Multi-view representations  [34, 23], point clouds [8, 11, 9, 1], mesh deformations [41, 19] and implicit functions [31, 26, 5, 13] provide alternatives that are compact and capable of generating detailed shapes. However, these approaches are focused on reconstructing accurate 3D shapes and are not amenable to tasks like editing. Our goal is different: we explore generative models to produce sets of handles – summarized shape representations that can be easily manipulated by users.

Figure 2: Overview. We propose a method to train a generative model for sets of shape handles. Once trained, the latent representation can also be used in applications like shape editing and interpolation.

Two closely related methods to ours are Tulsiani et al. [40] and Paschalidou et al. [32] where they propose models to generate shapes as a collection of primitives without supervision. In contrast, we are focused on creating models capable of utilizing shape decompositions provided by external agents; either a human annotator or a shape summarization technique. We demonstrate that, by using the extra information provided by annotations or well known geometric processing techniques, our method is capable of generating more accurate shapes while keeping the representation interpretable and intuitive for easy editing. Other approaches focused on learning shape structures through stronger supervision [21, 28, 30], requiring not only handle description, but also relationships between them, e.g. support, symmetry, adjacency, hierarchy, etc. In contrast, our method models shapes as sets and we show that inter-handle relationships can be learned directly from data, so that the latent space induced by our model can be used to guide shape editing, completion, and interpolation tasks. Furthermore, we present a general framework that can be easily adapted to different types of handles, not only a single parametric family, like cuboids [40, 28, 21] or superquadrics [32].

Methods for shape decomposition.

Shape decomposition has been extensively studied by the geometry processing literature. The task is to approximate a complex shape as a set of simpler, lower-dimensional parts that are amenable for editing. We refer to these parts as shape handles. Early cognitive studies have shown that humans tend to reason about 3D shapes as a union of convex components [16]. Multiple approaches have explored decomposing shapes in this manner [22, 44, 18]. However, those approaches are likely to generate too many parts, making them difficult to manipulate. This problem was addressed by later shape approximation methods such as cages [43], 3D curves [12, 25, 14] and sphere-meshes [38], which are shown very useful in shape editing and other high-level tasks. Our method is flexible to work with various types of shape handles, and in particular we show experiments using cuboids as well as sphere-meshes.

Several closely related methods to ours approximate complex shapes using primitives such as cylinders [45] or cuboids [43]. These approximations are easy to interpret and manipulate by humans. However, most existing methods rely solely on geometric cues for computing primitives, which can lead to counter-intuitive decompositions. In contrast, our method takes supervision from semantic information provided by human annotators or shape summarization techniques, and therefore our results more accurately match human intuition.

3 Method

Consider a dataset containing sets of shape handles. Each set of handles represents a 3D shape and consists of multiple handle descriptors. Our goal is to train a model capable of generating sets similar to the ones in , i.e., using them as supervision. More precisely, given an input associated with a set of handles

, our goal is to estimate the parameters

such that . The input can be an image, a point cloud, an occupancy grid, or even the set of handles itself. When ,

corresponds to an autoencoder. If we add a regularization term to the bottleneck of

, we have a Variational Auto-Encoder (VAE), which we use for applications like shape editing, completion and interpolation (Section 4.4

). However, we need to use a loss function capable of measuring the similarity between two sets of handles,

i.e. the reconstruction component of a VAE. Ideally, this loss function would be versatile – we should be able to use it to generate different types of handles with minimal modifications. Moreover, our model needs to be capable of generating sets with different cardinalities, since the sets do not always have the same size – in practice, the size of the sets used as supervision can vary a lot and our network must accommodate this need.

In this section, we describe how to create a model satisfying these constraints. First, we describe how to compute similarities between handles. Our method is flexible and only relies on the ability to efficiently compute the distance from an arbitrary point in space to the handle’s surface. We then demonstrate how to use this framework with two types of handles: cuboids and sphere-meshes. Finally, we describe how to build models capable of generating sets with varying sizes, by employing a separate network branch to predict the existence of shape handles.

3.1 Similarity between shape handles

Consider two sets of shape handles of the same type: and , where and are parameters that describe each handle. For example, if the handle type is cuboid, (or ) would include the cuboid dimensions, rotation and translation in space. One way to compute similarity between sets is through Chamfer distance. Let the asymmetric Chamfer distance between the two sets of handles and be defined as:


where is a function that computes the similarity between two handles with parameters and . There are multiple choices for . One straightforward choice is to define as the

-norm of the vector

. However, this is a poor choice as the parameters are not homogeneous. For example, parameters that describe rotations should not contribute to the similarity metric in the same way as those describing translations. Furthermore, there may be multiple configurations that describe the same shape – e.g., vertices that are in different orders may describe the same triangle; a cube can be rotated and translated in multiple ways and end up occupying the same region in space.

We address these problems by proposing a novel distance metric which measures the similarity of the distance field functions of the two handles. Specifically, let be a set of points in the 3D space and let represent the surface of the handle described by . Now, we define as follows:


Intuitively, calculates the sum of squared differences between two feature vectors representing the distance fields with respect to each of the handles. Each dimension of these feature vectors contains the distance between a point in a set of probe points and the surface of the handle defined by its parameters ( and in Equation 2). The main advantage of this similarity computation is its versatility: it allows us to compare any types of shape handles; the only requirement is the ability to efficiently compute given handle parameters and a point . In the following subsections, we show how to efficiently perform this computation for two types of shape handles: cuboids and sphere-meshes.


We choose to represent a cuboid by parameters , where is the cuboid center, is the cuboid scale factor (i.e. dimensions),

are vectors describing the rotation of the cuboid. This rotation representation has continuity properties that benefit its estimation through neural networks 

[46]. Notice that we can build a rotation matrix from and by following the procedure described in [46]. Now, consider the transformation . Let represent the surface of the cuboid parametrized by . We can compute (i.e. distance from to the cuboid) as follows:

where , and represent element-wise , and absolute value, respectively. Since this expression can be computed in , we are able to compute Equation 2 in , where the number of probing points is relatively small. In practice, we sample 64 points in a regular grid in the unit cube.


Figure 3: Schematic representation of sphere-meshes. A sphere-mesh (middle) is computed from a regular triangle mesh (left) as input, and it consists of multiple sphere-triangles (right), each of which is a volumetric representation

A triangle mesh consists of a set of vertices and triangular faces representing the vertex connectivity. Every vertex is a point in space and the surface of a triangle contains all the points that can be generated by interpolating the triangle’s vertices using barycentric coordinates. A sphere-mesh is a generalization of a triangle mesh – every vertex is a sphere instead of a point in space. Thus, every sphere-mesh “triangle” is actually a volume delimited by the convex-hull of the spheres centered at the triangle vertices. Figure 3 presents a visual description of sphere-mesh components. Thiery et al. [38] introduced an algorithm to compute sphere-meshes from regular triangle meshes. They show that complex meshes can be closely approximated with a sphere-mesh containing a fraction of the original components.

We model sphere-meshes as a set of sphere-mesh triangles, called sphere-triangles. Similarly to a regular triangle, a sphere-triangle is fully defined by its vertices, the difference being that its vertices are now spheres instead of points. Thus, we choose to represent a sphere-triangle using parameters ; where are the centers of the three spheres, and are their radii. Let represent the the surface of the sphere-triangle parametrized by . For calculating the similarity between two sphere-triangles: as each sphere-triangle is uniquely defined by its three spheres, it suffices to have contain only the surfaces of these three spheres, and hence it does not need to contain the entire sphere triangle. Thus, the distance of a probing point to the handle surface is computed as follows:

3.2 Generating sets with varying cardinality

The neural network generates shapes represented by sets of handles given an input . Our design of includes two main components: an encoder that, given an input , outputs a latent set representation ; and a decoder that, given the latent set representation , generates a set of handles. Even though we can use a symmetric version of Equation 1 to compute the similarity between the generated set and the ground-truth set of handles , so far our model has not taken into account the varying size (i.e. number of elements) of the generated sets. We address this issue by separating the generator into two parts: a parameter prediction branch and an existence prediction branch . The parameter prediction branch is trained to always output a fixed number of handle parameters where represents the parameters of the handle. On the other hand, the existence prediction branch

represents the probability of existence of the

generated handle. Now, we need to adapt our loss function to consider the probability of existence of a handle.

If we assume that all handles exist, our model can be trained using the following loss:

where is a set of shape handles drawn from the training data and is a latent representation computed from the associated input . However, we want to modify this loss to take into account the probability of a handle existing or not. To do so, note that has two terms. The first term measures accuracy: i.e. how close each of the handles in is from the handles in . For this term, we can use as weights for the summation in Equation 1, which leads to the following definition:


where is a latent space representation, is a set of handles and . The intuition is quite simple: if the handle is likely to exist, its distance to the closest handle should be taken into consideration; on the other hand, if the handle is unlikely to exist, it does not matter if it is approximating a handle in or not.

The second term in measures coverage: every handle in must have (at least) one handle in the generated set that is very similar to it. Here, we use an insight presented in [32] to efficiently compute the coverage of while considering the probability of elements in a set existing or not. Let be the list of generated handles ordered in non-decreasing order according to for . We compute the coverage of a set from a set generated from as follows:


The idea behind this computation is the following: for every handle , we compute its distance to every handle in , weighted by the probability of that handle existing or not. However, the distance to a specific handle is important only if no other handle closer to exists. Thus, the whole term needs to be weighted by . Finally, we can combine Equations 3 and 4 to define the reconstruction loss used to train our model:


Alternate training procedure.

Figure 4: Comparison of results after the first stage (top row) and second stage (bottom row) of alternate training. While the first stage ensures coverage, some extra, unnecessary handles are also generated. The second stage trains the existence branch, which assigns a low probability of existence to the inaccurate handles.

Although minimizing the loss in Equation 5 at once enables generating sets of different sizes, our experiments show that the results can be further improved if we train and in an alternating fashion. Specifically, we first initialize the biases and weights of the last layer of to ensure that all of its outputs are , i.e., the model is initialized to predict that every primitive exists. Then, in the first stage of the training, we fix the parameters of and train minimizing only the coverage . During the second stage of the training, we fix the parameters of and update the parameters of , but this time minimizing the full reconstruction loss . As we show in Section 4, this alternating procedure improves the training leading to the generation of more accurate shape handles. The intuition is that while training the model to predict the handle parameters (), the network should be only concerned about coverage, i.e., generating at least one similar handle for each ground-truth handle. On the other hand, while training the existence prediction branch (), we want to remove the handles that are in incorrect positions while keeping the coverage of the ground-truth set.

4 Experiments

This section describes our experiments and validates results. We experimented with two different types of handles: cuboids computed from PartNet [29] segmentations and sphere-meshes computed from ShapeNet [3] shapes using [38]. We compare our results to two other approaches focused on generating shapes as a set of simple primitives, namely cuboids [40] and superquadrics [32]

. All the experiments in the paper were implemented using Python 3.6 and PyTorch. Computation was performed on TitanX GPUs.

4.1 Datasets

Cuboids from PartNet [29].

We experiment with human annotated handles by fitting cuboids to the parts segmented in PartNet [29]. The dataset contains 26,671 shapes from 24 categories and 573,585 part instances. In order to compare our model with other approaches trained on the ShapeNet [3] chairs dataset, we select the subset of PartNet chairs that is also present in ShapeNet. This results in 6773 chair models segmented in multiple parts. Every model has on average 18 parts, but there are also examples with as many as 137 parts. For every part we fit a corresponding cuboid using PCA. Then, we compute the volume of every cuboid and keep at topmost 30 cuboids in terms of volume. Notice that 92% of the shapes have less than 30 cuboids, so those remain unchanged. The others will have missing components, but those usually correspond to very small details and can be ignored without degrading the overall structure.

Sphere-meshes from ShapeNet [3].

In contrast to cuboids (which are computed from human annotated parts), we compute sphere-meshes fully automatically using the procedure described in [38]. We use ShapeNet categories that are also analyzed in [32, 40]: chairs, airplanes and animals. The sphere-mesh computation procedure requires pre-selecting how many sphere-vertices to use. The algorithm starts by considering the regular triangle mesh as a trivial sphere-mesh (vertices with null radius) and then decimates the original mesh progressively through edge collapsing, optimizing for new sphere-vertex each time an edge is removed. This procedure is iterated until the required number of vertices is achieved.

In our case, since our model is capable of generating sets with different cardinalities, we are not required to set a fixed number of primitives for every shape. Therefore we use the following method to compute a sphere-mesh with adaptive number of vertices. Specifically, for every shape in the dataset, we start by computing a sphere-mesh with 10 vertices. Then, we sample 10K points both on the sphere-mesh surface and the original mesh. If the Hausdorff distance between the point clouds is smaller than (point clouds are normalized to fit the unit sphere), we keep the current computed sphere-mesh. Otherwise, we compute a new sphere-mesh by incrementing the number of vertices. This procedure continues until we reach a maximum of 40 vertices. This adaptive sphere-mesh computation allows our model to achieve a good balance between shape complexity and summarization – simpler shapes will be naturally represented with a smaller number of primitives. We note that the sphere-mesh computation allows the resulting mesh to contain not only triangles, but also edges (i.e. degenerate triangles). For simplicity, we make no distinction between sphere-triangles or edges: edges are simply triangles that have two identical vertices.

Figure 5: Shape parsing on the chairs dataset. From top to bottom, we show ground-truth shapes, results by Tulsiani et al. [40], results by our method using sphere-mesh handles, and our method using cuboids handles. Note how our results (last two rows) are able to generate handles with much better details such as the stripes on the back of the chair (first column), legs on wheel chairs (second column) and armrests in several other columns.

4.2 Shape Parsing

The shape parsing task is to compute a small set of primitives from non-parsimonious, raw, 3D representations, like occupancy grids, meshes or point clouds. We analyze the ability of our model in performing shape parsing using a similar setup to [32, 40]. Specifically, following the notation defined in Section 3, we train a model using input-output pairs , where corresponds to a point cloud with 1024 points and is a set of handles summarizing the shape represented by . We use a PointNet [35]

encoder to process a point cloud with 1024 points and generate a 1024 dimensional encoding. This encoding is then used as an input for our two-branched set decoder. Both branches follow the same architecture: 3 fully connected layers with 256 hidden neurons followed by batch normalization and ReLU activations. The only difference between the two branches is in the last layer. Assume

is the maximum set cardinality generated by our model and is the handle dimensionality (i.e. number of parameters of each handle descriptor, which happens to be for both sphere-mesh and cuboid). Then outputs values followed by a activation, while outputs values followed by a sigmoid activation. We set for cuboid handles and for sphere-meshes. The model is trained end-to-end by using the alternating training described in Section 3. Training is performed using the Adam optimizer with a learning rate of for 5K iterations in each stage.

Figures 5 and 6 show visual comparisons of our method with previous work. Qualitatively, our method generates shape handles with accurate geometric details, including many thin structures that previous methods struggle with.

Figure 6: Shape parsing on the airplanes and animals datasets. From top to bottom, we show ground-truth shapes, results by Tulsiani et al. [40], results by Paschalidou et al. [32], and results by our model trained using sphere-mesh handles. Our results contain accurate geometric details, such as the engines on the airplanes and animal legs that are clearly separated.
Handle type Category
Chairs Airplanes Animals
[40] Cuboid 0.129 0.065 0.334
[32] Superquadric 0.141 0.181 0.751
Ours Cuboid 0.311 - -
Sphere-mesh 0.298 0.307 0.761
Table 1: Quantitative results for shape parsing. Intersection over union computed on the reconstructed shapes. The best self-supervised results are shown in bold font.

Quantitative comparisons

We compare our method against [40, 32] using intersection over union (IoU) metric and results are shown in Table 1. As expected, when using cuboids as handles, our method leverages the annotated data from the PartNet [29] to achieve significantly more accurate shape approximations (more than twice the IoU in comparison). On the other hand, as [40, 32] are trained without leveraging annotated data, a more fair comparison is between theirs and our method using sphere-mesh handles, which are computed automatically. Our method still clearly outperforms theirs in all categories – chairs, airplanes and animals. This shows that even though a neural network in theory should be able to learn the best parsimonious shape representations, using self-supervision generated by shape summarization techniques (e.g. sphere-meshes) can still help it achieve more accurate approximations.

Figure 7: Ablation studies. Shapes generated from a model trained without our proposed handle similarity metric (first row), model trained without the two-stage training procedure (second row), and our full model (last row). Note that comparing handles using just -norm (first row) yields poor results. Training and at the same time (instead of alternating) yields reasonable results, but some parts are missing and/or poorly oriented.
w/o similarity w/o alternate full model
0.192 0.320 0.352
Table 2: Quantitative results of ablation studies comparing our full model with two variations that lack our handle similarity metric and alternate training procedure respectively.

4.3 Ablation studies

We investigated the influence of the two main contributions of this work: the similarity metric for handles and the alternating training procedure for and . To do so, we adopt a shape-handle auto-encoder and compare different variations by computing the IoU of reconstructed shapes in a held-out test set. The auto-encoder architecture is very similar to the one used in shape parsing, except for the encoder – it still follows a PointNet architecture, but every “point” is actually a handle treated as a point in a -dimensional space. We analyzed three different variations. In the first one, we simply used the -norm between the handle parameters (cuboids, in this case). As shown in Figure 7 and Table 2, the proposed handle similarity metric has a significant impact on the quality of the generated shapes. The second variation consists of training the same model, but without using the alternating procedure described in Section 3. Figure 7 shows that the alternating training procedure generates more accurate shapes, with fewer missing parts and better cuboid orientation.

4.4 Applications

In this section, we demonstrate the use of our generative model in several applications. We employed a Variational Auto-Encoder (VAE) [20] for this purpose. It follows the same architecture as the auto-encoder described in Section 4.3 with the only difference being that the output of the encoder (latent representation ) has dimensionality 256 instead of 512. Additionally, following [11], we added an additional regularization term to the training objective:


where is the encoder, is the covariance matrix, is the Frobenius norm, is input handle set and is random noise sampled from . Thus, the network is trained minimizing the following function:


In all our experiments, we used and . The model is trained using the alternate procedure described before, i.e. is replaced by while training .


Figure 8: Latent space interpolation Sets of handles can be interpolated by linearly interpolating the latent representation . Transitions are smooth and generate plausible intermediate shapes. Notice that the interpolation not only changes handle parameters, but also adds new handles / removes existing handles as necessary.

Once the VAE model is trained, we are able to morph between two shapes by linearly interpolating their latent representations . In particular, we sample two values from and generate new shapes by passing the interpolated encodings through the decoder , where . Results using cuboid handles are presented in Figure 8. Note that the shapes are smoothly interpolated, with new handles added and old handles removed as necessary when the overall shape deforms. Additionally, relationships between handles, like symmetries, adjacency and support, are preserved, thanks to the latent space learned by our model, even though such characteristics are never explicitly specified as supervision.

Handle completion. Consider an incomplete set of handles as input, the handle completion task is to generate a complete set of handles , such that contains not only the handles in the input but also necessary additional handles that result in a plausible shape. For example, given a single cuboid handle as shown in Figure 9, we want to generate a complete chair that contains that input handle. We perform this task by finding a latent representation that generates a set of handles approximating the elements in . Specifically, we solve the following optimization problem:


where is the coverage metric defined in Equation 4 and is the completed shape (i.e. output of the decoder using as input). We can also use the existence prediction branch () in this framework to reason about how complex we want the completed shapes to be. Specifically, we add an additional term to the optimization:


where controls the complexity of the shape. If , we are not penalizing a set with multiple handles – only coverage matters. As increases, existence of multiple handles is penalized more, leading to a solution with a lower cardinality. As can be seen in Figure 9, our model is capable of recovering plausible chairs even when given a single handle. In addition, we can generate multiple proposals for by initializing the optimization with different values of . More results can be found in the supplemental material.

Figure 9: Results of handle completion. Recovering full shape from incomplete set of handles. Using to control the complexity of the completed shape (left). Predicting a complete chair from a single handle (right).

Shape editing.

For editing shapes, we use a similar optimization based framework. Consider an original set of handles describing a particular shape. Assume that the user made edits to by modifying the parameters of some handles, creating a new set . Our goal is to generate a plausible new shape from , while minimizing the deviation from the original shape. To achieve this goal, we solve the following minimization problem via gradient descent:


where is the latent representation of the original shape. The intuition for Equation 10 is simple: we want to generate a plausible shape that approximates the user edits by minimizing but also keep the overall characteristics of the original shape by adding a penalty for deviating too much from . Results are shown in Figure 10. As observed in the figure, when the user edits one of the handles, our model can automatically modify the shape of the entire chair while preserving its overall structure.

Figure 10: Editing chairs. Given an initial set of handles, a user can modify any handle (yellow). Our model then updates the entire set of handles, resulting in a modified shape which observes the user edits while preserving the overall structure.


Our method has several limitations to be addressed in future work. First, during training we set a maximum number of handles to be generated. Increasing this number would allow more complex shapes but also entail a larger network with higher capacity. Therefore, there is a trade-off between the compactness of the generative model and the desired output complexity. Furthermore, our method currently does not guarantee the output handles observe certain geometric constraints, such as parts that need to be axis-aligned or orthogonal to each other. For man-made shapes, these are often desirable constraints and even slight deviation is immediately noticeable. While our model can already learn geometric relationships among handles from the data directly, generated shapes might benefit from additional supervision enforcing geometric constraints.

5 Conclusion

We presented a method to generate shapes represented as sets of handles – lightweight proxies that approximate the original shape and are amenable to high-level tasks, like shape editing, parsing and animation. Our approach leverages pre-defined sets of handles as supervision, either through annotated data or self-supervised methods. We proposed a versatile similarity metric for shape handles that can easily accommodate different types of handles, and a two-branch network architecture to generate handles with varying cardinality. Experiments show that our model is capable of generating compact and accurate shape approximations, outperforming previous work. We demonstrate our method in a variety of applications, including interactive shape editing, completion, and interpolation, leveraging the latent space learned by our model to guide these tasks.


This work is supported in part by a gift from Adobe Research, NSF grants 1908669, 1749833. Our experiments were performed in the UMass GPU cluster obtained under the Collaborative Fund managed by the Massachusetts Technology Collaborative.


  • [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. J. Guibas (2018) Learning Representations and Generative Models For 3D Point Clouds. In

    International Conference on Machine Learning

    Cited by: §2.
  • [2] M. Attene, B. Falcidieno, and M. Spagnuolo (2006) Hierarchical mesh segmentation based on fitting primitives. The Visual Computer 22 (3), pp. 181–193. Cited by: §1.
  • [3] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §4.1, §4.1, §4.
  • [4] B. Chazelle and D. P. Dobkin (1985) Optimal convex decompositions. In

    Machine Intelligence and Pattern Recognition

    Machine Intelligence and Pattern Recognition, pp. 63–133. Cited by: §1.
  • [5] Z. Chen and H. Zhang (2019) Learning implicit fields for generative shape modeling. In

    The IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §2.
  • [6] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese (2016) 3D-R2N2: A unified approach for single and multi-view 3D object reconstruction. In European Conference on Computer Vision, Cited by: §2.
  • [7] D. Cohen-Steiner, P. Alliez, and M. Desbrun (2004) Variational shape approximation. In ACM SIGGRAPH 2004 Papers, SIGGRAPH ’04, New York, NY, USA. Cited by: §1.
  • [8] H. Fan, H. Su, and L. Guibas (2017) A Point Set Generation Network for 3D Object Reconstruction from a Single Image. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [9] M. Gadelha, S. Maji, and R. Wang (2017) 3D shape generation using spatially ordered point clouds. In British Machine Vision Conference (BMVC), Cited by: §2.
  • [10] M. Gadelha, S. Maji, and R. Wang (2017) Unsupervised 3D Shape Induction from 2D Views of Multiple Objects. In International Conference on 3D Vision (3DV), Cited by: §2.
  • [11] M. Gadelha, R. Wang, and S. Maji (2018) Multiresolution Tree Networks for 3D Point Cloud Processing. In ECCV, Cited by: §2, §4.4.
  • [12] R. Gal, O. Sorkine, N. J. Mitra, and D. Cohen-Or (2009) IWIRES: an analyze-and-edit approach to shape manipulation. ACM Transactions on Graphics (Siggraph) 28 (3), pp. #33, 1–10. Cited by: §1, §2.
  • [13] K. Genova, F. Cole, D. Vlasic, A. Sarna, W. T. Freeman, and T. Funkhouser (2019) Learning shape templates with structured implicit functions. In International Conference on Computer Vision, Cited by: §2.
  • [14] G. Gori, A. Sheffer, N. Vining, E. Rosales, N. Carr, and T. Ju (2017) FlowRep: descriptive curve networks for free-form design shapes. ACM Transaction on Graphics 36 (4). External Links: Document Cited by: §2.
  • [15] C. Häne, S. Tulsiani, and J. Malik (2017) Hierarchical surface prediction for 3d object reconstruction. In International Conference on 3D Vision (3DV), Cited by: §2.
  • [16] D. D. Hoffman and W. A. Richards (1984) Parts of recognition. Cognition 18 (1-3), pp. 65–96. External Links: Document Cited by: §2.
  • [17] Z. Ji, L. Liu, and Y. Wang (2010) B-mesh: a modeling system for base meshes of 3d articulated shapes. Computer Graphics Forum 29 (7), pp. 2169–2177. Cited by: §1.
  • [18] O. V. Kaick, N. Fish, Y. Kleiman, S. Asafi, and D. Cohen-OR (2014) Shape segmentation by approximate convexity analysis. ACM Trans. Graph. 34 (1). Cited by: §2.
  • [19] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik (2018) Learning category-specific mesh reconstruction from image collections. In ECCV, Cited by: §2.
  • [20] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §4.4.
  • [21] J. Li, K. Xu, S. Chaudhuri, E. Yumer, H. Zhang, and L. Guibas (2017) GRASS: generative recursive autoencoders for shape structures. ACM Transactions on Graphics (Proc. of SIGGRAPH 2017) 36 (4), pp. to appear. Cited by: §2.
  • [22] J. Lien and N. M. Amato (2007) Approximate convex decomposition of polyhedra. In Proceedings of the 2007 ACM Symposium on Solid and Physical Modeling, SPM ’07. Cited by: §1, §2.
  • [23] Z. Lun, M. Gadelha, E. Kalogerakis, S. Maji, and R. Wang (2017) 3D shape reconstruction from sketches via multi-view convolutional networks. In International Conference on 3D Vision (3DV), Cited by: §2.
  • [24] R. Mehra, Q. Zhou, J. Long, A. Sheffer, A. Gooch, and N. J. Mitra (2009) Abstraction of man-made shapes. In ACM SIGGRAPH Asia 2009 Papers, SIGGRAPH Asia ’09. Cited by: §1.
  • [25] R. Mehra, Q. Zhou, J. Long, A. Sheffer, A. Gooch, and N. J. Mitra (2009) Abstraction of man-made shapes. ACM Transactions on Graphics 28 (5), pp. #137, 1–10. Cited by: §2.
  • [26] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: Learning 3D reconstruction in function space. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [27] A. T. Miller, S. Knoop, H. I. Christensen, and P. K. Allen (2003) Automatic grasp planning using shape primitives. In 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422), Vol. 2, pp. 1824–1829 vol.2. Cited by: §1.
  • [28] K. Mo, P. Guerrero, L. Yi, H. Su, P. Wonka, N. Mitra, and L. Guibas (2019) StructureNet: hierarchical graph networks for 3d shape generation. ACM Transactions on Graphics (TOG), Siggraph Asia 2019 38 (6), pp. Article 242. Cited by: §2.
  • [29] K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su (2019-06) PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §4.1, §4.1, §4.2, §4.
  • [30] C. Niu, J. Li, and K. Xu (2018) Im2Struct: recovering 3d shape structure from a single rgb image. In Computer Vision and Pattern Regognition (CVPR), Cited by: §2.
  • [31] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019) DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [32] D. Paschalidou, A. O. Ulusoy, and A. Geiger (2019-06) Superquadrics revisited: learning 3d shape parsing beyond cuboids. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.2, Figure 6, §4.1, §4.2, §4.2, Table 1, §4.
  • [33] S. R. Richter and S. Roth (2018) Matryoshka Networks: Predicting 3D Geometry via Nested Shape Layers. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [34] A. A. Soltani, H. Huang, J. Wu, T. Kulkarni, and J. Tenenbaum (2017) Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In CVPR, Cited by: §2.
  • [35] H. Su, C. Qi, K. Mo, and L. Guibas (2017)

    PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

    In CVPR, Cited by: §4.2.
  • [36] M. Tatarchenko, A. Dosovitskiy, and T. Brox (2017) Octree generating networks: efficient convolutional architectures for high-resolution 3d outputs. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [37] J. Thiery, É. Guy, T. Boubekeur, and E. Eisemann (2016) Animated mesh approximation with sphere-meshes. ACM Trans. Graph., pp. 30:1–30:13. External Links: ISSN 0730-0301, Link Cited by: §1.
  • [38] J. Thiery, E. Guy, and T. Boubekeur (2013) Sphere-meshes: shape approximation using spherical quadric error metrics. ACM Transaction on Graphics (Proc. SIGGRAPH Asia 2013) 32 (6), pp. Art. No. 178. Cited by: Figure 1, §1, §1, §2, §3.1, §4.1, §4.
  • [39] A. Tkach, M. Pauly, and A. Tagliasacchi (2016) Sphere-meshes for real-time hand modeling and tracking. ACM Trans. Graph. 35 (6). Cited by: §1.
  • [40] S. Tulsiani, H. Su, L. J. Guibas, A. A. Efros, and J. Malik (2017-07) Learning shape abstractions by assembling volumetric primitives. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Figure 5, Figure 6, §4.1, §4.2, §4.2, Table 1, §4.
  • [41] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. Jiang (2018) Pixel2Mesh: generating 3d mesh models from single rgb images. In ECCV, Cited by: §2.
  • [42] P. Wang, Y. Liu, Y. Guo, C. Sun, and X. Tong (2017)

    O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis

    ACM Transactions on Graphics (SIGGRAPH) 36 (4). Cited by: §2.
  • [43] C. Xian, H. Lin, and S. Gao (2012) Automatic cage generation by improved obbs for mesh deformation. The Visual Computer 28 (1), pp. 21–33. Cited by: §1, §2, §2.
  • [44] Zhou Ren, Junsong Yuan, Chunyuan Li, and Wenyu Liu (2011-11) Minimum near-convex decomposition for robust shape representation. In 2011 International Conference on Computer Vision, Cited by: §2.
  • [45] Y. Zhou, K. Yin, H. Huang, H. Zhang, M. Gong, and D. Cohen-Or (2015) Generalized cylinder decomposition. ACM Trans. Graph. 34 (6). Cited by: §2.
  • [46] Y. Zhou, C. Barnes, L. Jingwan, Y. Jimei, and L. Hao (2019-06) On the continuity of rotation representations in neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.