structedit
StructEdit: Learning Structural Shape Variations
view repo
Learning to encode differences in the geometry and (topological) structure of the shapes of ordinary objects is key to generating semantically plausible variations of a given shape, transferring edits from one shape to another, and many other applications in 3D content creation. The common approach of encoding shapes as points in a high-dimensional latent feature space suggests treating shape differences as vectors in that space. Instead, we treat shape differences as primary objects in their own right and propose to encode them in their own latent space. In a setting where the shapes themselves are encoded in terms of fine-grained part hierarchies, we demonstrate that a separate encoding of shape deltas or differences provides a principled way to deal with inhomogeneities in the shape space due to different combinatorial part structures, while also allowing for compactness in the representation, as well as edit abstraction and transfer. Our approach is based on a conditional variational autoencoder for encoding and decoding shape deltas, conditioned on a source shape. We demonstrate the effectiveness and robustness of our approach in multiple shape modification and generation tasks, and provide comparison and ablation studies on the PartNet dataset, one of the largest publicly available 3D datasets.
READ FULL TEXT VIEW PDFStructEdit: Learning Structural Shape Variations
StructEdit: Learning Structural Shape Variations
The shapes of 3D objects exhibit remarkable diversity, both in their compositional structure in terms of parts, as well as in the geometries of the parts themselves. Yet humans are remarkably skilled at imagining meaningful shape variations even from isolated object instances. For example, having seen a new chair, we can easily imagine its natural variations with a different height back, a wider seat, with or without armrests, or with a different base. In this paper, we investigate how to learn such shape variations directly from 3D data. Specifically, given a shape collection, we are interested in two sub-problems: first, for any given shape, we want to discover the main modes of edits, which can be inferred directly from the shape collection; and second, given an example edit on one shape, we want to transfer the edit to another shape in the collection, as a form of analogy-based edit transfer. This ability is useful in multiple settings, including the design of individual 3D models, the consistent modification of 3D model families, and the fitting of CAD models to noisy and incomplete 3D scans.
There are several challenges in capturing the space of shape variations. First, individual shapes can have different representations as images, point clouds, or surface meshes; second, one needs a unified setting for representing both continuous deformations as well as structural changes (e.g., the addition or removal of parts); third, shape edits are not directly expressed but are only implicitly contained in shape collections; and finally, learning a space of structural variations that is applicable to more than a single shape amounts to learning mappings between different shape edit distributions, since different shapes have different types and numbers of parts (e.g., tables with/without leg bars).
In much of the extant literature on 3D machine learning, 3D shapes are mapped to points in a representation space whose coordinates encode latent features of each shape. In such a representation, shape edits are encoded as vectors in that same space – in other words as differences between points representing shapes. Equivalently, we can think of shapes as “anchored” vectors rooted at the origin, while shape differences are “floating” vectors that can be transported around in the shape space. This type of vector space arithmetic is commonly used
[wu2016learning, achlioptas2017learning, wang2018global, gao2018automatic, xia2015realtime, Villegas_2018_CVPR], for example, in performing analogies, where the vector that is the difference of latent point from point is added to point to produce the analogous point . The challenge with this view in our setting is that while Euclidean spaces are perfectly homogeneous and vectors can be easily transported and added to points anywhere, shape spaces are far less so. While for continuous variations the vector space model has some plausibility, this is clearly not so for structural variations: the “add arms” vector does not make sense for a point representing a chair that already has arms.We take a different approach. We consider embedding shape differences or deltas directly in their own latent space, separate from the general shape embedding space. Encoding and decoding such shape differences is always done through a variational autoencoder (VAE), in the context of a given source shape, itself encoded through a part hierarchy. This has a number of key advantages: (i) allows compact encodings of shape deltas, since in general we aim to describe local variations; (ii) encourages the network to abstract commonalities in shape variations across the shape space; and (iii) adapts the edits to the provided source shape, suppressing the modes that are semantically implausible.
We have extensively evaluated StructEdit on publicly available shape data sets. We introduce a new synthetic dataset with ground truth shape edits to quantitatively evaluate our method and compare against baseline alternatives. We then provide evaluation results on the PartNet dataset [mo2019partnet] and provide ablation studies. Finally, we demonstrate that extensions of our method allow handling of both images and point clouds as shape sources, can predict plausible edit modes from single shape examples, and can also transfer example shape edits on one shape to other shapes in the collection (see Figure 1).
have attracted an increasing amount of research efforts recently. Different from 2D images, 3D shapes can be expressed as volumetric representations [wu2016learning, goodfellow2014generative, yan2016perspective, choy20163d, gwak2017weakly], oct-trees [tatarchenko2017octree, wang2018adaptive], point clouds [fan2017point, achlioptas2017learning, li2018point, valsesia2018learning, yang2019pointflow, shu20193d], multi-view depth maps [arsalan2017synthesizing], or surface meshes [sinha2017surfnet, groueix2018papier, hanocka2019meshcnn]. Beyond low level shape representations, object structure can be modeled along with geometry [nash2017shape, wang2018global, tian2018learning] to focus on the part decomposition of objects or hierarchical structures across object families during the generation process [wu2018structure, grass:li:2017, mo2019structurenet]. Similar to StructureNet [mo2019structurenet], we utilize the hierarchical part structure of 3D shapes as defined in PartNet [mo2019partnet]. However, our generative model directly encodes structural deltas instead of the shapes which, as we demonstrate, is more suitable for significant shape modifications and edit transfers.
is a long-standing research topic in shape analysis and manipulation. Early works [kraevoy2008non, xu2009joint] analyzed individual input shapes for structural constraints by leveraging local but adaptive deformation to adjust shape edits according to its content. Global constraints were subsequently used in the form of parallelism, orthogonality [gal2009iwires], or high-level symmetry and topological variations [wang2011symmetry, bokeloh2012algebraic]. However, analyzing shapes in isolation can lead to spuriously detected structural constraints and cannot easily be generalized to handle large number of shapes. Hence, followup works [ovsjanikov2011exploration, fish2014meta, yumer2015semantic]
analyze a family of 3D shapes to decipher the shared underlying geometric principles. Recently, utilizing deep neural networks, Yumer and Mitra
[yumer2016learning] learn how to generate deformation flows guided by some high-level intentions through 3D volumetric convolutions, while free-form deformation is learned [kurenkov2018deformnet, jack2018learning] to capture how shapes tend to deform within a category, or predict 3D mesh deformation [wang20193dn] conditioned on an input target image, with high-level deformation priors encoded through networks. However, by preserving input shape topology, these works greatly limit the possible edit space. Instead, we develop a deep neural network to capture the common structural variations within shape collections, and enable versatile edits with both geometric and topological changes.aims at transferring deformation imposed to a source shape onto a target shape. This requires to address how to represent shape edits, and how to connect the source and target pairs so that the edit are transferable. Early works used deformation flow or piecewise affine transformation to represent shape edits with explicit correspondence information [sumner2004deformation], or via functional spaces to represent shape differences [corman2017functional, rustamov2013map]. Correspondence is established either pointwise [sumner2004deformation, yang2018biharmonic, zhou2010deformation, ma2014analogy], or shapes are associated using higher-level abstractions like cages [ben2009spatial, chen2010cage], patches [baran2009semantic], or parts [xu2010style]. Recent efforts adapt neural representations for shapes via latent vector spaces, and then generate codes for shape edits by directly taking differences between latent shape representations. They either represent the source and target shapes in the same latent space and directly transfer the edit code [wu2016learning, achlioptas2017learning, wang2018global], or learn to transform the edit code from source shape domain to target shape domain [gao2018automatic]. Shape edit transfer is also related to motion retargeting [xia2015realtime, Villegas_2018_CVPR] where the shape deformations are usually restricted to topology-preserving changes. In contrast, we directly encode shape deltas, leading to more consistent edit transfers, even with significant topological changes.
Shape differences, or deltas , are defined as a description of the transformation of a source shape into a target shape . Our goal is to learn a generative model of the conditional distribution that accurately captures all in the dataset, and has a high degree of consistency between the conditional distributions of different source shapes (see Figure 5).
We represent a shape as a hierarchical assembly of parts that captures both the geometry and the structure of a shape. The part assembly is modeled as an -ary tree , consisting of a set of parts that describe the geometry of the shape, and a set of edges that describes the structure of the shape. See Figure 2 (left), for an example. Each part is represented by an oriented bounding box , where is the center, a quaternion defining its orientation, the extent of the box along each axis, and is the semantics of the part, chosen from a pre-defined set of categories. These parts form a hierarchy that is modeled by the edges of an -ary tree. Starting at the root node that consists of a bounding box for the entire shape, parts are recursively divided into their constituent parts, with edges connecting parents to child parts. A chair, for example, is divided into a backrest, a seat, and a base, which are then, in turn, divided into their constituent parts until reaching the smallest parts at the leaves.
Given a source shape and a target shape , we first find corresponding parts in both shapes, based on parameters, semantics, and hierarchy. We find the part-level transformation assignment among the parts and the parts starting at the children of the root parts, and then recursively matching children of the matched parts, until reaching the leaves. We only match each pair of parts with the same semantics, using a linear assignment based on the distance between the bounding boxes of the parts. As a measure of similarity between two parts, we cannot directly use the distance between their box parameters, as multiple parameter sets describe the same bounding box. Instead, we measure the distance between point clouds sampled on the bounding boxes:
(1) |
where is a function that samples a bounding box with a set of points. We use the Chamfer distance [Barrow:1977:Chamfer, fan2017point] to measure the distance between two point clouds.
Given the assignment , the shape delta consists of three sets of components: a set of part deltas that model geometric transformations from source parts to corresponding target parts, a set of deleted parts , and a set of added parts . Additional edges describe edges between added parts and their parents. Note that the parents can also be other added parts. Each part in the source shape can either be associated with a part delta or be deleted (see Figure 2).
A part delta defines the parameter differences between a source part and the corresponding target part. Deleted parts are the source parts that are not assigned to any target part. Added parts along with their structure form zero, one, or multiple sub-trees that extend the n-ary tree of the source shape. Note that edges that are not adjacent to an added part are not stored in the shape delta but inferred from the source shape.
Applying a shape delta to a source shapes, an operation we denote as , gives us the target shape :
(2) |
where is a modified part, and is the set of edges that is adjacent to any removed part. Note that our shape delta representation encodes both structural and geometric differences between the two shapes involved and represents a minimal program for transforming the source shape to the target shape.
We train a conditional variational autoencoder (cVAE) consisting of an encoder that encodes a shape delta into a latent code , and a decoder , that decodes a latent code back into a shape delta . Both the encoder and decoder are conditioned on . Providing access to the source shape encourages the cVAE to learn a distribution of deltas conditional on the source shape. Both the encoder and decoder make use of a shape encoder to generate and intermediate features of source shape .
The encoders and decoders are specialized networks that operate on trees of parts or shape delta components, and are applied recursively on the respective trees. We set the dimensionality of the latent code and all intermediate feature vectors computed by the encoders and decoders to .
The encoder computes two feature vectors for each part in a source shape, respectively encoding information about the part and its subtree.
The box features are computed for the geometry of each part of the source shape using the encoder :
(3) |
where denotes the concatenation of the box parameters of Part . The subtree features at each part are computed with a recursive encoder. For the leaf parts, we define . For non-leaf parts, we recursively pool their child features with encoder :
(4) |
where is the set of child parts, is the semantic label of a part, and the square brackets denote concatenation. The encoder uses a small PointNet [qi2017pointnet]
with max-pooling as symmetric operation. The PointNet architecture can encode an arbitrary number of children and ensures that
is invariant to the ordering of the child parts. See the Supplementary for architecture details.The shape delta encoder computes a feature for each component in sets of the shape delta . Each feature describes the component and its sub-tree. The feature of the root component is used as latent vector for the shape delta. We encode a shape delta recursively, following the tree structure of the source shape extended by the added parts. Components in the set of part additions and their edges are encoded analogous to the source shape encoder:
(5) |
where are child parts of that include newly added parts, and for added parts that are leaves.
Part deltas and deletions modify existing parts of the source shape. For components in both of these sets, we encode information about the part delta and the corresponding source part side-by-side, using the encoder :
(6) |
where
is a one-hot encoding of the shape delta component type (
delta or deletion), and is a feature describing the part delta. For deleted parts, we set to zero. is the feature vector describing the box geometry of the source part, defined in Eq. 3. Finally, features of the child components are pooled by the encoder :(7) |
The encoder is implemented as a small PointNet and returns zeros for leaf nodes that do not have children.
The shape delta decoder reconstructs a part delta or a deletion for each part of the source shape, and recovers any added nodes and their edges . The shape delta is decoded recursively, starting at the root part of the source shape. We use two decoders to compute the feature vectors; one decoder for part deltas and deleted parts, and one for added parts.
For part deltas and deleted parts , the decoder computes from parent feature and source part:
(8) |
where is the feature vector of the parent part, and are the features describing the source part and source part subtree defined in Equations 3 and 4. We include the latent code
of the shape delta in each decoder to provide a more direct information pathway. We then classify the feature
into one of two types (delta or deletion), using the classifier . For part deltas, we reconstruct the box difference parameters with the decoder . For deleted parts, no further parameters need to be reconstructed.Feature vectors for added parts are computed by the decoder that takes as input the parent feature and outputs a list of child features. This list has a fixed length , but we also predict an existence probability for each feature in the list. Features with are discarded. In our experiments we decode
features and probabilities per parent. The decoder
is defined as(9) |
where is the parent feature, are the child features with corresponding existence probabilities
. We realize the decoder as a single layer perceptron (SLP) that outputs
concatenated features, followed by another SLP that is applied to each of the with shared parameters to obtain the features and probabilities. Once the feature for an added part is computed, we reconstruct the added part with the decoder . We stop the recursion when the existence probability for all child parts falls below . To improve robustness, we additionally classify a part as leaf or non-leaf part, and do not apply to to the leaf nodes. We use two instances of this decoder that do not share parameters, one for the added parts that are children of part deltas, and one for added parts that are children of other added parts.We train our cVAE with a dataset of pairs. Each shape delta is an edit of the source shape that yields a target shape . Both source shapes and target shapes are part of the same shape dataset. Shape deltas represent local edits, so we can create shape deltas by computing the difference between pairs of shapes in local neighborhoods: , where denotes the local neighborhood of shape (see Section 4).
We train the cVAE to minimize the reconstruction loss between the input shape delta and the reconstructed shape delta :
(10) |
The reconstruction loss consists of four main terms, corresponding to the component types , a classification loss for the predicted components into one of the component types, and the variational regularization . Since we do not decode parameters for deleted parts , there is no loss term for these components beyond the classification loss. Empirically, we set .
The part delta loss measures the reconstruction error of the bounding box delta:
where is the bounding box distance defined in Eq. 1. Correspondences between the input part deltas and reconstructed part deltas are known, since each part delta corresponds to exactly one part of the source shape.
The classification loss is defined as the cross entropy between component type and reconstructed type :
(11) |
The added part loss measures the reconstruction error for the added parts. Unlike part deltas and deleted parts, added parts do not correspond to any part in the source shape. Using the inferred assignment (see Section 3.2) – matched parts share indices, and denotes the set of added parts in the reconstructed shape delta that have a match – the loss is defined as:
(12) |
the first term defines the reconstruction loss for all matched parts, while the second term defines the loss for the existence probabilities of both matched and unmatched parts (see Eq. 9). The indicator function returns for matched parts and for unmatched parts. The loss for matched parts measures box reconstruction error, the part semantics, and the leaf/non-leaf classification of a part:
(13) |
where is the semantic label of part , is the predicted probability for part to be a leaf part, and is the set of leaf parts. We set to .
We evaluate our main claims with three types of experiments. To show that encoding shape deltas more accurately captures the distribution of deltas compared to encoding shapes, we perform reconstruction and generation of modified shapes using our method, and compare to a state-of-the-art method for directly encoding shapes. To show that encoding shape deltas gives us a distribution that is more consistent between different source shapes, we perform edit transfer and measure the consistency of the transferred edits. Additionally, we show several applications. Ablation studies are provided in the supplementary.
We use two distance measures between two shapes. The geometric distance between two shapes is defined as the Chamfer distance between two point clouds of size sampled randomly on the bounding boxes of each shape. The structural distance is defined by first finding a matching between the parts of two shapes (Section 3.2), and then counting the total number of unmatched parts in both shapes, normalized by the number of parts in the first shape.
We train and evaluate on datasets that contain pairs of source shapes and shape deltas . To create these datasets, we start with a dataset of structured shapes that we use as source shapes, and then take the difference to neighboring shapes to create the deltas.
The first dataset we use for training and evaluation is the PartNet dataset [mo2019partnet] generated from a subset of ShapeNet [chang2015shapenet] with annotated hierarchical decomposition of each object into labelled parts (see Section 3.2). We train separately on the categories chair, table, and furniture. There are 4871 chairs, 5099 tables, and 862 cabinets in total. We use the official training and test splits as used in [mo2019structurenet].
We define neighborhoods as -nearest neighbors, according to two different metrics:
Geometric neighborhoods are based on the geometric distance highlights edits that focus on structural modifications.
Structural neighborhoods are based on a structural distance highlights edits that focus on geometric modifications.
See Figure 3 for an illustration. We set in our training sets. We choose for our test sets to obtain approximately the same neighborhood radius.
To evaluate the consistency of our conditional distributions of shape deltas between different source shapes, we need a ground truth for the correspondence between edits of different source shapes. In absence of any suitable benchmark, we introduce a new synthetic dataset where source shapes and edits are created procedurally, giving us the correspondence between edits by construction. The synthetic dataset consists of three groups of source shapes: stools, sofas, and chairs. These group are closed with respect to the edits. Inside each group, all edits have correspondences. Between groups, only some edits have correspondences. For example, stools have no correspondence for edits that modify the backrest, but do have correspondences for edits that modify the legs. For details on the procedural generation, please see the supplementary.
We compare our method to StructureNet [mo2019structurenet], a method that learns a latent space of shapes with the same hierarchical structure as ours, but does not encode edits. StructureNet can additionally encode relationship edges between sibling parts, but for a fair comparison, we only encode the hierarchical structure. Additionally, we compare to a baseline that only models identity edits and always returns the source shape as an upper bound for our error metrics.
To measure the reconstruction performance of our method, we train our architecture without the variational regularization on the PartNet dataset, and the evaluation is based on geometric distances and structural distances :
(14) |
where is the input delta, and the reconstructed delta. We normalize the distances by the average distance of a neighbor from the source shape in the given dataset.
chair | table | furn. | avg. | chair | table | furn. | avg. | ||
ID | 1.000 | 1.000 | 1.001 | 1.000 | 1.000 | 1.000 | 0.999 | 1.000 | |
SN | 0.886 | 0.972 | 0.875 | 0.911 | 0.656 | 0.492 | 0.509 | 0.553 | |
SE | 0.755 | 0.805 | 0.798 | 0.786 | 0.531 | 0.414 | 0.434 | 0.459 | |
ID | 0.946 | 0.940 | 0.951 | 0.945 | 1.107 | 1.341 | 1.124 | 1.191 | |
SN | 0.264 | 0.370 | 0.388 | 0.340 | 0.734 | 1.469 | 0.915 | 1.039 | |
SE | 0.082 | 0.151 | 0.139 | 0.124 | 0.136 | 0.246 | 0.183 | 0.188 |
Results for both metrics and both neighborhoods are given in Table 1. Geometric neighborhoods have large structural variations, but low geometric variations. For geometric distances, the source shape is therefore already a good approximation of the neighborhood, and the identity baseline performs close to the other methods. In contrast, with structural distances we see a much larger spread. For structural neighborhoods , most of the neighbors share a similar structure. Here, StructureNet’s reconstruction errors of the source shape become apparent, showing a structural error comparable to the identity baseline. StructEdit, on the other hand, only needs to encode local shape deltas. We benefit from the large degree of consistency between the deltas of different source shapes, allowing us to encode local neighborhoods more accurately. This effects of this benefit can also be confirmed visually in the lower part of Table 1.
chair | table | furn. | avg. | chair | table | furn. | avg. | ||
ID | 1.822 | 1.763 | 1.684 | 1.756 | 1.629 | 1.479 | 1.446 | 1.518 | |
1.760 | 2.076 | 1.626 | 1.821 | 1.308 | 1.208 | 1.243 | 1.253 | ||
1.722 | 2.068 | 1.558 | 1.783 | 1.241 | 1.103 | 1.135 | 1.160 | ||
1.768 | 2.189 | 1.554 | 1.837 | 1.232 | 1.057 | 1.017 | 1.102 | ||
SE | 1.593 | 1.655 | 1.561 | 1.603 | 1.218 | 1.000 | 1.015 | 1.078 | |
ID | 1.281 | 1.215 | 1.288 | 1.261 | 1.437 | 1.303 | 1.442 | 1.394 | |
1.081 | 0.878 | 1.015 | 0.991 | 1.466 | 3.484 | 1.414 | 2.121 | ||
0.871 | 0.729 | 0.873 | 0.824 | 1.373 | 3.300 | 1.204 | 1.959 | ||
0.751 | 0.667 | 0.726 | 0.715 | 1.763 | 3.622 | 1.167 | 2.184 | ||
SE | 0.559 | 0.524 | 0.741 | 0.608 | 0.609 | 0.451 | 0.676 | 0.579 |
Next, we report the difference of our learned distribution to the ground truth distribution using two measures. The coverage error measures the average distance from a ground truth sample to the closest generated sample, while the quality error measures the average distance from a generated sample to the closest ground truth sample.
where the generated shape delta is sampled according to the learned distribution . We use samples per source shape in our experiments, and average over all source shapes in the test set . is the neighbor count of each neighborhood in the test set. We evaluate the quality and coverage errors with both geometric distances and structural distances . The coverage and quality metrics can be combined by adding the two metrics for each source shape, giving the Chamfer distance between the generated samples and the ground truth samples of each source shape, denoted as , where can be or .
Table 6 shows the results of our method compared to the baselines on each dataset. The identity baseline has low quality error, but extremely high coverage error, since it approximates the distribution of deltas for each source shape with a single sample near the mode of the disitrbution. Both StructureNet and StructEdit approximate the distribution more accurately. Since in StructureNet, we cannot learn neighborhoods explicitly, we sample from Gaussian in latent space that are centered at the source shape, with sigmas , , and . Larger sigmas improve coverage at the cost of quality. StructEdit encodes deltas explicitly, allowing us to learn different types of neighborhoods and to make use of the similarity between the delta distributions at different source shapes. This is reflected in a significantly lower error in nearly all cases. The supplementary provides separate quality and coverage for each entry.
Figure 4 shows examples of multiple edits generated for several source shapes. Learning an accurate distribution of shape deltas allows us to expose a wide range of edits each source shape. Our method can learn different types of neighborhoods, corresponding different types of edits. We can see that properties of these neighborhoods are preserved in our learned distribution: geometric neighborhoods preserve the overall proportions of the source shape and have a large variety of structures; while the reverse is true for structural neighborhoods
. We show interpolations between edits in the supplementary.
Edit transfer maps a shape delta from a source shape to a different source shape . First, we encode the shape delta conditioned on the first source shape, and decode it conditioned on the other source shape: . Since the two source shapes generally have a different geometry and structure, the edit needs to be adapted to the new source shape by the decoder. The two edits should perform an analogous operation on both shapes. Our synthetic dataset provides a ground truth for analogous shape deltas. Shapes in this dataset are divided into groups of shapes, and the shape delta between any pair of shapes in a group has a known analogy in all of the other groups. When transferring an edit, we measure the geometric distance and structural distance of the modified shape from the ground truth analogy.
(15) |
where is the ground truth analogy and the predicted analogy. In case an edit is not transferable, such as adding an armrest to a shape that already has an armrest, we define an identity edit that leaves the source shape unchanged.
chair | sofa | stool | c. s. | c. st. | avg. | ||
Identity | 1.002 | 0.938 | 0.892 | 0.892 | 0.938 | 0.932 | |
StructureNet | 0.868 | 0.764 | 0.721 | 0.888 | 1.307 | 0.910 | |
StructEdit | 0.586 | 0.566 | 0.599 | 0.572 | 0.698 | 0.604 | |
Identity | 0.941 | 1.328 | 0.333 | 0.333 | 1.328 | 0.853 | |
StructureNet | 0.208 | 0.161 | 0.025 | 0.671 | 0.871 | 0.387 | |
StructEdit | 0.005 | 0.001 | 0.003 | 0.002 | 0.123 | 0.027 |
Table 3 compares the transfer error of our method to each baseline on the synthetic dataset. We show both transfers within the same group, and transfers between different groups, where some edits are not applicable and need to be mapped to identity. Our method benefits from consistency between deltas and achieves lower error. In absence of ground truth edit correspondence for PartNet we qualitatively compare edit transfers in Figure 5. Our transferred edits better mirror the given edit, both in the properties of the source shape that it modifies, and in the properties that it preserves. The given edit in the first row is transferred to the source shapes in the other rows.
Our latent space of shape deltas can be used for several interesting applications, such as the edit transfers we showed earlier. Here we demonstrate two additional applications.
First, we explore variations for raw, unlabelled point clouds. We can transform can transform the point cloud into our shape representation using an existing method [mo2019partnet], generate and apply edits, and then apply the changes back to the point cloud. For details please see the supplementary. Results on several raw point clouds sampled from ShapeNet [chang2015shapenet] meshes are shown in Figure 6.
Second, we create cross-modal analogies, between images and point clouds. The images can be converted to our shape representation using StructureNet [mo2019structurenet]. This allows us to define an edit from a pair of images, and to transfer this edit to a point cloud, using the same approach as described previously. Details are given in the supplementary. Results for several point clouds and image pairs are shown in Figure 7 on data from the ShapeNet dataset.
We presented a method to encode shape edits, represented as shape deltas, using a specialized cVAE architecture. We have shown that encoding shape deltas instead of absolute shapes has several benefits, like more accurate edit generation and edit transfer, which has important applications in 3D content creation. In the future, we would like to explore additional neighborhood types, add sparsity constraints to modify only a sparse set of parts, and encode chains of edits. While we demonstrated consistency of shape deltas in their latent space, our method remains restricted to class-specific transfers. It would be interesting to try to collectively train across different by closely-related shape classes.
This research was supported by NSF grant CHS-1528025, a Vannevar Bush Faculty Fellowship, KAUST Award No. OSR-CRG2017-3426, an ERC Starting Grant (SmartGeometry StG-2013-335373), ERC PoC Grant (SemanticCity), Google Faculty Awards, Google PhD Fellowships, Royal Society Advanced Newton Fellowship, and gifts from the Adobe, Autodesk, Google Corporations, and the Dassault Foundation.
We provide statistics of the PartNet [mo2019partnet, mo2019structurenet] dataset, as well as the synthetic dataset, and show a few samples from each. Additionally, we discuss the procedural shape generation pipeline that we use to create the synthetic dataset.
In our experiments, we use four datasets. A dataset of 4,871 chairs, 5,099 tables, and 862 cabinets in PartNet [mo2019partnet]. Additionally, we create a synthetic dataset of 57,600 chairs, tables, and sofas, where we have a ground-truth correspondence between the deltas in the neighborhoods of different source shapes. For each dataset, we use the same hierarchical bounding box representation. Figure 8 show examples of each dataset and more statistics are summarized in Table 4.
#shapes | tree depth | #leafs | ||
PartNet chair | 4871 | 4.039 | 11.097 | 100/20 |
PartNet table | 5099 | 5.127 | 7.537 | 100/20 |
PartNet furniture | 862 | 4.522 | 14.377 | 100/20 |
Synthetic | 57600 | 3.667 | 10.111 | 96/96 |
In this section, we introduce the procedural generation pipeline of the synthetic dataset. In the procedural generation, we explicitly create shape deltas for each source shape. This gives us knowledge of the ground truth correspondences between the shape deltas of different source shapes. We use this ground truth to quantitatively evaluate our edit transfer performance.
A chair shape consists of four basic components: a back, a seat, an optional pair of arms and a leg base with possibly different types of stretcher bars connecting four legs. We randomly sample 8 global parameters for each shape: where and are leg width and height, , , and are seat width, depth and height, and finally, , , and refer to back width, depth and height. All parameters for the other parts are deterministically derived based on the eight global parameters or assigned with fixed values. For example, chair arm depth is half of the seat depth and all stretcher bars have a fixed height of 0.03. Combinations of values for the 8 global parameters give us a large set of source shapes.
We create structural variations for each of these shapes by changing the structure of individual parts. For each shape, we create 4 variants for back (e.g. with or without back bars, with vertical bars or horizontal bars), 2 variants for legs (e.g. short or long), 3 variants for arms (e.g. with or without arms, different layouts for armrest and arm support), and 4 variants for leg stretchers (e.g. squared layout, H-like layout, or X-like layout). In total, we make structural variants for the same shape. A few examples of these variants are shown in Figure 9. We normalize all generated shapes within a unit sphere. In this procedural dataset, corresponding variants of different source shapes have the same index. Thus, for two variants with index and of two shapes : and ), we can define a ground-truth for the edit transfer as .
In real chairs, we do not always have correspondences between all possible shape variations. For example, a delta that makes the legs shorter does not have a correspondence in a sofa that does not have legs. To model these differences in the delta neighborhoods of our synthetic chairs, we divide them into three sub-types: 19,200 chairs, 19,200 sofas, and 19,200 stools. The creation of sofa shapes and stool shapes follow the same procedural generative grammar as chair shapes, except that we remove the leg base for sofas and remove chair back and arms for stools. For each of these sub-types, the dataset comprises of 200 groups of shapes, each with 96 structural variations. Between two sub-types, a known subset of the deltas does not have a correspondence. For example, deltas that modify the legs of chairs do not have a correspondence in sofas. We manually set the correspondence of these deltas to the identity edit (i.e. the delta that does not change the shape).
We will release the code for procedural shape generation pipeline and the generated synthetic dataset.
In our architecture, individual encoders and decoders share a similar architecture, unless noted otherwise in the main paper. In our experiments, we found that the total number of layers in the encoders and decoders has a significant adverse effect on the performance of the cVAE, especially since the recursive traversal depends on the depth of the shape tree. To keep the number of layers low, we use a relatively simple architecture for all individual encoders and decoders (unless noted otherwise): a multi layer perceptron (MLP) that has two layers. We also add a skip connection [resnet] that starts at the input and is added to the output to shorten the information path. This simple architecture is illustrated in Figure 10.
We provide an ablation study for the network architecture design choices and more experimental results with comparisons to our baselines. Note that methods that only encode geometry and not structure, such as methods that represent objects as voxels, point clouds, or implicit functions, are not suitable baselines for our method. The domain we work on consists of both geometry and structure, where structure is an abstraction of geometry. Methods that work on geometry only have a fundamentally different goal than our method. Their outputs, being geometry only, cannot be compared fairly to our output that combines geometry and structure. For this reason we only compare to methods that work on the same domain as ours in our experiments.
No Skip Conn. | 0.900 | 0.201 | 0.713 | 0.364 |
No Group Norm. | 0.749 | 0.083 | 0.525 | 0.142 |
No Leaf Class. | 0.759 | 0.087 | 0.533 | 0.171 |
No Box Deltas. | 1.737 | 0.083 | 1.766 | 0.142 |
Full | 0.754 | 0.082 | 0.531 | 0.136 |
chair | table | furniture | avg. | ||
/ / | Identity | 0.846 / 0.976 / 1.822 | 0.789 / 0.974 / 1.763 | 0.704 / 0.980 / 1.684 | 0.780 / 0.977 / 1.756 |
StructureNet-0.2 | 0.826 / 0.934 / 1.760 | 1.008 / 1.068 / 2.076 | 0.735 / 0.891 / 1.626 | 0.856 / 0.964 / 1.821 | |
StructureNet-0.5 | 0.857 / 0.865 / 1.722 | 1.092 / 0.975 / 2.068 | 0.744 / 0.815 / 1.558 | 0.898 / 0.885 / 1.783 | |
StructureNet-1.0 | 0.940 / 0.828 / 1.768 | 1.270 / 0.918 / 2.189 | 0.789 / 0.765 / 1.554 | 1.000 / 0.837 / 1.837 | |
StructEdit (Ours) | 0.789 / 0.804 / 1.593 | 0.834 / 0.821 / 1.655 | 0.806 / 0.755 / 1.561 | 0.810 / 0.793 / 1.603 | |
/ / | Identity | 0.281 / 1.000 / 1.281 | 0.215 / 1.000 / 1.215 | 0.288 / 1.000 / 1.288 | 0.261 / 1.000 / 1.261 |
StructureNet-0.2 | 0.300 / 0.781 / 1.081 | 0.248 / 0.630 / 0.878 | 0.316 / 0.698 / 1.015 | 0.288 / 0.703 / 0.991 | |
StructureNet-0.5 | 0.324 / 0.547 / 0.871 | 0.284 / 0.445 / 0.729 | 0.314 / 0.559 / 0.873 | 0.307 / 0.517 / 0.824 | |
StructureNet-1.0 | 0.388 / 0.363 / 0.751 | 0.347 / 0.321 / 0.667 | 0.336 / 0.390 / 0.726 | 0.357 / 0.358 / 0.715 | |
StructEdit (Ours) | 0.299 / 0.259 / 0.559 | 0.299 / 0.225 / 0.524 | 0.518 / 0.223 / 0.741 | 0.372 / 0.236 / 0.608 |
chair | table | furniture | avg. | ||
/ / | Identity | 0.651 / 0.978 / 1.629 | 0.499 / 0.980 / 1.479 | 0.467 / 0.980 / 1.446 | 0.539 / 0.979 / 1.518 |
StructureNet-0.2 | 0.557 / 0.751 / 1.308 | 0.501 / 0.707 / 1.208 | 0.450 / 0.793 / 1.243 | 0.502 / 0.750 / 1.253 | |
StructureNet-0.5 | 0.571 / 0.670 / 1.241 | 0.516 / 0.587 / 1.103 | 0.451 / 0.684 / 1.135 | 0.513 / 0.647 / 1.160 | |
StructureNet-1.0 | 0.611 / 0.621 / 1.232 | 0.548 / 0.509 / 1.057 | 0.456 / 0.561 / 1.017 | 0.538 / 0.564 / 1.102 | |
StructEdit (Ours) | 0.581 / 0.637 / 1.218 | 0.501 / 0.499 / 1.000 | 0.521 / 0.494 / 1.015 | 0.534 / 0.543 / 1.078 | |
/ / | Identity | 0.437 / 1.000 / 1.437 | 0.303 / 1.000 / 1.303 | 0.442 / 1.000 / 1.442 | 0.394 / 1.000 / 1.394 |
StructureNet-0.2 | 0.693 / 0.773 / 1.466 | 2.218 / 1.267 / 3.484 | 0.598 / 0.816 / 1.414 | 1.169 / 0.952 / 2.121 | |
StructureNet-0.5 | 0.888 / 0.485 / 1.373 | 2.518 / 0.781 / 3.300 | 0.613 / 0.590 / 1.204 | 1.340 / 0.619 / 1.959 | |
StructureNet-1.0 | 1.413 / 0.350 / 1.763 | 3.099 / 0.523 / 3.622 | 0.750 / 0.417 / 1.167 | 1.754 / 0.430 / 2.184 | |
StructEdit (Ours) | 0.323 / 0.286 / 0.609 | 0.271 / 0.180 / 0.451 | 0.454 / 0.222 / 0.676 | 0.349 / 0.229 / 0.579 |
We perform an ablation of four design choices in our architecture: our extensive use of skip connections, using group normalization, using a separate classifier to determine of added nodes are leafs, and encoding deltas of box parameters, instead of modified boxes. For each ablation, we evaluate the reconstruction and generation performance on the chairs dataset. From these four design choices, the skip connections have the largest positive impact on the structure of shape deltas, while encoding box deltas instead of absolute boxes has the largest positive effect on the geometry. Table 5 shows the performance for each ablated variant of our method.
Our latent space of edits has all the benefits that are enabled by a smooth latent space, such as the ability to interpolate between two edits. In Figure 12, we show two examples of interpolations between different edits. The examples show that both geometric and structural changes are interpolated smoothly.
Qualitative results comparing StructEdit to StructureNet for edit transfer on the synthetic dataset are shown in Figure 13. The edit that transforms source shape A into the modified shape (first two columns) is transferred to source shape B (third column). On the synthetic dataset, we have a ground truth for the result of the edit transfer, shown in the fourth column. StructEdit (SE, last column) explicitly encodes edits, and can thus benefit from the large degree of consistency between the neighborhoods of deltas around different source shapes, giving us a significantly more accurate edit transfer than than StructureNet (SN). Note that we do not use the ground truth transferred edit during training. We do not use any kind of supervision for the mapping between the shape deltas of different source shapes. Our intuition is that the increased accuracy of the edit transfer is a result of tendency of networks to compress information in their latent space. Due to the consistency of the shape deltas around different source shapes in our datasets, a consistent layout of shape deltas in the latent spaces around different source shapes is the layout that uses the least amount of information. A similar effect is observed in several other unsupervised methods [CycleGAN2017, Sun:2019:LAH].
In the following, we give additional details for two applications shown in the main paper: editing raw point clouds and cross-modal analogies.
In this application, we transform the point cloud into a structured shape using an existing method, find variations for the structured shape, and then transfer the corresponding shape deltas back to modify the point cloud. For the transformation into a shape, we first perform panoptic segmentation using the method described in PartNet [mo2019partnet], giving us part semantic and instance labels for each point. The semantics allow us to create a hierarchy among the part instances, and the instance labels give us part bounding boxes. After an edit, point cloud segments can either be transformed with the bounding box modifications or deleted, depending on the modification of the corresponding part. Added parts are transformed back into a point cloud by sampling their surface with a fixed number of points.
To transfer an edit defined by a pair of images to a point cloud, both images are transformed into structured shapes using the method described in StructureNet [mo2019structurenet]: an encoder maps the images into the latent space of a pre-trained StructureNet. Once we have structured shapes for both images, we use their difference as shape delta. This delta is then transferred to the shape obtained from the point cloud using our learned latent space. The conversion between point clouds and shapes is handled as described in the previous application.