Learning a shared representation space for geometries is a central task in 3D Computer Vision and in Geometric Modeling as it enables a series of important downstream applications, such as retrieval, reconstruction, and editing. For instance,morphable models blanz1999morphable is a commonly used representation for entire classes of shapes with small intra-class variations (i.e., faces), allowing high quality geometry generation. However, morphable models generally assume a shared topology and even the same mesh connectivity for all represented shapes, and are thus less extensible to general shape categories with large intra-class variations. Therefore, such approaches have limited applications beyond collections with a shared structure such as humans blanz1999morphable; SMPL:2015 or animals zuffi20173d.
In contrast, when trained on large shape collections (e.g., ShapeNet chang2015shapenet), 3D generative models are not only able to learn a shared latent space for entire classes of shapes (e.g., chairs, tables, airplanes), but also capture large geometric variations between classes. A main area of focus in this field has been developing novel geometry decoders for these latent representations. These generative spaces allow the mapping from a latent code to some geometric representation of a shape, examples being voxels choy20163d; tatarchenko2017octree, meshes groueix2018papier; nash2020polygen, convexes deng2020cvxnet; chen2020bspnet, or implicit functions chen2019learning; genova2019deepsif. Such latent spaces are generally smooth and allow interpolation or deformation between arbitrary objects represented in this encoding. However, the shape generation quality is highly dependent on the decoder performance and generally imperfect. While some decoder architectures are able to produce higher quality geometries, auto-encoded shapes never exactly match their inputs, leading to a loss of fine geometric details.
In this paper we introduce a different approach to shape generation based continuous flows between shapes that we term ShapeFlow. The approach views the shape generation process from a new perspective – rather than learning a generative space where a learned decoder directly maps a latent code to the shape as , ShapeFlow learns a deformation space facilitated by a learned deformer , where a novel shape is acquired by deforming one of many possible template shapes via this learned deformer: , where are the latent codes corresponding of and .
This deformation-centric view of shape generation has various unique properties. First, a deformation space, compared to a generative space, naturally disentangles geometry style from structure. Style comes from the choice of source shape , which also includes the shape topology and mesh connectivity. Structure includes the general placement of different parts, such as limb positioning in a human figure (i.e., pose), height and width of chair parts etc. Second, unlike template-based mesh generation frameworks such as wang2018pixel2mesh; litany2018deformable; groueix2018papier, whose generated shapes are inherently limited by the the template topology, a deformation space allows a multi-template scenario where each of the source shapes can be viewed as a template. Also, unlike volumetric decoders that require a potentially computationally intensive step for extracting surfaces (e.g., Marching Cubes), ShapeFlow directly outputs a mesh (or a point cloud) through deforming the source shape. Finally, by routing the deformations through a common waypoint in this space, we can learn a shared template for all geometries of the same class, despite differences in meshing or topology, allowing unsupervised learning of dense correspondences between all shapes within the same class.
The learned deformation function deforms the template shape into so that it is geometrically close to the target shape . Our deformation function is based on neurally parametrized 3D vector fields or flows that locally advect a template shape towards its destination. This novel way of modeling deformations has various innate advantages compared to existing methods. We show that deformation induced by a flow naturally prevents self-intersections. Furthermore, we demonstrate that we can parametrize a divergence-free flow field effectively using a neural network, which ensures volume conservation during the deformation process. Finally, ShapeFlow ensures path invertibility (), and therefore also identity preservation (). Compared to traditional deformation parameterizations in computer graphics such as control handles schaefer2006image; jacobson2011bounded and control cages joshi2007harmonic; lipman2008green; weber2009complex, ShapeFlow is a flow-model realized by a neural network, allowing a more fine grained deformation without requiring user intervention.
In summary, our main contributions are:
We propose a flow-based deformation model via a neural network that allows exact preservation of identity, good preservation of local geometric features, and disentangles geometry style and structure.
We show that our deformations by design prevent self-intersections and can preserve volume.
We demonstrate that we can learn a common template for a class of shapes through which we can derive dense correspondences.
We apply our method to interpolate shapes in different poses, producing smooth interpolation between key frames that can be used for animation and content creation.
2 Related work
Traditionally, shape representation in 3D computer vision roughly falls into two categories, template-based representation and template-free representation. In contrast, ShapeFlow fills a gap in between – it can be viewed as a multi-template space, where the source topology can be based on any of the training shapes, and where a very general deformation model is adopted.
These methods generally assume a fixed topology for all modelled geometries. Morphable models blanz1999morphable is a commonly used representation for entire classes of shapes with very small intra-class variations, such as faces blanz1999morphable; booth20163d; huber2016multiresolution; zhu2015discriminative; zhu2016face; zhu2015high, heads dai20173d; ploumpis2020towards, human bodies hasler2009statistical; allen2003space; SMPL:2015, and even animals zuffi20173d; zuffi2018lions. Morphable models generally assume a shared topology and even the same mesh connectivity for all represented shapes, which restricts its use to few shape categories. Recently, neural networks have been employed to generate 3D shapes via morphable models genova2018unsupervised; sanyal2019learning; litany2018deformable; ranjan2018generating; kolotouros2019convolutional; zuffi20173d; zuffi2018lions. Some recent work has extended the template-based approach to shapes with larger variations wang2018pixel2mesh; groueix2018papier; deprelle2019learning; ganapathi2018parsing, but the generated results are polygonal meshes that often contain self-intersections and are not-watertight.
These methods generally produce a volumetric implicit representation for the geometries rather than directly representing the surface under certain surface parameterizations, thus allowing the same model to model geometries across different topologies, with potentially large geometric variations. Earlier works in this line utilize voxel representations wu20153d; wu2016learning. Recently, the use of continuous implicit function decoders mescheder2019occupancy; park2019deepsdf; chen2019learning has been popularized due to its strong representation capacity for more detailed geometry. Similar ideas are extended to represent color, light field, and other scene related properties sitzmann2019scene; mildenhall2020nerf, and coupled with spatial jiang2020lig; chabra2020deep or spatio-temporal jiang2020meshfreeflownet latent grid structures to extend to larger scenes and domains. Still, these approaches lack the fine structures of real geometric models.
Parametrizing the space of admissible deformations in a set of shapes with diverse topologies is a challenging problem. Directly predicting offsets for each mesh vertex with insufficient regularization will lead to non-physical deformations such as self-intersections. In computer graphics, geometry deformation is usually parameterized using a set of deformation handles schaefer2006image or deformation cages joshi2007harmonic; lipman2008green; weber2009complex. Surface-based energies are usually optimized in the deformation process sorkine2007rigid; chao2010simple; jacobson2011bounded; uy2020deformation to maintain rigidity, isometry, or other desired geometric properties. More recently, learned deformation models have been proposed, directly predicting vertex offsets wang20193dn or control cage deformations yifan2019neural. Different from our end-to-end deformation setting, the graphics approaches are typically aimed at interactive and incremental shape editing applications.
Flow models have traditionally been used in machine learning for learning generative models for a given data distribution. Some examples of flow models include RealNVPdinh2016density and Masked Auto-Regressive Flows papamakarios2017masked; these generally involve a discrete number of learned transformations. Continuous normalizing flow models have also been recently proposed chen2018neural; grathwohl2018ffjord
, and our method is mainly inspired by these works. They create bijective mappings via a learned advection process, and are trained using a differential Ordinary Differential Equation (ODE) solver. PointFlowyang2019pointflow and OccFlow niemeyer2019occupancy are similar to our approach in using such learned flow dynamics for modeling geometry. However, PointFlow yang2019pointflow maps point clouds corresponding to geometries to a learned prior distribution while ShapeFlow directly learns the deformations function between geometries, bypassing a prior distribution and better preserves geometric details. OccFlow niemeyer2019occupancy only models the temporal deformation sequence for one object, while ShapeFlow learns a deformation space for entire classes of geometries.
Consider a set of shapes . Each shape is represented by a polygonal mesh , where is an ordered set of points that represent the vertices of the polygonal mesh. For each point , we have . is a set of polygonal elements, where each element indexes into a set of vertices . For one-way deformations, we seek a mapping that minimizes the geometric distance between the deformed source shape and the target shape :
where is the symmetric Chamfer distance between two shapes . Note the mapping operates on the vertices , while retaining the mesh connectivity expressed by . As in previous work fan2017point; mescheder2019occupancy, since mesh-to-mesh Chamfer distance computation is expensive, we proxy it using the point set to point set Chamfer distance between uniform point samples on the meshes. Furthermore, in order to learn a symmetric deformation space, we optimize for maps that minimize the symmetric deformation distance:
We define such maps as an advection process via a flow function , where we associate intermediate deformations with an interpolation parameter . For any pair of shapes :
Not introducing self-intersections is a key property in shape deformation, since self-intersecting deformations are not physically plausible. In Proposition 1 (supplementary material), we prove that this property is algebraically satisfied in our formulation. Note that this property holds under the assumption of perfect integration. Errors in numerical integration will lead to its violation. However, we will empirically show in Sec. C.2 (supplementary material) that this can be controlled by bounding the numerical integration error.
For any pair of shapes, it would be ideal if performing a deformation of into , and then back to , would recover exactly. We want the deformation to be lossless for identity transformations, or, more formally, . In Proposition 3 (supplementary material), we derive a condition on that is sufficient to ensure bijectivity :
3.1 Deformation flow field
At the core of the learned deformations (3) is a learnable flow field . We start by assigning latent codes to the shapes , and then define the flow as:
where are trainable parameters of a neural network. Note the same deformation function can be shared for all pairs of shapes , and that this flow satisfies the invertibility condition (4).
The function , receives in input the spatial coordinates and a latent code . When deforming from shape to shape , the latent code linearly interpolates between the two endpoints. is a fully-connected neural network with weights .
The sign function , receives the normalized direction for the vector from to . The sign function has the additional requirement that it be symmetric, which can be satisfied either by fully-connected neural networks with learnable parameters
, with zero bias and symmetric activation function (e.g.,tanh), or by construction via the hub-and-spokes model of Section 3.2.
With this regularization, we ensure that the distance within the latent space is directly proportional to the amount of required deformation between two shapes, and obtain several properties:
Consistency of the latent space, which ensures deforming half way from to is equivalent to deforming all-way from to the latent code half-way between and :
Identity preservation :
Implicit regularization: volume conservation
By learning a divergence-free flow field for the deformation process, we show that the volume of any enclosed mesh can be conserved through the deformation sequence; see Proposition 4 (supplementary material
). While we could penalize for divergence change via a loss, resulting in approximate volume conservation, we show how this hard-constraint can be implicitly and exactly satisfied without resorting to auxiliary loss functions. Based on Gauss’s theorem, the volume integral of the flow divergence is equal to the surface integral of flux, which amounts to zero for solenoidal flows. Additional, any divergence-free vector field can be represented as the curl of a vector potential. This allows us to parameterize astrictly divergence-free flow field by first parameterizing a vector field as the vector potential. In particular, we parameterize the flow as = , with using a fully-connected network: . Since the curl operator is a series of first-order spatial derivatives, it can be efficiently calculated via a sum of the first-order derivatives with respect to the input layer for
, computed through a single backpropagation step; refer to the architecture in Sec.B.1 (supplementary material).
Implicit regularization: symmetries
Given that many geometric objects have a natural plane/axis/point of symmetry, being able to enforce implicit symmetry is a desired quality for the deformation network. We can parameterize the flow function by first parameterizing a . Without loss of generality, assume , and let be the plane of symmetry:
where the superscript denotes the components of the vector output.
Explicit regularization: surface metrics
Additionally, surface metrics such as rigidity and isometry can be explicitly enforced via an auxiliary loss term to the overall loss function. A simple isometry constraint can be enforced by penalizing the change in edge lengths of the original mesh through the transformations, similar to the stretch regularization in gadelha2020deep; bednarik2019shape.
We use a modified version of IM-NET chen2019learning as the backbone flow model where we adjust the model with different number of hidden and output nodes. We defer discussions about the model architecture and training details to Sec. B.1 (supplementary material)
3.2 Hub-and-spoke deformation
Given a set of training shapes: , we train the deformer by picking random pairs of shapes from the set. There are two strategies for learning the deformation, either by directly deforming between each pair of shapes, or deforming each pair of shapes via a canonical latent shape corresponding to a “hub” latent code. Additionally, we use an encoder-less approach (i.e., an auto-decoder park2019deepsdf) where we initialize random latent codes from , corresponding to each training shape. . The latent codes are jointly optimized, along with the network parameters . Additionally, we define a “hub” latent vector as . Under the hub-and-spokes deformation model, the training process amounts to finding:
A visualization for the learned latent space via the hub-and-spokes model is shown in Fig. 1(b). With hub-and-spokes training, we can define the sign function (Sec. 3.1) simply to produce for the path towards the zero hub and for the path from the hub, without the need of parameters.
3.3 Encoder-less embedding
We adopt an encoder-less scheme for learning the deformation space, as well as embedding new observations into the deformation space. After we acquire a learned deformation space by training with the hub-and-spokes approach, we are able to embed new observations of point clouds into the learned latent space by optimizing for the latent code that minimizes the deformation error of random shapes in the original deformation space to the new observation. Again, this “embedding via optimization” approach is similar to the auto-decoder approach in park2019deepsdf. The embedding of a new point cloud amounts to seeking:
4.1 ShapeNet deformation space
As a first experiment, we learn the deformation space for entire classes of shapes from ShapeNet chang2015shapenet, and illustrate two downstream applications for such a deformation space: shape generation by deformation, and shape canonicalization. Specifically, we experiment on three representative shape categories in ShapeNet: chair, airplane and car. For each category, we follow the official train/test/validation split for the data. We preprocess the geometries into watertight manifolds using the preprocessing pipeline in mescheder2019occupancy, and further simplify the meshes to th of the original number of vertices using fastquadric. The deformation space is learned by deforming random pairs of objects using a hub-and-spokes deformation approach (as described in Section 3.2). More training details for learning the deformation space can be found in Section B.2 (supplementary material).
4.1.1 Surface reconstruction by template deformation
The learned deformation space can be used for reconstructing objects based on input observations. A schematic for this process is provided in Fig. 1: a new observation , in the form of a point cloud, can be embedded into a latent code the latent deformation space according to Eqn. 8. The top- nearest training shapes in the latent space are retrieved, and deformed to . During this step we further fine tune the network parameters to perform a better fitting to the observed point cloud.
We seek to reconstruct a complete object given a (potentially incomplete) sparse input point cloud. Following mescheder2019occupancy, we subsample points from mesh surfaces and add a Gaussian noise of to the point samples. As a measure of the reconstruction quality, we measure the volumetric Intersection-over-Union (IoU), Chamfer-, as well as normal consistency metrics.
We benchmark against various state-of-the-art shape generation models that outputs voxel grids (3D-R2N2 choy20163d), upsampled point sets (PSGN fan2017point), mesh surfaces (DMC liao2018deep) and implicit surfaces (OccFlow mescheder2019occupancy); see quantitative results in Table 1. Qualitative comparisons between the generated geometries are illustrated in Figure 2. Note our shape deformations are more constrained (i.e., less expressive) than traditional auto-encoding/decoding, resulting in slightly lower metrics (Table 1). However, ShapeFlow is able to produce visually appealing results (Figure 2), as the retrieved shapes are of CAD quality – and fine geometric details are preserved by the deformation.
|category||Chamfer- ()||IoU ()||Normal Consistency ()|
4.1.2 Canonicalization of shapes
An additional property of the deformation space learned through the hub-and-spoke formulation is that it naturally learns an aligned canonical deformation of all shapes. The canonical deformation corresponds to the zero latent code that corresponds to the hub, for shape it is simply the deformation of from latent code to the hub latent code . Dense correspondences between shapes can be acquired by searching for the nearest point on the opposing shape in the canonical space. For a point , the corresponding point on is found as:
To quantitatively evaluate the quality of the such surface correspondences learned in an unsupervised manner, we propose the Semantic Matching Score (SMS) as a metric for evaluating such correspondences. While semantic correspondences between shapes do not exist, semantic part labels are provided in various shape datasets, including ShapeNet. Denote as an evaluation of the semantic label for the point , is a label comparison operator that evaluates to one if the categorical labels are the same and zero otherwise. We define SMS between as:
We choose 10,000 random pairs of shapes in the chair category to compute semantic matching scores.
We first visualize the canonicalization and surface correspondences of shapes in the deformation space in Fig. 4. We compare the semantic matching score for our learned dense correspondence function with the naive baseline of nearest neighbor matching in the original
(ShapeNet) shape space. The results are presented in the inset table. The shapes align better in the canonical pose, and the matches found by canonical space matching are more semantically correct, especially between originally poorly aligned space due to different aspect ratios (e.g., couch and bar stool). This is reflected in the improved SMS matching score, as reported in the inset table.
4.2 Human deformation animation
ShapeFlow can be used to producing smooth animated deformations between pairs of 3D geometries. These animations are subject to the implicit and explicit constraints for volume and isometry conservation; see Section 3.1. To test the quality of such animated deformations, we choose two relatively distinct SMLP poses SMPL:2015, and produce continuous deformations for in-between frames. Given that dense correspondences between shapes are given, we change the distance metric in Eqn. 2 to be the pairwise norm between all vertices. We supervise the deformation with 5 intermediate frames produced via linear interpolation. Denoting the geometries at the two end-points as and , the deformation at intermediate step is:
We present the results of this deformation in Figure 5. We compare several cases, including direct linear interpolation, deformation using an unconstrained flow, volume constrained flow, as well as volume and edge length constrained flow model. The volume change curve in Figure 5 empirically validates our theoretical result in Section 3.1, that (1) a divergence-free flow conserves the volume of a mesh through the deformation process, and (2) prevents self-intersections of the mesh, as in the example in Figure 5. Furthermore, we find that explicit constraints, such as the edge length constraint, reduces surface distortions.
4.3 Comparison with parametric deformations
As a final experiment, we compare the unsupervised deformation acquired using ShapeFlow with interpolations of parametric CAD models. We use an exemplar parametric CAD model from Schulz:2017; see Figure 4. ShapeFlow
produces novel intermediate shapes of CAD level geometric quality that are consistent with those produced by interpolating a parametric model.
5 Conclusions and future work
ShapeFlow is a flow-based model capable to build high-quality shape-spaces by using deformation flows. We analytically show that ShapeFlow prevents self-intersections, and provide ways to regularize volume, isometry, and symmetry. ShapeFlow can be applied to reconstruct new shapes via the deformation of existing templates. A main limitation for the current framework is that it does not incorporate semantic supervision for matching shapes. Future directions include analyzing part structures of geometries by grouping similar vector fields xu2011detecting, and exploring semantics-aware deformations. Furthermore, ShapeFlow may be used for the inverse problem of inferring a solenoidal flow field given tracer observations willert1991digital, an important problem in engineering physics.
The work has broad potential impact within the computer vision and graphics community, as it describes a novel methodology that enables a range of new applications, from animation to novel content creation. We have discussed the potential future directions the work could take in Sec. 5.
On the broader societal level, this work remains largely academic in nature, and does not pose foreseeable risks regarding defense, security, and other sensitive fields.
Appendix A Mathematical proofs and derivations
Proposition 1 (Intersection-free).
Any deformation map in spatial dimensions, , induced via a spatio-temporal continuous flow function , cannot induce self intersection of a continuous manifold throughout the entire deformation process.
Let be two points on the manifold, such that:
Assume that the two points intersect at time . The location at time can be found via:
which contradicts Eq. 12. ∎
is a sufficient condition for bijectivity.
Proposition 3 (Bijectivity condition on ).
A sufficient condition on the flow function for deformation bijectivity is .
Proposition 4 (Volume conservation).
Suppose is a compact subset of . For , is the three-dimensional volume, and is the surface boundary of . Given a deformation map in spatial dimensions, , induced via a divergence-free (i.e. solenoidal) spatio-temporal continuous flow function , , the volume within the deformed boundary remains constant.
As per divergence theorem, the flux across the boundary integrates to zero:
Theorem 1 (Existence of vector potential).
If is a vector field on with , then there exists a vector field with .
This extends from the fundamental theorem of vector calculus, and is the result of the vector identity . ∎
Appendix B Implementation details
b.1 Neural architecture
We employ a variant of the IM-NET chen2019learning architecture as our backbone. The complexity of the flow model is parameterized by the number of feature layers , as well as the dimensionality of the latent space . In the case with no implicit regularization, we directly use the IM-NET backbone as the flow function (in Eqn.6). In the case with implicit volume or symmetry regularization, we use the backbone to parameterize ; see Figure 6 for a a schematic of the backbone. We do not learn an encoder, and instead we use an encoder-less scheme (Sec. 3.3) for training as well as embedding new observations into the deformation space.
b.2 Training details
ShapeNet deformation space (Sec. 4.1)
For training ShapeFlow to learn the ShapeNet deformation space, we use a backbone flow model with feature layers, use the ReLU activation function, learning rate of , a batch size of (across 8 GPUs), and train for steps. We train using an Adam Optimizer, and we compute the time integration using the dopri5 solver with relative and absolute error tolorence of . We samples 512 points on each mesh to use as proxy for computing point-to-point distance. We enforce symmetry condition on the deformations. We do not enforce isometry and volume conservation conditions since they do not apply to the shape categories in ShapeNet. For the reconstruction experiment (Sec.4.1.1) we use latent dimensions of , as a compact latent space allows better clustering of similar geometries, improving retrieval quality. For the canonicalization experiment (Sec. 4.1.2), we use a , since a larger latent dimension mitigates distortions at the canonical pose.
Furthermore, after training the deformation space, for embedding a new latent code, we initialize the latent code from . We optimize using Adam optimizer with learning rate of for 30 iterations. Then we finetune the neural network for the top-5 retrieved nearest neighbors, for an additional 30 iterations.
Human deformation animation (Sec. 4.2)
For human model deformation, since we are only learning the flow function for two, or a couple of shapes, we can afford to use a more lightweight model. We use a backbone flow model with . We use the Elu activation, since it is continuous, allowing us to parameterize a volume-conserving divergence-free flow function. We use the Adam optimizer with a learning rate of . For improved speed, we use the Runge-Kutta-4 (RK4) ODE solver, with 5 intermediate time steps. For the best performing result, we use the divergence-free parameterization, as well a edge loss weighting factor of . We optimize for 1000 steps.
Appendix C Additional analysis and visualization
c.1 Deformation examples
We show additional examples of the deformation between random pairs of shapes in the deformation space. We present the visualizations in Fig. 7. We draw random subsets of 5 shapes at a time, and plot the pairwise deformations of the shapes in a grid. One takeaway from this is that when the source and target are identical, the transformation amounts to an identity transformation. By transforming the shape to and back from the “hub", the geometric details are almost exactly preserved.
c.2 Effects of integration scheme
We further study the impacts of the ODE solver scheme on the shape deformation. We note that for the ShapeNet deformation space, it involves much more shapes () than the case of human frame interpolation, therefore it involves much drastic deformations. A fixed-step solver, such as the RK4 solver, is not able to accurate compute the dynamics for the individual points.
Numerical error accumulated during the integration step leads to violations of non self-intersection, identity preservation, resulting in dramatically unsatisfactory deformations between shapes.