DeepAI
Log In Sign Up

Neural Shape Deformation Priors

10/11/2022
by   Jiapeng Tang, et al.
0

We present Neural Shape Deformation Priors, a novel method for shape manipulation that predicts mesh deformations of non-rigid objects from user-provided handle movements. State-of-the-art methods cast this problem as an optimization task, where the input source mesh is iteratively deformed to minimize an objective function according to hand-crafted regularizers such as ARAP. In this work, we learn the deformation behavior based on the underlying geometric properties of a shape, while leveraging a large-scale dataset containing a diverse set of non-rigid deformations. Specifically, given a source mesh and desired target locations of handles that describe the partial surface deformation, we predict a continuous deformation field that is defined in 3D space to describe the space deformation. To this end, we introduce transformer-based deformation networks that represent a shape deformation as a composition of local surface deformations. It learns a set of local latent codes anchored in 3D space, from which we can learn a set of continuous deformation functions for local surfaces. Our method can be applied to challenging deformations and generalizes well to unseen deformations. We validate our approach in experiments using the DeformingThing4D dataset, and compare to both classic optimization-based and recent neural network-based methods.

READ FULL TEXT VIEW PDF

page 7

page 10

page 21

12/13/2019

Neural Cages for Detail-Preserving 3D Deformations

We propose a novel learnable representation for detail-preserving shape ...
11/26/2022

Reduced Representation of Deformation Fields for Effective Non-rigid Shape Matching

In this work we present a novel approach for computing correspondences b...
01/19/2016

Tetrisation of triangular meshes and its application in shape blending

The As-Rigid-As-Possible (ARAP) shape deformation framework is a versati...
08/14/2014

Regularized Harmonic Surface Deformation

Harmonic surface deformation is a well-known geometric modeling method t...
01/05/2022

DeepMLS: Geometry-Aware Control Point Deformation

We introduce DeepMLS, a space-based deformation technique, guided by a s...
01/06/2020

Meshlet Priors for 3D Mesh Reconstruction

Estimating a mesh from an unordered set of sparse, noisy 3D points is a ...
12/04/2020

Multiscale Mesh Deformation Component Analysis with Attention-based Autoencoders

Deformation component analysis is a fundamental problem in geometry proc...

1 Introduction

Editing and deforming 3D shapes is a key component in animation creation and computer aided design pipelines. Given as little user input as possible, the goal is to create new deformed instances of the original 3D shape which look natural and behave like real objects or animals. The user input is assumed to be very sparse, such as vertex handles that can be dragged around. For example, users can animate a 3D model of an animal by dragging its feet forward. This problem is severely ill-posed and typically under-constrained, as there are many possible deformations that can be matched with the provided partial surface deformations of handles, especially for large surface deformations. Thus, strong priors encoding deformation regularity are necessary to tackle this problem. Physics and differential geometry provide solutions that use various analytical priors which define natural-looking mesh deformations, such as elasticity Terzopoulos et al. (1987); Alexa et al. (2000), Laplacian smoothness Lipman et al. (2004); Sorkine et al. (2004); Zhou et al. (2005), and rigidity Sorkine and Alexa (2007); Sumner et al. (2007); Levi and Gotsman (2014) priors. They update mesh vertex coordinates by iteratively optimizing energy functions that satisfy constraints from both the pre-defined deformation priors and given handle locations. Although these algorithms can preserve geometric details of the original source model, they still have limited capacity to model realistic deformations, since the deformation priors are region independent, e.g., the head region deforms in a similar way as the tail of an animal, resulting in unrealistic deformation states.

Hence, motivated by the recent success of deep neural networks for 3D shape modeling Mescheder et al. (2019); Park et al. (2019); Chen and Zhang (2019); Xu et al. (2019); Tang et al. (2019); Chibane et al. (2020); Peng et al. (2020); Jiang et al. (2020b); Atzmon and Lipman (2020); Gropp et al. (2020); Chabra et al. (2020); Tretschk et al. (2020); Tang et al. (2021b); Chen et al. (2021b)

, we propose to learn shape deformation priors of a specific object class, e.g., quadruped animals, to complete surface deformations beyond observed handles. We formulate the following properties of such a learned model; (1) it should be robust to different mesh quality and number of vertices, (2) the source mesh is not limited to canonical pose (i.e., the input mesh can have arbitrary pose), and (3) it should generalize well to new deformations. Towards these goals, we represent deformations as a continuous deformation field which is defined in the near-surface region to describe the space deformation caused by the corresponding surface deformation. The continuity property enables us to manipulate meshes with infinite number of vertices and disconnected components. To handle source meshes in arbitrary poses, we learn shape deformations via canonicalization. Specifically, the overall deformation process consists of two stages: arbitrary-to-canonical transformation and canonical-to-arbitrary transformation. To obtain more detailed surface deformations and better generalization capabilities to unseen deformations, we propose to learn local deformation fields conditioned on local latent codes encoding geometry-dependent deformation priors, instead of global deformation fields conditioned on a single latent code. To this end, we propose Transformer-based Deformation Networks (TD-Nets), which learns encoder-based local deformation fields on point cloud approximations of the input mesh. Concretely, TD-Nets encode an input point cloud with surface geometry information and incomplete deformation flow into a sparse set of local latent codes and a global feature vector by using the vector attention blocks proposed in  

Zhao et al. (2021)

. The deformation vectors of spatial points are estimated by an attentive decoder, which aggregates the information of neighboring local latent codes of a spatial point based on the feature similarity relationships. The aggregated feature vectors are finally passed to a multi-layer-perceptron (MLP) to predict displacement vectors which can be applied to the source mesh to compute the final output mesh.

To summarize, we introduce transformer-based local deformation field networks which are capable to learn shape deformation priors for the task of user-driven shape manipulation. The deformation networks learn a set of anchor features based on a vector attention mechanism, enhancing the global deformation context, and selecting the most informative local deformation descriptors for displacement vector estimations, leading to an improved generalization ability to new deformations. In comparison to classical hand-crafted deformation priors as well as recent neural network-based deformation predictors, our method achieves more accurate and natural shape deformations.

2 Related Work

User-guided shape manipulation lies at the intersection of computer graphics and computer vision. Our proposed method is related to polygonal mesh geometry processing, neural field representations, as well as vision transformers.

Optimization-based Shape Manipulation.

Classical methods formulate shape manipulation as a mathematical optimization problem. They perform mesh deformations by either deforming the vertices 

Botsch and Sorkine (2007); Sorkine (2006) or the 3D space Jacobson et al. (2011); Bechmann (1994); Levi and Gotsman (2014); Milliron et al. (2002); Sederberg and Parry (1986). Performing mesh deformation without any other information about the target shape, but only using limited user-provided correspondences is an under-constrained problem. To this end, the optimization methods require deformation priors to constraint the deformation regularity as well as the smoothness of the deformed surface. Various analytic priors have been proposed which encourage smooth surface deformations, such as elasticity Terzopoulos et al. (1987); Alexa et al. (2000), Laplacian smoothness Lipman et al. (2004); Sorkine et al. (2004); Zhou et al. (2005), and rigidity Sorkine and Alexa (2007); Sumner et al. (2007); Levi and Gotsman (2014). These methods use efficient linear solvers to iteratively optimize energy functions that satisfy constraints from both the pre-defined deformation prior and provided handle movements. Recently, NFGP Yang et al. (2021) was proposed to optimize neural networks with non-linear deformation regularizations. Specifically, it performs shape deformations by warping the neural implicit fields of the source model through a deformation vector field, which is constrained by modeling implicitly represented surfaces as elastic shells. NeuralMLS Shechter et al. (2022)

learned a geometry-aware weight function of a shape and given control points for moving least squares(MLS) deformations, which smoothly interpolates the control point displacements over space. Although they can preserve many geometric details of the source shape, they struggle to model complex deformations, as local surfaces are simply constrained to be transformed in a similar manner. In contrast, we aim to learn deformation priors based on local geometries to infer hidden surface deformations.

Learning-based Shape Reconstruction and Manipulation.

Learning-based shape manipulation has been studied to learn shape priors based on shape auto-encoding or auto-decoding. Zheng et al. (2021b); Deng et al. (2021); Hao et al. (2020); Jiang et al. (2020a) map a class of shapes into a latent space. During inference, given handle positions as input, they find an optimal latent code whose 3D interpretation is the most similar to the observation. In contrast, we learn explicit deformation priors to directly predict 3D surface deformations. Jakab et al. Jakab et al. (2021) proposed to control shapes via unsupervised 3D keypoint discovery. Instead, we use partial surface deformations represented by handle displacements as input observations, rather than keypoint displacements. There exist a series of methods that use deep neural networks to complete non-rigid shapes Jiang et al. (2020a); Palafox et al. (2021); Božič et al. (2021); Li et al. (2021); Tang et al. (2021c); Saito et al. (2021); Wang et al. (2021b); Burov et al. (2021) from partial scans. Our task is partially related to this task, but our shape manipulation task from user input requires completion of the deformation field. In contrast to shape completion, our setting is more under-constrained, as the user-provided handle correspondences are very sparse and more incomplete than partial point clouds from scans. Recent methods for clothed-human body reconstruction choose to canonicalize the captured scan into a pre-defined T-pose Wang et al. (2021a); Mihajlovic et al. (2021); Chen et al. (2021a) using the skeletal deformation model of SMPL Loper et al. (2015) or STAR Osman et al. (2020) which can also be used to later animate the human. Inspired by this, we also perform a canonicalization to enable editing of source meshes with arbitrary poses, before applying the actual deformation towards the target pose handles.

Continuous Neural Fields.

Continuous neural field representations have been widely used in 3D shape modeling Mescheder et al. (2019); Chen and Zhang (2019); Park et al. (2019) and 4D dynamics capture Niemeyer et al. (2019); Tang et al. (2021c); Božič et al. (2021); Palafox et al. (2021); Li et al. (2021). Recent work that represents 3D shapes as continuous signed distance fields Atzmon and Lipman (2020); Xu et al. (2019); Gropp et al. (2020); Chabra et al. (2020); Tretschk et al. (2020) or occupancy fields Mescheder et al. (2019); Chen and Zhang (2019); Chibane et al. (2020); Mi et al. (2020); Peng et al. (2020); Jiang et al. (2020b); Tang et al. (2021a, b); Giebenhain and Goldlücke (2021); Zhang and Wonka (2021) can theoretically obtain volumetric reconstructions with infinite resolutions, as they are not bound to the resolution of a discrete grid structure. Similarly, we learn continuous deformation fields defined in 3D space for shape deformations Tang et al. (2019); Jiang et al. (2020a); Yang et al. (2021); Hui et al. (2022). Due to the continuity of the deformation fields, our method is not limited by the number of mesh vertices, or disconnected components. Different from ShapeFlow Jiang et al. (2020a), OFlow Niemeyer et al. (2019), LPDC-Net Tang et al. (2021c) and NPMs Palafox et al. (2021) that learn a deformation field from a single latent code, inspired by local implicit field learning Chibane et al. (2020); Peng et al. (2020); Tang et al. (2021b); Giebenhain and Goldlücke (2021); Zhang et al. (2022), we model the deformation field as a composition of local deformation functions, improving the representation capability of describing complex deformations as well as generalization to new deformations.

Visual Transformers.

Recently, transformer architectures Vaswani et al. (2017)

from natural language processing have revolutionized many computer vision tasks, including image classification 

Dosovitskiy et al. (2020); Wang et al. (2018), object recognition Carion et al. (2020), semantic segmentation Zheng et al. (2021a), or 3D reconstruction Bozic et al. (2021); Yu et al. (2021); Giebenhain and Goldlücke (2021); Zhang et al. (2022); Rao et al. (2022). We refer the reader to  Han et al. (2020) for a detailed survey of visual transformers. In this work, we propose the usage of a transformer architecture to learn deformation fields. Given the input point cloud sampled from the source mesh with partial deformation flow (defined by the user handles), we employ the vector attention blocks from Point Transformer Zhao et al. (2021) as a main point cloud processing module to extract a sparse set of local latent codes, enhancing the global understanding of deformation behaviours. Based on the obtained local deformation descriptors, our attentive deformation decoder learns to attend to the most informative features from near-by local codes to predict a deformation field.

3 Approach

Given a source mesh where and denote the set of vertices and the set of faces, respectively, we aim to deform to obtain a target mesh by selecting a sparse set of mesh vertices as handles, and dragging them to target locations . The key idea in this work is to use deformation priors to complete hidden surface deformations. Specifically, the goal is to learn a continuous deformation field defined in 3D space, from which we can obtain the deformed mesh through vertex deformations of the source mesh . The overall pipeline of the proposed approach is shown in Figure 2. Our method can be applied to input meshes in arbitrary poses by leveraging learned shape deformation via canonicalization (see Section 3.1). To represent the underlying deformation prior, we propose neural deformation fields as described in Section 3.2 which can be learned from large deformation datasets (see Section 3.3).

Figure 2: Overview. Given a source mesh with sparse handles (red circles) and their respective target locations (blue circles) as input, our method deforms the mesh to the target mesh via canonicalization . The backward and forward deformation networks store the deformation priors that allow our method to produce consistent and natural-looking outputs.

3.1 Learning Shape Deformations via Canonicalization

To ensure robustness w.r.t. varying input mesh quality (topology and resolution), we operate on point clouds instead of meshes. Specifically, we sample a point cloud from of size . We define the target handle point locations , where we use zeros to represent unknown point flows. Further, to avoid the ambiguity of zero point flow, we define the corresponding binary user handle masks where if is a handle or otherwise .

To learn the shape transformation between two arbitrary non-rigidly deformed poses, one can learn deformation fields that directly map the source deformed space to target space. However, it would be difficult to learn the deformation priors well, as there could be infinite deformation state transformation pairs. To decrease the learning complexity, we introduce a canonical space as an intermediate state. We divide the shape transformation process into two steps; a backward deformation that aligns the source deformed space to canonical space, and a forward deformation that maps the canonical space to the target deformation space. Concretely,

is passed into the backward transformation network

to learn the backward deformation field which transforms the input shape into a canonical pose . Similarly, the querying non-surface point set randomly sampled in the 3D space of is also mapped to canonical space through . Lastly, given , , and as input, a forward transformation network is learned to represent the forward deformation field that predicts final locations .

3.2 Transformer-based Deformation Networks (TD-Nets)

The deformation via canonicalization is based on two deformation field predictors (forward and backward deformations). Both networks share the same architecture, thus, in the following, we will only describe the forward deformation network as visualized in Figure 3 while the backward deformation network is analogous. It consists of a transformer-based deformation encoder and a vector cross attention-based decoder network.

Point transformer encoder.

Given a point set with handle locations and a binary mask as inputs, we use point transformer layers from Zhao et al. (2021) to build our encoder modules. The point transformer layer is based on the vector attention mechanism Zhao et al. (2020). Let and be the query and key-value sequences, where and denote the coordinates of query and key-value points with corresponding feature vectors and . The vector cross attention operator is defined as:

(1)

where are the aggregated features, , , and are linear projections implemented by a fully-connected layer. is a mapping function implemented by a two-layer MLP to predict attention vectors. is the attention weight normalization function, in our case softmax. is the positional embedding module Vaswani et al. (2017); Mildenhall et al. (2020)

implemented by a two linear layers with a single ReLU 

Nair and Hinton (2010). It leverages relatively positional information of and to benefit the network training. Then, with the definition of VCA, the vector self-attention operator can be defined as:

(2)

Based on VCA and VSA, we can define two basic modules to build our encoder network, i.e. the point transformer block (PTB) and the point abstraction block (PAB). The definition of the point transformer block is a combination of the BatchNorm (BN) layer Ioffe and Szegedy (2015)

, VSA, and residual connections, formulated as:

(3)

For each point , it encapsulates the information from nearest neighborhoods while keeping the point’s position unchanged. The point abstraction block consists of farthest point sampling (FPS), BN, VCA, and VSA, which is defined as follow:

(4)

The point cloud with handle mask and flow as additional channels are passed to a point transformer block (PTB) to obtain a feature point cloud . By using two consecutive point abstraction blocks (PABs) with intermediate set size of and , we obtain and . To enhance global deformation priors, we stack 4 point transformer blocks with full self-attention whose is set to 100 to exchange the global information in the whole set of . By doing so, we can obtain a sparse set of local deformation descriptors that are anchored in

. Finally, we perform a global max-pooling operation followed by two linear layers to obtain the global latent vector

.

Figure 3: Transformer-based Forward Deformation Networks. Given a canonical mesh with handle positions (red circles) and desired handle locations (blue circles), we perform surface sampling to obtain a point cloud with additional channels of handle mask and point flow . A point-transformer encoder is devised to extract a sparse set of local latent codes from this point cloud, where are the anchor positions of the latent features . For a specific point in 3D space (i.e. a vertex from the source mesh), based on the , a vector cross attention () block is used to effectively fuse the information of into from the nearest neighbouring latent codes of . Using a multi-layer perceptron (MLP) conditioned on , we predict the deformed location in the target space.

Attentive deformation decoder. Based on the learned local latent codes and global latent vector , the deformation decoder defines the forward deformation function , which maps a point from the canonical space of to the 3D space of . Similar to tri-linear interpolation operations in grid-based implicit field learning, a straightforward way to find the corresponding feature vector is to use the weighted combination of nearby local codes . Intuitively, the weight is inversely proportional to the euclidean distance between and the anchoring location  Peng et al. (2020). However, distance-based feature queries ignore the relationships between deformation descriptors. Thus, we propose to obtain by adaptively aggregating information of based on the vector cross-attention operator:

(5)

The local information aggregation enables us to flexibly search the local deformation priors, thus, improving the generalizability to new deformations. Finally, the is fed into an MLP composed of five Res-FC blocks to estimate the associate location in the target space.

3.3 Training Objectives

For training, we need a set of triplets with dense correspondences, from which we can randomly sample surface point clouds of size and querying non-surface points of size in the 3D space. To optimize the backward deformation networks, we employ the mean distance error that measures the difference between deformed points from source space and their ground-truths in the canonical space:

(6)

Similarly, to optimize the forward deformation networks, we use the following loss function:

(7)

The total loss function for source-target shape deformations is defined as:

(8)

4 Experiments

Dataset.

Our experiments are performed on the DeformingThing4D-Animals Li et al. (2021) dataset which contains 1494 non-rigidly deforming animations with various motions comprising 40 identities of 24 categories. For the train/test split, we divide all animations into training (1296) and test (198). Similar to the D-FAUST Bogo et al. (2017) used in OFlow Niemeyer et al. (2019), the test set is composed of two subsets: (S1) contains 143 sequences of new motions for seen train identities, and (S2) contains 55 sequences of unseen individuals (and thus also new motions). During training, we randomly sample two frames from an identity as source-target deformation pairs. During inference, we consider the first frame of an animation as source mesh, and other frames as target meshes. To evaluate the generalization ability to unseen identities, we evaluate the pre-trained models on the animal dataset used in Deformation Transfer Sumner and Popović (2004)

. For the quantitative comparison on each test subset, we compute evaluation metrics for 300 randomly sampled pairs. In addition, we also include comparisons on another animal dataset used in TOSCA 

Rodolà et al. (2017). TOSCA Rodolà et al. (2017) does not have correspondences between different poses of the same animal, and hence does not easily provide handle displacements as input. Thus, we provide a qualitative comparison under the setting of using user-specified handles as inputs.

Implementation details.

Our approach is built on the PyTorch library 

Paszke et al. (2019). Please refer to the supplementary material for the details of our network architecture. Our model consists of two training stages. We use an Adam Kingma and Ba (2014) optimizer with , , and . In the first stage, we train the forward and backward deformation networks individually. Specifically, the backward and forward deformation networks are respectively optimized by the objective described in Equations 6 or 7

using a batch size of 16 with the learning rate of 5e-4 for 100 epochs. In the second stage, the whole model is trained according to Equation 

8 in an end-to-end manner using a batch size of 6 with a learning rate of 5e-5 for 20 epochs.

Baselines.

We conduct comparisons against classical optimization-based and recent neural network-based methods. For the former, we select a representative work, ARAP Sorkine and Alexa (2007), that constrains each local surface to be rigidly transformed as much as possible. For the latter, we compare our method with the learning-based deformation predictor ShapeFlow Jiang et al. (2020a) that embeds each shape into a latent space and learns flow-based deformations among 3D shapes. We also compare to NFGP Yang et al. (2021), a deep optimization method, which constrains the implicitly represented surfaces as elastic shells during the deformation process.

Source mesh Target mesh ARAP Sorkine and Alexa (2007) ShapeFlow Jiang et al. (2020a) NFGP Yang et al. (2021) Ours
and handles and handles
Figure 4: Comparison against ARAP Sorkine and Alexa (2007), ShapeFlow Jiang et al. (2020a), and NFGP Yang et al. (2021) on new motions. We visualize the vertex euclidean distance errors as color maps.
Source mesh Target mesh ARAP Sorkine and Alexa (2007) ShapeFlow Jiang et al. (2020a) NFGP Yang et al. (2021) Ours
and handles and handles
Figure 5: Comparison against ARAP Sorkine and Alexa (2007), ShapeFlow Jiang et al. (2020a), and NFGP Yang et al. (2021) on the S2 test set of DeformingThing4D-Animals and unseen shapes of Deformation Transfer Sumner and Popović (2004). We visualize the vertex euclidean distance errors as color maps. Our approach generalizes better in comparison to ShapeFlow and NFGP and produces natural looking deformations (in comparison, ARAP generates rubber-like deformations).

Evaluation metrics.

We consider distance error of mesh vertices ( ), Chamfer Distance (CD ) of sampled point clouds of 30k points, and Face Normal Consistency (FNC ) as primary evaluation metrics. Please refer to the supplementary material for a detailed explanation of these metrics. Note that for and CD, lower is better, while for FNC, higher is better.

4.1 Comparisons

Method New motions (S1) Unseen identities (S2) Deformation Transfer
CD FNC CD FNC CD FNC
ARAP Sorkine and Alexa (2007) 5.568 2.312 95.35 9.794 2.308 94.89 5.145 3.475 91.21

ShapeFlow Jiang et al. (2020a)
21.03 3.494 89.69 32.08 3.925 90.73 33.72 4.093 86.36

NFGP Yang et al. (2021)
11.77 3.130 93.34 15.96 3.364 91.80 18.90 4.150 82.54

Ours-VDF
3.590 1.887 86.01 2.368 1.837 86.99 3.111 9.164 78.63

Ours-global
2.970 1.546 93.30 2.973 1.579 94.75 2.636 8.453 84.59

Ours-3D UNet
1.011 1.111 96.02 1.253 1.426 96.20 4.553 2.362 88.31

Ours-PointNet++.
0.886 1.055 95.47 1.231 1.364 95.37 4.898 2.564 85.87

Ours-w/o atten dec.
1.184 1.210 95.64 1.227 1.417 96.16 5.252 2.772 84.95

Ours-w/o cano.
1.018 1.063 96.40 0.969 1.258 96.62 2.660 1.934 90.96

Ours-full
0.752 0.948 96.59 0.795 1.241 96.68 2.495 1.877 91.40

Table 1: Quantitative comparisons on the S1 and S2 test sets of DeformingThing4D Li et al. (2021) and the unseen identities of used in Deformation Transfer Sumner and Popović (2004).

For a qualitative comparison, we visualize the vertex distance error maps of deformed meshes in Figure 4 and Figure 5. As can be seen, our method has lower vertex errors in the hidden surface regions since we use data-driven deformation priors, instead of employing hand-crafted regularizers to enforce surface smoothness. The generalization ability to unseen deformations is improved by learning deformation fields for local surfaces, instead of modeling global deformations.Compared to ARAP, ShapeFlow, and NFGP, we can produce more realistic results for complicated actions in the 3rd and 4th rows of Figure 4. The deformation results presented in Figure 5 demonstrate that our method can generalize to unseen identities, and is also verified quantitatively in Table 1, where our method consistently outperforms all baselines.

User-specified handles.

To evaluate the generalization performance of our approach on unseen identities using user-provided handle displacements that are used in interactive editing applications, we use random translations of handles applied to animals from TOSCA Rodolà et al. (2017) as input. As depicted in Figure 6, our approach is able to produce naturally-looking deformation results, and shows its advantages compared to ARAP, ShapeFlow, and NFGP. Note that for this demonstration of user-specified handles there exists no corresponding ground-truth.

Source mesh,
handles and
target handles
ARAP Sorkine and Alexa (2007)
ShapeFlow Jiang et al. (2020a)
NFGP Yang et al. (2021)
Ours
Figure 6: Comparison against ARAP Sorkine and Alexa (2007), ShapeFlow Jiang et al. (2020a) and NFGP Yang et al. (2021) under the setting of user-specified handles on TOSCA dataset Rodolà et al. (2017). Our method visibly produces the best results.

4.2 Ablation Studies

To verify our final model choice, we conducted a series of ablation studies, where we analysed several variants of our deformation fields (see Table 1 and Figure 7).

Figure 7: Qualitative ablation studies. Each component of our approach contributes to the final result that has the lowest reconstruction error.

Volumetric grids vs continuous fields.

As continuous fields are not bound to the resolution of a discrete grid structure, it can better represent complex deformations. The performance degrades when we learn grid-based volumetric deformation fields. This can be seen in the experiment “Ours-VDF" which uses a 3D U-Net Ronneberger et al. (2015) to generate volumetric deformation fields of a fixed resolution .

Global vs local deformation fields.

“Ours-global" learns a global continuous field only conditioned on the global latent code. This variant tends to lose detailed information about local surface deformations, and is more difficult to generalize to new motions or identities, leading to inferior results in comparison to our local deformation fields.

Network Architectures (3D U-Net vs PointNet++ vs Point Transformer).

Compared to grid-based and point-based local deformation descriptors learning, the point transformer-based encoder captures strong global contexts that enforce more global consistency constraints. This provides performance improvements on surface accuracy of deformed meshes. To verify this, we conducted an experiment with “Ours-3D-UNet," which learns a volumetric feature map through a 3D U-Net, and then predicts deformation fields based on queried features via tri-linear interpolation operations. Additionally, we compare with “Ours-PointNet++," which replaces the point transformer encoder with PointNet++ Qi et al. (2017).

With vs without Attention-based feature querying.

The attention-based feature query mechanism can flexibly and effectively select the most relevant deformation descriptors for a query point, resulting in improved performance over feature interpolation purely based on euclidean distances. A deformation decoder that for example uses an interpolation with weights that are purely based on euclidean distance instead (“Ours-w/o atten. dec."), leading to significantly higher errors, particularly in terms of the vertex error.

With vs without canonical poses.

Learning shape deformations via canonicalization improves the generalization to source meshes in different poses. Learning without canonicalization ("Ours-w/o cano."), i.e., learning shape deformations directly between two arbitrary poses, results in considerably higher surface errors.

4.3 Intermediate results of canonicalization

           (a) (b) (c) (a) (b) (c)
Figure 8: The intermediate results of our canonicalization. (a) Source mesh. (b) Canonical mesh. (c) Our canonicalized mesh.

In Figure 8, we visualize our intermediate results of canonicalization. As can be seen, our method can project source meshes with arbitrary poses into a canonical space with a same pose.

4.4 Limitations

While compelling results have been demonstrated for shape manipulation, a few limitations still exist in our approach that can be addressed in future work. Our approach only needs sparse user input in form of handles which can be moved to create a new deformation state. While this allows for quick editing, a possible extension is to add rotations to the handles. This could be done by leveraging a different deformation representation such as a SE(3) field which is composed of a displacement and a rotation field. Note that our displacement representation is able to represent general deformations, but might require more user handles. Due to the limitations of the DeformingThing4D-Animals Li et al. (2021) dataset in terms of available models and poses, our approach may suffer from the generalization to out-of-distribution models and extreme poses. Additionally, the output of our model, as with other learning-based methods, may be affected by biases in the training dataset that can limit generalization. We believe this issue can be relieved by a larger training dataset and a richer data augmentation strategy in future work. Lastly, our training scheme only considers handles that are selected from a set of candidate parts of the models, thus, limiting the regions the user can interact with. Enriching the candidate handles during training is potentially helpful for allowing free handle placement.

5 Conclusion

In this work, we introduced Neural Shape Deformation Priors, a novel approach that learns mesh deformations of non-rigid objects from user-provided handles based on the underlying geometric properties of shapes. To enable shape manipulation for source meshes with different poses, we choose to learn shape deformations via canonicalization where the source mesh is first transformed to the canonical space through a backward deformation field and then deformed to the target space through a forward deformation field. For deformation field learning, we propose Transformer-based Deformation Networks (TD-Net) that represent a shape deformation as a composition of local surface deformations. Our experiments and ablation studies demonstrate that our method can be applied to challenging new deformations, outperforming classical optimization-based methods such as ARAP Sorkine and Alexa (2007) and neural networks-based methods such as ShapeFlow Jiang et al. (2020a) and NFGP Yang et al. (2021), while showing a good generalization to previously unseen identities. We see our method as an important step in the development of 3D modeling algorithms and softwares and hope to inspire more research in learning-based shape manipulation.

Societal impact.

Our work provides an algorithm for natural-looking shape editing, which can simplify tedious procedures in 3D content creation and empower artists in the movie and game industries. It further has the potential to enrich 3D data with additional deformed shapes, and could thus help improve the performance of other practical application techniques that rely on large quantities of 3D ground-truth for training. Yet, misuse of our shape manipulation algorithm could enable fraud or offensive content generation.

Acknowledgement.

This work is supported by a TUM-IAS Rudolf Mößbauer Fellowship, the ERC Starting Grant Scan2CAD (804724), and Sony Semiconductor Solutions Corporation. We would also like to thank Angela Dai for the video voice over.

References

  • M. Alexa, D. Cohen-Or, and D. Levin (2000) As-rigid-as-possible shape interpolation. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 157–164. Cited by: §1, §2.
  • M. Atzmon and Y. Lipman (2020) Sal: sign agnostic learning of shapes from raw data. In CVPR, pp. 2565–2574. Cited by: §1, §2.
  • D. Bechmann (1994) Space deformation models survey. Computers & Graphics 18 (4), pp. 571–586. Cited by: §2.
  • F. Bogo, J. Romero, G. Pons-Moll, and M. J. Black (2017) Dynamic faust: registering human bodies in motion. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 6233–6242. Cited by: Appendix D, §4.
  • M. Botsch and O. Sorkine (2007) On linear variational surface deformation methods. IEEE transactions on visualization and computer graphics 14 (1), pp. 213–230. Cited by: §2.
  • A. Bozic, P. Palafox, J. Thies, A. Dai, and M. Nießner (2021) Transformerfusion: monocular rgb scene reconstruction using transformers. Advances in Neural Information Processing Systems 34. Cited by: §2.
  • A. Božič, P. Palafox, M. Zollhofer, J. Thies, A. Dai, and M. Nießner (2021) Neural deformation graphs for globally-consistent non-rigid reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1450–1459. Cited by: §2, §2.
  • A. Burov, M. Nießner, and J. Thies (2021) Dynamic surface function networks for clothed human bodies. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10754–10764. Cited by: §2.
  • N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European conference on computer vision, pp. 213–229. Cited by: §2.
  • R. Chabra, J. E. Lenssen, E. Ilg, T. Schmidt, J. Straub, S. Lovegrove, and R. Newcombe (2020) Deep local shapes: learning local sdf priors for detailed 3d reconstruction. In ECCV, pp. 608–625. Cited by: §1, §2.
  • X. Chen, Y. Zheng, M. J. Black, O. Hilliges, and A. Geiger (2021a) SNARF: differentiable forward skinning for animating non-rigid neural implicit shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11594–11604. Cited by: §2.
  • Y. Chen, Z. Tu, D. Kang, L. Bao, Y. Zhang, X. Zhe, R. Chen, and J. Yuan (2021b)

    Model-based 3d hand reconstruction via self-supervised learning

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10451–10460. Cited by: §1.
  • Z. Chen and H. Zhang (2019) Learning implicit fields for generative shape modeling. In CVPR, Cited by: §1, §2.
  • J. Chibane, T. Alldieck, and G. Pons-Moll (2020) Implicit functions in feature space for 3d shape reconstruction and completion. In CVPR, Cited by: §1, §2.
  • Y. Deng, J. Yang, and X. Tong (2021) Deformed implicit field: modeling 3d shapes with learned dense correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10286–10296. Cited by: §2.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §2.
  • S. Giebenhain and B. Goldlücke (2021) AIR-nets: an attention-based framework for locally conditioned implicit representations. In 2021 International Conference on 3D Vision (3DV), pp. 1054–1064. Cited by: §2, §2.
  • A. Gropp, L. Yariv, N. Haim, M. Atzmon, and Y. Lipman (2020) Implicit geometric regularization for learning shapes. ICML. Cited by: §1, §2.
  • K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, et al. (2020) A survey on visual transformer. arXiv e-prints, pp. arXiv–2012. Cited by: §2.
  • Z. Hao, H. Averbuch-Elor, N. Snavely, and S. Belongie (2020) Dualsdf: semantic shape manipulation using a two-level representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7631–7641. Cited by: §2.
  • K. Hui, R. Li, J. Hu, and C. Fu (2022) Neural template: topology-aware reconstruction and disentangled generation of 3d meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18572–18582. Cited by: §2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    International conference on machine learning

    ,
    pp. 448–456. Cited by: Appendix A, §3.2.
  • A. Jacobson, I. Baran, J. Popovic, and O. Sorkine (2011) Bounded biharmonic weights for real-time deformation.. ACM Trans. Graph. 30 (4), pp. 78. Cited by: §2.
  • T. Jakab, R. Tucker, A. Makadia, J. Wu, N. Snavely, and A. Kanazawa (2021) KeypointDeformer: unsupervised 3d keypoint discovery for shape control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12783–12792. Cited by: §2.
  • C. Jiang, J. Huang, A. Tagliasacchi, and L. J. Guibas (2020a) Shapeflow: learnable deformation flows among 3d shapes. Advances in Neural Information Processing Systems 33, pp. 9745–9757. Cited by: §2, §2, Figure 4, Figure 5, Figure 6, §4, Table 1, §5.
  • C. Jiang, A. Sud, A. Makadia, J. Huang, M. Nießner, T. Funkhouser, et al. (2020b) Local implicit grid representations for 3d scenes. In CVPR, pp. 608–625. Cited by: §1, §2.
  • M. Kazhdan and H. Hoppe (2013) Screened poisson surface reconstruction. ACM Transactions on Graphics (ToG) 32 (3), pp. 1–13. Cited by: Figure 12.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • Z. Levi and C. Gotsman (2014) Smooth rotation enhanced as-rigid-as-possible mesh animation. IEEE transactions on visualization and computer graphics 21 (2), pp. 264–277. Cited by: §1, §2.
  • Y. Li, H. Takehara, T. Taketomi, B. Zheng, and M. Nießner (2021) 4dcomplete: non-rigid motion estimation beyond the observable surface. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12706–12716. Cited by: Appendix D, Table 3, Table 4, Table 5, §2, §2, §4, §4.4, Table 1.
  • Y. Lipman, O. Sorkine, D. Cohen-Or, D. Levin, C. Rossi, and H. Seidel (2004) Differential coordinates for interactive mesh editing. In Proceedings Shape Modeling Applications, 2004., pp. 181–190. Cited by: §1, §2.
  • M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34 (6), pp. 248:1–248:16. Cited by: §2.
  • L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3d reconstruction in function space. In CVPR, Cited by: §1, §2.
  • Z. Mi, Y. Luo, and W. Tao (2020) Ssrnet: scalable 3d surface reconstruction network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 970–979. Cited by: §2.
  • M. Mihajlovic, Y. Zhang, M. J. Black, and S. Tang (2021) LEAP: learning articulated occupancy of people. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10461–10471. Cited by: §2.
  • B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) Nerf: representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pp. 405–421. Cited by: Appendix A, §3.2.
  • T. Milliron, R. J. Jensen, R. Barzel, and A. Finkelstein (2002) A framework for geometric warps and deformations. ACM Transactions on Graphics (TOG) 21 (1), pp. 20–51. Cited by: §2.
  • V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: Appendix A, §3.2.
  • M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger (2019) Occupancy flow: 4d reconstruction by learning particle dynamics. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 5379–5389. Cited by: Appendix D, §2, §4.
  • A. A. A. Osman, T. Bolkart, and M. J. Black (2020) STAR: a sparse trained articulated human body regressor. European Conference on Computer Vision (ECCV), pp. 598–613. External Links: Link Cited by: §2.
  • P. Palafox, A. Božič, J. Thies, M. Nießner, and A. Dai (2021)

    Npms: neural parametric models for 3d deformable shapes

    .
    In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12695–12705. Cited by: §2, §2.
  • J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019) DeepSDF: learning continuous signed distance functions for shape representation. In CVPR, Cited by: §1, §2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    Pytorch: an imperative style, high-performance deep learning library

    .
    Advances in neural information processing systems 32. Cited by: §4.
  • S. Peng, M. Niemeyer, L. Mescheder, M. Pollefeys, and A. Geiger (2020) Convolutional occupancy networks. In ECCV, Cited by: §1, §2, §3.2.
  • C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems 30. Cited by: §4.2.
  • Y. Rao, Y. Nie, and A. Dai (2022) PatchComplete: learning multi-resolution patch priors for 3d shape completion on unseen categories. Advances in Neural Information Processing Systems. Cited by: §2.
  • E. Rodolà, L. Cosmo, M. M. Bronstein, A. Torsello, and D. Cremers (2017) Partial functional correspondence. In Computer graphics forum, Vol. 36, pp. 222–236. Cited by: Figure 6, §4, §4.1.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §4.2.
  • N. Rüegg, S. Zuffi, K. Schindler, and M. J. Black (2022) BARC: learning to regress 3d dog shape from images by exploiting breed information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3876–3884. Cited by: Figure 13, Appendix E.
  • S. Saito, J. Yang, Q. Ma, and M. J. Black (2021) SCANimate: weakly supervised learning of skinned clothed avatar networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2886–2897. Cited by: §2.
  • T. W. Sederberg and S. R. Parry (1986) Free-form deformation of solid geometric models. In Proceedings of the 13th annual conference on Computer graphics and interactive techniques, pp. 151–160. Cited by: §2.
  • M. Shechter, R. Hanocka, G. Metzer, R. Giryes, and D. Cohen-Or (2022) NeuralMLS: geometry-aware control point deformation. Cited by: §2.
  • O. Sorkine and M. Alexa (2007) As-rigid-as-possible surface modeling. In Symposium on Geometry processing, Vol. 4, pp. 109–116. Cited by: Neural Shape Deformation Priors, §1, §2, Figure 4, Figure 5, Figure 6, §4, Table 1, §5.
  • O. Sorkine, D. Cohen-Or, Y. Lipman, M. Alexa, C. Rössl, and H. Seidel (2004) Laplacian surface editing. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium on Geometry processing, pp. 175–184. Cited by: §1, §2.
  • O. Sorkine (2006) Differential representations for mesh processing. In Computer Graphics Forum, Vol. 25, pp. 789–807. Cited by: §2.
  • R. W. Sumner and J. Popović (2004) Deformation transfer for triangle meshes. ACM Transactions on graphics (TOG) 23 (3), pp. 399–405. Cited by: Figure 5, §4, Table 1.
  • R. W. Sumner, J. Schmid, and M. Pauly (2007) Embedded deformation for shape manipulation. In ACM SIGGRAPH 2007 papers, pp. 80–es. Cited by: §1, §2.
  • J. Tang, X. Han, J. Pan, K. Jia, and X. Tong (2019) A skeleton-bridged deep learning approach for generating meshes of complex topologies from single rgb images. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 4541–4550. Cited by: §1, §2.
  • J. Tang, X. Han, M. Tan, X. Tong, and K. Jia (2021a) Skeletonnet: a topology-preserving solution for learning mesh reconstruction of object surfaces from rgb images. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
  • J. Tang, J. Lei, D. Xu, F. Ma, K. Jia, and L. Zhang (2021b) SA-convonet: sign-agnostic optimization of convolutional occupancy networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6504–6513. Cited by: §1, §2.
  • J. Tang, D. Xu, K. Jia, and L. Zhang (2021c) Learning parallel dense correspondence from spatio-temporal descriptors for efficient and robust 4d reconstruction. In CVPR, pp. 6022–6031. Cited by: §2, §2.
  • D. Terzopoulos, J. Platt, A. Barr, and K. Fleischer (1987) Elastically deformable models. In Proceedings of the 14th annual conference on Computer graphics and interactive techniques, pp. 205–214. Cited by: §1, §2.
  • E. Tretschk, A. Tewari, V. Golyanik, M. Zollhöfer, C. Stoll, and C. Theobalt (2020) PatchNets: patch-based generalizable deep implicit 3d shape representations. In ECCV, pp. 108–124. Cited by: §1, §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: Appendix A, §2, §3.2.
  • S. Wang, A. Geiger, and S. Tang (2021a) Locally aware piecewise transformation fields for 3d human mesh registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7639–7648. Cited by: §2.
  • S. Wang, M. Mihajlovic, Q. Ma, A. Geiger, and S. Tang (2021b) Metaavatar: learning animatable clothed human models from few depth images. Advances in Neural Information Processing Systems 34. Cited by: §2.
  • X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803. Cited by: §2.
  • Q. Xu, W. Wang, D. Ceylan, R. Mech, and U. Neumann (2019) DISN: deep implicit surface network for high-quality single-view 3d reconstruction. In NeurIPS, Cited by: §1, §2.
  • G. Yang, S. Belongie, B. Hariharan, and V. Koltun (2021) Geometry processing with neural fields. Advances in Neural Information Processing Systems 34. Cited by: §2, §2, Figure 4, Figure 5, Figure 6, §4, Table 1, §5.
  • H. Yu, F. Li, M. Saleh, B. Busam, and S. Ilic (2021) CoFiNet: reliable coarse-to-fine correspondences for robust pointcloud registration. Advances in Neural Information Processing Systems 34. Cited by: §2.
  • B. Zhang, M. Nießner, and P. Wonka (2022) 3DILG: irregular latent grids for 3d generative modeling. In Advances in Neural Information Processing Systems, Cited by: §2, §2.
  • B. Zhang and P. Wonka (2021) Training data generating networks: shape reconstruction via bi-level optimization. In International Conference on Learning Representations, Cited by: §2.
  • H. Zhao, J. Jia, and V. Koltun (2020) Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10076–10085. Cited by: Appendix A, §3.2.
  • H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun (2021) Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268. Cited by: §1, §2, §3.2.
  • S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, et al. (2021a) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6881–6890. Cited by: §2.
  • Z. Zheng, T. Yu, Q. Dai, and Y. Liu (2021b) Deep implicit templates for 3d shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1429–1439. Cited by: §2.
  • K. Zhou, J. Huang, J. Snyder, X. Liu, H. Bao, B. Guo, and H. Shum (2005) Large mesh deformation using the volumetric graph laplacian. In ACM SIGGRAPH 2005 Papers, pp. 496–503. Cited by: §1, §2.

Neural Shape Deformation Priors
– Supplementary Material –

Our Neural Shape Deformation Priors method is based on transformer-based deformation networks that represent the deformation as a composition of local surface deformations. The underlying architectures are discussed in Appendix A. The used evaluation metrics are detailed in Appendix B. Our notations are further explained in Appendix C. And more details about data-preprocessing are given in Appendix D. In addition to the results shown in the main paper, we conducted further experiments (see E). While our method exhibits good generalization to unseen poses and shapes, we discuss and show failure cases in Appendix F.

Appendix A Network Architectures

Vector Cross Attention:

In Figure 9, we illustrate the architecture of vector cross attention Zhao et al. (2020) (VCA) which is a building block of our transformer-based deformation network (see Figure 3 in the main paper). The feature vectors and are transformed with three linear projectors , and , each of which is a fully-connected layer. To leverage relatively positional information of and , is encoded by a positional embedding module Vaswani et al. (2017); Mildenhall et al. (2020) that consists of two linear layers with a single ReLU Nair and Hinton (2010). Then, the summation result of and will be further processed by a MLP . Next, a softmax function is used to generate normalized attention scores that are used to calculate a weighted combination of to obtain .

Figure 9: Vector Cross Attention (VCA), Point Transformer Block (PTB), and Point Abstraction Block (PAB).
Figure 10: Point Transformer Encoder.
Figure 11: Attentive Deformation Decoder.

Point Transformer Block (PTB):

As illutrated in Figure 9, we introduce the architecture of point transformer block. The point transformer block is used to encapsulate the information from = 16 nearest neighborhoods while keeping the position of a point unchanged. The input is fed into a vector attention block (VSA) and through a BatchNorm (BN) Ioffe and Szegedy (2015) (including a residual connection from the input ).

Point Abstraction Block (PAB):

The point abstraction block consists of a farthest point sampling module (FPS), a VCA module, a VSA module, followed by a BN layer. The farthest point sampling (FPS) is used to downsampled which is then fed into a VCA followed by a VSA module. We employ a skip connection from the original to the VCA module. The output of the FPS and the VSA module are fed into a batchnorm layer which computes the output of the point abstraction block.

Point Transformer Encoder

As shown in Figure 10, a PTB is used to obtain an initial feature encoding . Two consecutive point abstraction blocks (PABs) with intermediate set size of and , are used to obtain downsampled feature point clouds and . To enhance global deformation priors, we stack 4 point transformer block with full self-attention whose is set to 100 to exchange the global information in the whole set of . By doing so, we can obtain a sparse set of local deformation descriptors that are anchored in . Finally, a global max-pooling operation followed by two linear layers is used to obtain the global latent vector .

Attentive Deformation Decoder

The detailed architecture of attentive deformation decoder is shown in Figure 11. It fuses near-by local latent codes of under the guidance of a global latent code into , and feeds into an MLP consisting of five stacked Res-FC blocks to estimate the displacement vector of .

Appendix B Evaluation Metrics

For defining the evaluation metrics, we assume two meshes and being the ground-truth and deformed mesh respectively, sharing the same connectivity.

Vertex error:

The vertex distance error is the mean square distance between ground-truth vertices and deformed vertices :

where denotes the number of mesh vertices.

Chamfer distance:

To calculate the chamfer distance between and , we firstly sample two point set and from and individually. Then, the Chamfer distance of two point sets is defined as:

Face Normal Consistency

The face normal consistency describes the mean cosine similarity score of the triangle normals of two meshes. Let

and denote the set of face normals of and respectively. We define Face Normal Consistency as:

where denotes the number of triangle faces and denotes the dot product of two vectors.

Appendix C Notation

We will explain our notation in more detail after having briefly defined it in Section 3. By , , , we denote meshes of the considered shapes. is the source mesh and is the set of vertices of while is the set of faces of . is deformed in a 2-step approach. By we denote the canonical shape and is the target shape. We select a sparse set of handles of the original shape. The handles can be dragged to new target locations which define the target mesh . The continuous deformation field learnt in our work is denoted by . We apply to deform the vertices of to obtain the deformed mesh where are the vertices of the deformed mesh. We denote the backward deformation field by and the forward deformation field by . It holds . Since our method performs operations in the point cloud domain, we sample point clouds from the surface meshes. is a surface point cloud of canonical mesh with size . We define the binary user handle mask as . The point cloud is passed through the backward transformation network and mapped into the canonical pose , i.e. . Then the point cloud is passed through the forward transformation network and mapped into the target pose , i.e. . Further, consult Table 2 for the definition of all items.

Notations Meaning

Source mesh, canonical mesh, target mesh, deformed mesh
Vertices, faces of source mesh
Set of handles, -th handle location
Set of target locations of handles, -th target location
Binary user handle mask, -th element of
Surface point clouds of size sampled from the surface of
Target handle point locations
Non-surface point clouds of size sampled from the 3D space of
-th non-surface querying point
Size of surface point clouds
Size of non-surface point clouds
-th point from
Mapping of in canonical pose, target pose
Backward deformation field, forward deformation field
Deformation field between two arbitrary poses, i.e.
Backward transformation network, forward transformation network
Query sequence, key-value sequence
Coordinate of -th query point, corresponding feature vector, aggregated feature
Coordinate of -th key-value point, corresponding feature vector
Vector cross attention
Fully-connected layers
Attention weight normalization function, e.g. softmax function
Positional embedding module
Vector self-attention operator
Point transformer block, point abstraction block
BatchNorm Layer
Set of local deformation descriptors
A point in , corresponding feature vector
Coordinates and feature vector of -th deformation descriptor
Global latent vector
Backward loss function, forward loss function, end-to-end loss function
Table 2: Notations in order of appearance in the main paper.

Appendix D Data

To train and evaluate our method, we use the DeformingThing4D Li et al. (2021) dataset, which is available under a non-commercial academic license. It does not contain personally identifiable information or offensive contents. We have obtained the consent to use the dataset.

Train/test split

The DeformingThing4D consists of a large number of quadruped animal animations with various motions, such as “bear3EP Jump”, “bear9AK Jump”, or “bear3EP Lie” where "bear3EP" and "bear9AK" are identity names, and "Jump" and "Lie" are motion names. Similar to the D-FAUST Bogo et al. (2017) used in OFlow Niemeyer et al. (2019), the train/test split is based on these identity and motion names of deforming sequences. We firstly divide the animations of the dataset into two parts, seen identities and unseen identities. For the animations of seen identities, we further divide it into seen motions of seen identities (used as training set), and unseen motions of seen identities (used as the test set of S1). The animations of unseen identities are used as the test set of S2. Finally, the train, test S1, and test S2 datasets individually contains 1296, 143, and 55 deforming sequences.

Data preparation

In Section 3.3 of the main text, we mentioned that our method utilizes a set of triplets including source , canonical , and target mesh with dense correspondence for training. The point clouds of size with one-to-one correspondence are sampled from the surfaces of , , . And the non-surface point sets of size are sampled from their 3D space. Here, we provide the details of data preparation. Firstly, we sample surface points from the canonical mesh ; we also store the corresponding barycentric weights of sample points. Then, each point is randomly permuted by a small displacement vector along the normal direction of the corresponding triangle. The displacement distance

is from a Gaussian distribution

. Next, for source and target meshes, we use the same barycentric weights to obtain with correspondences, and use the same displacements to obtain with correspondences. Concretely, we pre-compute points from each canonical surface mesh, and get the non-surface points with 50% of surface points permuted by , with 50% of surface points permuted by . During training, we down-sample points of , and down-sample of . To maintain one-to-one correspondence, we use the same sampling indices for , , .

Appendix E Additional Results

Effects of point cloud sampling density

To study the effect of sampling density of input point cloud, we individually train our model by using point clouds of size 2500, 5000, 7500 as input. Quantitative results are shown in Table 3

. We can observe that the results of different evaluation metrics only show a slightly small variance. To balance accuracy and computational cost, we use 5000 points in our final model.

#sampling points New motions (S1) Unseen identities (S2)
CD FNC CD FNC
Ours-2500 0.789 1.008 96.27 0.905 1.285 96.57

Ours-5000
0.752 0.948 96.59 0.795 1.241 96.68

Ours-7500
0.732 0.944 96.39 0.789 1.251 96.66

Table 3: Quantitative results of different input point cloud density on the S1 and S2 test sets of DeformingThing4D Li et al. (2021) dataset.

Robustness to noisy source mesh

To analyze the robustness of noise effects, we individually train our model by adding gaussian noise permutations to the source meshes. The standard deviation of gaussian noise is set to 0, 0.0025 or 0.005. The comparison in 

Table 4 shows that with the noise becoming larger, the performance of our method experiences only slight variation; however, this demonstrates the robustness of our method to noisy source meshes.

#standard deviation New motions (S1) Unseen identities (S2)
CD FNC CD FNC
Ours-0 0.752 0.948 96.59 0.795 1.241 96.68

Ours-0.0025
0.774 0.973 95.90 0.808 1.278 96.65

Ours-0.0050
0.851 1.017 96.50 0.911 1.392 96.16

Table 4: Quantitative results of source meshes with different noise intensities on the S1 and S2 test sets of DeformingThing4D Li et al. (2021) dataset.

Robustness to partial source mesh

To investigate the robustness to incomplete source meshes, we randomly sample 5 seeds from the source mesh surface, and then remove the corresponding nearest vertices and corresponding faces. The is calculated by , where is the incompleteness ratio and is the number of source mesh vertices. Again, our model is directly evaluated under two different settings of and . The quantitative results are provided in  Table 5. As seen, there are not significant numerical variations between different incompleteness ratios. This clearly demonstrates the robustness of our approach to incomplete source meshes.

#incompleteness ratio New motions (S1) Unseen identities (S2)
CD FNC CD FNC
Ours-0.0 0.752 0.948 96.59 0.795 1.241 96.68

Ours-0.05
0.770 0.957 95.80 0.804 1.244 96.66

Ours-0.10
0.823 1.002 96.44 0.858 1.261 96.55

Table 5: Quantitative results of source meshes with different incomplete ratios on the S1 and S2 test sets of DeformingThing4D Li et al. (2021) dataset. Note that our model is directly evaluated on partial meshes without fine-tuning.

Evaluations on real animals scans.

We evaluate our pre-trained model on the real animal scans captured by ourselves. As show in Figure 12, our method can still learn realistic shape deformations, which demonstrates the generalization ability of our approach to real captured models.

           (a) (b) (c) (a) (b) (c)
Figure 12: Evaluation on real animal scans. (a) Real animal scans (b) Source meshes obtained via the Screened PSR Kazhdan and Hoppe (2013) and handles. (c) Ours.

Evaluations on reconstructed animals from real images.

In addtion, we evaluate our pre-trained model on the reconstructed animals from real RGB images using the BARC Rüegg et al. (2022) method. As shown in Figure 13, our method estimates realistic deformations for reconstructed animals from natural images. This also demonstrates the generalization ability of our method.

           (a) (b) (c) (d) (e)
Figure 13: Evaluation on reconstructed animals from real RGB images using the method of BARC Rüegg et al. (2022) (a) Real images. (b) Reconstructed source meshes and handles. (c) Ours. (d) Reconstructed source meshes and handles. (e) Ours.

Evaluations on non-realistic user-specified handles.

While our goal of data-driven deformation priors is to obtain deformations that are as realistic as possible, we also evaluate our method on non-realistic or non-physical-aware handles. As shown in Figure 14, our method will try to find the closest deformation of animals that can best explain the provided handle displacements. However, our method could be easily trained on non-realistic or non-physical-aware samples and learn the respective deformation behavior.

           (a) (b) (a) (b) (a) (b)
Figure 14: Evaluation on non-realistic user-specified handles. (a) Source meshes and handles. (b) Ours.

Without dense correspondence

While our current method uses an existing dataset where dense correspondences between temporal mesh frames are available, our framework can also be trained on datasets without dense correspondences through some adjustments on inputs and loss functions. Concretely, we change our method to receive sparse handle correspondences as inputs, and utilize Chamfer distance as the loss function that does not require ground-truth meshes with dense correspondences as supervision. In  Figure 15, we visualize several test results of such a modified framework. As seen, without dense correspondences for training, our method can still obtain accurate deformations.

           (a) (b) (c) (a) (b) (c)
Figure 15: The evaluation results of our modified framework that uses sparse handles as input and does not require dense correspondences as supervision. (a) Source meshes and handles. (b) Target meshes and handles. (c) Our results with vertex error map.

Video animations

To visualize the deformation behaviours of the different approaches, we use a sequence of handle movements as inputs, and run our model frame by frame to obtain a deformation motion sequence. We refer to the supplemental video for an animated sequence.

Appendix F Limitations

           (a) (b) (c) (a) (b) (c)
Figure 16: The failure cases. (a) Source meshes and handles. (b) Target meshes and handles. (c) Our results with vertex error map.

While compelling results have been demonstrated for shape manipulation, a few limitations still exist in our approach that can be addressed in future work. Two representative failure cases are depicted in Figure 16. We can see that our method cannot well address extreme shape deformations (e.g. left of Figure 16) or manipulate unseen identities that are far from the training data distribution (e.g. the elephant in the right of Figure 16). We believe this issue can be alleviated by a larger training dataset, a richer data augmentation strategy, and/or few shot generalization techniques in the future.