1 Introduction
Editing and deforming 3D shapes is a key component in animation creation and computer aided design pipelines. Given as little user input as possible, the goal is to create new deformed instances of the original 3D shape which look natural and behave like real objects or animals. The user input is assumed to be very sparse, such as vertex handles that can be dragged around. For example, users can animate a 3D model of an animal by dragging its feet forward. This problem is severely ill-posed and typically under-constrained, as there are many possible deformations that can be matched with the provided partial surface deformations of handles, especially for large surface deformations. Thus, strong priors encoding deformation regularity are necessary to tackle this problem. Physics and differential geometry provide solutions that use various analytical priors which define natural-looking mesh deformations, such as elasticity Terzopoulos et al. (1987); Alexa et al. (2000), Laplacian smoothness Lipman et al. (2004); Sorkine et al. (2004); Zhou et al. (2005), and rigidity Sorkine and Alexa (2007); Sumner et al. (2007); Levi and Gotsman (2014) priors. They update mesh vertex coordinates by iteratively optimizing energy functions that satisfy constraints from both the pre-defined deformation priors and given handle locations. Although these algorithms can preserve geometric details of the original source model, they still have limited capacity to model realistic deformations, since the deformation priors are region independent, e.g., the head region deforms in a similar way as the tail of an animal, resulting in unrealistic deformation states.
Hence, motivated by the recent success of deep neural networks for 3D shape modeling Mescheder et al. (2019); Park et al. (2019); Chen and Zhang (2019); Xu et al. (2019); Tang et al. (2019); Chibane et al. (2020); Peng et al. (2020); Jiang et al. (2020b); Atzmon and Lipman (2020); Gropp et al. (2020); Chabra et al. (2020); Tretschk et al. (2020); Tang et al. (2021b); Chen et al. (2021b)
, we propose to learn shape deformation priors of a specific object class, e.g., quadruped animals, to complete surface deformations beyond observed handles. We formulate the following properties of such a learned model; (1) it should be robust to different mesh quality and number of vertices, (2) the source mesh is not limited to canonical pose (i.e., the input mesh can have arbitrary pose), and (3) it should generalize well to new deformations. Towards these goals, we represent deformations as a continuous deformation field which is defined in the near-surface region to describe the space deformation caused by the corresponding surface deformation. The continuity property enables us to manipulate meshes with infinite number of vertices and disconnected components. To handle source meshes in arbitrary poses, we learn shape deformations via canonicalization. Specifically, the overall deformation process consists of two stages: arbitrary-to-canonical transformation and canonical-to-arbitrary transformation. To obtain more detailed surface deformations and better generalization capabilities to unseen deformations, we propose to learn local deformation fields conditioned on local latent codes encoding geometry-dependent deformation priors, instead of global deformation fields conditioned on a single latent code. To this end, we propose Transformer-based Deformation Networks (TD-Nets), which learns encoder-based local deformation fields on point cloud approximations of the input mesh. Concretely, TD-Nets encode an input point cloud with surface geometry information and incomplete deformation flow into a sparse set of local latent codes and a global feature vector by using the vector attention blocks proposed in
Zhao et al. (2021). The deformation vectors of spatial points are estimated by an attentive decoder, which aggregates the information of neighboring local latent codes of a spatial point based on the feature similarity relationships. The aggregated feature vectors are finally passed to a multi-layer-perceptron (MLP) to predict displacement vectors which can be applied to the source mesh to compute the final output mesh.
To summarize, we introduce transformer-based local deformation field networks which are capable to learn shape deformation priors for the task of user-driven shape manipulation. The deformation networks learn a set of anchor features based on a vector attention mechanism, enhancing the global deformation context, and selecting the most informative local deformation descriptors for displacement vector estimations, leading to an improved generalization ability to new deformations. In comparison to classical hand-crafted deformation priors as well as recent neural network-based deformation predictors, our method achieves more accurate and natural shape deformations.
2 Related Work
User-guided shape manipulation lies at the intersection of computer graphics and computer vision. Our proposed method is related to polygonal mesh geometry processing, neural field representations, as well as vision transformers.
Optimization-based Shape Manipulation.
Classical methods formulate shape manipulation as a mathematical optimization problem. They perform mesh deformations by either deforming the vertices
Botsch and Sorkine (2007); Sorkine (2006) or the 3D space Jacobson et al. (2011); Bechmann (1994); Levi and Gotsman (2014); Milliron et al. (2002); Sederberg and Parry (1986). Performing mesh deformation without any other information about the target shape, but only using limited user-provided correspondences is an under-constrained problem. To this end, the optimization methods require deformation priors to constraint the deformation regularity as well as the smoothness of the deformed surface. Various analytic priors have been proposed which encourage smooth surface deformations, such as elasticity Terzopoulos et al. (1987); Alexa et al. (2000), Laplacian smoothness Lipman et al. (2004); Sorkine et al. (2004); Zhou et al. (2005), and rigidity Sorkine and Alexa (2007); Sumner et al. (2007); Levi and Gotsman (2014). These methods use efficient linear solvers to iteratively optimize energy functions that satisfy constraints from both the pre-defined deformation prior and provided handle movements. Recently, NFGP Yang et al. (2021) was proposed to optimize neural networks with non-linear deformation regularizations. Specifically, it performs shape deformations by warping the neural implicit fields of the source model through a deformation vector field, which is constrained by modeling implicitly represented surfaces as elastic shells. NeuralMLS Shechter et al. (2022)learned a geometry-aware weight function of a shape and given control points for moving least squares(MLS) deformations, which smoothly interpolates the control point displacements over space. Although they can preserve many geometric details of the source shape, they struggle to model complex deformations, as local surfaces are simply constrained to be transformed in a similar manner. In contrast, we aim to learn deformation priors based on local geometries to infer hidden surface deformations.
Learning-based Shape Reconstruction and Manipulation.
Learning-based shape manipulation has been studied to learn shape priors based on shape auto-encoding or auto-decoding. Zheng et al. (2021b); Deng et al. (2021); Hao et al. (2020); Jiang et al. (2020a) map a class of shapes into a latent space. During inference, given handle positions as input, they find an optimal latent code whose 3D interpretation is the most similar to the observation. In contrast, we learn explicit deformation priors to directly predict 3D surface deformations. Jakab et al. Jakab et al. (2021) proposed to control shapes via unsupervised 3D keypoint discovery. Instead, we use partial surface deformations represented by handle displacements as input observations, rather than keypoint displacements. There exist a series of methods that use deep neural networks to complete non-rigid shapes Jiang et al. (2020a); Palafox et al. (2021); Božič et al. (2021); Li et al. (2021); Tang et al. (2021c); Saito et al. (2021); Wang et al. (2021b); Burov et al. (2021) from partial scans. Our task is partially related to this task, but our shape manipulation task from user input requires completion of the deformation field. In contrast to shape completion, our setting is more under-constrained, as the user-provided handle correspondences are very sparse and more incomplete than partial point clouds from scans. Recent methods for clothed-human body reconstruction choose to canonicalize the captured scan into a pre-defined T-pose Wang et al. (2021a); Mihajlovic et al. (2021); Chen et al. (2021a) using the skeletal deformation model of SMPL Loper et al. (2015) or STAR Osman et al. (2020) which can also be used to later animate the human. Inspired by this, we also perform a canonicalization to enable editing of source meshes with arbitrary poses, before applying the actual deformation towards the target pose handles.
Continuous Neural Fields.
Continuous neural field representations have been widely used in 3D shape modeling Mescheder et al. (2019); Chen and Zhang (2019); Park et al. (2019) and 4D dynamics capture Niemeyer et al. (2019); Tang et al. (2021c); Božič et al. (2021); Palafox et al. (2021); Li et al. (2021). Recent work that represents 3D shapes as continuous signed distance fields Atzmon and Lipman (2020); Xu et al. (2019); Gropp et al. (2020); Chabra et al. (2020); Tretschk et al. (2020) or occupancy fields Mescheder et al. (2019); Chen and Zhang (2019); Chibane et al. (2020); Mi et al. (2020); Peng et al. (2020); Jiang et al. (2020b); Tang et al. (2021a, b); Giebenhain and Goldlücke (2021); Zhang and Wonka (2021) can theoretically obtain volumetric reconstructions with infinite resolutions, as they are not bound to the resolution of a discrete grid structure. Similarly, we learn continuous deformation fields defined in 3D space for shape deformations Tang et al. (2019); Jiang et al. (2020a); Yang et al. (2021); Hui et al. (2022). Due to the continuity of the deformation fields, our method is not limited by the number of mesh vertices, or disconnected components. Different from ShapeFlow Jiang et al. (2020a), OFlow Niemeyer et al. (2019), LPDC-Net Tang et al. (2021c) and NPMs Palafox et al. (2021) that learn a deformation field from a single latent code, inspired by local implicit field learning Chibane et al. (2020); Peng et al. (2020); Tang et al. (2021b); Giebenhain and Goldlücke (2021); Zhang et al. (2022), we model the deformation field as a composition of local deformation functions, improving the representation capability of describing complex deformations as well as generalization to new deformations.
Visual Transformers.
Recently, transformer architectures Vaswani et al. (2017)
from natural language processing have revolutionized many computer vision tasks, including image classification
Dosovitskiy et al. (2020); Wang et al. (2018), object recognition Carion et al. (2020), semantic segmentation Zheng et al. (2021a), or 3D reconstruction Bozic et al. (2021); Yu et al. (2021); Giebenhain and Goldlücke (2021); Zhang et al. (2022); Rao et al. (2022). We refer the reader to Han et al. (2020) for a detailed survey of visual transformers. In this work, we propose the usage of a transformer architecture to learn deformation fields. Given the input point cloud sampled from the source mesh with partial deformation flow (defined by the user handles), we employ the vector attention blocks from Point Transformer Zhao et al. (2021) as a main point cloud processing module to extract a sparse set of local latent codes, enhancing the global understanding of deformation behaviours. Based on the obtained local deformation descriptors, our attentive deformation decoder learns to attend to the most informative features from near-by local codes to predict a deformation field.3 Approach
Given a source mesh where and denote the set of vertices and the set of faces, respectively, we aim to deform to obtain a target mesh by selecting a sparse set of mesh vertices as handles, and dragging them to target locations . The key idea in this work is to use deformation priors to complete hidden surface deformations. Specifically, the goal is to learn a continuous deformation field defined in 3D space, from which we can obtain the deformed mesh through vertex deformations of the source mesh . The overall pipeline of the proposed approach is shown in Figure 2. Our method can be applied to input meshes in arbitrary poses by leveraging learned shape deformation via canonicalization (see Section 3.1). To represent the underlying deformation prior, we propose neural deformation fields as described in Section 3.2 which can be learned from large deformation datasets (see Section 3.3).

3.1 Learning Shape Deformations via Canonicalization
To ensure robustness w.r.t. varying input mesh quality (topology and resolution), we operate on point clouds instead of meshes. Specifically, we sample a point cloud from of size . We define the target handle point locations , where we use zeros to represent unknown point flows. Further, to avoid the ambiguity of zero point flow, we define the corresponding binary user handle masks where if is a handle or otherwise .
To learn the shape transformation between two arbitrary non-rigidly deformed poses, one can learn deformation fields that directly map the source deformed space to target space. However, it would be difficult to learn the deformation priors well, as there could be infinite deformation state transformation pairs. To decrease the learning complexity, we introduce a canonical space as an intermediate state. We divide the shape transformation process into two steps; a backward deformation that aligns the source deformed space to canonical space, and a forward deformation that maps the canonical space to the target deformation space. Concretely,
is passed into the backward transformation network
to learn the backward deformation field which transforms the input shape into a canonical pose . Similarly, the querying non-surface point set randomly sampled in the 3D space of is also mapped to canonical space through . Lastly, given , , and as input, a forward transformation network is learned to represent the forward deformation field that predicts final locations .3.2 Transformer-based Deformation Networks (TD-Nets)
The deformation via canonicalization is based on two deformation field predictors (forward and backward deformations). Both networks share the same architecture, thus, in the following, we will only describe the forward deformation network as visualized in Figure 3 while the backward deformation network is analogous. It consists of a transformer-based deformation encoder and a vector cross attention-based decoder network.
Point transformer encoder.
Given a point set with handle locations and a binary mask as inputs, we use point transformer layers from Zhao et al. (2021) to build our encoder modules. The point transformer layer is based on the vector attention mechanism Zhao et al. (2020). Let and be the query and key-value sequences, where and denote the coordinates of query and key-value points with corresponding feature vectors and . The vector cross attention operator is defined as:
(1) |
where are the aggregated features, , , and are linear projections implemented by a fully-connected layer. is a mapping function implemented by a two-layer MLP to predict attention vectors. is the attention weight normalization function, in our case softmax. is the positional embedding module Vaswani et al. (2017); Mildenhall et al. (2020)
implemented by a two linear layers with a single ReLU
Nair and Hinton (2010). It leverages relatively positional information of and to benefit the network training. Then, with the definition of VCA, the vector self-attention operator can be defined as:(2) |
Based on VCA and VSA, we can define two basic modules to build our encoder network, i.e. the point transformer block (PTB) and the point abstraction block (PAB). The definition of the point transformer block is a combination of the BatchNorm (BN) layer Ioffe and Szegedy (2015)
, VSA, and residual connections, formulated as:
(3) |
For each point , it encapsulates the information from nearest neighborhoods while keeping the point’s position unchanged. The point abstraction block consists of farthest point sampling (FPS), BN, VCA, and VSA, which is defined as follow:
(4) |
The point cloud with handle mask and flow as additional channels are passed to a point transformer block (PTB) to obtain a feature point cloud . By using two consecutive point abstraction blocks (PABs) with intermediate set size of and , we obtain and . To enhance global deformation priors, we stack 4 point transformer blocks with full self-attention whose is set to 100 to exchange the global information in the whole set of . By doing so, we can obtain a sparse set of local deformation descriptors that are anchored in
. Finally, we perform a global max-pooling operation followed by two linear layers to obtain the global latent vector
.
Attentive deformation decoder. Based on the learned local latent codes and global latent vector , the deformation decoder defines the forward deformation function , which maps a point from the canonical space of to the 3D space of . Similar to tri-linear interpolation operations in grid-based implicit field learning, a straightforward way to find the corresponding feature vector is to use the weighted combination of nearby local codes . Intuitively, the weight is inversely proportional to the euclidean distance between and the anchoring location Peng et al. (2020). However, distance-based feature queries ignore the relationships between deformation descriptors. Thus, we propose to obtain by adaptively aggregating information of based on the vector cross-attention operator:
(5) |
The local information aggregation enables us to flexibly search the local deformation priors, thus, improving the generalizability to new deformations. Finally, the is fed into an MLP composed of five Res-FC blocks to estimate the associate location in the target space.
3.3 Training Objectives
For training, we need a set of triplets with dense correspondences, from which we can randomly sample surface point clouds of size and querying non-surface points of size in the 3D space. To optimize the backward deformation networks, we employ the mean distance error that measures the difference between deformed points from source space and their ground-truths in the canonical space:
(6) |
Similarly, to optimize the forward deformation networks, we use the following loss function:
(7) |
The total loss function for source-target shape deformations is defined as:
(8) |
4 Experiments
Dataset.
Our experiments are performed on the DeformingThing4D-Animals Li et al. (2021) dataset which contains 1494 non-rigidly deforming animations with various motions comprising 40 identities of 24 categories. For the train/test split, we divide all animations into training (1296) and test (198). Similar to the D-FAUST Bogo et al. (2017) used in OFlow Niemeyer et al. (2019), the test set is composed of two subsets: (S1) contains 143 sequences of new motions for seen train identities, and (S2) contains 55 sequences of unseen individuals (and thus also new motions). During training, we randomly sample two frames from an identity as source-target deformation pairs. During inference, we consider the first frame of an animation as source mesh, and other frames as target meshes. To evaluate the generalization ability to unseen identities, we evaluate the pre-trained models on the animal dataset used in Deformation Transfer Sumner and Popović (2004)
. For the quantitative comparison on each test subset, we compute evaluation metrics for 300 randomly sampled pairs. In addition, we also include comparisons on another animal dataset used in TOSCA
Rodolà et al. (2017). TOSCA Rodolà et al. (2017) does not have correspondences between different poses of the same animal, and hence does not easily provide handle displacements as input. Thus, we provide a qualitative comparison under the setting of using user-specified handles as inputs.Implementation details.
Our approach is built on the PyTorch library
Paszke et al. (2019). Please refer to the supplementary material for the details of our network architecture. Our model consists of two training stages. We use an Adam Kingma and Ba (2014) optimizer with , , and . In the first stage, we train the forward and backward deformation networks individually. Specifically, the backward and forward deformation networks are respectively optimized by the objective described in Equations 6 or 7using a batch size of 16 with the learning rate of 5e-4 for 100 epochs. In the second stage, the whole model is trained according to Equation
8 in an end-to-end manner using a batch size of 6 with a learning rate of 5e-5 for 20 epochs.Baselines.
We conduct comparisons against classical optimization-based and recent neural network-based methods. For the former, we select a representative work, ARAP Sorkine and Alexa (2007), that constrains each local surface to be rigidly transformed as much as possible. For the latter, we compare our method with the learning-based deformation predictor ShapeFlow Jiang et al. (2020a) that embeds each shape into a latent space and learns flow-based deformations among 3D shapes. We also compare to NFGP Yang et al. (2021), a deep optimization method, which constrains the implicitly represented surfaces as elastic shells during the deformation process.

Source mesh | Target mesh | ARAP Sorkine and Alexa (2007) | ShapeFlow Jiang et al. (2020a) | NFGP Yang et al. (2021) | Ours |
and handles | and handles |

Source mesh | Target mesh | ARAP Sorkine and Alexa (2007) | ShapeFlow Jiang et al. (2020a) | NFGP Yang et al. (2021) | Ours |
and handles | and handles |
Evaluation metrics.
We consider distance error of mesh vertices ( ), Chamfer Distance (CD ) of sampled point clouds of 30k points, and Face Normal Consistency (FNC ) as primary evaluation metrics. Please refer to the supplementary material for a detailed explanation of these metrics. Note that for and CD, lower is better, while for FNC, higher is better.
4.1 Comparisons
Method | New motions (S1) | Unseen identities (S2) | Deformation Transfer | ||||||
---|---|---|---|---|---|---|---|---|---|
CD | FNC | CD | FNC | CD | FNC | ||||
ARAP Sorkine and Alexa (2007) | 5.568 | 2.312 | 95.35 | 9.794 | 2.308 | 94.89 | 5.145 | 3.475 | 91.21 |
ShapeFlow Jiang et al. (2020a) |
21.03 | 3.494 | 89.69 | 32.08 | 3.925 | 90.73 | 33.72 | 4.093 | 86.36 |
NFGP Yang et al. (2021) |
11.77 | 3.130 | 93.34 | 15.96 | 3.364 | 91.80 | 18.90 | 4.150 | 82.54 |
Ours-VDF |
3.590 | 1.887 | 86.01 | 2.368 | 1.837 | 86.99 | 3.111 | 9.164 | 78.63 |
Ours-global |
2.970 | 1.546 | 93.30 | 2.973 | 1.579 | 94.75 | 2.636 | 8.453 | 84.59 |
Ours-3D UNet |
1.011 | 1.111 | 96.02 | 1.253 | 1.426 | 96.20 | 4.553 | 2.362 | 88.31 |
Ours-PointNet++. |
0.886 | 1.055 | 95.47 | 1.231 | 1.364 | 95.37 | 4.898 | 2.564 | 85.87 |
Ours-w/o atten dec. |
1.184 | 1.210 | 95.64 | 1.227 | 1.417 | 96.16 | 5.252 | 2.772 | 84.95 |
Ours-w/o cano. |
1.018 | 1.063 | 96.40 | 0.969 | 1.258 | 96.62 | 2.660 | 1.934 | 90.96 |
Ours-full |
0.752 | 0.948 | 96.59 | 0.795 | 1.241 | 96.68 | 2.495 | 1.877 | 91.40 |
|
For a qualitative comparison, we visualize the vertex distance error maps of deformed meshes in Figure 4 and Figure 5. As can be seen, our method has lower vertex errors in the hidden surface regions since we use data-driven deformation priors, instead of employing hand-crafted regularizers to enforce surface smoothness. The generalization ability to unseen deformations is improved by learning deformation fields for local surfaces, instead of modeling global deformations.Compared to ARAP, ShapeFlow, and NFGP, we can produce more realistic results for complicated actions in the 3rd and 4th rows of Figure 4. The deformation results presented in Figure 5 demonstrate that our method can generalize to unseen identities, and is also verified quantitatively in Table 1, where our method consistently outperforms all baselines.
User-specified handles.
To evaluate the generalization performance of our approach on unseen identities using user-provided handle displacements that are used in interactive editing applications, we use random translations of handles applied to animals from TOSCA Rodolà et al. (2017) as input. As depicted in Figure 6, our approach is able to produce naturally-looking deformation results, and shows its advantages compared to ARAP, ShapeFlow, and NFGP. Note that for this demonstration of user-specified handles there exists no corresponding ground-truth.
![]() |
|
Source mesh, | |
handles and | |
target handles | |
ARAP Sorkine and Alexa (2007) | |
ShapeFlow Jiang et al. (2020a) | |
NFGP Yang et al. (2021) | |
Ours |
4.2 Ablation Studies
To verify our final model choice, we conducted a series of ablation studies, where we analysed several variants of our deformation fields (see Table 1 and Figure 7).

Volumetric grids vs continuous fields.
As continuous fields are not bound to the resolution of a discrete grid structure, it can better represent complex deformations. The performance degrades when we learn grid-based volumetric deformation fields. This can be seen in the experiment “Ours-VDF" which uses a 3D U-Net Ronneberger et al. (2015) to generate volumetric deformation fields of a fixed resolution .
Global vs local deformation fields.
“Ours-global" learns a global continuous field only conditioned on the global latent code. This variant tends to lose detailed information about local surface deformations, and is more difficult to generalize to new motions or identities, leading to inferior results in comparison to our local deformation fields.
Network Architectures (3D U-Net vs PointNet++ vs Point Transformer).
Compared to grid-based and point-based local deformation descriptors learning, the point transformer-based encoder captures strong global contexts that enforce more global consistency constraints. This provides performance improvements on surface accuracy of deformed meshes. To verify this, we conducted an experiment with “Ours-3D-UNet," which learns a volumetric feature map through a 3D U-Net, and then predicts deformation fields based on queried features via tri-linear interpolation operations. Additionally, we compare with “Ours-PointNet++," which replaces the point transformer encoder with PointNet++ Qi et al. (2017).
With vs without Attention-based feature querying.
The attention-based feature query mechanism can flexibly and effectively select the most relevant deformation descriptors for a query point, resulting in improved performance over feature interpolation purely based on euclidean distances. A deformation decoder that for example uses an interpolation with weights that are purely based on euclidean distance instead (“Ours-w/o atten. dec."), leading to significantly higher errors, particularly in terms of the vertex error.
With vs without canonical poses.
Learning shape deformations via canonicalization improves the generalization to source meshes in different poses. Learning without canonicalization ("Ours-w/o cano."), i.e., learning shape deformations directly between two arbitrary poses, results in considerably higher surface errors.
4.3 Intermediate results of canonicalization

(a) | (b) | (c) | (a) | (b) | (c) |
In Figure 8, we visualize our intermediate results of canonicalization. As can be seen, our method can project source meshes with arbitrary poses into a canonical space with a same pose.
4.4 Limitations
While compelling results have been demonstrated for shape manipulation, a few limitations still exist in our approach that can be addressed in future work. Our approach only needs sparse user input in form of handles which can be moved to create a new deformation state. While this allows for quick editing, a possible extension is to add rotations to the handles. This could be done by leveraging a different deformation representation such as a SE(3) field which is composed of a displacement and a rotation field. Note that our displacement representation is able to represent general deformations, but might require more user handles. Due to the limitations of the DeformingThing4D-Animals Li et al. (2021) dataset in terms of available models and poses, our approach may suffer from the generalization to out-of-distribution models and extreme poses. Additionally, the output of our model, as with other learning-based methods, may be affected by biases in the training dataset that can limit generalization. We believe this issue can be relieved by a larger training dataset and a richer data augmentation strategy in future work. Lastly, our training scheme only considers handles that are selected from a set of candidate parts of the models, thus, limiting the regions the user can interact with. Enriching the candidate handles during training is potentially helpful for allowing free handle placement.
5 Conclusion
In this work, we introduced Neural Shape Deformation Priors, a novel approach that learns mesh deformations of non-rigid objects from user-provided handles based on the underlying geometric properties of shapes. To enable shape manipulation for source meshes with different poses, we choose to learn shape deformations via canonicalization where the source mesh is first transformed to the canonical space through a backward deformation field and then deformed to the target space through a forward deformation field. For deformation field learning, we propose Transformer-based Deformation Networks (TD-Net) that represent a shape deformation as a composition of local surface deformations. Our experiments and ablation studies demonstrate that our method can be applied to challenging new deformations, outperforming classical optimization-based methods such as ARAP Sorkine and Alexa (2007) and neural networks-based methods such as ShapeFlow Jiang et al. (2020a) and NFGP Yang et al. (2021), while showing a good generalization to previously unseen identities. We see our method as an important step in the development of 3D modeling algorithms and softwares and hope to inspire more research in learning-based shape manipulation.
Societal impact.
Our work provides an algorithm for natural-looking shape editing, which can simplify tedious procedures in 3D content creation and empower artists in the movie and game industries. It further has the potential to enrich 3D data with additional deformed shapes, and could thus help improve the performance of other practical application techniques that rely on large quantities of 3D ground-truth for training. Yet, misuse of our shape manipulation algorithm could enable fraud or offensive content generation.
Acknowledgement.
This work is supported by a TUM-IAS Rudolf Mößbauer Fellowship, the ERC Starting Grant Scan2CAD (804724), and Sony Semiconductor Solutions Corporation. We would also like to thank Angela Dai for the video voice over.
References
- As-rigid-as-possible shape interpolation. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 157–164. Cited by: §1, §2.
- Sal: sign agnostic learning of shapes from raw data. In CVPR, pp. 2565–2574. Cited by: §1, §2.
- Space deformation models survey. Computers & Graphics 18 (4), pp. 571–586. Cited by: §2.
-
Dynamic faust: registering human bodies in motion.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 6233–6242. Cited by: Appendix D, §4. - On linear variational surface deformation methods. IEEE transactions on visualization and computer graphics 14 (1), pp. 213–230. Cited by: §2.
- Transformerfusion: monocular rgb scene reconstruction using transformers. Advances in Neural Information Processing Systems 34. Cited by: §2.
- Neural deformation graphs for globally-consistent non-rigid reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1450–1459. Cited by: §2, §2.
- Dynamic surface function networks for clothed human bodies. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10754–10764. Cited by: §2.
- End-to-end object detection with transformers. In European conference on computer vision, pp. 213–229. Cited by: §2.
- Deep local shapes: learning local sdf priors for detailed 3d reconstruction. In ECCV, pp. 608–625. Cited by: §1, §2.
- SNARF: differentiable forward skinning for animating non-rigid neural implicit shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11594–11604. Cited by: §2.
-
Model-based 3d hand reconstruction via self-supervised learning
. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10451–10460. Cited by: §1. - Learning implicit fields for generative shape modeling. In CVPR, Cited by: §1, §2.
- Implicit functions in feature space for 3d shape reconstruction and completion. In CVPR, Cited by: §1, §2.
- Deformed implicit field: modeling 3d shapes with learned dense correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10286–10296. Cited by: §2.
- An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §2.
- AIR-nets: an attention-based framework for locally conditioned implicit representations. In 2021 International Conference on 3D Vision (3DV), pp. 1054–1064. Cited by: §2, §2.
- Implicit geometric regularization for learning shapes. ICML. Cited by: §1, §2.
- A survey on visual transformer. arXiv e-prints, pp. arXiv–2012. Cited by: §2.
- Dualsdf: semantic shape manipulation using a two-level representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7631–7641. Cited by: §2.
- Neural template: topology-aware reconstruction and disentangled generation of 3d meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18572–18582. Cited by: §2.
-
Batch normalization: accelerating deep network training by reducing internal covariate shift.
In
International conference on machine learning
, pp. 448–456. Cited by: Appendix A, §3.2. - Bounded biharmonic weights for real-time deformation.. ACM Trans. Graph. 30 (4), pp. 78. Cited by: §2.
- KeypointDeformer: unsupervised 3d keypoint discovery for shape control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12783–12792. Cited by: §2.
- Shapeflow: learnable deformation flows among 3d shapes. Advances in Neural Information Processing Systems 33, pp. 9745–9757. Cited by: §2, §2, Figure 4, Figure 5, Figure 6, §4, Table 1, §5.
- Local implicit grid representations for 3d scenes. In CVPR, pp. 608–625. Cited by: §1, §2.
- Screened poisson surface reconstruction. ACM Transactions on Graphics (ToG) 32 (3), pp. 1–13. Cited by: Figure 12.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
- Smooth rotation enhanced as-rigid-as-possible mesh animation. IEEE transactions on visualization and computer graphics 21 (2), pp. 264–277. Cited by: §1, §2.
- 4dcomplete: non-rigid motion estimation beyond the observable surface. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12706–12716. Cited by: Appendix D, Table 3, Table 4, Table 5, §2, §2, §4, §4.4, Table 1.
- Differential coordinates for interactive mesh editing. In Proceedings Shape Modeling Applications, 2004., pp. 181–190. Cited by: §1, §2.
- SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34 (6), pp. 248:1–248:16. Cited by: §2.
- Occupancy networks: learning 3d reconstruction in function space. In CVPR, Cited by: §1, §2.
- Ssrnet: scalable 3d surface reconstruction network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 970–979. Cited by: §2.
- LEAP: learning articulated occupancy of people. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10461–10471. Cited by: §2.
- Nerf: representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pp. 405–421. Cited by: Appendix A, §3.2.
- A framework for geometric warps and deformations. ACM Transactions on Graphics (TOG) 21 (1), pp. 20–51. Cited by: §2.
- Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: Appendix A, §3.2.
- Occupancy flow: 4d reconstruction by learning particle dynamics. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 5379–5389. Cited by: Appendix D, §2, §4.
- STAR: a sparse trained articulated human body regressor. European Conference on Computer Vision (ECCV), pp. 598–613. External Links: Link Cited by: §2.
-
Npms: neural parametric models for 3d deformable shapes
. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12695–12705. Cited by: §2, §2. - DeepSDF: learning continuous signed distance functions for shape representation. In CVPR, Cited by: §1, §2.
-
Pytorch: an imperative style, high-performance deep learning library
. Advances in neural information processing systems 32. Cited by: §4. - Convolutional occupancy networks. In ECCV, Cited by: §1, §2, §3.2.
- Pointnet++: deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems 30. Cited by: §4.2.
- PatchComplete: learning multi-resolution patch priors for 3d shape completion on unseen categories. Advances in Neural Information Processing Systems. Cited by: §2.
- Partial functional correspondence. In Computer graphics forum, Vol. 36, pp. 222–236. Cited by: Figure 6, §4, §4.1.
- U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §4.2.
- BARC: learning to regress 3d dog shape from images by exploiting breed information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3876–3884. Cited by: Figure 13, Appendix E.
- SCANimate: weakly supervised learning of skinned clothed avatar networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2886–2897. Cited by: §2.
- Free-form deformation of solid geometric models. In Proceedings of the 13th annual conference on Computer graphics and interactive techniques, pp. 151–160. Cited by: §2.
- NeuralMLS: geometry-aware control point deformation. Cited by: §2.
- As-rigid-as-possible surface modeling. In Symposium on Geometry processing, Vol. 4, pp. 109–116. Cited by: Neural Shape Deformation Priors, §1, §2, Figure 4, Figure 5, Figure 6, §4, Table 1, §5.
- Laplacian surface editing. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium on Geometry processing, pp. 175–184. Cited by: §1, §2.
- Differential representations for mesh processing. In Computer Graphics Forum, Vol. 25, pp. 789–807. Cited by: §2.
- Deformation transfer for triangle meshes. ACM Transactions on graphics (TOG) 23 (3), pp. 399–405. Cited by: Figure 5, §4, Table 1.
- Embedded deformation for shape manipulation. In ACM SIGGRAPH 2007 papers, pp. 80–es. Cited by: §1, §2.
- A skeleton-bridged deep learning approach for generating meshes of complex topologies from single rgb images. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 4541–4550. Cited by: §1, §2.
- Skeletonnet: a topology-preserving solution for learning mesh reconstruction of object surfaces from rgb images. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
- SA-convonet: sign-agnostic optimization of convolutional occupancy networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6504–6513. Cited by: §1, §2.
- Learning parallel dense correspondence from spatio-temporal descriptors for efficient and robust 4d reconstruction. In CVPR, pp. 6022–6031. Cited by: §2, §2.
- Elastically deformable models. In Proceedings of the 14th annual conference on Computer graphics and interactive techniques, pp. 205–214. Cited by: §1, §2.
- PatchNets: patch-based generalizable deep implicit 3d shape representations. In ECCV, pp. 108–124. Cited by: §1, §2.
- Attention is all you need. Advances in neural information processing systems 30. Cited by: Appendix A, §2, §3.2.
- Locally aware piecewise transformation fields for 3d human mesh registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7639–7648. Cited by: §2.
- Metaavatar: learning animatable clothed human models from few depth images. Advances in Neural Information Processing Systems 34. Cited by: §2.
- Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803. Cited by: §2.
- DISN: deep implicit surface network for high-quality single-view 3d reconstruction. In NeurIPS, Cited by: §1, §2.
- Geometry processing with neural fields. Advances in Neural Information Processing Systems 34. Cited by: §2, §2, Figure 4, Figure 5, Figure 6, §4, Table 1, §5.
- CoFiNet: reliable coarse-to-fine correspondences for robust pointcloud registration. Advances in Neural Information Processing Systems 34. Cited by: §2.
- 3DILG: irregular latent grids for 3d generative modeling. In Advances in Neural Information Processing Systems, Cited by: §2, §2.
- Training data generating networks: shape reconstruction via bi-level optimization. In International Conference on Learning Representations, Cited by: §2.
- Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10076–10085. Cited by: Appendix A, §3.2.
- Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268. Cited by: §1, §2, §3.2.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6881–6890. Cited by: §2.
- Deep implicit templates for 3d shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1429–1439. Cited by: §2.
- Large mesh deformation using the volumetric graph laplacian. In ACM SIGGRAPH 2005 Papers, pp. 496–503. Cited by: §1, §2.
Neural Shape Deformation Priors
– Supplementary Material –
Our Neural Shape Deformation Priors method is based on transformer-based deformation networks that represent the deformation as a composition of local surface deformations. The underlying architectures are discussed in Appendix A. The used evaluation metrics are detailed in Appendix B. Our notations are further explained in Appendix C. And more details about data-preprocessing are given in Appendix D. In addition to the results shown in the main paper, we conducted further experiments (see E). While our method exhibits good generalization to unseen poses and shapes, we discuss and show failure cases in Appendix F.
Appendix A Network Architectures
Vector Cross Attention:
In Figure 9, we illustrate the architecture of vector cross attention Zhao et al. (2020) (VCA) which is a building block of our transformer-based deformation network (see Figure 3 in the main paper). The feature vectors and are transformed with three linear projectors , and , each of which is a fully-connected layer. To leverage relatively positional information of and , is encoded by a positional embedding module Vaswani et al. (2017); Mildenhall et al. (2020) that consists of two linear layers with a single ReLU Nair and Hinton (2010). Then, the summation result of and will be further processed by a MLP . Next, a softmax function is used to generate normalized attention scores that are used to calculate a weighted combination of to obtain .



Point Transformer Block (PTB):
As illutrated in Figure 9, we introduce the architecture of point transformer block. The point transformer block is used to encapsulate the information from = 16 nearest neighborhoods while keeping the position of a point unchanged. The input is fed into a vector attention block (VSA) and through a BatchNorm (BN) Ioffe and Szegedy (2015) (including a residual connection from the input ).
Point Abstraction Block (PAB):
The point abstraction block consists of a farthest point sampling module (FPS), a VCA module, a VSA module, followed by a BN layer. The farthest point sampling (FPS) is used to downsampled which is then fed into a VCA followed by a VSA module. We employ a skip connection from the original to the VCA module. The output of the FPS and the VSA module are fed into a batchnorm layer which computes the output of the point abstraction block.
Point Transformer Encoder
As shown in Figure 10, a PTB is used to obtain an initial feature encoding . Two consecutive point abstraction blocks (PABs) with intermediate set size of and , are used to obtain downsampled feature point clouds and . To enhance global deformation priors, we stack 4 point transformer block with full self-attention whose is set to 100 to exchange the global information in the whole set of . By doing so, we can obtain a sparse set of local deformation descriptors that are anchored in . Finally, a global max-pooling operation followed by two linear layers is used to obtain the global latent vector .
Attentive Deformation Decoder
The detailed architecture of attentive deformation decoder is shown in Figure 11. It fuses near-by local latent codes of under the guidance of a global latent code into , and feeds into an MLP consisting of five stacked Res-FC blocks to estimate the displacement vector of .
Appendix B Evaluation Metrics
For defining the evaluation metrics, we assume two meshes and being the ground-truth and deformed mesh respectively, sharing the same connectivity.
Vertex error:
The vertex distance error is the mean square distance between ground-truth vertices and deformed vertices :
where denotes the number of mesh vertices.
Chamfer distance:
To calculate the chamfer distance between and , we firstly sample two point set and from and individually. Then, the Chamfer distance of two point sets is defined as:
Face Normal Consistency
The face normal consistency describes the mean cosine similarity score of the triangle normals of two meshes. Let
and denote the set of face normals of and respectively. We define Face Normal Consistency as:where denotes the number of triangle faces and denotes the dot product of two vectors.
Appendix C Notation
We will explain our notation in more detail after having briefly defined it in Section 3. By , , , we denote meshes of the considered shapes. is the source mesh and is the set of vertices of while is the set of faces of . is deformed in a 2-step approach. By we denote the canonical shape and is the target shape. We select a sparse set of handles of the original shape. The handles can be dragged to new target locations which define the target mesh . The continuous deformation field learnt in our work is denoted by . We apply to deform the vertices of to obtain the deformed mesh where are the vertices of the deformed mesh. We denote the backward deformation field by and the forward deformation field by . It holds . Since our method performs operations in the point cloud domain, we sample point clouds from the surface meshes. is a surface point cloud of canonical mesh with size . We define the binary user handle mask as . The point cloud is passed through the backward transformation network and mapped into the canonical pose , i.e. . Then the point cloud is passed through the forward transformation network and mapped into the target pose , i.e. . Further, consult Table 2 for the definition of all items.
Notations | Meaning |
---|---|
|
Source mesh, canonical mesh, target mesh, deformed mesh |
Vertices, faces of source mesh | |
Set of handles, -th handle location | |
Set of target locations of handles, -th target location | |
Binary user handle mask, -th element of | |
Surface point clouds of size sampled from the surface of | |
Target handle point locations | |
Non-surface point clouds of size sampled from the 3D space of | |
-th non-surface querying point | |
Size of surface point clouds | |
Size of non-surface point clouds | |
-th point from | |
Mapping of in canonical pose, target pose | |
Backward deformation field, forward deformation field | |
Deformation field between two arbitrary poses, i.e. | |
Backward transformation network, forward transformation network | |
Query sequence, key-value sequence | |
Coordinate of -th query point, corresponding feature vector, aggregated feature | |
Coordinate of -th key-value point, corresponding feature vector | |
Vector cross attention | |
Fully-connected layers | |
Attention weight normalization function, e.g. softmax function | |
Positional embedding module | |
Vector self-attention operator | |
Point transformer block, point abstraction block | |
BatchNorm Layer | |
Set of local deformation descriptors | |
A point in , corresponding feature vector | |
Coordinates and feature vector of -th deformation descriptor | |
Global latent vector | |
Backward loss function, forward loss function, end-to-end loss function |
Appendix D Data
To train and evaluate our method, we use the DeformingThing4D Li et al. (2021) dataset, which is available under a non-commercial academic license. It does not contain personally identifiable information or offensive contents. We have obtained the consent to use the dataset.
Train/test split
The DeformingThing4D consists of a large number of quadruped animal animations with various motions, such as “bear3EP Jump”, “bear9AK Jump”, or “bear3EP Lie” where "bear3EP" and "bear9AK" are identity names, and "Jump" and "Lie" are motion names. Similar to the D-FAUST Bogo et al. (2017) used in OFlow Niemeyer et al. (2019), the train/test split is based on these identity and motion names of deforming sequences. We firstly divide the animations of the dataset into two parts, seen identities and unseen identities. For the animations of seen identities, we further divide it into seen motions of seen identities (used as training set), and unseen motions of seen identities (used as the test set of S1). The animations of unseen identities are used as the test set of S2. Finally, the train, test S1, and test S2 datasets individually contains 1296, 143, and 55 deforming sequences.
Data preparation
In Section 3.3 of the main text, we mentioned that our method utilizes a set of triplets including source , canonical , and target mesh with dense correspondence for training. The point clouds of size with one-to-one correspondence are sampled from the surfaces of , , . And the non-surface point sets of size are sampled from their 3D space. Here, we provide the details of data preparation. Firstly, we sample surface points from the canonical mesh ; we also store the corresponding barycentric weights of sample points. Then, each point is randomly permuted by a small displacement vector along the normal direction of the corresponding triangle. The displacement distance
is from a Gaussian distribution
. Next, for source and target meshes, we use the same barycentric weights to obtain with correspondences, and use the same displacements to obtain with correspondences. Concretely, we pre-compute points from each canonical surface mesh, and get the non-surface points with 50% of surface points permuted by , with 50% of surface points permuted by . During training, we down-sample points of , and down-sample of . To maintain one-to-one correspondence, we use the same sampling indices for , , .Appendix E Additional Results
Effects of point cloud sampling density
To study the effect of sampling density of input point cloud, we individually train our model by using point clouds of size 2500, 5000, 7500 as input. Quantitative results are shown in Table 3
. We can observe that the results of different evaluation metrics only show a slightly small variance. To balance accuracy and computational cost, we use 5000 points in our final model.
#sampling points | New motions (S1) | Unseen identities (S2) | ||||
---|---|---|---|---|---|---|
CD | FNC | CD | FNC | |||
Ours-2500 | 0.789 | 1.008 | 96.27 | 0.905 | 1.285 | 96.57 |
Ours-5000 |
0.752 | 0.948 | 96.59 | 0.795 | 1.241 | 96.68 |
Ours-7500 |
0.732 | 0.944 | 96.39 | 0.789 | 1.251 | 96.66 |
|
Robustness to noisy source mesh
To analyze the robustness of noise effects, we individually train our model by adding gaussian noise permutations to the source meshes. The standard deviation of gaussian noise is set to 0, 0.0025 or 0.005. The comparison in
Table 4 shows that with the noise becoming larger, the performance of our method experiences only slight variation; however, this demonstrates the robustness of our method to noisy source meshes.#standard deviation | New motions (S1) | Unseen identities (S2) | ||||
---|---|---|---|---|---|---|
CD | FNC | CD | FNC | |||
Ours-0 | 0.752 | 0.948 | 96.59 | 0.795 | 1.241 | 96.68 |
Ours-0.0025 |
0.774 | 0.973 | 95.90 | 0.808 | 1.278 | 96.65 |
Ours-0.0050 |
0.851 | 1.017 | 96.50 | 0.911 | 1.392 | 96.16 |
|
Robustness to partial source mesh
To investigate the robustness to incomplete source meshes, we randomly sample 5 seeds from the source mesh surface, and then remove the corresponding nearest vertices and corresponding faces. The is calculated by , where is the incompleteness ratio and is the number of source mesh vertices. Again, our model is directly evaluated under two different settings of and . The quantitative results are provided in Table 5. As seen, there are not significant numerical variations between different incompleteness ratios. This clearly demonstrates the robustness of our approach to incomplete source meshes.
#incompleteness ratio | New motions (S1) | Unseen identities (S2) | ||||
---|---|---|---|---|---|---|
CD | FNC | CD | FNC | |||
Ours-0.0 | 0.752 | 0.948 | 96.59 | 0.795 | 1.241 | 96.68 |
Ours-0.05 |
0.770 | 0.957 | 95.80 | 0.804 | 1.244 | 96.66 |
Ours-0.10 |
0.823 | 1.002 | 96.44 | 0.858 | 1.261 | 96.55 |
|
Evaluations on real animals scans.
We evaluate our pre-trained model on the real animal scans captured by ourselves. As show in Figure 12, our method can still learn realistic shape deformations, which demonstrates the generalization ability of our approach to real captured models.

(a) | (b) | (c) | (a) | (b) | (c) |
Evaluations on reconstructed animals from real images.
In addtion, we evaluate our pre-trained model on the reconstructed animals from real RGB images using the BARC Rüegg et al. (2022) method. As shown in Figure 13, our method estimates realistic deformations for reconstructed animals from natural images. This also demonstrates the generalization ability of our method.

(a) | (b) | (c) | (d) | (e) |
Evaluations on non-realistic user-specified handles.
While our goal of data-driven deformation priors is to obtain deformations that are as realistic as possible, we also evaluate our method on non-realistic or non-physical-aware handles. As shown in Figure 14, our method will try to find the closest deformation of animals that can best explain the provided handle displacements. However, our method could be easily trained on non-realistic or non-physical-aware samples and learn the respective deformation behavior.

(a) | (b) | (a) | (b) | (a) | (b) |
Without dense correspondence
While our current method uses an existing dataset where dense correspondences between temporal mesh frames are available, our framework can also be trained on datasets without dense correspondences through some adjustments on inputs and loss functions. Concretely, we change our method to receive sparse handle correspondences as inputs, and utilize Chamfer distance as the loss function that does not require ground-truth meshes with dense correspondences as supervision. In Figure 15, we visualize several test results of such a modified framework. As seen, without dense correspondences for training, our method can still obtain accurate deformations.

(a) | (b) | (c) | (a) | (b) | (c) |
Video animations
To visualize the deformation behaviours of the different approaches, we use a sequence of handle movements as inputs, and run our model frame by frame to obtain a deformation motion sequence. We refer to the supplemental video for an animated sequence.
Appendix F Limitations

(a) | (b) | (c) | (a) | (b) | (c) |
While compelling results have been demonstrated for shape manipulation, a few limitations still exist in our approach that can be addressed in future work. Two representative failure cases are depicted in Figure 16. We can see that our method cannot well address extreme shape deformations (e.g. left of Figure 16) or manipulate unseen identities that are far from the training data distribution (e.g. the elephant in the right of Figure 16). We believe this issue can be alleviated by a larger training dataset, a richer data augmentation strategy, and/or few shot generalization techniques in the future.