1 Introduction
Fast and lightweight methods for animating 3D characters are desirable in various applications including computer games and film visual effects. Traditional skinningbased mesh deformation provides a fast geometric approach but often lacks realistic dynamics. On the other hand, physicallybased simulation can add plausible secondary motion to skinned animations, augmenting them with visually realistic and vivid effects, but at the cost of heavy computation.
Recent research has explored deep learning methods to approximate physicallybased simulation in a much more timeefficient manner. While some approaches have focused on accelerating specific parts of the simulation
[luo2018nnwarp, fulton2019latent, meister2020deep], others have proposed endtoend solutions that predict dynamics directly from mesh based features [bailey2018fast, holden2019subspace, holden2019subspace, santesteban2020softsmpl]. While demonstrating impressive results, these methods still have some limitations. Most of them assume a fixed mesh topology and thus need to train different networks for different character meshes. Moreover, in order to avoid the computational complexity of training networks on high resolution meshes, some methods operate on reduced subspaces with limited degrees of freedom, leading to low accuracy.
In this paper, we propose a deep learning approach to predict secondary motion, i.e., the deformable dynamics of given skinned animations of 3D characters. Our method addresses the shortcomings of the recent learningbased approaches by designing a network architecture that can reflect the actual underlying physical process. Specifically, our network models the simulation using a volumetric mesh consisting of uniform tetrahedra surrounding the character mesh, where the mesh edges encode the internal forces that depend on the current state (i.e., displacements, velocities, accelerations), material properties (e.g., stiffness), and constraints on the vertices. Mesh vertices encode the inertia. Motivated by the observation that within a short time instance the secondary dynamics of a vertex is mostly affected by its current state, as well as the internal forces due to its neighbors, our network operates on local patches of the volumetric mesh. In addition to avoiding the computational complexity of encoding high resolution character meshes as large graphs, this also enables our method to be applied to any character mesh, independent of its topology. Finally, our network encodes pervertex material properties and constraints, giving the user the ability to easily prescribe varying properties to different parts of the mesh to control the dynamic behaviour.
As a unique benefit of the generalization capability of our model, we demonstrate that it is not necessary to construct a massive training dataset of complex meshes and motions. Instead, we construct our training data from primitive geometries, such as a volumetric mesh of a sphere. Our network trained on this dataset can generate detailed and visually plausible secondary motions on much more complex 3D characters during testing. By assigning randomized motions to the primitives during training, we are able to let the local patches cover a broad motion space, which improves the network’s online predictions in unseen scenarios.
We evaluate our method on various character meshes and complex motion sequences. We demonstrate visually plausible and stable secondary motion while being over 30 times faster than the implicit Euler method commonly used in physicallybased simulation. We also provide comparisons to faster methods such as the explicit central differences method and other learningbased approaches that utilize graph convolutional networks. Our method outperforms those approaches both in terms of accuracy and robustness.
2 Related Work
2.1 Physically based simulation methods
Complementing skinningbased animations with secondary motion is a wellstudied problem. Traditional approaches resort to using physicallybased simulation [Zhang:CompDynamics:2020, Wang:2020:ACS]. However, it is wellknown that physically based methods often suffer from computational complexity. Therefore, in the last decade, a series of methods were proposed to accelerate the computation process, including examplebased dynamic skinning [shi2008example], efficient elasticity calculation [mcadams2011efficient], formulation of motion equations in the rig subspace [hahn2012rig, hahn2013efficient], and the coupling of the skeleton dynamics and the soft body dynamics [liu2013simulation]. These approaches still have some limitations such as robustness issues due to explicit integration, or unnatural deformation effects due to remeshing, while our method is much more robust in handling various characters and complex motions.
2.2 Learning based methods
Grzeszczuk et al. [grzeszczuk1998neuroanimator] presented one of the earliest works that demonstrated the possibility of replacing numerical computations with a neural network. Since then research in this area has advanced, especially in the last few years. While some approaches have presented hybrid solutions where a neural network replaces a particular component of the physically based simulation process, others have presented endtoend solutions.
In the context of hybrid approaches, plugin deep neural networks were applied in combination with the Finite Elements Method (FEM), to help accelerate the simulation. For example, the nodewise NNWarp [luo2018nnwarp] was proposed to efficiently map the linear nodal displacements to nonlinear ones. Fulton et al.[fulton2019latent]
utilized an autoencoder to project the target mesh to a lower dimensional space to increase the computation speed. Similarly, Tan et al.
[tan2020realtime] designed a CNNbased network for dimension reduction to accelerate thinshell deformable simulations. Romero et al. [ROCP20] built a datadriven statistical model to kinematically drive the FEM mechanical simulation. Meister et al. [meister2020deep] explored the use of neural networks to accelerate the time integration step of the Total Lagrangian Explicit Dynamics (TLED) for complex soft tissue deformation simulation. Finally, Deng et al. [deng2020alternating] modeled the force propagation mechanism in their neural networks. Those approaches improved efficiency but at the cost of accuracy and are not friendly to end users who are not familiar with physical techniques. Ours, instead, allows the user to adjust the animation by simply painting the constraints and stiffness properties.Endtoend approaches assume the target mesh is provided as input and directly predict the dynamics behaviour. For instance, Bailey et al. [bailey2018fast] enriched the realtime skinning animation by adding the nonlinear deformations learned from filmquality character rigs. The work of Holden et al [holden2019subspace] first trained an autoencoder to reduce the simulation space and then learned to efficiently approximate the dynamics projected to the subspace. Similarly, SoftSMPL [santesteban2020softsmpl] modeled the realistic softtissue dynamics based on a novel motion descriptor and a neuralnetworkbased recurrent regressor that ran in the nonlinear deformation subspace extracted from an autoencoder. While all these approaches presented impressive results, their main drawback was the assumption of a fixed mesh topology requiring different networks to be trained for different meshes. Our approach, on the other hand, operates at a local patch level and can therefore generalize to different meshes at test time.
Lately, researchers started to utilize the Graph Convolutional Network (GCN) for simulation tasks due to its advantage in handling topologyfree graphs. The GCN encodes the vertex positional information and aggregates the latent features to a certain node by using the propagation rule. For particlebased systems, graphs are constructed based on the local adjacency of the particles at each frame and fed into GCNs [li2018learning, ummenhofer2019lagrangian, sanchez2020learning, de2020combining]. Concurrently, Pfaff et al. [pfaff2020learning] proposed a GCN for surface meshbased simulation. While these GCN models interpret the mesh dynamics prediction as a general spatiotemporal problem, we incorporate physics into the design of our network architecture, e.g. inferring latent embedding for inertia and internal forces, which enables us to achieve more stable and accurate results (Section 4.3).
3 Method
Given a 3D character and its primary motion sequence obtained using standard linear blend skinning techniques [skinningcourse:2014], we first construct a volumetric (tetrahedral) mesh and a set of barycentric weights to linearly embed the vertices of the character’s surface mesh into the volumetric mesh [James:2004:Squashing], as shown in Figure 1. Our network operates on the volumetric mesh and predicts the updated vertex positions with deformable dynamics (also called the secondary motion) at each frame given the primary motion, the constraints and the material properties. The updated volumetric mesh vertex positions then drive the original surface mesh via the barycentric embedding, and the surface mesh is used for rendering; such a setup is very common and standard in computer animation.
We denote the reference tetrahedral mesh and its number of vertices by and respectively. The skinned animation (primary motion) is represented as a set of timevarying positions . Similarly, we denote the predicted dynamic mesh by and its positions by .
Our method additionally encodes mass and stiffness properties. The stiffness is represented as Young’s modulus. By painting different material properties per vertex over the mesh, users can control the dynamic effects, namely the deformation magnitude.
In contrast to previous works [santesteban2020softsmpl, pfaff2020learning] which trained neural networks directly on the surface mesh, we choose to operate on the volumetric mesh for several reasons. First, volumetric meshes provide a more efficient coarse representation and can handle character meshes that consist of multiple disconnected components. For example, in our experiments the “Michelle” character (see Figure 1) consists of vertices whereas the corresponding volumetric mesh only has vertices. In addition, the “Big Vegas” character mesh (see Figure LABEL:teaser) has eight disconnected components, requiring the artist to build a watertight mesh first if using a method that learns directly on the surface mesh. Furthermore, volumetric meshes not only capture the surface of the character but also the interior, leading to more accurate learning of the internal forces. Finally, we use a uniformly voxelized mesh subdivided into tetrahedra as our volumetric mesh, which enables our method to generalize across character meshes with varying shapes and resolutions.
Next, we will first explain the motion equations in physicallybased simulation and then discuss our method in detail, drawing inspiration from the physical process.
3.1 Physicallybased Motion Equations
In constraintbased physicallybased simulation [baraff2001physically], the equations of motion are
(1)  
where is the diagonal (lumped) mass matrix (as commonly employed in interactive applications), is the Rayleigh damping matrix, and , and represent the positions, velocities and accelerations, respectively. The quantity represents the internal elastic forces. Secondary dynamics occurs because the constraint part of the mesh “drives” the free part of the mesh. Constraints are specified via the constraint matrix and the selection matrix . In order to leave room for secondary dynamics for 3D characters, we typically do not constrain all the vertices of the mesh, but only a subset. For example, in the Big Vegas example (see Figure LABEL:teaser), we constrain the legs, the arms and the core inside the torso and head, but do not constrain the belly and hair, so that we can generate secondary dynamics in those unconstrained regions.
One approach to timestep Equation 1 is to use an explicit integrator, such as central differences:
(2) 
where and denote the state of the mesh in the current and next frames, respectively, and is the timestep. While the explicit integration is fast, it suffers from stability issues. Hence, the slower but stable implicit backward Euler integrator is often preferred in physicallybased simulation [Baraff:1998:LSI]:
(3) 
We propose to approximate implicit integration as
(4) 
where is a differentiable function constructed as a neural network with learned parameters .
3.2 Network design
As shown in Equation 1, predicting the secondary dynamics entails solving for degrees of freedom for a mesh with vertices. Hence, directly approximating in Equation 4
to predict all the degrees of freedom at once would lead to a huge and impractical network, which would furthermore not be applicable to input meshes with varying number of vertices and topologies. Inspired by the intuition that within a very short time moment, the motion of a vertex is mostly affected by its own inertia and the internal forces from its neighboring vertices, we design our network to operate on a local patch instead. As illustrated in Figure
2, the 1ring local patch consists of one center vertex along with its immediate neighbors in the volumetric mesh. Even though two characters might have very different mesh topologies, as shown in Figure 1, their local patches will often be more similar, boosting the generalization ability of our network.The internal forces are caused by the local stress, and the aggregation of the internal forces acts to pull the vertices to their positions in the reference motion, to reduce the elastic energy. Thus, the knowledge of the peredge deformation and the pervertex reference motion are needed for secondary motion prediction.
Hence, we propose to emulate this process as follows:
(5) 
where , and
are three different multilayer perceptrons (MLPs) as shown in Figure
3, are neighboring vertices of (excluding ), and the double indices denote the central vertex and a neighbor Quantities andare high dimensional latent vectors that represent an embedding for inertia dynamics and the internal forces from each neighboring vertex, respectively. Perceptron
receives the concatenation of and the sum of to predict the final acceleration of a vertex . In practice, for simplicity, we train to directly predict since we assume a fixed timestep of in our experiments.We implement all the three MLPs with four hidden fully connected layers activated by the Tanh function, and one output layer. During training, we provide the ground truth positions in the dynamic mesh as input. During testing, we provide the predictions of the network as input in a recurrent manner. Next, we discuss the details of these components.
Mlp :
This perceptron focuses on the center vertex itself, encoding the “selfinertia” information. That is, the center vertex tends to continue its current motion, driven by both the velocity and acceleration. The input to is the position of the center vertex in the last three frames both on the dynamic and skinned mesh, ,, and ,, , as well as its material properties, . The positions are represented in local coordinates with respect to , the current position of the center vertex in the reference motion. The positions in the last three frames implicitly encode the velocity and the acceleration. Since we know that the net force applied on the central vertex is divided by its mass in Equation 4 and it is relatively hard for the network to learn multiplication or division, we also include explicitly in the input. The hidden layer and output size is 64.
Mlp :
For an unconstrained center vertex , perceptron encodes the “internal forces” contributed by its neighbors. The input to the MLP is similar to except that we provide information both for the center vertex as well as its neighbors. For each neighboring vertex , we also provide the constraint information ( if a free vertex; if constrained). Each provides a latent vector for the central vertex. The hidden layer and output size is 128.
Mlp :
This module receives the concatenated outputs from and the aggregation of , and predicts the final displacement of the central vertex in the dynamic mesh. The input and hidden layer size is 192.
We train the final network with the mean square error loss:
(6) 
where
is the ground truth. We adopted the Adam optimizer for training, with a learning rate starting from 0.0001 along with a decay factor of 0.96 at each epoch.
3.3 Training Primitives
Because our method operates on local patches, it is not necessary to train it on complex character meshes. In fact, we found that a training dataset constructed by simulating basic primitives, such as a sphere (under various motions and material properties), is sufficient to generalize to various character meshes at test time. Specifically, we generate random motion sequences by prescribing random rigid body motion of a constrained beamshaped core inside the spherical mesh. The motion of this rigid core excites dynamic deformations in the rest of the sphere volumetric mesh. Each motion sequence starts by applying, to the rigid core, a random acceleration and angular velocity with respect to a random rotation axis. Next, we reverse the acceleration so that the primitive returns back to its starting position, and let the primitive’s secondary dynamics oscillate out for a few frames. While the still motions ensure that we cover the cases where local patches are stationary (but there is still residual secondary dynamics from primary motion), the random accelerations help to sample a diverse set of motions of local patches as much as possible. Doing so enhances the networks’s prediction stability.
4 Experiments
In this section, we show qualitative and quantitative results of our method, as well as comparisons to other methods. We also run an ablation study to verify why explicitly providing the position information on the reference mesh as input is necessary.
4.1 Dataset and evaluation metrics
For training, we use a uniform tetrahedral mesh of a sphere. We generate random motion sequences at 24 fps, using the Vega FEM simulator [Vega, sin2013vega]. For each motion sequence, we use seven different material settings. Each motion sequence consists of 456 frames resulting in a total of 255k frames in our training set.
We evaluate our method on 3D character animations obtained from Adobe’s Mixamo dataset [mixamo]. Neither the character meshes nor the primary motion sequences are seen in our training data. We create test cases for five different character meshes as listed in Table 1 and 15 motions in total. The volumetric meshes for the test characters use the same uniform tetrahedron size as our training data. For all the experiments, we report three types of metrics:

[leftmargin=0.4cm]

Singleframe RMSE: We measure the average rootmeansquare error (RMSE) between the prediction and the ground truth over all frames, while providing the ground truth positions of the previous frames as input.

Rollout RMSE: We provide the previous predictions of the network as input to the current frame in a recurrent manner and measure the average RMSE between the prediction and the ground truth over all frames.

: We use the concept of elastic energy in physicallybased simulation to detect abnormalities in the deformation sequence, or any possible mesh explosions. For each frame, we calculate the elastic energy based on the current mesh displacements with respect to its reference state. We list the the ,
as well as the standard deviation (
to show the energy distribution across the animation.
4.2 Analysis of Our Method
Performance:
In Table 1, we show the speed of our method, as well as that of the ground truth method and a baseline method
. For each method, we record the time to calculate the dynamic mesh but exclude other components such as initialization, rendering and mesh interpolation.
We adopted the implicit backward Euler approach (Equation 3) as ground truth and the faster explicit central differences integration (Equation 2) as the baseline. Both our baseline and ground truth were optimized using the deformable object simulation library, Vega FEM [Vega, sin2013vega], and accelerated using multicores via Intel Thread Building Blocks (TBB), with 8 cores for assembling the internal forces and 16 cores for solving the linear system. The experiment platform is with 2.90 GHz Intel Xeon(R) CPU E52690 (32 GB RAM) which provides for a highly competitive baseline/ground truth implementation. We ran our trained model on a GeForce RTX 2080 graphics card (8 GB RAM). We also tested it on CPU, without any multithread acceleration.
Moreover, we also provide performance results for the same character mesh (Big Vegas) with different voxel resolutions. To handle different resolutions of testing meshes, we resize the volumetric mesh to have the local patch similar to the training data (i.e., the shortest edge length is 0.2).
character 






Big vegas  1468  0.58  0.056  0.012  0.017  
Kaya  1417  0.52  0.052  0.012  0.015  
Michelle  1105  0.33  0.032  0.011  0.015  
Mousey  2303  0.83  0.084  0.014  0.020  
Ortiz  1258  0.51  0.049  0.012  0.015  
Big vegas  6987  2.45  0.32  0.032  0.14  
Big vegas  10735  4.03  0.53  0.046  0.24  
Big vegas  18851  8.26  1.06  0.068  0.42  
Big vegas  39684  24.24  2.96  0.14  0.89 
Results indicate that when ran on GPU (CPU), our method is around 30 (20) times faster than the implicit integrator and 3 (2) times faster than the explicit integrator, per frame. Under an increasing number of vertices, our method has an even more competitive performance. Although the explicit method has comparable speed to our method, the simulation explodes after a few frames. In practice, explicit methods require much smaller time steps, which required additional 100 substeps in our experiments, to achieve stable quality. We provide a more detailed report on the speedstability relationship of explicit integration in the supplementary material.
Generalization:
We train the network on the sphere dataset and achieve a single frame RMSE of 0.0026 on the testing split of this dataset (the sphere has a radius of ). As listed in Table 2, when tested on characters, our method achieves a single frame RMSE of , showing remarkable generalization capability (we note that the shortest edge length on the volumetric character meshes is ). The mean rollout error increases to after running the whole sequences due to error accumulation, but elastic energy statistics are still close to the ground truth. From the visualization of the ground truth and our results in Figure 6, we can see that although the predicted secondary dynamics have slight deviation from the ground truth, they are still visually plausible. We further plot the rollout prediction RMSE and elastic energy of the Big Vegas character in Figure 4. It can be seen that the prediction error remains under , and the mean elastic energy of our method is always close to the ground truth for the whole sequence, whereas the baseline method explodes quickly. We provide such rollout prediction plots for all characters in the supplemental and the video results in supplemental material.
Methods  single frame  rollout24  rollout48  rolloutall 


Ground truth  
Our method  0.0067  0.059  0.062  0.064  
Ours w/o ref. motion  0.050  0.20  0.38  10.09  
Baseline  7.26  9.63  17.5  
CFDGCN [de2020combining]  0.040  41.17  70.55  110.07  
GNS [sanchez2020learning]  0.049  0.22  0.34  0.54  
MeshGraphNets [pfaff2020learning]  0.050  0.11  0.43  4.46 
Nonhomogeneous Dynamics:
Figure 5 shows how to control the dynamics by painting nonhomogeneous material properties over the mesh. Varying stiffness values are painted on the hair and the breast region on the volumetric mesh. For better visualization, we render the material settings across the surface mesh in the figure. We display three different material settings, by assigning different stiffness values. Larger means stiffer material, hence the corresponding region exhibits less dynamics. In contrast, the regions with smaller show significant dynamic effects. This result demonstrates that our method correctly models the effect of material properties while providing an interface for the artist to efficiently adjust the desired dynamic effects.
Ablation study:
To demonstrate that it is necessary to incorporate the reference mesh motion into the input features of our network, we performed an ablation study. To ensure that the constrained vertices are still driving the dynamic mesh in the absence of the reference information, we update the positions of the constrained vertices based on the reference motion, at the beginning of each iteration. As input to our network architecture, we use the same set of features except the positions on the reference mesh. The results of “Ours w/o ref. motion” in Table 2 and Figure 6 and 4 demonstrate that this version is inferior to our original method, especially when running the network over a long time sequence. This establishes that the reference mesh is indispensable to the quality of the network’s approximation.
4.3 Comparison to Previous Work
As discussed in Section 2, several recent particlebased physics and meshbased deformation systems utilized graph convolutional networks (GCNs). In this section, we train these network models on the same training set as our method and test on our character meshes.
CfdGcn [de2020combining]:
We implemented our version of the CFDGCN architecture, adopting the convolution kernel of [kipf2016semi]
. However, we ignored the remeshing part because we assume that the mesh topology remains fixed when predicting secondary motion. As input, we provide the same information as our method, namely the constraint states of the vertices, the displacements and the material properties. We found that the network structure recommended in the paper resulted in a high training error. We then replaced the originally proposed ReLu activation function with the Tanh activation (as used in our method), which significantly improved the training performance. Even so, as shown in Table
2 and Figure 4, the rollout prediction explodes very quickly. We speculate that although the model aggregates the features from the neighbors to a central vertex via an adjacency matrix, it treats the center and the neighboring vertices equally, whereas in reality, their roles in physicallybased simulation are distinct.Gns [sanchez2020learning]:
The recently proposed GNS [sanchez2020learning] architecture is also a graph network designed for particle systems. The model first separately encodes node features and edge features in the graph and then generalizes the GraphNet blocks in [pmlrv80sanchezgonzalez18a] to pass messages across the graph. Finally, a decoder is used to extract the prediction target from the GraphNet block output. The original paper embeds the particles in a graph by adding edges between vertices under a given radius threshold. In our implementation, we instead utilized the mesh topology to construct the graph. We used two blocks in the “processor” [sanchez2020learning] to achieve a network capacity similar to ours. In contrast to CFDGCN [de2020combining], the GraphNet block can represent the interaction between the nodes and edges more efficiently, resulting in a significant performance improvement in rollout prediction settings. However, we still observe mesh explosions after a few frames, as shown in Figure 6 and in the supplementary video.
MeshGraphNets [pfaff2020learning]
In concurrent work to us, MeshGraphNets [pfaff2020learning] were presented for physicallybased simulation on a mesh, with an architecture similar to GNS [sanchez2020learning]. The Lagrangian cloth system presented in their paper is the most closely related approach to our work. Therefore, we followed the input formulation of their example, except that we used the reference mesh to represent the undeformed mesh space as the edge feature. In our implementation, we keep the originally proposed encoders and that embed the edge and node features, but exclude the global (world) feature encoder because it is not applicable to our problem setting. Similarly, we kept the MLPs and but removed the inside the graph block. We used 15 graph blocks in the model, as suggested by their paper. The network has 10 times more parameters than ours; 2,333,187 parameters compared to our 237,571 parameters. Training lasted for 11 days, whereas our network was trained in less than a day.
We report how MeshGraphNets perform on our test character motion sequences in Table 2. The overall average rollout RMSE of MeshGraphNets is worse than GNS [sanchez2020learning]. Nevertheless, we note that out of 15 motions, this approach achieved 5 stable rollout predictions without explosions, while GNS [sanchez2020learning] failed on all of them. Our method outperforms each of the compared methods with respect to the investigated metrics.
5 Conclusion
We proposed a Deep Emulator for enhancing skinningbased animations of 3D characters with vivid secondary motion. Our method is inspired by the underlying physical simulation. Specifically, we train a neural network that operates on a local patch of a volumetric simulation mesh of the character, and predicts the updated vertex positions from the current acceleration, velocity, and positions. Being a local method, our network generalizes across 3D character meshes of arbitrary topology.
While our method demonstrates plausible secondary dynamics for various 3D characters under complex motions, there are still certain limitations we would like to address in future work. Specifically, we demonstrated that our network trained on a dataset of a volumetric mesh of a sphere can generalize to 3D characters with varying topologies. However, if the local geometric detail of a character is significantly different to those seen during training, e.g., the ears of the mousey character containing many local neighborhood not present in the sphere training data, the quality of our output decreases. One potential avenue for addressing this is to add additional primitive types to training, beyond tetrahedralized spheres. A thorough study on the type of training primitives and motion sequences required to cover the underlying problem domain is an interesting future direction.
Acknowledgements:
This research was sponsored in part by NSF (IIS1911224), USC Annenberg Fellowship to Mianlun Zheng, Bosch Research and Adobe Research.
References
Appendix:
We sincerely request readers to refer to the link below for more visualization results: https://zhengmianlun.github.io/publications/deepEmulator.html.
a.1 Dataset Information
In this paper, we trained our network on a sphere dataset but tested it on five character meshes from the Adobe’s Mixamo dataset [mixamo]. Table A.1 provides detailed information about the five character meshes, including the vertex number and the edge length on the original surface mesh as well as the corresponding uniform volumetric mesh.
In Figure A.1, we show how we set constraints for each of the meshes, from a side view. The red vertices are constrained to move based on the skinned animation and drive the free vertices to deform with secondary motion.
a.2 Full Quantitative and Qualitative Results
In Tables A.2 A.16, we provide the quantitative results of our network tested on the five character meshes and 15 motions. The corresponding error plots are given in Figures A.2A.16. We also provide the error plots for the compared methods. Across all the test cases, our method achieves the most stable rollout prediction with the lowest error.
In DeepEmulator.html, we provide animation sequences of our results as well as other comparison methods.
a.3 Further Analysis of Baseline Performance
As introduced in Section 4.2, we adopted the implicit backward Euler approach (Equation 3) as ground truth and the faster explicit central differences integration (Equation 2) as the baseline. Although the baseline method is 10 times faster than the implicit integrator with the same time step (1/24 second), it explodes after a few frames. In order to achieve stable simulation results, we found that it requires at least 100 substeps (). In Table A.17, we provide the perframe running time of the explicit integration with 50 and 100 steps.
a.4 Choice of the Training Dataset
In Section 5, we mentioned a future direction of expanding the training dataset beyond primitivebased datasets such as spheres. Here, we analyze an alternative training dataset, namely the “Ortiz Dataset”, created by running our physicallybased simulator on the volumetric mesh surrounding the Ortiz character (same mesh as in Table A.1), with motions acquired from Adobe’s Mixamo. In both datasets, we use the same number of frames. We report our results in Table A.18 to A.22.
Our experiments show that the network trained on the Sphere Dataset in most cases (75%) outperforms the Ortiz Dataset. We think there are two reasons for this. First, the local patches in the sphere are general and not specific to any geometry, making the learned neural network more general and therefore more suitable for characters other than Ortiz. Second, the motions in the Ortiz Dataset were created by human artists, and as such these motions follow certain humanselected artistic patterns. The motions in the Sphere Dataset, however, consist of random translations and rotations, which provides a denser sampling of motions in the possible motion space, and therefore improves the robustness of the network.
a.5 Analysis of the Local Patch Size
In the main paper, we show our network architecture for 1ring local patches (Figure 3). Namely, in the main paper the MLP learns to predict the internal forces from the 1ring neighbors around the center vertex. Here, we present an ablation study whereby the network learns based on 2ring local patches, and 3ring local patches, respectively. For 2ring local patches, we add an additional MLP that receives the inputs from the 2ring neighbors of the center vertex. The output latent vector is concatenated to the input of the MLP. Similar operation is adopted for the 3ring local patch network by adding another MLP for the 3ring internal forces.
For the training loss, the network achieves the RMSE of 0.00257, 0.00159 and 0.00146 for 1ring, 2ring and 3ring local patches, respectively. In Table A.23, we provide the corresponding test results on the five characters. Overall, we didn’t see obvious improvements by increasing the local patch size. This could be because 2ring and 3ring local patches exhibits larger variability of structure, different to the sphere mesh, particularly for a center vertex close to the boundary. Therefore, we adopt 1ring local patches in our paper.
Character 






Big vegas  3711  8  1468  
Kaya  4260  4  1417  
Michelle  14267  1  1105  
Mousey  6109  1  2303  
Ortiz  24798  1  1258 
Methods  Single Frame  Rollout24  Rollout48  RolloutAll 


Ground Truth  
Our Method  0.0098  0.053  0.063  0.059  
Ours w/o ref. motion  0.058  0.19  0.55  7.70  
Baseline  8.23E120  1.07E121  1.57E121  
CFDGCN [de2020combining]  0.031  50.33  72.06  79.16  
GNS [sanchez2020learning]  0.057  0.20  0.31  0.55  
MeshGraphNets [pfaff2020learning]  0.058  0.14  0.10  2.85 
Methods  Single Frame  Rollout24  Rollout48  RolloutAll 


Ground Truth  
Our Method  0.0093  0.066  0.074  0.073  
Ours w/o ref. motion  0.061  0.31  0.87  14.19  
Baseline  7.83E120  1.06E121  1.79E121  
CFDGCN [de2020combining]  0.031  67.91  110.82  591.46  
GNS [sanchez2020learning]  0.060  0.32  0.50  0.68  
MeshGraphNets [pfaff2020learning]  0.062  0.16  0.38  6.58 
Methods  Single Frame  Rollout24  Rollout48  RolloutAll 


Ground Truth  
Our Method  0.0062  0.050  0.057  0.065  
Ours w/o ref. motion  0.047  0.15  0.36  21.37  
Baseline  7.67E120  1.07E121  2.51E121  
CFDGCN [de2020combining]  0.027  47.90  76.76  110.26  
GNS [sanchez2020learning]  0.046  0.17  0.32  0.56  
MeshGraphNets [pfaff2020learning]  0.048  0.084  0.081  10.12 
Methods  Single Frame  Rollout24  Rollout48  RolloutAll 


Ground Truth  
Our Method  0.0058  0.052  0.050  0.064  
Ours w/o ref. motion  0.038  0.13  0.40  13.88  
Baseline  7.67E120  9.97E120  2.15E121  
CFDGCN [de2020combining]  0.027  75.12  86.49  101.63  
GNS [sanchez2020learning]  0.037  0.17  0.30  0.62  
MeshGraphNets [pfaff2020learning]  0.040  0.090  0.091  7.89 
Methods  Single Frame  Rollout24  Rollout48  RolloutAll 


Ground Truth  
Our Method  0.0065  0.062  0.054  0.070  
Ours w/o ref. motion  0.040  0.35  0.90  12.82  
Baseline  7.76E120  1.02E121  1.96E121  
CFDGCN [de2020combining]  0.083  50.17  71.88  79.87  
GNS [sanchez2020learning]  0.040  0.18  0.31  0.50  
MeshGraphNets [pfaff2020learning]  0.043  0.13  0.11  5.24 
Methods  Single Frame  Rollout24  Rollout48  RolloutAll 


Ground Truth  
Our Method  0.0067  0.054  0.058  0.041  
Ours w/o ref. motion  0.042  0.097  0.15  20.02  
Baseline  6.21E120  8.00E120  2.00E121  
CFDGCN [de2020combining]  0.016  0.22  72.15  69.87  
GNS [sanchez2020learning]  0.041  0.15  0.28  0.47  
MeshGraphNets [pfaff2020learning]  0.042  0.063  0.084  0.068 
Methods  Single Frame  Rollout24  Rollout48  RolloutAll 


Ground truth  
Our Method  0.0075  0.083  0.067  0.054  
Ours w/o ref. motion  0.030  0.23  0.41  3.16  
Baseline  5.66E120  7.98E120  8.99E120  
CFDGCN [de2020combining]  0.023  35.34  62.31  59.14  
GNS [sanchez2020learning]  0.029  0.21  0.29  0.45  
MeshGraphNets [pfaff2020learning]  0.03  0.11  0.15  2.44 
Methods  Single Frame  Rollout24  Rollout48  RolloutAll 


Ground truth  
Our Method  0.0041  0.033  0.033  0.04  
Ours w/o ref. motion  0.060  0.12  0.19  5.24  
Baseline  5.81E120  7.86E120  1.52E121  
CFDGCN [de2020combining]  0.017  27.36  42.25  73.72  
GNS [sanchez2020learning]  0.060  0.10  0.19  0.37  
MeshGraphNets [pfaff2020learning]  0.06  0.082  0.11  0.079 
Methods  Single Frame  Rollout24  Rollout48  RolloutAll 


Ground truth  
Our Method  0.0056  0.025  0.024  0.047  
Ours w/o ref. motion  0.082  0.13  0.15  15.20  
Baseline  6.25E120  7.47E120  2.25E121  
CFDGCN [de2020combining]  0.019  36.06  36.80  64.16  
GNS [sanchez2020learning]  0.082  0.12  0.14  0.48  
MeshGraphNets [pfaff2020learning]  0.082  0.065  0.077  4.31 
Methods  Single Frame  Rollout24  Rollout48  RolloutAll 


Ground truth  
Our Method  0.0056  0.049  0.037  0.055  
Ours w/o ref. motion  0.086  0.14  0.16  20.14  
Baseline  5.71E120  7.69E120  2.35E121  
CFDGCN [de2020combining]  0.019  25.05  42.43  64.23  
GNS [sanchez2020learning]  0.085  0.13  0.19  0.43  
MeshGraphNets [pfaff2020learning]  0.086  0.11  0.094  0.11 
Methods  Single Frame  Rollout24  Rollout48  RolloutAll 


Ground truth  
Our Method  0.0080  0.077  0.10  0.086  
Ours w/o ref. motion  0.057  0.24  0.42  1.43  
Baseline  7.78E120  1.07E121  1.24E121  
CFDGCN [de2020combining]  0.041  56.37  82.57  72.71  
GNS [sanchez2020learning]  0.057  0.50  0.69  0.92  
MeshGraphNets [pfaff2020learning]  0.057  0.15  1.30  3.48 
Methods  Single Frame  Rollout24  Rollout48  RolloutAll 


Ground truth  
Our Method  0.0066  0.067  0.11  0.09  
Ours w/o ref. motion  0.036  0.19  0.28  2.95  
Baseline  8.29E120  1.12E121  1.51E121  
CFDGCN [de2020combining]  0.043  68.29  78.87  75.08  
GNS [sanchez2020learning]  0.036  0.26  0.62  0.86  
MeshGraphNets [pfaff2020learning]  0.037  0.15  1.61  8.63 
Methods  Single Frame  Rollout24  Rollout48  RolloutAll 


Ground truth  
Our Method  0.0090  0.087  0.10  0.10  
Ours w/o ref. motion  0.039  0.36  0.35  7.86  
Baseline  8.48E120  1.10E121  1.81E121  
CFDGCN [de2020combining]  0.17  66.19  73.93  78.95  
GNS [sanchez2020learning]  0.039  0.28  0.50  0.63  
MeshGraphNets [pfaff2020learning]  0.040  0.21  2.14  14.97 
Methods  Single Frame  Rollout24  Rollout48  RolloutAll 


Ground truth  
Our Method  0.0057  0.082  0.077  0.073  
Ours w/o ref. motion  0.041  0.29  0.40  1.20  
Baseline  8.15E120  1.09E121  1.09E121  
CFDGCN [de2020combining]  0.030  10.71  65.43  58.90  
GNS [sanchez2020learning]  0.040  0.30  0.22  0.27  
MeshGraphNets [pfaff2020learning]  0.042  0.090  0.088  0.096 
Methods  Single Frame  Rollout24  Rollout48  RolloutAll 


Ground truth  
Our Method  0.0039  0.039  0.032  0.036  
Ours w/o ref. motion  0.034  0.10  0.17  4.18  
Baseline  7.40E120  9.48E120  1.62E121  
CFDGCN [de2020combining]  0.017  0.57  83.52  71.93  
GNS [sanchez2020learning]  0.034  0.17  0.21  0.31  
MeshGraphNets [pfaff2020learning]  0.034  0.065  0.071  0.064 
character 







Big vegas  1468  0.58  0.056  2.57027  6.20967  0.017  
Kaya  1417  0.52  0.052  2.42985  5.72762  0.015  
Michelle  1105  0.33  0.032  1.52916  3.64744  0.013  
Mousey  2303  0.83  0.084  3.90579  9.5897  0.018  
Ortiz  1258  0.51  0.049  2.2496  5.16806  0.015 
Motion  Dataset  Single Frame  Rollout24  Rollout48  RolloutAll  

Sphere Dataset  0.0098  0.053  0.063  0.0591  
Ortiz Dataset  0.0111  0.0512  0.067  0.0969  

Sphere Dataset  0.0093  0.0664  0.0744  0.0727  
Ortiz Dataset  0.0101  0.0564  0.0607  0.1241  

Sphere Dataset  0.0062  0.0495  0.0571  0.0654  
Ortiz Dataset  0.0062  0.0481  0.061  0.143  

Sphere Dataset  0.0058  0.0521  0.0496  0.0635  
Ortiz Dataset  0.0064  0.0367  0.0423  0.1331  

Sphere Dataset  0.0065  0.0615  0.0537  0.0702  
Ortiz Dataset  0.0065  0.0383  0.0616  0.1282 
Motion  Dataset  Single Frame  Rollout24  Rollout48  RolloutAll  

Sphere Dataset  0.0067  0.0544  0.0578  0.0411  
Ortiz Dataset  0.0113  0.075  0.0839  0.1176  

Sphere Dataset  0.0075  0.0834  0.0666  0.0537  
Ortiz Dataset  0.0126  0.0749  0.0645  0.076 
Motion  Dataset  Single Frame  Rollout24  Rollout48  RolloutAll  

Sphere Dataset  0.0041  0.0329  0.0332  0.0401  
Ortiz Dataset  0.0059  0.0329  0.0319  0.0969  

Sphere Dataset  0.0056  0.025  0.0236  0.0471  
Ortiz Dataset  0.0079  0.0324  0.035  0.1213  

Sphere Dataset  0.0056  0.0491  0.0373  0.0548  
Ortiz Dataset  0.0067  0.058  0.04  0.1264 
Motion  Dataset  Single Frame  Rollout24  Rollout48  RolloutAll  

Sphere Dataset  0.008  0.0771  0.1003  0.0858  
Ortiz Dataset  0.0122  0.0871  0.1056  0.122  

Sphere Dataset  0.0066  0.0666  0.1115  0.09  
Ortiz Dataset  0.0115  0.0936  0.1072  0.1303  

Sphere Dataset  0.009  0.0871  0.1001  0.1042  
Ortiz Dataset  0.0144  0.1236  0.1113  0.1594 
Motion  Dataset  Single Frame  Rollout24  Rollout48  RolloutAll  

Sphere Dataset  0.0057  0.0819  0.0765  0.0726  
Ortiz Dataset  0.0053  0.0719  0.0532  0.0702  

Sphere Dataset  0.0039  0.0391  0.0316  0.0363  
Ortiz Dataset  0.0051  0.0365  0.0338  0.0946 
Test Dataset  Patch size  Single Frame  Rollout24  Rollout48  RolloutAll 
Big vegas  1ring  0.0075  0.057  0.060  0.066 
2ring  0.0068  0.060  0.066  0.077  
3ring  0.0075  0.043  0.0418  0.059  
Kaya  1ring  0.0071  0.069  0.062  0.047 
2ring  0.0060  0.078  0.060  0.072  
3ring  0.0064  0.099  0.095  0.092  
Michelle  1ring  0.0051  0.036  0.031  0.047 
2ring  0.0045  0.033  0.033  0.047  
3ring  0.0047  0.043  0.042  0.059  
Mousey  1ring  0.0079  0.077  0.10  0.093 
2ring  0.0069  0.11  0.19  0.14  
3ring  0.0075  0.11  0.21  0.16  
Ortiz  1ring  0.0048  0.061  0.054  0.054 
2ring  0.0042  0.069  0.089  0.070  
3ring  0.0051  0.094  0.097  0.087 