A Deep Emulator for Secondary Motion of 3D Characters

03/01/2021 ∙ by Mianlun Zheng, et al. ∙ adobe University of Southern California 25

Fast and light-weight methods for animating 3D characters are desirable in various applications such as computer games. We present a learning-based approach to enhance skinning-based animations of 3D characters with vivid secondary motion effects. We design a neural network that encodes each local patch of a character simulation mesh where the edges implicitly encode the internal forces between the neighboring vertices. The network emulates the ordinary differential equations of the character dynamics, predicting new vertex positions from the current accelerations, velocities and positions. Being a local method, our network is independent of the mesh topology and generalizes to arbitrarily shaped 3D character meshes at test time. We further represent per-vertex constraints and material properties such as stiffness, enabling us to easily adjust the dynamics in different parts of the mesh. We evaluate our method on various character meshes and complex motion sequences. Our method can be over 30 times more efficient than ground-truth physically based simulation, and outperforms alternative solutions that provide fast approximations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fast and light-weight methods for animating 3D characters are desirable in various applications including computer games and film visual effects. Traditional skinning-based mesh deformation provides a fast geometric approach but often lacks realistic dynamics. On the other hand, physically-based simulation can add plausible secondary motion to skinned animations, augmenting them with visually realistic and vivid effects, but at the cost of heavy computation.

Recent research has explored deep learning methods to approximate physically-based simulation in a much more time-efficient manner. While some approaches have focused on accelerating specific parts of the simulation 

[luo2018nnwarp, fulton2019latent, meister2020deep], others have proposed end-to-end solutions that predict dynamics directly from mesh based features  [bailey2018fast, holden2019subspace, holden2019subspace, santesteban2020softsmpl]

. While demonstrating impressive results, these methods still have some limitations. Most of them assume a fixed mesh topology and thus need to train different networks for different character meshes. Moreover, in order to avoid the computational complexity of training networks on high resolution meshes, some methods operate on reduced subspaces with limited degrees of freedom, leading to low accuracy.

In this paper, we propose a deep learning approach to predict secondary motion, i.e., the deformable dynamics of given skinned animations of 3D characters. Our method addresses the shortcomings of the recent learning-based approaches by designing a network architecture that can reflect the actual underlying physical process. Specifically, our network models the simulation using a volumetric mesh consisting of uniform tetrahedra surrounding the character mesh, where the mesh edges encode the internal forces that depend on the current state (i.e., displacements, velocities, accelerations), material properties (e.g., stiffness), and constraints on the vertices. Mesh vertices encode the inertia. Motivated by the observation that within a short time instance the secondary dynamics of a vertex is mostly affected by its current state, as well as the internal forces due to its neighbors, our network operates on local patches of the volumetric mesh. In addition to avoiding the computational complexity of encoding high resolution character meshes as large graphs, this also enables our method to be applied to any character mesh, independent of its topology. Finally, our network encodes per-vertex material properties and constraints, giving the user the ability to easily prescribe varying properties to different parts of the mesh to control the dynamic behaviour.

As a unique benefit of the generalization capability of our model, we demonstrate that it is not necessary to construct a massive training dataset of complex meshes and motions. Instead, we construct our training data from primitive geometries, such as a volumetric mesh of a sphere. Our network trained on this dataset can generate detailed and visually plausible secondary motions on much more complex 3D characters during testing. By assigning randomized motions to the primitives during training, we are able to let the local patches cover a broad motion space, which improves the network’s online predictions in unseen scenarios.

We evaluate our method on various character meshes and complex motion sequences. We demonstrate visually plausible and stable secondary motion while being over 30 times faster than the implicit Euler method commonly used in physically-based simulation. We also provide comparisons to faster methods such as the explicit central differences method and other learning-based approaches that utilize graph convolutional networks. Our method outperforms those approaches both in terms of accuracy and robustness.

2 Related Work

2.1 Physically based simulation methods

Complementing skinning-based animations with secondary motion is a well-studied problem. Traditional approaches resort to using physically-based simulation [Zhang:CompDynamics:2020, Wang:2020:ACS]. However, it is well-known that physically based methods often suffer from computational complexity. Therefore, in the last decade, a series of methods were proposed to accelerate the computation process, including example-based dynamic skinning [shi2008example], efficient elasticity calculation [mcadams2011efficient], formulation of motion equations in the rig subspace [hahn2012rig, hahn2013efficient], and the coupling of the skeleton dynamics and the soft body dynamics [liu2013simulation]. These approaches still have some limitations such as robustness issues due to explicit integration, or unnatural deformation effects due to remeshing, while our method is much more robust in handling various characters and complex motions.

2.2 Learning based methods

Grzeszczuk et al. [grzeszczuk1998neuroanimator] presented one of the earliest works that demonstrated the possibility of replacing numerical computations with a neural network. Since then research in this area has advanced, especially in the last few years. While some approaches have presented hybrid solutions where a neural network replaces a particular component of the physically based simulation process, others have presented end-to-end solutions.

In the context of hybrid approaches, plug-in deep neural networks were applied in combination with the Finite Elements Method (FEM), to help accelerate the simulation. For example, the node-wise NNWarp [luo2018nnwarp] was proposed to efficiently map the linear nodal displacements to nonlinear ones. Fulton et al.[fulton2019latent]

utilized an autoencoder to project the target mesh to a lower dimensional space to increase the computation speed. Similarly, Tan et al. 

[tan2020realtime] designed a CNN-based network for dimension reduction to accelerate thin-shell deformable simulations. Romero et al. [ROCP20] built a data-driven statistical model to kinematically drive the FEM mechanical simulation. Meister et al.  [meister2020deep] explored the use of neural networks to accelerate the time integration step of the Total Lagrangian Explicit Dynamics (TLED) for complex soft tissue deformation simulation. Finally, Deng et al. [deng2020alternating] modeled the force propagation mechanism in their neural networks. Those approaches improved efficiency but at the cost of accuracy and are not friendly to end users who are not familiar with physical techniques. Ours, instead, allows the user to adjust the animation by simply painting the constraints and stiffness properties.

End-to-end approaches assume the target mesh is provided as input and directly predict the dynamics behaviour. For instance, Bailey et al. [bailey2018fast] enriched the real-time skinning animation by adding the nonlinear deformations learned from film-quality character rigs. The work of Holden et al [holden2019subspace] first trained an autoencoder to reduce the simulation space and then learned to efficiently approximate the dynamics projected to the subspace. Similarly, SoftSMPL [santesteban2020softsmpl] modeled the realistic soft-tissue dynamics based on a novel motion descriptor and a neural-network-based recurrent regressor that ran in the nonlinear deformation subspace extracted from an autoencoder. While all these approaches presented impressive results, their main drawback was the assumption of a fixed mesh topology requiring different networks to be trained for different meshes. Our approach, on the other hand, operates at a local patch level and can therefore generalize to different meshes at test time.

Lately, researchers started to utilize the Graph Convolutional Network (GCN) for simulation tasks due to its advantage in handling topology-free graphs. The GCN encodes the vertex positional information and aggregates the latent features to a certain node by using the propagation rule. For particle-based systems, graphs are constructed based on the local adjacency of the particles at each frame and fed into GCNs  [li2018learning, ummenhofer2019lagrangian, sanchez2020learning, de2020combining]. Concurrently, Pfaff et al. [pfaff2020learning] proposed a GCN for surface mesh-based simulation. While these GCN models interpret the mesh dynamics prediction as a general spatio-temporal problem, we incorporate physics into the design of our network architecture, e.g. inferring latent embedding for inertia and internal forces, which enables us to achieve more stable and accurate results (Section 4.3).

3 Method

Given a 3D character and its primary motion sequence obtained using standard linear blend skinning techniques [skinningcourse:2014], we first construct a volumetric (tetrahedral) mesh and a set of barycentric weights to linearly embed the vertices of the character’s surface mesh into the volumetric mesh [James:2004:Squashing], as shown in Figure 1. Our network operates on the volumetric mesh and predicts the updated vertex positions with deformable dynamics (also called the secondary motion) at each frame given the primary motion, the constraints and the material properties. The updated volumetric mesh vertex positions then drive the original surface mesh via the barycentric embedding, and the surface mesh is used for rendering; such a setup is very common and standard in computer animation.

We denote the reference tetrahedral mesh and its number of vertices by and respectively. The skinned animation (primary motion) is represented as a set of time-varying positions . Similarly, we denote the predicted dynamic mesh by and its positions by .

Our method additionally encodes mass and stiffness properties. The stiffness is represented as Young’s modulus. By painting different material properties per vertex over the mesh, users can control the dynamic effects, namely the deformation magnitude.

In contrast to previous works [santesteban2020softsmpl, pfaff2020learning] which trained neural networks directly on the surface mesh, we choose to operate on the volumetric mesh for several reasons. First, volumetric meshes provide a more efficient coarse representation and can handle character meshes that consist of multiple disconnected components. For example, in our experiments the “Michelle” character (see Figure 1) consists of vertices whereas the corresponding volumetric mesh only has vertices. In addition, the “Big Vegas” character mesh (see Figure LABEL:teaser) has eight disconnected components, requiring the artist to build a watertight mesh first if using a method that learns directly on the surface mesh. Furthermore, volumetric meshes not only capture the surface of the character but also the interior, leading to more accurate learning of the internal forces. Finally, we use a uniformly voxelized mesh subdivided into tetrahedra as our volumetric mesh, which enables our method to generalize across character meshes with varying shapes and resolutions.

Figure 1: The tetrahedral simulation mesh and the embedded surface mesh. The local patch consists of a center vertex and its neighbors, defined as the vertices of the tetrahedra touching the center vertex.

Next, we will first explain the motion equations in physically-based simulation and then discuss our method in detail, drawing inspiration from the physical process.

3.1 Physically-based Motion Equations

In constraint-based physically-based simulation [baraff2001physically], the equations of motion are

(1)

where is the diagonal (lumped) mass matrix (as commonly employed in interactive applications), is the Rayleigh damping matrix, and , and represent the positions, velocities and accelerations, respectively. The quantity represents the internal elastic forces. Secondary dynamics occurs because the constraint part of the mesh “drives” the free part of the mesh. Constraints are specified via the constraint matrix and the selection matrix . In order to leave room for secondary dynamics for 3D characters, we typically do not constrain all the vertices of the mesh, but only a subset. For example, in the Big Vegas example (see Figure LABEL:teaser), we constrain the legs, the arms and the core inside the torso and head, but do not constrain the belly and hair, so that we can generate secondary dynamics in those unconstrained regions.

One approach to timestep Equation 1 is to use an explicit integrator, such as central differences:

(2)

where and denote the state of the mesh in the current and next frames, respectively, and is the timestep. While the explicit integration is fast, it suffers from stability issues. Hence, the slower but stable implicit backward Euler integrator is often preferred in physically-based simulation [Baraff:1998:LSI]:

(3)

We propose to approximate implicit integration as

(4)

where is a differentiable function constructed as a neural network with learned parameters .

3.2 Network design

As shown in Equation 1, predicting the secondary dynamics entails solving for degrees of freedom for a mesh with vertices. Hence, directly approximating in Equation 4

to predict all the degrees of freedom at once would lead to a huge and impractical network, which would furthermore not be applicable to input meshes with varying number of vertices and topologies. Inspired by the intuition that within a very short time moment, the motion of a vertex is mostly affected by its own inertia and the internal forces from its neighboring vertices, we design our network to operate on a local patch instead. As illustrated in Figure 

2, the 1-ring local patch consists of one center vertex along with its immediate neighbors in the volumetric mesh. Even though two characters might have very different mesh topologies, as shown in Figure 1, their local patches will often be more similar, boosting the generalization ability of our network.

Figure 2: The input reference mesh and the target dynamic mesh . We draw the meshes in 2D for convenience.

The internal forces are caused by the local stress, and the aggregation of the internal forces acts to pull the vertices to their positions in the reference motion, to reduce the elastic energy. Thus, the knowledge of the per-edge deformation and the per-vertex reference motion are needed for secondary motion prediction.

Hence, we propose to emulate this process as follows:

(5)

where , and

are three different multi-layer perceptrons (MLPs) as shown in Figure 

3, are neighboring vertices of (excluding ), and the double indices denote the central vertex and a neighbor Quantities and

are high dimensional latent vectors that represent an embedding for inertia dynamics and the internal forces from each neighboring vertex, respectively. Perceptron

receives the concatenation of and the sum of to predict the final acceleration of a vertex . In practice, for simplicity, we train to directly predict since we assume a fixed timestep of in our experiments.

We implement all the three MLPs with four hidden fully connected layers activated by the Tanh function, and one output layer. During training, we provide the ground truth positions in the dynamic mesh as input. During testing, we provide the predictions of the network as input in a recurrent manner. Next, we discuss the details of these components.

Figure 3: Our network architecture.

Mlp :

This perceptron focuses on the center vertex itself, encoding the “self-inertia” information. That is, the center vertex tends to continue its current motion, driven by both the velocity and acceleration. The input to is the position of the center vertex in the last three frames both on the dynamic and skinned mesh, ,, and ,, , as well as its material properties, . The positions are represented in local coordinates with respect to , the current position of the center vertex in the reference motion. The positions in the last three frames implicitly encode the velocity and the acceleration. Since we know that the net force applied on the central vertex is divided by its mass in Equation 4 and it is relatively hard for the network to learn multiplication or division, we also include explicitly in the input. The hidden layer and output size is 64.

Mlp :

For an unconstrained center vertex , perceptron encodes the “internal forces” contributed by its neighbors. The input to the MLP is similar to except that we provide information both for the center vertex as well as its neighbors. For each neighboring vertex , we also provide the constraint information ( if a free vertex; if constrained). Each provides a latent vector for the central vertex. The hidden layer and output size is 128.

Mlp :

This module receives the concatenated outputs from and the aggregation of , and predicts the final displacement of the central vertex in the dynamic mesh. The input and hidden layer size is 192.

We train the final network with the mean square error loss:

(6)

where

is the ground truth. We adopted the Adam optimizer for training, with a learning rate starting from 0.0001 along with a decay factor of 0.96 at each epoch.

3.3 Training Primitives

Because our method operates on local patches, it is not necessary to train it on complex character meshes. In fact, we found that a training dataset constructed by simulating basic primitives, such as a sphere (under various motions and material properties), is sufficient to generalize to various character meshes at test time. Specifically, we generate random motion sequences by prescribing random rigid body motion of a constrained beam-shaped core inside the spherical mesh. The motion of this rigid core excites dynamic deformations in the rest of the sphere volumetric mesh. Each motion sequence starts by applying, to the rigid core, a random acceleration and angular velocity with respect to a random rotation axis. Next, we reverse the acceleration so that the primitive returns back to its starting position, and let the primitive’s secondary dynamics oscillate out for a few frames. While the still motions ensure that we cover the cases where local patches are stationary (but there is still residual secondary dynamics from primary motion), the random accelerations help to sample a diverse set of motions of local patches as much as possible. Doing so enhances the networks’s prediction stability.

4 Experiments

In this section, we show qualitative and quantitative results of our method, as well as comparisons to other methods. We also run an ablation study to verify why explicitly providing the position information on the reference mesh as input is necessary.

4.1 Dataset and evaluation metrics

For training, we use a uniform tetrahedral mesh of a sphere. We generate random motion sequences at 24 fps, using the Vega FEM simulator [Vega, sin2013vega]. For each motion sequence, we use seven different material settings. Each motion sequence consists of 456 frames resulting in a total of 255k frames in our training set.

We evaluate our method on 3D character animations obtained from Adobe’s Mixamo dataset [mixamo]. Neither the character meshes nor the primary motion sequences are seen in our training data. We create test cases for five different character meshes as listed in Table 1 and 15 motions in total. The volumetric meshes for the test characters use the same uniform tetrahedron size as our training data. For all the experiments, we report three types of metrics:

  • [leftmargin=0.4cm]

  • Single-frame RMSE: We measure the average root-mean-square error (RMSE) between the prediction and the ground truth over all frames, while providing the ground truth positions of the previous frames as input.

  • Rollout RMSE: We provide the previous predictions of the network as input to the current frame in a recurrent manner and measure the average RMSE between the prediction and the ground truth over all frames.

  • : We use the concept of elastic energy in physically-based simulation to detect abnormalities in the deformation sequence, or any possible mesh explosions. For each frame, we calculate the elastic energy based on the current mesh displacements with respect to its reference state. We list the the ,

    as well as the standard deviation (

    to show the energy distribution across the animation.

4.2 Analysis of Our Method

Performance:

In Table 1, we show the speed of our method, as well as that of the ground truth method and a baseline method

. For each method, we record the time to calculate the dynamic mesh but exclude other components such as initialization, rendering and mesh interpolation.

We adopted the implicit backward Euler approach (Equation 3) as ground truth and the faster explicit central differences integration (Equation 2) as the baseline. Both our baseline and ground truth were optimized using the deformable object simulation library, Vega FEM [Vega, sin2013vega], and accelerated using multi-cores via Intel Thread Building Blocks (TBB), with 8 cores for assembling the internal forces and 16 cores for solving the linear system. The experiment platform is with 2.90 GHz Intel Xeon(R) CPU E5-2690 (32 GB RAM) which provides for a highly competitive baseline/ground truth implementation. We ran our trained model on a GeForce RTX 2080 graphics card (8 GB RAM). We also tested it on CPU, without any multi-thread acceleration.

Moreover, we also provide performance results for the same character mesh (Big Vegas) with different voxel resolutions. To handle different resolutions of testing meshes, we resize the volumetric mesh to have the local patch similar to the training data (i.e., the shortest edge length is 0.2).

character
# vertices
(tet mesh)
s/frame
s/frame
s/frame
s/frame
Big vegas 1468 0.58 0.056 0.012 0.017
Kaya 1417 0.52 0.052 0.012 0.015
Michelle 1105 0.33 0.032 0.011 0.015
Mousey 2303 0.83 0.084 0.014 0.020
Ortiz 1258 0.51 0.049 0.012 0.015
Big vegas 6987 2.45 0.32 0.032 0.14
Big vegas 10735 4.03 0.53 0.046 0.24
Big vegas 18851 8.26 1.06 0.068 0.42
Big vegas 39684 24.24 2.96 0.14 0.89
Table 1: The running time (s/frame) of a single step (1/24 second) for the ground truth, the baseline, and our method.

Results indicate that when ran on GPU (CPU), our method is around 30 (20) times faster than the implicit integrator and 3 (2) times faster than the explicit integrator, per frame. Under an increasing number of vertices, our method has an even more competitive performance. Although the explicit method has comparable speed to our method, the simulation explodes after a few frames. In practice, explicit methods require much smaller time steps, which required additional 100 sub-steps in our experiments, to achieve stable quality. We provide a more detailed report on the speed-stability relationship of explicit integration in the supplementary material.

Generalization:

We train the network on the sphere dataset and achieve a single frame RMSE of 0.0026 on the testing split of this dataset (the sphere has a radius of ). As listed in Table 2, when tested on characters, our method achieves a single frame RMSE of , showing remarkable generalization capability (we note that the shortest edge length on the volumetric character meshes is ). The mean rollout error increases to after running the whole sequences due to error accumulation, but elastic energy statistics are still close to the ground truth. From the visualization of the ground truth and our results in Figure 6, we can see that although the predicted secondary dynamics have slight deviation from the ground truth, they are still visually plausible. We further plot the rollout prediction RMSE and elastic energy of the Big Vegas character in Figure 4. It can be seen that the prediction error remains under , and the mean elastic energy of our method is always close to the ground truth for the whole sequence, whereas the baseline method explodes quickly. We provide such rollout prediction plots for all characters in the supplemental and the video results in supplemental material.

Methods single frame rollout-24 rollout-48 rollout-all
Ground truth
Our method 0.0067 0.059 0.062 0.064
Ours w/o ref. motion 0.050 0.20 0.38 10.09
Baseline 7.26 9.63 17.5
CFD-GCN [de2020combining] 0.040 41.17 70.55 110.07
GNS [sanchez2020learning] 0.049 0.22 0.34 0.54
MeshGraphNets [pfaff2020learning] 0.050 0.11 0.43 4.46
Table 2: The single-frame RMSE, rollout-24, rollout-48 and rollout-all of our method and others tested on all five characters with 15 different motions. The shortest edge length in the test meshes is 0.2.
Figure 4: The rollout prediction results of our method and others, tested on the Big Vegas character with 283-frame hip hop dancing motion.
Figure 5: Non-homogeneous dynamics, tested on the Michelle character with 122-frame cross-jumps motion. We only show the upper region (see Figure 1 for the full mesh).
Figure 6: The rollout prediction results of our method and others tested on the Big Vegas character with 283-frame hip hop dancing motion. The baseline cannot be rendered because it explodes.

Non-homogeneous Dynamics:

Figure 5 shows how to control the dynamics by painting non-homogeneous material properties over the mesh. Varying stiffness values are painted on the hair and the breast region on the volumetric mesh. For better visualization, we render the material settings across the surface mesh in the figure. We display three different material settings, by assigning different stiffness values. Larger means stiffer material, hence the corresponding region exhibits less dynamics. In contrast, the regions with smaller show significant dynamic effects. This result demonstrates that our method correctly models the effect of material properties while providing an interface for the artist to efficiently adjust the desired dynamic effects.

Ablation study:

To demonstrate that it is necessary to incorporate the reference mesh motion into the input features of our network, we performed an ablation study. To ensure that the constrained vertices are still driving the dynamic mesh in the absence of the reference information, we update the positions of the constrained vertices based on the reference motion, at the beginning of each iteration. As input to our network architecture, we use the same set of features except the positions on the reference mesh. The results of “Ours w/o ref. motion” in Table 2 and Figure 6 and 4 demonstrate that this version is inferior to our original method, especially when running the network over a long time sequence. This establishes that the reference mesh is indispensable to the quality of the network’s approximation.

4.3 Comparison to Previous Work

As discussed in Section 2, several recent particle-based physics and mesh-based deformation systems utilized graph convolutional networks (GCNs). In this section, we train these network models on the same training set as our method and test on our character meshes.

Cfd-Gcn [de2020combining]:

We implemented our version of the CFD-GCN architecture, adopting the convolution kernel of [kipf2016semi]

. However, we ignored the remeshing part because we assume that the mesh topology remains fixed when predicting secondary motion. As input, we provide the same information as our method, namely the constraint states of the vertices, the displacements and the material properties. We found that the network structure recommended in the paper resulted in a high training error. We then replaced the originally proposed ReLu activation function with the Tanh activation (as used in our method), which significantly improved the training performance. Even so, as shown in Table 

2 and Figure 4, the rollout prediction explodes very quickly. We speculate that although the model aggregates the features from the neighbors to a central vertex via an adjacency matrix, it treats the center and the neighboring vertices equally, whereas in reality, their roles in physically-based simulation are distinct.

Gns [sanchez2020learning]:

The recently proposed GNS [sanchez2020learning] architecture is also a graph network designed for particle systems. The model first separately encodes node features and edge features in the graph and then generalizes the GraphNet blocks in [pmlr-v80-sanchez-gonzalez18a] to pass messages across the graph. Finally, a decoder is used to extract the prediction target from the GraphNet block output. The original paper embeds the particles in a graph by adding edges between vertices under a given radius threshold. In our implementation, we instead utilized the mesh topology to construct the graph. We used two blocks in the “processor” [sanchez2020learning] to achieve a network capacity similar to ours. In contrast to CFD-GCN [de2020combining], the GraphNet block can represent the interaction between the nodes and edges more efficiently, resulting in a significant performance improvement in rollout prediction settings. However, we still observe mesh explosions after a few frames, as shown in Figure 6 and in the supplementary video.

MeshGraphNets [pfaff2020learning]

In concurrent work to us, MeshGraphNets [pfaff2020learning] were presented for physically-based simulation on a mesh, with an architecture similar to GNS [sanchez2020learning]. The Lagrangian cloth system presented in their paper is the most closely related approach to our work. Therefore, we followed the input formulation of their example, except that we used the reference mesh to represent the undeformed mesh space as the edge feature. In our implementation, we keep the originally proposed encoders and that embed the edge and node features, but exclude the global (world) feature encoder because it is not applicable to our problem setting. Similarly, we kept the MLPs and but removed the inside the graph block. We used 15 graph blocks in the model, as suggested by their paper. The network has 10 times more parameters than ours; 2,333,187 parameters compared to our 237,571 parameters. Training lasted for 11 days, whereas our network was trained in less than a day.

We report how MeshGraphNets perform on our test character motion sequences in Table 2. The overall average rollout RMSE of MeshGraphNets is worse than GNS [sanchez2020learning]. Nevertheless, we note that out of 15 motions, this approach achieved 5 stable rollout predictions without explosions, while GNS [sanchez2020learning] failed on all of them. Our method outperforms each of the compared methods with respect to the investigated metrics.

5 Conclusion

We proposed a Deep Emulator for enhancing skinning-based animations of 3D characters with vivid secondary motion. Our method is inspired by the underlying physical simulation. Specifically, we train a neural network that operates on a local patch of a volumetric simulation mesh of the character, and predicts the updated vertex positions from the current acceleration, velocity, and positions. Being a local method, our network generalizes across 3D character meshes of arbitrary topology.

While our method demonstrates plausible secondary dynamics for various 3D characters under complex motions, there are still certain limitations we would like to address in future work. Specifically, we demonstrated that our network trained on a dataset of a volumetric mesh of a sphere can generalize to 3D characters with varying topologies. However, if the local geometric detail of a character is significantly different to those seen during training, e.g., the ears of the mousey character containing many local neighborhood not present in the sphere training data, the quality of our output decreases. One potential avenue for addressing this is to add additional primitive types to training, beyond tetrahedralized spheres. A thorough study on the type of training primitives and motion sequences required to cover the underlying problem domain is an interesting future direction.

Acknowledgements:

This research was sponsored in part by NSF (IIS-1911224), USC Annenberg Fellowship to Mianlun Zheng, Bosch Research and Adobe Research.

References

Appendix:

We sincerely request readers to refer to the link below for more visualization results: https://zhengmianlun.github.io/publications/deepEmulator.html.

a.1 Dataset Information

In this paper, we trained our network on a sphere dataset but tested it on five character meshes from the Adobe’s Mixamo dataset [mixamo]. Table A.1 provides detailed information about the five character meshes, including the vertex number and the edge length on the original surface mesh as well as the corresponding uniform volumetric mesh.

In Figure A.1, we show how we set constraints for each of the meshes, from a side view. The red vertices are constrained to move based on the skinned animation and drive the free vertices to deform with secondary motion.

a.2 Full Quantitative and Qualitative Results

In Tables A.2A.16, we provide the quantitative results of our network tested on the five character meshes and 15 motions. The corresponding error plots are given in Figures A.2-A.16. We also provide the error plots for the compared methods. Across all the test cases, our method achieves the most stable rollout prediction with the lowest error.

In DeepEmulator.html, we provide animation sequences of our results as well as other comparison methods.

a.3 Further Analysis of Baseline Performance

As introduced in Section 4.2, we adopted the implicit backward Euler approach (Equation 3) as ground truth and the faster explicit central differences integration (Equation 2) as the baseline. Although the baseline method is 10 times faster than the implicit integrator with the same time step (1/24 second), it explodes after a few frames. In order to achieve stable simulation results, we found that it requires at least 100 sub-steps (). In Table A.17, we provide the per-frame running time of the explicit integration with 50 and 100 steps.

a.4 Choice of the Training Dataset

In Section 5, we mentioned a future direction of expanding the training dataset beyond primitive-based datasets such as spheres. Here, we analyze an alternative training dataset, namely the “Ortiz Dataset”, created by running our physically-based simulator on the volumetric mesh surrounding the Ortiz character (same mesh as in Table A.1), with motions acquired from Adobe’s Mixamo. In both datasets, we use the same number of frames. We report our results in Table A.18 to A.22.

Our experiments show that the network trained on the Sphere Dataset in most cases (75%) outperforms the Ortiz Dataset. We think there are two reasons for this. First, the local patches in the sphere are general and not specific to any geometry, making the learned neural network more general and therefore more suitable for characters other than Ortiz. Second, the motions in the Ortiz Dataset were created by human artists, and as such these motions follow certain human-selected artistic patterns. The motions in the Sphere Dataset, however, consist of random translations and rotations, which provides a denser sampling of motions in the possible motion space, and therefore improves the robustness of the network.

a.5 Analysis of the Local Patch Size

In the main paper, we show our network architecture for 1-ring local patches (Figure 3). Namely, in the main paper the MLP learns to predict the internal forces from the 1-ring neighbors around the center vertex. Here, we present an ablation study whereby the network learns based on 2-ring local patches, and 3-ring local patches, respectively. For 2-ring local patches, we add an additional MLP that receives the inputs from the 2-ring neighbors of the center vertex. The output latent vector is concatenated to the input of the MLP. Similar operation is adopted for the 3-ring local patch network by adding another MLP for the 3-ring internal forces.

For the training loss, the network achieves the RMSE of 0.00257, 0.00159 and 0.00146 for 1-ring, 2-ring and 3-ring local patches, respectively. In Table A.23, we provide the corresponding test results on the five characters. Overall, we didn’t see obvious improvements by increasing the local patch size. This could be because 2-ring and 3-ring local patches exhibits larger variability of structure, different to the sphere mesh, particularly for a center vertex close to the boundary. Therefore, we adopt 1-ring local patches in our paper.

Character
Vertex Number
(surface mesh)
Edge Length
(surface mesh)
Disconnected
Components
Vertex Number
(tet mesh)
Edge Length
(tet mesh)
Big vegas 3711 8 1468
Kaya 4260 4 1417
Michelle 14267 1 1105
Mousey 6109 1 2303
Ortiz 24798 1 1258
Table A.1: Detailed information on the five test characters. Each character’s surface mesh was re-scaled uniformly to lie exactly within a bounding box of dimensions 555.
Figure A.1: The constraints (red vertices) set on the volumetric mesh surrounding the surface mesh.
Methods Single Frame Rollout-24 Rollout-48 Rollout-All
Ground Truth
Our Method 0.0098 0.053 0.063 0.059
Ours w/o ref. motion 0.058 0.19 0.55 7.70
Baseline 8.23E120 1.07E121 1.57E121
CFD-GCN [de2020combining] 0.031 50.33 72.06 79.16
GNS [sanchez2020learning] 0.057 0.20 0.31 0.55
MeshGraphNets [pfaff2020learning] 0.058 0.14 0.10 2.85
Table A.2: Quantitative results: Big vegas, 283-frame hip hop dancing 1.
Figure A.2: Plot of the quantitative results: Big vegas, 283-frame hip hop dancing 1.
Methods Single Frame Rollout-24 Rollout-48 Rollout-All
Ground Truth
Our Method 0.0093 0.066 0.074 0.073
Ours w/o ref. motion 0.061 0.31 0.87 14.19
Baseline 7.83E120 1.06E121 1.79E121
CFD-GCN [de2020combining] 0.031 67.91 110.82 591.46
GNS [sanchez2020learning] 0.060 0.32 0.50 0.68
MeshGraphNets [pfaff2020learning] 0.062 0.16 0.38 6.58
Table A.3: Quantitative results: Big vegas, 366-frame hip hop dancing 2.
Figure A.3: Plot of the quantitative results: Big vegas, 366-frame hip hop dancing 2.
Methods Single Frame Rollout-24 Rollout-48 Rollout-All
Ground Truth
Our Method 0.0062 0.050 0.057 0.065
Ours w/o ref. motion 0.047 0.15 0.36 21.37
Baseline 7.67E120 1.07E121 2.51E121
CFD-GCN [de2020combining] 0.027 47.90 76.76 110.26
GNS [sanchez2020learning] 0.046 0.17 0.32 0.56
MeshGraphNets [pfaff2020learning] 0.048 0.084 0.081 10.12
Table A.4: Quantitative results: Big vegas, 594-frame samba dancing 1.
Figure A.4: Plot of the quantitative results: Big vegas, 594-frame samba dancing 1.
Methods Single Frame Rollout-24 Rollout-48 Rollout-All
Ground Truth
Our Method 0.0058 0.052 0.050 0.064
Ours w/o ref. motion 0.038 0.13 0.40 13.88
Baseline 7.67E120 9.97E120 2.15E121
CFD-GCN [de2020combining] 0.027 75.12 86.49 101.63
GNS [sanchez2020learning] 0.037 0.17 0.30 0.62
MeshGraphNets [pfaff2020learning] 0.040 0.090 0.091 7.89
Table A.5: Quantitative results: Big vegas, 493-frame samba dancing 2.
Figure A.5: Plot of the quantitative results: Big vegas, 493-frame samba dancing 2.
Methods Single Frame Rollout-24 Rollout-48 Rollout-All
Ground Truth
Our Method 0.0065 0.062 0.054 0.070
Ours w/o ref. motion 0.040 0.35 0.90 12.82
Baseline 7.76E120 1.02E121 1.96E121
CFD-GCN [de2020combining] 0.083 50.17 71.88 79.87
GNS [sanchez2020learning] 0.040 0.18 0.31 0.50
MeshGraphNets [pfaff2020learning] 0.043 0.13 0.11 5.24
Table A.6: Quantitative results: Big vegas, 399-frame samba dancing 3.
Figure A.6: Plot of the quantitative results: Big vegas, 399-frame samba dancing 3.
Methods Single Frame Rollout-24 Rollout-48 Rollout-All
Ground Truth
Our Method 0.0067 0.054 0.058 0.041
Ours w/o ref. motion 0.042 0.097 0.15 20.02
Baseline 6.21E120 8.00E120 2.00E121
CFD-GCN [de2020combining] 0.016 0.22 72.15 69.87
GNS [sanchez2020learning] 0.041 0.15 0.28 0.47
MeshGraphNets [pfaff2020learning] 0.042 0.063 0.084 0.068
Table A.7: Quantitative results: Kaya, 650-frame dancing running man.
Figure A.7: Plot of the quantitative results: Kaya, 650-frame dancing running man.
Methods Single Frame Rollout-24 Rollout-48 Rollout-All
Ground truth
Our Method 0.0075 0.083 0.067 0.054
Ours w/o ref. motion 0.030 0.23 0.41 3.16
Baseline 5.66E120 7.98E120 8.99E120
CFD-GCN [de2020combining] 0.023 35.34 62.31 59.14
GNS [sanchez2020learning] 0.029 0.21 0.29 0.45
MeshGraphNets [pfaff2020learning] 0.03 0.11 0.15 2.44
Table A.8: Quantitative results: Kaya, 167-frame zombie scream.
Figure A.8: Plot of the quantitative results: Kaya, 167-frame zombie scream.
Methods Single Frame Rollout-24 Rollout-48 Rollout-All
Ground truth
Our Method 0.0041 0.033 0.033 0.04
Ours w/o ref. motion 0.060 0.12 0.19 5.24
Baseline 5.81E120 7.86E120 1.52E121
CFD-GCN [de2020combining] 0.017 27.36 42.25 73.72
GNS [sanchez2020learning] 0.060 0.10 0.19 0.37
MeshGraphNets [pfaff2020learning] 0.06 0.082 0.11 0.079
Table A.9: Quantitative results: Michelle, 371-frame gangnam style.
Figure A.9: Plot of the quantitative results: Michelle, 371-frame gangnam style.
Methods Single Frame Rollout-24 Rollout-48 Rollout-All
Ground truth
Our Method 0.0056 0.025 0.024 0.047
Ours w/o ref. motion 0.082 0.13 0.15 15.20
Baseline 6.25E120 7.47E120 2.25E121
CFD-GCN [de2020combining] 0.019 36.06 36.80 64.16
GNS [sanchez2020learning] 0.082 0.12 0.14 0.48
MeshGraphNets [pfaff2020learning] 0.082 0.065 0.077 4.31
Table A.10: Quantitative results: Michelle, 627-frame swing dancing 1.
Figure A.10: Plot of the quantitative results: Michelle, 627-frame swing dancing 1.
Methods Single Frame Rollout-24 Rollout-48 Rollout-All
Ground truth
Our Method 0.0056 0.049 0.037 0.055
Ours w/o ref. motion 0.086 0.14 0.16 20.14
Baseline 5.71E120 7.69E120 2.35E121
CFD-GCN [de2020combining] 0.019 25.05 42.43 64.23
GNS [sanchez2020learning] 0.085 0.13 0.19 0.43
MeshGraphNets [pfaff2020learning] 0.086 0.11 0.094 0.11
Table A.11: Quantitative results: Michelle, 699-frame swing dancing 2.
Figure A.11: Plot of the quantitative results: Michelle, 699-frame swing dancing 2.
Methods Single Frame Rollout-24 Rollout-48 Rollout-All
Ground truth
Our Method 0.0080 0.077 0.10 0.086
Ours w/o ref. motion 0.057 0.24 0.42 1.43
Baseline 7.78E120 1.07E121 1.24E121
CFD-GCN [de2020combining] 0.041 56.37 82.57 72.71
GNS [sanchez2020learning] 0.057 0.50 0.69 0.92
MeshGraphNets [pfaff2020learning] 0.057 0.15 1.30 3.48
Table A.12: Quantitative results: Mousey, 158-frame dancing.
Figure A.12: Plot of the quantitative results: Mousey, 158-frame dancing.
Methods Single Frame Rollout-24 Rollout-48 Rollout-All
Ground truth
Our Method 0.0066 0.067 0.11 0.09
Ours w/o ref. motion 0.036 0.19 0.28 2.95
Baseline 8.29E120 1.12E121 1.51E121
CFD-GCN [de2020combining] 0.043 68.29 78.87 75.08
GNS [sanchez2020learning] 0.036 0.26 0.62 0.86
MeshGraphNets [pfaff2020learning] 0.037 0.15 1.61 8.63
Table A.13: Quantitative results: Mousey, 255-frame shuffling.
Figure A.13: Plot of the quantitative results: Mousey, 255-frame shuffling.
Methods Single Frame Rollout-24 Rollout-48 Rollout-All
Ground truth
Our Method 0.0090 0.087 0.10 0.10
Ours w/o ref. motion 0.039 0.36 0.35 7.86
Baseline 8.48E120 1.10E121 1.81E121
CFD-GCN [de2020combining] 0.17 66.19 73.93 78.95
GNS [sanchez2020learning] 0.039 0.28 0.50 0.63
MeshGraphNets [pfaff2020learning] 0.040 0.21 2.14 14.97
Table A.14: Quantitative results: Mousey, 627-frame swing dancing.
Figure A.14: Plot of the quantitative results: Mousey, 627-frame swing dancing.
Methods Single Frame Rollout-24 Rollout-48 Rollout-All
Ground truth
Our Method 0.0057 0.082 0.077 0.073
Ours w/o ref. motion 0.041 0.29 0.40 1.20
Baseline 8.15E120 1.09E121 1.09E121
CFD-GCN [de2020combining] 0.030 10.71 65.43 58.90
GNS [sanchez2020learning] 0.040 0.30 0.22 0.27
MeshGraphNets [pfaff2020learning] 0.042 0.090 0.088 0.096
Table A.15: Quantitative results: Ortiz, 122-frame cross jumps rotation.
Figure A.15: Plot of the quantitative results: Ortiz, 122-frame cross jumps rotation.
Methods Single Frame Rollout-24 Rollout-48 Rollout-All
Ground truth
Our Method 0.0039 0.039 0.032 0.036
Ours w/o ref. motion 0.034 0.10 0.17 4.18
Baseline 7.40E120 9.48E120 1.62E121
CFD-GCN [de2020combining] 0.017 0.57 83.52 71.93
GNS [sanchez2020learning] 0.034 0.17 0.21 0.31
MeshGraphNets [pfaff2020learning] 0.034 0.065 0.071 0.064
Table A.16: Quantitative results: Ortiz, 326-frame jazz dancing.
Figure A.16: Plot of the quantitative results: Ortiz, 326-frame jazz dancing.
character
# vertices
(tet mesh)
s/frame
1 step / frame
s/frame
1 step / frame
s/frame
50 steps / frame
s/frame
100 steps / frame
s/frame
1 step / frame
Big vegas 1468 0.58 0.056 2.57027 6.20967 0.017
Kaya 1417 0.52 0.052 2.42985 5.72762 0.015
Michelle 1105 0.33 0.032 1.52916 3.64744 0.013
Mousey 2303 0.83 0.084 3.90579 9.5897 0.018
Ortiz 1258 0.51 0.049 2.2496 5.16806 0.015
Table A.17: The running time of the ground truth, the baseline, and our method.
Motion Dataset Single Frame Rollout-24 Rollout-48 Rollout-All
Hip hop dancing 1
283 frames
Sphere Dataset 0.0098 0.053 0.063 0.0591
Ortiz Dataset 0.0111 0.0512 0.067 0.0969
Hip hop dancing 2
366 frames
Sphere Dataset 0.0093 0.0664 0.0744 0.0727
Ortiz Dataset 0.0101 0.0564 0.0607 0.1241
Samba dancing 1
594 frames
Sphere Dataset 0.0062 0.0495 0.0571 0.0654
Ortiz Dataset 0.0062 0.0481 0.061 0.143
Samba dancing 2
493 frames
Sphere Dataset 0.0058 0.0521 0.0496 0.0635
Ortiz Dataset 0.0064 0.0367 0.0423 0.1331
Samba dancing 3
399 frames
Sphere Dataset 0.0065 0.0615 0.0537 0.0702
Ortiz Dataset 0.0065 0.0383 0.0616 0.1282
Table A.18: Quantitative results: the network trained on different datasets and tested on the character Big vegas.
Motion Dataset Single Frame Rollout-24 Rollout-48 Rollout-All
Dancing running man
650 frames
Sphere Dataset 0.0067 0.0544 0.0578 0.0411
Ortiz Dataset 0.0113 0.075 0.0839 0.1176
Zombie scream
167 frames
Sphere Dataset 0.0075 0.0834 0.0666 0.0537
Ortiz Dataset 0.0126 0.0749 0.0645 0.076
Table A.19: Quantitative results: the network trained on different datasets and tested on the character Kaya.
Motion Dataset Single Frame Rollout-24 Rollout-48 Rollout-All
Gangnam style
371 frames
Sphere Dataset 0.0041 0.0329 0.0332 0.0401
Ortiz Dataset 0.0059 0.0329 0.0319 0.0969
Swing dancing 1
627 frames
Sphere Dataset 0.0056 0.025 0.0236 0.0471
Ortiz Dataset 0.0079 0.0324 0.035 0.1213
Swing dancing 2
699 frames
Sphere Dataset 0.0056 0.0491 0.0373 0.0548
Ortiz Dataset 0.0067 0.058 0.04 0.1264
Table A.20: Quantitative results: the network trained on different datasets and tested on the character Michelle.
Motion Dataset Single Frame Rollout-24 Rollout-48 Rollout-All
Dancing
158 frames
Sphere Dataset 0.008 0.0771 0.1003 0.0858
Ortiz Dataset 0.0122 0.0871 0.1056 0.122
Shuffling
225 frames
Sphere Dataset 0.0066 0.0666 0.1115 0.09
Ortiz Dataset 0.0115 0.0936 0.1072 0.1303
Swing dancing
627 frames
Sphere Dataset 0.009 0.0871 0.1001 0.1042
Ortiz Dataset 0.0144 0.1236 0.1113 0.1594
Table A.21: Quantitative results: the network trained on different datasets and tested on the character Mousey.
Motion Dataset Single Frame Rollout-24 Rollout-48 Rollout-All
cross jumps rotation
122 frames
Sphere Dataset 0.0057 0.0819 0.0765 0.0726
Ortiz Dataset 0.0053 0.0719 0.0532 0.0702
jazz dancing
326 frames
Sphere Dataset 0.0039 0.0391 0.0316 0.0363
Ortiz Dataset 0.0051 0.0365 0.0338 0.0946
Table A.22: Quantitative results: the network trained on different datasets and tested on the character Ortiz.
Test Dataset Patch size Single Frame Rollout-24 Rollout-48 Rollout-All
Big vegas 1-ring 0.0075 0.057 0.060 0.066
2-ring 0.0068 0.060 0.066 0.077
3-ring 0.0075 0.043 0.0418 0.059
Kaya 1-ring 0.0071 0.069 0.062 0.047
2-ring 0.0060 0.078 0.060 0.072
3-ring 0.0064 0.099 0.095 0.092
Michelle 1-ring 0.0051 0.036 0.031 0.047
2-ring 0.0045 0.033 0.033 0.047
3-ring 0.0047 0.043 0.042 0.059
Mousey 1-ring 0.0079 0.077 0.10 0.093
2-ring 0.0069 0.11 0.19 0.14
3-ring 0.0075 0.11 0.21 0.16
Ortiz 1-ring 0.0048 0.061 0.054 0.054
2-ring 0.0042 0.069 0.089 0.070
3-ring 0.0051 0.094 0.097 0.087
Table A.23: Quantitative results: the network trained on local patches of different patch sizes.