SoftSMPL: Data-driven Modeling of Nonlinear Soft-tissue Dynamics for Parametric Humans

by   Igor Santesteban, et al.

We present SoftSMPL, a learning-based method to model realistic soft-tissue dynamics as a function of body shape and motion. Datasets to learn such task are scarce and expensive to generate, which makes training models prone to overfitting. At the core of our method there are three key contributions that enable us to model highly realistic dynamics and better generalization capabilities than state-of-the-art methods, while training on the same data. First, a novel motion descriptor that disentangles the standard pose representation by removing subject-specific features; second, a neural-network-based recurrent regressor that generalizes to unseen shapes and motions; and third, a highly efficient nonlinear deformation subspace capable of representing soft-tissue deformations of arbitrary shapes. We demonstrate qualitative and quantitative improvements over existing methods and, additionally, we show the robustness of our method on a variety of motion capture databases.



There are no comments yet.


page 1

page 4

page 7

page 8

page 9


AMASS: Archive of Motion Capture as Surface Shapes

Large datasets are the cornerstone of recent advances in computer vision...

DSNet: Dynamic Skin Deformation Prediction by Recurrent Neural Network

Skin dynamics contributes to the enriched realism of human body models i...

Learning-Based Animation of Clothing for Virtual Try-On

This paper presents a learning-based clothing animation method for highl...

Learning Soft Tissue Behavior of Organs for Surgical Navigation with Convolutional Neural Networks

Purpose: In surgical navigation, pre-operative organ models are presente...

Data-Driven Geometric System Identification for Shape-Underactuated Dissipative Systems

The study of systems whose movement is both geometric and dissipative of...

A new geodesic-based feature for characterization of 3D shapes: application to soft tissue organ temporal deformations

In this paper, we propose a method for characterizing 3D shapes from poi...

New approaches in modeling belt-flesh-pelvis interaction using obese GHBMC models

Obesity is associated with higher fatality risk and altered distribution...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Soft-tissue dynamics are fundamental to produce compelling human animations. Most of existing methods capable highly dynamic soft-tissue deformations are based on physics-based approaches. However, these methods are challenging to implement due to the inner complexity of the human body, and the expensive simulation process needed to animate the model. Alternatively, data-driven models can potentially learn human soft-tissue deformations as a function of body pose directly from real-world data (

e.g., 3D reconstructed sequences). However, in practice, this is a very challenging task due to the highly nonlinear nature of the dynamic deformations, and the scarce of datasets with sufficient reconstruction fidelity.

In this work we propose a novel learning-based method to animate parametric human models with highly expressive soft-tissue dynamics. SoftSMPL takes as input the shape descriptor of a body and a motion descriptor, and produces dynamic soft-tissue deformations that generalize to unseen shapes and motions. Key to our method is to realize that humans move in a highly personalized manner, i.e., motions are shape and subject dependent, and subject-dependant features are usually entangled in the pose representation.

Previous methods fail to disentangle body pose from shape and subject features; therefore, they overfit the relationship between tissue deformation and pose, and generalize poorly to unseen shape and motions. Our method overcomes this limitation by proposing a new representation to disentangle the traditional pose space in two steps. First, we propose a solution to encode a compact and deshaped representation of body pose which eliminates the correlation between individual static poses and subject. Second, we propose a motion transfer approach, which uses person-specific models to synthesize animations for pose (and style) sequences of other persons. As a result, our model is trained with data where pose and subject-specific dynamic features are no longer entangled. We complement this contribution with a highly efficient nonlinear subspace to encode tissue deformations of arbitrary bodies, and a neural-network-based recurrent regressor as our learning-based animation model. We demonstrate qualitative and quantitative improvements over previous methods, as well as robust performance on a variety of motion capture databases.

2 Related work

The 3D modeling of human bodies has been investigated following two main trends: data-driven models, which learn deformations directly from data; and physically-based models, which compute body deformations by solving a simulation problem, usually consisting of a kinematic model coupled with a deformable layer. In this section we discuss both trends, with special emphasis on the former, to which our method belongs.

Data-driven models.

Pioneering data-driven models interpolate manually sculpted static 3D meshes to generate new samples

[sloan2001shapeexample]. With the appearance of laser scanning technologies, capable of reconstructing 3D static bodies with great level of detail, the data-driven field became popular. Hilton et al. [hilton2002animatedmodel] automatically fit an skeleton to a static scan to generate animated characters. Allen et al. proposed one of the first methods to model upper body [allen2002articulated] and full body [allen2003humanbody] deformations using a shape space learned from static scans and an articulated template. Anguelov et al. [anguelov2005scape] went one step further and modeled both shape and pose dependent deformations directly from data. Many follow-up data-driven methods have appeared [hasler2009statistical, jain2010moviereshape, hirshberg2012coregistration, chen2013tensor, yang2014semantic, feng2015reshaping, zuffi2015stitched, loper_SIGAsia2015, pishchulin2017building], but all of these are limited to modeling static deformations.

Data-driven models have also been explored to model soft-tissue deformations, which is our main goal too. Initial works used sparse marker-based systems to acquire the data. The pioneering work of Park and Hodgins [park2006capturing] reconstructs soft-tissue motion of an actor by fitting a 3D mesh to 350 tracked points. In subsequent work [park2008datadriven], they proposed a second-order dynamics model to synthesize skin deformation as a function of body motion. Similar to us, they represented both body pose and dynamic displacements in a low-dimensional space. However, their method does not generalize to different body shapes. Neumann et al. [neumann2013capture] also used sparse markers to capture shoulder and arm deformations of multiple subjects in a multi-camera studio. They were able to model muscle deformations as a function of shape, pose, and external forces, but their method is limited to the shoulder-arm area, and cannot learn temporal dynamics. Similarly, Loper et al. [loper2014mosh] did not learn dynamics either, but they were able to estimate full body pose and shape from a small set of motion capture markers. Remarkably, despite their lack of explicit dynamics, their model can reproduce soft-tissue motions by allowing body shape parameters to change over time.

More recently, 3D/4D scanning technologies and mesh registration methods [bradley2008garmentcapture, cagniart2010probabilistic, dou20153d, bogo2017dfaust, robertini2017surfacedetails, ponsmoll2017clothcap] allow to reconstruct high-quality dynamic sequences of human performances. These techniques have paved the way to data-driven methods that leverage dense 3D data, usually in the form of temporally coherent 3D mesh sequences, to extract deformation models of 3D humans. Neumann et al. [neumann2013splocs] used 3D mesh sequences to learn sparse localized deformation modes, but did not model temporal dynamics. Tsoli et al. reconstructed 3D meshes of people breathing [tsoli2014breathing] in different modes, and built a statistical model of body surface deformations as a function of lung volume. In contrast, we are able to model far more complex deformations, with higher frequency dynamics, as a function of body shape and pose. Casas and Otaduy [casas_PACMCGIT2018] modeled full-body soft-tissue deformations as a function of body motion using a neural-network-based nonlinear regressor. Their model computes per-vertex 3D offsets encoded in an efficient subspace, however it is subject-specific and does not generalize to different body shapes. Closest to our work is Dyna [ponsmoll2015dyna], a state-of-the-art method that relates soft-tissue deformations to motion and body shape from 4D scans. Dyna uses a second-order auto-regressive model to output mesh deformations encoded in a subspace. Despite its success in modeling surface dynamics, we found that its generalization capabilities to unseen shapes and poses are limited due to the inability to effectively disentangle pose from shape and subject-style . Furthermore, Dyna relies on a linear PCA subspace to represent soft-tissue deformations, which struggles to reproduce highly non-linear deformations. DMPL [loper_SIGAsia2015] proposes a soft-tissue deformation model heavily-inspired in Dyna, with the main difference that it uses a vertex-based representation instead of triangle-based. However, DMPL suffers from the same limitations of Dyna mentioned above. We also propose a vertex-based representation, which eases the implementation in standard character rigging pipelines, while achieving superior generalization capabilities and more realistic dynamics.

Garment and clothing animation have also been addressed with data-driven models that learn surface deformations as a function of human body parameters . Guan et al. [guan2012drape], Similarly, other data-driven methods are limited to represent garment shape variations as linear scaling factors [ponsmoll2017clothcap, yang2018analyzing, laehner2018deepwrinkles] and therefore do not reproduce realistic deformations. Closer to ours is the work of Santesteban et al. [santesteban_EG2019] which effectively disentangles garment deformations due to shape and pose, allowing to train a deformation regressor that generalizes to new subjects and motions. Alternatively, Gundogdu et al. [gundogdu2019garnet], extracts geometric features of human bodies to use them to parameterize garment deformations.

Physically-based models.

The inherent limitation of data-driven models is their struggle to generate deformations far from the training examples. Physically-based models overcome this limitation by formulating the deformation process within a simulation framework. However, these approaches are not free of difficulties: defining an accurate and efficient mechanical model to represent human motions, and solving the associated simulations is hard.

Initial works used layered representations consisting of a deformable volume for the tissue layer, rigidly attached to a kinematic skeleton [capell2002dyanmicdeform, larboulette2005dynamic]. Liu et al. [liu2013softbody] coupled rigid skeletons for motion control with a pose-based plasticity model to enable two-way interaction between skeleton, skin, and environment. McAdams et al. [mcadams2011efficientelasticity] showed skin deformations with a discretization of corotational elasticity on a hexahedral lattice around the surface mesh, but did not run at real-time rates. To speed up simulations, Position-Based Dynamics (PBD) [bender2017survey] solvers have been widely used for different physics systems, also for human soft tissue [bender2013physicsskinning, komaritzan2018projective] and muscle deformation [romeo2019muscle]. Projective Dynamics, another common approach to accelerate simulations, has also been used for simulating deformable characters [li2019fastsimulation].

Subspaces for Simulation.

Subspace simulation methods attempt to find a low-dimensional representation of an initial dense set of equations in order to facilitate the computations. PCA has been widely used to this end [krysl2001dimensional, barbivc2005real, treuille2006model], while an alternative set of works aims to find new and more efficient bases [teng2014simulating, teng2015subspace]. For clothing, Hahn et al. [hahn2014subspace] built a linear subspace using temporal bases distributed across pose space. Holden et al. [holden2019subspacephysics]

also built a linear subspace using PCA and machine learning to model external forces and collisions at interactive rates. Finally, Fulton

et al. [Fulton:LSD:2018] built a non-linear subspace using an auto-encoder on top of an initial PCA subspace to accelerate the solver.

Figure 1:

Runtime pipeline of our approach. First, the temporal motion data is encoded in our novel disentangled pose descriptor. Then, the resulting low dimensional vector is concatenated with the skeleton root offsets to form the motion descriptor. This descriptor along with the desired shape parameters are passed through the soft-tissue regressor, which predicts the nonlinear dynamic behaviour of the soft-tissue deformation in a latent space. Finally, the deformation decoder recovers the original full space of deformation offsets for each vertex of the mesh.

3 Overview

Our animation model for soft-tissue dynamics takes as input descriptors of body shape and motion, and outputs surface deformations. These deformations are represented as per-vertex 3D displacements of a human body model, described in Section 4.1, and encoded in an efficient nonlinear subspace, described in Section 4.2. At runtime, given body and motion descriptors, we predict the soft-tissue deformations using a novel recurrent regressor proposed in Section 4.3. Figure 1 depicts the architecture of our runtime pipeline, including the motion descriptor, the regressor, and a soft-tissue decoder to generate the predicted deformations.

In addition to our novel subspace and regressor, our key observation to achieve highly expressive dynamics with unprecedented generalization capabilities is an effective disentanglement of the pose space. In Section 5, we argue and demonstrate that the standard pose space (i.e., vector of joint angles ) used in previous methods is entangled with subject-specific features. This causes learning-based methods to overfit the relationship between tissue deformation and pose. In Section 5.1 we identify static features, mostly due to the particular anatomy of each person, that are entangled in the pose space, and propose a deshaped representation to effectively disentangle them. Furthermore, in Section 5.2 we identify dynamic features that manifest across a sequence of poses (also known as style), and propose a strategy to eliminate them.

4 SoftSMPL

Figure 2: Architecture of the multi-modal pose autoencoder.

4.1 Human Model

We build our soft-tissue model on top of standard human body models  (e.g.[feng2015reshaping, loper_SIGAsia2015]) controlled by shape parameters (e.g., principal components of a collection of body scans in rest pose) and pose parameters (e.g., joint angles). These works assume that a deformed body mesh , where is the number of vertices, is obtained by


where is a skinning function (e.g., linear blend skinning, dual quaternion, etc.) with skinning weights that deforms an unposed body mesh .

Inspired by Loper et al. [loper_SIGAsia2015], who obtain the unposed mesh by deforming a body mesh template to incorporate changes in shape and pose corrective displacements , we propose to further deform the body mesh template to incorporate soft-tissue dynamics. More specifically, we define our unposed body mesh as


where is a soft-tissue regressor that outputs per-vertex displacements required to reproduce skin dynamics given a shape parameter and a motion descriptor . Notice that, in contrast to previous model-based works that also predict soft-tissue displacements [ponsmoll2015dyna, loper_SIGAsia2015, casas_PACMCGIT2018], our key observation is that such regressing task cannot be formulated directly as function of pose (and shape ), because subject-specific information is entangled in that pose space. See Section 5 for a detailed description of our motion descriptor and full details on our novel pose disentanglement method.

4.2 Soft-Tissue Representation and Deformation Subspace

We represent soft-tissue deformations as per-vertex 3D offsets of a body mesh in an unposed state. This representation allows to isolate the soft-tissue deformation component from other deformations, such as pose or shape.

Given the data-driven nature of our approach, in order to train our model it is crucial that we define a strategy to extract ground truth deformations from real world data. To this end, given a dataset of 4D scans with temporally consistent topology, we extract the soft-tissue component of each mesh as


where is the inverse of the skinning function, a corrective pose blendshape, and a shape deformation blendshape (see [loper_SIGAsia2015] for details on how the latter two are computed). Solving Equation 3 requires estimating the pose and shape parameters for each mesh , which is a priori unknown (i.e., the dataset contains only 3D meshes, no shape or pose parameters). Similar to [ponsmoll2015dyna], we solve the optimization problem:


to estimate the shape and pose parameters of each scan in the dataset .

Despite the highly-convenient representation of encoding soft-tissue deformations as per-vertex 3D offsets , this results in a too high-dimensional space for an efficient learning-based framework. Previous works [loper_SIGAsia2015, ponsmoll2015dyna] use linear dimensionality reduction techniques (e.g.

, Principal Component Analysis) to find a subspace capable of reproducing the deformations without significant loss of detail. However, soft-tissue deformations are highly nonlinear, hindering the reconstructing capabilities of linear methods. We mitigate this by proposing a novel autoencoder to find an efficient nonlinear subspace to encode soft-tissue deformations of parametric humans.

Following the standard autoencoder pipeline, we define the reconstructed (i.e., encoded-decoded) soft-tissue deformation as


where and are encoder and decoder networks, respectively, and

soft-tissue displacements projected into the latent space. We train our deformation autoencoder by using a loss function

that minimizes both surface and normal errors between input and output displacements as follows


where is the number of faces of the mesh template, the normal of the face, and is set to 1000. Notice that, during training, we use ground truth displacements from a variety of characters which enables us to find a subspace that generalizes well to encode soft-tissue displacements of any human shape. This is in contrast to previous works [casas_PACMCGIT2018] that need to train shape-specific autoencoders.

We implement the encoder and decoder using a neural network architecture . In Figure 1 (right) we depict the decoder . The encoder uses an analogous architecture.

4.3 Soft-Tissue Recurrent Regressor

In this section we describe the main component of our runtime pipeline: the soft-tissue regressor , illustrated in Figure 1 (center). Assuming a motion descriptor (which we discuss in detail in Section 5.1) and a shape descriptor , our regressor outputs the predicted soft tissue displacements . These encoded displacements are subsequently fed into the decoder to generate the final per-vertex 3D displacements


To learn the naturally nonlinear dynamic behavior of soft-tissue deformations, we implement the regressor using a recurrent architecture GRU [cho2014learning]. Recurrent architectures learn which information of previous frames is relevant and which not, resulting in a good approximation of the temporal dynamics. This is in contrast to modeling temporal dependencies by explicitly adding the output of one step as the input of the next step, which is prone to instabilities specially in nonlinear models. Furthermore, our regressor also uses a residual shortcut connection to skip the GRU layer altogether, which improves the flow of information .

We train the regressor by minimizing a loss , which enforces predicted vertex positions, velocities, and accelerations to match the latent space deformations ,


5 Disentangled Motion Descriptor

To efficiently train the soft-tissue regressor , described earlier in Section 4.3, we require a pose-disentangled and discriminative motion descriptor . To this end, in this section we propose a novel motion descriptor. It encompasses the velocity and acceleration of the body root in world space , a novel pose descriptor , and the velocity and acceleration of this novel pose descriptor, as follows:


In the rest of this section we discuss the limitation of the pose descriptors used in state-of-the-art human models, and introduce a new disentangled space to remove static subject-specific features (Section 5.1). Moreover, we also propose a strategy to remove dynamic subject-specific features (Section 5.2) from sequences of poses.

5.1 Static Pose Space Disentanglement

The regressor proposed in Section 4.3 relates body motion and body shape to soft-tissue deformations. To represent body motion, a standard parameterization used across many human models [feng2015reshaping, anguelov2005scape, loper2014mosh, loper_SIGAsia2015] is the joint angles of the kinematic skeleton, . However, our key observation is that this pose representation is entangled with shape- and subject-specific information that hinders the learning of a pose-dependent regressor. Additionally, Hahn et al. [hahn2014subspace] also found that using joint angles to represent pose leads to a high-dimensional space with redundancies, which makes the learning task harder and prone to overfitting. We hypothesize that existing data-driven parametric human models are less sensitive to this entanglement and overparameterization because they learn simpler deformations with much more data. In contrast, we model soft-tissue with a limited dataset of 4D scans, which requires a well disentangled and discriminative space to avoid overfitting tissue deformation and pose. Importantly, notice that removing these features manually is not feasible, not only because of the required time, but also because these features are not always apparent to a human observer.

We therefore propose a novel and effective approach to deshape the pose coefficients, i.e., to disentangle subject-specific anatomical features into a normalized and low-dimensional pose space :


We find by training a multi-modal encoder-decoder architecture, shown in Figure 2. In particular, having a mesh scan and its corresponding pose and shape parameters (found by solving Equation 4), we simultaneously train two encoders and one decoder minimizing the loss


where are the surface vertices of a skinned mesh in pose and mean shape (i.e., vector of shape coefficients is zero). The intuition behind this multi-modal autoencoder is the following: the encoder takes as input skinned vertices to enforce the similarity of large deformations (e.g., lifting arms, where many vertices move) in the autoencoder loss. By using a significantly small latent space, we are able to simultaneously train it with the encoder such that the latter learns to remove undesired local pose articulations (and keep global deformations) directly in the pose vector . In contrast, notice that without the loss term that uses we would not be able to distinguish between large and small deformations, because in the pose parameterization space of all parameters (i.e.

, degrees of freedom) contribute equally.

The effect of the encoder is depicted in Figure 3, where subject- and shape-specific features are effectively removed, producing a normalized pose. In other words, we are disentangling features originally present in the pose descriptor (e.g., wrist articulation) that are related to that particular subject or shape, but we are keeping the overall pose (e.g., raising left leg).

We found 10 to be an appropriate size of the latent space for a trade-off between capturing subtle motions and removing subject-specific features.

Figure 3: Result after static pose disentanglement. Our approach effectively removes subject- and shape-dependent features, while retaining the main characteristics of the input pose. See supplementary material for a visualisation of the pose disentanglement across a sequence.

5.2 Avoiding Dynamic Pose Entanglement

The novel pose representation introduced earlier effectively disentangles static subject-specific features from the naive pose representation , however, our motion descriptor also takes temporal information (velocities and accelerations) into account. We observe that such temporal information can encode dynamic shape- and subject-specific features, causing an entanglement potentially making our regressor prone to overfitting soft-tissue deformations to subject-specific pose dynamics.

We address this by extending our 4D dataset by transferring sequences (encoded using our motion descriptor) across the different subjects. In particular, given two sequences of two different subjects


where is the mesh of the subject A performing the sequence identity at time , we transfer the sequence of poses to a subject B by training a subject-specific regressor . This process generates a new sequence


with the shape identity of the subject B performing the motion (notice, a motion originally performed by subject A). By transferring all motions across all characters, we are enriching our dataset in a way that effectively avoids overfitting soft-tissue deformations to subject and shape-specific dynamics (i.e., style).

In Section 7 we detail the number of sequences and frames that we transfer, and evaluate the impact of this strategy. Specifically, Figure 7 shows an ablation study on how the generalization capabilities of our method improve when applying the pose disentangling methods introduced in this section.

6 Datasets, Networks and Training

In this section we provide details about the datasets, network architectures, and parameters to train our models.

6.1 Soft-tissue Autoencoder and Regressor


Our soft-tissue autoencoder and soft-tissue regressor (Section 4.3) are trained using the 4D sequences provided in the Dyna dataset [ponsmoll2015dyna]. This dataset contains highly detailed deformations of registered meshes of 5 female subjects performing a total of 52 dynamic sequences captured at 60fps (42 used for training, 6 for testing). Notice that we do not use the Dyna provided meshes directly, but preprocess them to unpose the meshes. To this end, we solve Equation 4 for each mesh, and subsequently apply Equation 3 to find the ground truth displacements for all Dyna meshes.

Moreover, in addition to the motion transfer technique described in Section 5.2, we further synthetically augment the dataset by mirroring all the sequences.


We implement all networks in TensorFlow, including the encoder-decoder architecture of

and , and the regressor. We also leverage TensorFlow and its automatic differentiation capabilities to solve Equation 4. In particular, we optimize using the first frame of a sequence and then optimize while leaving

constant. We use Adam optimizer with a learning rate of 1e-4 for the autoencoder and 1e-3 for the regressor. The autoencoder is trained during 1000 epochs (around 3 hours) with a batch size of 256, and a dropout rate of 0.1. The regressor is trained during 100 epochs (around 25 minutes) with batch size of 10, and no dropout. The details of the architecture are show in Figure 


6.2 Pose Autoencoder


To train our pose autoencoder presented in Section 5.1 we are not restricted to the data of 4D scans because we do not need dynamics. We therefore leverage the SURREAL dataset [varol_CVPR2017], which contains a vast amount of Motion Capture (MoCap) sequences, from different actors, parameterized by pose representation . Our training data consists of 76094 poses from a total of 298 sequences and 56 different subjects, including the 5 subjects of the soft-tissue dataset (excluding the sequences used for testing the soft-tissue networks).


We use Adam optimizer with a learning rate of 1e-3, and a batch size of 256, during 20 epochs (20 min). The details of the architecture are show in Figure 2.

7 Evaluation

In this section we provide qualitative and quantitative evaluation of both the reconstruction accuracy of our soft-tissue deformation subspace, described in Section 4.2, and the regressor proposed in Section 4.3.

7.1 Soft-tissue Autoencoder Evaluation

Quantitative Evaluation.

Figure 4 shows a quantitative evaluation of the reconstruction accuracy of the proposed nonlinear autoencoder (AE) for soft-tissue deformation, for a variety of subspace sizes. We compare it with linear approaches based on PCA used in previous works [loper_SIGAsia2015, ponsmoll2015dyna], in a test sequence (i.e., not used for training). These results demonstrate that our autoencoder consistently outperforms the reconstruction accuracy of the subspaces used in previous methods.

25D 50D 100D
PCA 3.82mm 3.17mm 2.38mm
AE 3.02mm 2.58mm 2.09mm
Table 1:
Figure 4: Soft-tissue autoencoder quantitative evaluation

Qualitative Evaluation.

Figure 5 depicts a qualitative evaluation of the soft-tissue deformation autoencoder for a variety of subspace dimensions. Importantly, we also show that the reconstruction accuracy is attained across different shapes. The realism of the autoencoder is better appreciated in the supplementary video, which includes highly dynamic sequences reconstructed with our approach.

Figure 5: Reconstruction errors of our soft-tissue autoencoder and PCA, for two different body shapes. Notice that our subspace efficiently encodes soft-tissue displacements for parametric shapes, in contrast to previous works [casas_PACMCGIT2018] that required an autoencoder per subject.

7.2 Soft-tissue Regressor Evaluation

We follow a similar evaluation protocol as in Dyna [ponsmoll2015dyna], and evaluate the following scenarios to exhaustively test our method. Additionally, we provide novel quantitative insights that demonstrate significantly better generalization capabilities of our regression approach with respect existing methods.

Generalization to New Motions.

In Figure 6 and in the supplementary video we demonstrate the generalization capabilities of our method to unseen motions. In particular, at train time, we left out the sequence one_leg_jump of the Dyna dataset, and then use our regressor to predict soft-tissue displacements for this sequence, for the shape identity of the subject 50004. Leaving ground truth data out at train time allows us to quantitatively evaluate this scenario. To this end, we also show a visualization of the magnitude of soft-tissue displacement for both ground truth and regressed displacements, and conclude that the regressed values closely match the ground truth.

Additionally, in the supplementary video we show more test sequences of different subjects from the Dyna dataset animated with MoCap sequences from the CMU dataset [varol_CVPR2017]. Notice that for these sequences there is no ground truth soft-tissue available (i.e., actors were captured in a MoCap studio, only recording joint positions). Our animations show realistic and highly expressive soft-tissue dynamics that match the expected deformations for different body shapes.

Figure 6: Evaluation of generalization to new motions. The sequence one_leg_jump was left out at train time, and used only for testing, for subject 50004. We show ground truth meshes and vertex displacements (top), and the regressed deformations (bottom). Notice how the magnitude of the regressed displacement closely matches the ground truth.

Generalization to New Subjects.

We quantitatively evaluate the generalization capabilities to new subjects by looking at the magnitude of the predicted soft-tissue displacements for different body shapes. Intuitively, subjects with larger body mass (i.e., more fat), which map to the smaller parameters, should exhibit larger soft-tissue velocities. In contrast, thin subjects, which maps to mostly positive values in , should exhibit much lower soft-tissue velocities due to the high rigidity of their body surface. We exhaustively evaluate this metric in Figure 7, where we show an ablation study comparing our full method, our method trained with each of the contributions alone, and Dyna. In contrast, our full model (in pink) regresses a higher dynamic range of deformations, outputing larger deformations for small values of (i.e., fat subjects), and small surface velocities for larger values of (i.e., thin subjects). Importantly, we show that each contribution of our model (the static and dynamic pose disentangling methods introduced in Section 5) contributes to our final results, and that all together produce the highest range of deformations.

In the supplementary video we further demonstrate our generalization capabilities. We also show an interactive demo where the user can change the shape parameters of an avatar in realtime, and our method produces the corresponding and very compelling soft tissue deformation.

Figure 7: We quantitatively evaluate the generalization to new shapes of our regressor by looking at the mean vertex speed of the predicted soft-tissue offsets in unpose state . Our model (pink) produces a higher range of dynamics, with large velocities for fat subjects (shape parameter -2.5) and small velocities for thin subjects (shape parameter 0.5). In contrast, previous works (Dyna, in dark blue) produce a much smaller range, resulting in limited generalization capabilities to new subjects. Furthermore, here we also demonstrate that all components of our method contribute getting the best generalization capabilities.

Generalization to New Motion and New Subject.

We finally demonstrate the capabilities of our model to regress soft-tissue deformations for new body shapes and motions. To this end, we use MoCap data from SURREAL datasets [varol_CVPR2017, mahmood2019amass] and arbitrary body shape parameters. Figure 8 shows sample frames of sequences 01_01 and 09_10 for two different shapes. Colormaps on 3D meshes depict per-vertex magnitude regressed offsets to reproduce soft-tissue dynamics. As expected, frames with more dynamics exhibit larger deformations. Please see the supplementary video for more details.

Figure 8: Sample frames of soft-tissue regression on two test sequences and two test subjects. Colormap depicts magnitude of the regressed deformation. Notice how our method successfully regresses larger deformations on highly dynamic poses such as in the middle of a jump or when a steps on the ground. See supplementary video for full animation and more examples.

7.3 Runtime performance

We have implemented our method on a regular desktop PC equiped with an AMD Ryzen 7 2700 CPU, a Nvidia GTX 1080 GPU, and 32GB of RAM. . On average, a forward pass of the model takes . This cost is distributed across the components of the model as follows: the pose encoder, the soft-tissue regressor and the soft-tissue decoder.

8 Conclusions

We have presented SoftSMPL, a data-driven method to model soft-tissue deformations of human bodies. Our method combines a novel motion descriptor and a recurrent regressor to generate per-vertex 3D displacements that reproduce highly expressive soft-tissue deformations. We have demonstrated that the generalization capabilities of our regressor to new shapes and motions significantly outperform existing methods. Key to our approach is to realize that traditional body pose representations rely on a entangled space that contains static and dynamic subject-specific features. By proposing a new disentangled motion descriptor, and a novel subspace and regressor, we are able to model soft-tissue deformations as a function of body shape and pose with unprecedented detail.

Despite the significant step forward towards modeling soft-tissue dynamics from data, our method suffers for the following limitations. With the current 4D datasets available, which contain very few subjects and motions, it is not feasible to learn a model for a high-dimensional shape space. Furthermore, subtle motions that introduce large deformations are also very difficult to reproduce. Finally, as in most data-driven methods, our model cannot interact with external objects