High-Quality Face Capture Using Anatomical Muscles

12/06/2018 ∙ by Michael Bao, et al. ∙ Industrial Light & Magic Stanford University 0

Muscle-based systems have the potential to provide both anatomical accuracy and semantic interpretability as compared to blendshape models; however, a lack of expressivity and differentiability has limited their impact. Thus, we propose modifying a recently developed rather expressive muscle-based system in order to make it fully-differentiable; in fact, our proposed modifications allow this physically robust and anatomically accurate muscle model to conveniently be driven by an underlying blendshape basis. Our formulation is intuitive, natural, as well as monolithically and fully coupled such that one can differentiate the model from end to end, which makes it viable for both optimization and learning-based approaches for a variety of applications. We illustrate this with a number of examples including both shape matching of three-dimensional geometry as as well as the automatic determination of a three-dimensional facial pose from a single two-dimensional RGB image without using markers or depth information.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 7

page 8

page 10

page 11

page 12

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Muscle simulation-based animation systems are attractive due to their ability to preserve important physical properties such as volume conservation as well as their ability to handle contact and collision. Moreover, utilizing an anatomically motivated set of controls provides a straightforward way of extracting out semantic meaning from the control values. Unfortunately, even though [sifakis2005automatic] was able to automatically compute muscle activation values given sparse motion capture data, muscle-based animation models have proven to be significantly less expressive and harder to control than their blendshape-based counterparts [lewis2014practice].

Recently, [cong2016art] introduced a novel method that significantly improved upon the expressiveness of muscle-based animation systems. They introduced the concept of “muscle tracks” to control the deformation of the underlying musculature. This concept gives the muscle simulation enough expressiveness to target arbitrary shapes, which allowed it be used in high-quality movie productions such as Kong: Skull Island where it was used both to aid in the creation of blendshapes and to offer physically-based corrections to artist-created animation sequences [cong2017muscle, lan2017lessons]. While [cong2016art]

alleviates the problems of muscle-based simulation in regards to expressiveness and control, the method is geared towards generative computer graphics problems, and is thus not amenable for estimating a facial pose from a two-dimensional image as is common for markerless performance capture. One could iterate between solving for a performance using blendshapes and then using a muscle-based solution to correct the blendshapes; however, this iterative method is lossy as the muscle simulation does not have access to the raw data and may thus hallucinate details or erase details of the performance.

In this paper, we extend [cong2016art] by combining the ease of use and differentiability of traditional blendshape models with expressive, physically-plausible muscle track simulations in order to create a differentiable simulation framework that can be used interchangeably with traditional blendshape models for facial performance capture and animation. Instead of relying on a non-differentiable per-frame volumetric morph to drive the muscle track deformation as in [cong2016art], we instead create a state-of-the-art blendshape model for each muscle, which is then used to drive its volumetric deformation. Our model maintains the expressiveness of [cong2016art] while preserving crucial physical properties. Furthermore, our new formulation is differentiable from end to end, which allows it to be used to target three-dimensional facial poses as well as two-dimensional RGB images. We demonstrate that our blendshape muscle tracks method shows significant improvements in anatomical plausibility and semantic interpretability when compared to state-of-the-art blendshape-based methods for targeting three-dimensional geometry and two-dimensional RGB images.

2 Related Work

Face Models: Although our work does not directly address the modeling part of the pipeline, it relies on having a pre-existing model of the face. For building a realistic digital double of an actor, multi-view stereo techniques can be used to collect high-quality geometry and texture information in a variety of poses [beeler2010high, beeler2011high, debevec2012light]. Artists can then use this data to create the final blendshape model. In state-of-the-art models, the deformation model will include non-linear skinning/enveloping in addition to linear blendshapes to achieve more plausible deformations [lewis2014practice]. On the other hand, more generalized digital face models would be more useful in cases where the target actor is not known beforehand. One would generally use a 3D morphable model (3DMM) which can be created using statistical methods from a large database of scanned faces. Such models include the classic Blanz and Vetter model [blanz1999morphable], the Basel Face Model (BFM) [luthi2017gaussian, paysan20093d], FaceWarehouse [cao2014facewarehouse], and the Large Scale Facial Model (LSFM) [booth20163d]. Recent models such as the FLAME model [li2017learning] have begun to introduce non-linear deformations by using skinning and corrective blendshapes. These models tend to be geared towards real-time applications and as a result have a low number of vertices.

Face Capture: A more comprehensive review of facial performance capture techniques can be found in [zollhofer2018state]. To date, marker based techniques have been the most popular for capturing facial performances for both real-time applications and feature films. Helmet mounted cameras (HMCs) are often used to stereo track a sparse set of markers on the face. These markers are then used as constraints in an optimization to find blendshape weights [bhat2013high]. In many real-time applications, pre-applied markers are generally not an option so 2D features [cao20133d, chen2013accurate, wu2016anatomically], depth images [chen2013accurate, kazemi2014real, weise2011realtime], or low-resolution RGB images [thies2016face2face]

are often used instead. Other methods have focused on using traditional computer vision techniques to track a facial performance with consistent topology

[beeler2010high, beeler2011high, fyffe2017multi]. More recently, methods using neutral networks have been used to reconstruct face geometry [jackson2017large, sela2017unrestricted] and estimate facial control parameters [jourabloo2016large, kim2018inversefacenet]. Analysis-by-synthesis techniques have also been explored for capturing facial performances [pighin1999resynthesizing].

Face Simulation: [sifakis2005automatic] was one of the first to utilize quasistatic simulations to drive the deformation of a 3D face, especially for motion capture. There has also been interest in using quasistative simulations to drive muscle deformations in the body [irving2004invertible, teran2003finite, teran2005creating]. However, in general, facial muscle simulations tend to be less expressive than their artist-driven blendshape counterparts. More recently, significant work has been done to make muscle simulations more expressive [cong2016art, ichim2017phace]. While these methods can be used to target data in the form of geometry, it is unclear how to cleanly transfer these methods to target non-geometry data such as two-dimensional RGB images. Other work has been done to try to introduce physical simulations into the blendshape models themselves [barrielle2018realtime, barrielle2016blendforces, ichim2016building, kozlov2017enriching]; however, these works do not focus on the inverse problem.

3 Blendshape Model

As discussed in Section 2, there are many different types of blendshape models that exist and we refer interested readers to [lewis2014practice] for a more thorough overview of existing literature. We focus on the state-of-the-art hybrid blendshape deformation model that is the basis of our method introduced in Section 6. A hybrid blendshape model refers to a deformation model that uses both linear blendshapes and linear blend skinning to deform the vertices of the mesh. Our model contains a single 6-DOF joint for the jaw. We can succinctly write the model given the blendshape parameters and joint parameters as

(1)

where is the neutral shape, is the blendshape deltas matrix, and contains the linear blend skinning matrix, a transformation matrix due to a change in the jaw joint, for each vertex. Note that is often referred to as the pre-skinning shape and as the pre-skinning displacements. More complex animation systems include corrective shapes and intermediate controls and thus we let denote a broader set of animator controls which we treat as our independent variable rewriting Equation 1 as

(2)

where and may include non-linearities such as non-linear corrective blendshapes.

4 Muscle Model

We create an anatomical model of the face consisting of the cranium, jaw, and a tetrahedralized flesh mesh with embedded muscles for a given actor using the method of [cong2015fully]. Since we desire parity with the facial model used to deform the face surface, we define the jaw joint as a 6-DOF joint equivalent to the one used to skin the face surface in Section 3

. Traditionally, face simulation models have been controlled using a vector of muscle activation parameters which we denote as

. We use the same constitutive model for the muscle as [teran2003finite, teran2005creating] which consists of an isotropic Mooney-Rivlin term, a quasi-incompressibility term, and an anisotropic passive/active muscle response term. The finite-volume method [teran2003finite, teran2005robust] is used to compute the force on each vertex of the tetrahedralized flesh mesh given the current 1st Piola-Kirchoff stress computed using the constitutive model and the current deformation gradient. Some vertices of the flesh mesh are constrained to kinematically follow along with the cranium/jaw and the steady state position is implicitly defined as the positions of the unconstrained flesh mesh vertices which make the sum of all relevant forces identically , .

One can decompose the forces to be a sum of the finite-volume forces and collision penalty forces

(3)

One can further break down the finite-volume forces into the passive force and active force . Then using the fact the the active muscle response is scaled linearly by the muscle activation [zajac1989muscle], we can rewrite the finite-volume force as

(4)

We refer interested readers to [sifakis2005automatic, teran2003finite, teran2005creating, teran2005robust] for derivations of the aforementioned forces and their associated Jacobians with respect to the flesh mesh vertices. Given a vector of muscle activations and cranium/jaw parameters, Equation 3 can be solved using the Newton-Raphson method to compute the unconstrained flesh mesh vertex positions .

5 Muscle Tracks

The muscle tracks simulation introduced by [cong2016art] modifies the framework described in Section 4 such that the muscle deformations are primarily controlled by a volumetric morph [ali2013anatomy, cong2015fully] rather than directly using muscle activation values. [cong2016art] first creates a correspondence between the neutral pose of the blendshape system and the outer boundary surface of the tetrahedral mesh . Then, given a blendshape target expression with surface mesh displacements , [cong2016art] creates target displacements for the outer boundary of the tetrahedral mesh . Using as Dirichlet boundary conditions, [cong2016art] solves a Poisson equation for the displacements , , where are the rest-state vertex positions consistent with the neutral pose . Neumann boundary conditions are used on the inner boundary of the tetrahedral mesh. Afterwards, zero-length springs are attached between the tetrahedralized flesh mesh vertices interior to each muscle and their corresponding target locations resulting from the Poisson equation. The muscle track force resulting from the zero-length springs for each muscle has the form

(5)

where is the per-muscle spring stiffness matrix, is the selector matrix for the flesh mesh vertices interior to the muscle, and are the target locations resulting from the volumetric morph. Thus the expanded quasistatics equation can be written as

(6)

where includes Equation 5 for every muscle. Since the activation values are no longer specified manually, they must be computed automatically given the final post-morph shape of a muscle to reintroduce the effects of muscle tension into the simulation. [cong2016art] barycentrically embeds a piecewise linear curve into each muscle and uses the length of that curve to determine an appropriate activation value.

6 Blendshape-Driven Muscle Tracks

The morph from Section 5 was designed in the spirit of the computer graphics pipeline, and as such, does not allow for the sort of full end-to-end coupling that facilitates differentiability, inversion, and other typical inverse problem methodologies. Thus, our key contribution is to replace the morphing step with a blendshape deformation in the form of Equation 1 to drive the muscle volumes and their center-line curves thereby creating a direct functional connection between the animator controls and the muscle tracks target locations and activation values .

For each muscle, we create a tetrahedralized volume and piecewise linear center-line curve in the neutral pose. Furthermore, for each blendshape in the face surface model, we use the morph from [cong2016art] to create a corresponding shape for each muscle’s tetrahedralized volume and center-line curve , where is used to denote the th blendshape. Alternatively, one could morph and subsequently simulate as in Section 5 using tracks in order to create and . In addition, we assign skinning weights to each vertex in and and assemble them into linear blend skinning transformation matrices and . This allows us to write

(7)
(8)

which parallel Equation 1. Notably, we are able to obtain Equations 7 and 8 in part because we solve the Poisson equation on the pre-skinning neutral as compared to [cong2016art] which uses the post-skinning neutral. In addition, this better prevents linearized rotation artifacts from diffusing into the tetrahedralized flesh mesh. Finally, we can write the length of each center-line curve as

(9)

where is the th vertex of the piecewise linear center-line curve for the th muscle.

To justify our approach, we can write the linear system to solve the Poisson equation as where is the portion of the Laplacian matrix discretized on the tetrahedralized volume at rest using the method of [zheng2015new] for the unconstrained vertices. Similarly, is the portion for the constrained vertices post-multiplied by the linear correspondence between the neutral pose of the blendshape system and the outer boundary of the tetrahedral mesh . Equivalently, we may write

(10)

(where are the standard basis vectors) which is equivalent to doing solves of the form

(11)

and then subsequently summing both sides to obtain . That is, the linearity of the Poisson equation allows us to precompute its action for each blendshape and subsequently obtain the exact result on any combination of blendshapes by simply summing the results obtained on the individual blendshapes.

In summary, for each of the blendshapes, we solve a Poisson equation (Equation 11) to precompute and , and then given animator controls which yield and , we obtain and via Equations 7 and 8. This replaces the morphing step allowing us to proceed with the quasistatic muscle simulation using tracks driven entirely by the animator parameters .

7 End-to-End Differentiability

In this section, we outline the derivative of the simulated tetrahedral mesh vertex positions with respect to the blendshape parameters and jaw controls that parameterize the simulation results as per Section 6. The derivative of and with respect to the animator controls depend on the specific controls and can be post-multiplied. If one cares about the resulting vertices of a rendered mesh embedded in or constrained to the tetrahedral mesh, then this embedding, typically linear, can be pre-multiplied.

Although the constrained nodes typically only depend on the joint parameters, one may wish, at times, to simulate only a subset of the tetrahedral flesh mesh. In such instances, the constrained nodes can appear on the unsimulated boundary which in turn can be driven by the blendshape parameters ; thus, we write and concatenate it with to obtain for the purposes of this section. The collision forces only depend on the nodal positions, and we may write . The finite volume force depends on both the nodal positions and activations, and the activations are determined from an activation-length curve where the length is given in Equation 9. Our precomputation makes only a function of and and notably independent of , and so we may write combining the activation length curve with Equations 8 and 9. We stress that the activations are independent of the positions, . Thus, we may write . Similarly, we may write . Therefore, all the forces in Equation 6 are a function of , , and which are in turn a function of and .

Using the aforementioned dependencies, we can take the total derivative of the forces in Equation 3 with respect to a single blendshape parameter to obtain which is equivalent to since is independent of . Since our activations are still independent of just as they were in [sifakis2005automatic], here is identical to that discussed in [sifakis2005automatic], and thus their quasistatic solve can be used to determine by solving . To compute the right hand side, note that can be obtained from Equation 8. To obtain , we compute . are simply the active forces in Equation 4, is the local slope of the activation length curve, and is readily computed from Equation 9. The are determined similarly.

One may take a similar approach to Equation 6, obtaining by solving . We stress that the coefficient matrix of the quasistatic solve is now augmented by (see Equation 5) and is the same quasistatic coefficient matrix in [cong2016art]. and are obtained from Equations 5 and 7 respectively. Again, the are found similarly. In summary, finding and involves solving the same quasistatics problem of [sifakis2005automatic] with the slight augmention to the coefficient matrix from [cong2016art] merely with different right hand sides. Although this requires a quasistatic solve for each and , they are all independent and can thus be done in parallel.

8 Experiments

We use the Dogleg optimization algorithm [lourakis2005levenberg] as implemented by the Chumpy autodifferentation library [loperchumpy] in order to target our face model to both three-dimensional geometry and two-dimensional RGB images to demonstrate the efficacy of our end-to-end fully differentiable formulation. Other optimization algorithms and/or applications may similarly be pursued. Our nonlinear least squares optimization problems generally have the form

(12)

where are the animator controls that deform the face, are the positions of the vertices on the surface of the face deformed using the full blendshape-driven muscle simulation system as described in Section 6, is a function of those vertex positions, and is the desired output of that function. and are an additional rigid rotation and translation, respectively where represents Euler angles, . We use a standard L2 norm regularization on the animator controls , where is set experimentally to avoid overfitting.

8.1 Model Creation

Figure 1: The eight viewpoints used to reconstruct the facial geometry for a particular pose.
(a) Neutral
(b) Jaw Open
Figure 2: The underlying anatomical model of the face in the neutral pose as well as the jaw open pose using linear blend skinning.

The blendshape system is created from the neutral pose as well as FACS-based expressions [ekman2002facial] using the methods of [beeler2010high, beeler2011high]. Eight black and white cameras from varying viewpoints (see Figure 1) are used to reconstruct the geometry of the actor. Artists clean up these scans and use them as inspiration for a blendshape model and to calibrate the linear blend skinning matrices for the face surface (see Figure 3). Of course, any reasonable method could be used to create the blendshape system. A subset of the full face surface model with surface vertices is used in the optimization.

We use the neutral pose of the blendshape system and the method of [cong2015fully] to create the tetrahedral flesh mesh , tetrahedral muscle volumes , and muscle center line curves by morphing them from a template asset. Our simulation mesh has vertices and tetrahedra. We use muscles with a total of vertices and tetrahedra (some tetrahedra are duplicated between muscles due to overlap). The linear blend skinning weights used to form on the face surface are propagated to the surface of the tetrahedral mesh and used as boundary conditions in a Poisson equation solve again as in [ali2013anatomy, cong2015fully] to obtain linear blend skinning weights throughout the volumetric tetrahedral mesh as well as for the muscles and center-line curves, thus defining skinning transformation matrices and . Figure 2 shows the muscles in the neutral pose as well as the result after skinning with the jaw open, Equation 7 with all identically .

Finally, for each shape in the blendshape system, we solve a Poisson equation (Equation 11) for the vertex displacements which are then transferred to the muscle volumes and center-line curves to obtain and . This allows us full use of Equations 7 and 8 parameterized by the blendshapes . Figure 4 shows some examples of the muscles evaluated using Equation 7 for a variety of expressions.

Figure 3: The geometry reconstructed by applying the multi-view stereo algorithm described in [beeler2010high, beeler2011high] to the input images shown in Figure 1.
(a) Smile
(b) Pucker
(c) Funneler
Figure 4: The anatomical model of the face peforming a variety of expressions using only the blendshape deformation from Equation 7.

8.2 Targeting 3D Geometry

(a) Blendshape
(b) Simulation
(c) Target
Figure 5: We target the geometry shown in (c) using purely blendshapes shown in (a) versus the blendshape driven muscle simulation model shown in (b). While neither method exactly matches target geometry, in general, we found that the simulation results preserve key physical properties such as volume preservation around the lips. A close-up of the lips is shown in the bottom row where it is more apparent how the pure blendshape inversion has significant volume loss around the lips.

Oftentimes, one has captured a facial pose in the form of three-dimensional mesh; however, this data is generally noisy, and it is desirable to convert this data into a lower dimensional representation. Using a lower dimensional representation facilitates editing, extracting semantic information, and performing statistical analysis. In our case, the lower dimensional representation is the parameter space of the blendshape or simulation model.

In general, extracting a lower dimensional representation from an arbitrary mesh requires extracting a set of correspondences between the mesh and the face model. However, for simplicity, we assume that the correspondence problem has been solved beforehand and that each vertex of the incoming mesh captured by a system using the methods of [beeler2010high, beeler2011high] has a corresponding vertex on our face surface. We can thus use an optimization problem in the form of Equation 12 to solve for the model parameters where are the vertex positions of the target geometry, and is the identity function.

While a rigid alignment between the and the neutral mesh , and , is created as a result of [beeler2010high, beeler2011high], we generally found it to be inaccurate. As a result, we also allow the optimization to solve for and as well. Our optimization problem for targeting three-dimensional geometry thus has the form

(13)

where is set experimentally.

(a) Blendshape Weights
(b) Muscle Activations
Figure 6: The blendshape solve results in blendshape weights that are dense, overdialed, and hard to decipher. The largest weights are related to closing the mouth (with magnitudes ranging from to , three to six times taller than what is shown in the figure). It is not until the th most dialed in shape that we see a blendshape related to the pucker. Whereas all (of ; shapes for the neck, etc. were not used) blendshapes used have non-zero values, only of the available muscles have non-zero activation values. The top four most activated muscles are related to the frontalis indicating that the eyebrows are raised [standring2015gray]. The activations of the incisivus labii superioris and orbicularis oris muscles are also among the top activated muscles properly indicating a compression of the lips [hur2018anatomical, standring2015gray]. These muscle activations succintly describe the performance of the actor in this frame.

We demonstrate the efficacy of our method on a pose where the actor has his mouth slightly open and is making a pucker shape. We compare the results of targeting three-dimensional geometry when it is driven using simulation via the blendshape muscle tracks as described in Section 6 versus when it is driven using the pure blendshape model described in Section 3. Traditionally, pucker shapes have been difficult for activation-muscle based simulations to hit. See Figure 5. Although neither inversion quite captures the tightness of the mouth’s pucker, the muscle simulation results demonstrate how the simulation’s volume preservation property significantly improves upon the blendshape results where the top and bottom lips seem to shrink. This property is also useful in preserving the general shape of the philtrum; the blendshape models’s inversion causes the part of the philtrum near the nose to incorrectly bulge significantly. Furthermore, the resulting muscle activation values are easier to draw semantic meaning from due to their sparsity and anatomical meaning as seen in Figure 6.

Note that errors in the method of [beeler2010high, beeler2011high] in performing multi-view reconstruction will cause the vertices of the target geometry to contain noise and potentially be in physically implausible locations. Additionally, errors in finding correspondences between the target geometry and the face surface will result in an inaccurate objective function. Furthermore, there is no guarantee that our deformation model is able to hit all physically attainable poses even when the capture and correspondence are perfect. This demonstrates the efficacy of introducing physically-based priors into the optimization. Additional comparisons and results are shown in the supplementary material and video.

8.3 Targeting Monocular RGB Images

(a) Plate
(b) Lighting/Albedo
Figure 7: Before estimating the facial pose, we first estimate lighting and albedo on a neutral or close to neutral pose.

To further demonstrate the efficacy of our approach, we consider facial reconstruction from monocular RGB images. The images were captured using an mm lens attached to an ARRI Alexa XT Studio running at 24 frames-per-second with an 180 degree shutter angle at ISO . We refer to images captured by the camera as the “plates.” The original plates have a resolution of , but we downsample them to . The camera was calibrated using the method of [heikkila1997four] and the resulting distortion parameters are used to undistort the plate to obtain .

renders the face geometry in its current pose with a set of camera, lighting, and material parameters. We use a simple pinhole camera with extrinsic parameters determined by the camera calibration step. The rigid transformation of the face is determined by manually tracking features on the face in the plate. The face model is lit with a single spherical harmonics light with coefficients , see [ramamoorthi2001efficient], and is shaded with Lambertian diffuse shading. Each vertex also has an RGB color associated with it. We solve for and all using a non-linear least squares optimization of the form

(14)

where the per-vertex colors is regularized using where are the neighboring vertices of vertex . This lighting and albedo solve is done as a preprocess on a neutral or close to neutral pose with set experimentally. OpenDR [loper2014opendr] is used to differentiate to solve Equation 14; however, any other differentiable renderer ([li2018differentiable]) can be used instead. Then we assume that and stay constant throughout the performance. See Figure 7.

We solve for the parameters in two steps. Given curves around the eyes and lips on the three-dimensional neutral face mesh, a rotoscope artist draws corresponding curves on the two-dimensional film plate. Then, we solve for an initial guess by solving an optimization problem of the form

(15)

where is set experimentally. is the two-dimensional Euclidean distance between the points on the rotoscoped curves on the plate and the corresponding points on the face surface projected into the image plane. See Figure 8. We then use to initialize a shape from shading solve

(16)

to determine the final parameters where and are set experimentally. Here, is a three-level Gaussian pyramid of the per-pixel differences between the plate and the synthetic render.

(a) Blendshapes
(b) Simulation
(c) Roto Curves
Figure 8: We use rotoscoped curves on the plate to solve for an initial estimate of the face pose.

1112 1134 1160 1170

(a) Blendshapes
(b) Simulation
(c) Plate
Figure 9: We target the raw image data using our face model using both simulation and blendshapes on a number of frames of an actor’s performance. Both sets of results suffer from some depth ambiguity due to only using monocular two-dimensional data in the optimization.

We demonstrate the efficacy of our approach on frames of a facial performance. As in Section 8.2, we compare the results of solving Equations 15 and 16 using driven by a simulation model versus a blendshape model. In particular, we choose four frames with particularly challenging facial expressions (frames , , ) as well as capture conditions such as motion blur (frame ). We note that a significant portion of the facial expression is captured using the rotoscoped curves and the shape-from-shading step primarily helps to refine the expression and the contours of the face. Both and (Equations 15 and 16) require end-to-end differentiability through our blendshape driven method. See Figure 9. While the general expressions are similar, we note that the simulation’s surface geometry tends to be more physically plausible due the simulation’s ability to preserve volume, especially around the lips. This regularization is especially prominent on frame . As shown in supplementary material, the resulting muscle activation values are also comparatively sparser which leads to an increased ability to extract semantic meaning out of the performance. Additional comparisons and results are shown in the supplementary material and video.

9 Conclusion and Future Work

Although promising anatomically based muscle simulation systems have existed for some time and have had the ability to target data as in [sifakis2005automatic], they have lacked the high-end efficacy required to produce compelling results. Although the recently proposed [cong2016art] does produce quite compelling results, it requires a full face shape as input and is not differentiable. In this paper, we alleviated both of the aforementioned difficulties, extending [cong2016art] with end-to-end differentiability and a morphing system driven by blendshape parameters. This blendshape-driven morph removes the need for a full face surface mesh as a pre-existing target. We demonstrate the efficacy of our approach by targeting three-dimensional geometry and two-dimensional RGB images. To the best of our knowledge, we are the first to use quasistatic simulation of a muscle model to target RGB images. We note that methods such as [sifakis2005automatic] could be used in the optimizations presented in this paper (as outlined in the second to last paragraph of Section 7); however, the resulting simulation results would be less expressive and would not be able to effectively reproduce the desired expressions.

Although the computer vision community expends great efforts in regards to identifying faces in images, segmenting them cleanly from their surroundings, and even identifying their shape, semantic understanding of what such faces are doing or intend to do or feel is still in its infancy consisting mostly of preliminary image labeing and annotation. The ability to express a facial pose or image using a muscle activation basis provides an anatomically-motivated way to extract semantic information. Even without extensive model calibration, our anatomical model’s muscle activations have shown to be useful for extracting anatomically-based smenatic information. This is a promising avenue for future work. Additionally, muscle activations could also be used as a basis for statistical/deep learning instead of semantically meaningless combinations of blendshape weights.

Finally, one of the more philosophical questions in deep learning seems to revolve around what should or should not be considered a “learning crime” (drawing similarities to variational crimes [strang1972variational]). For example, in [bailey2018fast], the authors learn a perturbation of linear blend skinning as opposed to the whole shape, assuming that the perturbation is lower-dimensional, spatially correlated, and/or easier to learn. The authors in [feng2018joint, ranjan2018generating] use spatially correlated networks for spatially correlated information under the assumption, once again, that this leads to a network that is easier to train and generalizes better. It seems that adding strong priors, domain knowledge, informed procedural methods, etc. to generate as much of a function as possible before training a network to learn the rest is often considered prudent. Our anatomically-based physical simulation system incorporates physical properties such as volume preservation, contact, and collision so that a network would not need to learn or explain them; instead the network only needs to learn what further perturbations are required to match the data.

Acknowledgements

Research supported in part by ONR N00014-13-1-0346, ONR N00014-17-1-2174, ARL AHPCRC W911NF-07-0027, and generous gifts from Amazon and Toyota. In addition, we would like to thank both Reza and Behzad at ONR for supporting our efforts into computer vision and machine learning, as well as Cary Phillips, Kiran Bhat, and Industrial Light & Magic for supporting our efforts into facial performance capture. M. Bao was supported in part by The VMWare Fellowship in Honor of Ole Agesen. We would also like to thank Paul Huston for his acting and Jane Wu for her help in preparing the supplementary video.

Appendices

Appendix A Targeting 3D Geometry - Additional Results

We present additional comparisons between using blendshapes and simulations for targeting three-dimensional geometry in Figure 10. Our approach using muscle simulation results in facial expressions similar to that obtained via blendshapes, but also introduces physical properties such as volume preservation. Our results can be improved by further calibrating and refining the anatomical model. As seen in Figure 11, the resulting muscle activation weights are sparser and less overdialed than their blendshape counterparts. In particular, note how the muscle activations generally track the magnitude of the expression. This is especially evident in frame where the face is in a close to neutral pose; while the muscle activations are close to all , the blendshape weights are still dialed in heavily to match the expression. The overdialing of blendshape weights could be alleviated by increasing the L2 regularization of the weights; however, this will also cause the captured performance to become less representative of the original performance. Figure 12 shows that muscle activations result in anatomically and semantically meaningful information. Note that further calibration of the anatomical model will also lead to more accurate muscle activation weights.

Appendix B Targeting RGB Images - Additional Results

We show additional results for targeting monocular RGB images in Figure 13. Furthermore, we show the resulting geometry and plates for the same frames from another camera perspective in Figure 14. The corresponding blendshape weights and muscle activations are shown in Figure 15. A visualization of the muscles’ activations is shown in Figure 16. Currently, the muscle activations resulting from targeting RGB images do not permit as clean of an interpretation as those obtained when targeting geometry, although the incisivus labii superioris muscles tend to become activated in conjunction with expressions involving the mouth. However, we note that the general magnitude of the activations tends to match the magnitude of the expression. Future work calibrating the muscle model will improve semantic intepretability.

Blendshapes Simulation Target

(a) 2536
(b) 2540
(c) 2560
(d) 2573
(e) 2590
Figure 10: Additional comparisons when targeting geometry viewed from one of the original camera viewpoints.

Blendshapes Simulation

(a) 2536
(b) 2540
(c) 2560
(d) 2573
(e) 2590
Figure 11: Additional comparisons between the resulting blendshape weights and muscle activations when targeting geometry.
(a) 2536
(b) 2540
(c) 2560
(d) 2573
(e) 2590
Figure 12: Muscle activations from Figure 11 visualized where activations greater than are colored white and activations at are colored red.

1115 1120 1130 1155

(a) Blendshapes
(b) Simulation
(c) Plate
Figure 13: Targeting the monocular RGB image using shape-from-shading and rotoscope curves with blendshapes and simulation from the main camera’s perspective.

1115 1120 1130 1155

(a) Blendshapes
(b) Simulation
(c) Plate
Figure 14: Targeting the monocular RGB image using shape-from-shading and rotoscope curves with blendshapes and simulation from an alternate camera’s perspective.

Blendshapes Simulation

(a) 1112
(b) 1115
(c) 1120
(d) 1130

Blendshapes Simulation

(e) 1134
(f) 1155
(g) 1160
(h) 1170
Figure 15: Comparisons between the blendshape weights and muscle activations for all the monocular shape-from-shading results. The corresponding geometry for frames , , , and are shown in Figures 13 and 14. The corresponding geometry for frames , , , and are shown in the main paper.
(a) 1112
(b) 1115
(c) 1120
(d) 1130
(e) 1134
(f) 1155
(g) 1160
(h) 1170
Figure 16: Muscle activations from Figure 15 visualized where activations greater than are colored white and activations at are colored red.

References