1 Introduction
For highend face performance capture, either motion capture markers [bhat2013high] or markerless techniques such as shapefromshading [aldrian2012inverse] or optical flow [wu2016anatomically] are typically used; however, these methods are generally unable to capture the intricacies of the performance, especially around the lips. To obtain highend results, artists handdraw rotoscope curves on the captured image; then, a variety of techniques are used to construct similar curves on the synthetic render of the estimated pose and to determine correspondences between the handdrawn and synthetically generated curves. The simplest such approach would be to use a predefined contour on the threedimensional face model, clip it for occlusions, and create correspondences in a length proportional way; although this provides some consistency to the curves generated on the synthetic render, it is quite difficult for an artist to emulate these curves. Thus, practical systems implement a number of adhoc methods in order to match the artist’s more subjective interpretation. The inability to embed the artist’s subjectivity into the optimization loop and onto the synthetic render coupled with the artist’s inability to faithfully reproduce procedurally generated curves leaves a gaping chasm in the uncanny valley.
Although one might debate what works best in order to align a threedimensional virtual model with an image, it is clearly the case that a consistent metric should be applied to evaluate whether the synthetic render and image are aligned. This motivatives the employment of a machine learning algorithm to draw the rotoscope curves on both the captured image and the synthetic render, hoping that accurately representing the real world pose would lead to a negligible difference between the two curves. Although one might reasonably expect that the differences between real and synthetic camera behavior, albedo, lighting, etc. may lead to different rotoscope curves being generated by the deep learning algorithm, GANlike approaches
[goodfellow2014generative, li2017generative] could be used to rectify such issues by training a network to draw the curves such that a discriminator cannot tell which curves were generated on real images versus synthetic renders.Recent advancements in using deep neural networks to detect face landmarks (see [bulat2017far]
) make using machine learning to detect not only the lip curves but also other facial landmarks on both the captured image and the synthetic render an attractive option. A traditional optimization approach can then be used to minimize the difference between the outputs from the captured image and the synthetic render. This is feasible as long as one can backpropagate through the network to obtain the Jacobian of the output with respect to the synthetic render. Of course, this assumes that the synthetic render is fully differentiable with respect to the face pose parameters, and we use OpenDR
[loper2014opendr] to satisfy this latter requirement; however, we note that any differentiable renderer can be used ([li2018differentiable]).This approach is attractive as it replaces human subjectivity with a consistent, albeit sometimes consistently wrong, network to evaluate the semantic difference between two images. Furthermore, as networks improve their ability to detect facial features/descriptors on a wider variety of poses and lighting conditions, our approach will also benefit. More generally, we propose to use machine learning networks to embed various subjective, but consistent, “evaluations” into classical optimization approaches that estimate facial pose and expression.
2 Related Work
Rotoscope Lip/Mouth Curves: Motion capture markers are commonly used to estimate facial pose and expression for a facial performance [deng2006animating, ma2008facial, sifakis2005automatic]. The corresponding points on the triangulated facial mesh are either selected by hand or found by projecting twodimensional markers from the image to the threedimensional mesh when the face is in a neutral pose [bhat2013high]. Rotoscope curves are typically used to refine the results around the lips to produce higher fidelity [bhat2013high, dinev2018user]. For viewindependent curves such as the outer lip, the corresponding contour on the mesh can be predefined [bhat2013high]; however, for viewdependent/silhouette curves such as the inner lip, the corresponding contour must be defined based on the current camera viewpoint and updated iteratively [bhat2013high, dinev2018user].
Face Alignment: The earliest approaches to regressionbased face alignment trained a cascade of regressors to detect face landmarks [cao2014face, chen2014joint, feng2016dynamic, kazemi2014one, zhu2016unconstrained]
. More recently, deep convolutional neural networks (CNNs) have been used for both 2D and 3D facial landmark detection from 2D images
[jourabloo2017pose, wu2018look]. These methods are generally classified into coordinate regression models
[jeni2017dense, jourabloo2017pose, toshev2014deeppose, xing2018towards], where a direct mapping is learned between the image and the landmark coordinates, and heatmap regression models [bulat2017far, deng2018cascade, zadeh2017convolutional], where prediction heatmaps are learned for each landmark. Heatmapbased architectures are generally derived from stacked hourglass [bulat2017far, deng2018cascade, jackson2017large, newell2016stacked] or convolutional pose machine [wei2016convolutional] architectures used for human body pose estimation. Pixel coordinates can be obtained from the heatmaps by applying the argmax operation; however, [dong2018supervision, sun2017integral] use softargmax to achieve endtoend differentiability. A more comprehensive overview of face alignment methods can be found in [jin2017face].Optical Flow: Optical flow has had a long successful history started in part by [black1996robust]. Variational methods such as the LucasKanade method [lucas1981iterative] and the Brox method [brox2004high] are commonly used. Other correspondence finding strategies such as EpicFlow [revaud2015epicflow] and FlowFields [bailer2015flow] have also been used successfully. We focus instead on trained deep networks. Endtoend methods for learning optical flow using deep networks were first proposed by [dosovitskiy2015flownet] and later refined in [IMKDB17]. Other methods include DeepFlow [weinzaepfel2013deepflow], etc. use deep networks to detect correspondences. These methods are generally evaluated on the Middlebury dataset [baker2011database], the KITTI dataset [geiger2013vision], and the MPI Sintel dataset [butler2012naturalistic].
Faces and Networks: Neural networks have been used for various other face image analysis tasks such as gender determination [golomb1990sexnet]
and face detection
[rowley1998face]. More recently, deep CNNs have been used to improve face detection results especially in uncontrolled environments and with more extreme poses [haoxiangli2015cascade, kaipengzhang2016joint]. Additionally, CNNs have been employed for face segmentation [sifeiliu2015labeling, nirkin2018swapping, saito2016segmentation], facial pose and reflectance acquisition [laine2017production, sengupta2018sfsnet], and face recognition [schroff2015facenet, taigman2014deepface].Using deep networks such as VGG16 [simonyan2014very]
for losses has been shown to be effective for training other deep networks for tasks such as style transfer and superresolution
[johnson2016perceptual]. Such techniques have also been used for image generation [dosovitskiy2016generating] and face swapping [korshunova2017fast]. Furthermore, deep networks have been used in energies for traditional optimization problems for style transfer [gatys2016image], texture synthesis [sendik2017deep], and image generation [mahendran2015understanding, ulyanov2018deep]. While [gatys2016image, sendik2017deep] use the LBFGS [zhu1997algorithm] method to minimize the optimization problem, [mahendran2015understanding, ulyanov2018deep] use gradient descent methods [ruder2016overview].3 Overview
In this paper, we advocate for a general strategy that uses classical optimization where the energy to be minimized is based on metrics ascertained from deep learning neural networks. In particular, this removes the rotoscope artist and the adhoc contour drawing procedures meant to match the artist’s work, or vice versa, from the pipeline. This pins advancements in threedimensional facial pose and expression estimation to those being made in machine learning, which are advancing at a fast pace. Generally, we take the following approach: First, we estimate an initial rigid alignment of the threedimensional face model to the twodimensional image using a facial alignment network. Then, we estimate an initial guess for the jaw and mouth expression using the same network. Finally, we temporally refine the results and insert/repair failed frames (if/when necessary) using an optical flow network.
We use a blendshape model hybridized with linear blend skinning for a six degree of freedom jaw joint
[lewis2014practice]; let denote the parameters that drive the face triangulated surface . The resulting surface has a rigid frame given by Euler angles , rotation matrix , and a translation such that the final vertex positions are(1) 
We note that other geometry such as the teeth can be trivially handled by Equation 1 as well. The geometry is rendered using OpenDR [loper2014opendr] obtaining a rendered image . As a precomputation, we estimate the face’s albedo and nine coefficients for a spherical harmonics light [ramamoorthi2001efficient] on a frame where the face is close to neutral; however, we note that a texture captured using a light stage [debevec2000acquiring] (for example) would potentially work just as well if not better. Then, our goal is to determine the parameters , , and that best match a given captured image .
Both the pixels of captured image and the pixels of the rendered image are fed through the same deep network to get two sets of landmark positions and . See Figure 1. We use the L2 norm of the difference between them
(2) 
as the objective function to minimize via nonlinear least squares, which is solved using the Dogleg method [lourakis2005levenberg] as implemented by Chumpy [loperchumpy]
. This requires computing the Jacobian via the chain rule of the energy function;
, , and where is one of , , and all need to be evaluated. We use OpenDR to compute , and Equation 1 yields .is computed by backpropagating through the trained network using one’s deep learning library of choice; in this paper, we use PyTorch
[paszke2017automatic]. Note that for computational efficiency, we do not compute explicitly, but instead compute the Jacobian of Equation 2 with respect to the rendered image pixels output by instead.4 Rigid Alignment
We first solve for the initial estimate of the rigid alignment of the face, and using the pretrained 3DFAN network [bulat2017far]. Note that 3DFAN, and most other facial alignment networks, requires taking in a cropped and resized image of the face as input; we denote these two operations as and respectively. The cropping function requires the bounding box output of a face detector ; we use the CNNbased face detector implemented by Dlib [dlib09]. The resize function resizes the crop to a resolution of to feed into the network. The final image passed to the network is thus where we note that both the crop and resize functions depend on the output of the face detector; however, aggressively assuming that did not impede our ability to estimate a reasonable facial pose. Given , is merely a subset of the pixels of so for all the pixels within the detected bounding box and for all pixels outside.
resizes the crop using bilinear interpolation so
can be computed using the size of the detected bounding box.3DFAN outputs a tensor of size
, each of the landmarks has a heatmap specifying the likelihood of a particular pixel containing that landmark. While one might difference the heatmaps directly, it is unlikely that this would sufficiently capture correspondences. Instead, we follow the approach of [dong2018supervision, sun2017integral] and apply a differentiable softargmax function to the heatmaps obtaining pixel coordinates for each of the landmarks. That is, given the marker position computed using the argmax function on heatmap , we use a patch of pixels around to compute the softargmax position as(3) 
where is set experimentally and returns the heatmap value at a pixel coordinate . We found that using a small patch around the argmax landmark positions gives better results than running the softargmax operation on the entire heatmap.
The softargmax function returns an image coordinate on the image, and these image coordinates need to be remapped to the full resolution image to capture translation between the synthetic face render and the captured image. Thus, we apply inverse rescale and crop operations , . The multiplication by rescales from the heatmap to the original . To summarize, we treat the as the output of , and Equation 2 measures the L2 distance between the on the captured image and the corresponding on the synthetic render. We once again assume .
is the identity matrix, and
contains the scalar multipliers to resize the image from to the original cropped size. We stress that this entire process is endtoend differentiable. See Figure 2.5 Expression Estimation
After the rigid alignment determines and , we solve for an initial estimate of the mouth and jaw blendshape parameters (a subset of ). Generally, one would use handdrawn rotoscope curves around the lips to accomplish this as discussed in Section 2; however, given the multitude of problems associated with this method as discussed in Section 1, we instead turn to deep networks to accomplish the same goal. We use 3DFAN in the same manner as discussed in Section 4 to solve for a subset of the blendshape weights keeping the rigid parameters and fixed. It is sometimes beneficial or even preferred to also allow and to be modified somewhat at this stage, although a prior energy term that penalizes these deviations from the values computed during the rigid alignment stage is often useful.
We note that the ideal solution would be to instead create new network architectures and train new models that are designed specifically for the purpose of detecting lip/mouth contours, especially since the heatmaps generated by 3DFAN are generally too lowresolution to detect fine mouth movements such as when the lips pucker. However, since our goal in this paper is to show how to leverage existing architectures and pretrained networks especially so one can benefit from the plethora of existing literature, for now, we bootstrap the mouth and jaw estimation using the existing facial landmark detection in 3DFAN.
6 Optical Flow for Missing Frames
The face detector used in Section 4 can sometimes fail, on our test sequence, the Dlib’s HOGbased detector failed on frames while Dlib’s CNNbased detector succeeded on all frames. We thus propose using optical flow networks to infer the rigid and blendshape parameters for failed frames by “flowing” these parameters from surrounding frames where the face detector succeeded. This is accomplished by assuming that the optical flow of the synthetic render from one frame to the next should be identical to the corresponding optical flow of the captured image. That is, given two synthetic renders and and two captured images and , we can compute two optical flow fields and using FlowNet2 [IMKDB17]. We resize the synthetic renders and captured images to a resolution of before feeding them through the optical flow network. Assuming that is the image the face detector failed on, we solve for the parameters of starting with an initial guess , the parameters of
, by minimizing the L2 difference between the flow field vectors
. can be computed by backpropagating through the network.7 Temporal Refinement
Since we solve for the rigid alignment and expression for all captured images in parallel, adjacent frames may produce visually disjointed results either because of noisy facial landmarks detected by 3DFAN or due to the nonlinear optimization converging to different local minima. Thus, we also use optical flow to refine temporal inconsistencies between adjacent frames. We adopt a method that can be run in parallel. Given three sequentially captured images , , and , we compute two optical flow fields and . Similarly, we can compute and . Then, we solve for the parameters of by minimizing the sum of two L2 norms and . The details for computing the Jacobian follow that in Section 6. Optionally, one may also wish to add a prior penalizing the parameters from deviating too far from their initial value. Here, step of smoothing to obtain a new set of parameters uses the parameters from the last step ; however, one could also use the updated parameter values whenever available in a GaussSeidel style approach.
Alternatively, one could adopt a selfsmoothing approach by ignoring the capture image’s optical flow and solving for the parameters that minimize . Such an approach in effect minimizes the second derivative of the motion of the head in the image plane, causing any sudden motions to be smoothed out; however, since the energy function contains no knowledge of the data being targeted, it is possible for such a smoothing operation to cause the model to deviate from the captured image.
While we focus on exploring deep learning based techniques, more traditional smoothing/interpolation techniques can also be applied in place of in addition to the proposed optical flow approaches. Such methods include: spline fitting the rigid parameters and blendshape weights, smoothing the detected landmarks/bounding boxes on the captured images as a preprocess, smoothing each frame’s parameters using the adjacent frame’s estimations, etc.
8 Results
We estimate the facial pose and expression on a moderately challenging performance captured by a single ARRI Alexa XT Studio running at 24 framespersecond with an degree shutter angle at ISO where numerous captured images exhibit motion blur. These images are captured at a resolution of , but we downsample them to before feeding them through our pipeline. We assume that the camera intrinsics and extrinsics have been precalibrated, the captured images have been undistorted, and that the face model described in Equation 1 has already been created. Furthermore, we assume that the face’s rigid transform has been set such that the rendered face is initially visible and forwardfacing in all the captured viewpoints.
8.1 Rigid Alignment
We estimate the rigid alignment ( and ) of the face using 3DFAN. We use an energy where are the image space coordinates of the facial landmarks as described in Section 4 and is a perlandmark weighting matrix. Furthermore, we use an edgepreserving energy where are the landmark positions on the captured image and are the landmark positions on the synthetic renders to ensure that the face does not erroneously grow/shrink in projected size as it moves towards the target landmarks, which may prevent the face detector from working.
First, we only solve for using all the landmarks except for those around the jaw to bring the initial state of the face into the general area of the face on the captured image. See Figure 3. We prevent the optimization from overfitting to the landmarks by limiting the maximum number of iterations. Next, we solve for both and in three steps: using the nonjaw markers, using only the jaw markers, and using all markers. We perform these steps in stages as we generally found the nonjaw markers to be more reliable and use them to guide the face model to the approximate location before trying to fit to all existing markers. See Figure 4.
8.2 Expression Estimation








We run a similar multistage process to estimate facial expression using the detected 3DFAN landmarks. We use the same energy term as Section 8.1, but also introduce L2 regularization on the blendshape weights with set experimentally. In the first stage, we weight the landmarks around the mouth and lips more heavily and estimate only the jaw open parameter along with the rigid alignment. The next stage estimates all available jawrelated blendshape parameters using the same set of landmarks. The final stage estimates all available jaw and mouthrelated blendshapes as well as the rigid alignment using all available landmarks. See Figure 5. This process will also generally correct any overfitting introduced during the rigid alignment due to not being able to fully match the markers along the mouth. See Figure 6.
Our approach naturally depends on the robustness of 3DFAN’s landmark detection on both the captured images and synthetic renders. As seen in Figure 8, the optimization will try to target the erroneous markers producing inaccurate , , and which overfit to the markers. Such frames should be considered a failure case and thus require using the optical flow approach described in Section 6 for infill. Alternatively, one could manually modify the multistage process for rigid alignment and expression estimation to remove the erroneous markers around the jaw; however, such an approach may then overfit to the potentially inaccurate mouth markers. We note that such concerns will gradually become less prominent as these networks improve.
8.3 Optical Flow Infill


Consider, for example, Figure 7 where frames and were solved for successfully and we wish to fill frames , , and . We visualize the optical flow fields using the coloring scheme of [baker2011database]. We adopt our proposed approach from Section 6 whereby the parameters of frames , , and are first solved for sequentially starting from frame . Then, the frames are solved again in reverse order starting from frame . This backandforth process which can be repeated multiple times ensures that the infilled frames at the end of the sequence have not accumulated so much error that they no longer match the other known frame.
Using optical flow information is preferable to using simple interpolation as it is able to more accurately capture any nonlinear motion in the captured images (the mouth staying open and then suddenly closing). We compare the results of our approach of using optical flow to using linear interpolation for and and spherical linear interpolation for in Figure 9.




8.4 MultiCamera








Our approach can trivially be extended to multiple calibrated camera viewpoints as it only entails adding another duplicate set of energy terms to the nonlinear least squares objective function. We demonstrate the effectiveness of this approach by applying our approach from Sections 8.1 and 8.2 to the same performance captured using an identical ARRI Alexa XT Studio from another viewpoint. See Figure 10.
We also compare the rigid alignment estimated by our automatic method to the rigid alignment created by a skilled matchmove artist for the same performance. The manual rigid alignment was performed by tracking the painted black dots on the face along with other manually tracked facial features. In comparison, our rigid alignment was done using only the markers detected by 3DFAN on both the captured images and the synthetic renders. See Figure 11. Our approach using only features detected by 3DFAN produces visually comparable results. In Figure 13, we assume the manually done rigid alignment is the “ground truth” and quantitatively evaluate the rigid alignment computed by the monocular and stereo solves. Both the monocular and stereo solves are able to recover similar rotation parameters, and the stereo solve is able to much more accurately determine the rigid translation. We note, however, that it is unlikely that the manually done rigid alignment can be considered “ground truth” as it more than likely contains errors as well.
8.5 Temporal Refinement
As seen in the supplementary video, the facial pose and expression estimations are generally temporally inconsistent. We adopt our proposed approach from Section 7. This attempts to mimic the captured temporal performance which not only helps to better match the synthetic render to the captured image but also introduces temporal consistency between renders. While this is theoretically susceptible to noise in the optical flow field, we did not find this to be a problem. See Figure 12. We explore additional methods of performing temporal refinement in the supplementary material.
9 Conclusion and Future Work
We have proposed and demonstrated the efficacy of a fully automatic pipeline for estimating facial pose and expression using pretrained deep networks as the objective functions in traditional nonlinear optimization. Such an approach is advantageous as it removes the subjectivity and inconsistency of the artist. Our approach heavily depends upon the robustness of the face detector and the facial alignment networks, and any failures in those cause the optimization to fail. Currently, we use optical flow to fix such problematic frames, and we leave exploring methods to automatically avoid problematic areas of the search space for future work. Furthermore, as the quality of these networks improve, our proposed approach would similarly benefit, leading to higherfidelity results. While we have only explored using pretrained facial alignment and optical flow networks, using other types of networks (face segmentation, face recognition, etc.) and using networks trained specifically on the vast repository of data from decades of visual effects work are exciting avenues for future work.
Acknowledgements
Research supported in part by ONR N000141310346, ONR N000141712174, ARL AHPCRC W911NF070027, and generous gifts from Amazon and Toyota. In addition, we would like to thank both Reza and Behzad at ONR for supporting our efforts into computer vision and machine learning, as well as Cary Phillips and Industrial Light & Magic for supporting our efforts into facial performance capture. M.B. was supported in part by The VMWare Fellowship in Honor of Ole Agesen. J.W. was supported in part by the Stanford School of Engineering Fellowship. We would also like to thank Paul Huston for his acting.
Appendix
Appendix A Temporal Smoothing Alternatives
Figure 14 (third row) shows the results obtained by matching the synthetic render’s optical flow to the captured image’s optical flow (denoted plate flow in Figure 14). Although this generally produces accurate results when looking at each frame in isolation, adjacent frames may still obtain visually disjoint results (see the accompanying video). Thus, we explore additional temporal smoothing methods.
We first explore temporally smoothing the parameters (, , and ) by computing a weighted average over a three frame window centered at every frame. We weigh the current frame more heavily and use the method of [markley2007quaternion] to average the rigid rotation parameters. While this approach produces temporally smooth parameters, it generally causes the synthetic render to no longer match the captured image. This inaccuracy is demonstrated in Figure 14 (top row, denoted as averaging) and is especially apparent around the nose (frames and ) and around the lower right cheek (frame ).
One could also carry out averaging using an optical flow network. This can be accomplished by finding the parameters that minimize the difference in optical flow fields between the current frame’s synthetic render and the adjacent frames’ synthetic renders, . See Figure 14 (second row, designated self flow). This aims to minimize the second derivative of the motion of the head in the image plane; however, in practice, we found this method to have little effect on temporal noise while still causing the synthetic render to deviate from the captured image. These inaccuracies are most noticeable around the right cheek and lips.
We found the most effective approach to temporal refinment to be a two step process: First, we use averaging to produce temporally consistent parameter values. Then, starting from those values, we use the optical flow approach to make the synthetic render flow better target that of the plate. See Figure 14 (bottom row, denoted hybrid). This hybrid approach produces temporally consistent results with synthetic renders that still match the captured image. Figure 15 shows the rigid parameters before and after using this hybrid approach, along with that obtained manually by a matchmove artist for reference. Assuming the manual rigid alignment is the “ground truth,” Figure 16 compares how far the rigid parameters are from their manually solved for values both before and after the hybrid smoothing approach. Figure 17 compares all the proposed smoothing methods on this same example.





a.1 Expression Reestimation



The expression estimation and temporal smoothing steps can be repeated multiple times until convergence to produce more accurate results. To demonstrate the potential of this approach, we reestimate the facial expression by solving for the mouth and jaw blendshape parameters (a subset of ) while keeping the rigid parameters fixed after temporal smoothing. As seen in Figure 18, the resulting facial expression is generally more accurate than the pretemporal smoothing result. Furthermore, in the case where temporal smoothing dampens the performance, performing expression reestimation will once again capture the desired expression (frame ).
Comments
There are no comments yet.