 # Improved Search Strategies for Determining Facial Expression

It is well known that popular optimization techniques can lead to overfitting or even a lack of convergence altogether; thus, practitioners often utilize ad hoc regularization terms added to the energy functional. When carefully crafted, these regularizations can produce compelling results. However, regularization changes both the energy landscape and the solution to the optimization problem, which can result in underfitting. Surprisingly, many practitioners both add regularization and claim that their model lacks the expressivity to fit the data. Motivated by a geometric interpretation of the linearized search space, we propose an approach that ameliorates overfitting without the need for regularization terms that restrict the expressiveness of the underlying model. We illustrate the efficacy of our approach on minimization problems related to three-dimensional facial expression estimation where overfitting clouds semantic understanding and regularization may lead to underfitting that misses or misinterprets subtle expressions.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

There exist many standard methods for finding the solution to an unconstrained nonlinear optimization problem; however, generally speaking, one typically approaches these problems by iteratively solving a sequence of linear problems aiming to progress towards the solution of the fully nonlinear problem. For example, the Gauss-Newton method [madsen2004methods] aims to minimize an energy function by iteratively solving a linear problem where is the Jacobian of evaluated at the current parameter state , is the current residual, and is the desired step in parameter values to move towards the solution. This process is repeated until convergence. Although this linear problem minimizes , it is unclear in practice that minimizing this L2 norm is the best way to make progress towards the value of that best minimizes . In fact, the Levenberg-Marquardt and the Dogleg trust-region methods blatantly modify the least squares problem claiming to make the Gauss-Newton method more robust [lourakis2005levenberg, marquardt1963algorithm].

Even when one can trust the linear problem to provide progress towards the solution or can modify the linear problem to be more trust-worthy, one still has to solve the linear problem dealing with its potential over- or under-determinedness. For example, practitioners may add regularizations such as the L2-norm of the parameters to ensure that is nonsingular [hoerl1970ridge] or add the L1-norm of the parameters to create sparsity [boyd2004convex] in the spirit of minimum-norm solutions to underdetermined problems.

A further complication is that accurately minimizing the energy functional may not even lead to the desired solution since the energy functional is often only suggestive of intent. This is the classical problem of overfitting, minimizing the loss while obtaining large generalization errors on holdout data. In the specific case of facial expression estimation, matching pre-drawn contours may minimize the energy functional at the cost of implausible deformations of the rest of the face. Thus, domain specific regularizers such as Laplacian smoothing of surface meshes [bhat2013high] or anatomical simulation of the underlying muscle model [cong2016art]

help to alleviate overfitting. In the case of training neural networks, one can utilize early stopping, dropout, and batch normalization to achieve models that better generalize to more data

[goodfellow2016deep].

To summarize, linear subproblems may not lead to the solution, often require modifications for solvability anyway, and may come from an energy functional which is only at best suggestive of intent. Thus, we take an aggressive approach to solving these linear subproblems by first pruning away any coordinates that are uncorrelated with the right hand side as motivated by least angle regression (LARS) [efron2004least]. The other coordinates are then estimated in a coordinate descent fashion [shi2016primer] eliminating the need to regularize for solvability. Our aim is to find a set of parameters that give a meaningful interpretation of the observed data; thus, using uncorrelated directions to make progress on linear subproblems seems unnecessary if not unwise. The notion that solving linear systems may not necessarily help to solve a nonlinear problem is not new; there are many nonlinear problems where one looks at linearizations only to motivate a strategy for tackling the nonlinear problem – a quite common example is eigensystem consideration for solving hyperbolic conservation laws, see [toro2013riemann] and the references therein.

We illustrate the efficacy of our approach on a number of problems in the context of facial expression estimation especially when targeting rotoscope curves. The end goal is to determine a set of model parameters that represent the correct semantic meanings including pose and expression while minimizing the excitation of spurious degrees of freedom.

## 2 Related Work

Regularization: When working with ill-posed problems, regularization is used to prevent the model from overfitting to the data [goodfellow2016deep, hastie2001elements]

[hoerl1970ridge] and LASSO [tibshirani1996regression] are two popular methods for regularizing linear least squares problems; they add constraints equivalent to introducing L2 and L1-norm regularization respectively on the parameters of the model [hastie2001elements]. Using L1-norm regularization has the additional benefit of keeping the solution sparse [boyd2004convex, donoho2006most]. Additionally, regularization will help ensure that the normal equations used to solve least squares are well-conditioned. However, regularization forces one to balance the trade-off between matching the data and having smaller parameters values.

Coordinate Descent: At each iteration, coordinate descent chooses a single coordinate direction to minimize. This allows the coordinate descent algorithm to avoid null spaces making it an attractive option for use on ill-posed, poorly conditioned problems. The column can be chosen stochastically [nesterov2012efficiency] or deterministically. Popular deterministic methods for choosing the next search direction include cyclic coordinate descent [krylov1966solution], the Gauss-Southwell and Gauss-Southwell-Lipschitz rule [nutini2015coordinate], and the maximum block improvement (MBI) rule [chen2012maximum]. Generally, either a line search is performed to determine the step size or a fixed step size/learning rate is used. However, the cost of a line search is prohibitive when the function takes a long time to evaluate (in the case of simulations) so we instead use coordinate descent to solve the linear problem found at every iteration of Gauss-Newton. Instead of looking at a single column at a time, block coordinate descent can be used to update multiple columns simultaneously [tseng2001convergence]; however, with block coordinate descent, regularization may still be needed to avoid ill-posed problems when the block of columns does not have full rank. A more complete overview of coordinate descent methods can be found in [shi2016primer, wright2015coordinate].

Correlation: Although coordinate descent algorithms can avoid the null space, they may still choose poorly correlated coordinates potentially resulting in many poorly correlated coordinates in place of fewer more strongly correlated coordinates. Using correlation to choose the next coordinate to add to the model can alleviate this problem and is the central idea behind MBI [chen2012maximum], forward and backward stepwise regression [derksen1992backward], and least angle regression (LARS) [efron2004least]. The latter statistical regression methods are often used to gain better prediction accuracy and interpretability of the model [hastie2001elements]. However, LARS converges to the least squares solution [efron2004least] which often means eventually using uncorrelated coordinates.

Faces: When solving for facial blendshape parameters [lewis2014practice], L2-norm regularization of the parameters is commonly used by practitioners to obtain more reasonable solutions due to its ease of use in commodity least squares solvers [blanz1999morphable, cao2014facewarehouse, li2010example, thies2016face2face]. Others instead choose to bound the blendshape weights between some minimum and maximum value (typically and or and ) [bhat2013high, hsieh2015unconstrained, li2013realtime]. The usage of the L1-norm regularization is relatively rare, even though it results in sparser solutions [bouaziz2013online, ichim2015dynamic, neumann2013sparse]. Other methods use Laplacian regularization [bhat2013high, huang2011leveraging] or anatomically motivated regularization [wu2016anatomically] to constrain the deformation of the face. Figure 1: (a) A visualization of the columns of A as well as b for the linear problem Ax=b from Section 3 when b=T. (b) The exact solution depicted by a1x1 and a2x2. (c) The regularized solution with λ=.2.

## 3 Motivation

Consider where and , and the two columns of get overdialed by the exact solution of and . That is, while is in fact in the range of , it is not what we refer to as “easily” in the range of . Since the columns of are mostly orthogonal to , the exact solution shown in Figure 1b overdials both columns producing erroneously large canceling in the horizontal direction in order to edge up vertically even a small amount. Ideally, one would consider these columns of unrepresentative of and instead find a control parameter that moves vertically. Practitioners often claim that regularization handles these issues. Regularized least squares of the form has the solution shown in Figure 1c where the non-zero aspect of the regularized solution is almost entirely in the spurious horizontal direction with only an infinitesimal component in the direction of . In fact, the best one can hope for is to drive the spurious horizontal component to zero with no hope of moving vertically towards the solution. Figure 2: (a) A visualization of the columns of A as well as b for the linear problem Ax=b from Section 3 when b=T. (b) The exact solution depicted by a1x1 and a2x2. (c) The regularized solution with λ=1. (d) The regularized solution with λ1=0 and λ2=1. (e) The solution when solving for x1 only. (f) The solution using coordinate descent using the MBI selection rule after a few iterations.

Next consider the case where and the component is more readily captured by the columns of . Here, the exact solution shown in Figure 2b is an improvement over Figure 1b since less of is wasted to cancel all of , and similarly, the regularized solution shown in Figure 2c makes some progress towards the actual solution as compared to Figure 1c. Still, one could do better by changing the regularization in the least squares problem to have the form with . Figure 2d shows the result for which is highly improved. One could do even better using only as shown in Figure 2e obtained using . This is equivalent to minimizing using coordinate descent after using only one column.

## 4 Pruning

We illustrate our approach by solving a generic nonlinear least squares optimization problem of the form using a Gauss-Newton based method, computing the Jacobian to utilize the first-order Taylor expansion at each iteration. The minimization problem at the current iteration is . This is a linear least squares problem and can be solved using to find the that makes progress towards the solution. The least squares solution can also be computed by solving using QR factorization; however, it is not clear that the least squares solution is necessary or desirable. Figure 3: Left: Here, a=b+c, but c is only valid for making progress towards a in conjunction with b. While using c may be desirable when trying to solve a linear system of equations, using c (which is perpendicular to a) to progress when solving a linearized high-dimensional nonlinear problem may not produce any meaningful progress. Right: In fact, it may be desirable to only progress in the direction of b for a while, stopping before the remaining residual starts to become uncorrelated. At that point, it would be better to find a new direction (in this case d) that leads back towards a.

We thus propose an alternative approach to solving which depends on computing the correlation between each column of and . Similarly to LARS [efron2004least] and MBI [chen2012maximum], we use the dot product magnitude to determine correlation, where . We first prune out columns that are poorly correlated with as these make no progress towards the solution in the linearized state of the problem. Furthermore, these columns can only act as corrections on actual progress in conjunction with other columns; however, these corrections are only valid when the problem is truly linear. In the case of a nonlinear problem, it is unclear whether these uncorrelated corrections help to minimize the nonlinear . See Figure 3. Motivated by the Gauss-Southwell rule, one might instead prune using the dot product computed using

to consider large decreases in residual with small steps; however, this may leave columns that are large and poorly correlated unpruned. Thus, we use the absolute dot product of the normalized vectors as the correlation measure for its geometric interpretation as the absolute cosine of the angle between

and .

Pruning columns of to get a reduced has the additional benefit of potentially eliminating portions of the null space of , as the pruned out columns may not have full rank or a combination of the pruned out columns with the remaining columns may not have full rank; this pruning may also improve the condition number. This is especially prudent when working with a large number of dimensions in which case the dimension of the null space of and the condition number of may be large. Moreover, this problem is exacerbated when regularization is not used. At this point, one could compute a modified step that uses the columns of to make progress towards . However, if contains a null space, some regularization would still be needed to obtain a reasonable solution.

## 5 Solving the Pruned System

To avoid regularization, we pursue a coordinate descent strategy to solve . At each iteration of the coordinate descent algorithm, a single column of the Jacobian is used to make progress towards . Since only one column is examined at any given iteration, we do not have to use regularization to avoid null spaces. There are a few strategies for choosing the column to update that follow our motivation of using correlated columns. For example, one could choose the column with the largest correlation with the residual or choose the column which results in the largest decrease in the residual similar to the MBI rule. Motivated by the Gauss-Southwell rule [nutini2015coordinate], we instead choose the column that maximizes the reduction in residual for a given step, which in effect balances correlation with residual reduction.

Assuming that we measure the size of the residual using the squared L2 norm, we choose the column that maximizes

 Δ(rTr)Δαi =||r(δxS)||22−||r(δxS)−αiji||22|αi| (1)

where is the residual for the linearized problem and is the step size when is the coordinate direction. Flipping all so that leads to , which in turn means that finding the column that maximizes Equation 1 is equivalent to the column that maximizes . Figure 4: Whereas MBI simply minimizes θ, Gauss-Southwell and our approach instead consider ^r⋅j. However, unlike Gauss-Southwell, our approach also prefers that the remaining residual r−αj still be correlated with j. This penalizes going too far in a search direction j, especially since going too far can lead to using uncorrelated directions as corrections as shown in Figure 3. Instead, we stop progress along j while the remaining residual is still correlated, which allows us to look for more correlated directions to continue making progress similarly to LARS [efron2004least].

In the case where is chosen to make as much progress as possible, or , we obtain which is standard for Gauss-Southwell. Similarly, as , . Geometrically, this maximizes the projection of into the direction of , , since does not vary when considering which to use. In comparison, the MBI rule chooses the that maximizes ignoring the magnitude of , which unfortunately allows for directionally correlated but small magnitude columns that may require large overdialed to make progress.

When more conservatively choosing less than the greedy value to make the maximal progress, and . Whereas the greedy choice eliminated the last two terms in giving a result in line with Gauss-Southwell, our approach augments the Gauss-Southwell like term with which compares the remaining residual to the search direction . Maximizing this latter term helps to prefer search directions that remain correlated with the new residual preventing uncorrelated gains that might later be removed using uncorrelated directions as in Figure 3 (left). See Figure 4. We note that it may be desirable to replace our proposed Heaviside pruning (in Section 4) with a differentiable soft-pruning penalty term in Equation 1.

We generally only execute a few iterations of coordinate descent using these update rules to mimic the regularization effects of early stopping [goodfellow2016deep] and truncated-Newton methods [nash2000survey] as it prevents overfitting to the current linear model. Furthermore, we also terminate the coordinate descent iterations early if the decrease in L2 error is low. Stopping early also prevents us from reaching the undesirable least squares solution as in LARS [efron2004least].

## 6 Experiments

Consider matching hand-drawn rotoscope curves on a captured image to the corresponding two-dimensional projections of similar curves drawn on a three-dimensional face model. The face surface is driven by blendshapes [lewis2014practice] and linear blend skinning for the jaw; let represent the full set of controls for the jaw angles, jaw translation, and face blendshapes. Assuming the pointwise correspondences between the curves are known, we solve a nonlinear least squares problem of the form

 minw∥C∗−C(x(w))∥22 (2)

to estimate , where are the points on the rotoscope curves and are the corresponding points from the face model projected into the image plane. Here, we assume that the camera parameters and the rigid alignment are precomputed.

To solve this problem, practitioners often add L2 regularization of the parameters (). Although this achieves reasonable results when looking at the geometry and/or synthetic render, the resulting parameter values are typically densely activated making it nearly impossible to utilize those values for editing or for extracting semantic information. See Figure 5.

### 6.1 Synthetic Tests

We manually set the two components of (left and right) corresponding to the smile expression to one to produce a face surface from which the barycentrically embedded contour points can be projected into the image plane generating a synthetic target . As a baseline, we then solve Equation 2 using the Dogleg method with no prior, the Dogleg method with a prior weight of , and the BFGS method [nocedal2006numerical] with a soft-L1 prior with a weight of (with an extra term [charbonnier1997deterministic]). For our approach, we firstly prune all columns of the Jacobian whose angle to the residual has an absolute cosine less than (determined experimentally). Then, is set to a fixed size with and the coordinate descent algorithm is run until the linear L2 error no longer sufficiently decreases or when more than coordinates have been used. We limit all four methods to take no more than Gauss-Newton iterations and show the results in Figure 5(a).

Next, we add an increasing amount of uniformly distributed noise to the image locations of the targeted contour points. As expected, although Dogleg with no regularization produces reasonable results when the noise is low, as the amount of noise increases, it begins to overfit to the errorneous data. See Figure

6, top row. Both of the regularized approaches as well as our approach, however, are able to target the noisy curves (without overfitting) producing more reasonable geometry. See Figure 6, bottom three rows. Also, as seen in Figure 7, our approach gives more accurate semantics even in the presence of noise.

### 6.2 Real Image Data

For each captured image, an artist draws eight curves around the mouth and eyes. Another artist hand-selects the corresponding contours on the three-dimensional face model which are projected into the image plane using camera extrinsic and intrinsic parameters calibrated using the method of [heikkila1997four]. Generally, these two sets of contours will contain a different number of control points; as a result, we uniformly resample both in the image plane. The uniformly sampled curves are then used in Equation 2.

Our approach and using regularization give the most reasonable geometric results – similar to the synthetic case. However, in certain frames our approach seems to over-regularize the result and cause the mouth to not quite hit the desired expression. This could be due, in part, to not having a fully accurate rigid alignment. However, a major benefit of our method is the resulting sparsity in the weights. While the Dogleg method with and without regularization will dial in nearly all the parameters to a non-zero value, our approach generally dials in a small number of weights. See Figure 11. Similarly, the soft L1 regularized solution also produces sparser solutions than the L2 regularized solution; however, due to approximations in the chosen optimization approach (BFGS, Soft L1), it produces many small () weights instead of zero values. While one could use a threshold to clamp values to zero, care must be taken to not accidentally threshold a small blendshape weight (or a combination of small blendshape weights) that contribute significantly to the overall performance. We note that one could potentially use iteratively reweighted least squares methods [chartrand2008iterative] to obtain sparse solutions.

### 6.3 Parameter Study

Column Choice: We compare Equation 1 for choosing the coordinate descent column to using Gauss-Southwell and MBI without pruning. Firstly, for each approach, we linearize and solve with no thresholding for the relative decrease in L2 error, an upper limit of unique coordinates used, and a fixed step size of . Figure 8 (top row) shows similar behavior in terms of geometry and choice of blendshape weights between our approach and Gauss-Southwell. The MBI rule, on the other hand, immediately overfits and dials in extraordinarily high weights, especially for blendshapes that affect the eyes, causing the face surface to explode; however, as can be seen in the supplementary material, solving without the eye rotoscope curves causes MBI to overfit to the mouth curves instead. Secondly, repeating the linearization and solve times as shown in Figure 8 (bottom row) gives similar results.

Step Size:

Recall are the parameters, blendshape weights, that generally vary between and ; thus we choose to compare fixed step sizes of and to the full, greedy step, . Again without pruning, we run Gauss-Newton iterations with no thresholding for the relative decrease in L2 error and an upper limit of unique coordinates used. As seen in Figure 9, the smaller step sizes achieve better overall facial shapes and less overdialed weights. In particular, the greedy step dials weights to be greater than while the and step sizes only dial in such weights. Removing the eye rotoscope curves causes the overdialed weights to disappear; however, as seen in the supplementary material, the greedy step causes the mouth to move unnaturally.

We also compare the effect of using fixed step sizes in Equation 1 versus the full, greedy step size equivalent to Gauss-Southwell without pruning. To isolate this variable, we run Gauss-Newton iterations with no thresholding for the relative decrease in L2 error and an upper limit of unique coordinates used. We vary in Equation 1 but set the actual step size taken to be fixed at . As shown in supplementary material, while the resulting geometry and weights are all similar, our approach of allowing the step size to influence the chosen coordinate allows the optimization to more quickly reduce the error in earlier Gauss-Newton iterations.

Pruning:

We use the correlation metric of from LARS and MBI to prune poorly correlated coordinate directions. We choose threshold values of (no pruning), , , , , and , and run Gauss-Newton iterations with a step size of with no thresholding for the relative decrease in L2 error. To emphasize the effect of pruning, we allow up to unique coordinates per linearization, and focus only on the rotoscope curves around the mouth. With little to no pruning the model overfits and the geometry around the mouth deforms unreasonably. As the pruning threshold increases, the geometry becomes more regularized and the blendshape weights are less overdialed and sparser as the optimization is forced to use only the most correlated directions. See Figure 10. However, we caution that too much pruning causes MBI style column choices.

## 7 Conclusion

Taking the linearization of the nonlinear problem as a mere suggestion for a search strategy, we prune uncorrelated directions, pursue a coordinate descent approach that does not need regularization, and choose search directions in order to maximize gains in reducing the residual with minimal parameter value increases. We stress that this is more generally an outline of an improved search strategy as opposed to a formal and precise new method; yet, our first attempts at such an approach led to highly improved results. In the case of estimating three-dimensional facial expression from a mere eight contours drawn on a single two-dimensional RGB image, we were able to robustly estimate clean, sparse parameter values with good semantic meaning in a highly underconstrained situation where one would typically need significant regularization. In fact, the standard approach without regularization was widly inaccurate, and while regularization helped to moderate the overall face shape, it excited almost every parameter in the model clouding semantic interpretation.

As future work, we would be interested in considering similar search strategies for training neural networks with limited data. Such situations are similar to our example where we only use sparse rotoscope curves that are only suggestive of the desired expression, instead of dense data for every triangle (such as when using shape-from-shading or optical flow).

## Acknowledgements

Research supported in part by ONR N00014-13-1-0346, ONR N00014-17-1-2174, ARL AHPCRC W911NF-07-0027, and generous gifts from Amazon and Toyota. In addition, we would like to thank both Reza and Behzad at ONR for supporting our efforts into computer vision and machine learning, as well as Cary Phillips and Industrial Light & Magic for supporting our efforts into facial performance capture. M.B. was supported in part by The VMWare Fellowship in Honor of Ole Agesen. We would also like to thank Paul Huston for his acting.

## Appendix A Extended Parameter Study

### a.1 Column Choice

Once again, we compare our approach for choosing the next coordinate descent column to using Gauss-Southwell and MBI. For each approach, we linearize and solve with no thresholding for the relative decrease in L2 error, an upper limit of unique coordinates used, and a fixed step size of ; however, in these examples, we remove the eye rotoscope curves from the energy function. Even without solving for the eye rotoscope curves, MBI will still overfit and overdial mouth blendshapes (the two most dialed in shapes have magnitudes of and ). On the other hand, Gauss-Southwell and our approach maintain the face surface geometry and keep the blendshape weights within a reasonable range while maintaining the sparsity of the solution. We note that with coordinate descent it is generally a matter of when, not if, the algorithm chooses a coordinate that will be overdialed; our examples demonstrate that MBI reaches those coordinates more quickly than Gauss-Southwell and our approach. See Figure 13.

### a.2 Step Size

Once again, we compare fixed step sizes of and to the full, greedy step, . We again note that the fixed step sizes for each coordinate are clamped to not exceed the full, greedy step size. Again without pruning, we run Gauss-Newton iterations with no thresholding for the relative decrease in L2 error and an upper limit of unique coordinates used; however, in these examples, we remove the eye rotoscope curves from the energy function. In this case, the step size has no significant effect on the sparsity of the blendshape weights; however, as seen in Figure 13, taking the greedy step versus taking a fixed size step results in a more unnaturally shaped mouth. This would seem to indicate that always taking the greedy step will result in some overfitting.

We also run tests with and without the eye rotoscope curves to isolate the effect the step size has on the column choice in our approach; while we vary the step size used to determine the next coordinate descent column, we fix the actual step size to have a magnitude of . Again without pruning, we run Gauss-Newton iterations with no thresholding for the relative decrease in L2 error and an upper limit of unique coordinates used. We note that when the full, greedy step size is used to choose the next coordinate direction, our approach produces the same results, chooses the same coordinates and takes the same steps, as Gauss-Southwell. As can be seen in Figure 15, the final geometry and blendshape weights when varying the step size are similar to the results produced by Gauss-Southwell, using the full, greedy step. However, our approach, as seen in Figure 14 allows the error to decrease more in earlier Gauss-Newton iterations than when using Gauss-Southwell. Therefore, it may be beneficial to use our approach with a fixed size step when only a few Gauss-Newton iterations are desired.

## Appendix B Parameter Limits and Trust Region

Using a coordinate descent solver for the linear problem at every Gauss-Newton iteration allows us to build the solution to incrementally without relying on a black box linear solver. Any approach that aims to prevent the solution from going past a certain point (parameter limits and a trust region method) are thus trivial to implement. Given min and max parameters and , at every iteration of the coordinate descent solve, we can clamp each such that . Additionally, given a trust region radius , we can simply terminate the coordinate descent solve when/if ; this is similar in spirit to how the trust region parameter is used in the Dogleg method. Furthermore, we can introduce limits on to ensure that each step will not cause the solver to go beyond the trust region. Figure 12: A comparison of the coordinates chosen by the MBI rule, the Gauss-Southwell (GS) rule, and our approach when solving without the eye rotoscope curves. The top row are the results after a single Gauss-Newton iteration, and the bottom row are the results after 10 Gauss-Newton iterations. (a) with eye rotoscope with eye rotoscope