 # Supervised Descent Method for Solving Nonlinear Least Squares Problems in Computer Vision

Many computer vision problems (e.g., camera calibration, image alignment, structure from motion) are solved with nonlinear optimization methods. It is generally accepted that second order descent methods are the most robust, fast, and reliable approaches for nonlinear optimization of a general smooth function. However, in the context of computer vision, second order descent methods have two main drawbacks: (1) the function might not be analytically differentiable and numerical approximations are impractical, and (2) the Hessian may be large and not positive definite. To address these issues, this paper proposes generic descent maps, which are average "descent directions" and rescaling factors learned in a supervised fashion. Using generic descent maps, we derive a practical algorithm - Supervised Descent Method (SDM) - for minimizing Nonlinear Least Squares (NLS) problems. During training, SDM learns a sequence of decent maps that minimize the NLS. In testing, SDM minimizes the NLS objective using the learned descent maps without computing the Jacobian or the Hessian. We prove the conditions under which the SDM is guaranteed to converge. We illustrate the effectiveness and accuracy of SDM in three computer vision problems: rigid image alignment, non-rigid image alignment, and 3D pose estimation. In particular, we show how SDM achieves state-of-the-art performance in the problem of facial feature detection. The code has been made available at www.humansensing.cs.cmu.edu/intraface.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Mathematical optimization plays a fundamental role in solving many problems in computer vision. This is evidenced by the significant number of papers using optimization techniques in any major computer vision conferences. Many important problems in computer vision, such as structure from motion, image alignment, optical flow, or camera calibration can be posed as nonlinear optimization problems. There are a large number of different approaches to solve these continuous nonlinear optimization problems based on first and second order methods, such as gradient descent  for dimensionality reduction, Gauss-Newton for image alignment [2, 3, 4] or Levenberg-Marquardt (LM)  for structure from motion .

Despite many centuries of history, Newton’s method and its variants (e.g., Quasi-Newton methods [7, 8, 9]) are regarded as powerfull optimization tools for finding local minima/maxima of smooth functions when second derivatives are available. Newton’s method makes the assumption that a smooth function can be well approximated by a quadratic function in a neighborhood of the minimum. If the Hessian is positive definite, the minimum can be found by solving a system of linear equations. Given an initial estimate , Newton’s method creates a sequence of updates as

 xk+1=xk−H−1(xk)J(xk), (1)

where and are the Hessian matrix and Jacobian matrix evaluated at (see notation 111 Bold capital letters denote a matrix ; bold lower-case letters denote a column vector . All non-bold letters represent scalars. represents the th column of the matrix . denotes the scalar in the row and column of the matrix . denotes the scalar in the element of .

is an identity matrix.

denotes the Euclidean distance. designates the Frobenious norm of a matrix. ). In the case of Quasi-Newton methods,

can be approximated by analyzing successive gradient vectors. Newton-type methods have two main advantages over competitors. First, they are guaranteed to converge provided that, in the neighborhood of the minimum, the Hessian is invertible and the minimizing function is Lipschitz continuous. Second, the convergence rate is quadratic. Fig. 1: a) Newton’s method to minimize f(x). The z-axis is reversed for visualization purposes. b) SDM learns a sequence of generic descent maps {Rk} from the optimal optimization trajectories (indicated by the dotted lines). Each parameter update Δxi is the product of Rk and a sample-specific component (y−h(xik)).

However, when applying Newton’s method to computer vision problems, three main problems arise: (1) The Hessian is positive definite at the local minimum, but it may not be positive definite elsewhere; therefore, the Newton steps may not be in the descent direction. LM addressed this issue by adding a damping factor to the Hessian matrix. This increases the robustness of the algorithm but reduces the convergence rate. (2) Newton’s method requires the function to be twice differentiable. This is a strong requirement in many computer vision applications. For instance, the popular SIFT  or HoG  features are non-differentiable image operators. In these cases, we can estimate the gradient or the Hessian numerically, but this is typically computationally expensive. (3) The dimension of the Hessian matrix can be large; inverting the Hessian requires operations and in space, where is the dimension of the parameter to estimate. Although explicit inversion of the Hessian is not needed using Quasi-Newton methods, it can still be computationally expensive to use these methods in computer vision problems. In order to address previous limitations, this paper proposes the idea of learning descent directions (and rescaling factors) in a supervised manner with a new method called Supervised Descent Method (SDM).

Consider Fig. 1 where the goal is to minimize a Nonlinear Least Squares (NLS) function, , where is a nonlinear function , is the vector of parameters to optimize, and is a known scalar. The -axis has been reversed for visualization purposes. The top image shows the optimization trajectory with Newton’s method. The traditional Newton update has to compute the Hessian and the Jacobian at each step. The bottom image illustrates the main idea behind SDM. In training, SDM starts by sampling a set of initial parameters around the known minimum . For each sample, the optimal parameter update is also known in training (plotted in dotted lines from the point to ). SDM will learn a sequence of updates to approximate the ideal trajectory. The SDM updates can be decomposed into two parts: a sample specific component (e.g., ) and a generic Descent Map (DM) . During training, SDM finds a series of such that they gradually move all towards . The updates in the first iteration using learned DM are plotted in black lines. Then, the optimal parameter updates are re-computed (shown in dotted lines) and used for learning the next DM. In testing, is optimized using the learned DMs without computing the Jacobian or the Hessian. SDM is particularly useful to minimize a NLS problem where the template is the same in testing (e.g., tracking). However, this is not a limitation, and we will develop extensions of SDM for an unseen .

We illustrate the benefits of SDM over traditional optimization methods in terms of speed and accuracy in three NLS problems in computer vision. First, we show how to extend the traditional Lucas-Kanade  tracker using HOG/SIFT features, resulting in a more robust tracker. Second, we apply SDM to the problem of facial feature detection and show state-of-the-art results. The code is publicly available. Third, we illustrate how SDM improves current methods for 3D pose estimation.

## 2 Theory of SDM

This section provides the theoretical basis behind SDM. Section 2.1 reviews background mathematical definitions. Section 2.2 illustrates the ideas behind SDM and DMs using one-dimensional functions, and Section 2.3 extends it to high-dimensional functions. Section 2.4 and Section 2.5 derive practical algorithms for different situations. Fig. 2: a) Experimental setup. The colon usage follows Matlab conventions. b,c,d) Convergence curves of xk using a generic DM on three functions. Different colors represent different initializations xi0.

### 2.1 Background (Definitions)

Before deriving SDM, we review two concepts, Lipschitz continuous  and monotone operator .

###### Definition 2.1.

A function is called locally Lipschitz continuous anchored at if there exists a real constant such that

 ∥f(x)−f(x∗)∥2≤K∥x−x∗∥2,∀x∈U(x∗),

where represents a neighborhood of and is referred as the Lipschitz constant.

Note that this is different from the traditional Lipschitz continuous definition, which is defined for any pair of ’s. The advantage of our definition is that it is applicable to other distance metrics besides Euclidean.

###### Definition 2.2.

A function is called a locally monotone operator anchored at if

 ⟨x−x∗,f(x)−f(x∗)⟩≥0,∀x∈U(x∗),

where represents the inner product operator. is said to be a strictly locally monotone operator if

 ⟨x−x∗,f(x)−f(x∗)⟩>0,∀x∈U(x∗),

### 2.2 One-dimensional Case

This section derives the theory for SDM in 1D functions. Given a 1D NLS problem,

 minxf(x)=minx(h(x)−y)2, (2)

where is a nonlinear function and

is a known scalar. Applying the chain rule to Eq.

2, the gradient descent update rule yields

 xk=xk−1−αh′(xk−1)(h(xk−1)−y). (3)

Gradient-based methods to optimize Eq. 2 follow Eq. 3, but have different ways to compute the step size . For example, Newton’s method sets . Computing the step size and gradient direction in high-dimensional spaces is computationally expensive and can be numerically unstable, especially in the case of non-differentiable functions, where finite differences are required to compute estimates of the Hessian and Jacobian. The main idea behind SDM is to avoid explicit computation of the Hessian and Jacobian and learn the “descent directions” () from training data. During training, SDM samples many different initial configurations in the parameter space and learns a constant , which drives all samples towards the optimal solution . We define more formally below.

###### Definition 2.3.

A scalar is called a generic DM if there exists a scalar such that , . is updated using the following equation:

 xk=xk−1−r(h(xk−1)−h(x∗)). (4)

The existence of a generic DM is guaranteed when both of the following conditions hold:

1. is strictly monotonic around .

2. is locally Lipschitz continuous anchored at with as the Lipschitz constant.

A detailed proof is presented in Section 8 (Theorem 8.1). Interestingly, the above two conditions are closely related to the essence of a generic DM. The update rule (Eq. 4) can be split into two terms: (1) (generic DM) that is sample independent, and (2) that is sample dependent. contains only part of the descent direction and needs to be multiplied by to produce a descent direction. Condition 1 ensures that does not change signs around , while condition 2 constrains the smoothness of the function, putting an upper bound on step sizes.

In Fig. 2, we illustrate how to minimize three different functions using a generic DM. The table in Fig. 2a describes our experimental setup for three different functions with template , different initial values , the optimal value and the generic DM . According to Theorem 8.1, the DM is set to be ), where is a small positive number, and the Lipschitz constant is computed numerically in a neighborhood of . Figs. 2bcd plot the convergence curves for each function where is updated using Eq. 4. Note that SDM always converges to the optimal value , regardless of the initial value .

### 2.3 Multi-dimensional Case

This section extends the concept of generic DM to multi-dimensional functions, where . For multi-dimensional NLS functions, the gradient descent update in Eq. 3 becomes

 xk=xk−1−αAJ⊤h(xk−1)(h(xk−1)−y) (5)

where is the Jacobian matrix, is the identity () in first order gradient methods, and the inverse Hessian (or an approximation) for second-order methods, is the step size. A generic DM exists if there exists a scalar such that

 ∥x∗−xk∥2≤c∥x∗−xk−1∥2.∀x∈U(x∗)

The update rule of Eq. 4 becomes

 xk=xk−1−R(h(xk−1)−h(x∗)). (6)

In section 8, we prove that the existence of a generic DM if both of the following conditions hold:

1. is a strictly locally monotone operator anchored at the optimal solution .

2. is locally Lipschitz continuous anchored at with as the Lipschitz constant.

### 2.4 Relaxed SDM

In the previous section, we have stated the conditions that ensure the existence of a generic DM. However, in many computer vision applications, such conditions may not hold. In this section, we introduce a relaxed version of the generic DM and we derive a practical algorithm. Previously, a single matrix was used for all samples. In this section, we extend SDM to learn a sequence of that moves the expected value of towards the optimal solution . The relaxed SDM is a sequential algorithm that learns such DMs from training data.

Let us denote

to be a random variable representing the state of

. In the first iteration (), we assume that is coming from a known distribution and use lower case

to represent its probability density function. The first DM

is computed by minimizing the expected loss between the true state and the predicted states, given by

 E∥x∗−X1∥22=E∥x∗−X0+R0(h(X0)−h(x∗))∥22 = ∫x0∥x∗−x0+R0(h(xi0)−h(x∗))∥22p0(x0)dx0 ≈ ∑i∥x∗−xi0+R0(h(xi0)−h(x∗))∥22. (7)

Here we have used Monte Carlo sampling to approximate the integral, and is drawn from the distribution . are known in training and minimizing Eq. 7 is simply a linear least squares problem, which can be solved in closed-form.

It is unlikely that the first generic DM can achieve the desired minimum in one step for all initial configurations. As in Newton’s method, after an update, we recompute a new generic map (i.e., a new Jacobian and Hessian in Newton’s method). In iteration , with estimated, each sample is updated to its new location as follows:

 xik=xik−1−Rk−1(h(xik−1)−h(x∗)). (8)

This is equivalent to drawing samples from the conditional distribution . We can use the samples to approximate the expected loss as follows:

 E∥x∗−Xk+1∥22=E∥x∗−Xk+Rk(h(Xk)−h(x∗))∥22 = ∫xk∥x∗−xk+Rk(h(xk)−h(x∗))∥22p(xk|xk−1)dxk ≈ ∑i∥x∗−xik+Rk(h(xik)−h(x∗))∥22. (9)

Minimizing Eq. 9 yields the DM. During learning, SDM alternates between minimizing Eq. 9 and updating Eq. 8 until convergence. In testing, is updated recursively using Eq. 8 with learned DMs.

### 2.5 Generalized SDM

Thus far, we have assumed that is the same for training as for testing (e.g., template tracking). This section presents generalized SDM, which addresses the case where differs between training and testing. There are two possible scenarios: is known and is unknown. In this section, for simplicity, we will only discuss the case when is known and its applications to pose estimation. Section 4.2 discusses the case when is unknown and its application to face alignment.

The goal of 3D pose estimation is to estimate the 3D rigid transformation () given a 3D model of the object and its projection (), and is the projection function. See section 5. In training, the pairs of 3D transformation and image projection are known. In testing, the projection (2D landmarks) is known, but different from those used in training. To address this problem, we reverse the order of learning. Unlike SDM, the reversed SDM during training always starts at the same initial point and samples different around it. At the same time, for each sample, we compute . The training procedure remains the same as stated in the previous section, except we replace in Eq. 8 with ,

 xik=xik−1−Rk−1(h(xik−1)−yi)), (10)

and we replace in Eq. 9 with ,

 ∑i∥xi∗−xik+Rk(h(xik)−yi))∥22. (11)

In testing, we start SDM with the same initial value that we used in training.

## 3 Rigid Tracking

This section illustrates how to apply SDM to the problem of tracking rigid objects. In particular, we show how we can extend the classical Lucas-Kanade (LK) method  to efficiently track in HoG  space. To the best of our knowledge, this is the first algorithm to perform alignment in HoG space.

The Lucas-Kanade (LK) tracker is one of the early and most popular computer vision trackers due to its efficiency and simplicity. It formulates image alignment as a NLS problem. Alignment is achieved by finding the motion parameter that minimizes

 minp ||d(f(x,p))−t(x)||22, (12)

where is the template, is a vector containing the coordinates of the pixels to detect/track, and is a vector with entries representing a geometric transformation. In this section, we limit the transformation to be affine. That is, relates to by

 [uivi]=[p1p2p4p5][xiyi]+[p3p6].

The entry of is the pixel intensity of the image at . Minimizing Eq. 12 is a NLS problem because the motion parameters are nonlinearly related to the pixel values (it is a composition of two functions).

Given a template (often the initial frame), the LK method uses Gauss-Newton to minimize Eq. 12 by linearizing the motion parameters around an initial estimate . The step of the LK update is

 Δp=H−1k(∂d∂f∂f∂pk)⊤(t(x)−d(f(x,pk))),

where is the Gauss-Newton approximation of the Hessian. The motion parameter is then updated as .

### 3.2 An SDM Solution

The LK method provides a mathematically sound solution for image alignment problems. However, it is not robust to illumination changes. Robustness can be achieved by aligning images w.r.t. some image descriptors instead of pixel intensities, i.e.,

 minp ||h(d(f(x,p)))−h(t(x))||22, (13)

where is some image descriptor function (HoG, in our case). The LK (Gauss-Newton) update for minimizing 13 can be derived as follows:

 Δp=H−1k(∂h∂pk)⊤(h(t(x))−h(d(f(x,pk)))).

However, the update can no longer computed efficiently: HoG is a non-differentiable image operator, and thus the Jacobian () has to be estimated numerically at each iteration.

In contrast, SDM minimizes Eq. 13 by replacing the computationally expensive term with a pre-trained DM , and yields the following update:

 Δp=Rk(h(t(x))−h(d(f(x,pk)))). (14)

Each update step in SDM is very efficient, mainly consisting of one affine image warping and one HoG descriptor computation of the warped image. One may notice the inconsistency between Eq. 14 and the SDM update we previously introduced in Eq. 8. In the following, we will show the equivalence of the two.

The template can be seen as the HoG descriptors extracted from the image under an identity transformation, , where

 p∗=⊤.

Under the assumption that only affine deformation is involved, the image on which we perform tracking can be interpreted as the template image under an unknown affine transformation :

 d(x)=t(f(x,~p)).

In Eq. 14, image warped under the current parameter can be re-written as

 d(f(x,pk))=t(f(f(x,~p),pk)). (15)

The composition of two affine transformations remains affine, so Eq. 15 becomes

 d(f(x,pk))=t(f(x,ˆp)),

where an unknown affine parameter that differs from . Therefore, we can re-write Eq. 14 as

 Δp=R(h(t(f(x,p∗)))−h(t(f(x,ˆp)))). (16)

Eq. 16 can be further simplified to follow the same form of Eq. 8:

 Δp=R(g(p∗)−g(ˆp)),

where .

In our implementation, we use Eq. 14 as the update rule instead of Eq. 16 because is a known parameter w.r.t image and is unknown w.r.t the template image . The descent maps are learned in the neighborhood of , but as tracking continues, the motion parameter may deviate greatly from . When tracking a new frame, before applying SDM the image is first warped back using the motion parameter estimated in the previous frame so that the optimization happens within a neighborhood of . Therefore, after SDM finishes, we warp back the estimated using the same parameter.

The training of SDM involves sampling initial motion parameters and solving a sequence of linear systems (detailed in section 2.4). We sample around and those samples are used for approximating the expectation expressed in Eq. 9. The details of how we generate initial samples are described in section 6.2.

## 4 Nonrigid Detection and Tracking

In the previous section, we showed how SDM can be used for aligning regions of images that undergo an affine motion. This section extends SDM to detect and track nonrigid objects. In particular, we will show how SDM achieves state-of-the-art performance in facial feature detection and tracking.

### 4.1 Previous Work

This section reviews existing work on face alignment.

Parameterized Appearance Models (PAMs), such as Active Appearance Models [15, 4, 16], Morphable Models [17, 18], Eigentracking , and template tracking [2, 19]

build an object appearance and shape representation by performing Principal Component Analysis (PCA) on a set of manually labeled data. Fig.

3a illustrates an image labeled with landmarks ( in this case). After the images are aligned with Procrustes , a shape model is learned by performing PCA on the registered shapes. A linear combination of shape bases can reconstruct (approximately) any aligned shape in the training set. Similarly, an appearance model is built by performing PCA on the texture. Alignment is achieved by finding the motion parameter and appearance coefficients that best align the image w.r.t. the subspace , i.e.,

 minca,p ||d(f(x,p))−Uaca||22, (17)

In the case of the LK tracker, is fixed to be and is a subspace that contains a single vector, the reference template. The notation follows that of Section 3.1; contains the coordinates of the pixels to track, and is a vector denoted by that now includes both affine and nonrigid transformation. That is, relates to by

 [uivi]=[a1a2a4a5][xsiysi]+[a3a6].

Here

 [xs1,ys1,...xsl,ysl]⊤=¯¯¯x+Uscs,

where is the mean shape face, are the affine and nonrigid motion parameters respectively, and . Similar to the LK method, PAMs algorithms [3, 16, 15, 4] optimize Eq. 17 using the Gauss-Newton method. A more robust formulation of (17) can be achieved by either replacing the L-2 norm with a robust error function [3, 21] or by performing matching on robust features, such as gradient orientation .

Discriminative approaches learn a mapping from image features to motion parameters or landmarks. Cootes et al. 

proposed to fit AAMs by learning a linear regression between the increment of motion parameters

and the appearance differences . The linear regressor is a numerical approximation of the Jacobian . Following this idea, several discriminative methods that learn a mapping from to

have been proposed. Gradient Boosting, first introduced by Friedman

, has become one of the most popular regressors in face alignment because of its efficiency and ability to model nonlinearities. Saragih and Göcke  and Tresadern et al.  showed that using boosted regression for AAM discriminative fitting significantly improved over the original linear formulation. Dollár et al.  incorporated “pose indexed features” to the boosting framework, where features are re-computed at the latest estimate of the landmark locations in addition to learning a new weak regressor at each iteration. Beyond gradient boosting, Rivera and Martinez  explored kernel regression to map from image features directly to landmark locations, achieving surprising results for low-resolution images. Recently, Cootes et al. 

investigated Random Forest regressors in the context of face alignment. At the same time, Sánchez

et al.  proposed to learn a regression model in the continuous domain to efficiently and uniformly sample the motion space. In the context of tracking, Zimmermann et al.  learned a set of independent linear predictors for different local motions and then chose a subset of them during tracking. Unlike PAMs, a major problem of discriminative approaches is that the cost function being minimizing is unclear, making these algorithms difficult to analyze theoretically. This paper is the first to formulate a precise cost function for discriminative approaches.

Part-based deformable models perform alignment by maximizing the posterior likelihood of part locations given an image. The objective function is composed of the local likelihood of each part times a global shape prior. Different methods typically vary the optimization methods or the shape prior. Constrained Local Models (CLM)  model this prior similarly as AAMs, assuming all faces lie in a linear subspace spanned by PCA bases. Saragih et al.  proposed a nonparametric representation to model the posterior likelihood and the resulting optimization method is reminiscent of mean-shift. In , the shape prior was modeled nonparametrically from training data. Recently, Saragih  derived a sample specific prior to constrain the output space providing significant improvements over the original PCA prior. Instead of using a global model, Huang et al.  proposed to build separate Gaussian models for each part (e.g., mouth, eyes) to preserve more detailed local shape deformations. Zhu and Ramanan 

assumed that the face shape is a tree structure (for fast inference), and used a part-based model for face detection, pose estimation, and facial feature detection.

### 4.2 An SDM Solution Fig. 3: a) Manually labeled image with 66 landmarks. Blue outline indicates face detector. b) Mean landmarks, x0, initialized using the face detector.

Similar to rigid tracking in section 3.2, we perform face alignment in the HoG space. Given an image of pixels, indexes landmarks in the image.

is a nonlinear feature extraction function and

in the case of extracting HoG features. In this setting, face alignment can be framed as minimizing the following NLS function over landmark coordinates :

 f(x)=∥h(d(x))−y∗∥22, (18)

where represents the HoG values computed on the local patches extracted from the manually labeled landmarks. During training, we assume that the correct landmarks (in our case ) are known, and we will refer to them as (see Fig. 3a). Also, to simulate the testing scenario, we run the face detector on the training images to provide an initial configuration of the landmarks (), which corresponds to an average shape (see Fig. 3b).

Eq. 18 has several fundamental differences with previous work on PAMs (Eq. 17). First, in Eq. 18, we do not learn any model of shape or appearance beforehand from training data. Instead, we align the image w.r.t. a template . For the shape, we optimize the landmark locations directly. Recall that in traditional PAMs, nonrigid motion is modeled as a linear combination of shape bases learned by performing PCA on a training set. Our shape formulation is able to generalize better to untrained situations (e.g., asymmetric facial gestures). Second, we use HoG features extracted from patches around the landmarks to achieve a representation robust to illumination changes. Fig. 4: Three examples of SDM minimizing the reprojection errors through each step. Blue outlines represent the image projections h(pik) under the current parameter estimates pik. Green outlines are the projections rendered under the ground truth parameters pi∗.

Unknown : In face alignment, the used for testing is unknown and different from those used for training (i.e., the test subject is not one of the training subjects). Furthermore, the function is parametrized not only by , but also by the images (i.e., different subjects or different conditions of subjects). Therefore, SDM learns an additional bias term to represent the average of during training. The training step modifies Eq. 8 to minimize the expected loss over all initializations and images, where the expected loss is given by

 ∑i,j∥x∗−xi,jk+Rkh(di(xi,jk))−bk∥22. (19)

We use to index images and to index initializations. The update of Eq. 8 is thus modified to be

 xi,jk=xi,jk−1−Rk−1h(di(xi,jk−1))+bk−1. (20)

Despite the modification, minimizing Eq. 20 is still a linear least squares problem. Note that we do not use in training, although they are available. In testing, given an unseen image and an initial guess of , is updated recursively using Eq. 20. If are drawn from the same distribution that produces the training data, each iteration is guaranteed to decrease the expected loss between and .

### 4.3 Online SDM

SDM may have poor performance on an unseen sample that is dramatically different from those in the training set. It would be desirable to incorporate this new sample into the existing model without re-training. This section describes such a procedure that updates an existing SDM model in an online fashion.

Assume that one is given a trained SDM model, represented by , where and is the data matrix used in training the descent map. For a given new face image and labeled landmarks , one can compute the initial landmark perturbation and the feature vector extracted at , . Using the well known recursive least squares algorithm , SDM can be re-trained by iterating the following three steps:

1. Update inverse covariance matrix

 \boldmath{Σ}−1k←\boldmath{Σ}−1k−\boldmath{Σ}−1k\boldmathϕk(w−1+% \boldmathϕ⊤k\boldmath{Σ}−1k\boldmathϕk)−1\boldmathϕ⊤k\boldmath{Σ}−1k. (21)
2. Update the generic descent direction

 Rk←Rk+(Δxk−Rk\boldmath% ϕk)w\boldmathϕ⊤k\boldmath{Σ}−1k.
3. Generate a new sample pair for re-training in the next iteration

 Δxk+1 ←Δxk−Rk\boldmathϕk \boldmathϕk+1 ←h(d(x∗+Δxk+1)).

Setting the weight to be treats every sample equally. For different applications, one may want the model to emphasize the more recent samples. For example, SDM with exponential forgetting can be implemented with a small modification of Eq. 21:

 \boldmath{Σ}−1k←λ−1[\boldmath{Σ}−1k−\boldmath{Σ}−1k\boldmathϕk(λ+\boldmathϕ⊤k\boldmath{Σ}−1k\boldmathϕk)−1\boldmathϕ⊤k% \boldmath{Σ}−1k],

where is a discount parameter. Assuming data points come in order, the weight on the sample is . Above, we do not explain the update formula for the bias term , since it is often incorporated into by augmenting the feature vector with . Note that in Eq. 21, the term in parentheses is a scalar. Since no matrices need to be inverted, our re-training scheme is very efficient, consisting of only a few matrix multiplications and feature extractions.

## 5 3D Pose Estimation

In the two applications we have shown thus far, the optimization parameters lie in space. In this section, we will show how SDM can also be used to optimize parameters such as a rotation matrix, which belongs to the group.

The problem of 3D pose estimation can be described as follows. Given the 3D model of an object represented as points , its projection on an image, and the intrinsic camera parameters , the goal is to estimate the object pose (3D rotation and translation ).222 is used for rotation matrix to avoid conflict with DM This is also known as extrinsic camera parameter calibration.

### 5.1 Previous Work

Object pose estimation is a well-studied problem. If the model and image projection points are perfectly measured, this problem can be solved in closed-form by finding the perspective projection matrix [38, 39]. The projection matrix maps the 3D model points to image points in homogeneous coordinates. Since it has 11 unknowns, at least six correspondences are required. However, these approaches are very fragile to noise. Lowe  and Yuan  improved the robustness of the estimates by minimizing the reprojection error. Since the projection function is nonlinear, they used Newton-Raphson method to optimize it. However, both algorithms require good initial values to converge and for both algorithms, each iteration is an operation (requiring the pseudo-inverse of the Jacobian). DeMenthon and Davis proposed an accurate and efficient POSIT algorithm  that iteratively finds object pose by assuming a scaled orthographic projection.

### 5.2 A SDM Solution

The 3D pose estimation problem can also be formulated as a constrained NLS problem that minimizes the reprojection error w.r.t. and :

 minimizeQ,t ∥h(Q,t,M)−U∥F subject to Q⊤Q=I3 and det(Q)=1.

can be seen as composition of two functions and , which can be written in closed-form as follows:

 g1(Q,t,X) =K(QX+1⊤n⊗t), g2(X) =[x⊤1⊘x⊤3x⊤2⊘x⊤3],

where represents the Kronecker product, denotes element-wise division, and is the first row vector of . We parameterize the rotation matrix as a function of the Euler-angles . Then, the objective function can be simplified into the following unconstrained optimization problem:

 minp∥h(p,M)−U∥F,

where . We minimize the above function using reversed SDM introduced in Section 2.5. For training SDM, we sample a set of poses and compute the image projections under each pose. Recall that the training of reversed SDM alternates between minimization of Eq. 11 and updating of Eq. 10. We rewrite these equations in the context of pose estimation:

 pik=pik−1−Rk−1(h(pik−1,M)−Ui), (22)
 ∑i∥pi∗−pik+Rk(h(pik,M)−Ui)∥22.

In testing, given an unseen , SDM recursively applies the update given by Eq. 22.

Fig. 4 shows three examples of how the reprojection errors are decreased through each SDM update when performing head pose estimation. In these cases, the SDM always starts at (see iteration 0 in Fig. 4) and quickly converges to the optimal solutions. More results can be found in section 6.5 as well as a comparison with the POSIT algorithm.

## 6 Experiments

This section illustrates the benefits of SDM for solving NLS problems in a synthetic example and three computer vision problems. First, we illustrate SDM with four analytic functions. Second, we show how we can use SDM to track planar objects that undergo an affine transformation in HoG space using an LK-type tracker. Third, we demonstrate how SDM can achieve state-of-the-art facial feature detection results in two face-in-the-wild databases. Finally, in the third experiment we illustrate how SDM can be applied to pose estimation and compare it with POSIT. Fig. 5: Experimental setup for reversed SDM on analytic functions. erf(x) is the error function.

### 6.1 Analytic scalar functions

This experiment compares the speed and accuracy performance of reversed SDM against Newton’s method on four analytic functions. The NLS problem that we optimize is:

 minxf(x)=(h(x)−y∗)2,

where is a scalar function (see Fig. 5) and is a given constant. Observe that the and derivatives of these functions can be derived analytically. Assume that we have a fixed initialization and that we are given a set of training data and . We trained reversed SDM as explained in section 2.5.

The training and testing setup for each function are shown in Fig. 5 using Matlab notation. We have chosen only invertible functions. Otherwise, for a given , multiple solutions may be obtained. In the training data, the output variables are sampled uniformly in a local region of , and their corresponding inputs are computed by evaluating at the inverse function of . The test data is generated at a finer resolution than in training.

To measure the accuracy of both methods, we computed the normalized least square residuals at the first 10 steps. Fig. 6 shows the convergence comparison between SDM and Newton’s method. Surprisingly, SDM converges with the same number of iterations as Newton’s method, but each iteration is faster. Moreover, SDM is more robust against bad initializations and ill-conditions (). For example, when , the Newton’s method starts from a saddle point and stays there for the following iterations (observe that in the Fig. 6b, Newton’s method stays at 1). In the case of , Newton’s method diverges because it is ill-conditioned. Not surprisingly, when Newton’s method converges, it provides more accurate estimation than SDM because SDM uses a generic descent map. If is quadratic (e.g., is linear function of ), SDM will converge in one iteration because the average gradient evaluated at different locations will be the same for linear functions. This coincides with a well-known fact that Newton’s method converges in one iteration for quadratic functions. Fig. 6: Normalized error versus iterations on four analytic functions using Newton’s method and SDM.

### 6.2 Rigid tracking

This section presents the tracking results comparing LK and SDM using an affine transformation. We used a publicly available implementation of the LK method  for tracking a single template. The experiments are conducted on a public dataset published by . The dataset features six different planar textures: mansion, sunset, Paris, wood, building, and bricks. Each texture includes videos, each of which corresponds to a different camera motion path or changes illumination condition. In our experiment, we chose five of the motions, giving us a total of videos. The five motions correspond to translation, dynamic lighting, in-plane rotation, out-plane rotation, and scaling.

Both trackers (SDM and LK) used the same template, which was extracted at a local region on the texture in the first frame. In our implementation of SDM, the motion parameters were sampled from an isotropic Gaussian distribution with zero mean. The standard deviations were set to be

We used 300 samples and four iterations to train SDM. The tracker was considered lost if there was more than difference between the template and the back-warp image. Note that this difference was computed in HoG space.

Table 7 shows the number of frames successfully tracked by the LK tracker and SDM. SDM performs better than or as well as the LK tracker in out of the sequences. We observe that SDM performs much better than LK in translation. One possible explanation is that HoG features are more robust to motion blur. Not surprisingly, SDM performs perfectly in the presence of dynamic lighting because HoG is robust to illumination changes. In-plane rotation tends to be the most challenging motion for SDM, but even in this case, it is very similar to LK. Fig. 7: Comparison between SDM and LK on rigid tracking experiments. Each entry in the table states the number of frames successfully tracked by each algorithm. The total number of frames is given by number in the parentheses from the first column.

### 6.3 Facial feature detection

This section reports experiments on facial feature detection in two “face-in-the-wild” datasets, and compares SDM with state-of-the-art methods. The two face databases are the LFPW dataset  and the LFW-A&C dataset .

The experimental setup is as follows. First, the face is detected using the OpenCV face detector 

. The evaluation is performed on those images in which a face can be detected. The face detection rates are 96.7% on LFPW and 98.7% on LFW-A&C, respectively. The initial shape estimate is given by centering the mean face at the normalized square. The translational and scaling differences between the initial and true landmark locations are also computed, and their means and variances are used for generating Monte Carlo samples in Eq.

7. We generated 10 perturbed samples for each training image. HoG descriptors are computed on local patches around each landmark. To reduce the dimensionality of the data, we performed PCA, preserving of the energy on the image features.

LFPW

dataset contains images downloaded from the web that exhibit large variations in pose, illumination, and facial expression. Unfortunately, only image URLs are given and some are no longer valid. We downloaded 884 of the 1132 training images and 245 of the 300 test images. We followed the evaluation metric used in

, where the error is measured as the average Euclidean distance between the 29 labeled and predicted landmarks. The error is then normalized by the inter-ocular distance.

We compared our approach with two recently proposed methods [33, 46]. Fig. 8 shows the Cumulative Error Distribution (CED) curves of SDM, Belhumeur et al. , and our method trained with only one linear regression. Note that SDM is different from the AAM trained in a discriminative manner with linear regression  because we do not learn a shape or appearance model. Note that such curves are computed from of the points defined in , following the convention used in . Clearly, SDM outperforms  and linear regression. It is also important to notice that a completely fair comparison is not possible since  was trained and tested with some images that were no longer available. However, the average is in favor of our method. The recently proposed method in  is based on boosted regression with pose-indexed features. To the best of our knowledge this paper reported the state-of-the-art results on LFPW dataset. In , no CED curve was given and they reported a mean error () of 3.43. SDM shows comparable performance with a average of .

The first two rows of Fig. 12 show our results on faces with large variations in poses and illumination as well as ones that are partially occluded. The last row displays the worst 10 results measured by the normalized mean error. Most errors were caused by the gradient feature’s inability to distinguish between similar facial parts and occluding objects (e.g., glasses frame and eye brows).

LFW-A&C is a subset of the LFW dataset, consisting of 1116 images of people whose names begin with an ’A’ or ’C’. Each image is annotated with the same 66 landmarks shown in Fig. 3

. We compared our method with the Principle Regression Analysis (PRA) method

, which proposes a sample-specific prior to constrain the regression output. This method achieves the state-of-the-art results on this dataset. Following , those whose name started with ‘A’ were used for training, giving us a total of 604 images. The remaining images were used for testing. Root mean squared error (RMSE) was used to measure the alignment accuracy. Each image has a fixed size of and the error was not normalized. PRA reported a median alignment error of 2.8 on the test set while ours averages 2.7. The comparison of CED curves can be found in Fig. 8b and our method outperforms both PRA and Linear Regression. Qualitative results from SDM on the more challenging samples are plotted in Fig. 13.

### 6.4 Facial feature tracking

This section tested the use of SDM for facial feature tracking. The main idea is to use SDM for detection in each frame, but initialize the frame with the landmark estimates of the previous frame.

We trained our model with landmarks on the MPIE  and LFW-A&C datasets. The standard deviations of the scaling and translational perturbation were set to 0.05 and 10, respectively. This indicates that in two consecutive frames, the probability of a tracked face shifting more than 20 pixels or scaling more than is less than . We evaluated SDM’s tracking performance on two datasets: RU-FACS  and Youtube Celebrities .

RU-FACS dataset consists of sequences of different subjects recorded in a constrained environment. Each sequence has an average of frames. The dataset is labeled with the same 66 landmarks of our trained model except for the 17 jaw points, which are defined slightly differently (See Fig. 9b). We used the remaining landmarks for evaluation. The ground truth was given by person-specific AAMs . For each of the sequences, the average RMS error and standard deviation are plotted in red in Fig. 9a. The blue lines were generated by a new model re-trained with the first image of each of the 29 sequences following the online SDM algorithm introduced in section 4.3. The tracking results improved for all sequences. To make sense of the numerical results, in Fig. 9b, we also showed one tracking result overlaid with ground truth and in this example, it gives us a RMS error of . There are no obvious differences between the two labelings. Also, the person-specific AAM gives unreliable results when the subject’s face is partially occluded, while SDM still provides a robust estimation (See Fig. 14). In the frames of the RU-FACS videos, the SDM tracker never lost track, even in cases of partial occlusion. Fig. 9: a) Average RMS errors and standard deviations on 29 video sequences in RU-FACS dataset. Red: orignal model. Blue: re-trained model. b) RMS error between the SDM detection (green) and ground truth (red) is 5.03.

Youtube Celebrities is a public “in-the-wild” dataset that is composed of videos of celebrities during an interview or on a TV show. It contains sequences of subjects, but most of them are shorter than 3 seconds. It was released as a dataset for face tracking and recognition, thus no labeled facial landmarks are given. See Fig. 15 for example tracking results from this dataset. Tracked video sequences can be found below. From the videos, we can see that SDM is able to reliably track facial landmarks with large pose variation ( yaw, roll and, pitch), occlusion, and illumination changes. All results are generated without re-initialization.

Both Matlab and C++ implementations of facial feature detection and tracking can be found at the link below. The C++ implementation averages around 11ms per frame, tested with an Intel i7 3752M processor. On a mobile platform (iPhone 5s), the tracker runs at 35fps.

### 6.5 3D pose estimation

This section reports the experimental results on 3D pose estimation using SDM and a comparison with the widely popular POSIT method . Fig. 10: 3D objects used in pose estimation experiments. Units are in millimeters (mm).

The experiment is set up as follows. We selected three different meshes of D objects: a cube, a face, and a human body (see Fig. 10). We placed a virtual camera at the origin of the world coordinates. In this experiment, we set the focal length (in terms of pixels) to be and principle point to be

. The skew coefficient was set to be zero. The training and testing data were generated by placing a 3D object at

, perturbed with different 3D translations and rotations. The POSIT algorithm does not require labeled data. Three rotation angles were uniformly sampled from to with increments of in training and

in testing. Three translation values were uniformly sampled from -400mm to 400mm with increments of 200mm in training and 170mm in testing. Then, for each combination of the six values, we computed the object’s image projection using the above virtual camera and used it as the input for both algorithms. White noise (

) was added to the projected points. In our implementation of SDM, to ensure numerical stability, the image coordinates of the projection were normalized as follows:

Fig. 11a shows the mean errors and standard deviations of the estimated rotations (in degree) and translations (in mm) for both algorithms. Both algorithms achieved around accuracy for rotation estimation, but SDM was more accurate for translation. This is because POSIT assumes a scaled orthographic projection, while the true image points are generated by a perspective projection. Fig. 11b states the computational time in milliseconds (ms) for both methods. Both algorithms are efficient, although POSIT is slightly faster than SDM. Fig. 11: Accuracy and speed comparison between SDM and POSIT algorithms on estimating 3D object pose. a) Rotation (in degree) and translation (in mm) errors and their standard deviations. b) Running times (in ms) of both algorithms.

## 7 Conclusions

This paper presents SDM, a method for solving NLS problems. SDM learns generic DMs in a supervised manner, and is able to overcome many drawbacks of second order optimization schemes, such as non-differentiability and expensive computation of the Jacobians and Hessians. Moreover, it is extremely fast and accurate. We have illustrated the benefits of our approach in three important computer vision problems and achieve state-of-the-art performance in the problem of facial feature detection.

Beyond SDM, an important contribution of this work in the context discriminative face alignment is to propose a cost function for discriminative image alignment (Eq. 18). Existing discriminative methods for facial alignment pose the problem as a regression one, but lack a well-defined alignment error function. Therefore, performing theoretical analysis of these methods is difficult. Eq. 18 establishes a direct connection with existing PAMs, and existing algorithms can be used to minimize it, such as Gauss-Newton or the supervised version proposed in this paper. In addition, we were able to use robust feature representations (e.g., HoG) within commonly used alignment algorithms, greatly improving the robustness and performance.

While SDM has shown to be a promising alternative to second order methods of solving NLS problems, according to Theorem 8.2, the generic DM only exists within a local neighborhood of the optimal parameters. When we train SDM with a large range of parameters, the results may not be accurate. For instance, in the experiment of pose estimation, when we trained SDM with a larger range of rotations (e.g., to ), the performance dropped dramatically. A simple way to solve this problem is to partition the output space into grids and train a different SDM for each grid. During testing, we can iterate over all the models and return the solution that give us the minimum reprojection error. We will explore more about how to use SDM for global optimization in future work.

## 8 Appendix

###### Theorem 8.1.

If the function satisfies the following two conditions:

1. is monotonic around the minimum ,

2. is locally Lipschitz continuous anchored at with as the Lipschitz constant,

then is a generic DM if and .

###### Proof:

Without loss of generality, we assume that is monotonically increasing, and that . Otherwise, the optimization has reached the minimum. In the following, we use to denote and to denote . We want to find such that

 |Δxik||Δxik−1|<1,if x∗≠xik−1. (23)

We replace with using Eq. 4 and the left side of Eq. 23 becomes

 |Δxik||Δxik−1| =|Δxik−1−rΔhik−1||Δxik−1|=|Δxik−1(1−rΔhik−1Δxik−1)||Δxik−1| =|Δxik−1||1−rΔhik−1Δxik−1||Δxik−1|=∣∣ ∣∣1−rΔhik−1Δxik−1∣∣ ∣∣ =∣∣ ∣∣1−r|Δhik−1||Δxik−1|∣∣ ∣∣. (24)

The last step is derived from condition 1. Denoting as and setting Eq. 24 gives us

 −1 < 1− rKik−1 <1 ⇒0 < r <2Kik−1. (25)

From condition 2, we know that . Any will satisfy the inequalities in Eq. 25, and therefore, there exists a generic DM. Similarly, we can show is a generic DM when is a monotonically decreasing. ∎

###### Theorem 8.2.

If function satisfies the following two conditions:

1. is a locally monotone operator anchored at the minimum ,

2. is locally Lipschitz continuous anchored at with as the Lipschitz constant,

then there exists a generic DM .

###### Proof.

To simplify the notation, we denote as , as , and use to represent the L-2 norm. We want to show that there exists such that

 ∥x∗−xik∥∥x∗−xik−1∥<1,if x∗≠xik−1. (26)

We replace with using Eq. 6 and squaring the left side of Eq. 26 gives us

 ∥Δxik∥2∥Δxik−1∥2=∥Δxik−1−RΔhik−1∥2∥Δxik−1∥2 = ∥Δxik−1∥2∥Δxik−1∥2+∥RΔhik−1∥2∥Δxik−1∥2−2Δxi⊤k−1RΔhik−1∥Δxik−1∥2 = 1+∥