1 Introduction
The Lucas & Kanade (LK) algorithm [1]
has been a popular approach for tackling dense alignment problems for images and objects. At the heart of the algorithm is the assumption that an approximate linear relationship exists between pixel appearance and geometric displacement. Such a relationship is seldom exactly linear, so a linearization process is typically repeated until convergence. Pixel intensities are not deterministically differentiable with respect to geometric displacement; instead, the linear relationship must be established stochastically through a learning process. One of the most notable properties of the LK algorithm is how efficiently this linear relationship can be estimated. This efficiency stems from the assumption of independence across pixel coordinates  the parameters describing this linear relationship are classically referred to as image gradients. In practice, these image gradients are estimated through finite differencing operations. Numerous extensions and variations upon the LK algorithm have subsequently been explored in literature
[2], and recent work has also demonstrated the utility of the LK framework [3, 4, 5] using classical dense descriptors such as dense SIFT [6], HOG [7], and LBP [8].A drawback to the LK algorithm and its variants, however, is its generative nature. Specifically, it attempts to synthesize, through a linear model, how appearance changes as a function of geometric displacement, even though its end goal is the inverse problem. Recently, Xiong & De la Torre [9, 10, 11] proposed a new approach to image alignment known as the Supervised Descent Method (SDM). SDM shares similar properties with the LK algorithm as it also attempts to establish the relationship between appearance and geometric displacement using a sequence of linear models. One marked difference, however, is that SDM directly learns how geometric displacement changes as a function of appearance. This can be viewed as estimating the conditional likelihood function , where and are geometric displacement and appearance respectively. As reported in literature [12] (and also confirmed by our own experiments in this paper), this can lead to substantially improved performance over classical LK as the learning algorithm is focused directly on the end goal (i.e. estimating geometric displacement from appearance).
Although it exhibits many favorable properties, SDM also comes with disadvantages. Specifically, due to its nongenerative nature, SDM cannot take advantage of the pixel independence assumption enjoyed through classical LK (see Section 4 for a full treatment on this asymmetric property). Instead, it needs to model full dependence across all pixels, which requires: (i) a large amount of training data, and (ii) the requirement of adhoc regularization strategies in order to avoid a poorly conditioned linear system. Furthermore, SDM does not utilize prior knowledge of the type of geometric warp function being employed (e.g. similarity, affine, homography, point distribution model, etc.), which further simplifies the learning problem in classical LK.
In this paper, we propose a novel approach which, like SDM, attempts to learn a linear relationship between geometric displacement directly as a function of appearance. However, unlike SDM, we enforce that the pseudoinverse of this linear relationship enjoys the generative independence assumption across pixels while utilizing prior knowledge of the parametric form of the geometric warp. We refer to our proposed approach as the Conditional LK algorithm. Experiments demonstrate that our approach achieves comparable, and in many cases better, performance to SDM across a myriad of tasks with substantially less training examples. We also show that our approach does not require any adhoc regularization term, and it exhibits a unique property of being able to “swap” the type of warp function being modeled (e.g. replace a homography with an affine warp function) without the need to retrain. Finally, our approach offers some unique theoretical insights into the redundancies that exist when attempting to learn efficient object/image aligners through a conditional paradigm.
Notations. We define our notations throughout the paper as follows: lowercase boldface symbols (e.g.
) denote vectors, uppercase boldface symbols (
e.g. ) denote matrices, and uppercase calligraphic symbols (e.g. ) denote functions. We treat images as a function of the warp parameters, and we use the notations to indicate sampling of the channel image representation at subpixel location . Common examples of multichannel image representations include descriptors such as dense SIFT, HOG and LBP. We assume when dealing with raw grayscale images.2 The Lucas & Kanade Algorithm
At its heart, the Lucas & Kanade (LK) algorithm utilizes the assumption that,
(1) 
where is the image function representation and is the image gradient function at pixel coordinate . In most instances, a useful image gradient function
can be efficiently estimated through finite differencing operations. An alternative strategy is to treat the problem of gradient estimation as a perpixel linear regression problem, where pixel intensities are samples around a neighborhood in order to “learn” the image gradients
[4]. A focus of this paper is to explore this idea further by examining more sophisticated conditional learning objectives for learning image gradients.For a given geometric warp function parameterized by the warp parameters , one can thus express the classic LK algorithm as minimizing the sum of squared differences (SSD) objective,
(2) 
which can be viewed as a quasiNewton update. The parameter is the initial warp estimate, is the warp update being estimated, and is the template image we desire to align the source image against. The pixel coordinates are taken with respect to the template image’s coordinate frame, and is the warp Jacobian. After solving Equation 2, the current warp estimate has the following additive update,
(3) 
As the relationship between appearance and geometric deformation is not solely linear, Equations 2 and 3 must be applied iteratively until convergence is achieved.
2.0.1 Inverse compositional fitting.
The canonical LK formulation presented in the previous section is sometimes referred to as the forwards additive (FA) algorithm [2]. A fundamental problem with the forwards additive approach is that it requires recomputing the image gradient and warp Jacobian in each iteration, greatly impacting computational efficiency. Baker and Matthews [2] devised a computationally efficient extension to forwards additive LK, which they refer to as the inverse compositional (IC) algorithm. The ICLK algorithm attempts to iteratively solve the objective
(4) 
followed by the inverse compositional update
(5) 
where we have abbreviated the notation to be the composition of warp functions parametrized by , and to be the parameters of the inverse warp function parametrized by . We can express Equation 4 in vector form as
(6) 
where,
and
Here, is considered the identity warp (i.e. ). It is easy to show that the solution to Equation 6 is given by
(7) 
where . The superscript denotes the MoorePenrose pseudoinverse operator. The IC form of the LK algorithm comes with a great advantage: the gradients and warp Jacobian are evaluated at the identity warp , regardless of the iterations and the current state of . This means that remains constant across all iterations, making it advantageous over other variants in terms of computational complexity. For the rest of this paper, we shall focus on the IC form of the LK algorithm.
3 Supervised Descent Method
Despite exhibiting good performance on many image alignment tasks, the LK algorithm can be problematic to use when there is no specific template image to align against. For many applications, one may be given just an ensemble of groundtruth images and warps of the object of interest. If one has prior knowledge of the distribution of warp displacements to be encountered, one can synthetically generate examples to form a much larger set to learn from, where . In these circumstances, a strategy recently put forward known as the Supervised Descent Method (SDM) [9] has exhibited stateoftheart performance across a number of alignment tasks, most notably facial landmark alignment. The approach attempts to directly learn a regression matrix that minimizes the following SSD objective,
(8) 
The template image can be learned either with directly or by taking it to be , the average of groundtruth images [11].
3.0.1 Regularization.
is a regularization function used to ensure that the solution to is unique. To understand the need for this regularization, one can reform Equation 8 in matrix form as
(9) 
where
Here, indicates the matrix Frobenius norm. Without the regularization term , the solution to Equation 9 is . It is understood in literature that raw pixel representations of natural images stem from certain frequency spectrums [13] that leads to an autocovariance matrix which is poorly conditioned in nearly all circumstances. It has been demonstrated [13] that this property stems from the fact that image intensities in natural images are highly correlated in close spatial proximity, but this dependence drops off as a function of spatial distance.
In our experiments, we have found that is always poorly conditioned even when utilizing other image representations such as dense SIFT, HOG, and LBP descriptors. As such, it is clear that some sort of regularization term is crucial for effective SDM performance. As commonly advocated and practiced, we employed a weighted Tikhonov penalty term , where controls the weight of the regularizer. We found this choice to work well in our experiments.
3.0.2 Iterationspecific Regressors.
Unlike the ICLK approach, which employs a single regressor/template pair to be applied iteratively until convergence, SDM learns a set of regressor/template pairs for each iteration (sometimes referred to as layers). On the other hand, like the ICLK algorithm, these regressors are precomputed in advance and thus are independent of the current image and warp estimate. As a result, SDM is computationally efficient just like ICLK. The regressor/template pair is learned from the synthetically generated set within Equation 8, which we define to be
(10) 
where
(11) 
For the first iteration (), the warp perturbations are generated from a predetermined random distribution; for every subsequent iteration, the warp perturbations are resampled from the same distribution to ensure each iteration’s regressor does not overfit. Once learned, SDM is applied by employing Equation 11 in practice.
3.0.3 Inverse Compositional Warps.
It should be noted that there is nothing in the original treatment [9] on SDM that limits it to compositional warps. In fact, the original work employing facial landmark alignment advocated an additive update strategy. In this paper, however, we have chosen to employ inverse compositional warp updates as: (i) we obtained better results for our experiments with planar warp functions, (ii) we observed almost no difference in performance for nonplanar warp functions such as those involved in face alignment, and (iii) it is only through the employment of inverse compositional warps within the LK framework that a firm theoretical motivation for fixed regressors can be entertained. Furthermore, we have found that keeping a close mathematical relationship to the ICLK algorithm is essential for the motivation of our proposed approach.
4 The Conditional Lucas & Kanade Algorithm
Although enjoying impressive results across a myriad of image alignment tasks, SDM does have disadvantages when compared to ICLK. First, it requires large amounts of synthetically warped image data. Second, it requires the utilization of an adhoc regularization strategy to ensure good condition of the linear system. Third, the mathematical properties of the warp function parameters being predicted is ignored. Finally, it reveals little about the actual degrees of freedom necessary in the set of regressor matrices being learned through the SDM process.
In this paper, we put forward an alternative strategy for directly learning a set of iterationspecific regressors,
s.t. 
where
At first glance, this objective may seem strange, as we are proposing to learn template “image gradients” within a conditional objective. As previously discussed in [4]
, this idea deviates from the traditional view of what image gradients are  parameters that are derived from heuristic finite differencing operations. In this paper, we prefer to subscribe to the alternate view that image gradients are simply weights that can be, and should be, learned from data. The central motivation for this objective is to enforce the parametric form of the generative ICLK form through a conditional objective.
An advantage of the Conditional LK approach is the reduced number of model parameters. Comparing the model parameters of Conditional LK () against SDM (), there is a reduction in the degrees of freedom needing to be learned for most warp functions where . More fundamentally, however, is the employment of the generative pixel independence assumption described originally in Equation 1. This independence assumption is useful as it ensures that a unique can be found in Equation Appendix I: Math Derivations of the Conditional LK Algorithm without any extra penalty terms such as Tikhonov regularization. In fact, we propose that the sparse matrix structure of image gradients within the psuedoinverse of acts as a much more principled form of regularization than those commonly employed within the SDM framework.
A further advantage of our approach is that, like the ICLK framework, it utilizes prior knowledge of the warp Jacobian function during the estimation of the regression matrix . Our insight here is that the estimation of the regression matrix using a conditional learning objective should be simplified (in terms of the degrees of freedom to learn) if one had prior knowledge of the deterministic form of the geometric warp function.
A drawback to the approach, in comparison to both the SDM and ICLK frameworks, is the nonlinear form of the objective in Equation Appendix I: Math Derivations of the Conditional LK Algorithm. This requires us to resort to nonlinear optimization methods, which are not as straightforward as linear regression solutions. However, as we discuss in more detail in the experimental portion of this paper, we demonstrate that a LevenbergMarquardt optimization strategy obtains good results in nearly all circumstances. Furthermore, compared to SDM, we demonstrate good solutions can be obtained with significantly smaller numbers of training samples.
4.0.1 Iterationspecific Regressors.
As with SDM, we assume we have an ensemble of images and groundtruth warps from which a much larger set of synthetic examples can be generated , where . Like SDM, we attempt to learn a set of regressor/template pairs for each iteration . The set of training samples is derived from Equations 10 and 11 for each iteration. Once learned, the application of these iterationspecific regressors is identical to SDM.
4.0.2 Pixel Independence Asymmetry.
A major advantage of the ICLK framework is that it assumes generative independence across pixel coordinates (see Equation 1). A natural question to ask is: could not one predict geometric displacement (instead of appearance) directly across independent pixel coordinates?
The major drawback to employing such strategy is its ignorance of the wellknown “aperture problem” [14]
in computer vision (
e.g. the motion of an image patch containing a sole edge cannot be uniquely determined due to the ambiguity of motion along the edge). As such, it is impossible to ask any predictor (linear or otherwise) to determine the geometric displacement of all pixels within an image while entertaining an independence assumption. The essence of our proposed approach is that it circumvents this issue by enforcing global knowledge of the template’s appearance across all pixel coordinates, while entertaining the generative pixel independence assumption that has served the LK algorithm so well over the last three decades.4.0.3 Generative LK.
For completeness, we will also entertain a generative form of our objective in Equation Appendix I: Math Derivations of the Conditional LK Algorithm, where we instead learn “image gradients” that predict generative appearance as a function of geometric displacement, formulated as
s.t. 
Unlike our proposed Conditional LK, the objective in Equation 4.0.3 is linear and directly solvable. Furthermore, due to the generative pixel independence assumption, the problem can be broken down into independent subproblems. The Generative LK approach is trained in an identical way to SDM and Conditional LK, where iterationspecific regressors are learned from a set of synthetic examples .
Figure 1 provides an example of visualizing the gradients learned from the Conditional LK and Generative LK approaches. It is worthwhile to note that the Conditional LK gradients get sharper over regression iterations, while it is not necessarily the case for Generative LK. The rationale for including the Generative LK form is to highlight the importance of a conditional learning approach, and to therefore justify the added nonlinear complexity of the objective in Equation Appendix I: Math Derivations of the Conditional LK Algorithm.
5 Experiments
In this section, we present results for our approach across three diverse tasks: (i) planar image alignment, (ii) planar template tracking, and (iii) facial model fitting. We also investigate the utility of our approach across different image representations such as raw pixel intensities and dense LBP descriptors.
5.1 Planar Image Alignment
5.1.1 Experimental settings.
In this portion of our experiments, we will be utilizing a subsection of the MultiPIE [15] dataset. For each image, we denote a image with groundtruth warp rotated, scaled and translated around handlabeled locations. For the ICLK approach, this image is then employed as the template . For the SDM, Conditional LK and Generative LK methods, a synthetic set of geometrically perturbed samples are generated .
We generate the perturbed samples by adding i.i.d. Gaussian noise of standard deviation
to the four corners of the groundtruth bounding box as well as an additional translational noise from the same distribution, and then finally fitting the perturbed box to the warp parameters . In our experiments, we choose pixels. Figure 2 shows an example visualization of the training procedure as well as the generated samples. For SDM, a Tikhonov regularization term is added to the training objective as described in Section 3, and the penalty factor is chosen by evaluating on a separate validation set; for Conditional LK, we use LevenbergMarquardt to optimize the nonlinear objective where the parameters are initialized through the Generative LK solution.5.1.2 Frequency of Convergence.
We compare the alignment performance of the four types of aligners in our discussion: (i) ICLK, (ii) SDM, (iii) Generative LK, and (iv) Conditional LK. We state that convergence is reached when the point RMSE of the four corners of the bounding box is less than one pixel.
Figure 3 shows the frequency of convergence tested with both a 2D affine and homography warp function. Irrespective of the planar warping function, our results indicate that Conditional LK has superior convergence properties over the others. This result holds even when the approach is initialized with a warp perturbation that is larger than the distribution it was trained under. The alignment performance of Conditional LK is consistently better in all circumstances, although the advantage of the approach is most noticeable when training with just a few training samples.
Figure 4 provides another comparison with respect to the amount of training data learned from. It can be observed that SDM is highly dependent on the amount of training data available, but it is still not able to generalize as well as Conditional LK. This is also empirical proof that incorporating principled priors in Conditional LK is more desirable than adhoc regularizations in SDM.
5.1.3 Convergence Rate.
We also provide some analysis on the convergence speed. To make a fair comparison, we take the average of only those test runs where all regressors converged. Figure 5 illustrates the convergence rates of different regressors learned from different amounts of training data. The improvement of Conditional LK in convergence speed is clear, especially when little training data is provided. SDM starts to exhibit faster convergence rate when learned from over examples per layer; however, Conditional LK still surpasses SDM in term of the frequency of final convergence.
5.1.4 Swapping Warp Functions.
A unique property of Conditional LK in relation to SDM is its ability to interchange between warp functions after training. Since we are learning image gradients for the Conditional LK algorithm, one can essentially choose which warp Jacobian to be employed before forming the regressor . Figure 6 illustrates the effect of Conditional LK learning the gradient with one type of warp function and swapping it with another during testing. We see that whichever warp function Conditional LK is learned with, the learned conditional gradients are also effective on the other and still outperforms ICLK and SDM.
It is interesting to note that when we learn the Conditional LK gradients using either 2D planar similarity warps () or homography warps (), the performance on 2D planar affine warps () is as effective. This outcome leads to an important insight: it is possible to learn the conditional gradients with a simple warp function and replace it with a more complex one afterwards; this can be especially useful when certain types of warp functions (e.g. 3D warp functions) are harder to come by.
5.2 Planar Tracking with LBP Features
In this section, we show how Conditional LK can be effectively employed with dense multichannel LBP descriptors where . First we analyze the convergence properties of Conditional LK on the dense LBP descriptors, as we did similarly in the previous section, and then we present an application to robust planar tracking. A full description of the multichannel LBP descriptors we used in our approach can be found in [5].
Figure 7 provides a comparison of robustness by evaluating the frequency of convergence with respect to the scale of test warps . This suggests that Conditional LK is as effective in the LK framework with multichannel descriptors: in addition to increasing alignment robustness (which is already a wellunderstood property of descriptor image alignment), Conditional LK is able to improve upon the sensitivity to initialization with larger warps.
Figure 8 illustrates alignment performance as a function of the number of samples used in training. We can see the Conditional LK only requires as few as 20 examples per layer to train a better multichannel aligner than ICLK, whereas SDM needs more than 50 examples per iterationspecific regressor. This result again speaks to the efficiency of learning with Conditional LK.
5.2.1 Low Framerate Template Tracking.
In this experiment, we evaluate the advantage of our proposed approach for the task of low framerate template tracking. Specifically, we borrow a similar experimental setup to BitPlanes [5]. LBPstyle dense descriptors are ideal for this type of task as their computation is computationally feasible in realtime across a number of computational platforms (unlike HOG or dense SIFT). Further computational speedups can be entertained if we start to skip frames to track.
We compare the performance of Conditional LK with ICLK and run the experiments on the videos collected in [5]. We train the Conditional LK tracker on the first frame with synthetic examples. During tracking, we skip every frames to simulate low framerate videos. Figure 9 illustrates the percentage of successfully tracked frames over the number of skipped frames . It is clear that the Conditional LK tracker is more stable and tolerant to larger displacements between frames.
Figure 10 shows some snapshots of the video, including the frames where the ICLK tracker starts to fail but the Conditional LK tracker remains. This further demonstrates that the Conditional LK tracker maintains the same robustness to brightness variations by entertaining dense descriptors, but meanwhile improves upon convergence. Enhanced susceptibility to noises both in motion and brightness also suggests possible extensions to a wide variety of tracking applications.
5.3 Facial Model Fitting
In this experiment, we show how Conditional LK is applicable not only to 2D planar warps like affine or homography, but also to more complex warps that requires heavier parametrization. Specifically, we investigate the performance of our approach with a point distribution model (PDM) [16] on the IJAGS dataset [16], which contains an assortment of videos with handlabeled facial landmarks. We utilize a pretrained 2D PDM learned from all labeled data as the warp Jacobian and compare the Conditional LK approach against ICLK (it has been shown that there is an IC formulation to facial model fitting [16]). For Conditional LK, we learn a series of regressor/template pairs with examples per layer; for ICLK, the template image is taken by the mean appearance.
Figure 11 shows the results of fitting accuracy and convergence rate of subjectspecific alignment measured in terms of the pointtopoint RMSE of the facial landmarks; it is clear that Conditional LK outperforms ICLK in convergence speed and fitting accuracy. This experiment highlights the possibility of extending our proposed Conditional LK to more sophisticated warps. We would like to note that it is possible to take advantage of the Conditional LK warp swapping property to incorporate a 3D PDM as to introduce 3D shape modelling; this is beyond the scope of discussion of this paper.
6 Conclusion
In this paper, we discuss the advantages and drawbacks of the LK algorithm in comparison to SDMs. We argue that by enforcing the pixel independence assumption into a conditional learning strategy we can devise a method that: (i) utilizes substantially less training examples, (ii) offers a principled strategy for regularization, and (iii) offers unique properties for adapting and modifying the warp function after learning. Experimental results demonstrate that the Conditional LK algorithm outperforms both the LK and SDM algorithms in terms of convergence. We also demonstrate that Conditional LK can be integrated with a variety of applications that potentially leads to other exciting avenues for investigation.
References
 [1] Lucas, B.D., Kanade, T., et al.: An iterative image registration technique with an application to stereo vision. In: IJCAI. Volume 81. (1981) 674–679
 [2] Baker, S., Matthews, I.: Lucaskanade 20 years on: A unifying framework. International journal of computer vision 56(3) (2004) 221–255
 [3] Antonakos, E., Alaborti Medina, J., Tzimiropoulos, G., Zafeiriou, S.P.: Featurebased lucas–kanade and active appearance models. Image Processing, IEEE Transactions on 24(9) (2015) 2617–2632
 [4] Bristow, H., Lucey, S.: In defense of gradientbased alignment on densely sampled sparse features. In: Dense correspondences in computer vision. Springer (2014)
 [5] Alismail, H., Browning, B., Lucey, S.: Bitplanes: Dense subpixel alignment of binary descriptors. CoRR abs/1602.00307 (2016)
 [6] Lowe, D.G.: Distinctive image features from scaleinvariant keypoints. International journal of computer vision 60(2) (2004) 91–110

[7]
Dalal, N., Triggs, B.:
Histograms of oriented gradients for human detection.
In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Volume 1., IEEE (2005) 886–893
 [8] Ojala, T., Pietikäinen, M., Mäenpää, T.: Multiresolution grayscale and rotation invariant texture classification with local binary patterns. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24(7) (2002) 971–987
 [9] Xiong, X., De la Torre, F.: Supervised descent method and its applications to face alignment. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, IEEE (2013) 532–539
 [10] Xiong, X., la Torre, F.D.: Supervised descent method for solving nonlinear least squares problems in computer vision. CoRR abs/1405.0601 (2014)
 [11] Xiong, X., De la Torre, F.: Global supervised descent method. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 2664–2673

[12]
Jebara, T.:
Discriminative, generative and imitative learning.
PhD thesis, Massachusetts Institute of Technology (2001)  [13] Simoncelli, E.P., Olshausen, B.A.: Natural image statistics and neural representation. Annual review of neuroscience 24(1) (2001) 1193–1216
 [14] Marr, D.: Vision: A computational investigation into the human representation and processing of visual information, henry holt and co. Inc., New York, NY 2 (1982)
 [15] Gross, R., Matthews, I., Cohn, J., Kanade, T., Baker, S.: Multipie. Image and Vision Computing 28(5) (2010) 807–813
 [16] Matthews, I., Baker, S.: Active appearance models revisited. International Journal of Computer Vision 60(2) (2004) 135–164
Appendix I: Math Derivations of the Conditional LK Algorithm
We describe the derivation and a few optimization details of the proposed Conditional LK algorithm. For convenience, we repeat the objective here,
s.t. 
where
is the compact form of the template “image gradients” we want to learn. For simplicity, we further denote to be the vectorized form of , and we use here instead of to emphasize it is a function of . Thus we can rewrite Equation Appendix I: Math Derivations of the Conditional LK Algorithm as
s.t. 
where
We can expand the pseudoinverse form of to be
(16) 
where
is the pseudoHessian matrix. By the product rule, the derivative of with respect to the th element of , denoted as , becomes
(17) 
where is an indicator matrix with only the element in corresponding to being active. The derivative of with respect to is readily given as
(18) 
where
(19) 
Now that we have obtained explicit expression of , we can optimize through gradientbased optimization methods by iteratively solving for , the updates to
. One can choose to use firstorder methods (batch/stochastic gradient descent) or secondorder methods (GaussNewton or LevenbergMarquardt). In the secondorder method case, for examples, we can first rewrite Equation
Appendix I: Math Derivations of the Conditional LK Algorithm in the vectorized form as(20) 
where
is the identity matrix of size
. Then the iterative update is obtained by solving the leastsquares problemwhere is linearized around to be
Finally, the Conditional LK regressors are formed to be
(21) 
Comments
There are no comments yet.