Using Normalized Cross Correlation in Least Squares Optimizations

10/10/2018
by   Oliver J. Woodford, et al.
0

Direct methods for vision have widely used photometric least squares minimizations since the seminal 1981 work of Lucas & Kanade, and have leveraged normalized cross correlation since at least 1972. However, no work to our knowledge has successfully combined photometric least squares minimizations and normalized cross correlation: despite obvious complementary benefits of efficiency and accuracy on the one hand, and robustness to lighting changes on the other. This work shows that combining the two methods is not only possible, but also straightforward and efficient. The resulting minimization is shown to be superior to competing approaches, both in terms of convergence rate and computation time. Furthermore, a new, robust, sparse formulation is introduced to mitigate local intensity variations and partial occlusions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

02/07/2018

Correlation Estimation System Minimization Compared to Least Squares Minimization in Simple Linear Regression

A general method of minimization using correlation coefficients and orde...
11/17/2021

A Generalized Proportionate-Type Normalized Subband Adaptive Filter

We show that a new design criterion, i.e., the least squares on subband ...
05/24/2017

Deep Learning Improves Template Matching by Normalized Cross Correlation

Template matching by normalized cross correlation (NCC) is widely used f...
12/10/2021

Surrogate-based cross-correlation for particle image velocimetry

This paper presents a novel surrogate-based cross-correlation (SBCC) fra...
04/05/2019

The Derivation of Failure Event Correlation Based on Shadowing Cross-Correlation

In this document we derive the mapping between the failure event correla...
11/15/2020

The Challenge of Diacritics in Yoruba Embeddings

The major contributions of this work include the empirical establishment...
06/15/2020

Filter design for small target detection on infrared imagery using normalized-cross-correlation layer

In this paper, we introduce a machine learning approach to the problem o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Least squares optimization is a standard optimization tool across many branches of science, due to the fact that Gauss’ approximation to the Hessian makes second-order Newtonian optimization computationally efficient. The use of least squares optimization in photometric costs dates back to at least 1981, when Lucas & Kanade applied it to image registration [1]. Since then, inverse approaches have further improved the efficiency [2, 3] and robustness [4] of least squares methods on certain registration problems. Applications of photometric least squares optimization have also increased, covering such diverse tasks as visual odometry [5] and active appearance models [6].

Similarly, cross correlation, a standard statistical tool across many branches of science, is used to measure the similarity between two signals. The normalized version (NCC), sometimes called zero-mean normalized cross correlation, has been used in image registration as far back as 1972 [7]

. This version normalizes the means and variances of the data before applying cross correlation, making the measure robust to changes in gain and bias. Common applications of NCC today include multi-view stereo 

[8, 9] and 2-d keypoint tracking [10], both of which are instances of local image registration.

These two methods are complementary; the first offers efficiency and the second provides robustness to changes in intensity. However, they have not yet been successfully combined. Indeed, of the methods that use NCC, current tracking systems tend to employ an exhaustive search [10], which does not scale well to higher dimensional search spaces, while popular multi-view stereo pipelines either optimize using BFGS [11, §6.1] with numerical gradients [8] (computed using finite differences), or a sampling approach [9], both of which are inefficient.

Gradient-based methods, which include least squares optimization, scale well with dimensionality by traversing the state space one direction (of high gradient) at a time. While previous works have optimized NCC costs using such methods, this work is the first to incorporate NCC into a least squares optimization, demonstrating that it is simple, efficient, and converges well. In addition, we introduce a sparse formulation that both handles local variations in intensity and improves robustness to occlusions.

The following subsections review existing gradient-based NCC optimizers, and methods of registration with local intensity variations, particulary in least squares frameworks. Section 2 then presents the NCC least squares framework, section 3 presents our results, and we conclude in section 4.

1.1 Gradient-based optimization of NCC

NCC has been optimized using gradient-based methods with analytic gradients, particularly for image registration: Irani & Anandan [12] use a second-order Newtonian framework, requiring the computation of second derivatives, which increases computational cost and amplifies image noise. Brooks & Arbel [13] extend inverse compositional optimization to arbitrary cost functions (including NCC), using the BFGS optimizer: a first order optimizer which builds an approximation to the Hessian over time. Evangelidis & Psarikis [14] define a least squares formulation of NCC (eqn. (13), employed here), but then maximize the standard formulation (eqn. (12)) as follows: they compute the Jacobian for the zero-mean data, then at each iteration compute a scalar “perturbation” that is applied to the reference data to account for normalization. The resulting step approximates the least squares one. However, we note that none of these approaches has been widely adopted.

1.2 Registration with local intensity variations

Registration methods can be broadly split into direct and feature-based approaches. Feature-based approaches, such as that presented in Brown & Lowe’s Autostitch [15], pre-compute features in each image, match them across images, then minimize a geometric error on the correspondences. This approach has the benefit of broad convergence, but is computationally expensive, and does not minimize the true measurement errors, which occur in the domain of pixel intensities, not feature locations. Rather than requiring correspondences, direct methods minimize an error on pixel intensities or some transformed pixel space (see below). Therefore, these methods exploit more image data, including edges, and have been shown to be more accurate than feature-based methods for direct odometry [5]

. Recent advances in efficient exhaustive search using fast fourier transforms and ridge regression 

[16] have revolutionized 2-d tracking [17], but do not extend to higher-dimensional registration problems such as active appearance models [6] and direct odometry [5]. These latter problems use optimization-based approaches, which are efficient but require a good initialization. As NCC is a direct error metric, and non-linear least squares is an optimization approach, we limit further discussion to direct, optimization-based methods.

Several categories of direct registration methods already exist to handle data with local intensity variations. We enumerate these with regard to image-based alignments, where scene effects such as non-Lambertian reflectance and a changing viewpoint, lighting effects (e.g. shadows moving with the time of day), and camera effects (e.g. lens vignetting) all create local changes in intensity.

The first category is image transformation. Prior to least squares optimization, images are converted pixelwise into a lighting invariant space. Some methods compute distance transforms over edge images [18, 19], converting intensity into a form of geometric, rather than photometric, error. Other methods compute a multi-channel descriptor per pixel, based on the local image texture, some of which are hand designed [20, 21, 22], others of which are learned [23]. It should be noted that these descriptors are often not invariant to in-plane image rotations,111Invariance of a pixelwise transformation to an in-plane rotation requires , where is an image. so registration in such cases will fail.

The second category models intensity changes generatively. Lucas & Kanade [1] originally proposed a two-parameter gain and bias model, which can only model a global change in intensity. Silveira & Malis [24] extend the gain model to independent regions, but retain the global bias. This allows for more local intensity variations, at the expense of a higher-dimensional search space.

The final category consists of lighting invariant scores between registered patches, such as NCC and mutual information [25, 26]. Mutual information supports a wider range of lighting changes than the linear-invariance of NCC, but nevertheless requires a globally consistent transformation of intensities. In addition, a least squares formulation does not exist.

2 NCC least squares

2.1 Formulation

Our task is formulated as that of finding a warp matrix that defines the registration between two data samples, and , by minimizing an NCC least squares cost between the registered samples (symbols defined below):

(1)
(2)
(3)

We assume single channel data for brevity (least squares trivially generalizes to multiple channels [22]). Therefore, the data represents a scalar field, of dimensions. The data could be -d (audio), -d (image), -d (volumetric), or higher. In this work, we only apply the formulation to image data, such that .

Warp transform

represent coordinates within the scalar field of the source data. These coordinates are the sample points whose error is to be minimized during registration. The function , transforms the coordinates from the source to target frame, as follows:

(4)

where the warp matrix encodes the registration,222

is the identity matrix/transformation.

so is kept on the manifold of allowed registrations (see sec. 2.3.1), and applies any non-linearities present in the measurement process to each column of ; in the case of images, this is a projection onto the image plane, and correction for camera calibration and lens distortion. Our evaluation tasks require only a simple projection: . Finally, the warp function samples the target data at the coordinates

, using bilinear interpolation, producing a sampled data vector, denoted by a bar,

.

NCC transform

The function transforms a sampled vector to one of zero mean and unit length.333 is a vector of ones; similarly, is a vector of zeros. Though different from the standard one, this NCC formulation, previously given by Evangelidis & Psarikis [14], has an identical cost function shape (see ncc_formulation). Our first contribution is demonstrating that this change in formulation allows NCC to be straightforwardly and efficiently (see ncc_one_pass) incorporated into a least squares framework.

2.2 Sparse formulation

Our second contribution is to introduce the following sparse, robust, locally normalized formulation, inspired by the success of sparse features for direct odometry [5]:

(5)

The cost function consists of a sum of local NCC costs, making this cost invariant to local (not just global) variations in intensity. Furthermore, each cost is robustified by a function444Suitable robustification functions must satisfy and , and increase monotonically for positive values [27]. Note that robustifying a single cost term, such as equation (1), is redundant. Since increases monotonically, the minima occur at identical warps.

, that downweights large errors, such that costs close to converging have more influence than costs far from converged. The robustification of costs to outlier data, such as occluded regions, is a standard tool in vision literature 

[28, 27].

2.3 Optimization

2.3.1 Warp update parameterization

To support inverse [3] and ESM [4] approaches (sec. 2.3.3), we update the warp via a compositional approach [3]. Given a vector of the change in variables , which is computed at each iteration of the optimization (see below), the warp is updated as follows:

(6)

where converts an update vector into a warp matrix, such that the set of warps is a group [3]. In particular, must be the identity matrix, and, due to the projection from 3-d to 2-d in the case of images, the generators of each dimension must have zero trace [4]. We use the warp update parameterizations

(7)
(8)

in our experiments, where encodes 2-d translations, and encodes homographies [29, eqn. 85–87].

2.3.2 Iterative update

Any standard non-linear least squares optimizer (Gauss-Newton, Levenberg-Marquardt, ) can be used to optimize equations (1) & (5); here we use Gauss-Newton. The per iteration update for equation (5) (of which equation (1) is a special case) is computed thus:

(9)
(10)
(11)

where is an update vector of zeros. The scalar value and matrix are Triggs’ correction factors [27, eqn. 11], functions of which account for robustification. In our implementation, this update is repeated until , or the least squares cost fails to go below the minimum found for three consecutive iterations.

2.3.3 Jacobian computation

The following three approaches to Jacobian computation for compositional warp updates have been proposed in the literature.

Forwards compositional

The standard Jacobian [3] is given by a straightforward differentiation of equation (2): .

Inverse compositional

Jacobians can also be computed in the source image, at the identity warp [3, 2], , such that they are constant. When Gauss-Newton is used, the pseudo inverse can also be precomputed, resulting in a much faster update. However, Triggs’ second order correction for robust kernels cannot be used with a precomputed pseudo inverse, so the optimization must use iteratively reweighted least squares [30] instead: which results in slower convergence [27]. Experiments here use the second order correction and therefore recompute the pseudo inverse each iteration.

Efficient Second-order Minimization (ESM)

Taking the average of the above two Jacobians,

, provides a more accurate estimate of the Hessian 

[4], improving both the rate and speed (number of iterations) of convergence.

Only the formula for the Jacobian of is given herein (see ncc_derivative). Formulae for other Jacobians are available in prior work [3]. We also note that modern auto-differentiation tools (employed in our implementation) can automatically compute analytic derivatives at runtime.

3 Evaluation

Our experiments validate the value of the two contributions of this work. The first is a new method to optimize the well-known NCC cost, so we evaluate its performance relative to other such optimizers. The second is a sparse, robust formulation which improves performance in situations with local intensity variations and partial occlusion, so we compare this approach against other methods invariant to local intensity changes, as well as on partly occluded image regions.

We run quantitative experiments on homography-based alignment. The reasonably high dimensionality of this problem’s state space (8-d) indicates performance on the kind of problems for which photometric least squares methods remain popular, such as active appearance models [6] and direct odometry [5].

(a)
(b)
Fig. 1: Graffiti2 dataset. We use, and make available [31], a dataset of 7 image sets (a), of 4 6008003 images each (b), of approximately planar regions of outdoor graffiti, taken from roughly the same position and under different lighting conditions, due to capture at different times of the day. All were captured with an iPhone 6s (with negligible lens distortion) as 12Mpixel JPEG images, then downsampled and saved losslessly.

3.1 Costs evaluated

We evaluate three costs in our experiments:

Dense, globally normalized cost

This cost is defined by equation (1), where is a dense grid of pixel coordinates sampled at the pixel corners555Sampling (under bilinear interpolation) at pixel corners means that a sample must shift at least half a pixel in any direction before the image intensity changes non-linearly. In contrast, the intensity at a pixel center is non-differentiable. within a region of interest.

Sparse, locally normalized cost

This cost is defined by equation (5), without robustification (, , ). Each set of coordinates has 8 sample points on a 24 grid: spaced 2 pixels apart, centered on and aligned perpendicular to features called edgelets. Regions of constant image gradient produce the same NCC cost. Therefore, a change in gradient, an image edge, is required for registration. Edgelets are extracted at sub-pixel local maxima of gradient magnitude, in the direction of maximum gradient, as shown in Figure 2. Blocks have 8 samples, as vector operations on modern CPUs work well with this number. The height of 2 gives them some sensitivity to the orientation of the edge, while the width of 4 allows them to converge from a reasonable distance. The set consists of the blocks on all edgelets extracted within a region of interest.

(a)
(b)
(c)
Fig. 2: Edgelet extraction. An input image (a) is first converted to grayscale, from which gradient magnitudes (b) and orientations are computed. Edgelet features (c) are instantiated on the sub-pixel local maxima of gradient magnitude, in the direction of maximum gradient.
Sparse, robustified, locally normalized cost

This cost is also defined by equation (5), but with a robustification function of the form , which is the Geman-McClure robustifier [28]; we use . is as above.

3.2 Dataset

To ensure that results are broadly applicable, experiments should be run on a range of image textures, with real (not synthetic) data, locally varying intensities across images of the same scene. We therefore created a new dataset of 7 sets of 4 images, shown in Figure 1, with real-life changes in lighting due to capture at different times of day. We have also computed ground truth homographies between each pair of images within each set, using a standard feature matching followed by RANSAC approach [15].

3.2.1 Test case generation

We randomly select 100 50

50 pixel regions from each image. All regions contain at least 50 edgelets, and are fully visible in the 3 other images of the same scene. We generate a perturbation for each region, by applying a random shift, drawn from a normal distribution, to each corner separately, then scaling the perturbation so that the mean shift per corner is 0,..,10 pixels, creating starting distances of increasing magnitude for each region. Each perturbation of each region is warped into each image in the set (including the source image), using the ground truth homography, from which the initial warp from source to target is then computed 

[32, alg. 4.2]. This creates a total of 10011447 = 123,200 test cases, 92,400 of them between different images and 30,800 between identical images.

We have made the images, ground truth homographies and test cases publicly available [31].

3.3 NCC optimization comparison

Of the three pre-existing gradient-based NCC optimization methods [12, 13, 14], we compare our non-linear least squares (NLS) method against the two most recent of these: the first-order BFGS method [13], and the second-order approximation to NCC optimization [14], which the authors call enhanced correlation coefficient maximization (ECCM). The Newton-based method [12] employs a full second-order differentiation of the NCC cost (not given in the paper), requiring a significantly more complex implementation, which is not publicly available. For this reason, we omit comparison to that method here. However, we note that our least squares update is the Gauss-Newton approximation to the Newton update of the standard NCC cost. Therefore, we would expect our method to provide similar accuracy with a simpler, faster implementation.

We run all three optimizers on each of the three costs described above, using the warp update of eq. (8) (homography fitting) with each of the three Jacobian values described in section 2.3.3, (FWD), (INV), and (ESM), over every test case. We extend the baseline methods to support ESM, as well as our two sparse costs.

(a)
(b)
(c)
Fig. 3: Convergence rates and times. Graphs showing, on the vertical axis, (top row) the percentage of runs which successfully converged to within one pixel of the ground truth solution, and (bottom row) the mean computation time of the successful runs, against, on the horizontal axis, the mean starting distance of region corners from ground truth. These results are for the case that source and target images were different (Fig. 4).

3.3.1 Rate of convergence

The primary goal of registration optimization is to reach an alignment that matches the ground truth; we call this convergence. Since matching ground truth exactly is unlikely, we allow an error of up to 1 pixel in the position of each corner in the target image for an optimization to be considered converged. The proportion of tests which converged is plotted against the mean corner starting distance, as shown in Figure 3(top row) for the case that source and target images differ. It should first be noted that the rate of convergence is not 100% for a starting distance of 0, and is also different between methods. The ground truth homography for some test cases does not coincide with the minimum of the NCC cost, and not all optimizers reach the optimum equally well. As a result, when the starting position is inside the converged region, worse optimizers actually get a higher rate of convergence. In the case that source and target images are identical, the ground truth and NCC optimum do coincide, giving all optimizers 100% convergence for a starting distance of 0, as shown Figure 4.

The following trends are visible in both Figures 3(top row) and 4. In the dense case (a), NLS performs marginally better than ECCM and significantly better than BFGS; in the sparse cases (b) & (c), NLS performs significantly better than both ECCM and BFGS; ESM provides better convergence overall for both NLS and ECCM, though marginally in the dense case; for BFGS, the FWD scheme performs best; when the sparse cost is robustified (c), the rate of convergence improves at all starting distances.

3.3.2 Optimization time

All tested methods are implemented in the same langauge (). As a result, optimization time is a strong overall indicator of the relative performance of the different methods, if not absolute achievable performance. Figure 3(bottom row) shows the convergence time of each method, across different starting distances and cost functions. It is important to note that NLS methods are faster, often significantly, than their BFGS or ECCM counterparts. In this high dimensional space, BFGS requires many more function evaluations. Meanwhile, the pertubation factor computation of ECCM is not inexpensive, slowing, in particular, the INV variant significantly (beyond the ESM variant, which uses fewer iterations, for sparse costs). It should also be acknowledged that for NLS, ESM is faster than FWD and converges more often, but INV is considerably faster still. Consequently, there is a trade-off between speed and efficacy when choosing between INV and ESM.

(a)
(b)
(c)
Fig. 4: Convergence rates for indentical source & target images. Graphs showing, on the vertical axis, the percentage of runs which successfully converged to within one pixel of the ground truth solution, against, on the horizontal axis, the mean starting distance of region corners from ground truth. These results are for the case that source and target images were the same.
(a)
(b)
(c)
Fig. 5: Relative least squares costs. Graphs showing, on the horizontal axis, the cost factor for each method, which is the achieved least squares cost divided by the minimum cost found across all methods, against, on the vertical axis, the proportion of tests with a cost factor less than the x-axis value (known as recall). Results are shown for a starting distance of 5, for the case that source and target images are different.

3.3.3 Final cost

The goal of image alignment is to converge to ground truth. However, when comparing optimizers, one should also consider the lowest cost achieved. As we have seen, the two don’t always coincide. Therefore, we also show the relative cost achieved by each method, compared to the lowest cost found across all methods, in Figure 5. The results are similar to the relative convergence rates: for the dense case, ECCM performs slightly worse than NLS, and for the sparse cases, NLS methods perform significantly better than all methods except BFGS FWD, which is still worse (and significantly slower).

(a)
(b)
(c)
Fig. 6: Cost surfaces of our cost functions. Three NCC cost surfaces (contour map) for each of the three costs tested, computed on an 5050 pixel patch shifted over a 2020 pixel region centered on itself (the patch from Fig. 1(a) was used). Each plot additionally shows the trajectory of 2-d translation optimizations for the ESM variant of each optimizer evaluated, starting from 8 different points (black dots).

3.3.4 Qualitative analysis of costs & optimizers

The causes of the relative performance of the costs and optimizers tested are illuminated by the qualitative plots shown in Figure 3, which visualize the cost surfaces generated by each of the above costs, for a 5050 pixel region centered on the image shown in Fig. 2(a). Trajectories computed by the ESM variant of each optimizer tested are also shown, starting from 2-d shifted positions (black dots) and using the warp update of equation (7).

Fig. 3(a) indicates that the dense cost has a relatively wide basin of convergence. The optimization trajectories of BFGS show it takes large steps in directions of maximum gradient, then makes abrupt changes in direction. ECCM and NLS both take smooth paths to the minimum.

The sparse, locally normalized cost surface (Fig. 3(b)) has a narrower basin of convergence, and more local minima. NLS and BFGS behave similarly to with the dense cost, though NLS takes shorter steps. ECCM behaves erratically, but in this 2-d space nevertheless often converges in the end.

The robustified sparse cost surface (Fig. 3(c)) has an even narrower trough around the minimum, but flatter plateaus and fewer local minima either side, compared to the unrobustified sparse cost. The optimizer trajectories are similar in nature to those of the unrobustified sparse cost.

The smoother behaviour of the NLS optimizer leads to an improved convergence rate in the higher dimensional space of homography fitting, where a greater number of local minima trap erratic optimizers.

3.4 Robustness to occlusion

Figure 7 shows the rate of convergence for each of our 3 NCC costs, using the ESM NLS optimizer, for the case that one quadrant of the source patch has been replaced with random texture, emulating occlusion of the scene by another object. The graph shows that the robustified sparse cost performs significantly better than the other costs, except at starting distances greater than 7 pixels, where the dense cost, with it’s broader convergence basin, still performs better.

Rate of convergence (%)

Fig. 7: Convergence rates with occlusion. A graph showing, on the vertical axis, the percentage of runs which successfully converged to within one pixel of the ground truth solution, with one quadrant of the patch masked with random texture, against, on the horizontal axis, the mean starting distance of region corners from ground truth. Results are for our method using ESM, over the three tested cost functions, for the case that source and target images are different.

3.5 Lighting invariance compared to other costs

We compare our new sparse, robust cost formulation against two other locally lighting invariant photometric costs: descriptor fields [20] (first order version) and the census bit planes transform [22]. Both are examples of image transformation approaches, and showed state of the art results when published.666

A deep learned transformation 

[23] has appeared since, but the proprietary model has not been released publicly. We implemented and evaluated both within our framework, using Gauss-Newton ESM to optimize both, and compare to our NLS ESM method in Figure 8. The results show (Fig. 8(a)) that convergence rates for our method are best overall. However, these convergence rates are slightly worse than census bit planes at small starting distances, which then drop off rapidly, and slightly worse than descriptor fields at larger starting distances, possibly due to the latter’s image blurring. In terms of optimization time (Fig. 8(b)), census bit planes is significantly slower, using 8 channels compared to NCC’s 1. Descriptor fields uses 4 channels, but is comparable in speed to our method. This implies convergence in fewer iterations, which could again be due to its image blurring.

(a)
(b)
Fig. 8: Comparison against other locally lighting invariant costs. Graphs showing, on the vertical axis, (a) the percentage of runs which successfully converged to within one pixel of the ground truth solution, and (b) the mean computation time of the successful runs, against, on the horizontal axis, the mean starting distance of region corners from ground truth, for our NLS on our new sparse, robustified cost and NLS on the descriptor fields [20] and census transform bit planes [22] costs.

4 Conclusion

This work shows that NCC optimization in a non-linear least squares framework is not only possible, but straightforward to implement, efficient, and effective at optiming NCC. It also shows that the sparse, robust formulation introduced here better registers images with local intensity variations than both the standard NCC formulation, and other locally lighting invariant methods tested, as well as being more robust to occlusions.

Our experiments focused on optimizations at a single resolution, whereas in practice such methods are generally used in a multi-resolution framework in order to broaden convergence. However, since each level of a multi-resolution framework is an instance of a single resolution optimization, our approach and results are directly applicable to these frameworks also.

We note that our NCC least squares framework can be extended to the following scenarios: data fields of arbitrary dimension and/or multiple channels, and methods which don’t just solve for registration parameters, but also solve for other variables, such as the data coordinates , as in direct odometry [5].

Appendix A NCC least squares optimization

a.1 Formulation equivalence

The standard formulation of NCC [33, eqn. 8.11] is the following score between two patches, which should be maximized:

(12)

We now show that it has the same shape (negated, up to scale and ignoring a scalar offset) as the least squares NCC cost between two patches [14, eqn. 4], which is minimized:

(13)
(14)
(15)

Therefore not only is the optimum of each score achieved at the same inputs, but the gradients of the score functions are the same (up to a scale factor of -2) for all inputs.

a.2 NCC Jacobian

The analytic Jacobian can be computed first by subtracting the mean, then by applying the length normalization using the quotient rule, as follows:

(16)
(17)

where

. Applying the chain rule and some rearrangement (using

), this gives

(18)

a.3 One-pass inverse compositional optimization

In an inverse compositional framework, the Jacobian is pre-computed. Therefore, only the patch (not derivatives) require NCC normalization at each iteration. In the case where , it is more efficient to apply this normalization to than to . This section shows how this can be achieved for the formulation of equation (1).

First note that a matrix whose columns sum to zero, , can be post-multiplied by any matrix and still have zero-sum columns: ; the same goes for pre-multiplication of a matrix with zero-sum rows. jac_zero_mean has zero-sum columns, therefore so does , and its pseudo-inverse, has zero-sum rows. Therefore, for Gauss-Newton optimization,

(19)
(20)
(21)

Both and can be computed during the first pass over patch samples, using the identity for the latter. The division and subtraction of one_pass is then applied to N coefficients only. The result is that patch samples need not be stored and revisited in a second pass to apply normalization. This one-pass approach improves computational efficiency, especially in the case of large patches, which might otherwise incur cache misses on a second pass.

Note that this approach cannot be applied to sparse_framework, since each cost must be normalized separately.

References

  • [1] B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in

    Proceedings of the International Joint Conference on Artificial Intelligence

    , 1981, pp. 674–679.
  • [2]

    G. D. Hager and P. N. Belhumeur, “Efficient region tracking with parametric models of geometry and illumination,”

    IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 10, pp. 1025–1039, 1998.
  • [3] S. Baker and I. Matthews, “Lucas-Kanade 20 years on: A unifying framework,”

    International Journal of Computer Vision

    , vol. 56, no. 3, pp. 221–255, 2004.
  • [4] S. Benhimane and E. Malis, “Real-time image-based tracking of planes using efficient second-order minimization,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, vol. 1, 2004, pp. 943–948.
  • [5] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 4, 2017.
  • [6] I. Matthews and S. Baker, “Active appearance models revisited,” International Journal of Computer Vision, vol. 60, no. 2, pp. 135–164, 2004.
  • [7] D. I. Barnea and H. F. Silverman, “A class of algorithms for fast digital image registration,” IEEE Transactions on Computers, vol. C-21, no. 2, pp. 179–186, Feb 1972.
  • [8] Y. Furukawa and J. Ponce, “Accurate, dense, and robust multiview stereopsis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 8, pp. 1362–1376, 2010.
  • [9] J. L. Schönberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, “Pixelwise view selection for unstructured multi-view stereo,” in Proceedings of the European Conference on Computer Vision.   Springer, 2016, pp. 501–518.
  • [10] D. Wagner, D. Schmalstieg, and H. Bischof, “Multiple target detection and tracking with guaranteed framerates on mobile phones,” in Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, 2009, pp. 57–64.
  • [11] J. Nocedal and S. J. Wright, Numerical Optimization (second edition).   Springer, 2006.
  • [12] M. Irani and P. Anandan, “Robust multi-sensor image alignment,” in Proceedings of the IEEE International Conference on Computer Vision.   IEEE, 1998, pp. 959–966.
  • [13] R. Brooks and T. Arbel, “Generalizing inverse compositional image alignment,” in

    Proceedings of the International Conference on Pattern Recognition

    , vol. 2.   IEEE, 2006, pp. 1200–1203.
  • [14] G. D. Evangelidis and E. Z. Psarakis, “Parametric image alignment using enhanced correlation coefficient maximization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 10, pp. 1858–1865, 2008.
  • [15] M. Brown and D. G. Lowe, “Automatic panoramic image stitching using invariant features,” International Journal of Computer Vision, vol. 74, no. 1, pp. 59–73, 2007.
  • [16] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 583–596, 2015.
  • [17] Kristan, M. , “The visual object tracking VOT2017 challenge results,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, Oct 2017, pp. 1949–1972.
  • [18] M. Kuse and S. Shen, “Robust camera motion estimation using direct edge alignment and sub-gradient method,” in Proceedings of the IEEE International Conference on Robotics and Automation, 2016, pp. 573–579.
  • [19] X. Wang, W. Dong, M. Zhou, R. Li, and H. Zha, “Edge enhanced direct visual odometry.” in Proceedings of the British Machine Vision Conference, 2016.
  • [20] A. Crivellaro and V. Lepetit, “Robust 3d tracking with descriptor fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3414–3421.
  • [21] E. Antonakos, J. Alabort-i Medina, G. Tzimiropoulos, and S. P. Zafeiriou, “Feature-based lucas–kanade and active appearance models,” IEEE Transactions on Image Processing, vol. 24, no. 9, pp. 2617–2632, 2015.
  • [22] H. Alismail, B. Browning, and S. Lucey, “Robust tracking in low light and sudden illumination changes,” in Proceedings of the International Conference on 3D Vision.   IEEE, 2016, pp. 389–398.
  • [23] C.-H. Chang, C.-N. Chou, and E. Y. Chang, “CLKN: Cascaded Lucas-Kanade networks for image alignment,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [24] G. Silveira and E. Malis, “Real-time visual tracking under arbitrary illumination changes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2007, pp. 1–6.
  • [25] P. Viola and W. M. Wells III, “Alignment by maximization of mutual information,” International Journal of Computer Vision, vol. 24, no. 2, pp. 137–154, 1997.
  • [26] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens, “Multimodality image registration by maximization of mutual information,” IEEE Transactions on Medical Imaging, vol. 16, no. 2, pp. 187–198, 1997.
  • [27] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle adjustment—a modern synthesis,” in International workshop on vision algorithms.   Springer, 1999, pp. 298–372.
  • [28] M. J. Black and A. Rangarajan, “On the unification of line processes, outlier rejection, and robust statistics with applications in early vision,” International Journal of Computer Vision, vol. 19, no. 1, pp. 57–91, 1996.
  • [29] E. Eade, “Lie groups for computer vision,” http://ethaneade.com/lie_groups.pdf.
  • [30] P. W. Holland and R. E. Welsch, “Robust regression using iteratively reweighted least-squares,” Communications in Statistics-theory and Methods, vol. 6, no. 9, pp. 813–827, 1977.
  • [31] O. Woodford, “Computer vision datasets,” https://sites.google.com/site/oliverwoodford/datasets.
  • [32] R. Hartley and A. Zisserman, Multiple view geometry in computer vision, 2nd ed.   Cambridge University Press, 2004.
  • [33] R. Szeliski, Computer vision: algorithms and applications.   Springer Science & Business Media, 2010.