Preconditioning in general and diagonal preconditioning in particular is a popular technique used in a wide range of disciplines and problems, sometimes under different names and in different forms. In optimization, the simplest form of preconditioning is used to speed up the convergence and improve the accuracy of iterative solvers of linear systems such as the Jacobi method (Saad (2003)) and conjugate gradient (CG), which is often used in the Newton step in interior point algorithms (Nocedal and Wright (2006)). Many other first order optimization methods also involve some form of (diagonal) preconditioning or benefit from preconditioning. These include the ADMM method (Lin et al. (2016, 2018), Takapoui and Javadi (2016), Boyd et al. (2011)) and steepest descent methods with preconditioned gradients such as mirror descent with quadratic distance functions (Beck and Teboulle (2003)), online Newton methods, and Quasi-Newton methods (Broyden (1970), Goldfarb (1970)). Diagonal preconditioners in particular are widely used in adaptive gradient methods such as AdaGrad (Duchi et al. (2011)) and Adam (Kingma and Ba (2014)), because they can adapt to the geometry of the problem and not only accelerate the rate of convergence, but also achieve low regret for problems with certain geometries (Levy and Duchi (2019)). In machine learning, it is widely acknowledged that adaptive normalization techniques such as batch normalization (Ioffe and Szegedy (2015)) and layer normalization (Ba et al. (2016)
) are essential for achieving good empirical performance in terms of convergence speed of training, as well as better generalization power on test sets. Such methods can be understood as a form of adaptive diagonal preconditioning on the outputs of intermediate layers of a neural network.
Intuitively, preconditioning changes the local landscape of the objective function by making it more “rounded” so that local minima are more accessible. To understand the benefit of preconditioning more precisely and to motivate the current paper, we start with the simple setting of solving linear systems with iterative methods. Consider the problem of solving
given a full-rank matrix and . For large and sparse systems iterative methods based on factorization are preferred to direct methods such as matrix inversion or Gaussian elimination. The following iterative methods are commonly used to solve large linear systems, and under appropriate conditions they converge linearly with rates of convergence depending explicitly on the condition number of
, defined as the ratio of the largest and smallest singular values of:
Jacobi Method The Jacobi method for solving linear system is the iteration
where and is the diagonal of . When is diagonally
dominant, the method converges linearly with a rate of ,
which is bounded above by Arioli and Romani (1985).
Gauss-Seidel Method The Gauss-Seidel method to solve the system is similar to the Jacobi method. Instead of using the decomposition , it uses the decomposition where is the strict upper triangular part of . The iteration then consists of
where the inverse operation can done through forward substitution
due to the triangular structure of . Like the Jacobi method,
two sufficient conditions for convergence are is symmetric positive
definite, or strictly diagonally dominant. In such cases, the linear
convergence rate is again given by .
Kaczmarz Method The Kaczmarz iteration to solve the linear system consists of the updates
where , and is the number of rows of , and is the th row of . A randomized version that chooses the update with
at each step with probability proportional tohas been shown to converge linearly in expectation with a rate of Strohmer and Vershynin (2009). Other variants of the Kaczmarc method such as the block Kaczmarz method and sketch-and-project methods have been shown to have similar convergence rates Needell and Rebrova (2019).
Gradient Methods First order optimization methods can also be applied to solve the square linear system , which is equivalent to minimizing the quadratic objective . Steepest descent with optimal step size has linear convergence given by
where Luenberger et al. (1984). Similarly,
conjugate gradient (CG) has a linear convergene rate of .
Quasi-Newton Methods Quasi-Newton methods such as BFGS and Anderson mixing accelerate the gradient method by premultiplying with an approximate inverse Hessian . The exact dependence of convergence rate on the condition number of is not well-studied, but it has been known that when is ill-conditioned, the algorithm suffers slowdown or divergence.
|Jacobi||Gauss-Seidel||Kaczmarz||Steepest Descent||Conjugate Gradient|
|linear convergence rates|
We see that the linear rates of convergence of iterative methods applied to solve are better when the matrix has a small condition number . This phenomenon is not limited to linear systems. For general optimization problems, it is well known that the condition number of the Hessian matrix of the objective plays a similar role in the rate of convergence of first order methods. For example, for general strongly convex objectives, gradient descent has the same asymptotic linear convergence rate of where is now the Hessian of the objective at optimal point , i.e. . Thus the larger the condition number of , the slower the convergence. In order to accelerate the convergence of iterative solvers and optimization methods, it is then essential to ensure that the condition number of the coefficient or Hessian matrix is small. Preconditioning achieves this by multiplying the a matrix by matrices and , such that the matrices , , or have a smaller condition number than . For example, the preconditioned conjugate gradient method solves the equivalent system
with given by the incomplete Cholesky factorization of , often resulting in .
A large class of preconditioning methods uses matrices
that are diagonal. Diagonal preconditioners have the advantage that they are easy to implement and there exist many heuristics for finding good diagonal preconditioners to achieve reductions in condition number. We mention the following commonly used diagonal preconditioning procedures:
Jacobi preconditioner: is the simplest diagonal preconditioner, and it works well when is diagonally dominant.
Column (row) normalization: where is the diagonal matrix of column standard deviations or column norms.
The third method is also commonly used as a normalization procedure for data augmentation/processing. Compared to matrix equilibriation, it is easier to compute. It also has an interesting connection with the method of batch normalization that is ubiquitous in machine learning. Even though these preconditioning methods are popular in practice, they are heuristics and do not always guarantee a reduction in the condition number of the transformed matrix. It is thus important for the practitioner to know when these heuristics work well. Moreover, when they cannot effectively reduce the condition number of a particular matrix, one may need an efficient algorithm that provides a diagonal preconditioner that guarantees that the preconditioned matrix has smaller condition number. Finally, in stochastic optimization settings, lack of access to the full matrix requires adaptive diagonal preconditioners that make use of data accessed so far. In this paper, we attempt to address these problems. We first study the problem of when the column normalization method is effective at reducing the condition number of the data matrix. We provide a concentration result for the condition number of sub-Gaussian matrices and show that the reduction in condition number using column nomalization is on the order of the ratio between the condition numbers of the population covariance matrix and the population correlation matrix, and then give sufficient conditions on when this ratio is large, i.e. diagonal preconditioning with column norm/standard deviation is effective. Then we study the problem of finding an optimal diagonal preconditioner given a fixed matrix. We develop bisection and interior point algorithms with iteration complexities that finds a diagonal scaling that guarantees a condition number within of the smallest possible condition number achievable by diagonal preconditioning. We also give results on the effect of reduction in condition number of the data matrix on the condition number of a class of strongly convex and smooth functions. Finally, we extend our optimal diagonal preconditioner to an adaptive setting, and compare it with batch normalization, a common adaptive preconditioning method used in machine learning, on regression and classification problems, and find that gradient methods with adaptive optimal diagonal preconditioning generally requires fewer iterations compared to batch normalization. The tradeoff is that the computation of the optimal diagonal conditioner requires using a full-rank matrix, and that its practical implementation based on the bisection algorithm may take a long time.
2 Diagonal Preconditioning and Reduction in Condition Number
In this section, we study the problem of when applying the diagonal preconditioning method with normalization by column norm is effective at reducing the condition number. When applying such methods, the matrix being transformed is often a data or design matrix with i.i.d. rows drawn from some common distribution. In this case the condition number can also be viewed as a random variable. We first use matrix concentration to show that with high probability, the reduction of condition number using column normalization is close to that achieved by transforming the population covariance matrix to the population correlation matrix. Then we give sufficient conditions on when the corresponding procedure on the population covariance matrix reduces condition number significantly.
2.1 Concentration of Condition Numbers
We consider the setting where each row of the design matrix is an i.i.d. observation of a
-dimensional sub-Gaussian vector. We are interested in how the procedure of normalizing by column norm/standard deviation affects the condition number of the design matrix. Because the condition number is a random variable in this setting, we would like to simplify the analysis by first proving that it concentrates around the condition number of population parameters with high probability, from which we can conclude how much the condition number can be reduced with this procedure.
For ease of presentation, we assume for now that the sub-Gaussian vector has mean zero and covariance matrix . One of the most common diagonal preconditioning procedures is to normalize each column of by its mean or standard deviation. Thus we first analyze the concentration behavior of the condition number under this particular diagonal preconditioning, which takes the form
where is the diagonal matrix with
corresponding to column variance or norm squared.
We recall the definition of sub-Gaussian random vectors and some results from matrix concentration and analysis that are useful in the proof of our main result. Define the sub-Gaussian norm of a random variable to be
and the sub-Gaussian norm of a random vector to be
Equivalently, the sub-Gaussian norm of is given by the smallest such that
(sub-Gaussian matrix concentration Rudelson and Vershynin (2010) ) Let be an random matrix with and i.i.d sub-Gaussian rows with sub-Gaussian norm and
its second moment matrix. Then with probability at least,
where and are universal constants. Moreover, the above also implies
(Weyl inequality ) Let be the -th eigenavlue of a self-adjoint matrix . Then
. In other words, the eigenvalue map is 1-Lipschitz on the space of self-adjoint matrices with respect to the metric induced by the operator norm. Our main result is that the condition number ofwhere concentrates around that of , where is the diagonal of , i.e. is the population correlation matrix. Since , this implies that is on the order of with high probability. Let be an random matrix with and i.i.d. sub-Gaussian rows with the sub-Gaussian norm of its rows, , and let be the population covariance matrix. Let be the diagonal matrix with , i.e. is the norm squared of the -th column of , and let be the matrix normalized with diagonal entries of . Then with probability at least , we have
Proof See the appendix.If we assume that the diagonal of is lower bounded, we can further simplify the bound. Same assumptions as the previous theorem, and additionally . Then
with probability at least
It is straightforward to extend these results to the case of sub-Gaussian matrices with independent rows and non-zero mean. The above results imply that with high probability, the “reduction” in condition number of a sub-Gaussian design matrix with column normalization is on the order of . Therefore we can shift the focus of our analysis to the problem of comparing the condition numbers of population covariance and correlation matrices.
Proof Strategy Because the scaling matrix is a function of sample statistics, namely the column-wise sample standard deviations or column norm, the analysis with directly is cumbersome. Our strategy for showing the main result is to start with the simplest possible case for and gradually increase the complexity of the scaling matrix. More precisely, we show a sequence of concentration results, each one built on top of the other, with the following scaling matrices:
is equal to the population covariance matrix , with general non-diagonal .
is the sample column-wise standard deviation, under the assumption that is diagonal.
is the diagonal of the population covariance matrix, with general non-diagonal .
is the sample standard deviation, with general non-diagonal .
For details of the proof, see the appendix.
The diagonal preconditioning considered in the above results is widely used in practice. It is common knowledge that when variables have different scales, the condition number of the design matrix (whose columns correspond to variables) tends to be large. This can be alleviated by rescaling the columns of the design matrix by its norm, so that each column has norm 1. This procedure does not guarantee, however, that there is always a significant reduction in condition number. In the next subsection, we study conditions under which such a procedure does result in a significant reduction in condition number, or equivalently that .
It is well-known that a matrix whose columns have different scales tends to have a large condition number. This is the reason behind normalization procedures used to process data in a variety of contexts such as PCA and first order optimization methods. However, the answer to our question about whether the correlation matrix has a much smaller condition number will depend on the structure of the correlation matrix as well.
As a first step to explore this dependence, we randomly generate some covariance matrices and look for those with small ratio . In Table 2 we present some examples of covariance matrices with , i.e. the correlation matrix is more ill-conditioned than the covariance matrix.
We see that the correlation matrices with large condition numbers all have off-diagonal entries that are close to 1, which is confirmed by other examples.
In this section, we show that if a covariance matrix has columns (and rows) that are of different scales, but its corresponding correlation matrix is diagonally dominant, then diagonal preconditioning can achieve a reduction in condition number on the order of with high probability, where is the matrix consisting of the diagonal entries of . To put the result in perspective, note that is always true. We are interested in sufficient conditions that guarantee a matching lower bound, so that , i.e. the reduction is on the order of with high probability for a class of matrices.
We shall use the above result to give a sufficient condition on when the ratio is large, i.e. when diagonal preconditioning with column standard deviation/norm achieves a significant reduction in condition number with high probability.
More specifically, we show that if the correlation matrix is diagonally dominant, then the ratio is bounded below by . For a covariance matrix , if , then we have
3 Optimal Diagonal Preconditioning
We have shown that when the population correlation matrix is diagonally dominant, diagonal preconditioning of design matrix by sample standard deviations yields a reduction in condition number on the order of with high probability. However, many population correlation matrices are not diagonally dominant, or we may not be able to easily verify this condition. In such cases, we need an efficient method to obtain a diagonal conditioner that does reduce the condition number of any fixed design matrix . We turn to this problem in this section.
Given a fixed design matrix , we are interested in the optimal diagonal preconditioning matrix , that gives the minimal condition number of
among all diagonal preconditioners. We first reformulate the problem to a quasi-convex SDP problem that can be solved with bisection, and then provide a faster potential reduction algorithm with guarantee on convergence. The purpose of studying the optimal diagonal preconditioning procedure is the following:
In settings where diagonal preconditioning helps to reduce condition number, we would like to compare the performances of diagonal preconditioning using column standard deviation/norm studied in the previous section with that of the optimal diagonal preconditioning, to see how much gap there is between the two.
In situations where diagonal preconditioning with sample standard deviation/norm does not actually reduce the condition number, we may instead use the optimal diagonal preconditioning procedure.
We are also interested in how the result of adaptive diagonal preconditioning procedures, such as batch normalization, compares with the optimal diagonal preconditioning. This may help us understand the nature of such adaptive preconditioning procedures.
3.1 Optimal Diagonal Preconditioning
Writing and rewriting as , the optimal diagonal preconditioning problem is
We can rewrite it as
The inequality constraints can be further simplified to . The problem can thus be reformulated as
where is a diagonal matrix with strictly positive entries.
We may solve this problem using bisection as follows. First find a large fixed such that the convex feasiblity SDP problem
has a solution. In practice, we can start with , or if we have some prior information that there exists some that satisfies
we can set as a warm start. Often the choice or yields a smaller upper bound.
Then, solve the feasibility problem
If it is feasible, then we know the optimal objective value is in , otherwise it is in . We can find an -optimal solution in iterations.
There is also a geometric interpretation of the feasibility problem with parameter : it looks for an ellipsoid with axes of symmetry the coordinate axes, such that it is inscribed in the ellipsoid defined by , and such that a multiple of includes the ellipsoid defined by . Such inscribing and containing ellipsoids have been used the ellipsoid method, where the ellipsoids are not constrained to be symmetric along coordinate axes. The geometric interpretation also makes it clear that the optimal value is achieved.
Step 0. Start with such that
is feasible. Set , .
Step 1. Solve the feasibility problem
If it is feasible, then set . Otherwise, set .
Step 2. If , return to Step 1.
We next develop an interior point method that finds an -optimal diagonal scaling in iterations, where each iteration consists of a Newton update step. This is in contrast with the bisection algorithm, which also has
iterations, but each iteration consists of solving a SDP feasibility problem. The formulation and proof borrows ideas from interior point methods for linear programming and semidefinite programming, in particularMizuno (1992), Ye (1992, 1995), Nesterov and Todd (1997), most notable of which are the work of Nesterov and Todd (Nesterov and Todd (1997)) on Newton updates in SDP and the work of Ye (Ye (1995)) on the “von Neumann economic growth problem”.
3.2 Interior Point Algorithms with Potential Reduction
Recall the reformulated optimal diagonal preconditioning problem
Following Ye (1995), we define the potential function as
where is the analytic center of the feasible region
which is simply the unique maximizer of the strictly concave problem
The geometric interpretation of the problem makes it clear that if the feasible region has a non-empty interior for some , then the feasible region has a non-empty interior for all , so that the analytic center is well defined. At the optimal objective value , however, the solution set has no interior because it must be the case that for any in the solution set, where are the eigenvalues of a matrix, and so perturbing in some direction by any amount will violate the constraint . For the feasible set of the feasibility problem
has bounded non-empty interior for all . Next we show that the potential function is monotonic in , and that it is bounded away from , for any large, so that it is a reasonable measure of the progress of the problem: we just need to be able to reduce the potential function by a constant amount at every step and terminate when the potential function is reduced below a certain threshold determined by accuracy . If , then .
Proof See the appendix. ∎
Let satisfy , then for any , we have
for some universal constant that only depends on .
Proof See the appendix. ∎
These two results imply that if , then we must have . Therefore if we can find a whose associated potential is below the threshold of , we have found an approximate solution that yields a condition number that is within of the optimal condition number. Our goal is then to design an algorithm that achieves constant reduction of the potential at every iteration, which would guarantee an iteration complexity of .
In the next two subsections, we develop potential reduction algorithms based on this idea, using exact and and approximate analytic centers, respetively. Before doing so, we show that the analytic center satisfies an equality condition that is useful in the proof of the reduction of potential function. This is essentially the first order condition of the convex optimziation problem, but because it also serves an important role in the Newton reduction step, we state it is a particular form in the next lemma. The analytic center of the feasible region satisfies
Proof Since solves the convex problem
The first order condition implies
so if we define and , and
then we have and moreover . ∎
3.3 Analysis of Potential Reduction with Exact Analytic Center
Let be the analytic centers of and . With a suitable choice of , we show that the analytic centers and are close, and that
so that we can achieve constant reduction of the potential function by going from to . To obtain from , we can use symmetrized versions of Newton iterations with a quadratic convergence rate. The exact form of the Newton iteration is deferred to the next subsection, where we propose a practical potential algorithm using approximate analytic centers and only one Newton step. Let be the solution of
and let be a known upper bound of , i.e.
Let and be the analytic centers of and for . If satisfies
for some sufficiently small that only depends on , then
for some constant and
Proof See the appendix. ∎
We have thus obtained a conceptual algorithm, which uses the exact analytic center, summarized in Algorithm 2.
Step 0. Let be analytic center of with . Set .
Step 1. Let
for some small constant and let . Then use Newton’s method described in the next section to generate the axact analytic center of .
Step 2. If , set and return to Step 1.
3.4 Algorithm with Approximate Analytic Center
Given step size , in order to apply the potential reduction algorithm, we still need to find the new analytic center corresponding to . This can be achieved using Newton iterations. However, it is not necessary to use exact analytic centers. Instead, we can use an approximate analytic center obtained from only one Newton update step, and show that the max-potential still reduces by a constant amount. Moreover, if we start from an approximate analytic center, one Newton step with the appropriate step size yields an approximate center that still achieves this constant reduction in potential. We develop these results in this subsection.
By an approximate analytic center, we mean a set of variables that satisfy the optimality conditions
but instead of requiring as is true for the exact analytic center, we allow the products to be close to the identity .
We define a measure of proximity of the products as follows
where the equality follows from the fact that the Frobenius norm is a spectral property for symmetric matrices. We may also check that
As we will see in the following results, this is the appropriate measure of proximity. The first result we show is that starting from an analytic center for , we can obtain a feasible point in that is an approximate analytic center. Let be the exact analytic center of . Define , and .
Further, let with where is a sufficiently small constant, and
Then, the new variables satisfy
and . Moreover, .
Proof We can readily verify that the new variables satisfy the desired equalities. When is appropriately chosen as in the previous theorem, we can guarantee that . Now
where we have used where and are the eigenvalues of . Therefore, . Similarly, and since , we have ∎
Next we show that this set of variables can be used as an initial set to generate a new approximate center for , using a Newton step. This also implies that starting from an approximate analytic center we can obtain an exact analytic center by applying the Newton iterations with quadratic convergence. Here to ensure the symmetry of updates in the Newton step, we use the Nesterov-Todd direction Todd et al. (1998), Nesterov and Todd (1997).
Let Let be any set that satisfy
and , with
for some .
Let be defined by
where the increments s are given by the Newton step
4 Diagonal Preconditioning and Stochastic Convex Optimization
So far we have focused on diagonal preconditioning of design matrices. We studied how preconditioning affects the condition number of the design matrices, and also developed algorithms to compute optimal diagonal preconditioners. These results have direct implications on the convergence rates of iterative solvers of linear systems and first order methods applied to quadratic objectives, since convergence rates depend explicitly on the condition number of the (constant) Hessian matrix .
For general convex objective functions, convergence rates of first order methods depend on the smoothness and convexity constants of the function. It is therefore more challenging to study the impact of preconditioning on the design matrix on the convergence of first order methods on general convex objective functions, as we need to investigate how the reduction in condition number of the design matrix translates to improvements in the smoothness and convexity constants of the objective function. In this section, we first give such a result for a class of objective functions that correspond to the (regularized) loss of an empirical risk minimization problem (Shalev-Shwartz and Zhang (2013)), and show that when is smooth and strongly convex, reductions in the condition number of the objective is on the order of where is the design matrix and is any preconditioned design matrix with smaller condition number. For other more general classes of convex objectives, we leave to future work to quantify theoretically the impact of condition number on convergence rates. However, we make the case that the optimal diagonal preconditioning algorithms developed in the previous section is not limited to being applied to fixed design matrices. We establish the connection of batch normalization, a popular technique in machine learning, to adaptive preconditioning, and develop an adaptive version of the optimal diagonal preconditioning algorithm. We then conduct numerical experiments that demonstrate the effectiveness of the optimal diagonal preconditioning algorithm at speeding up first order methods in regression and classification problems in the next section.
4.1 Convergence Rates for General Convex Problems
The relationship between the complexity of first order methods and the smoothness and convexity properties of the objective function has been well established in the optimization literature. Here the complexity is usually expressed in terms of the number of iterations required to obtain an -optimal solution. We summarize the corresponding iteration complexities of solving
|Lipschitz continuous||Smooth||Smooth&Strongly Convex|