Consider the following optimization problem
where are smooth functions and
is a convex constraint set. Many machine learning and scientific computing problems involve finding an approximation of the minimizer of the above optimization problem tohigh precision. For example, consider any machine learning application where each
is a loss function corresponding todata point and is the empirical risk. The goal of solving (1) is to obtain a solution with small generalization error, i.e., high predictive accuracy on “unseen” data. One can show that the generalization error of the empirical risk minimizer is to within additive error of that of the true population risk minimizer; e.g., see [bousquet2002stability, cesa2004generalization, vapnik1989inductive, vapnik2013nature]. As a result, in large scale regime that we consider in this paper where there are many data points available, i.e., , obtaining a solution to (1) to high precision would indeed guarantee a low generalization error. A second example is that in many problems in which the optimization variable contains specific meanings, a high-precision solution to (1) is necessary for interpreting the results. Examples of such settings arise frequently in machine learning such as sparse least squares [tibshirani1996regression], generalized linear models (GLMs) [friedman2001elements] and metric learning problems [kulis2012metric] among many more modern large scale problems.
There is a plethora of first-order optimization algorithms [bubeck2014theory, nocedal2006numerical] for solving (1). However, for ill-conditioned problems, it is often the case that first-order methods return a solution far from the minimizer, , albeit a low objective value. (See Figure 2 in Section 5 for example.) On the other hand, most second-order algorithms prove to be more robust to such ill conditioning. This is so since, using the curvature information, second-order methods properly rescale the gradient, such that it is a more appropriate direction to follow. For example, take the canonical second-order method, i.e., Newton’s method, which, in the unconstrained case, has updates of the form
where and denote the gradient and the Hessian of at , respectively. Classical results indicate that under certain assumptions, Newton’s method can achieve a locally super-linear convergence rate, which can be shown to be problem independent! Nevertheless, the cost of forming and inverting the Hessian is a major drawback in using Newton’s method in practice.
In this regard, there has been a long line of work that tries to provide sufficient second-order information with feasible computations. For example, among the class of quasi-Newton methods, the BFGS algorithm nocedal2006numerical and its limited memory version liu1989lbfgs are the most celebrated. However, the convergence guarantee of these methods can be much weaker than Newton’s methods. More recently, authors in pilanci2015newton,erdogdu2015convergence,roosta2016sub1 considered using sketching and sampling techniques to construct an approximate Hessian matrix and using it in the update rule (2). They showed that such algorithms inherit a local linear-quadratic convergence rate with a substantial computational gain.
where are readily available and is some positive semi-definite matrix. This situation arises very frequently in many applications such as machine learning. For example, take any problem where , is any convex loss function and ’s are data points, and . In such situations, is simply and .
First, we choose a sampling scheme that constructs an appropriate non-uniform sampling distribution over and samples terms from . The approximate Hessian, constructed as where denotes the set of sub-sampled indices, is then used to update the current iterate as
, e.g., Conjugate Gradient or Stochastic Gradient Descent, with a few iterations such that a high-quality approximate solution can be produced with a less complexity. Such inexact updates used in many second-order optimization algorithms have been well studied in[byrd2011use, dembo1982inexact].
Under certain conditions, it can be shown that this type of randomized Newton-type algorithm can achieve a local linear-quadratic convergence rate shown as below (formally stated in Lemma 7)
where are some constants that can be controlled by the Hessian approximation quality, i.e., choice of sampling scheme ,111To be more precise, by a sampling scheme here, we mean the way we construct the sampling distribution , e.g., uniform sampling distribution or leverage scores sampling distribution, and the value of sampling size . and the solution quality of (4), i.e., choice of solver .222By a solver here, we mean the choice of the specific algorithm with parameters specified, e.g., number of iterations, we use to obtain a high-quality approximation solution to the subproblem.
Different choices of sampling scheme and solver lead to different complexities in SSN. Below, we briefly discuss their effects. As can be seen, the total complexity of our algorithm can be characterized by the following three factors, each of which is affected by or , or both.
Number of total iterations determined by the convergence rate, i.e., and in (5) which is affected by sampling scheme and solver .
In each iteration, the time it needs to construct and sample terms, which is determined by sampling scheme .
In each iteration, the time it needs to (implicitly) form which is affected by sampling scheme and to (inexactly) solve the linear problem (4) which is affected by solver .
With these, the total complexity can be expressed as
where is the time it takes to compute the full gradient which is not affected by the choice of and and will not be discussed in the rest of this paper.
As discussed above, the choice of sampling scheme and solver plays an important role in our algorithm. Below, we focus on first and discuss a few concrete sampling schemes. A natural and simple approach is uniform sampling discussed in erdogdu2015convergence,roosta2016sub1, roosta2016sub2. The greatest advantage of uniform sampling is its simplicity of construction. However, in the presence of high non-uniformity among , the sampling size required to sufficiently capture the curvature information of the Hessian can be very large, which makes the resulting algorithm less beneficial.
In this paper, we consider two non-uniform sampling schemes, block norm squares and a new, and more general, notion of leverage scores named block partial leverage scores (Definition 6); see Section 3 for detailed construction. The motivation for using these schemes is that we can in fact view the sufficient conditions for achieving the local linear-quadratic convergence rate (5) as matrix approximation guarantees and these two sampling schemes can yield high-quality matrix approximations; see Section 4.1 for more details. Recall that, the choice of affects the three terms, namely, (manifested in and ), , , in (6). By leveraging and extending theories in randomized linear algebra, we can show how these terms may become when different sampling schemes are used as presented in Table 1. Detailed theory is elaborated in Section 4. As we can see, block norm squares sampling and leverage scores sampling require a smaller sampling size than uniform sampling does since . Furthermore, the dependence of and on the condition number reveals that leverage scores sampling is more robust to ill-conditioned problems; see Section 4.6.1 for more detailed discussions.
|SSN (leverage scores)||This paper|
|SSN (block norm squares)||This paper|
Next, we discuss the effect of the solver . Typically, a direct solver takes time to solve the subproblem (4) where is the sampling size. This becomes prohibitive when is large. However, iterative solvers allow one to obtain a high quality approximation solution with a few iterations which may drastically drive down the complexity. For example, Conjugate Gradient (CG) takes to return an approximate solution with relative error where is the condition number of the problem. In Lemma 7 we show that such inexactness will not deteriorate the performance of SSN too much.
Indeed, based on (5), it is possible to choose the parameters, e.g., sampling size in the sampling scheme and number of iterations to run in solver , so that SSN converges in a constant linear rate, e.g.,
By this way, the complexity per iteration of SSN can be explicitly given. In Table 2 we summarize these results with comparison to other stochastic second-order approaches such as LiSSA agarwal2016second. It can be seen from Table 2 that compared to Newton’s methods, these stochastic second-order methods trade the coefficient of the leading term with some lower order terms that only depend on and condition numbers. Although SSN with non-uniform sampling has a quadratic dependence on , its dependence on the condition number is better than the other methods. There are two main reasons. First, the total power of the condition number is lower, regardless of the versions of the condition number needed. Second, SSN (leverage scores) and SSN (block norm squares) only depend on which can be significantly lower than the other two definitions of condition number, i.e., and ; see Section 4.6.2 for more details.
|name||complexity per iteration||reference|
|SSN (leverage scores)||This paper|
|SSN (block norm squares)||This paper|
|Newton Sketch (SRHT)||[pilanci2015newton]|
As we shall see (in Section 5), our algorithms converge much faster than other competing methods with ridge logistic regression. In particular, on several real datasets with a moderately high condition number and large , our methods are at least twice as fast as Newton’s methods in finding a medium- or high-precision solution, while other methods including first-order methods converge slowly. Indeed, this phenomenon is well supported by our theoretical findings—the complexity of our algorithms has a lower dependence on the problem condition number and is immune to any non-uniformity among which may cause a factor of in the complexity (Table 2). In the following we present other prior work and details of our main contributions.
1.1 Related work
Recently, within the context of randomized second-order methods, many algorithms have been proposed that aim at reducing the computational costs involving pure Newton’s method. Among them, algorithms that employ uniform sub-sampling constitute a popular line of work [byrd2011use, erdogdu2015convergence, martens2010deep, vinyals2011krylov]. In particular roosta2016sub1,roosta2016sub2 consider a more general class of problems and, under a variety of conditions, thoroughly study the local and global convergence properties of sub-sampled Newton methods where the gradient and/or the Hessian are uniformly sub-sampled. Our work here, however, is more closely related to a recent work pilanci2015newton (Newton Sketch), which considers a similar class of problems and proposes sketching the Hessian using random sub-Gaussian matrices or randomized orthonormal systems. Furthermore, [agarwal2016second]
proposes a stochastic algorithm (LiSSA) that, for solving the sub-problems, employs some unbiased estimators of the inverse of the Hessian.
The main technique used by pilanci2015newton,roosta2016sub1,roosta2016sub2 and our work is sketching, which is a powerful technique in randomized linear algebra and many other applications woodruff2014sketching,mahoney2011randomized,yang2016review. As mentioned above, Hessian approximation can be viewed as a matrix approximation problem. In terms of this, error analysis of matrix approximation based on leverage scores sampling has been well studied drineas2008relative, drineas2012fast, cohen2015ridge. holodnak2015randomized show the lower bounds of sampling size for both block norm squares sampling and leverage scores sampling in approximating the Gram product matrix. Also cohen2015uniform implies that uniform sampling cannot guarantee spectral approximation when the sampling size is only dependent on the lower dimension.
1.2 Main contributions
The contributions of this paper can be summarized as follows.
For the class of problems considered here, unlike the uniform sampling used in [byrd2011use, erdogdu2015convergence, roosta2016sub1, roosta2016sub2], we employ two non-uniform sampling schemes based on block norm squares and a new, and more general, notion of leverage scores named block partial leverage scores (Definition 6). It can be shown that in the case of extreme non-uniformity among , uniform sampling might require samples to capture the Hessian information appropriately. However, we show that our non-uniform sampling schemes result in sample sizes completely independent of and are immune to such non-uniformity.
Within the context of sub-sampled Newton-type algorithms, [byrd2011use, roosta2016sub2] incorporate inexact updates where the sub-problems are solved only approximately and study global convergence properties of their algorithms. We extend the study of inexactness to the finer level of local convergence analysis.
We provide a general structural result (Lemma 7) showing that, as in [erdogdu2015convergence, pilanci2015newton, roosta2016sub1], our main algorithm exhibits a linear-quadratic solution error recursion. However, we show that by using our non-uniform sampling strategies, the factors appearing in such error recursion enjoy a much better dependence on problem specific quantities, e.g., such as the condition number (Table 1). For example, using block partial leverage score sampling, the factor for the linear term of the error recursion (14) is of order as opposed to for uniform sampling.
We show that to achieve a locally problem independent linear convergence rate, i.e., for some fixed constant , the per-iteration complexities of our algorithm with leverage scores sampling and block norm squares sampling are and , respectively, which have lower dependence on condition numbers compared to [agarwal2016second, pilanci2015newton, roosta2016sub2] (Table 2). In particular, in the presence of high non-uniformity among , factors and (see Definition (27)) which appear in SSN (uniform) [roosta2016sub1], and LiSSA [agarwal2016second], can potentially be as large as ; see Section 4.6 for detailed discussions.
We numerically demonstrate the effectiveness and robustness of our algorithms in recovering the minimizer of ridge logistic regression on several real datasets with a moderately large condition number (Figures 1, 2 and 3). In particular, our algorithms are at least twice as fast as Newton’s methods in finding a medium- or high-precision solution, while other methods including first-order methods converge slowly.
The remainder of the paper is organized as follows. We begin in Section 2 with notation and assumptions that will be used throughout the paper. In Section 3, we describe our main algorithm and propose two non-uniform sampling schemes. Section 4 provides theoretical analysis. Finally, we present our numerical experiments in Section 5.
Given a function , the gradient, the exact Hessian and the approximate Hessian are denoted by , , and , respectively. Iteration counter is denoted by subscript, e.g., . Unless stated specifically,
denotes the Euclidean norm for vectors and spectral norm for matrices. Frobenius norm of matrices is written as. Given a matrix , we let denote the number of non-zero elements in . By a matrix having blocks, we mean that has a block structure and can be viewed as , for appropriate size blocks .
Definition 1 (Tangent Cone).
Denote be the tangent cone of constraints at the optimum , i.e., .
Definition 2 (-restricted Maximum and Minimum Eigenvalues).
Given a symmetric matrix and a cone , we define the -restricted maximum and minimum eigenvalues as follows.
-restricted maximum and minimum eigenvalues as follows.
Definition 3 (Stable Rank).
Given a matrix , the stable rank of is defined as
Definition 4 (Leverage Scores).
Given , then for , the -th leverage scores of is defined as
Throughout the paper, we use the following assumptions regarding the properties of the problem.
Assumption 1 (Lipschitz Continuity).
is convex and twice differentiable. The Hessian is -Lipschitz continuous, i.e.
Assumption 2 (Local Regularity).
is locally strongly convex and smooth, i.e.,
Here we define the local condition number of the problem as .
Assumption 3 (Hessian Decomposition).
For each in (1), define . For simplicity, we assume and is independent of . Furthermore, we assume that given , computing , , and takes , , and time, respectively. We call the matrix the augmented matrix of . Note that .
3 Main Algorithm: SSN with Non-uniform Sampling
Our proposed SSN method with non-uniform sampling is given in Algorithm 1. The core of our algorithm is based on choosing a sampling scheme that, at every iteration, constructs a non-uniform sampling distribution over and then samples from to form the approximate Hessian, . The sampling sizes needed for different sampling distributions will be discussed in Sections 4.2 and 4.3. Since , the Hessian approximation essentially boils down to a matrix approximation problem. Here, we generalize the two popular non-uniform sampling strategies, i.e., leverage score sampling and block norm squares sampling, which are commonly used in the field of randomized linear algebra, particularly for matrix approximation problems [holodnak2015randomized, mahoney2011randomized]. With an approximate Hessian constructed via non-uniform sampling, we may choose an appropriate solver to the solve the sub-problem in Step 11 of Algorithm 1. Below we elaborate on the construction of the two non-uniform sampling schemes. Indeed, the sampling distribution is defined based on the matrix representation of — its augmented matrix defined as follows.
Definition 5 (Augmented Matrix).
Define the augmented matrix of as
For the ease of presentation, throughout the rest of this section and next section, we use and to denote and , respectively, as long as it is clear in the text.
Block Norm Squares Sampling
The first option is to construct a sampling distribution based on the magnitude of . That is, define
This is an extension to the row norm squares sampling in which the intuition is to capture the importance of the blocks based on the “magnitudes” of the sub-Hessians.
Block Partial Leverage Scores Sampling
The second option is to construct a sampling distribution based on leverage scores. Compared to the traditional matrix approximation problem, this problem has two major difficulties. First, here blocks are being sampled, not single rows. Second, the matrix being approximated involves not only but also .
To address the first difficulty, we follow the work by harvey_sparse in which a sparse sum of semi-definite matrices is found by sampling based on the trace of each semi-definite matrix after a proper transformation. By expressing , one can show that their approach is essentially sampling based on the sum of leverage scores that correspond to each block. For the second difficulty, inspired by the recently proposed ridge leverage scores [el2014fast, cohen2015ridge], we consider the leverage scores of a matrix that concatenates and . Combining these motivates us to define a new notion of leverage scores —- block partial leverage scores which is define as follows formally.
Definition 6 (Block Partial Leverage Scores).
Given a matrix with blocks and a matrix satisfying , let be the leverage scores of the matrix . Define the block partial leverage score for the -th block as
Then the sampling distribution is defined as
Remark. When each block of has only one row and , the partial block leverage scores are equivalent to the ordinary leverage scores defined in Definition 4.
4 Theoretical Results
In this section we provide detailed theoretical analysis to describe the complexity of our algorithm.333In this work, we only focus on local convergence guarantees for Algorithm 1. To ensure global convergence, one can incorporate an existing globally convergent method, e.g. [roosta2016sub2], as initial phase and switch to Algorithm 1 once the iterate is “close enough” to the optimum; see Lemma 7. Different choices of sampling scheme and the sub-problem solver lead to different complexities in SSN. More precisely, total complexity is characterized by the following four factors: (i) total number of iterations determined by the convergence rate which is affected by the choice of and ; (ii) the time, , it takes to compute the full gradient (Step 10 in Algorithm 1), (iii) the time , to construct the sampling distribution and sample terms at each iteration (Steps 4-8 in Algorithm 1), which is determined by ; and (iv) the time needed to (implicitly) form and (inexactly) solve the sub-problem at each iteration (Steps 9 and 11 in Algorithm 1) which is affected by the choices of both (manifested in the sampling size ) and . With these, the total complexity can be expressed as
Below we study these contributing factors. Lemma 7 in Section 4.1 gives a general structural lemma that characterize the convergence rate which determines . In Sections 4.2 and 4.3, Lemmas 8 and 10 discuss for the two sampling schemes respectively while Lemmas 9 and 11 give the required sampling size for the two sampling schemes respectively which directly affects the . Furthermore, is also affected by the choice of solver which will be discussed in Section 4.4. Finally, the complexity results are summarized in Section 4.5 and a comparison with other methods is provided in Section 4.6.
4.1 Sufficient conditions for local linear-quadratic convergence
Before diving into details of the complexity analysis, we state a structural lemma that characterizes the local convergence rate of our main algorithm, i.e., Algorithm 1. As discussed earlier, there are two layers of approximation in Algorithm 1, i.e., approximation of the Hessian by sub-sampling and inexactness of solving (8). For the first layer, we require the approximate Hessian to satisfy one of the following two conditions (in Sections 4.2 and 4.3 we shall see our construction of approximate Hessian via non-uniform sampling can achieve these conditions with a sampling size independent of ).
Note that (C1) and (C2) are two commonly seen guarantees for matrix approximation problems. In particular, (C2) is stronger in the sense that the spectrum of the approximated matrix is well preserved. Below in Lemma 7, we shall see such a stronger condition ensures a better dependence on the condition number in terms of the convergence rate. For the second layer of approximation, we require the solver to produce an -approximate solution satisfying
Remark. When the problem is unconstrained, i.e., , solving the subproblem (8) is equivalent to solving
Then requirement (12) is equivalent to finding an approximation solution such that
Lemma 7 (Structural Result).
Specifically, given any ,
From this it is not hard to see that (C2) is strictly stronger than (C1). Also, in this case the Hessian approximation problem boils down to a matrix approximation problem. That is, given and , we want to construct a sampling matrix efficiently such that the matrix is well preserved. As we mentioned, leverage scores sampling and block norm squares sampling are two popular ways for this task. In the next two subsections we will focus on the theoretical properties of these two schemes.
4.2 Results for block partial leverage scores sampling
Since the block partial leverage scores are defined as the standard leverage scores of some matrix, we can make use of the fast approximation algorithm for standard leverage scores. Specifically, apply a variant of the algorithm in drineas2012fast by using the sparse subspace embedding [clarkson13sparse] as the underlying sketching method to further speed up the computation.
4.2.2 Sampling size
The following theorem indicates that if we sample the blocks of based on block partial leverage scores with large enough sampling size, (18) holds with high probability.
Remark. When are the exact scores, since where , the above theorem indicates that less than blocks are needed for (18) to hold.
4.3 Results for block norm squares sampling
To sample based on block norm squares, one has first compute the Frobenius norm of every block in the augmented matrix . This requires time.
Given , under Assumption 3, it takes time to construct a block norm squares sampling distribution for where is the augmented matrix of .
4.3.2 Sampling size
The following theorem holodnak2015randomized show the approximation error bound for Gram matrix. Here we extend it to our augmented matrix setting as follows,
4.4 Discussion on the choice of solver
Here we discuss the effect of the choice of the solver in Algorithm 1. Specifically, after an approximate Hessian is constructed in Algorithm 1, we look at how various solvers affect in (11). Since the approximate Hessian is of the form where , the complexity for solving the subproblem (8) essentially depends on and . Given and , for ease of notation, we use
to denote the time it needs to solve the subproblem (8) using solver .
For example, when the problem is unconstrained, i.e., , the subproblem (8
) reduces to a linear regression problem with sizeby and direct solver costs . Alternatively, one can use an iterative solver such as Conjugate Gradient (CG) to obtain an approximate solution. In this case, the complexity for solving the subproblem becomes to produce an solution to (8) where is the condition number of the problem. It is not hard to see that CG is advantageous when the low dimension is large and the linear system is fairly well-conditioned.
Again, recall that in (11) the complexity of the sub-sampled Newton methods can be expressed as . Combining the results from the previous few subsections, we have the following lemma characterizing the total complexity.
Indeed, Lemma 7 implies that the sub-sampled Newton method inherits a local constant linear convergence rate. This can be shown by choosing specific values for and in Lemma 7. The results are presented in the following corollary.
if block partial leverage scores sampling is used, the complexity per iteration in the local phase is
if block norm squares sampling is used, the complexity per iteration in the local phase is
and the solution error satisfies
where is a constant.444In this paper, hides logarithmic factors of , and .
4.6.1 Comparison between different sampling schemes
As discussed above, the sampling scheme plays a crucial role in sub-sampled Newton methods. Here, we compare the two proposed non-uniform sampling schemes, namely, block partial leverage scores sampling and block norm squares sampling, with uniform sampling. SSN with uniform sampling was discussed in roosta2016sub1. For completeness, we state the sampling size bound for uniform sampling. Note that, this upper bound for is tighter than the original analysis in roosta2016sub1.
This result allows us to compare the three sampling schemes in terms of the three main complexities, i.e., , and (manifested in and ), as shown in Table 1 (identical to Table 1 in Section 1). Note here, to evaluate the effect of the sampling scheme only, we assume a direct solver is used for the subproblem (8) since in this case is directly controlled by the sampling size , independent of the solver . Also, for simplicity, we assume that . The analysis is similar for general cases. In Table 1, and are defined based on two problem properties and :
|SSN (leverage scores)|
|SSN (block norm squares)|
As can be seen in Table 1, the greatest advantage of uniform sampling scheme comes from its simplicity of construction. On the other hand, as discussed in Sections 4.2.1 and 4.3.1, it takes nearly input sparsity time to construct the leverage scores sampling distribution or the block norm squares sampling distribution. When it comes to the sampling size for achieving (17) or (18), as suggested in (24), the one for uniform sampling can become when is very non-uniform, i.e., . It can be shown that for a given , block norm squares sampling requires the smallest sampling size which leads to the smallest value of in Table 1.
It is worth pointing that, although either (17) or (18) is sufficient to yield a local linear-quadratic convergence rate, as (18) is essentially a stronger condition, it has better constants, i.e., and . This fact is reflected in Table 1. The constants and for leverage scores sampling have a better dependence on the local condition number than the other two schemes since leverage scores sampling yields a sampling matrix that satisfies the spectral approximation guarantee (18). In fact, this difference can dramatically affect the performance of the algorithm when dealing with ill-conditioned problems. This is verified by numerical experiments; see Figure 1 in Section 5 for details.
Remark. Note that all the analysis including error recursion (Lemma 7) and the required sampling size for different sampling schemes are provided as upper bounds. There will be cases that the sampling size bound indicates a large value for , in fact a much smaller sampling size
suffices to yield good performance. For example, when the leverage scores are equal or close to a uniform distribution, the actual required sampling size for uniform sampling scheme is much less than (24).
4.6.2 Comparison between various methods
Next, we compare our main algorithm with other stochastic second-order methods including [roosta2016sub1, agarwal2016second]. Since these essentially imply a constant linear convergence rate, i.e.,