 # Leverage Score Sampling for Faster Accelerated Regression and ERM

Given a matrix A∈R^n× d and a vector b ∈R^d, we show how to compute an ϵ-approximate solution to the regression problem _x∈R^d1/2A x - b_2^2 in time Õ ((n+√(d·κ_sum))· s·ϵ^-1) where κ_sum=tr(A^A)/λ_(A^TA) and s is the maximum number of non-zero entries in a row of A. Our algorithm improves upon the previous best running time of Õ ((n+√(n ·κ_sum))· s·ϵ^-1). We achieve our result through a careful combination of leverage score sampling techniques, proximal point methods, and accelerated coordinate descent. Our method not only matches the performance of previous methods, but further improves whenever leverage scores of rows are small (up to polylogarithmic factors). We also provide a non-linear generalization of these results that improves the running time for solving a broader class of ERM problems.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Given and , the regression problem

is one of the most fundamental problems in optimization and a prominent tool in machine learning. It is one of the simplest empirical risk minimization (ERM) problems and a prominent proving ground for developing efficient learning algorithms. Regression is long known to be solve-able directly by matrix multiplication in

time where [Wil12] is the matrix multiplication constant and recent work has improved the running time to ,111Through this paper, we use to hide factors polylogarithmic in ,, and (c.f. Definition 3 and Definition 6). i.e. linear time plus the time needed to solve a nearly square linear system. [CW13, LMP13, NN13, CLM15, Coh16] However, for sufficiently large even a super-quadratic running time of  can be prohibitively expensive. Consequently, over the past decade improving this running time under mild regularity assumptions on has been an incredibly active area of research. In this paper we improve the best known running time for solving regression under standard regularity assumptions. Formally the problem we consider is as follows.

###### Definition 1 (The Regression Problem).

Given with rows and , we consider the regression problem where

 fA,b(x)def=12∥Ax−b∥22=∑i∈[n]12(a⊤ix−bi)2.

The central problem of this paper is to get faster regression algorithms defined as follows.

###### Definition 2 (Regression Algorithm).

We call an algorithm a -time regression algorithm if for any , , and w.h.p in in time the algorithm outputs a vector such that

 fA,b(y)−minxfA,b(x)≤ϵ⋅(fA,b(x0)−minxfA,b(x)). (1.1)

Note that if is a minimizer of then the guarantee (1.1) is equivalent to the following

 ∥y−x∗∥2A⊤A≤ϵ∥x0−x∗∥2A⊤A (1.2)

where for a PSD matrix is defined as . The goal of this paper is to provide regression algorithms with improved running times depending on , , and the following regularity parameters.

###### Definition 3.

(Regularity Parameters) We let and

denote the smallest and largest eigenvalues of

. We let denote the condition number of and let denote the total condition number of . We let denote the maximum number of non-zero entries in a row of . Occasionally, we drop the terms in parenthesis when they are clear from context.

### 1.1 Previous Results

Standard classic iterative methods such as gradient descent and accelerated gradient descent [Nes83] solve the regression problem with running times of and respectively. While these running times are super-linear whenever is super constant there has been a flurry of recent papers showing that using sampling techniques faster running times can be achieved. These often yield nearly linear running times when is sufficiently larger than . [SSZ13, JZ13a, SZ16, All16]. Using recent advances in accelerated coordinate descent [AQRY16, NS17] coupled with proximal point methods [FGKS15a, LMH15] the previous fastest iterative algorithm is as follows:

###### Theorem 4 (Previous Best Regression Running Times).

Given , there is a -time regression algorithm with

 T(A)=˜O⎛⎜ ⎜⎝⎛⎜ ⎜⎝n+∑i∈[n]∥ai∥2√λmin(A⊤A)⎞⎟ ⎟⎠⋅s(A)⎞⎟ ⎟⎠=˜O((n+√n⋅κsum(A⊤A))⋅s(A)).

The inequality in this theorem follows directly from Cauchy Schwartz, as

 ∑i∈[n]∥ai∥2≤√n⋅tr(A⊤A)=√n⋅κsum(A⊤A)⋅λmin(A⊤A) .

For the rest of the proof, see Section 5, where we provide a proof of Theorem 13, a generalization of Theorem 4 which is more convenient for our analysis.

### 1.2 Our Results

The work in this paper is motivated by the natural question, can this running time of Theorem 4 be further improved? Despite the running time lower bound of shown in [WS16],222Their lower bound involves a function with . However, is more common as we explain. in this paper we give an affirmative answer improving the term in Theorem 4 to . The main result of this paper is the following:

###### Theorem 5 (Improved Regression Running Time).

Given , Algorithm LABEL:alg:solveUsingLS is a

-time regression algorithm that succeeds with high probability(w.h.p) in

where

 T(A)=˜O⎛⎜ ⎜⎝nnz(A)+⎛⎜ ⎜⎝d+∑i∈[n]∥ai∥2⋅σi(A)√λmin(ATA)⎞⎟ ⎟⎠⋅s(A)⎞⎟ ⎟⎠

and for all .

Up to polylogarithmic factors Theorem 5 is an improvement over Theorem 4 as . This improvement can be substantial as can be as small as , e.g. if is an entry-wise random Gaussian matrix. Compared to Theorem 4 whose second term in running time grows as , our second term is independent of due to the following:

 ∑i∈[n]∥ai∥2σi(A)≤√∑i∈[n]∥ai∥22∑i∈[n]∥ai∥2(A⊤A)−1=√tr(A⊤A)tr(A(A⊤A)−1AT)≤√dκ% sum. (1.3)

Therefore in Theorem 5 we have . This improvement from to can be significant as (the number of samples) is in some cases orders of magnitude larger than (the number of features). For example, in the LIBSVM dataset, in 87 out of 106 many non-text problems, we have , 50 of them have and in the UCI dataset, in 279 out of 301 many non-text problems, we have , 195 out of them have . Furthermore, in Section 7 we show how to extend our results to ERM problems more general then regression. In particular we consider the following ERM problem

###### Definition 6 (Erm).

Given with rows and functions such that each is twice differentiable and satisfies

 ∀x∈Rd1M≤ψ′′(x)≤M (1.4)

we wish to minimize over where

 F(x)def=∑i∈[n]fi(x)=∑i∈[n]ψi(aTix)

While the form of ERM considered by us is not the most general form considered in literature, we consider this problem, to provide a proof of concept that our techniques for regression are more broadly applicable to the case of ERM and can if fact provide a non-trivial improvement over existing methods. We leave it to future work to study the limitations of our techniques and possible improvements for more general ERM problems. The following is our main theorem regarding the ERM problem.

###### Theorem 7.

Given an ERM problem(Definition 6) and an initial point , there exists an algorithm that produces a point such that which succeeds w.h.p in n in total time

 ~O((nnz(A)+(dM5+n∑i=1∥ai∥2√σiM3√λmin(A⊤A))s(A))log(1ϵ))

where are the leverage scores with respect to .

Note that Theorem 7interpolates our regression results, i.e. it recovers our results for regression in the special case of . To better understand the bound in Theorem 7, note that following the derivation in Equation (1.3) we have that the running time in Theorem 7 is bounded by

 ~O((nnz(A)+(dM5+n∑i=1M3√dκsum(A⊤A))s(A))log(1ϵ)) .

The best known bound for the ERM problem as defined in Definition 6 given by [FGKS15a, LMH15, All16] is

 ~O((nnz(A)+n∑i=1M√nκsum(A⊤A)))s(A)log(1ϵ))

In this case Theorem 7 should be seen as implying that under Assumption (1.4) the effective dependence on the number of examples on the running time for ERM can be reduced to at most . Again, we remark that the running time bound of Theorem 7 should be viewed as a proof of concept that our regression machinery can be used to improve the running time of ERM. We leave it as future work to both improve Theorem 7’s dependence on and have it extend to a broader set of problems. For example, we believe the the running time can be immediately improved to

 ~O((nnz(A)+(dM4+n∑i=1∥ai∥2√σiM3√λmin(A⊤A))s(A))log(1ϵ))

simply by using a proximal version of Theorem 18, which is alluded to in the work of [AQRY16]. Note that this improvement leads to the effective number of examples being bounded by .

### 1.3 Our Approach

Our algorithm follows from careful combination and analysis of a recent suite of advances in numerical linear algebra. First, we use the previous fastest regression algorithm, Theorem 4, which is the combination of recent advances in accelerated coordinate descent [AQRY16, NS17] and proximal point methods [FGKS15a, LMH15] (See Section 5

.) Then, we show that if we have estimates of the leverage scores of the rows of

, a natural recently popularized measure of importance, [SS08, LMP13, CLM15] we can use concentration results on leverage score sampling and preconditioning to obtain a faster regression algorithm. (See Section 3.) Now, it is a powerful and somewhat well known fact that given an algorithm for regression one can compute leverage score estimates in nearly linear time plus the time needed to solve regression problems [SS08]. Consequently, to achieve the improved running time when we do not have leverage scores we are left with a chicken and egg problem. Fortunately, recent work has shown that such a problem can be solved in a several ways [LMP13, CLM15]. We show that the technique in [LMP13] carefully applied and analyzed can be used to obtain our improved running time for both estimating leverage scores and solving regression problems with little overhead. (See Section 4.) To generalize our results for a broader class of ERM problems we follow a similar procedure. Most parts generalize naturally, however perhaps the most complex ingredient is how to generalize preconditioning to the case when we are sampling non-quadratic functions. For this, we prove concentration results on sampling from ERM inspired from [FGKS15b] to show that it suffices to solve ERM on a sub-sampling of the components that may be of intrinsic interest. (See Section 7) In summary our algorithms are essentially a careful blend of accelerated coordinate descent and concentration results coupled with the iterative procedure in [LMP13] and the Johnson Lindenstrauss machinery of [SS08] to compute leverage scores. Ultimately the algorithms we provide are fairly straightforward, but it provides a substantial running time improvement that we think is of intrinsic interest. We hope this work may serve as a springboard for even faster iterative methods for regression and minimizing finite sums more broadly. Finally, we remark that there is another way to achieve the improvement over . One could simply use subspace embeddings [CW13, NN13, Coh16] and preconditioning to reduce the regression problem to a regression problem on a matrix and then apply Theorem 4 to solve the regression problem. While this works, it has three shortcomings relevant to our approach. First, this procedure would possibly lose sparsity, the matrix may be dense and thus the final running time would have an additional term our method does not. Second, it is unclear if this approach yields our more fine-grained running time dependence on leverage scores that appears in Theorem 5 which we believe to be significant. Thirdly it is unclear how to extend the approach to the ERM setting.

### 1.4 Paper Organization

After providing requisite notation in Section 2 we prove Theorem 5 in Sections 3 and 4. We first provide the algorithm for regression given leverage score estimates in Section 3 and further provide the algorithm to compute the estimates based in Section 4. Note that the algorithm for computing leverage scores makes use of the algorithm for regression given leverage scores as a sub-routine. In Section 5 we provide the proofs of deferred lemmas and theorems from Sections 3, 4. In Section 7 we provide the proof of Theorem 7. Further we collect some simple re-derivations and reductions from known results required through the paper in the Appendix.

## 2 Notation

For symmetric matrix and we let . For symmetric matrix we use to denote the condition that for all and we define , , and analogously. We use to denote the number of non-zero entries in and for a vector we let denote the number of nonzero entries in .

## 3 Regression Algorithm Given Leverage Score Estimates

The regression algorithm we provide in this paper involves two steps. First we find which rows of are important, where importance is measured in terms of leverage score. Second, we use these leverage scores to sample the matrix and solve the regression problem on the sampled matrix using Theorem 13. In this section we introduce leverage scores and provide and analyze the second step of our algorithm.

###### Definition 8 (Leverage Score).

For with rows we denote the leverage score of row by .

Leverage score have numerous applications and are well studied. It is well known that for all and . The critical fact we used about leverage scores is that sampling rows of according to any overestimate of leverage scores yields a good approximation to after appropriate re-scaling [CLM15, SS08]:

###### Lemma 9 (Leverage Score Sampling (Lemma 4 of [Clm+15]) ).

Let , let , and let be overestimates of the leverage scores of ; i.e. for all . Define for a sufficiently large absolute constant and let be a random diagonal matrix where independently with probability and otherwise. With high probability in , and .

algocf[h]

###### Theorem 10.

If satisfies for all then is a -time regression algorithm where

 T(A)=˜O⎛⎜ ⎜⎝nnz(A)+⎛⎜ ⎜⎝d+∑i∈[n]√σi(A)⋅∥ai∥2√λmin(A⊤A)⎞⎟ ⎟⎠⋅s(A)⎞⎟ ⎟⎠ .
###### Proof.

Let for . Applying Lemma 9 yields with high probability in that

 (56)A⊤A⪯A⊤HA⪯(65)A⊤A (3.1)

where and . Note that

 E⎡⎣∑i∈[n]∥bi∥2⎤⎦=∑i∈[n]pi√pi∥ai∥2≤∑i∈[n]√k′uilogn∥ai∥2 .

Consequently, by Markov’s inequality

 ∑i∈[n]∥bi∥2≤2⋅∑i∈[n]√k′⋅uilogn⋅∥ai∥2

with probability at least and the loop in the algorithm terminates with high probability in in iterations. Consequently, the loop takes only -time and since we only sampled many independent copies of , the guarantee (3.1) again holds with high probability in . Using the guarantee (3.1) and Theorem 13 on and , we can produce a we need in time times

 ˜O⎛⎝⎛⎝nnz(A)+⎛⎝dlogn+1√λ∑i∈[n]∥bi∥2⎞⎠⋅s(A)⎞⎠⎞⎠

where we used that has at most rows with high probability in . Since we know

 ∑i∈[n]∥bi∥2≤2∑i∈[n]√k′⋅uilogn⋅∥ai∥2,

all that remains is to bound . However, and therefore

 I⪯λmax(A⊤A)(A⊤A)−1 and ∥ai∥2≤√λmax(A⊤A)⋅σi(A) .

Consequently, Cauchy Schwartz and yields

 1√n∑i∈[n]∥ai∥2≤√∑i∈[n]∥ai∥22≤1√λmin(A⊤A)∑i∈[n]∥ai∥22≤√κ(A⊤A)∑i∈[n]√σi(A)⋅∥ai∥2.

Since this yields

 ∑i∈[n]√ui⋅∥ai∥2 ≤2∑i∈[n]√σi(A)⋅∥ai∥2+1√n⋅κ(A⊤A)∑i∈[n]∥ai∥2≤3∑i∈[n]√σi(A)⋅∥ai∥2

which in turn yields the result as hides factors poly-logarithmic in and . ∎

## 4 Regression Algorithm Without Leverage Score Estimates

In the previous section we showed that we can solve regression in our desired running time provided we have constant factor approximation to leverage scores. Here we show how to apply this procedure repeatedly to estimate leverage scores as well. We do this by first adding a large multiple of the identity to our matrix and then gradually decreasing this multiple while maintaining estimates for leverage scores along the way. This is a technique introduced in [LMP13] and we leverage it tailored to our setting. A key technical ingredient for this algorithm is the following well-known result on the reduction from leverage score computation to regression with little overhead. Formally, Lemma 11 states that you can compute constant multiplicative approximations to all leverage scores of a matrix in nearly linear time plus the time needed to solve regression problems. Algorithm LABEL:alg:computeLS details the procedure for computing leverage scores. algocf[h]

###### Lemma 11 (Computing Leverage Scores).

For let be a -time algorithm for regression on . For , in time where we set with high probability in , the algorithm outputs such that for all .

We defer the proof of Lemma 11 to the Appendix (Section LABEL:sec:computeleveragescores). Combining the algorithm for estimating leverage scores (Algorithm LABEL:alg:computeLS) with our regression algorithm given leverage scores (Theorem 10) yields our solver (Algorithm LABEL:alg:mainalgo). We first provide a technical lemma regarding invariants maintained by the algorithm, Lemma 12, and then we prove that this algorithm has the desired properties to prove Theorem 5. algocf[h]

###### Lemma 12.

In the algorithm (See Algorithm LABEL:alg:mainalgo) the following invariant is satisfied

 σi(Aη)≤ui≤4⋅σi(Aη)+[n⋅κ(A⊤ηAη)]−1. (4.1)
###### Proof.

Note that . Consequently, since initially we have that initially . Consequently, we have that initially and therefore satisfies the invariant (4.1). Now, suppose at the start of the repeat loop, satisfies the invariant (4.1). In this case the the assumptions needed to invoke by Theorem 10 are satisfied. Hence, after the line , by Lemma 11 we have that for all

Now, letting we see that and direct calculation shows that invariant (4.1) is still satisfied after changing to . All the remains is to consider the last step when we set . When this happens . and therefore is close enough to and the invariant (4.1) is satisfied. ∎

Using this we prove that our main algorithm works as desired.

###### Proof of Theorem 5.

Lemma 12 shows that is always a good estimate enough of throughout the algorithm to invoke with Theorem 10. In particular, this holds at the last step when is set to and thus the output of the algorithm is as desired by Theorem 10. During the whole algorithm, is called times. Each time is called, is called many times. All that remains is to bound the running time of . However, for and we have and since we . Furthermore, since we have that the running time follows from the following:

 ∑i∈[n+d]√σi(Aλ)⋅∥ai∥2√λmin(A⊤λAλ)≤∑i∈[n]√σi(A)⋅∥ai∥2√λmin(A⊤A)+∑i∈[d]√λ√λ .

## 5 Previous Best Running Time for Regression

Here we state Theorem 13, a generalization of Theorem 4 that is useful for our analysis. Theorem 13 follows by applying recent results on accelerated coordinate descent [AQRY16, NS17] to the dual of regression through recent results on approximate proximal point / Catalyst [FGKS15a, LMH15].

###### Theorem 13 (Previous Best Regression Running Time).

Let and be matrices with the same number of columns. Suppose that has rows and , then there is a -time regression algorithm with

 T(A)=˜O⎛⎜ ⎜⎝nnz(A)+⎛⎜ ⎜⎝n+∑i∈[n]∥bi∥2√λmin(B⊤B)⎞⎟ ⎟⎠⋅s(B)⎞⎟ ⎟⎠.

### 5.1 Proof of Theorem 13

First we give the theorems encapsulating the results we use and then use them to prove Theorem 4 in the case when . We then prove the case when . Theorem 14 describes the fastest coordinate descent algorithm known by [AQRY16]. Theorem 15 describes the reduction [FGKS15a] to from regression to coordinate decent via proximal point.

###### Theorem 14 (Corollary of Thm 5.1 of [Aqry16]).

Let be a twice differentiable -strongly convex function for . Further suppose that for all and it is the case that for and the partial derivative can be computed in time. Then there exists an algorithm which given any finds a such that

 f(y)−minxf(x)≤ϵ(f(x0)−minxf(x)).

in expected running time .

###### Theorem 15 (Corollary of Thm 4.3 of [FGKS15a]).

Given with rows and . Consider the function where are convex functions. Suppose that for all . Let . Let dual problem . Suppose that for any , any and any , we can compute in expected running time such that

 gs(y)−minygs(y)≤ϵ(gs(y0)−minygs(y)). (5.1)

Then, for any and any we can find such that

 p(x)−minxp(x)≤ϵ(p(x0)−minxp(x))

in time w.h.p. in where and includes logarithmic factors in .

We note that although the guarantees of Thm 5.1 of [AQRY16] and Thm 4.3 of [FGKS15a] are not stated in the form of Theorems 14 and 15. They can be easily converted to the form above by noticing that the expected running time of the procedure in Thm 4.3 of [FGKS15a] using Theorem 14 is which can then be boosted to high probability in using Lemma 16. We now give the proof of Theorem 13.

###### Proof of Theorem 13 when A=B.

Let

 p(x)=n∑i=1ϕi(a⊤ix) where ϕi(x)=12(x−bi)2 .

Then, we have that and hence

 gs(y)=n∑i=1ϕ∗i(yi)+12λ∥A⊤y∥22−s⊤A⊤y=12∥y∥22+b⊤y+12λ∥A⊤y∥22−sTATy.

Note that is 1 strongly convex and

Hence, Theorem 14 finds satisfying (5.1) in time

 O⎛⎝s(A)⋅∑i∈[n]√1+1λ∥ai∥22log(ϵ−1)⎞⎠=O⎛⎝⎛⎝n+1√λ∑i∈[n]∥ai∥2⎞⎠⋅s(A)⋅log(ϵ−1)⎞⎠.

Hence, this shows that the primal can be solved in time

 O⎛⎝⎛⎝n+1√λ∑i∈[n]∥bi∥2⎞⎠⋅s⋅log(n⋅κ)⋅log(κϵ−1)⎞⎠

where we used at the end. ∎

###### Proof of Theorem 13 for the case A≠B.

The proof involves two steps. First, we show that given any point , we can find a new point that is closer to the minimizer. Then, we bound how many steps it takes. To find , we consider the function

 fx0(x)=12∥Bx−Bx0∥22+⟨Ax0−c,Ax−Ax0⟩.

Let be the minimizer of and be the minimizer of . Note that

 z=x0−(B⊤B)−1A⊤η with η=Ax0−b, and x∗=(A⊤A)−1A⊤b.

Hence, we have that

 12∥Az−Ax∗∥22 =12ηA⊤(A⊤A)−1A⊤η−ηA⊤(B⊤B)−1A⊤η+12∥A(B⊤B)−1A⊤η∥22.

Using that , we have

 12∥Az−Ax∗∥22≤410ηA⊤(A⊤A)−1A⊤η=410∥Ax0−Ax∗∥22. (5.2)

However, it is difficult to reduce to the case when to minimize the function due to the extra linear term. To address this issue, we assume by appending an extra identity term. Note that this only adds a small matrix and hence we still have but with a slightly different constant which will not affect the proof for (5.2). Due to the extra identity term, reduces to an expression of the form for some vector and constant . We can now apply Theorem 13 for the case and get an such that

 fx0(x)−fx0(z)≤1200(fx0(x0)−fx0(z)). (5.3)

in time

Note that the extra terms in does not affect the minimum eigenvalue and it increases by atmost . Now, using the formula of , the guarantee (5.3) can be written as

 ∥Bx−Bz∥22≤1200∥Ax0−Ax∗∥22 .

Using that , we have

 ∥Ax−Az∥2≤110∥Ax0−Ax∗∥2 .

Combining this with (5.2), we have that

 ∥Ax−Ax∗∥2≤0.9∥Ax0−Ax∗∥2.

Hence, we get closer to by constant factor. Therefore, to achieve (1.2), we only need to repeat this process times. Hence, the total running time is

## 6 Reduction from High Probability Solvers to Expected Running Times

In this section we provide an auxiliary lemma that reduce the problem of achieving accuracy with high probability to the problem of achieving an accuracy with probability at least for some constants up to logarithmic factor blowups. Note that a naive reduction suffers an additional blowup which we avoid in the following reduction. The reduction although straightforward helps us provide a concise description of the algorithm for the ERM problem in the next section.

###### Lemma 16.

Consider being given a function and define . Let be an algorithm such that given any point the algorithm runs in time and produces a point such that

 F(x′)−F(x∗)≤c(F(x0)−F(x∗))

with probability at least for given universal constants . Further there exists a procedure which given a point can produce an estimate in time such that

 m/r≤F(x)−F(x∗)≤rm

for some given . Then there exists a procedure that given a point outputs a point such that

 F(x′)−F(x∗)≤ϵ(F(x0)−F(x∗))

and the expected running time of the procedure is bounded by

 O((T+T′)log(r)log(ϵ−1))

where hides constant factors in . Moreover for any we have a procedure that produces a point such that

 F(x′)−F(x∗)≤ϵ(F(