Given and , the regression problem
is one of the most fundamental problems in optimization and a prominent tool in machine learning. It is one of the simplest empirical risk minimization (ERM) problems and a prominent proving ground for developing efficient learning algorithms. Regression is long known to be solve-able directly by matrix multiplication intime where [Wil12] is the matrix multiplication constant and recent work has improved the running time to ,111Through this paper, we use to hide factors polylogarithmic in ,, and (c.f. Definition 3 and Definition 6). i.e. linear time plus the time needed to solve a nearly square linear system. [CW13, LMP13, NN13, CLM15, Coh16] However, for sufficiently large even a super-quadratic running time of can be prohibitively expensive. Consequently, over the past decade improving this running time under mild regularity assumptions on has been an incredibly active area of research. In this paper we improve the best known running time for solving regression under standard regularity assumptions. Formally the problem we consider is as follows.
Definition 1 (The Regression Problem).
Given with rows and , we consider the regression problem where
The central problem of this paper is to get faster regression algorithms defined as follows.
Definition 2 (Regression Algorithm).
We call an algorithm a -time regression algorithm if for any , , and w.h.p in in time the algorithm outputs a vector such that
Note that if is a minimizer of then the guarantee (1.1) is equivalent to the following
where for a PSD matrix is defined as . The goal of this paper is to provide regression algorithms with improved running times depending on , , and the following regularity parameters.
(Regularity Parameters) We let and
denote the smallest and largest eigenvalues of. We let denote the condition number of and let denote the total condition number of . We let denote the maximum number of non-zero entries in a row of . Occasionally, we drop the terms in parenthesis when they are clear from context.
1.1 Previous Results
Standard classic iterative methods such as gradient descent and accelerated gradient descent [Nes83] solve the regression problem with running times of and respectively. While these running times are super-linear whenever is super constant there has been a flurry of recent papers showing that using sampling techniques faster running times can be achieved. These often yield nearly linear running times when is sufficiently larger than . [SSZ13, JZ13a, SZ16, All16]. Using recent advances in accelerated coordinate descent [AQRY16, NS17] coupled with proximal point methods [FGKS15a, LMH15] the previous fastest iterative algorithm is as follows:
Theorem 4 (Previous Best Regression Running Times).
Given , there is a -time regression algorithm with
1.2 Our Results
The work in this paper is motivated by the natural question, can this running time of Theorem 4 be further improved? Despite the running time lower bound of shown in [WS16],222Their lower bound involves a function with . However, is more common as we explain. in this paper we give an affirmative answer improving the term in Theorem 4 to . The main result of this paper is the following:
Theorem 5 (Improved Regression Running Time).
Algorithm LABEL:alg:solveUsingLS is a -time
regression algorithm that succeeds with high probability(w.h.p) in
-time regression algorithm that succeeds with high probability(w.h.p) inwhere
and for all .
Up to polylogarithmic factors Theorem 5 is an improvement over Theorem 4 as . This improvement can be substantial as can be as small as , e.g. if is an entry-wise random Gaussian matrix. Compared to Theorem 4 whose second term in running time grows as , our second term is independent of due to the following:
Therefore in Theorem 5 we have . This improvement from to can be significant as (the number of samples) is in some cases orders of magnitude larger than (the number of features). For example, in the LIBSVM dataset333https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, in 87 out of 106 many non-text problems, we have , 50 of them have and in the UCI dataset,444http://archive.ics.uci.edu/ml/datasets.html in 279 out of 301 many non-text problems, we have , 195 out of them have . Furthermore, in Section 7 we show how to extend our results to ERM problems more general then regression. In particular we consider the following ERM problem
Definition 6 (Erm).
Given with rows and functions such that each is twice differentiable and satisfies
we wish to minimize over where
While the form of ERM considered by us is not the most general form considered in literature, we consider this problem, to provide a proof of concept that our techniques for regression are more broadly applicable to the case of ERM and can if fact provide a non-trivial improvement over existing methods. We leave it to future work to study the limitations of our techniques and possible improvements for more general ERM problems. The following is our main theorem regarding the ERM problem.
Given an ERM problem(Definition 6) and an initial point , there exists an algorithm that produces a point such that which succeeds w.h.p in n in total time
where are the leverage scores with respect to .
Note that Theorem 7interpolates our regression results, i.e. it recovers our results for regression in the special case of . To better understand the bound in Theorem 7, note that following the derivation in Equation (1.3) we have that the running time in Theorem 7 is bounded by
In this case Theorem 7 should be seen as implying that under Assumption (1.4) the effective dependence on the number of examples on the running time for ERM can be reduced to at most . Again, we remark that the running time bound of Theorem 7 should be viewed as a proof of concept that our regression machinery can be used to improve the running time of ERM. We leave it as future work to both improve Theorem 7’s dependence on and have it extend to a broader set of problems. For example, we believe the the running time can be immediately improved to
1.3 Our Approach
Our algorithm follows from careful combination and analysis of a recent suite of advances in numerical linear algebra. First, we use the previous fastest regression algorithm, Theorem 4, which is the combination of recent advances in accelerated coordinate descent [AQRY16, NS17] and proximal point methods [FGKS15a, LMH15] (See Section 5
.) Then, we show that if we have estimates of the leverage scores of the rows of, a natural recently popularized measure of importance, [SS08, LMP13, CLM15] we can use concentration results on leverage score sampling and preconditioning to obtain a faster regression algorithm. (See Section 3.) Now, it is a powerful and somewhat well known fact that given an algorithm for regression one can compute leverage score estimates in nearly linear time plus the time needed to solve regression problems [SS08]. Consequently, to achieve the improved running time when we do not have leverage scores we are left with a chicken and egg problem. Fortunately, recent work has shown that such a problem can be solved in a several ways [LMP13, CLM15]. We show that the technique in [LMP13] carefully applied and analyzed can be used to obtain our improved running time for both estimating leverage scores and solving regression problems with little overhead. (See Section 4.) To generalize our results for a broader class of ERM problems we follow a similar procedure. Most parts generalize naturally, however perhaps the most complex ingredient is how to generalize preconditioning to the case when we are sampling non-quadratic functions. For this, we prove concentration results on sampling from ERM inspired from [FGKS15b] to show that it suffices to solve ERM on a sub-sampling of the components that may be of intrinsic interest. (See Section 7) In summary our algorithms are essentially a careful blend of accelerated coordinate descent and concentration results coupled with the iterative procedure in [LMP13] and the Johnson Lindenstrauss machinery of [SS08] to compute leverage scores. Ultimately the algorithms we provide are fairly straightforward, but it provides a substantial running time improvement that we think is of intrinsic interest. We hope this work may serve as a springboard for even faster iterative methods for regression and minimizing finite sums more broadly. Finally, we remark that there is another way to achieve the improvement over . One could simply use subspace embeddings [CW13, NN13, Coh16] and preconditioning to reduce the regression problem to a regression problem on a matrix and then apply Theorem 4 to solve the regression problem. While this works, it has three shortcomings relevant to our approach. First, this procedure would possibly lose sparsity, the matrix may be dense and thus the final running time would have an additional term our method does not. Second, it is unclear if this approach yields our more fine-grained running time dependence on leverage scores that appears in Theorem 5 which we believe to be significant. Thirdly it is unclear how to extend the approach to the ERM setting.
1.4 Paper Organization
After providing requisite notation in Section 2 we prove Theorem 5 in Sections 3 and 4. We first provide the algorithm for regression given leverage score estimates in Section 3 and further provide the algorithm to compute the estimates based in Section 4. Note that the algorithm for computing leverage scores makes use of the algorithm for regression given leverage scores as a sub-routine. In Section 5 we provide the proofs of deferred lemmas and theorems from Sections 3, 4. In Section 7 we provide the proof of Theorem 7. Further we collect some simple re-derivations and reductions from known results required through the paper in the Appendix.
For symmetric matrix and we let . For symmetric matrix we use to denote the condition that for all and we define , , and analogously. We use to denote the number of non-zero entries in and for a vector we let denote the number of nonzero entries in .
3 Regression Algorithm Given Leverage Score Estimates
The regression algorithm we provide in this paper involves two steps. First we find which rows of are important, where importance is measured in terms of leverage score. Second, we use these leverage scores to sample the matrix and solve the regression problem on the sampled matrix using Theorem 13. In this section we introduce leverage scores and provide and analyze the second step of our algorithm.
Definition 8 (Leverage Score).
For with rows we denote the leverage score of row by .
Leverage score have numerous applications and are well studied. It is well known that for all and . The critical fact we used about leverage scores is that sampling rows of according to any overestimate of leverage scores yields a good approximation to after appropriate re-scaling [CLM15, SS08]:
Lemma 9 (Leverage Score Sampling (Lemma 4 of [Clm15]) ).
Let , let , and let be overestimates of the leverage scores of ; i.e. for all . Define for a sufficiently large absolute constant and let be a random diagonal matrix where independently with probability and otherwise. With high probability in , and .
If satisfies for all then is a -time regression algorithm where
Let for . Applying Lemma 9 yields with high probability in that
where and . Note that
Consequently, by Markov’s inequality
with probability at least and the loop in the algorithm terminates with high probability in in iterations. Consequently, the loop takes only -time and since we only sampled many independent copies of , the guarantee (3.1) again holds with high probability in . Using the guarantee (3.1) and Theorem 13 on and , we can produce a we need in time times
where we used that has at most rows with high probability in . Since we know
all that remains is to bound . However, and therefore
Consequently, Cauchy Schwartz and yields
Since this yields
which in turn yields the result as hides factors poly-logarithmic in and . ∎
4 Regression Algorithm Without Leverage Score Estimates
In the previous section we showed that we can solve regression in our desired running time provided we have constant factor approximation to leverage scores. Here we show how to apply this procedure repeatedly to estimate leverage scores as well. We do this by first adding a large multiple of the identity to our matrix and then gradually decreasing this multiple while maintaining estimates for leverage scores along the way. This is a technique introduced in [LMP13] and we leverage it tailored to our setting. A key technical ingredient for this algorithm is the following well-known result on the reduction from leverage score computation to regression with little overhead. Formally, Lemma 11 states that you can compute constant multiplicative approximations to all leverage scores of a matrix in nearly linear time plus the time needed to solve regression problems. Algorithm LABEL:alg:computeLS details the procedure for computing leverage scores. algocf[h]
Lemma 11 (Computing Leverage Scores).
For let be a -time algorithm for regression on . For , in time where we set with high probability in , the algorithm outputs such that for all .
We defer the proof of Lemma 11 to the Appendix (Section LABEL:sec:computeleveragescores). Combining the algorithm for estimating leverage scores (Algorithm LABEL:alg:computeLS) with our regression algorithm given leverage scores (Theorem 10) yields our solver (Algorithm LABEL:alg:mainalgo). We first provide a technical lemma regarding invariants maintained by the algorithm, Lemma 12, and then we prove that this algorithm has the desired properties to prove Theorem 5. algocf[h]
In the algorithm (See Algorithm LABEL:alg:mainalgo) the following invariant is satisfied
Note that . Consequently, since initially we have that initially . Consequently, we have that initially and therefore satisfies the invariant (4.1). Now, suppose at the start of the repeat loop, satisfies the invariant (4.1). In this case the the assumptions needed to invoke by Theorem 10 are satisfied. Hence, after the line , by Lemma 11 we have that for all
Now, letting we see that and direct calculation shows that invariant (4.1) is still satisfied after changing to . All the remains is to consider the last step when we set . When this happens . and therefore is close enough to and the invariant (4.1) is satisfied. ∎
Using this we prove that our main algorithm works as desired.
Proof of Theorem 5.
Lemma 12 shows that is always a good estimate enough of throughout the algorithm to invoke with Theorem 10. In particular, this holds at the last step when is set to and thus the output of the algorithm is as desired by Theorem 10. During the whole algorithm, is called times. Each time is called, is called many times. All that remains is to bound the running time of . However, for and we have and since we . Furthermore, since we have that the running time follows from the following:
5 Previous Best Running Time for Regression
Here we state Theorem 13, a generalization of Theorem 4 that is useful for our analysis. Theorem 13 follows by applying recent results on accelerated coordinate descent [AQRY16, NS17] to the dual of regression through recent results on approximate proximal point / Catalyst [FGKS15a, LMH15].
Theorem 13 (Previous Best Regression Running Time).
Let and be matrices with the same number of columns. Suppose that has rows and , then there is a -time regression algorithm with
5.1 Proof of Theorem 13
First we give the theorems encapsulating the results we use and then use them to prove Theorem 4 in the case when . We then prove the case when . Theorem 14 describes the fastest coordinate descent algorithm known by [AQRY16]. Theorem 15 describes the reduction [FGKS15a] to from regression to coordinate decent via proximal point.
Theorem 14 (Corollary of Thm 5.1 of [Aqry16]).
Let be a twice differentiable -strongly convex function for . Further suppose that for all and it is the case that for and the partial derivative can be computed in time. Then there exists an algorithm which given any finds a such that
in expected running time .
Theorem 15 (Corollary of Thm 4.3 of [FGKS15a]).
Given with rows and . Consider the function where are convex functions. Suppose that for all . Let . Let dual problem . Suppose that for any , any and any , we can compute in expected running time such that
Then, for any and any we can find such that
in time w.h.p. in where and includes logarithmic factors in .
We note that although the guarantees of Thm 5.1 of [AQRY16] and Thm 4.3 of [FGKS15a] are not stated in the form of Theorems 14 and 15. They can be easily converted to the form above by noticing that the expected running time of the procedure in Thm 4.3 of [FGKS15a] using Theorem 14 is which can then be boosted to high probability in using Lemma 16. We now give the proof of Theorem 13.
Proof of Theorem 13 when .
Proof of Theorem 13 for the case .
The proof involves two steps. First, we show that given any point , we can find a new point that is closer to the minimizer. Then, we bound how many steps it takes. To find , we consider the function
Let be the minimizer of and be the minimizer of . Note that
Hence, we have that
Using that , we have
However, it is difficult to reduce to the case when to minimize the function due to the extra linear term. To address this issue, we assume by appending an extra identity term. Note that this only adds a small matrix and hence we still have but with a slightly different constant which will not affect the proof for (5.2). Due to the extra identity term, reduces to an expression of the form for some vector and constant . We can now apply Theorem 13 for the case and get an such that
Note that the extra terms in does not affect the minimum eigenvalue and it increases by atmost . Now, using the formula of , the guarantee (5.3) can be written as
Using that , we have
Combining this with (5.2), we have that
Hence, we get closer to by constant factor. Therefore, to achieve (1.2), we only need to repeat this process times. Hence, the total running time is
6 Reduction from High Probability Solvers to Expected Running Times
In this section we provide an auxiliary lemma that reduce the problem of achieving accuracy with high probability to the problem of achieving an accuracy with probability at least for some constants up to logarithmic factor blowups. Note that a naive reduction suffers an additional blowup which we avoid in the following reduction. The reduction although straightforward helps us provide a concise description of the algorithm for the ERM problem in the next section.
Consider being given a function and define . Let be an algorithm such that given any point the algorithm runs in time and produces a point such that
with probability at least for given universal constants . Further there exists a procedure which given a point can produce an estimate in time such that
for some given . Then there exists a procedure that given a point outputs a point such that
and the expected running time of the procedure is bounded by
where hides constant factors in . Moreover for any we have a procedure that produces a point such that