The gradient complexity of linear regression

11/06/2019 ∙ by Mark Braverman, et al. ∙ 14

We investigate the computational complexity of several basic linear algebra primitives, including largest eigenvector computation and linear regression, in the computational model that allows access to the data via a matrix-vector product oracle. We show that for polynomial accuracy, Θ(d) calls to the oracle are necessary and sufficient even for a randomized algorithm. Our lower bound is based on a reduction to estimating the least eigenvalue of a random Wishart matrix. This simple distribution enables a concise proof, leveraging a few key properties of the random Wishart ensemble.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Solving linear systems and computing eigenvectors are fundamental problems in numerical linear algebra, and have widespread applications in numerous scientific, mathematical, and computational fields. Due to their simplicity, parallelizability, and limited computational overhead, first-order methods based on incremental gradient updates have become increasing popular for solving these problems. Moreover, in many settings, the complexity of these methods is currently well understood: tight upper and lower bounds are known for gradient methods, accelerated gradient methods and related algorithms.

First order methods for regression and eigenvector computation.

As an example, consider the problem of computing the largest eigenvector for a given matrix in . The power method finds an -approximate solution in iterations, each involving a matrix-vector product that can be computed in time propotional to the number of non-zeros in the matrix. A variant of the Lanczos algorithm improves this complexity to (Kuczyński and Woźniakowski, 1992; Musco and Musco, 2015). Alternatively, if the matrix has an inverse-eigengap bounded by , the above running times can be improved to and . In a low accuracy regime, where , these upper bounds attained by Lanczos are known to be information-theoretically tight in the number of matrix-vector products required to compute a solution to the given precision (Simchowitz et al., 2018). The optimal complexities are nearly identical for solving linear systems, except that these do not incur a dependence on the ambient dimension (Simchowitz, 2018). More generally, these upper and lower bounds extend to convex optimization with first order methods more broadly (Nemirovskii et al., 1983).

The blessing of data sparsity.

One major advantage of first order methods is that they benefit from sparsity. Each iteration of first order methods computes matrix-vector multiplies, and if the matrix in question has entries, then these multiplications can be performed in -time. This yields runtimes which scale with the sparsity of the problem instance, rather than the ambient dimension (which can be quadratically worse).

Regime Running time Method
Lanczos / CG/AGD
matrix inversion
Table 1: Methods for computing the largest eigenvector of , and solving linear systems with -data points, equialent to computing for . Here denotes the number of nonzero entries of . For eigenvalue problems, above runtimes suppress -dependence. Lanczos refers to the block Lanczos methods (Musco and Musco, 2015), CG to the conjugate gradient methods (see, e.g. Trefethen and Bau III (1997)), AGD to accelerated gradient descent.

The high accuracy regime.

What is the computational complexity of obtaining high, inverse polynomial precision using a randomized algorithm, without a bound on the condition number or eigengap of the matrix? This is exactly the question we investigate in this paper.

In this regime, our understanding of even these most basic problems is poor by comparison. The best known algorithms for -approximation in this regime scale as , where is the matrix inversion constant, currently around . These methods proceed by attempting to invert the matrix in question. Since the inverse of a sparse matrix is itself not necessarily sparse, these methods do not take advantage of data sparsity.

It is thus natural to ask if there is an optimization-based randomized algorithm that can exploit data sparsity, even for the simplest of linear algebra problems sketched above. We note that such faster algorithms would not necessarily require an inverse of the entire matrix, and therefore would not imply faster matrix inversion.

Lower bounds for randomized algorithms

We focus in this work on lower bounds for randomized algorithms. These are more interesting than lower bounds for deterministic algorithms for several reasons. Of course, the former are stronger and more widely applicable than the latter. More importantly, there are problems for which randomized algorithms can outperform deterministic algorithms enormously, for instance, the only polynomial time algorithms for volume computation are randomized (Lovász and Vempala, 2006)

. Finally, the linear algebraic problems we consider are of great use in machine learning problems, which are frequently tackled using randomized approaches in order to avoid poor dimensional dependencies.

Our results.

Our main result answers the above question in the negative. In a computation model where each iteration corresponds to one query of a matrix-vector product, we show that matrix-vector product oracle queries are necessary to obtain a polynomial approximation to the largest eigenvector. This is tight, as such queries are sufficient for the Lanczos method to obtain an exact solution (up to machine precistion). Similarly, we show a lower bound of queries for solving linear systems, which nearly matches the -query upper bound of the conjugate gradient method. Our results also yield lower bounds in terms of eigengap and condition number parameters which are near-optimal up to logarithmic factors. In contrast to much existing work, our lower bounds are information theoretic, and apply to randomized algorithms, even those that do not satisfy Krylov restrictions. To our knowledge, this is the first work that provides lower bounds which apply to general randomized algorithms, and attain optimal dimension-dependence. For a thorough discussion of the prior art, see our discussion of related work below.

Moreover, for instances with nonzero entries. we show a lower bound of queries necessary for high-precision eigenvalue approximation, and for solving linear systems. This suggests an overall computational complexity of for first order methods. This in turn demonstrates that algebraic methods based on matrix inversion asymptotically outperform optimization-based approaches in the regime .

Related Work.

There is an extensive literature on algorithms for linear algebraic tasks such as solving linear systems and computing eigenvalues and eigenvectors, see for example the survey of Sachdeva et al. (2014). In the interest of brevity, we focus on the relevant lower bounds literature.

The seminal work of Nemirovskii et al. (1983) establishes lower bounds which apply only to deterministic algorithms. These first order lower bounds enjoy essentially optimal dependence all relevant problem parameters, including dimension. However, these constructions are based on a so-called resisting oracle, and therefore do not extend to the randomized algorithms considered in this work.

For randomized algorithms, the lower bounds of Simchowitz et al. (2018) and Simchowitz (2018) yield optimal dependence on the eigengap and condition number parameters. However, these bounds require the dimension to be polynomial large in these parameters, which translates into a suboptimal dimension-dependent lower bound of .

A series of papers due to Woodworth and Srebro (2016, 2017) prove lower bounds for first order convex optimization algorithms which obtain optimal dependence on relevant parameters, but hold only in high dimensions. Furthermore, they are based on intricate, non-quadratic convex objectives which can effectively “hide” information in a way that linear algebraic instances cannot. Thus, they do not apply to the natural linear algebraic constructions that we consider. For high dimensional problems in the low accuracy regime, there are lower bounds even for randomized algorithms that use higher order derivatives, see e.g. (Agarwal and Hazan, 2017).

Finally, in concurrent, related work, Sun et al. (2019) study numerous other linear algebraic primitives in the same matrix-vector product oracle setting. They use a similar approach to proving lower bounds for other problems and randomized algorithms, but do not address the fundamental problems of maximum eigenvalue computation and linear regression as we do.

Proof Techniques.

One of the greatest strength of our results is the simplicity of their proofs. In general, establishing query lower bounds which apply to randomized algorithms requires great care to control the amount of information accumulated by arbitrary, randomized, adaptive queries. Currently, the two dominant approaches are either (a) to construct complex problem instances that obfuscate information from any sequence of queries made (Woodworth and Srebro, 2016), or (b) reduce the problem to estimating of some hidden component (Simchowitz et al., 2018; Simchowitz, 2018). The constructions for approach (a) are typically quite intricate, require high dimensions, and do not extend to linear algebraic problems. Approach (b) requires sophisticated information theoretic tools to control the rate at which information is accumulated.

In contrast, our work leverages simple problems of a classic random matrix ensemble known as the

Wishart distribution Anderson et al. (2010). In particular, our lower bound for maximum eigenvalue computation is witnessed by a very natural instance where the entries of are i.i.d. Gaussian. This is plausibly a very benign instance as it is one of the simplest distributions over symmetric positive definite matrices that one might think of.

The simplicity of the problem instance, and existing understanding of the distribution of the spectrum of Wishart matrices allows for concise, straightforward proofs.

1.1 Notation

Let , and ; throughout, we shall use for symetric matrices and for PSD matrices. For , we let , and for , we set . We adopt the conventional notions as suppressing universal constants independent of dimension and problem parameters, let suppress logarithmic factors, and and let denote a term which is for a sufficiently small constant . We say a matrix is sparse if its number of nonzero entries is at most .

2 Main Results

We begin the section by stating a lower bound for the problem of eigenvalue estimation and eigenvector approximation via matrix-vector multiply queries. Via a standard reduction, this bound will imply a lower bound for solving linear systems via gradient-queries.

We stress that, unlike prior lower bounds, our bounds for eigenvalue problems (resp. linear systems) both apply to arbitrary, randomized algorithms, and capture the correct dependence on the eigengap (resp. condition number), all the way up to a (resp. ) worst-case lower bound in dimensions. This worst-case lower bound is matched by the Lanczos (Musco and Musco, 2015) and Conjugate Gradient methods (see, e.g. Trefethen and Bau III (1997)), which, assuming infinite precision, efficiently recover the exact optimal solutions in at most queries.

2.1 Eigenvalue Problems

Before introducing our results, we formalize the query model against which our lower bounds hold:

Definition 2.1 (Eigenvalue and Eigenvector Algorithms).

An eigenvalue approximation algorithm, or , is an algorithm which interacts with an unknown matrix via adaptive, randomized queries, , and returns an estimate of . An eigenvector approximation algorithm, or , operates in the same query model, but instead returns an estimate of . We call the query complexity of .

We let

denote the probability induced by running

when the input is a random instance drawn from a distribution . We now state our main query lower bound for ’s, which we prove in Section 3:

Theorem 2.1 (Lower Bound for Eigenvalue Estimation).

There is a function such that the following holds. For any , ambient dimension , and every sparsity level , there exists a distribution supported on -sparse matrices in such that any with satisfies

Moreover, satisfies , and almost surely. Here, denotes a quantity lower bounded by a function of , but not on or .

In particular, any algorithm requires queries in ambient dimension to estimate up to error with constant probability, and in fact requires queries for a probability of error.

By setting the parameter , Theorem 2.1 implies queries are necessary for -accuracy. Alternatively, choosing , we find obtain a gap-dependent bound requiring queries for accuracy. Both bounds match the sharp lower bounds of Simchowitz et al. (2018) up to logarithmic factors, while also capturing the correct worst-case query complexity for ambient dimension , namely . Moreover, our proof is considerably simpler.

Implications for sparsity.

For , our lower bound says that first order methods require queries to approximate the top eigenvalue of matrices with . Therefore, implementations of first-order methods based on standard matrix-vector multiplies have cannot have complexity better than in the worst case. On the other hand, matrix inversion has runtime . Hence, for , we see that matrix inversion outperforms optimization based methods.

Approximating the top eigenvector.

As a corollary, we obtain the analogous lower bound for algorithms approximating the top eigenvector of a symmetric matrix, and, in particular, an query complexity lower bound for -precision approximations:

Corollary 2.2.

In the setting of Theorem 2.1, any with satisifes

Proof.

For ease, set . Let , and let , which can be computed using at most additional query. Since , and since we see that only if . Thus, recalling , we have , which is at least by Theorem 2.1.∎

2.2 Lower Bounds for Solving Linear Systems

We now present our lower bounds for minimizing quadratic functions. We consider the following query model:

Definition 2.2 (Gradient Query model for Linear System Solvers).

We say that is an if is given initial point and linear term , and it interacts with an unknown symmetric matrix via adaptive, randomized queries, , and returns an estimate of . Again, we call the query complexity of .

Defining the objective function , we see that the query model of Definition 2.2 is equivalent to being given a gradient query at , , and making queries . We shall use do denote probability induced by running the a on the instance . Our lower bound in this model is as follows, stated in terms of the function suboptimality .

Theorem 2.3 (Lower Bound for Linear System Solvers).

Let be a universal constant. Then for all ambient dimensions , and all , any which satisfies the guarantee

(1)

for all -sparse matrices with , and all , must have query complexity at least

where .

In particular, any algorithm which ensures with probability requires -queries. Moreover, we see that obtaining a function suboptimality of requires queries, matching known upper bounds achieved by the conjugate gradient method up to logarithmic factors (Trefethen and Bau III, 1997), which in turn match information-theoretic lower bounds (Simchowitz, 2018).

We prove Theorem 2.3 by leveraging a well-known reduction from eigenvector approximation to minimizing quadratic functions, known as “shift-and-invert” (Saad, 2011; Garber et al., 2016) . To state the result, we define a class of matrices to which the reduction applies:

The term corresponds to , whereas measures to how close is to . Note that can be suitably rescaled and shifted by a multiple of the identity to ensure , and , and thus more generally corresponds to an approximate foreknowledge of (which is necessary to facilitate the reduction). We further note that the distribution from Theorem 2.1 satisfies, for some functions and ,

(2)

With this definition in hand, we provide a precise guarantee for the reduction, which we prove in Appendix A:

Proposition 2.4 (Eigenvector-to-Linear-System Reduction).

Let for a universal . Fix a , , and suppose that be a which which satisfies (1) with for all with . Then, for any , there exists an , , which satisfies

with query complexity at most

where hides multiplicative and additive constants depending on .

We can now prove Theorem 2.3 by combining Proposition 2.4 and Theorem 2.1:

Proof.

We shall prove the theorem in the regime where . The general case is attained by embedding an instance of dimension into dimension , as in the proof of Theorem 2.1, and is deferred to Appendix B.

To begin, let (any constant in suffices); throughout, will be a universal constant, rather than a problem parameter. Next, fix an ambient dimension , where is from Theorem 2.1, and from Proposition 2.4.

Lastly, let and , where and are as in (2). Let . Then, (2) ensures with probability . For the sake of contradiction, suppose that is a which satisfies the guarantee of (1) for all

Then, for the universal constant , there exists an which satisfies, for all ,

whose query complexity is bounded by

where we use , and that depend on universal constants, and not on the choice of dimension . By Theorem 2.1, we must have , whence

3 Proof of Theorem 2.1

The proof of Theorem 2.1 follows by deriving a lower bound for the problem of estimating the least eigenvalue of a classical random matrix ensemble known as the (standard) Wishart matrices:

Definition 3.1 (Wishart Distribution).

We write to denote a random matrix with the distribution , where and .

We now state our main technical contribution, which lower bounds the number of queries required for estimation the smallest eigenvalue of a matrix :

Theorem 3.1 (Lower Bound for Wishart Eigenvalue Estimation).

There exists a universal constant and function such that the following holds: for all , and all , we have that satifies

  1. Any algorithm which makes adaptively chosen queries, and returns an estimate of satisfies

  2. There exists constants and such that

Note that, by taking , Theorem 3.1 in fact demonstrates that queries are required for an probability of failure, showing that no nontrivial improvements can be achieved. We now establish Theorem 2.1:

Proof of Theorem 2.1.

Fix , and let denote the function from Theorem 3.1. For , let , and define

By construction, has sparsity . Moreover, on the event of part (b) of Theorem 3.1 with, we have that the event that , that , and (3) , occurs with probability at least .

Now consider an estimator of . By considering the induced estimator of , part (a) of Theorem 3.1 and a union bound implies that

Hence, let denote the distribution of conditioned on , any with queries satisfies .

3.1 Proof of Theorem 3.1

We begin the proof of Theorem 3.1 by collecting some useful facts from the literature regarding the asymptotic distribution of Wishart spectra.

Lemma 3.2 (Facts about Wishart Matrices).

Let

denote random variables with the (joint) law of

, where . The following are true:

  1. converge in distribution to distribution with (Ramírez and Rider (2009, Theorem 1)).

  2. has the density (e.g. Shen (2001, Page 3))

Moreover, for any , (e.g. Anderson et al. (2010, Exercise 2.1.18))

We then convert these asymptotic guarantees into more quantitative ones (proof in Section 3.2):

Corollary 3.3 (Non-Asymptotic Properties).

There exists a maps , functions , and a universal constant such that the following holds: for any and ,

(3)

Moreover, for any and , , while .

We now show establish, in an appropriate basis, admits a useful block decomposition:

Lemma 3.4.

Let . Then, for any sequence of queries and responses , there exists a rotation matrix constructed solely as a function of such that the matrix can be written

where conditioned on the event , satisfies distribution.

The above lemma is proven in in Section 3.3. The upshot of the lemma is that after queries, there is still a portion of that remains unknown to the query algorithm. We now show that this unknown portion exerts significant influence on the smallest eigenvalue of . Specifically, the following technical lemma implies that :

Lemma 3.5.

For , , and symmetric , let

then .

Proof.

Let such that and . Define . Then

Therefore, . ∎

With all the above ingredients in place, we are now ready to complete the proof of Theorem 3.1:

Proof of Theorem 3.1.

Let , and let , let encode the query-response information, and let denote the matrix from Lemma 3.4. Finally, define the error probability

We can now lower bound the probability of error by lower bounding the probability that the algorithm givens an estimate above a threshold , while the corner matrix has smallest eignvalue below . We can then devouple the probability of these events using independence of conditioned on the queries

where uses Lemma 3.5, and use the fact that has a Wishart distribution conditioned on , and thus is independent of . On the other hand, we have

so that . This implies that . Performing some algebra, and setting , ,

Finally, since ,

Let , where is the function from Corollary 3.3. We then see that for all for which , then , and thus Corollary 3.3 yields the existence of constant for which , and . Hence, setting , we conclude

3.2 Proof of Corollary 3.3

For the first point, fix a . Then by Lemma 3.2

, the limiting normalized distributions of the eigenvalues

satisfy for appropriate constants . By convergence in distribution, and the fact that corresponds to the event that lie in a closed set,