The gradient complexity of linear regression

We investigate the computational complexity of several basic linear algebra primitives, including largest eigenvector computation and linear regression, in the computational model that allows access to the data via a matrix-vector product oracle. We show that for polynomial accuracy, Θ(d) calls to the oracle are necessary and sufficient even for a randomized algorithm. Our lower bound is based on a reduction to estimating the least eigenvalue of a random Wishart matrix. This simple distribution enables a concise proof, leveraging a few key properties of the random Wishart ensemble.

• 20 publications
• 55 publications
• 38 publications
• 23 publications
07/25/2017

Compressed Sparse Linear Regression

High-dimensional sparse linear regression is a basic problem in machine ...
06/16/2022

On the well-spread property and its relation to linear regression

We consider the robust linear regression model y = Xβ^* + η, where an ad...
02/06/2019

QMA Lower Bounds for Approximate Counting

We prove a query complexity lower bound for QMA protocols that solve app...
06/29/2022

Hardness and Algorithms for Robust and Sparse Optimization

We explore algorithms and limitations for sparse optimization problems s...
08/08/2016

Sampling Requirements and Accelerated Schemes for Sparse Linear Regression with Orthogonal Least-Squares

The Orthogonal Least Squares (OLS) algorithm sequentially selects column...
10/23/2015

On the complexity of switching linear regression

This technical note extends recent results on the computational complexi...
06/29/2015

Portfolio optimization using local linear regression ensembles in RapidMiner

In this paper we implement a Local Linear Regression Ensemble Committee ...

1 Introduction

Solving linear systems and computing eigenvectors are fundamental problems in numerical linear algebra, and have widespread applications in numerous scientific, mathematical, and computational fields. Due to their simplicity, parallelizability, and limited computational overhead, first-order methods based on incremental gradient updates have become increasing popular for solving these problems. Moreover, in many settings, the complexity of these methods is currently well understood: tight upper and lower bounds are known for gradient methods, accelerated gradient methods and related algorithms.

First order methods for regression and eigenvector computation.

As an example, consider the problem of computing the largest eigenvector for a given matrix in . The power method finds an -approximate solution in iterations, each involving a matrix-vector product that can be computed in time propotional to the number of non-zeros in the matrix. A variant of the Lanczos algorithm improves this complexity to (Kuczyński and Woźniakowski, 1992; Musco and Musco, 2015). Alternatively, if the matrix has an inverse-eigengap bounded by , the above running times can be improved to and . In a low accuracy regime, where , these upper bounds attained by Lanczos are known to be information-theoretically tight in the number of matrix-vector products required to compute a solution to the given precision (Simchowitz et al., 2018). The optimal complexities are nearly identical for solving linear systems, except that these do not incur a dependence on the ambient dimension (Simchowitz, 2018). More generally, these upper and lower bounds extend to convex optimization with first order methods more broadly (Nemirovskii et al., 1983).

The blessing of data sparsity.

One major advantage of first order methods is that they benefit from sparsity. Each iteration of first order methods computes matrix-vector multiplies, and if the matrix in question has entries, then these multiplications can be performed in -time. This yields runtimes which scale with the sparsity of the problem instance, rather than the ambient dimension (which can be quadratically worse).

The high accuracy regime.

What is the computational complexity of obtaining high, inverse polynomial precision using a randomized algorithm, without a bound on the condition number or eigengap of the matrix? This is exactly the question we investigate in this paper.

In this regime, our understanding of even these most basic problems is poor by comparison. The best known algorithms for -approximation in this regime scale as , where is the matrix inversion constant, currently around . These methods proceed by attempting to invert the matrix in question. Since the inverse of a sparse matrix is itself not necessarily sparse, these methods do not take advantage of data sparsity.

It is thus natural to ask if there is an optimization-based randomized algorithm that can exploit data sparsity, even for the simplest of linear algebra problems sketched above. We note that such faster algorithms would not necessarily require an inverse of the entire matrix, and therefore would not imply faster matrix inversion.

Lower bounds for randomized algorithms

We focus in this work on lower bounds for randomized algorithms. These are more interesting than lower bounds for deterministic algorithms for several reasons. Of course, the former are stronger and more widely applicable than the latter. More importantly, there are problems for which randomized algorithms can outperform deterministic algorithms enormously, for instance, the only polynomial time algorithms for volume computation are randomized (Lovász and Vempala, 2006)

. Finally, the linear algebraic problems we consider are of great use in machine learning problems, which are frequently tackled using randomized approaches in order to avoid poor dimensional dependencies.

Our results.

Our main result answers the above question in the negative. In a computation model where each iteration corresponds to one query of a matrix-vector product, we show that matrix-vector product oracle queries are necessary to obtain a polynomial approximation to the largest eigenvector. This is tight, as such queries are sufficient for the Lanczos method to obtain an exact solution (up to machine precistion). Similarly, we show a lower bound of queries for solving linear systems, which nearly matches the -query upper bound of the conjugate gradient method. Our results also yield lower bounds in terms of eigengap and condition number parameters which are near-optimal up to logarithmic factors. In contrast to much existing work, our lower bounds are information theoretic, and apply to randomized algorithms, even those that do not satisfy Krylov restrictions. To our knowledge, this is the first work that provides lower bounds which apply to general randomized algorithms, and attain optimal dimension-dependence. For a thorough discussion of the prior art, see our discussion of related work below.

Moreover, for instances with nonzero entries. we show a lower bound of queries necessary for high-precision eigenvalue approximation, and for solving linear systems. This suggests an overall computational complexity of for first order methods. This in turn demonstrates that algebraic methods based on matrix inversion asymptotically outperform optimization-based approaches in the regime .

Related Work.

There is an extensive literature on algorithms for linear algebraic tasks such as solving linear systems and computing eigenvalues and eigenvectors, see for example the survey of Sachdeva et al. (2014). In the interest of brevity, we focus on the relevant lower bounds literature.

The seminal work of Nemirovskii et al. (1983) establishes lower bounds which apply only to deterministic algorithms. These first order lower bounds enjoy essentially optimal dependence all relevant problem parameters, including dimension. However, these constructions are based on a so-called resisting oracle, and therefore do not extend to the randomized algorithms considered in this work.

For randomized algorithms, the lower bounds of Simchowitz et al. (2018) and Simchowitz (2018) yield optimal dependence on the eigengap and condition number parameters. However, these bounds require the dimension to be polynomial large in these parameters, which translates into a suboptimal dimension-dependent lower bound of .

A series of papers due to Woodworth and Srebro (2016, 2017) prove lower bounds for first order convex optimization algorithms which obtain optimal dependence on relevant parameters, but hold only in high dimensions. Furthermore, they are based on intricate, non-quadratic convex objectives which can effectively “hide” information in a way that linear algebraic instances cannot. Thus, they do not apply to the natural linear algebraic constructions that we consider. For high dimensional problems in the low accuracy regime, there are lower bounds even for randomized algorithms that use higher order derivatives, see e.g. (Agarwal and Hazan, 2017).

Finally, in concurrent, related work, Sun et al. (2019) study numerous other linear algebraic primitives in the same matrix-vector product oracle setting. They use a similar approach to proving lower bounds for other problems and randomized algorithms, but do not address the fundamental problems of maximum eigenvalue computation and linear regression as we do.

Proof Techniques.

One of the greatest strength of our results is the simplicity of their proofs. In general, establishing query lower bounds which apply to randomized algorithms requires great care to control the amount of information accumulated by arbitrary, randomized, adaptive queries. Currently, the two dominant approaches are either (a) to construct complex problem instances that obfuscate information from any sequence of queries made (Woodworth and Srebro, 2016), or (b) reduce the problem to estimating of some hidden component (Simchowitz et al., 2018; Simchowitz, 2018). The constructions for approach (a) are typically quite intricate, require high dimensions, and do not extend to linear algebraic problems. Approach (b) requires sophisticated information theoretic tools to control the rate at which information is accumulated.

In contrast, our work leverages simple problems of a classic random matrix ensemble known as the

Wishart distribution Anderson et al. (2010). In particular, our lower bound for maximum eigenvalue computation is witnessed by a very natural instance where the entries of are i.i.d. Gaussian. This is plausibly a very benign instance as it is one of the simplest distributions over symmetric positive definite matrices that one might think of.

The simplicity of the problem instance, and existing understanding of the distribution of the spectrum of Wishart matrices allows for concise, straightforward proofs.

1.1 Notation

Let , and ; throughout, we shall use for symetric matrices and for PSD matrices. For , we let , and for , we set . We adopt the conventional notions as suppressing universal constants independent of dimension and problem parameters, let suppress logarithmic factors, and and let denote a term which is for a sufficiently small constant . We say a matrix is sparse if its number of nonzero entries is at most .

2 Main Results

We begin the section by stating a lower bound for the problem of eigenvalue estimation and eigenvector approximation via matrix-vector multiply queries. Via a standard reduction, this bound will imply a lower bound for solving linear systems via gradient-queries.

We stress that, unlike prior lower bounds, our bounds for eigenvalue problems (resp. linear systems) both apply to arbitrary, randomized algorithms, and capture the correct dependence on the eigengap (resp. condition number), all the way up to a (resp. ) worst-case lower bound in dimensions. This worst-case lower bound is matched by the Lanczos (Musco and Musco, 2015) and Conjugate Gradient methods (see, e.g. Trefethen and Bau III (1997)), which, assuming infinite precision, efficiently recover the exact optimal solutions in at most queries.

2.1 Eigenvalue Problems

Before introducing our results, we formalize the query model against which our lower bounds hold:

Definition 2.1 (Eigenvalue and Eigenvector Algorithms).

An eigenvalue approximation algorithm, or , is an algorithm which interacts with an unknown matrix via adaptive, randomized queries, , and returns an estimate of . An eigenvector approximation algorithm, or , operates in the same query model, but instead returns an estimate of . We call the query complexity of .

We let

denote the probability induced by running

when the input is a random instance drawn from a distribution . We now state our main query lower bound for ’s, which we prove in Section 3:

Theorem 2.1 (Lower Bound for Eigenvalue Estimation).

There is a function such that the following holds. For any , ambient dimension , and every sparsity level , there exists a distribution supported on -sparse matrices in such that any with satisfies

 PM∼D,Alg[|ˆλ−λ1(M)|≥120s2]≥Ω(√β)

Moreover, satisfies , and almost surely. Here, denotes a quantity lower bounded by a function of , but not on or .

In particular, any algorithm requires queries in ambient dimension to estimate up to error with constant probability, and in fact requires queries for a probability of error.

By setting the parameter , Theorem 2.1 implies queries are necessary for -accuracy. Alternatively, choosing , we find obtain a gap-dependent bound requiring queries for accuracy. Both bounds match the sharp lower bounds of Simchowitz et al. (2018) up to logarithmic factors, while also capturing the correct worst-case query complexity for ambient dimension , namely . Moreover, our proof is considerably simpler.

Implications for sparsity.

For , our lower bound says that first order methods require queries to approximate the top eigenvalue of matrices with . Therefore, implementations of first-order methods based on standard matrix-vector multiplies have cannot have complexity better than in the worst case. On the other hand, matrix inversion has runtime . Hence, for , we see that matrix inversion outperforms optimization based methods.

Approximating the top eigenvector.

As a corollary, we obtain the analogous lower bound for algorithms approximating the top eigenvector of a symmetric matrix, and, in particular, an query complexity lower bound for -precision approximations:

Corollary 2.2.

In the setting of Theorem 2.1, any with satisifes

 PM∼D,Alg[ˆv⊤Mˆv≤λ1(M)(1−120s2)]≥Ω(√β)
Proof.

For ease, set . Let , and let , which can be computed using at most additional query. Since , and since we see that only if . Thus, recalling , we have , which is at least by Theorem 2.1.∎

2.2 Lower Bounds for Solving Linear Systems

We now present our lower bounds for minimizing quadratic functions. We consider the following query model:

Definition 2.2 (Gradient Query model for Linear System Solvers).

We say that is an if is given initial point and linear term , and it interacts with an unknown symmetric matrix via adaptive, randomized queries, , and returns an estimate of . Again, we call the query complexity of .

Defining the objective function , we see that the query model of Definition 2.2 is equivalent to being given a gradient query at , , and making queries . We shall use do denote probability induced by running the a on the instance . Our lower bound in this model is as follows, stated in terms of the function suboptimality .

Theorem 2.3 (Lower Bound for Linear System Solvers).

Let be a universal constant. Then for all ambient dimensions , and all , any which satisfies the guarantee

 PAlg,x0,b,A[∥ˆx−A−1b∥2A≤λ1(A)∥x0∥2es2]≥1−1e (1)

for all -sparse matrices with , and all , must have query complexity at least

 Query(Alg)≥Ω(s⋅(log2+logs)−1),

where .

In particular, any algorithm which ensures with probability requires -queries. Moreover, we see that obtaining a function suboptimality of requires queries, matching known upper bounds achieved by the conjugate gradient method up to logarithmic factors (Trefethen and Bau III, 1997), which in turn match information-theoretic lower bounds (Simchowitz, 2018).

We prove Theorem 2.3 by leveraging a well-known reduction from eigenvector approximation to minimizing quadratic functions, known as “shift-and-invert” (Saad, 2011; Garber et al., 2016) . To state the result, we define a class of matrices to which the reduction applies:

 Md(gap,α):={M∈Sd++:gap(M)≥gap, |λ1(M)−1|≤αgap, λ1(M)∈[12,2]},

The term corresponds to , whereas measures to how close is to . Note that can be suitably rescaled and shifted by a multiple of the identity to ensure , and , and thus more generally corresponds to an approximate foreknowledge of (which is necessary to facilitate the reduction). We further note that the distribution from Theorem 2.1 satisfies, for some functions and ,

 PM∼D(s,d,β)[M∈Md(cgap(β)s2,ceig(β)cgap(β))]=1 (2)

With this definition in hand, we provide a precise guarantee for the reduction, which we prove in Appendix A:

Proposition 2.4 (Eigenvector-to-Linear-System Reduction).

Let for a universal . Fix a , , and suppose that be a which which satisfies (1) with for all with . Then, for any , there exists an , , which satisfies

 PAlgeig,M[ˆv⊤Mˆv≥(1−cgap)λ1(M)]≥1−δ,∀M∈Md(gap,α)

with query complexity at most

 Query(Algeig)≤Query(Alg)⋅Oα((log1δ)⋅log2+logdmin{cgap,1}),

where hides multiplicative and additive constants depending on .

We can now prove Theorem 2.3 by combining Proposition 2.4 and Theorem 2.1:

Proof.

We shall prove the theorem in the regime where . The general case is attained by embedding an instance of dimension into dimension , as in the proof of Theorem 2.1, and is deferred to Appendix B.

To begin, let (any constant in suffices); throughout, will be a universal constant, rather than a problem parameter. Next, fix an ambient dimension , where is from Theorem 2.1, and from Proposition 2.4.

Lastly, let and , where and are as in (2). Let . Then, (2) ensures with probability . For the sake of contradiction, suppose that is a which satisfies the guarantee of (1) for all

 A:cond(A)≤1+α+1gap=O(d2).

Then, for the universal constant , there exists an which satisfies, for all ,

 PAlgeig,M[ˆv⊤Mˆv≥(1−120d2)λ1(M)] =PAlgeig,M[ˆv⊤Mˆv≥(1−cgap)λ1(M)]≥1−Ω(√β),

whose query complexity is bounded by

 Query(Alg)⋅Oα((log1Ω(√β)∧e)⋅log2+logd1∧cgap,1)=Query(Alg)⋅O(log2+logd),

where we use , and that depend on universal constants, and not on the choice of dimension . By Theorem 2.1, we must have , whence

 Query(Alg)≥c1√gap⋅Ω((log2+logd)−1)=Ω(d(log2+logd)−1).

3 Proof of Theorem 2.1

The proof of Theorem 2.1 follows by deriving a lower bound for the problem of estimating the least eigenvalue of a classical random matrix ensemble known as the (standard) Wishart matrices:

Definition 3.1 (Wishart Distribution).

We write to denote a random matrix with the distribution , where and .

We now state our main technical contribution, which lower bounds the number of queries required for estimation the smallest eigenvalue of a matrix :

Theorem 3.1 (Lower Bound for Wishart Eigenvalue Estimation).

There exists a universal constant and function such that the following holds: for all , and all , we have that satifies

1. Any algorithm which makes adaptively chosen queries, and returns an estimate of satisfies

 PW,Alg[|ˆλmin−λmin(W)|≤14d2]≥cwish√β.
2. There exists constants and such that

 PW[{λd(W)≤C1(β)d−2}∩{λd−1(W)−λd(W)≥C2(β)d−2}∩{∥W∥<5}]≥1−cwish√β2.

Note that, by taking , Theorem 3.1 in fact demonstrates that queries are required for an probability of failure, showing that no nontrivial improvements can be achieved. We now establish Theorem 2.1:

Proof of Theorem 2.1.

Fix , and let denote the function from Theorem 3.1. For , let , and define

 M=[Is×s−15W000]∈Rd×d.

By construction, has sparsity . Moreover, on the event of part (b) of Theorem 3.1 with, we have that the event that , that , and (3) , occurs with probability at least .

Now consider an estimator of . By considering the induced estimator of , part (a) of Theorem 3.1 and a union bound implies that

 PM,Alg[{|ˆλ−λmax(M)|≤120d2}∣E] =PW,Alg[{|ˆλmin−λmin(W))|≤14d2}∣E] =PW,Alg[{|ˆλmin−λmin(W))|≤14d2}∩E] ≥cwish√β−12cwish√β=Ω(√β).

Hence, let denote the distribution of conditioned on , any with queries satisfies .

3.1 Proof of Theorem 3.1

We begin the proof of Theorem 3.1 by collecting some useful facts from the literature regarding the asymptotic distribution of Wishart spectra.

Lemma 3.2 (Facts about Wishart Matrices).

Let

denote random variables with the (joint) law of

, where . The following are true:

1. converge in distribution to distribution with (Ramírez and Rider (2009, Theorem 1)).

2. has the density (e.g. Shen (2001, Page 3))

Moreover, for any , (e.g. Anderson et al. (2010, Exercise 2.1.18))

We then convert these asymptotic guarantees into more quantitative ones (proof in Section 3.2):

Corollary 3.3 (Non-Asymptotic Properties).

There exists a maps , functions , and a universal constant such that the following holds: for any and ,

 PW∼Wishart(d)[{z(d)2−z(d)1≥C2(δ)}∩{z(d)1≤C1(δ)}∩{∥W∥op≤5}]≥1−δ (3)

Moreover, for any and , , while .

We now show establish, in an appropriate basis, admits a useful block decomposition:

Lemma 3.4.

Let . Then, for any sequence of queries and responses , there exists a rotation matrix constructed solely as a function of such that the matrix can be written

 VWV⊤=[Y1Y⊤1Y1Y⊤2Y2Y⊤1Y2Y⊤2+˜W]

where conditioned on the event , satisfies distribution.

The above lemma is proven in in Section 3.3. The upshot of the lemma is that after queries, there is still a portion of that remains unknown to the query algorithm. We now show that this unknown portion exerts significant influence on the smallest eigenvalue of . Specifically, the following technical lemma implies that :

Lemma 3.5.

For , , and symmetric , let

 M=[AA⊤AB⊤BA⊤BB⊤+W]

then .

Proof.

Let such that and . Define . Then

 z⊤Mz=z⊤[−AB⊤v+AB⊤v−BB⊤v+BB⊤v+Wv]=v⊤Wv=λmin(W)

Therefore, . ∎

With all the above ingredients in place, we are now ready to complete the proof of Theorem 3.1:

Proof of Theorem 3.1.

Let , and let , let encode the query-response information, and let denote the matrix from Lemma 3.4. Finally, define the error probability

 perr:=P[|ˆλmin−λmin(W)|≥ϵ].

We can now lower bound the probability of error by lower bounding the probability that the algorithm givens an estimate above a threshold , while the corner matrix has smallest eignvalue below . We can then devouple the probability of these events using independence of conditioned on the queries

 perr≥P[{ˆλmin≥t}∩{t−ϵ≥λmin(W)}] (i)≥P[{ˆλmin≥t}∩{t−ϵ≥λmin(˜W)}] =E[P[{ˆλmin≥t}∩{t−ϵ≥λmin(˜W)}∣Z]] ≥E[P[ˆλmin≥t∣Z]⋅P[˜W≤t−ϵ]] =P[t−ϵ>λmin(˜W)]⋅P[ˆλmin>t],

where uses Lemma 3.5, and use the fact that has a Wishart distribution conditioned on , and thus is independent of . On the other hand, we have

 perr ≥P[{ˆλmin

so that . This implies that . Performing some algebra, and setting , ,

 perr≥P[λmin(˜W)≤t−ϵ]⋅P[λmin(W)≥t+ϵ]1+P[λmin(˜W)≤t−ϵ≥P[λmin(˜W)≤12]⋅P[λmin(W)≥1]2.

Finally, since ,

 P[λmin(˜W)≤12] =P˜W∼Wishart(T−d)[λmin(˜W)≤d−T2d] ≤P˜W∼Wishart(T−d)[λmin(˜W)≤1−β2].

Let , where is the function from Corollary 3.3. We then see that for all for which , then , and thus Corollary 3.3 yields the existence of constant for which , and . Hence, setting , we conclude

 perr ≥P[λmin(˜W)≤12]⋅P[λmin(W)≥1]2. ≥P˜W∼Wishart(T−d)[λmin(˜W)≤1−β2].⋅P[λmin(W)≥1]2 ≥√β/2p0⋅p02=p202√2√β=cwish√β.

3.2 Proof of Corollary 3.3

For the first point, fix a . Then by Lemma 3.2

, the limiting normalized distributions of the eigenvalues

satisfy for appropriate constants . By convergence in distribution, and the fact that corresponds to the event that lie in a closed set, , so that for all sufficently large as a function of ,