The problem of minimizing convex, quadratic functions of the form for is a fundamental algorithmic primitive in machine learning and optimization. Many popular approaches for minimizing can be characterized as “first order” methods, or algorithms which proceed by querying the gradients at a sequence of iterates , in order to arrive at a final approximate minimum . Standard gradient descent, the heavy-ball method, Nesterov’s accelerated descent, and conjugate-gradient can be all be expressed in this form.
The seminal work of Nemirovskii et al. (1983) established that for a class of deterministic, first order methods, the number of gradient queries required to achieve a solution which approximates has the following scaling:
Condition-Dependent Rate: To attain , one needs , where .
Condition-Free Rate: For any , there exists an such that to obtain , one needs queries.111Note that is precisely the Lipschitz constant of , and corresponds to the Euclidean radius of the domain over which one is minimizing; see Remark 2.2.
It has long been wondered whether the above, worst-case lower bounds are reflective of the “average case” difficulty of minimizing quadratic functions, or if they are mere artificacts of uniquely adversarial constructions. For example, one may hope that randomness may allow a first order algorithm to avoid querying in worst-case, uninformative directions, at least for the initial few iterations. Furthermore, quadratic objectives have uniform curvature, and thus local gradient exploration can provide global information about the function.
In this work, we show that in fact randomness does not substantially improve the query complexity of first order algorithms. Specifically, we show that even for randomized algorithms, (a) to obtain a solution for a small but universal constant , one needs gradient queries, and, as a consequence, (b) for any , the condition-free lower bound of queries for an -approximate solution holds as well. These lower bounds are attained by explicit constructions of distributions over parameters and , which are derived from classical models in random matrix theory. Hence, not only do our lower bounds resolve the question of the complexity of quadratic minimization with randomized first-order queries; they also provide compelling evidence that the worst-case and “average-case” complexity of quadratic minimization coincide up to constant factors.
1.1 Proof Ideas and Organization
Our argument draws heavily upon a lower bound due to Simchowitz et al. (2018)
for approximating the top eigenvector of a deformed Wigner model,, given a matrix-vector multiplication queries of the form . Here, is drawn from a Gaussian Orthogonal Ensemble (see Section 3.1), 222In Simchowitz et al. (2018), was taken to be uniform on the sphere., and is a parameter controlling . That work showed that eigenvector approximation implies estimation of the so-called “plant” , and showed that one required queries to perform the estimation appropriately.
In this work, we show an analogous reduction: one can estimate if one can minimize the function , where for an appropriate , and is a Gaussian vector that is slightly correlated with . We also consider matrix vector multiply queries ; these are equivalent both to querying , and to querying (see Remark 2.1).
The intuition behind our reduction comes from the meta-algorithm introduced by Garber et al. (2016)
. For epochsand uniform on the sphere, calls a black-box quadratic solver to produce iterates . If the errors are sufficiently small and if is tuned appropriately one can show that (a) and (b) denoting the top eigenvector of by , the iterate satisfies
In other words, reduces approximating the eigenvector of to minimizing a sequence of convex quadratic functions with condition number . Applying the lower bound for estimating from Simchowitz et al. (2018), one should expect queries on average to minimize these functions.
Unfortunately, applying the reduction in a black-box fashion requires high accuracy approximations of ; this mean that this reduction cannot be used to lower bound the query complexity required for constant levels of error , and thus cannot be used to deduce the minimax rate. Our analysis therefore departs from the black-box reduction in that (a) we warm start near the plant rather than from an isotropic distribution, (b) we effectively consider only the first iteration of the scheme, corresponding to finding , and (c) we directly analyze the overlap between and the plant , ; the reduction is sketched in Section 3.1. Moreover, we modify information-theoretic lower bounds for the estimation of from queries of to account for the additional information conveyed by the linear term (see Section 5). Altogether, our reduction affords us simpler proofs and an explicit construction of a “hard instance”. Most importantly, the reduction tolerates constants error between the approximate minimizer and the optimum , which enables us to establish a sharp lower bound.
In particular, to obtain a lower bound which matches known upper bounds up to constants, it is necessary to establish that the error cannot align to closely with . Otherwise, one could obtain a good approximation of , namely , which was not sufficiently aligned with . Since is independent of given and , we can bound their cosine in terms of the quantity
which correponds to the largest alignment between , and any -measurable estimator of the direction of . We can relate this quantity to the minimum mean-squared error of estimating the plant in a deformed Wigner model. This can in turn be controlled by recent a result due to Lelarge and Miolane (2016), which rigorously establishes a general replica-symmetric formula for planted matrix models. With this tool in hand, we prove Proposition 3.3, which gives an order-optimal bound on in terms of relevant problem parameters, provided that the ambient dimension is sufficiently large. We remark that the result of Lelarge and Miolane (2016) had been proven under additional restrictions by Barbier et al. (2016); see Section 2.1 for related work and additional discussion.
We cannot simply apply the bounds of Lelarge and Miolane (2016) out of the box, because (a) the former result does not allow for side information , and (b) the former work consider a slighlty different observation model where only the off diagonals of are observed. In Section 6.2, we show that we can effectively remove the side information and reduce to a case where where for an appropriate mean and a random scaling . Then, in Appendix D.4
, we carry out a careful interpolation argument in the spirit of the Wasserstein continuity of mutual information (see, e.g.Wu and Verdú (2012)) to transfer the results from Lelarge and Miolane (2016) to our observation model. This interpolation argument also lets us establish a version of uniform convergence, which is necessary to account for the random scaling .
Organization: In Section 2, we formally introduce our formal query model and state our results; Section 2.1 discusses related work. In Section 3, we sketch the main components of the proof. Section 3.1 formally introduces the distribution over which witnesses our lower bound; it also presents Proposition 3.3, which bounds the term , and gives the redunction from estimating the plant to approximately minimizing . Section 4 gives a more in-depth proof roadmap for the reduction from estimation to optimization, which relies on non-asymptotic computations of the Stieltjes transition of and its derivatives. Lastly, Section 5 fleshes out the proof of the lower bound for estimating , and Section 6 provides background information and a proof sketch for our bounds on .
We shall use bold upper case letters (e.g. ) to denote (typically random) matrices related to a given problem instance, bold lower cause letters (e.g. ) to denote (typically random) vectors related to a problem instance, and serif-font () to denote quantities related to a given algorithm. We use the standard notation , , for the Euclidean 2-norm, matrix operator norm, and matrix Frobenius norm, respectively. We let denote the cannonical basis vectors in , let denote the unit sphere, the set of symmetric matrices, and the set of positive definite matrices. For a matrix , let denote its eigevalues. For and , we let , and . Given vectors , we let denote the orthogonal projection onto . Lastly, given , we let if , and .
2 Main Results
We begin by presenting a formal definition of our query model.
Definition 2.1 (Randomized Query Algorithm).
A randomized query algorithm (RQA) with query complexity is an algorithm which interacts with an instance via the following query scheme:
The algorithm recieves an initial input from an oracle.
For rounds , queries an oracle with a vector , and receives a noiseless response .
At the end of rounds, the algorithm returns an estimate of .
The queries and output are allowed to be randomized and adaptive, in that there is a random seed such that is a function of , and is a function of .
We remark that the above query model is equivalent to a querying exact gradient of the objective . Indeed, , and . Thus, our query model encapsulates gradient descent, accelerated gradient descent, heavy-ball, and conjugate graident methods. Crucially, our query model differs from existing lower bounds by allowing for randomized queries as in Agarwal and Bottou (2014), and by not requiring iterates to lie in the Krylov space spanned by past queries as in Nemirovskii et al. (1983).
We now state our main result, which shows that there exists a distribution over instances which matches the lower bounds of Nemirovskii et al. (1983):
Theorem 2.1 (Main Theorem: Minimax Rate with Conjectured Polynomial Dimension).
There exists a functions and universal constants such that the following holds. For and , there exists a joint distribution over instances
, there exists a joint distribution over instancessuch that (a) and (b) for any and any RQA with query complexity and output , we have that for ,
Typically, convex optimization lower bounds are stated in terms of a strong convexity , a smoothness parameter , and the radius of the domain, or distance between the first iterate and a global minimizer, (see e.g. Bubeck et al. (2015)). For quadratics, the strong convexity parameter is and the smoothness parameter is ; one can show that both these quantities are concentrate sharply in our particular distribution over , and that is at most a universal constant. As we are considering unconstrained optimization, the radius of the domain corresponds to . Indeed, the distribution of is rotationally symmetric, so a priori, the best estimate of (before observing or querying ) is . Hence the event can be interpreted as . Since one needs to have , we have that, with high probability,
which is which is the standard presentation of lower bounds for convex optimization. Similarly, the complement of the event can be rendered as
where is an upper bound on condition number.
Remark 2.3 (Scalings of ).
In Theorem 2.1, the dimension corresponds to how large the ambient dimension needs to be in order for to have the appropriate condition number, for approximations of to have sufficient overlap with , assuming a bound on , and for the lower bounds on estimating to kick in. For the sake of brevity, we show that is an unspecified polynomial in ; characterizing the explicit dependence is possible, but would require great care, lengthier proofs, and would distract from the major ideas of the work.
The dimension captures how large must be in order to obtain the neccessary bound on . Though is finite, we are only able to guarantee that the dependence on is polynomial under a plausible conjecture, Conjecture 6.1, which requires that either (a) minimum-mean squared error of the estimate of the planted solution in a deformed Wigner model, or (b) the mutual information between the deformed Wigner matrix and the planted solution, converge to their asymptotic values at a polynomial rate.
If non-conjectural bounds are desired which still guarantee that the dimension need only be polynomial in the condition number, we instead have the following theorem:
Theorem 2.2 (Main Theorem: Weaker Rate with Guaranteed Polynomial Dimension).
Let be as in Theorem 2.1, and let . Then for every , there exists a distribution such that such that and for any and any RQA with query complexity , we have that
Remark 2.4 (The distributions and ).
The distributions over from Theorem 2.1 and from Theorem 2.2 differ subtly. The form of the distribution over is given explicitly at the beginning of Section 3.1, and is specialized for Theorem 2.2 by appropriately tuning parameters and . The distribution over is obtained by conditioning on a constant-probability, -measurable event (see remarks following Proposition 3.2). If one prefers, one can express Theorem 2.1 as saying that, for the distribution as in Section 3.1 and Theorem 2.2, any algorithm with has a large error with constant probability. However, by distinguishing between and , we ensure that any algorithm incurs error with overwhelming, rather than just constant, probability.
2.1 Related Work
It is hard to do justice to the vast body of work on quadratic minimization and first order methods for optimization. We shall restrict the present survey to the lower bounds literature.
Lower Bounds for Convex Optimization: The seminal work of Nemirovskii et al. (1983) established tight lower bounds on the number of gradient queries required to minimize quadratic objectives, in a model where the algorithm was (a) required to be deterministic (and was analyzed for a worst-case initialization), and (b) the gradient queries were restricted to lie in the linear span of the previous queries, known as the Krylov space. Agarwal and Bottou (2014) showed that deterministic algorithms can be assumed to query in the Krylov space without loss of generality, but did not extend their analysis to randomized methods. Woodworth and Srebro (2016) proved truly lower bounds against randomized first-order algorithms for finite-sum optimization of convex functions, but their constructions require non-quadratic objectives. Subsequent works generalized these constructions to query models which allow for high-order derivatives (Agarwal and Hazan, 2017; Arjevani et al., 2017); these lower bounds are only relvant for non-quadratic functions, since a second order method can, by definition, minimize a quadratic function in one iteration.
All aforementioned lower bounds, as well as those presented in this paper, require the ambient problem dimension to be sufficiently large as a function of relevant problem parameters; another line of work due to Arjevani and Shamir (2016) attains dimension-free lower bounds, but at the expense of restricting the query model.
Lower Bounds for Stochastic Optimization: Lower bounds have also been established in the stochastic convex optimization (Agarwal et al., 2009; Jamieson et al., 2012), where each gradient- or function-value oracle query is corrupted with i.i.d. noise, and Allen-Zhu and Li (2016) prove analogues of these bounds for streaming PCA. Other works have considered lower bounds which hold when the optimization algorithm is subject to memory constraints (Steinhardt et al., 2015; Steinhardt and Duchi, 2015; Shamir, 2014). While these stochastic lower bounds are information-theoretic, and thus unconditional, they are incomparable to the setting considered in this work, where we are allowed to make exact, noiseless queries.
Query Complexity: Our proof casts eigenvector computation as a sequential estimation problem. These have been studied at length in the context of sparse recovery and active adaptive compressed sensing (Arias-Castro et al., 2013; Price and Woodruff, 2013; Castro and Tánczos, 2017; Castro et al., 2014). Due to the noiseless oracle model, our setting is most similar to that of Price and Woodruff (Price and Woodruff, 2013), whereas other works (Arias-Castro et al., 2013; Castro and Tánczos, 2017; Castro et al., 2014) study measurements contaminated with noise. More broadly, query complexity has received much recent attention in the context of communication-complexity (Anshu et al., 2017; Nelson et al., 2017), in which lower bounds on query complexity imply corresponding bounds against communication via lifting theorems.
Estimation in the Deformed Wigner Model: As mentioned in Section 1.1, we require a result due to Lelarge and Miolane (2016) regarding the minimum mean squared error of estimation in a deformed Wigner model; this is achieved by establishing that the replica-symmetric formula for mutual information in the deformed Wigner model holds in broad generality. The replica-symmetric formula had been conjectured by the statistical physics community (see Lesieur et al. (2015)), and Barbier et al. (2016) and Krzakala et al. (2016) had rigorously proven this formula under the restriction that the entries of the plant have finite support. In our application, has Gaussian entries, which is why we need the slightly more general result of Lelarge and Miolane (2016). Later, Alaoui and Krzakala (2018) give a concise proof of the replica-symmetric formula, again under the assumption that has finite support.
3 Proof Roadmap
3.1 Reduction from Estimation in the Deformed Wigner Model
Our random instances will be parameterized by the quantities , , and ; typically, one should think of as being on the order of , and of , which is on the order of . We say is a universal constant if it does not depend on the triple , and write as short hand for , for some unspecified universal constant . We shall also let denote a term which is at most for universal constants . Given an event , we note that writing allows us to encode constraints of the form (recall ), since otherwise and the probability statement is vacuously true. In particular, we shall assume is sufficiently large that .
For each and , consider the deformed Wigner model
where is called the plant, and is a matrix, with for , and for for . With and defined above, we define our random instance as
and let denote the vector which exists almost surely, and when , is the unique minimizer of the quadratic objective . In this section, we give a high level sketch of the major technical building blocks which underly our main results in Section 2.
The first step is to provide a reduction from estimation to optimization. Specifically, we must show that if the the approximate minimizer returned by any RQA is close to the true optimal , then has a large inner product with . We must also ensure that we retain control over the conditioning of . To this end, the parameter gives us a knob to control the condition number of , and gives us control over to what extent we “warm-start” the algorithm near the true planted solution . Specially, Proposition 4.1 implies that will concentrate below
and standard arguments imply that concentrates around . In Proposition 4.2, we show that if is is in some desired range, then then satisfies
In other words, the solution is about -times more correlated with the plant than is . This allows us to show that if approximates up to sufficiently high accuracy, then we show in Section 4 that one can achieve a solution which is correlated with :
Proposition 3.1 (Reduction from Optimization to Estimation; First Attempt).
For all and , then as defined above satisfy
Proposition 3.1 allows the , the parameter controlling the correlation between and , to be vanishingly small in the dimension. In fact, the condition can be replaced by for any , provided that the constants are ammended accordingly. Thus, our lower bounds hold even when the linear term and the plant have little correlation, provide the solution accuracy is sufficiently high. Unfortunately, Proposition 3.1 also requires that be small. In fact, we can only take to be at most , yielding the bound
which only applies if can ensure . The minimax lower bounds, on the other hand, must apply as soon as is some (possibly small) constant.
To sharpen Proposition 3.1, we make the following observation: whereas (4) controls the overlap between and , we are more precisely interested in the overlap between and . If the error could align arbitrarily well with , then we would only be able to tolerate small errors to ensure large correlations . However, we observe that both and are conditionally independent of , given . Since conditioning on is equivalent to conditioning on , we can bound the alignment between and by viewing as an estimator , and bounding the quantity
Here, corresponds the largest possible expected alignment between and any vector possible estimator depending on a total observation of . In particular, if is small, then the overlap between and is small in expectation. This idea leads to the following refinement of (5):
Proposition 3.2 (Reduction from Optimization to Estimation; Sharpened Version).
Let and set . Then, there exists a distribution of instances with such that, for
The distribution is obtained by conditioning the distribution over on a constant-probability event, described in Section 4.
The proofs of Proposition 3.2 and its coarser analouge 3.1 are given in Section 4. The main idea is to relate quantities of interest to fundamental quantities in the study of deformed Wigner matrix, namely the Stieltjes transform and its derivatives. Leveraging the non-asymptotic convergence of the Stieltjes transform, we can establish non-asymptotic convegence of its derivatives via Lemma B.4 in the appendix, a quantitative analogue of a classical bound regarding the convergence of the derivatives of limits of convex functions.
Compared to (5), Proposition 3.2 increases the error tolerance by a factor of , up to multiplicative constants. In particular, if we can show , then the RQA need only output a solution satisfying . For sufficiently large, we can prove precisely this bound.
Suppose that . Then, there exists a such for all , . Moreover, under Conjecture 6.1, .
The above result leverages a recent result regarding the asymptotic error of plant estimation in a deformed Wigner model (Lelarge and Miolane, 2016). The proof involves engaging with rather specialized material, and is deferred to Section 6. Specifically, the first statement is a consequence of Corollary 6.2, and the second statement follows from Corollary 6.4.
The last ingredient we need in our proof is to upper bound
Let and , and let , and be as in Section 3.1. Then for any RQA interacting with the instances , and any ,
where the probability is taken over the randomness of the algorithm, and over .
We prove Theorem 3.4 by modifying the arguments from Simchowitz et al. (2018); the proof is outlined in Section 5. The key intuition is to slightly modify ’s queries so the innner product by the norm of the projection of onto -queries, and show that this projection grows at a rate thats bounded by a geometric series on the order of , which is for . With the above results in place, we are now ready to prove our main theorems:
To prove Theorem 2.1, let denote the hidden universal constant on the left hand side of equation (6), and the universal constant on the right hand side. Then, for , , and , the distribution from the sharpened reduction in Propostion 3.2 satisfies
Rearranging, combining terms, and absorbing constants, we have that
Now recall that with probability one over , . We see that is a decreasing bijection from to , we may reparameterize both and the above result in terms of . Recognizing that , we see that for possibly modified constants , it holds that for all , we have
where with probability , . We remark if if is polynomial in , as in Conjecture 6.1, then when parameterized in terms of . We also for some , we can bound for a new universal constant . Lastly, we find , and thus the event entails . This concludes the proof of Theorem 2.1. The proof of Theorem 2.2 follows similarly by arguing from Equation (5) instead of from (6); in this case, we no longer need the requirement , and we work with the original distribution over instead of the conditional distribution . ∎
In this section, we shall focus on establishing Proposition 3.2; the proof of Proposition 3.1 uses strictly a simplified version of the same argument, and we defer its proof to the end of the section. In proving Proposition 3.2, our goal will be to define an event such that the desired distribution of is just the conditional distribution . We shall construct as the intersection of two events and , which ensure respectively that
is well conditioned; specifically, .
Any approximate minimizer of is well aligned with with constant probability.
Let’s begin with , which ensures the conditioning of . In what follows, we let denote a parameter representing a multiplicative error in our deviation bounds; we shall choose without affecting the scaling of the results, but taking will recover known asymptotic scalings in many (but not all) of our bounds.
Let . Then, for any fixed , the event
occurs with probability at least .