We consider first-order optimization methods for convex Lipschitz bounded functions. I.e., for the following optimization problem
where is convex and -Lipschitz, we consider using (exact) queries returning and a subgradient , and ask the classical question of how many queries are required to ensure we find an -suboptimal solution. The classical answer is that suffice, using the center-of-mass method, and that, when this is optimal (Nemirovsky and Yudin, 1983). However, the center-of-mass method is intractable, at least exactly. Other methods with a , and even , query complexity and polynomial runtime have been suggested, including as the Ellipsoid method (Shor, 1970), Vaidya’s method (Atkinson and Vaidya, 1995) and approximate center of mass using sampling (Bertsimas and Vempala, 2002). But these are generally not used in practice, since the higher order polynomial runtime dependence on the dimension is prohibitive. These methods also all require storing all returned gradients, or alternatively an ellipsoid in , and so memory. A simpler alternative is gradient descent, which requires queries, but only memory and runtime per query.
One might ask: is it possible to achieve the optimal query complexity using a “simple” method? Since it is much harder to provide runtime lower bound, we instead focus on the required memory, and ask: is it possible to to achieve the optimal query complexity with memory? How does the first-order oracle complexity trade off with the memory needed by an optimization algorithm? This question is formalized in the following Section.
2 Problem Formulation
We capture the class of first-order optimization algorithms that use bits of memory in terms of a set of “encoders” and “decoders.” In each iteration, the decoder reads the bits of memory, and determines a query point . The encoder receives the function value and a subgradient at —as is standard with oracle based optimization, if is not differentiable at , we require the method works for any valid subgradient used. The encoder then uses the current memory state, and to update the memory state for the next iteration. At the end, the algorithm’s output is chosen as a function of the final memory state. To be clear, the encoding and decoding functions can require an arbitrary amount of memory to compute, and can compute using real numbers. However, between each access to the oracle, there is a “bottleneck” where the algorithm’s state must be compressed down to bits.
Formally, we define , the class of all deterministic111We focus on deterministic algorithms, but an analogous class of randomized algorithms could easily be specified. However, it seems unnecessary to complicate things in this way because there is evidence that there is little to be gained through randomization for solving problems of the form (1) Woodworth and Srebro (2017). first-order optimization algorithms that use bits of memory and function value and gradient computations. An algorithm is specified by a set of decoder functions , a set of encoder functions , and an output function . The algorithm’s memory is initially blank . The iteration are specified recursively by and
and the output of the algorithm, denoted , is given by:
Let be the set of all convex, -Lipschitz functions such that with . We define the minimax memory-bounded first-order oracle complexity as
where the supremum over functions should be interpreted also as a supremum over all valid subgradients used in the updates. Without loss of generality, we will fix and write . We will further say that a query-memory tradeoff is possible for a problem specified by if and impossible if .
3 Current Knowledge
In high dimensions, when , gradient descent is optimal in terms of both query and memory complexity, and so we consider only .
We can describe the minimax complexity , and the query-memory tradeoff, in terms of the regions of possible and impossible , as depicted in Figure 1. We currently understand only the extremes. With any amount of memory, queries are required, providing a lower bound for the possible region in terms of the query complexity (a horizontal lower bound in Figure 1). This is attained by the center of mass method, using bits of memory (see Appendix B for an analysis of Center of Mass with discrete memory), and so any is possible (the rectangle above and to the right of “Center of Mass” in Figure 1). At the other extreme, even just representing the answer requires bits (see Theorem C in Appendix C), providing a lower bound for the possible region in terms of memory (the vertical lower bound in Figure 1). This is attained by Gradient Descent using queries (see Appendix A for an analysis of Gradient Descent with discrete memory), and so any is possible (the rectangle above and to the left of “Gradient Descent” in Figure 1).
To the best of our knowledge, what happens inside the square bordered by these regions is completely unknown. Nothing we know would contradict the existence of a query and memory complexity algorithm, i.e. a single optimal method at the bottom left corner of the unknown square, making the entire square possible. It is also entirely possible, as far as we know, that improving over a query complexity of requires memory, making the entire square impossible, and implying that no compromise is possible between the query requirement of Gradient Descent and memory requirement of Center of Mass.
Ultimately, we would like fully understand what is and is not possible:
Question 1 ($500 or Two Star Michelin Meal)
Provide a complete characterization of and the possible trade-off, preferably up to constant factors, and at most up to factors poly-logarithmic in and .
The most interesting scaling of and is when the dimension is larger then poly-logarithmic but smaller then polynomial in , so that memory is less then quadratic memory, but query complexity is not polynomial in .
Even without understanding the entire trade-off, it would be interesting to study what can be done on its boundary. Perhaps the most important regime is the case of linear memory . Therefore, as a starting point, we ask to characterize . In particular, is it possible to have query complexity polynomial in with memory?
Question 2 ($200 or One Star Michelin Meal)
Can we have when but for all ?
At the other extreme, we might ask whether quadratic memory is necessary in order to achieve optimal query complexity:
Question 3 ($200 or One Star Michelin Meal)
Can we have , for , when but for all ?
The above represent specific incursions into the unknown square in Figure 1. Any other such incursion would also be interesting, and provide either for a memory lower bound, or a trade-off improving over Gradient Descent and Center of Mass in some regime.
Question 4 ($100 or Michelin Bib Gourmand Meal)
Resolve the possibility or impossibility of some trade-off polynomially inside the unknown square in Figure 1.
- Atkinson and Vaidya  David S Atkinson and Pravin M Vaidya. A cutting plane algorithm for convex programming that uses analytic centers. Mathematical Programming, 69(1-3):1–43, 1995.
Bertsimas and Vempala 
Dimitris Bertsimas and Santosh Vempala.
Solving convex programs by random walks.
Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 109–115. ACM, 2002.
Bubeck et al. 
Sébastien Bubeck et al.
Convex optimization: Algorithms and complexity.
Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
Grünbaum et al. 
Branko Grünbaum et al.
Partitions of mass-distributions and of convex bodies by hyperplanes.Pacific Journal of Mathematics, 10(4):1257–1261, 1960.
- Nemirovsky and Yudin  Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
- Shor  Naum Z Shor. Convergence rate of the gradient descent method with dilatation of the space. Cybernetics, 6(2):102–108, 1970.
- Woodworth and Srebro  Blake Woodworth and Nathan Srebro. Lower bound for randomized first order convex optimization. arXiv preprint arXiv:1709.03594, 2017.
Appendix A Analysis of Gradient Descent
For any -Lipschitz and convex function with , the gradient descent algorithm can find a point with using bits of memory and function and gradient evaluations.
For now, assume that in each iteration the perturbation of the gradient descent iterates resulting from the discretization is bounded in L2 norm, i.e. . Then, following the standard gradient descent analysis,
Rearranging this expression, we conclude
Choosing a fixed stepsize and averaging the iterates, we conclude achieves suboptimality
Since the averaged iterate achieves this suboptimality, the best iterate’s suboptimality is at least this good. For , this ensures that at least one of the iterates was -suboptimal. As long as is discretized to accuracy , then is at most -suboptimal. This discretization can be achieved using bits.
Discretizing the iterates up to accuracy can be achieved using the log of the L2 covering number of the radius- ball, which is upper bounded by bits. The discretization of is achieved using the same number of bits.
Therefore, the total number of bits of memory needed to implement gradient descent is at most
Appendix B Analysis of Center of Mass Algorithm
[] For any convex set with center of gravity , and any halfspace passing through ,
For any -Lipschitz and convex function with , the center of mass algorithm can find a point with using bits of memory and function and gradient evaluations. This proof is quite similar to existing analysis of the center of mass algorithm , we simply take care to count the number of required bits.
Consider the set , which has volume . By convexity,
By Grünbaum’s Lemma, . Thus, when
We conclude that there must be some iteration in which . Thus, and . We will now argue that the center of mass has small error:
Therefore, if we choose and discretize gradients with L2 error at most , this ensures that .
The gradients of an -Lipschitz function are contained in the Euclidean ball of radius . Therefore, the gradients can be discretized with error using the logarithm of the L2 covering number of the Euclidean ball of radius , which is upper bounded by bits. There are gradients in total, thus the total number of bits required to represent the gradients is
Since , this is upper bounded by bits.
Once iterations have been completed, we know that at least one of the centers must be an -approximate minimizer of the objective. Using the stored gradients, we can then recompute all centers and return a discretization of the best center. As long as for all , then the center chosen by the algorithm will be within of the best center. This discretization of the function values requires bits.
As long as the discretization of the chosen center has L2 error at most , then the output will be a -approximate minimizer. The number of bits needed for this discretization is at most . Therefore, the total number of bits needed is at most
Rescaling completes the proof.
Appendix C Memory Lower Bound
The packing number of the Euclidean unit sphere in with distance is at least . Let be the largest possible packing of the unit sphere, with . Consider the set of points that are within of one of the points in the packing:
Therefore, there exists a point such that for all . The existence of such a point contradicts the assumption that is the largest possible packing. We conclude that the packing number is at least .
For any and any , any optimization algorithm that is guaranteed to return an -suboptimal point for any convex, -Lipschitz function with must use at least bits of memory. To begin, by Lemma C there exists a packing of the ball of size at least such that for all . We will associate a function with each point in the packing, let
These functions are convex and -Lipschitz, and their optimizers have norm less than .
Note that any point which is an approximate minimizer of some must have high function value on all other functions . Suppose , then . Consequently, for all , , thus .
Consider using a memory-bounded optimization algorithm to optimize one of these functions . After the algorithm has made all of its first-order oracle accesses, the output function must map from the final memory state to a solution . Suppose the final memory state uses bits, then there are at most outputs that the algorithm might give. However, as we just argued, there exist functions such that returning an accurate solution for any one of them requires returning an inaccurate solution for all the others. Consequently, any algorithm which can output fewer than different outputs will fail to optimize at least one of the functions .