Tight Complexity Bounds for Optimizing Composite Objectives

05/25/2016 ∙ by Blake Woodworth, et al. ∙ Toyota Technological Institute at Chicago 0

We provide tight upper and lower bounds on the complexity of minimizing the average of m convex functions using gradient and prox oracles of the component functions. We show a significant gap between the complexity of deterministic vs randomized optimization. For smooth functions, we show that accelerated gradient descent (AGD) and an accelerated variant of SVRG are optimal in the deterministic and randomized settings respectively, and that a gradient oracle is sufficient for the optimal rate. For non-smooth functions, having access to prox oracles reduces the complexity and we present optimal methods based on smoothing that improve over methods using just gradient accesses.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider minimizing the average of convex functions:

(1)

where is a closed, convex set, and where the algorithm is given access to the following gradient (or subgradient in the case of non-smooth functions) and prox oracle for the components:

(2)

where

(3)

A natural question is how to leverage the prox oracle, and how much benefit it provides over gradient access alone. The prox oracle is potentially much more powerful, as it provides global, rather then local, information about the function. For example, for a single function (), one prox oracle call (with ) is sufficient for exact optimization. Several methods have recently been suggested for optimizing a sum or average of several functions using prox accesses to each component, both in the distributed setting where each components might be handled on a different machine (e.g. ADMM [7], DANE [18], DISCO [20]) or for functions that can be decomposed into several “easy” parts (e.g. PRISMA [13]). But as far as we are aware, no meaningful lower bound was previously known on the number of prox oracle accesses required even for the average of two functions ().

The optimization of composite objectives of the form (1) has also been extensively studied in the context of minimizing empirical risk over samples. Recently, stochastic methods such as SDCA [16], SAG [14], SVRG [8]

, and other variants, have been presented which leverage the finite nature of the problem to reduce the variance in stochastic gradient estimates and obtain guarantees that dominate both batch and stochastic gradient descent. As methods with improved complexity, such as accelerated SDCA

[17], accelerated SVRG, and Katyusha [3], have been presented, researchers have also tried to obtain lower bounds on the best possible complexity in this settings—but as we survey below, these have not been satisfactory so far.

In this paper, after briefly surveying methods for smooth, composite optimization, we present methods for optimizing non-smooth composite objectives, which show that prox oracle access can indeed be leveraged to improve over methods using merely subgradient access (see Section 3). We then turn to studying lower bounds. We consider algorithms that access the objective only through the oracle and provide lower bounds on the number of such oracle accesses (and thus the runtime) required to find -suboptimal solutions. We consider optimizing both Lipschitz (non-smooth) functions and smooth functions, and guarantees that do and do not depend on strong convexity, distinguishing between deterministic optimization algorithms and randomized algorithms. Our upper and lower bounds are summarized in Table 1.

-Lipschitz -Smooth
Convex, -Strongly Convex, -Strongly
Convex Convex

Deterministic

Upper

(Section 3) (Section 3) (AGD) (AGD)

Lower

(Section 4) (Section 4) (Section 4) (Section 4)

Randomized

Upper

(SGD, A-SVRG) (SGD, A-SVRG) (A-SVRG) (A-SVRG)

Lower

(Section 5) (Section 5) (Section 5) (Section 5)
Table 1: Upper and lower bounds on the number of grad-and-prox oracle accesses needed to find -suboptimal solutions for each function class. These are exact up to constant factors except for the lower bounds for smooth and strongly convex functions, which hide extra and factors for deterministic and randomized algorithms. Here, is the suboptimality of the point .

As shown in the table, we provide matching upper and lower bounds (up to a log factor) for all function and algorithm classes. In particular, our bounds establish the optimality (up to log factors) of accelerated SDCA, SVRG, and SAG for randomized finite-sum optimization, and also the optimality of our deterministic smoothing algorithms for non-smooth composite optimization.

On the power of gradient vs prox oracles

For non-smooth functions, we show that having access to prox oracles for the components can reduce the polynomial dependence on from to , or from to for -strongly convex functions. However, all of the optimal complexities for smooth functions can be attained with only component gradient access using accelerated gradient descent (AGD) or accelerated SVRG. Thus the worst-case complexity cannot be improved (at least not significantly) by using the more powerful prox oracle.

On the power of randomization

We establish a significant gap between deterministic and randomized algorithms for finite-sum problems. Namely, the dependence on the number of components must be linear in for any deterministic algorithm, but can be reduced to (in the typically significant term) using randomization. We emphasize that the randomization here is only in the algorithm—not in the oracle. We always assume the oracle returns an exact answer (for the requested component) and is not a stochastic oracle. The distinction is that the algorithm is allowed to flip coins in deciding what operations and queries to perform but the oracle must return an exact answer to that query (of course, the algorithm could simulate a stochastic oracle).

Prior Lower Bounds

Several authors recently presented lower bounds for optimizing (1) in the smooth and strongly convex setting using component gradients. Agarwal and Bottou [1] presented a lower bound of . However, their bound is valid only for deterministic algorithms (thus not including SDCA, SVRG, SAG, etc.)—we not only consider randomized algorithms, but also show a much higher lower bound for deterministic algorithms (i.e. the bound of Agarwal and Bottou is loose). Improving upon this, Lan [9] shows a similar lower bound for a restricted class of randomized algorithms: the algorithm must select which component to query for a gradient by drawing an index from a fixed distribution, but the algorithm must otherwise be deterministic in how it uses the gradients, and its iterates must lie in the span of the gradients it has received. This restricted class includes SAG, but not SVRG nor perhaps other realistic attempts at improving over these. Furthermore, both bounds allow only gradient accesses, not prox computations. Thus SDCA, which requires prox accesses, and potential variants are not covered by such lower bounds. We prove as similar lower bound to Lan’s, but our analysis is much more general and applies to any randomized algorithm, making any sequence of queries to a gradient and prox oracle, and without assuming that iterates lie in the span of previous responses. In addition to smooth functions, we also provide lower bounds for non-smooth problems which were not considered by these previous attempts. Another recent observation [15] was that with access only to random component subgradients without knowing the component’s identity, an algorithm must make queries to optimize well. This shows how relatively subtle changes in the oracle can have a dramatic effect on the complexity of the problem. Since the oracle we consider is quite powerful, our lower bounds cover a very broad family of algorithms, including SAG, SVRG, and SDCA.

Our deterministic lower bounds are inspired by a lower bound on the number of rounds of communication required for optimization when each is held by a different machine and when iterates lie in the span of certain permitted calculations [5]. Our construction for is similar to theirs (though in a different setting), but their analysis considers neither scaling with (which has a different role in their setting) nor randomization.

Notation and Definitions

We use to denote the standard Euclidean norm on . We say that a function is -Lipschitz continuous on if ; -smooth on if it is differentiable and its gradient is -Lipschitz on ; and -strongly convex on if . We consider optimizing (1) under four combinations of assumptions: each component is either -Lipschitz or -smooth, and either is -strongly convex or its domain is bounded, .

2 Optimizing Smooth Sums

We briefly review the best known methods for optimizing (1) when the components are -smooth, yielding the upper bounds on the right half of Table 1. These upper bounds can be obtained using only component gradient access, without need for the prox oracle.

We can obtain exact gradients of by computing all component gradients . Running accelerated gradient descent (AGD) [12] on using these exact gradients achieves the upper complexity bounds for deterministic algorithms and smooth problems (see Table 1).

SAG [14], SVRG [8] and related methods use randomization to sample components, but also leverage the finite nature of the objective to control the variance of the gradient estimator used. Accelerating these methods using the Catalyst framework [10] ensures that for -strongly convex objectives we have after iterations, where . Katyusha [3] is a more direct approach to accelerating SVRG which avoids extraneous log-factors, yielding the complexity indicated in Table 1.

When is not strongly convex, adding a regularizer to the objective and instead optimizing with results in an oracle complexity of . The log-factor in the second term can be removed using the more delicate reduction of Allen-Zhu and Hazan [4], which involves optimizing for progressively smaller values of , yielding the upper bound in the table.

Katyusha and Catalyst-accelerated SAG or SVRG use only gradients of the components. Accelerated SDCA [17] achieves a similar complexity using gradient and prox oracle access.

3 Leveraging Prox Oracles for Lipschitz Sums

In this section, we present algorithms for leveraging the prox oracle to minimize (1) when each component is -Lipschitz. This will be done by using the prox oracle to “smooth” each component, and optimizing the new, smooth sum which approximates the original problem. This idea was used in order to apply Katyusha [3] and accelerated SDCA [17] to non-smooth objectives. We are not aware of a previous explicit presentation of the AGD-based deterministic algorithm, which achieves the deterministic upper complexity indicated in Table 1.

The key is using a prox oracle to obtain gradients of the -Moreau envelope of a non-smooth function, , defined as:

(4)
Lemma 1 ([13, Lemma 2.2], [6, Proposition 12.29], following [11]).

Let be convex and -Lipschitz continuous. For any ,

  1. is -smooth

Consequently, we can consider the smoothed problem

(5)

While is not, in general, the -Moreau envelope of , it is -smooth, we can calculate the gradient of its components using the oracle , and . Thus, to obtain an -suboptimal solution to (1) using , we set and apply any algorithm which can optimize (5) using gradients of the -smooth components, to within accuracy. With the rates presented in Section 2, using AGD on (5) yields a complexity of in the deterministic setting. When the functions are -strongly convex, smoothing with a fixed results in a spurious log-factor. To avoid this, we again apply the reduction of Allen-Zhu and Hazan [4], this time optimizing for increasingly large values of . This leads to the upper bound of when used with AGD (see Appendix A for details).

Similarly, we can apply an accelerated randomized algorithm (such as Katyusha) to the smooth problem to obtain complexities of and —this matches the presentation of Allen-Zhu [3] and is similar to that of Shalev-Shwartz and Zhang [17].

Finally, if or , stochastic gradient descent is a better randomized alternative, yielding complexities of or .

4 Lower Bounds for Deterministic Algorithms

We now turn to establishing lower bounds on the oracle complexity of optimizing (1). We first consider only deterministic optimization algorithms. What we would like to show is that for any deterministic optimization algorithm we can construct a “hard” function for which the algorithm cannot find an -suboptimal solution until it has made many oracle accesses. Since the algorithm is deterministic, we can construct such a function by simulating the (deterministic) behavior of the algorithm. This can be viewed as a game, where an adversary controls the oracle being used by the algorithm. At each iteration the algorithm queries the oracle with some triplet and the adversary responds with an answer. This answer must be consistent with all previous answers, but the adversary ensures it is also consistent with a composite function that the algorithm is far from optimizing. The “hard” function is then gradually defined in terms of the behavior of the optimization algorithm.

To help us formulate our constructions, we define a “round” of queries as a series of queries in which distinct functions are queried. The first round begins with the first query and continues until exactly unique functions have been queried. The second round begins with the next query, and continues until exactly more distinct components have been queried in the second round, and so on until the algorithm terminates. This definition is useful for analysis but requires no assumptions about the algorithm’s querying strategy.

4.1 Non-Smooth Components

We begin by presenting a lower bound for deterministic optimization of (1) when each component is convex and -Lipschitz continuous, but is not necessarily strongly convex, on the domain . Without loss of generality, we can consider . We will construct functions of the following form:

(6)

where , , and

is an orthonormal set of vectors in

chosen according to the behavior of the algorithm such that is orthogonal to all points at which the algorithm queries before round , and where are indicators chosen so that if the algorithm does not query component in round (and zero otherwise). To see how this is possible, consider the following truncations of (6):

(7)

During each round , the adversary answers queries according to , which depends only on for , i.e. from previous rounds. When the round is completed, is determined and is chosen to be orthogonal to the vectors as well as every point queried by the algorithm so far, thus defining for the next round. In Appendix B.1 we prove that these responses based on are consistent with .

The algorithm can only learn after it completes round —until then every iterate is orthogonal to it by construction. The average of these functions reaches its minimum of at , so we can view optimizing these functions as the task of discovering the vectors —even if only is missing, a suboptimality better than cannot be achieved. Therefore, the deterministic algorithm must complete at least rounds of optimization, each comprising at least queries to in order to optimize . The key to this construction is that even though each term appears in components, and hence has a strong effect on the average , we can force a deterministic algorithm to make queries during each round before it finds the next relevant term. We obtain (for complete proof see Appendix B.1):

Theorem 1.

For any , any , any , and any deterministic algorithm with access to , there exists a dimension , and functions defined over , which are convex and -Lipschitz continuous, such that in order to find a point for which , must make queries to .

Furthermore, we can always reduce optimizing a function over to optimizing a strongly convex function by adding the regularizer to each component, implying (see complete proof in Appendix B.2):

Theorem 2.

For any , any , any , and any deterministic algorithm with access to , there exists a dimension , and functions defined over , which are -Lipschitz continuous and -strongly convex, such that in order to find a point for which , must make queries to .

4.2 Smooth Components

When the components are required to be smooth, the lower bound construction is similar to (6), except it is based on squared differences instead of absolute differences. We consider the functions:

(8)

where and are as before. Again, we can answer queries at round based only on for . This construction yields the following lower bounds (full details in Appendix B.3):

Theorem 3.

For any , any , and any deterministic algorithm with access to , there exists a sufficiently large dimension , and functions defined over , which are convex and -smooth, such that in order to find a point for which , must make queries to .

In the strongly convex case, we use a very similar construction, adding the term , which gives the following bound (see Appendix B.4):

Theorem 4.

For any such that , any , any , any , and any deterministic algorithm with access to , there exists a sufficiently large dimension , and functions defined over , which are -smooth and -strongly convex and where , such that in order to find a point for which , must make queries to .

5 Lower Bounds for Randomized Algorithms

We now turn to randomized algorithms for (1). In the deterministic constructions, we relied on being able to set and based on the predictable behavior of the algorithm. This is impossible for randomized algorithms, we must choose the “hard” function before we know the random choices the algorithm will make—so the function must be “hard” more generally than before.

Previously, we chose vectors

orthogonal to all previous queries made by the algorithm. For randomized algorithms this cannot be ensured. However, if we choose orthonormal vectors

randomly in a high dimensional space, they will be nearly

orthogonal to queries with high probability. Slightly modifying the absolute or squared difference from before makes near orthogonality sufficient. This issue increases the required dimension but does not otherwise affect the lower bounds.

More problematic is our inability to anticipate the order in which the algorithm will query the components, precluding the use of . In the deterministic setting, if a term revealing a new appeared in half of the components, we could ensure that the algorithm must make queries to find it. However, a randomized algorithm could find it in two queries in expectation, which would eliminate the linear dependence on in the lower bound! Alternatively, if only one component included the term, a randomized algorithm would indeed need queries to find it, but that term’s effect on suboptimality of would be scaled down by , again eliminating the dependence on .

To establish a lower bound for randomized algorithms we must take a new approach. We define pairs of functions which operate on orthogonal subspaces of . Each pair of functions resembles the constructions from the previous section, but since there are many of them, the algorithm must solve separate optimization problems in order to optimize .

5.1 Lipschitz Continuous Components

First consider the non-smooth, non-strongly-convex setting and assume for simplicity is even (otherwise we simply let the last function be zero). We define the helper function , which replaces the absolute value operation and makes our construction resistant to small inner products between iterates and not-yet-discovered components:

(9)

Next, we define pairs of functions, indexed by :

(10)

where are random orthonormal vectors and . With sufficiently small and the dimensionality sufficiently high, with high probability the algorithm only learns the identity of new vectors by alternately querying and ; so revealing all vectors requires at least total queries. Until is revealed, an iterate is -suboptimal on . From here, we show that an -suboptimal solution to can be found only after at least queries are made to at least pairs, for a total of queries. This time, since the optimum will need to have inner product with vectors , we need to have , and the total number of queries is . The term of the lower bound follows trivially since we require , (proofs in Appendix C.1):

Theorem 5.

For any , any , any , and any randomized algorithm with access to , there exists a dimension , and functions defined over , which are convex and -Lipschitz continuous, such that to find a point for which , must make queries to .

An added regularizer gives the result for strongly convex functions (see Appendix C.2):

Theorem 6.

For any , any , any , and any randomized algorithm with access to , there exists a dimension , and functions defined over , which are -Lipschitz continuous and -strongly convex, such that in order to find a point for which , must make queries to .

The large dimension required by these lower bounds is the cost of omitting the assumption that the algorithm’s queries lie in the span of previous oracle responses. If we do assume that the queries lie in that span, the necessary dimension is only on the order of the number of oracle queries needed.

When in the non-strongly convex case or in the strongly convex case, the lower bounds for randomized algorithms presented above do not apply. Instead, we can obtain a lower bound based on an information theoretic argument. We first uniformly randomly choose a parameter , which is either or . Then for , in the non-strongly convex case we make with probability and with probability . Optimizing to within

accuracy then implies recovering the bias of the Bernoulli random variable, which requires

queries based on a standard information theoretic result [2, 19]. Setting gives a lower bound in the -strongly convex setting. This is formalized in Appendix C.5.

5.2 Smooth Components

When the functions are smooth and not strongly convex, we define another helper function :

(11)

and the following pairs of functions for :

(12)

with as before. The same arguments apply, after replacing the absolute difference with squared difference. A separate argument is required in this case for the term in the bound, which we show using a construction involving simple linear functions (see Appendix C.3).

Theorem 7.

For any , any , and any randomized algorithm with access to , there exists a sufficiently large dimension and functions defined over , which are convex and -smooth, such that to find a point for which , must make queries to .

In the strongly convex case, we add the term to and (see Appendix C.4) to obtain:

Theorem 8.

For any , any such that , any , any , and any randomized algorithm , there exists a dimension , domain , , and functions defined on which are -smooth and -strongly convex, and such that and such that in order to find a point such that , must make queries to .

Remark:

We consider (1) as a constrained optimization problem, thus the minimizer of could be achieved on the boundary of , meaning that the gradient need not vanish. If we make the additional assumption that the minimizer of lies on the interior of (and is thus the unconstrained global minimum), Theorems 1-8 all still apply, with a slight modification to Theorems 3 and 7. Since the gradient now needs to vanish on , is always -suboptimal, and only values of in the range and result in a non-trivial lower bound (see Remarks at the end of Appendices B.3 and C.3).

6 Conclusion

We provide a tight (up to a log factor) understanding of optimizing finite sum problems of the form (1) using a component prox oracle.

Randomized optimization of (1) has been the subject of much research in the past several years, starting with the presentation of SDCA and SAG, and continuing with accelerated variants. Obtaining lower bounds can be very useful for better understanding the problem, for knowing where it might or might not be possible to improve or where different assumptions would be needed to improve, and for establishing optimality of optimization methods. Indeed, several attempts have been made at lower bounds for the finite sum setting [1, 9]. But as we explain in the introduction, these were unsatisfactory and covered only limited classes of methods. Here we show that in a fairly general sense, accelerated SDCA, SVRG, SAG, and Katyusha are optimal up to a log factor. Improving on their runtime would require additional assumptions, or perhaps a stronger oracle. However, even if given “full” access to the component functions, all algorithms that we can think of utilize this information to calculate a prox vector. Thus, it is unclear what realistic oracle would be more powerful than the prox oracle we consider.

Our results highlight the power of randomization, showing that no deterministic algorithm can beat the linear dependence on and reduce it to the dependence of the randomized algorithms.

The deterministic algorithm for non-smooth problems that we present in Section 3

is also of interest in its own right. It avoids randomization, which is not usually problematic, but makes it fully parallelizable unlike the optimal stochastic methods. Consider, for example, a supervised learning problem where

is the (non-smooth) loss on a single training example

, and the data is distributed across machines. Calculating a prox oracle involves applying the Fenchel conjugate of the loss function

, but even if a closed form is not available, this is often easy to compute numerically, and is used in algorithms such as SDCA. But unlike SDCA, which is inherently sequential, we can calculate all prox operations in parallel on the different machines, average the resulting gradients of the smoothed function, and take an accelerated gradient step to implement our optimal deterministic algorithm. This method attains a recent lower bound for distributed optimization, resolving a question raised by Arjevani and Shamir [5], and when the number of machines is very large improves over all other known distributed optimization methods for the problem.

In studying finite sum problems, we were forced to explicitly study lower bounds for randomized optimization as opposed to stochastic optimization (where the source of randomness is the oracle, not the algorithm). Even for the classic problem of minimizing a smooth function using a first order oracle, we could not locate a published proof that applies to randomized algorithms. We provide a simple construction using -insensitive differences that allows us to easily obtain such lower bounds without reverting to assuming the iterates are spanned by previous responses (as was done, e.g., in [9]), and could potentially be useful for establishing randomized lower bounds also in other settings.

Acknowledgements:

We thank Ohad Shamir for his helpful discussions and for pointing out [4].

References

Appendix A Upper bounds for non-smooth sums

Consider the case where the components are not strongly convex. As shown in lemma 1, we can use a single call to a prox oracle to obtain the gradient of

which is a -smooth approximation to . We then consider the new optimization problem:

(13)

Also by lemma 1, setting ensures that for all . Consequently, any point which is -suboptimal for will be -suboptimal for . This technique therefore reduces the task of optimizing an instance of an -Lipschitz finite sum to that of optimizing an -smooth finite sum.

Solving (13) to -suboptimality using AGD requires gradients for which requires that same number of prox oracles from . Formally:

Theorem 9.

For any , any , and any functions which are convex and -Lipschitz continuous over the domain , applying AGD to (13) for , will result in a point such that after queries to .

When the component functions are -strongly convex, a more sophisticated strategy is required to avoid an extra factor. The solution is the AdaptSmooth algorithm [4]. This involves solving smooth and strongly convex subproblems, where the subproblem is reducing the suboptimality of the -smooth and -strongly convex function by a factor of four, where and where upper bounds the initial suboptimality. Using this method results in an -suboptimal solution for after queries to .

In the case of AGD, Time and

Theorem 10.

For any , and any functions , which are -Lipschitz continuous and -strongly convex on the domain , applying AdaptSmooth with AGD will find a point such that after queries to .

To conclude our presentation of upper bounds, we emphasize that the smoothing methods described in this section will only improve oracle complexity when used with accelerated methods. For example, using non-accelerated gradient descent on in the not strongly convex case leads to an oracle complexity of , which is no better than the convergence rate of gradient descent applied directly to .

Appendix B Lower bounds for deterministic algorithms

b.1 Non-smooth and not strongly convex components

See 1

Proof.

Without loss of generality, we can assume . For particular values and to be decided upon later, we use the functions (6):

It is straightforward to confirm that is both -Lipschitz and convex (for orthonormal vectors and indicators ). As explained in the main text, the orthonormal vectors and indicators are chosen according to the behavior of the algorithm . At the end of each round , we set iff the algorithm did not query function during round (and zero otherwise), and we set to be orthogonal to the vectors as well as every query made by the algorithm so far. Orthogonalizing the vectors in this way is possible as long as the dimension is at least as large as the number of oracle queries has made so far plus . We are allowed to construct and in this way as long as the algorithm’s execution up until round , and thus our choice of and , depends only on and for . We can enforce this condition by answering the queries during round according to

For non-smooth functions, the subgradient oracle is not uniquely defined—-many different subgradients might be a valid response. However, in order to say that an algorithm successfully optimizes a function, it must be able to do so no matter which subgradient is receives. Conversely, to show a lower bound, it is sufficient to show that for some valid subgradient the algorithm fails. And so, in constructing a “hard” instance to optimize we are actually constructing both a function and a subgradient oracle for it, with specific subgradient responses. Therefore, answering the algorithm’s queries during round according to is valid so long as the subgradient we return is a valid subgradient for (the converse need not be true) and the prox returned is exactly the prox of . For now, assume that this query-answering strategy is consistent (we will prove this last).

Then if and if is an iterate generated both before completes round and before it makes queries to (so that the dimension is large enough to orthogonalize each as described above), then by construction. This allows us to bound the suboptimality of (since functions are queried during each round, ):

is non-negative and where . Choosing makes so that . Therefore, achieves its minimum on and

Where the final inequality holds when . Setting implies . Therefore, must either query more than times or complete rounds to reach an -suboptimal solution. Completing each round requires at least queries to , so when , this implies a lower bound of

To complete the proof, it remains to show that the subgradients and proxs of are consistent with those of at every time . Since every function operates on the -dimensional subspace of spanned by , it will be convenient to decompose vectors into two components: where and . Note that .

Lemma 2.

For any and any such that , if function is queried during round , then .

Proof.

All subgradients of have the form

where we define . Since function is queried during round , , and since for all , contains all subgradients of the form

which is exactly . ∎

Lemma 3.

For any and any such that , if function is queried during round then , .

Proof.

Consider the definition of the prox oracle from equation 3

Next, we further decompose where