Robust stochastic optimization with the proximal point method

07/31/2019 ∙ by Damek Davis, et al. ∙ 0

Standard results in stochastic convex optimization bound the number of samples that an algorithm needs to generate a point with small function value in expectation. In this work, we show that a wide class of such algorithms on strongly convex problems can be augmented with sub-exponential confidence bounds at an overhead cost that is only polylogarithmic in the condition number and the confidence level. We discuss consequences both for streaming and offline algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic convex optimization lies at the core of modern statistical and machine learning. Standard results in the subject bound the number of samples that an algorithm needs to generate a point with small function value in

expectation. More nuanced

high probability

guarantees are rarer, and typically either rely on “light-tails” assumptions or exhibit worse sample complexity. To address this issue, we show that a wide class of stochastic algorithms for strongly convex problems can be augmented with high confidence bounds at an overhead cost that is only logarithmic in the confidence level and polylogarithmic in the condition number. We discuss consequences both for streaming and offline algorithms. The procedure we propose, called proxBoost

, is elementary and combines two well-known ingredients: robust distance estimation and the proximal point method.

To illustrate the proposed procedure, consider the optimization problem

where is a -strongly convex function with -Lipschitz continuous gradient. We will later consider the more general class of convex composite problems. We aim to develop generic procedures that equip stochastic algorithms with high confidence guarantees. Consequently, it will be convenient to treat such algorithms as black boxes. More formally, suppose that the function may only be accessed through a minimization oracle , which on input , returns a point satisfying the low confidence bound

(1.1)

By Markov’s inequality, minimization oracles arise from any algorithm that can generate a point satisfying For example, oracles for minimizing an expectation may be constructed from streaming algorithms or from offline empirical risk minimization methods.

The procedure introduced in this paper executes a minimization oracle multiple times in order to boost its confidence. To quantify this overhead, let denote the cost of the oracle call . It is natural to assume that the cost is decreasing in and increasing in the condition number

. The cost may also depend on other parameters, such as initialization quality and bounds on optimal value, but we ignore these for a moment. Given a minimization oracle and its cost, we investigate the following question:

Is there a procedure within this oracle model of computation that returns a point satisfying the high confidence bound

(1.2)

at a total cost that is only a “small” multiple of ?

We will see that the answer is yes, with the total cost on the order of

Thus, high probability bounds are achieved with a small cost increase, which depends only logarithmically on and polylogarithmically on the condition number .

Before introducing our approach, we discuss two techniques for boosting the confidence of a minimization oracle, both of which have limitations. As a first approach, one may query the oracle multiple times and pick the “best” iterate from the batch. This approach is flawed since often one cannot test which iterate is “best” without increasing sample complexity. To illustrate, consider estimating the expectation to -accuracy for a fixed point . This task amounts to approximate mean estimation, which may require on the order of samples, even under sub-Gaussian assumptions [8]. In this paper, the cost will typically scale at worst as , and therefore mean estimation would significantly degrade the overall sample complexity.

As the second approach, strong convexity immediately implies the distance estimate

where is the minimizer of . Given this bound, one may apply the robust distance estimation technique of [35, p. 243] and [19] to choose a point near : Run trials of and find one iterate around which the other points “cluster”. Then the point will be within a distance of from with probability . The downside of this strategy is that when converting naively back to function values, the suboptimality gap becomes . Thus the function gap at may be significantly larger than the expected function gap at , by a factor of the condition number. Therefore, robust distance estimation exhibits a trade-off between robustness and efficiency.

The trade-off between robustness and efficiency disappears for perfectly conditioned losses. Therefore, it appears plausible that one might avoid the factor through an iterative algorithm that solves a sequence of nearby, better conditioned problems. This is the strategy we explore here. The proxBoost procedure embeds the robust distance estimation technique inside a proximal point method. The algorithm begins by declaring the initial point to be the output of the robust distance estimator on . Then the better conditioned function

is formed and the next iterate is declared to be the output of the robust distance estimator on . The procedure is effective since the conditioning of rapidly improves with , which makes the robust distance estimator more efficient as the counter grows.

The proxBoost procedure can be applied to a wide class of stochastic minimization oracles, for example, streaming or empirical risk minimization (ERM) algorithms. For these problems, the loss takes the form

(1.3)

where the population data follows a fixed unknown distribution and the loss is convex and smooth for a.e. . The cost of streaming or ERM oracles is then measured by the number of samples drawn from . We now illustrate the consequences of proxBoost for these oracles.

1.1 Streaming Oracles

Stochastic gradient methods can be treated as minimization oracles with cost that is measured by the number stochastic gradient estimates needed to reach functional accuracy in expectation. An algorithm with minimal such cost was proposed by Ghadimi and Lan [16]. It generates a point satisfying with

(1.4)

stochastic gradient evaluations, where the quantity

is an upper bound on the variance of the stochastic gradient estimator

and is a known upper bound on the initial function gap . A simpler algorithm with a similar efficiency estimate was recently presented by Kulunchakov and Mairal [24], and was based on estimate sequences. Aybat et al. [4] present an algorithm with similar efficiency, but in contrast to previous work, it does not require the variance and the initial gap as inputs.

It is intriguing to ask if one can equip the stochastic gradient method and its accelerated variant with high confidence guarantees. In their original work [16, 15], Ghadimi and Lan provide an affirmative answer under the additional assumption that the stochastic gradient estimator has light tails. The very recent preprint of Juditsky-Nazin-Nemirovsky-Tsybakov [22] shows that one can avoid the light tail assumption for the basic stochastic gradient method, and for mirror descent more generally, by truncating the gradient estimators. High confidence bounds for the accelerated method, without light tail assumptions, remain open.

In this work, the optimal method of [16] will be used as a minimization oracle within proxBoost, allowing us to nearly match the efficiency estimate (1.4) without “light-tail” assumptions. Equipped with this oracle, proxBoost returns a point satisfying

and the overall cost of the procedure is

Here, only suppresses logarithmic dependencies in ; see Section 5 for details. Thus for small , the sample complexity of the robust procedure is roughly times the efficiency estimate (1.4) of the low-confidence algorithm. In this paper, we also provide similar accelerated guarantees for additive convex composite problems, by using the routine of [22] in the last step of proxBoost.

1.2 Empirical Risk Minimization Oracles

An alternative approach to streaming algorithms, such as the stochastic gradient method, is based on empirical risk minimization (ERM). Namely, we may draw i.i.d. samples and minimize the empirical average

(1.5)

A key question is to determine the number of samples that would ensure that the minimizer of the empirical risk has low generalization error , with reasonably high probability. There is a vast literature on this subject; see for example [19, 5, 39, 38]. We build here on the work of Hsu-Sabato [19], who focused on high confidence guarantees for nonnegative losses . They showed that the empirical risk minimizer yields a robust distance estimator of the true minimizer of , by the aforementioned resampling technique. As a consequence they deduced that the ERM learning rule can find a point satisfying the relative error guarantee

with the sample complexity on the order of

Loosely speaking, here and are the condition numbers of and , respectively. By embedding empirical risk minimization within proxBoost, we obtain an algorithm with the much better sample complexity

where the symbol only suppresses polylogarithmic dependence on and .

Related literature

Our paper rests on two pillars: the proximal point method and robust distance estimation. Both techniques have been well studied in the optimization and statistics literature. The proximal point method was introduced by Martinet [31, 30] and further popularized by Rockafellar [37]. The construction is also closely related to the smoothing function of Moreau [33]. Recently, there has been a renewed interest in the proximal point method, most notably due to its uses in accelerating variance reduced methods for minimizing finite sums of convex functions [27, 13, 26, 40]. The proximal point method has also featured prominently as a guiding principle in nonconvex optimization, with the works of [3, 2, 12, 10, 11]. The stepsize schedule we use within the proximal point method is geometrically decaying, in contrast to the more conventional polynomially decaying schemes. Geometrically decaying schedules for subgradient methods were first used by Goffin [18] and have regained some attention recently due to their close connection to the popular step-decay schedule in stochastic optimization [14, 4, 41, 42].

Robust distance estimation has a long history. The estimator we use was first introduced in [35, p. 243], and can be viewed as a multivariate generalization of the median of means estimator [1, 20]. Robust distance estimation was further investigated in [19] with a focus on high probability guarantees for empirical risk minimization. A different generalization based on the geometric median was studied in [32]. Other recent articles related to the subject include median of means tournaments [28], robust multivariate mean estimators [21, 29], and bandits with heavy tails [7].

One of the main applications of our techniques is to streaming algorithms. Most currently available results that establish high confidence convergence guarantees make sub-Gaussian assumptions on the stochastic gradient estimator [34, 23, 15, 17]. More recently there has been renewed interest in obtaining robust guarantees without the light-tails assumption. For example, the two works [9, 43] make use of the geometric median of means technique to robustly estimate the gradient in distributed optimization. A different technique was recently developed by Juditsky et al. [22], where the authors establish high confidence guarantees for mirror descent type algorithms by truncating the gradient.


The outline of the paper is as follows. Section 2 presents the problem setting. Section 3 develops the proxBoost procedure. Section 4 presents consequences for empirical risk minimization, while Section 5 discusses consequences for streaming algorithms.

2 Problem setting

Throughout, we follow standard notation of convex optimization, as set out for example in the monographs [36, 6]. We let denote an Euclidean space with inner product and the induced norm . The symbol will stand for the closed ball around of radius . We will use the shorthand interval notation for any number .

Consider a function . The effective domain of , denoted , consists of all points where is finite. The function is called -strongly convex if the perturbed function is convex. We say that is -smooth if it differentiable with -Lipschitz continuous gradient. If is both -strongly convex and -smooth, then the two sided bound holds:

(2.1)

where is the minimizer of . We then define the condition number of to be .

Assumption 2.1.

Throughout this work, we consider the optimization problem

(2.2)

where the function is -strongly convex. We denote the minimizer of by and its minimal value by .

Let us suppose for the moment that the only access to is by querying a black-box procedure that estimates . Namely following [19] we will call a procedure a weak distance oracle for the problem (2.2) if it returns a point satisfying

(2.3)

We will moreover assume that when querying

multiple times, the returned vectors are all statistically independent. Weak distance oracles arise naturally in stochastic optimization both in streaming and offline settings. We will discuss specific examples in Sections 

4 and 5. The numerical value plays no real significance and can be replaced by any fraction greater than a half.

It is well known from [35, p. 243] and [19] that the low-confidence estimate (2.3) can be improved to a high confidence guarantee by a resampling trick. Following [19], we define the robust distance estimator to be the following procedure

Input: trial count , access to a weak distance oracle
Query times the oracle and let consist of the responses. Step :
       Compute . Set Return
Algorithm 1 Robust Distance Estimation

Thus the robust distance estimator first generates statistically independent random points by querying times the weak distance oracle . Then the procedure computes the smallest radius ball around each point that contains more than half of the generated points . Finally, the point corresponding to the smallest such ball is returned. The intuition behind this procedure is that by Chernoff’s bound, with high probability, the ball will contain at least of the generated points. Therefore in this event, the estimate holds. Moreover since the two sets, and intersect, it follows that and are within a distance of of each other. For a complete argument, see [19, Propositions 8,9].

Lemma 2.2 (Robust Distance Estimator).

The point returned by satisfies

We seek to understand how one may use a robust distance estimator to compute a point satisfying with high probability, where is a specified accuracy. As motivation, consider the case when is -smooth. Then one immediate approach is to appeal to the upper bound in (2.1). Hence the point , with , satisfies the guarantee

We will follow an alternative approach, which in concrete circumstances can significantly decrease the overall cost. The optimistic goal is to replace the accuracy used in the call to by the potentially much larger quantity . The strategy we propose will apply a robust distance estimator to a sequence of optimization problems that are better and better conditioned, thereby amortizing the overall cost. In the initial step, we will simply apply to with the low accuracy . In step , we will apply to a new function , which has condition number , with accuracy . Continuing this process for rounds, we arrive at accuracy and a function that is nearly perfectly conditioned with . In this way, the total cost is amortized over the sequence of optimization problems. The key of course is to control the error incurred by varying the optimization problems along the iterations.

3 Main result

The procedure outlined at the end of the previous section can be succinctly described within the framework of an inexact proximal point method. Henceforth fix an increasing sequence of penalties and a sequence of centers . For each index , define the quadratically perturbed functions and their minimizers:

The exact proximal point method [31, 30, 37] proceeds by inductively declaring for . Since computing is in general impossible, we will instead monitor the error . The following elementary result will form the basis for the rest of the paper. To simplify notation, we will set and , throughout.

Theorem 3.1 (Inexact proximal point method).

For all , the estimates hold:

(3.1)

Consequently, we have the error decomposition:

(3.2)

Moreover, if is -smooth, then for all the estimate holds:

(3.3)
Proof.

We first establish (3.1) by induction. To see the base case , observe

As the inductive assumption, suppose the estimate (3.1) holds up to iteration . We then conclude

where the last inequality follows by the inductive assumption. This completes the proof of (3.1). To see (3.2), we observe using (3.1) the estimate

Inequality (3.3) for index follows from smoothness, while the general case follows from using the bound in (3.2). ∎

The main conclusion of Theorem 3.1 is the decomposition of the functional error described in (3.2). Namely, the estimate (3.2) upper bounds the error as the sum of the suboptimality in the last step and the errors incurred along the way. By choosing sufficiently large, we can be sure that the function is well-conditioned. Moreover in order to ensure that each term in the sum is of order , it sufficies to guarantee for each index . Since is an increasing sequence, it follows that we may gradually decrease the tolerance on the errors , all the while improving the conditioning of the functions we encounter. With this intuition in mind, we introduce the proxBoost procedure (Algorithm 2).

Input: , ,
Set , Generate a point satisfying with probability . for   do
       Set Generate a point satisfying
(3.4)
where denotes the event .
end for
Generate a point satisfying
(3.5)
Return
Algorithm 2

Thus the proxBoostprocedure consists of three stages, which we now examine in detail.

Stage I: Initialization.

Algorithm 2 begins by generating a point that is a distance of away from the minimizer of with probability . This task can be achieved by applying a robust distance estimator on , as discussed previously.

Stage II: Proximal iterations.

In each subsequent iteration, is defined to be a point that is within a radius of from the minimizer of with probability conditioned on the event . The event encodes that each previous iteration was successful in the sense that the point indeed lies inside for all . Thus can be determined by a procedure that within the event is a robust distance estimator on the function .

Stage III: Cleanup.

In the final step, the algorithm outputs a -minimizer of with probability conditioned on the event . In particular, if is -smooth then we may use a robust distance estimator on . Namely, taking into account the upper bound (2.1), we may declare to be any point satisfying

Notice that by choosing sufficiently large, we may ensure that the condition number of is arbitrarily close to one. If is not smooth, such as when constraints are present, we can not use a robust distance estimator in the cleanup stage. We will see in Section 5 a different approach, based on the robust stochastic gradient method of [22].


The following theorem summarizes the guarantees of the proxBoost procedure.

Theorem 3.2 (Proximal Boost).

Fix a constant , a probability of failure and a natural number . Then with probability at least , the point satisfies

(3.6)
Proof.

We first prove by induction the estimate

(3.7)

The base case is immediate from the definition of . Suppose now that (3.7) holds for some index . Then the inductive assumption and the definition of yield

thereby completing the induction. Thus the inequalities (3.7) hold. Define the event

We therefore deduce

Suppose now that the event occurs. Then using the estimate (3.2), we conclude

where the last inequality uses the definitions of and . This completes the proof. ∎

Looking at the estimate (3.6), we see that the final error is controlled by the sum . A moment of thought yields an appealing choice for the proximal parameters. Indeed, then every element in the sum is upper bounded by two. Moreover, if is -smooth, then the condition number of is upper bounded by two after only rounds.

Corollary 3.3 (Proximal boost with geometric decay).

Fix an iteration count , a target accuracy , and a probability of failure . Define the algorithm parameters:

Then the point satisfies

In the next two sections, we seed the proxBoost procedure with (accelerated) stochastic gradient algorithms and methods based on empirical risk minimization. The reader, however, should keep in mind that proxBoost is entirely agnostic to the inner workings of the robust distance estimators it uses. The only point to be careful about is that some distance estimators (e.g. stochastic gradient) require to be passed auxiliary quantities, such as an upper estimate on the function gap at the initial point. Therefore, we may have to update such estimates along the iterations of Algorithm 2.

4 Consequences for empirical risk minimization

In this section, we explore the consequences of the proxBoost algorithm for empirical risk minimization. Setting the stage, fix a probability space and equip with the Borel -algebra. Consider the optimization problem

(4.1)

where is a measurable nonnegative function. A common approach to expectation minimization problems is based on empirical risk minimization. Namely, we may form an i.i.d. sample and minimize the empirical average

(4.2)

A central question is to determine the number of samples that would ensure that the minimizer of the empirical risk has low generalization error , with reasonably high probability. There is a vast literature on this subject; some representative works include [19, 5, 39, 38]. We build here on the work of Hsu-Sabato [19], who specifically focused on high confidence guarantees for smooth strongly convex minimization. As in the previous sections, we let be a minimizer of and define the shorthand .

Assumption 4.1.

Following [19], we make the following assumptions on the loss.

  1. (Strong convexity) There exist a real and a natural number such that:

    1. the population loss is -strongly convex,

    2. the empirical loss is -strongly convex with probability at least , whenever .

  2. (Smoothness) There exist constants such that:

    1. for a.e. , the loss is -smooth,

    2. the population objective is -smooth.

Define the empirical and population condition numbers, and , respectively.

The following result proved in [19, Theorem 15] shows that the empirical risk minimizer is a weak distance oracle for the problem (4.1).

Lemma 4.2.

Fix an i.i.d. sample of size . Then the minimizer of the empirical risk (4.2) satisfies the bound:

In particular, using Algorithm 1 we may turn empirical risk minimization into a robust distance estimator for the problem (4) using a total of samples. Let us estimate the function value at the generated point by a direct application of smoothness. Appealing to Lemma 2.2 and (2.1), we deduce that with probability the procedure will return a point satisfying

Observe that this is an estimate of relative error. In particular, let be some acceptable probability of failure and let be a desired level of relative accuracy. Then setting and , we conclude that satisfies

(4.3)

while the overall sample complexity of the procedure is

(4.4)

This is exactly the result [19, Corollary 16]. We will now see how to find a point satisfying (4.3) with significantly fewer samples by embedding empirical risk minimization within the proxBoost algorithm. Algorithm 3 encodes the empirical risk minimization process on a quadratically regularized problem. Algorithm 4 is the robust distance estimator induced by Algorithm 3. Finally, Algorithm 5 is the proxBoost algorithm specialized to empirical risk minimization.

Input: sample count , center , amplitude . Generate i.i.d. samples and compute the minimizer of
Return
Algorithm 3
Input: sample count , trial count , center , amplitude .
Query times ERM and let consist of the responses. Step :
       Compute . Set Return
Algorithm 4
Input: ,
Set , Step :
      
       Return
Algorithm 5

Using Theorem 3.2, we can now prove the following result.

Theorem 4.3 (Efficiency of BoostERM).

Fix a target relative accuracy and numbers . Then with probability at least , the point satisfies

Proof.

We will verify that Algorithm 5 is an instantiation of Algorithm 2 with and . More precisely, we will prove by induction that with this choice of and , the iterates satisfy (3.4) for each index and satisfies (3.5). As the base case, consider the evaluation . Then Lemma 2.2 and Theorem 4.2 guarantee

Taking into account the definition of , we deduce

as claimed. As an inductive hypothesis, suppose that (3.4) holds for the iterates . We will prove it holds for . To this end, suppose that the event occurs. Then by the same reasoning as in the base case, the point satisfies

(4.5)

Observe now, using (3.1) and the inductive assumption, the estimate:

Combining this inequality with (4.5), we conclude that conditioned on the event , we have with probability the guarantee

(4.6)

where the last equality follows from the definition of . Thus the estimate (3.4) holds for the iterate , as needed. Suppose now that that event occurs. Then by exactly the same reasoning that led to (4.6), we have the estimate

Using smoothness, we therefore deduce , as claimed. An application of Theorem 3.2 completes the proof. ∎

When using the proximal parameters , we obtain the following guarantee.

Corollary 4.4 (Efficiency of BoostERM with geometric decay).

Fix a target relative accuracy and a probability of failure . Define the algorithm parameters:

Then with probability of at least , the point satisfies . Moreover, the total number of samples used by the algorithm is

Notice that the sample complexity provided by Corollary 4.4 is an order of magnitude better than (4.4) in terms of the dependence on the condition numbers and .

5 Consequences for stochastic approximation

In this section, we investigate the consequences of the proxBoost algorithm for stochastic approximation. Namely, we will seed proxBoost with the robust distance estimator, induced by the stochastic proximal gradient method and its accelerated variant. An important point is that the sample complexity of stochastic gradient methods depends on the initialization quality . Consequently, in order to know how many iterations are needed to reach a desired accuracy , we must have available an upper bound on the initialization quality . Moreover, we will have to update the initialization estimate for each proximal subproblem along the iterations of proxBoost. The following assumption formalizes this idea.

Assumption 5.1.

Consider the proximal minimization problem

Let be a real number satisfying . We will let be a procedure that returns a point satisfying

Clearly, by strong convexity, we may turn into a robust distance estimator on the proximal problems as long as is indeed an upper bound on the initialization error. We record the robust distance estimator induced by as Algorithm 6.

Input: accuracy , amplitude , upper bound , center ,
       trial count .
Query times and let consist of the responses. Step :
       Compute . Set Return
Algorithm 6 Alg-R

When is -smooth, it is straightforward to instantiate proxBoost with the robust distance estimator Alg-R. The situation is more nuanced when is nonsmooth for two reasons. First, it becomes less clear how to control the initialization quality for each proximal subproblem. Secondly, Alg-R can not be used in the cleanup stage of proxBoost to generate the last iterate ; instead, we will use a different algorithm [22] in this last stage. In the following two sections, we consider the smooth and nonsmooth settings in order.

5.1 Smooth Setting

Throughout this section, in addition to Assumptions 2.1 and 5.1, we assume that is -smooth and set . Algorithm 7 seeds the proxBoost procedure with Alg-R.

Input: accuracy , upper bound , center , iterations
Set , , Step :
              Return
Algorithm 7

We can now prove the following theorem on the efficiency of Algorithm 7. The proof is almost a direct application of Theorem 3.2. The only technical point is to verify that for all indices , the quantity is a valid upper bound on the initialization error in the event .

Theorem 5.2 (Efficiency of BoostAlg).

Fix an arbitrary point and let