Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

12/12/2017
by   Benjamin Grimmer, et al.
cornell university
0

We generalize the classic convergence rate theory for subgradient methods to apply to non-Lipschitz functions via a new measure of steepness. For the deterministic projected subgradient method, we derive a global O(1/√(T)) convergence rate for any function with at most exponential growth. Our approach implies generalizations of the standard convergence rates for gradient descent on functions with Lipschitz or Hölder continuous gradients. Further, we show a O(1/√(T)) convergence rate for the stochastic projected subgradient method on functions with at most quadratic growth, which improves to O(1/T) under strong convexity.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

01/22/2019

On convergence rate of stochastic proximal point algorithm without strong convexity, smoothness or bounded gradients

Significant parts of the recent learning literature on stochastic optimi...
04/01/2017

Faster Subgradient Methods for Functions with Hölderian Growth

The purpose of this manuscript is to derive new convergence results for ...
12/10/2012

A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method

In this note, we present a new averaging technique for the projected sto...
01/08/2018

How To Make the Gradients Small Stochastically

In convex stochastic optimization, convergence rates in terms of minimiz...
05/11/2018

Fast Rates of ERM and Stochastic Approximation: Adaptive to Error Bound Conditions

Error bound conditions (EBC) are properties that characterize the growth...
07/05/2021

The Last-Iterate Convergence Rate of Optimistic Mirror Descent in Stochastic Variational Inequalities

In this paper, we analyze the local convergence rate of optimistic mirro...
07/10/2018

Dual optimization for convex constrained objectives without the gradient-Lipschitz assumption

The minimization of convex objectives coming from linear supervised lear...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the nonsmooth, convex optimization problem given by

for some lower semicontinuous convex function and closed convex feasible region . We assume lies in the domain of and that this problem has a nonempty set of minimizers . Further, we assume orthogonal projection onto is computationally tractable (which we denote by ).

Since may be nondifferentiable, we weaken the notion of gradients to subgradients. The set of all subgradients at some (referred to as the subdifferential) is denoted by

We consider solving this problem via a (potentially stochastic) projected subgradient method. These methods have received much attention lately due to their simplicity and scalability; see [2, 13], as well as [7, 8, 9, 12, 14] for a sample of more recent works.

Deterministic and stochastic subgradient methods differ in the type of oracle used to access the subdifferential of . For deterministic methods, we consider an oracle , which returns an arbitrary subgradient at . For stochastic methods, we utilize a weaker, random oracle

, which is an unbiased estimator of a subgradient (i.e.,

for some easily sampled distribution ).

We analyze two classic subgradient methods, differing in their step size policy. Given a deterministic oracle, we consider the following normalized subgradient method

(1)

for some positive sequence . Note that since only if minimizes , this iteration is well-defined until a minimizer is found. Given a stochastic oracle, we consider the following method

(2)

for some positive sequence and i.i.d. sample sequence .

Let denote the Euclidean distance from the initial iterate to a minimizer, and denote the objective gap at a point . The following three theorems establish the classic objective gap convergence rates of (1) and (2).

Theorem 1 (Classic Deterministic Rate).

Consider any convex function and subgradient oracle satisfying for all . Then for any positive sequence , the iteration (1) satisfies

For example, under the constant step size , the iteration (1) satisfies

Theorem 2 (Classic Stochastic Rate).

Consider any convex function and stochastic subgradient oracle satisfying for all . Then for any positive sequence , the iteration (2) satisfies

For example, under the constant step size , the iteration (2) satisfies

We say is -strongly convex on for some if for every and ,

If this holds for some , the convergence of (2) can be improved to [7, 8, 14]. Below, we present one such bound from [8].

Theorem 3 (Classic Strongly Convex Stochastic Rate).

Consider any -strongly convex function and stochastic subgradient oracle satisfying for all . Then for the decreasing sequence of step sizes , the iteration (2) satisfies

Remarks on the Generality of Theorems 1-3.

The assumed subgradient norm bound for all is implied by being -Lipschitz continuous on some open convex set containing (which is often the assumption made). This assumption restricts the classic convergence results to functions with at most linear growth (at rate ). When is bounded, one can invoke a compactness argument to produce a uniform Lipschitz constant. However, such an approach may introduce large constants heavily dependent on the size of (and frankly, lacks the elegance that such a fundamental method deserves).

We also remark that Lipschitz continuity and strong convexity are fundamentally at odds. Lipschitz continuity allows at most linear growth while strong convexity requires quadratic growth. The only way both can occur is when

is bounded.

Recently, Renegar [15] introduced a novel framework that allows first-order methods to be applied to general (non-Lipschitz) convex optimization problems via a radial transformation. Based on this framework, Grimmer [6] showed a simple radial subgradient method has convergence paralleling the classic rate without assuming Lipschitz continuity. This algorithm is applied to a transformed version of the original problem and replaces orthogonal projection by a line search at each iteration.

Lu [9] analyzes an interesting subgradient-type method (which is a variation of mirror descent) for non-Lipschitz problems that is customized for a particular problem via a reference function. This approach gives convergence guarantees based on a relative-continuity constant instead of a uniform Lipschitz constant.

Although the works of Renegar [15], Grimmer [6], and Lu [9] give convergence rates for specialized subgradient methods without assuming Lipschitz continuity, its unclear what guarantees the classic subgradient methods (1) and (2) have for non-Lipschitz problems. In this paper, we propose a generalization of Lipschitz continuity, which greatly extends the applicability and quality of convergence rate bounds for these classic methods.

1.1 Our Contributions

We propose the following generalization of an absolute bound on subgradient norm.

Definition 4.

Consider any nonnegative valued function .
We say a subgradient oracle is
-steep on if for all .
Similarly, a stochastic oracle is
-steep on if for all .

We say a function is -steep if every subgradient oracle is -steep. This definition allows subgradients to be large when the objective gap is large as well (where the exact relation between these is governed by ). Note when is a constant function, steepness is identical to the classic model. In this case, our convergence rates stated below exactly match their classic counterparts. In Section 2, we discuss a number of examples of steepness and applications of our bounds. First, consider the deterministic subgradient method (1).

Theorem 5 (Extended Deterministic Rate).

Consider any convex function and -steep subgradient oracle on . Then for any positive sequence , the iteration (1) satisfies

For example, under the constant step size , the iteration (1) satisfies

Remarks on the Generality of Theorem 5.

When is at most linear (i.e., there exists some such that ), the supremum above approaches as and provides a meaningful rate of convergence. For reasonably simple , this supremum has a closed form. However, this bound may be vacuous when is superlinear since having implies the supremum above equals .

Having an -steep oracle can be viewed as allowing functions with at most exponential growth (see Proposition 14). Intuitively, this is reasonable as such steepness is roughly a differential inequality of the form , which has a classic exponential solution. This is a large improvement on the linear growth required by the classic theory. In Section 4, we discuss how more general convergence rates can be given.

Provided is at most linear, simple limiting arguments give the following eventual convergence rate of (1) based on Theorem 5: For any , there exists , such that all have

As a result, the asymptotic convergence rate of (1) is determined entirely by the size of subgradients around the set of minimizers, and conversely, steepness far from optimality plays no role in the asymptotic behavior.

As a surprising consequence of Theorem 5, we recover the classic convergence rate for gradient descent on differentiable functions with an -Lipschitz continuous gradient of [13]. Any such function is -steep on (see Lemma 9). Then a convergence rate immediately follows from Theorem 5 (for simplicity, we consider constant step size).

Corollary 6 (Generalizing Gradient Descent’s Convergence).

Consider any convex function and -steep subgradient oracle. Then under the constant step size , the iteration (1) satisfies

Thus a convergence rate of can be attained without any mention of smoothness or differentiability. Instead, the essential property is that gradients (or subgradients) have norm go to zero sufficiently fast when approaching the minimum function value. In Section 2, we bound the steepness of any function with a Hölder continuous gradient (an extension of Lipschitz continuity) and state the resulting convergence bound. In general, for any at most linear with , Theorem 5 gives convergence at a rate of .

Now we consider the stochastic subgradient method defined by (2). Here we limit our analysis to -steepness for some (note the classic model restricts to ). We have the following guarantees for convex and strongly convex problems.

Theorem 7 (Extended Stochastic Rate).

Consider any convex function and -steep stochastic subgradient oracle on . Then for any positive sequence with , the iteration (2) satisfies

For example, under the constant step size , the iteration (2) satisfies

provided is large enough to have .

Theorem 8 (Extended Strongly Convex Stochastic Rate222A predecessor of Theorem 8 was given by Davis and Grimmer in Proposition 3.2 of [3], where a convergence rate was shown for certain non-Lipschitz strongly convex problems.).

Consider any -strongly convex function and -steep stochastic subgradient oracle on . Then for the decreasing sequence of step sizes

the iteration (2) satisfies

The following simpler averaging gives a bound weakened roughly by a factor of two:

Remarks on the Generality of Theorems 7 and 8.

Having a deterministic -steep oracle can be viewed as allowing functions with at most quadratic growth (see Proposition 14). Intuitively, this is reasonable since the corresponding differential inequality has a simple quadratic solution. As a result, we avoid the inherent conflict in Theorem 3 between Lipschitz continuity and strong convexity since a function can globally have both square root steepness and strong convexity. In Section 4, we show a weaker condition than strong convexity is sufficient to ensure a rate.

2 Examples and Applications of Steepness

In this section, we show several examples of problems that are steep with respect to simple functions , as well as provide an alternative characterization of steepness in terms of upper bounds on function growth. To establish a baseline for understanding steepness, Table 1 gives a variety of simple functions and corresponding steepness bounds.

Function -Steepness Function -Steepness
Table 1: Numerous simple convex functions and bounds on their steepness on are given. Each steepness bound is pointwise as small as possible. All of the examples given, except , have at most linear steepness, and thus fall within the scope of Theorem 5.

2.1 Smooth Optimization

The standard analysis of gradient descent in smooth optimization assumes the gradient of the objective function is uniformly Lipschitz continuous, or more generally, uniformly Hölder continuous. A differentiable function has -Hölder continuous gradient on for some and if for all

Note this is exactly Lipschitz continuity of the gradient when . Below, we show the steepness of any such function admits a simple description.

Lemma 9.

Every with a -Hölder continuous gradient on is

Proof.

The following upper bound holds for each : For all ,

Selecting for minimizes this upper bound. Therefore

Rearranging this inequality gives the claimed bound. ∎

This lemma with implies any function with an -Lipschitz gradient is -steep. Then Theorem 5 gives our generalization of the classic gradient descent convergence rate claimed in Corollary 6. Further, for any function with a Hölderian gradient, we find the following convergence rate.

Corollary 10 (Generalizing Hölderian Gradient Descent’s Convergence).

Consider any convex function and -steep subgradient oracle. Then under the constant step size , the iteration (1) satisfies

2.2 Additive Composite Optimization

Often problems arise where the objective is to minimize a sum of smooth and nonsmooth functions. We consider the following general formulation of this problem

for any differentiable convex function with -Hölderian gradient and any -Lipschitz continuous convex function . Such problems occur when regularizing smooth optimization problems, where would be the sum of one or more nonsmooth regularizers (for example, to induce sparsity).

Additive composite problems can be solved by prox-gradient or splitting methods, which solve a subproblem based on each iteration. However, this limits these methods to problems where is relatively simple. The subgradient method avoids this limitation by only requiring the computation of a subgradient of each iteration, which is given by . The classic convergence theory fails to give any guarantees for this problem since may be non-Lipschitz. In contrast, we find this problem class has a simple steepness bound from which guarantees for the classic subgradient method directly follow.

Lemma 11.

For any oracle , the oracle is

Proof.

Consider any and . Define the following lower bound on

Notice that and both minimize at with . Further, since has a -Hölder continuous gradient, Lemma 9 bounds the size of for any as

The Lipschitz continuity of implies , and so the triangle inequality completes the proof ∎

One could plug into Theorem 5 and evaluate the supremum to produce a convergence guarantee. For ease of presentation, we weaken our steepness bound to the following, which may be up to a factor of two larger,

Then Theorem 5 immediately gives the following convergence rate (for simplicity, we state the bound for constant step size).

Corollary 12 (Additive Composite Convergence).

For any deterministic subgradient oracle , under the constant step size , the iteration (1) satisfies

Up to small factors, the first term in the above maximum matches the convergence rate on functions with Hölderian gradient like (see Corollary 10) and the second term matches the convergence rate on Lipschitz continuous functions like (see Theorem 1). Thus the subgradient method on has convergence guarantees no worse than those of the subgradient method on or separately.

2.3 Quadratically Regularized, Stochastic Optimization

Another common class of optimization problems result from adding a quadratic regularization term to the objective function, for some parameter . Consider solving

for any Lipschitz continuous convex function . Suppose we have a stochastic subgradient oracle for denoted by satisfying . Although the function and its stochastic oracle meet the necessary conditions for the classic theory to be applied, the addition of a quadratic term violates Lipschitz continuity. Simple arguments give a steepness bound for this problem and the following convergence rate.

Corollary 13 (Quadratically Regularized Convergence).

For the decreasing step sizes

and stochastic subgradient oracle , the iteration (2) satisfies

Proof.

Consider any and . Since has -Lipschitz gradient, the same argument used in Lemma 11 shows . Applying Jensen’s inequality and the assumed subgradient bound implies

Thus our stochastic oracle is -steep. Noting is -strongly convex, our bound follows from Theorem 8. ∎

One common example of a problem of the form

is training a Support Vector Machine (SVM). Suppose one has

data points with feature vector labeled . Then one trains a model for some parameter by solving

Here, a stochastic subgradient oracle can be given by selecting a summand uniformly at random and then setting

which has .

Much work has previously been done to give guarantees for SVMs. If one adds the constraint that lies in some large ball (which will then be projected onto each iteration), the classic strongly convex rate can be applied [16]. A similar approach utilized in [8] is to show that, in expectation, all of the iterates of a stochastic subgradient method lie in a large ball (provided the initial iterate does). The specialized mirror descent method proposed by Lu [9] gives convergence guarantees for SVMs at a rate of without needing a bounding ball. Splitting methods and quasi-Newton methods capable of solving this problem are given in [5] and [18], respectively, which both avoid needing to assume subgradient bounds.

2.4 Alternative Characterizations of Deterministic Steepness

Here we give an alternative interpretation of bounding the size of subgradients, either absolutely or with steepness on some convex open set . First we consider the classic model. Suppose a convex function has for all and . This is equivalent to being -Lipschitz continuous on and can be restated as the following linear upper bound holding for each

where .

This characterization shows the limitation to linear growth of the classic model (i.e., constant steepness). In the following proposition, we give similar upper bound characterizations for linear and square root steepness, which can be seen as allowing up to exponential and quadratic growth, respectively.

Proposition 14.

A convex function is -steep on some open convex if and only if the following exponential upper bound holds for each

Similarly, a convex function is -steep on some open convex if and only if the following quadratic upper bound holds for each

Proof.

First we prove the forward direction of both claims. Consider any and subgradient oracle . Let denote the unit direction from to , and denote the restriction of to this line. Notice that and . The convexity of implies it is differentiable almost everywhere in the interval . Thus satisfies the following, for almost every ,

For any -steep function, this gives the differential inequality of

Standard calculus arguments imply which is equivalent to our claimed upper bound at .

For any -steep function, this gives the differential inequality of

Standard calculus arguments imply which is equivalent to our claimed upper bound at .

Now we prove the reverse direction of both claims. Let denote either of our upper bounds given by some . Further, let the directional derivative operator in some unit direction . Then for any subgradient ,

where the first inequality uses the definition of and the second uses the fact that upper bounds . A simple calculation shows our first upper bound has and our second upper bound has . Then both of our steepness bounds follow by taking . ∎

3 Convergence Proofs

Each of our extended convergence theorems follows from essentially the same proof as its classic counterpart. Only minor modification is needed to replace -Lipschitz continuity by -steepness. The central inequality in analyzing subgradient methods is the following.

Lemma 15.

Consider any -strongly convex function . For any and ,

Note that this holds for any convex function at .

Proof.

Since orthogonal projection onto a convex set is nonexpansive, we have

Taking the expectation over yields

Applying the definition of strong convexity completes the proof. ∎

Let denote the expected distance from each iterate to the set of minimizers. Each of our proofs follows the same general outline: use Lemma 15 to set up a telescoping inequality on , then sum the telescope.

3.1 Proof of Theorem 5

From Lemma 15 with , , , and , it follows that

where the second inequality uses -steepness of . Inductively applying this implies

Thus

Applying the sup-inverse of completes the proof.

3.2 Proof of Theorem 7

From Lemma 15 with , , , and , it follows that

where the second inequality uses the steepness of . Inductively applying this implies

The convexity of gives

completing the proof.

3.3 Proof of Theorem 8

Our proof follows the style of [8]. Observe that our choice of step size satisfies the following pair of conditions. First, note that it is a solution to the recurrence

(3)

Second, note that for all since

(4)

From Lemma 15 with , , and , it follows that

where the second inequality uses the steepness of . Multiplying by yields

Notice that this inequality telescopes due to (3). Inductively applying this implies

Since and , we have

Observe that the coefficients of each above are positive due to (4). Then the convexity of gives our first convergence bound. From (4), we know for all . Then the previous inequality can be weakened to

The convexity of gives our second convergence bound.

4 Extensions of our Convergence Rates

4.1 Convergence Beyond Exponential Growth

Early in the development of subgradient methods, Shor [17] observed that the normalized subgradient method (1) enjoys some form of convergence guarantee for any convex function with a nonempty set of minimizers. Shor used the same elementary argument underlying Theorem 5 to show for any minimizer : either some has or

Thus, for any convex function, the subgradient method has convergence in terms of this inner product (which convexity implies is always nonnegative). This quantity can be interpreted as the distance from the hyperplane

to .

To turn this into an objective gap convergence rate for general convex problems, one needs to convert having small “subgradient hyperplane distance to a minimizer” into having small objective gap. The immediate convergence theorem based on this idea is the following.

Theorem 16.

Consider any convex and subgradient oracle. Fix some . If

for some function , then the iteration (1) satisfies

The primary difficulty in applying the above theorem to a particular problem is in identifying a function where the necessary implication holds. However, this approach can circumvent the limitation of Theorem 5 to having at most exponential growth. For example, satisfies this implication with . Theorem 5 can be viewed as a one particular way to construct a suitable : For any -steep oracle,

4.2 Improved Convergence Without Strong Convexity

Strong convexity is stronger than necessary to achieve many of the standard improvements in convergence rate for smooth optimization problems [1, 4, 10, 11]. Instead the weaker condition of requiring quadratic growth away from the set of minimizer suffices. We find that this weaker condition is also sufficient for (2) to have a convergence rate of .

We say a function has -quadratic growth if all satisfy

The proof of Theorem 8 only uses strong convexity once for the following inequality:

Having -quadratic growth is sufficient to give this inequality, weakened by a factor of :

Then simple modifications of the proof of Theorem 8 give a convergence rate.

Theorem 17.

Consider any convex function with -quadratic growth and -steep stochastic subgradient oracle on . Then for the decreasing sequence of step sizes

the iteration (2) satisfies

Acknowledgments.

The author thanks Jim Renegar for providing feedback on an early draft of this work.

References

  • [1] Jérôme Bolte, Trong Phong Nguyen, Juan Peypouquet, and Bruce W. Suter. From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming, 165(2):471–507, Oct 2017.
  • [2] Sébastien Bubeck. Convex Optimization: Algorithms and Complexity. Found. Trends Mach. Learn., 8(3-4):231–357, November 2015.
  • [3] Damek Davis and Benjamin Grimmer. Proximally Guided Stochastic Subgradient Method for Nonsmooth, Nonconvex Problems. ArXiv e-prints, 1707.03505, July 2017.
  • [4] D. Drusvyatskiy and A. S. Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods. ArXiv e-prints, February 2016.
  • [5] John Duchi and Yoram Singer. Efficient Online and Batch Learning Using Forward Backward Splitting. J. Mach. Learn. Res., 10:2899–2934, December 2009.
  • [6] Benjamin Grimmer. Radial Subgradient Method. To appear in SIAM Journal on Optimization.
  • [7] Elad Hazan and Satyen Kale. Beyond the Regret Minimization Barrier: Optimal Algorithms for Stochastic Strongly-convex Optimization. J. Mach. Learn. Res., 15(1):2489–2512, January 2014.
  • [8] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. ArXiv e-prints, 1212.2002, December 2012.
  • [9] Haihao Lu. “Relative-Continuity” for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent. ArXiv e-prints, 1710.04718, October 2017.
  • [10] Zhi-Quan Luo and Paul Tseng. Error bounds and convergence analysis of feasible descent methods: a general approach. Annals of Operations Research, 46(1):157–178, Mar 1993.
  • [11] Ion Necoara, Yurii Nesterov, and Francois Glineur. Linear convergence of first order methods for non-strongly convex optimization. To appear in Mathematical Programming.
  • [12] Angelia Nedić and Soomin Lee. On stochastic subgradient mirror-descent algorithm with weighted averaging. SIAM Journal on Optimization, 24(1):84–107, 2014.
  • [13] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition, 2004.
  • [14] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization. In

    Proceedings of the 29th International Coference on International Conference on Machine Learning

    , ICML’12, pages 1571–1578, USA, 2012. Omnipress.
  • [15] James Renegar. “Efficient” Subgradient Methods for General Convex Optimization. SIAM Journal on Optimization, 26(4):2649–2676, 2016.
  • [16] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: primal estimated sub-gradient solver for svm. Mathematical Programming, 127(1):3–30, Mar 2011.
  • [17] Naun Zuselevich Shor. Minimization Methods for Non-Differentiable Functions, page 23. Springer Berlin Heidelberg, Berlin, Heidelberg, 1985.
  • [18] Jin Yu, S.V.N. Vishwanathan, Simon Günter, and Nicol N. Schraudolph. A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning. J. Mach. Learn. Res., 11:1145–1200, March 2010.