# Stochastic subgradient method converges at the rate O(k^-1/4) on weakly convex functions

We prove that the projected stochastic subgradient method, applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate O(k^-1/4).

## Authors

• 13 publications
• 12 publications
• ### Stochastic model-based minimization of weakly convex functions

We consider an algorithm that successively samples and minimizes stochas...
03/17/2018 ∙ by Damek Davis, et al. ∙ 0

• ### Distributed Projected Subgradient Method for Weakly Convex Optimization

The stochastic subgradient method is a widely-used algorithm for solving...
04/28/2020 ∙ by Shixiang Chen, et al. ∙ 0

• ### Completing Simple Valuations in K-categories

We prove that Keimel and Lawson's K-completion Kc of the simple valuatio...
02/05/2020 ∙ by Xiaodong Jia, et al. ∙ 0

• ### Convergence of adaptive algorithms for weakly convex constrained optimization

We analyze the adaptive first order algorithm AMSGrad, for solving a con...
06/11/2020 ∙ by Ahmet Alacaoglu, et al. ∙ 0

• ### On Functions Weakly Computable by Pushdown Petri Nets and Related Systems

We consider numerical functions weakly computable by grammar-controlled ...
04/08/2019 ∙ by J. Leroux, et al. ∙ 0

• ### Stochastic Conjugate Gradient Algorithm with Variance Reduction

Conjugate gradient methods are a class of important methods for solving ...
10/27/2017 ∙ by Xiao-Bo Jin, et al. ∙ 0

• ### Subsumption of Weakly Well-Designed SPARQL Patterns is Undecidable

Weakly well-designed SPARQL patterns is a recent generalisation of well-...
01/27/2019 ∙ by Mark Kaminski, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this work, we consider the optimization problem

 min φ(x):=g(x)+r(x) (1.1)

under the following assumptions on the functional components and . Throughout, is a closed convex function with a computable proximal map

 proxαr(x):=\operatornamewithlimitsargminy{r(y)+12α∥y−x∥2},

while is a -weakly convex function, meaning that the assignment is convex. The above assumptions on are standard in the literature (see e.g. [18, 2, 21]), while those on deserve some commentary. The class of weakly convex functions, first introduced in English in [19], is broad. Indeed, it includes all convex functions and smooth functions with Lipschitz continuous gradient. More generally, any function of the form , with convex and Lipschitz and a smooth map with Lipschitz Jacobian [7, Lemma 4.2], is weakly convex. Classical literature highlights the importance of weak convexity in optimization [26, 22, 23], while recent advances in statistical learning and signal processing have further reinvigorated the problem class (1.1). For a recent discussion on the role of weak convexity in large-scale optimization, see for example [5].

The proximal subgradient method is perhaps the simplest algorithm for the problem (1.1). Given a current iterate , the method repeats the steps

 {Choose ζt∈∂g(xt)Set xt+1=proxαtr(xt−αtζt)},

where is an appropriately chosen control sequence. Here, the subdifferential is meant in a standard variational analytic sense [27, Definition 8.3]; we will recall the precise definition in Section 2. The setting when is the indicator function of a closed convex set reduces the algorithm to the classical projected subgradient method. Indeed, then the proximal map is simply the nearest point projection .

The primary goal in nonsmooth nonconvex optimization is the search for stationary points. A point is called stationary for the problem (1.1) if the inclusion holds. In “primal terms”, these are precisely the points where the directional derivative of is nonnegative in every direction [27, Proposition 8.32]:

 dist(0;∂φ(x))=−infv:∥v∥≤1φ′(x;v). (1.2)

It has been known since [20, 19] that the (stochastic) subgradient method with generates an iterate sequence that subsequentially converges to a stationary point of the problem. A long standing open question in this line of work is to determine the “rate of convergence” of the basic (stochastic) subgradient method and of its proximal extensions.

An immediate difficulty in addressing this question is that it is not a priori clear how to measure the progress of the algorithm. Indeed, neither the functional suboptimality gap, , nor the stationarity measure, , necessarily tend to zero along the iterate sequence. Instead, recent literature [5, 7] has identified a different measure of complexity of minimizing weakly convex functions, based on smooth approximations. The key construction we use is the Moreau envelope:

 φλ(x):=miny {φ(y)+12λ∥y−x∥2},

where . Standard results show that as long as , the envelope is -smooth with the gradient given by

 ∇φλ(x)=λ−1(x−proxλφ(x)). (1.3)

See for example [25, Theorem 31.5]. Moreover, the norm of the gradient has an intuitive interpretation in terms of near-stationarity for the target problem (1.1). Namely, the definition of the Moreau envelope directly implies that for any , the proximal point satisfies

Thus a small gradient implies that is near some point that is nearly stationary for (1.1). For a longer discussion of near-stationarity, see [5] or [7, Section 4.1].

In this paper, we show that under an appropriate choice of the control sequence , the subgradient method will generate a point satisfying after at most iterations. A similar guarantee was recently established for the proximally guided projected subgradient method [4]. This scheme proceeds by directly applying the gradient descent method to the Moreau envelope , with each proximal point approximately evaluated by a convex subgradient method. In contrast, we show here that the basic subgradient method, without any modification or parameter tuning, already satisfies the desired convergence guarantees. This is perhaps surprising, since neither the Moreau envelope nor the proximal map explicitly appear in the definition of the subgradient method.

Though our results appear to be new even in this rudimentary deterministic set up, the argument we present applies much more broadly to stochastic proximal subgradient methods, in which only stochastic estimates of

are available. This is the setting of the paper. In this regard, we improve in two fundamental ways on the results in the seminal papers [9, 10, 28]: first, we allow

to be nonsmooth and second, we do not require the variance of our stochastic estimator for

to decrease as a function of . The second contribution removes the well-known “mini-batching” requirements common to [10, 28], while the first significantly expands the class of functions for which the rate of convergence of the stochastic proximal subgradient method is known. It is worthwhile to mention that our techniques crucially rely on convexity of , while [28] makes no such assumption.

There is an extensive literature on stochastic subgradient methods in convex optimization, which we will not detail here; instead, we refer the interested reader to the seminal works [15, 16]. An in-depth summary of recent work for nonconvex problems appears in [4].

The outline of the paper is as follows. In the Section 2.1, we present a simplified argument for the case in which

is the indicator function of a closed convex set and the stochastic estimator has finite second moment. In this section, we also comment on improved rates in the convex setting. In Section

2.2, we prove convergence of the stochastic proximal subgradient method in full generality. In Section 2.3, we modify the results of the previous section to the case in which is smooth and the stochastic estimator has finite variance.

## 2 Convergence guarantees

is through a stochastic subgradient oracle. Formally, we fix a probability space

and equip with the Borel -algebra. We make the following three standard assumptions:

1. It is possible to generate i.i.d. realizations .

2. There is an open set containing and a measurable mapping satisfying for all .

3. There is a real such that the inequality, , holds for all .

Some comments are in order. First, the symbol refers to the subdifferential of at

. By definition, this is the set consisting of all vectors

satisfying

 g(y)≥g(x)+⟨v,y−x⟩+o(∥y−x∥) as y→x.

Weak convexity automatically guarantees that subgradients of satisfy the much stronger property [27, Theorem 12.17]:

 g(y)≥g(x)+⟨v,y−x⟩−ρ2∥y−x∥2,∀x,y∈Rd, v∈∂g(x). (2.1)

One important consequence we will use is the hypo-monotonicity inequality:

 ⟨v−w,x−y⟩≥−ρ∥x−y∥2,∀x,y∈Rd, v∈∂g(x), w∈∂g(y). (2.2)

The three assumption (A1), (A2), (A3) are standard in the literature on stochastic subgradient methods. Indeed, assumptions (A1) and (A2) are identical to assumptions (A1) and (A2) in [15], while Assumption (A3) is the same as the assumption listed in [15, Equation (2.5)].

In this work, we investigate the efficiency of the proximal stochastic subgradient method, described in Algorithm 1.

Henceforth, the symbol will denote the expectation conditioned on all the realizations .

### 2.1 Projected stochastic subgradient method

Our analysis of Algorithm 1 is shorter and more transparent when is the indicator function of a closed, convex set . This is not surprising, since projected subgradient methods are typically much easier to analyze than their proximal extensions (e.g. [8, 3]). Note that (1.1) then reduces to the constrained problem

 minx∈X g(x), (2.3)

and the proximal maps become the nearest point projection . Thus throughout Section 2.1, we suppose that Assumptions (A1), (A2), and (A3) hold and that is the indicator function of a closed convex set . The following is the main result of this section.

###### Theorem 2.1 (Stochastic projected subgradient method).

Let be the point returned by Algorithm 1. Then in terms of any constant , the estimate holds:

 E[∥∇φ1/ˆρ(xt∗)∥2] ≤^ρ^ρ−ρ⋅(φ1/^ρ(x0)−minφ)+^ρL22∑Tt=0α2t∑Tt=0αt.
###### Proof.

Let denote the points generates by Algorithm 1. For each index , define and set . We successively deduce

 Et[φ1/^ρ(xt+1)] ≤Et[g(^xt)+^ρ2∥xt+1−^xt∥2] (2.4) ≤g(^xt)+^ρ2Et[∥(xt−^xt)−αtG(xt,ξt)∥2] (2.5) ≤g(^xt)+^ρ2∥xt−^xt∥2+^ραtEt[⟨^xt−xt,G(xt,ξt)⟩]+α2t^ρ2L2 ≤φ1/^ρ(xt)+^ραt⟨^xt−xt,ζt⟩+α2t^ρ2L2 ≤φ1/^ρ(xt)+^ραt(g(^xt)−g(xt)+ρ2∥xt−^xt∥2)+α2t^ρ2L2, (2.6)

where (2.4) follows directly from the definition of the proximal map, the inequality (2.5) uses that the projection is -Lipschitz, and (2.6) follows from weak convexity of .

Using the law of total expectation to unfold this recursion yields:

 E[φ1/^ρ(xT+1)]≤φ1/^ρ(x0)+^ρL22T∑t=0α2t−^ρET∑t=0αt(g(xt)−g(^xt)−ρ2∥xt−^xt∥2).

Lower-bounding the left-hand side by and rearranging, we obtain the bound:

 ≤(φ1/^ρ(x0)−minφ)+^ρL22∑Tt=0α2t^ρ∑Tt=0αt. (2.7)

Notice that the left-hand-side of (2.7) is precisely . Next, observe that the function is strongly convex with parameter , and therefore

 g(xt∗)−g(^xt∗)−ρ2∥xt∗−^xt∗∥2 =(g(xt∗)+^ρ2∥xt∗−xt∗∥2)−(g(^xt∗)+^ρ2∥xt∗−^xt∗∥2)+^ρ−ρ2∥xt∗−^xt∗∥2 ≥(^ρ−ρ)∥xt∗−^xt∗∥2=^ρ−ρ^ρ2∥∇φ1/^ρ(xt∗)∥2,

where the last equality follows from (1.3). Using this estimate to lower bound the left-hand-side of (2.7) completes the proof. ∎

In particular, using the constant stepsize on the order of yields the following complexity guarantee.

###### Corollary 2.2 (Complexity guarantee).

Fix an index and set the constant steplength for some real . Then the point returned by Algorithm 1 satisfies:

 E[∥∇φ1/(2ρ)(xt∗)∥2]≤2⋅(φ1/(2ρ)(x0)−minφ)+ρL2γ2γ√T+1. (2.8)
###### Proof.

This follows immediately from Theorem 2.1 by setting . ∎

Let us look closer at the guarantee of Corollary 2.2 by minimizing out in . Namely, suppose we have available some real satisfying . We deduce from (2.8) the estimate, Minimizing the right-hand side in yields the choice and therefore the guarantee

 E[∥∇φ1/(2ρ)(xt∗)∥2]≤4⋅√ρRL2T+1. (2.9)

In particular, suppose that is -Lipschitz and the diameter of is bounded by some . Then we may set , where the first term follows from the definition of the Moreau envelope and the second follows from Lipschitz continuity. Then the number of subgradient evaluations required to find a point satisfying is at most

 ⎡⎢ ⎢ ⎢ ⎢⎢16⋅(ρLD)2⋅min{1,LρD}ε4⎤⎥ ⎥ ⎥ ⎥⎥. (2.10)

This complexity in matches the guarantees of the stochastic gradient method for finding an -stationary point of a smooth function [9, Corollary 2.2].

It is intriguing to ask if the complexity (2.10) can be improved when is a convex function. The answer, unsurprisingly, is yes. Since is convex, here and for the rest of the section, we will let the constant be arbitrary. As a first attempt, one may follow the observation of Nesterov [17] for smooth minimization. The idea is that the right-hand-side of the complexity bound (2.9) dependence on the initial gap . We can make this quantity as small as we wish by a separate subgradient method. Namely, we may simply run a subgradient method for iterations to decrease the gap to ; see for example [12, Proposition 5.5] for the this basic guarantee. Then we run another round of a subgradient method for iterations using the optimal choice . A quick computation shows that the resulting scheme will find a point satisfying after at most iterations.

This complexity can be improved slightly by first regularizing the problem. We will only outline the procedure here, since the details are standard and easy to verify. Define the function , for some and arbitrary . We will apply optimization algorithms to instead of , and therefore we must relate their Moreau envelopes. Fixing an arbitrary , it is straightforward to verify the following equality by completing the square in the Moreau envelope:

 ˆφ1/λ(x)=φ1/(λ+μ)(μμ+λxc+λμ+λx)+λμ2(μ+λ)∥x−xc∥2.

Differentiating in yields the bound

 ∥∥∇φ1/(λ+μ)(μμ+λxc+λμ+λx)∥∥≤λ+μλ∥∇ˆφ1/λ(x)∥+μD. (2.11)

Thus, supposing , we may set and , obtaining the estimate

 ∥∥∇φ1/(2ρ)(μμ+λxc+λμ+λx)∥∥≤2∥∇ˆφ1/λ(x)∥+ε2.

Hence, if we find a point satisfying , then the convex combination would satisfy , as desired. Let us now apply the two-stage procedure on the strongly convex function . We first apply the projected stochastic subgradient method [14] for iterations yielding the estimate . Then we apply the subgradient method (Algorithm 1) for iterations on with an optimal step-size . Solving for in terms of , a quick computation shows that the resulting scheme will find a point satisfying after at most iterations. By following a completely different technique, introduced by Allen-Zhu [1] for smooth stochastic minimization, this complexity can be even further improved to by running logarithmically many rounds of the subgradient method. Since this procedure and its analysis is somewhat technical and is independent of the rest of the material, we have placed it in a supplementary text that can be found at www.math.washington.edu/ddrusv/sms.pdf

### 2.2 Proximal stochastic subgradient method

We next move on to convergence guarantees of Algorithm 1 in full generality – the main result of this work. To this end, in this section, in addition to assumptions (A1), (A2), and (A3) we will also assume that is -Lipschitz.

We break up the analysis of Algorithm 1 into two lemmas. Henceforth, fix a real . Let be the iterates produced by Algorithm 1 and let be the i.i.d. realizations used. For each index , define and set . Observe that by the optimality conditions of the proximal map, there exists a vector satisfying . The following lemma realizes as a proximal point of .

###### Lemma 2.3.

For each index , equality holds:

 ^xt=proxαtr(αt^ρxt−αt^ζt+(1−αt^ρ)^xt).
###### Proof.

By the definition of , we have

 αt^ρ(xt−^xt)∈αt∂r(^xt)+αt^ζt ⟺αt^ρxt−αt^ζt+(1−αt^ρ)^xt∈^xt+αt∂r(^xt) ⟺^xt=proxαtr(αt^ρxt−αt^ζt+(1−αt^ρ)^xt),

where the last equivalence follows from the optimality conditions for the proximal subproblem. This completes the proof. ∎

The next lemma establishes a crucial descent property for the iterates.

###### Lemma 2.4.

Suppose and we have for all indices . Then the inequality holds:

 Et∥xt+1−^xt∥2 ≤∥xt−^xt∥2+2α2tL2−2αt(^ρ−ρ)∥xt−^xt∥2.
###### Proof.

We successively deduce

 Et∥xt+1−^xt∥2 =Et∥proxαtr(xt−αtG(xt,ξt))−proxαtr(αt^ρxt−αt^ζt+(1−αt^ρ)^xt)∥2 (2.12) ≤Et∥xt−αtG(xt,ξt)−(αt^ρxt−αt^ζt+(1−αt^ρ)^xt)∥2 (2.13) =Et∥(1−αt^ρ)(xt−^xt)−αt(G(xt,ξt)−^ζt)∥2 (2.14) =(1−αt^ρ)2∥xt−^xt∥2−2(1−αt^ρ)αtEt[⟨xt−^xt,G(xt,ξt)−^ζt⟩]+α2tEt∥G(xt,ξt)−^ζt∥2 =(1−αt^ρ)2∥xt−^xt∥2−2(1−αt^ρ)αt⟨xt−^xt,ζt−^ζt⟩+α2tEt∥G(xt,ξt)−^ζt∥2 ≤(1−αt^ρ)2∥xt−^xt∥2+2(1−αt^ρ)αtρ∥xt−^xt∥2+2α2tL2 (2.15) =∥xt−^xt∥2+2α2tL2−(2αt(^ρ−ρ)+α2t^ρ(2ρ−^ρ))∥xt−^xt∥2,

where (2.12) follows from Lemma 2.3, the inequality (2.13) uses that is -Lipschitz [27, Proposition 12.19], and (2.15) follows from the inequality (2.2). The result now follows from the assumed inequality . ∎

With Lemma 2.4 proved, we can now establish convergence guarantees of Algorithm 1 in full generality.

###### Theorem 2.5 (Stochastic proximal subgradient method).

Fix a real and a stepsize sequence . Then the point returned by Algorithm 1 satisfies:

 E[∥∇φ1/ˆρ(xt∗)∥2] ≤^ρ^ρ−ρ⋅(φ1/^ρ(x0)−minφ)+^ρL2∑Tt=0α2t∑Tt=0αt.
###### Proof.

We successively observe

 Et[φ1/^ρ(xt+1)] ≤Et[φ(^xt)+^ρ2∥xt+1−^xt∥2] ≤φ(^xt)+^ρ2[∥xt−^xt∥2+2α2tL2−2αt(^ρ−ρ)∥xt−^xt∥2] =φ1/^ρ(xt)+^ρ[α2tL2−αt(^ρ−ρ)∥xt−^xt∥2],

where the first inequality follows directly from the definition of the proximal map and the second follows from Lemma 2.4.

Using the law of total expectation to unfold this recursion yields:

 E[φ1/^ρ(xT+1)]≤φ1/^ρ(x0)+^ρL2T∑t=0α2t−^ρ(^ρ−ρ)ET∑t=0αt∥xt−^xt∥2.

Next using the inequality and rearranging, we obtain the bound:

 ≤(φ1/^ρ(x0)−minφ)+^ρL2∑Tt=0α2t^ρ(^ρ−ρ)∑Tt=0αt. (2.16)

To complete the proof, observe that the left-hand-side is exactly , while from from (1.3) we have the equality . ∎

In particular, using the constant stepsize on the order of yields the following complexity guarantee.

###### Corollary 2.6 (Complexity guarantee).

Fix a constant and an index , and set the constant steplength . Then the point returned by Algorithm 1 satisfies:

 E[∥∇φ1/(2ρ)(xt∗)∥2] ≤2⋅(φ1/(2ρ)(x0)−minφ)+ρL2γ2γ√T+1.
###### Proof.

This follows immediately from Theorem 2.5 by setting . ∎

### 2.3 Proximal stochastic gradient for smooth minimization

Let us now look at the consequences of our results in the setting when is -smooth with -Lipschitz gradient. Note, that then is automatically -weakly convex. In this smooth setting, it is common to replace assumption (A3) with the finite variance condition:

• There is a real such that the inequality, , holds for all .

Henceforth, let us therefore assume that is -smooth with -Lipschitz gradient and Assumptions (A1), (A2), and hold.

All of the results in Section 2.2 can be easily modified to apply to this setting. In particular, Lemma 2.3 holds verbatim, while Lemma 2.4 extends as follows.

###### Lemma 2.7.

Fix a real and a sequence . Then the inequality holds:

 Et∥xt+1−^xt∥2 ≤∥xt−^xt∥2+α2tσ2−αt(^ρ−ρ)∥xt−^xt∥2.
###### Proof.

By the same argument as in Lemma 2.7, we arrive at the inequality (2.14) with . Adding and subtracting , we successively deduce

 Et∥xt+1−^xt∥2 =Et∥(1−αt^ρ)(xt−^xt)−αt(G(xt,ξt)−∇g(^xt))∥2 =Et∥(1−αt^ρ)(xt−^xt)−αt(∇g(xt)−∇g(^xt))−αt(G(xt,ξt)−∇g(xt))∥2 =∥(1−αt^ρ)(xt−^xt)−αt(∇g(xt)−∇g(^xt))∥2+α2tEt∥G(xt,ξt)−∇g(xt)∥2 (2.17) ≤(1−αt^ρ)2∥xt−^xt∥2−2(1−αt^ρ)αt⟨xt−^xt,∇g(xt)−∇g(^xt)⟩ +α2t∥∇g(xt)−∇g(^xt)∥2+α2tσ2 (2.18) =(1−αt^ρ)2∥xt−^xt∥2+2(1−αt^ρ)αtρ∥xt−^xt∥2+ρ2α2t∥xt−^xt∥2+α2tσ2 (2.19) =∥xt−^xt∥2+α2tσ2−(2αt(^ρ−ρ)+α2t^ρ(2ρ−^ρ)−ρ2α2t)∥xt−^xt∥2, =∥xt−^xt∥2+α2tσ2−αt(^ρ−ρ)(2−αt(^ρ−ρ))∥xt−^xt∥2,

where (2.17) follows from assumption (A2), namely , inequality (2.18) follows by expanding the square and using assumption , and inequality (2.19) follows from (2.2) and Lipschitz continuity of . The assumption guarantees . The result follows. ∎

We can now state the convergence guarantees of the proximal stochastic gradient method. The proof is completely analogous to that of Theorem 2.5, with Lemma 2.7 playing the role of Lemma 2.4.

###### Corollary 2.8 (Stochastic proximal gradient method for smooth minimization).

Fix a real and a stepsize sequence . Then the point returned by Algorithm 1 satisfies:

 E[∥∇φ1/ˆρ(xt∗)∥2]≤2^ρ^ρ−ρ⋅(φ1/^ρ(x0)−minφ)+^ρσ22∑Tt=0α2t∑Tt=0αt.

In particular, setting for some real yields the guarantee

 E[∥∇φ1/(2ρ)(xt∗)∥2