1 Introduction
Common stochastic optimization algorithms proceed as follows. Given an iterate , the method samples a model of the objective function formed at and declares the next iterate to be a minimizer of the model regularized by a proximal term. Stochastic proximal point, proximal subgradient, and GaussNewton type methods are common examples. Let us formalize this viewpoint, following [15]. Namely, consider the optimization problem
(1.1) 
where the function is closed and convex and the only access to is by sampling a stochastic onesided model. That is, for every point , there exists a family of models of
, indexed by a random variable
. This setup immediately motivates the following algorithm, analyzed in [15]:(1.2) 
where is an appropriate control sequence that governs the stepsize of the algorithm.
Some thought shows that convergence guarantees of the method (1.2) should rely at least on two factors: control over the approximation quality, , and growth/stability properties of the individual models . With this in mind, the paper [15] isolates the following assumptions:
(1.3) 
and there exists a square integrable function satisfying
(1.4) 
Condition (1.3) simply says that in expectation, the model must globally lower bound up to a quadratic error, while agreeing with at the base point ; when (1.3) holds, the paper [15] calls the assignment a stochastic onesided model of . Property (1.4), in contrast, asserts a Lipschitz type property of the individual models .^{1}^{1}1The stated assumption (A4) in [15] is stronger than (1.4); however, a quick look at the arguments shows that property (1.4) suffices to obtain essentially the same convergence guarantees. The main result of [15] shows that under these assumption, the scheme (1.2) drives a natural stationarity measure of the problem to zero at the rate . Indeed, the stationarity measure is simply the gradient of the Moreau envelope
(1.5) 
where is a smoothing parameter on the order of .
The assumptions (1.3) and (1.4) are perfectly aligned with existing literature. Indeed, common firstorder algorithms rely on global Lipschitz continuity of the objective function or of its gradient; see for example the monographs [31, 33, 5]. Recent work [2, 30, 29, 26, 8], in contrast, has emphasized that global Lipschitz assumptions can easily fail for wellstructured problems. Nonetheless, these papers show that it is indeed possible to develop efficient algorithms even without the global Lipschitz assumption. The key idea, originating in [2, 30, 29], is to model errors in approximation by a Bregman divergence, instead of a norm. The ability to deal with problems that are not globally Lipschitz is especially important in stochastic nonconvex settings, where linesearch strategies that exploit local Lipschitz continuity are not welldeveloped.
Motivated by the recent work on relative continuity/smoothness [2, 30, 29], we extend the results of [15] to nonglobally Lipschitzian settings. Formally, we simply replace the squared norm in the displayed equations (1.2)(1.5) by a Bregman divergence
generated by a Legendre function . With this modification and under mild technical conditions, we will show that algorithm (1.2) drives the gradient of the Bregman envelope (1.5) to zero at the rate , where the size of the gradient is measured in the local norm induced by . As a consequence, we obtain new convergence guarantees for stochastic proximal point, mirror descent^{2}^{2}2This work appears on arXiv a month after a preprint of Zhang and He [42], who provide similar convergence guarantees specifically for the stochastic mirror descent algorithm. The results of the two papers were obtained independently and are complementary to each other., and regularized GaussNewton methods, as well as for an elementary algorithm for stochastic saddle point problems. Perhaps the most important application arena is when the functional components of the problem grow at a polynomial rate. In this setting, we present a simple Legendre function that satisfies the necessary assumptions for the convergence guarantees to take hold. We also note that the stochastic mirror descent algorithm that we present here does not require minibatching the gradients, in contrast to the previous seminal work [24].
When the stochastic models
are themselves convex and globally underestimate
in expectation, we prove that the scheme drives the expected functional error to zero at the rate . The rate improves to when the regularizer is strongly convex relative to in the sense of [30]. In the special case of mirror descent, these guarantees extend the results for convex unconstrained problems in [29] to the proximal setting. Even specializing to the proximal subgradient method, the convergence guarantees appear to be different from those available in the literature. Namely, previous complexity estimates [7, 20] depend on the largest norms of the subgradients of along the iterate sequence, whereas Theorems 7.2 and 7.4 replace this dependence only by the initial error .The outline of the manuscript is as follows. Section 2 reviews the relevant concepts of convex analysis, focusing on Legendre functions and the Bregman divergence. Section 3 introduces the problem class and the algorithmic framework. This section also interprets the assumptions made for the stochastic proximal point, mirror descent, and regularized GaussNewton methods, as well as for a stochastic approximation algorithm for saddle point problems. Section 4 discusses the stationarity measure we use to quantify the rate of convergence. Section 5 contains the complete convergence analysis of the stochastic modelbased algorithm. Section 6 presents a specialized analysis for the mirror descent algorithm when
is smooth and the stochastic gradient oracle has finite variance. Finally, in Section
7 we prove convergence rates in terms of function values for stochastic modelbased algorithms under (relative strong) convexity assumptions.2 Legendre functions and the Bregman divergence
Throughout, we follow standard notation from convex analysis, as set out for example by Rockafellar [37]. The symbol will denote an Euclidean space with inner product and the induced norm . For any set , we let and denote the interior and closure of , respectively. Whenever is convex, the set is the interior of relative to its affine hull. The effective domain of any function , denoted by , consists of all points where is finite. Abusing notation slightly, we will use the symbol to denote the set of all points where is differentiable.
This work analyzes stochastic modelbased minimization algorithms, where the “errors” are controlled by a Bregman divergence. For wider uses of the Bregman divergence in firstorder methods, we refer the interested reader to the expository articles of Bubeck [10], JuditskyNemirovski [27], and Teboulle [40].
Henceforth, we fix a Legendre function , meaning:

(Convexity) is proper, closed, and strictly convex.

(Essential smoothness) The domain of has nonempty interior, is differentiable on , and for any sequence converging to a boundary point of , it must be the case that .
Typical examples of Legendre functions are the squared Euclidean norm , the Shannon entropy with , and the Burge function with . For more examples, we refer the reader to the articles [1, 3, 22, 39] and the recent survey [40].
We will often use the observation that the subdifferential of a Legendre function is empty on the boundary of its domain [37, Theorem 26.1]:
The Legendre function induces the Bregman divergence
for all . Notice that since is strictly convex, equality holds for some if and only if . Analysis of algorithms based on the Bregman divergence typically relies on the following three point inequality; see e.g. [41, Property 1].
Lemma 2.1 (Three point inequality).
Consider a closed convex function satisfying . Then for any point , any minimizer of the problem
lies in , is unique, and satisfies the inequality:
Recall that a function is called weakly convex if the perturbed function is convex [34]. By analogy, we will say that is weakly convex relative to if the perturbed function is convex. This notion is closely related to the relative smoothness condition introduced in [30, 2].
Relative weak convexity, like its classical counterpart, can be caracterized through generalized derivatives. Recall that the Fréchet subdifferential of a function at a point , denoted
, consists of all vectors
satisfyingThe limiting subdifferential of at , denoted , consists of all vectors such that there exist sequences and satisfying .
Lemma 2.2 (Subdifferential characterization).
The following are equivalent for any locally Lipschitz function .

The function is weakly convex relative to .

For any and any , the inequality holds:
(2.1) 
For any , and any , the inequality holds:
(2.2)
If and are smooth on , then the three properties above are all equivalent to
(2.3) 
Proof.
Define the perturbed function . We prove the implications in order. To this end, suppose 1 holds. Since is convex, the subgradient inequality holds:
(2.4) 
Taking into account that is differentiable on , we deduce for all ; see e.g. [38, Exercise 8.8]. Rewriting (2.4) with this in mind immediately yields 2. The implication is immediate since , whenever is differentiable at .
Suppose 3 holds. Fix an arbitrary point . Algebraic manipulation of inequality (2.2) yields the equivalent description
(2.5) 
It follows that the vector lies in the convex subdifferential of at . Since is locally Lipschitz continuous, Rademacher’s theorem shows that has full measure in . In particular, we deduce from (2.5) that the convex subdifferential of is nonempty on a dense subset of . Taking limits, it quickly follows that the convex subdifferential of is nonempty at every point Using [9, Exercise 3.1.12(a)], we conclude that is convex on . Moreover, appealing to the sum rule [38, Exercise 10.10], we deduce that for all , since for all . Therefore is a globally monotone map globally. Appealing to [38, Theorem 12.17], we conclude that is a convex function. Thus item 1 holds. This completes the proof of the equivalences .
Finally suppose that and are smooth on . Clearly, if is weakly convex relative to , then secondorder characterization of convexity of the function directly implies (2.3). Conversely, (2.3) immediately implies that is convex on the interior of its domain. The same argument using [38, Theorem 12.17], as in the implication , shows that is convex on all of . ∎
Notice that the setup so far has not relied on any predefined norm. Let us for the moment make the common assumption that
is 1strongly convex relative to some norm on , which implies(2.6) 
Then using Lemma 2.2, we deduce that to check that is weakly convex relative to , it suffices to verify the inequality
Recall that a function is called smooth if it satisfies:
where is the dual norm. Thus any smooth function is automatically weakly convex relative to . Our main result will not require to be 1strongly convex; however, we will impose this assumption in Section 6 where we augment our guarantees for the stochastic mirror descent algorithm under a differentiability assumption.
3 The problem class and the algorithm
We are now ready to introduce the problem class considered in this paper. We will be interested in the optimization problem
(3.1) 
where

is a locally Lipschitz function,

is a closed function having a convex domain,

is some Legendre function satisfying the compatibility conditions:
(3.2)
The first two items are standard and mild. The third stipulates that must be compatible with . In particular, the inclusion automatically implies (3.2), whenever is convex [37, Theorem 23.8], or more generally whenever a standard qualification condition holds.^{3}^{3}3Qualification condition: , for all ; see [38, Proposition 8.12, Corollary 10.9]. To simplify notation, henceforth set .
3.1 Assumptions and the Algorithm
We now specify the modelbased algorithms we will analyze. Fix a probability space
and equip with the Borel algebra. To each point and each random element , we associate a stochastic onesided model of the function . Namely, we assume that there exist satisfying the following properties.
[label=(A0)]

(Sampling) It is possible to generate i.i.d. realizations

(Onesided accuracy) There is a measurable function defined on satisfying both
and
(3.3) 
(Weak convexity of the models) The functions are weakly convex relative to for all , and a.e. .

(Lipschitzian property) There exists a square integrable function such that for all , the following inequalities hold:
(3.4)
Some comments are in order. Assumption 1 is standard and is necessary for all sampling based algorithms. Assumption 2 specifies the accuracy of the models. That is, we require the model in expectation to agree with at the basepoint, and to globally lowerbound up to an error controlled by the Bregman divergence. Assumption 3 is very mild, since in most practical circumstances the function is convex, i.e. . The final Assumption 4 controls the order of growth of the individual models as the argument moves away from .
Notice that the assumptions 14 do not involve any norm on . However, when is 1strongly convex relative to some norm, the properties (3.3) and (3.4) are implied by standard assumptions. Namely (3.3) holds if the error in the model approximation satisfies
Similarly (3.4) will hold as long as for every and a.e. the models are Lipschitz continuous on in the norm . The use of the Bregman divergence allows for much greater flexibility as it can, for example, model higher order growth of the functions in question. To illustrate, let us look at the following example where the Lipschitz constant of the models is bounded by a polynomial.
Example 3.1 (Bregman divergence under polynomial growth).
Consider a degree univariate polynomial
with coefficients . Suppose now that the onesided Lipschitz constants of the models satisfy the growth property:
Motivated by [29, Proposition 5.1], the following proposition constructs a Bregman divergence that is welladapted to the polynomial . We defer its proof to Appendix A.1. In particular, with the choice of the Legendre function in (3.5), the required estimate (3.4) holds.
Proposition 3.2.
The final ingredient we need before stating the algorithm is an estimate on the weak convexity constant of . The following simple lemma shows that Assumptions 2 and 3 imply that itself is weakly convex relative to .
Lemma 3.3.
The function is weakly convex relative to .
Proof.
We first show that the function is convex on . To this end, fix arbitrary points , and note the equality [37, Theorem 6.5]. Choose and set . Taking into account 3, we deduce
(3.6)  
Now observe
and similarly
Hence algebraic manipulation of the two equalities above yields the expression
Continuing with (3.6), we obtain
We have thus verified that is convex on . Appealing to (3.2) and the sum rule [38, Exercise 10.10], we deduce that the subdifferential is empty at every point in , and therefore is a globally monotone map. Using [38, Theorem 12.17], we conclude that is a convex function, as needed. ∎
In light of Lemma 3.3, we also make the following additional assumption on the solvability of the Bregman proximal subproblems.

[label=(A0)]

(Solvability) The convex problems
admit a minimizer for any , any , and a.e. .^{4}^{4}4Note the minimizers are automatically unique by Lemma 2.1 The minimizers vary measurably in .
Assumption (A5) is very mild. In particular, it holds automatically if is strongly convex with respect to some norm, or if the functions and are bounded from below and has bounded sublevel sets [40, Lemma 2.3].
We are now ready to state the stochastic modelbased algorithm we analyze—Algorithm 1.
3.2 Examples
Before delving into the convergence analysis of Algorithm 1, in this section we illustrate the algorithmic framework on four examples. In all cases, assumptions 1 and (A5) are selfexplanatory. Therefore, we only focus on verifying 24. For simplicity, we also assume that is convex in all examples.
Stochastic Bregmanproximal point.
Suppose that the models satisfy
With this choice of the models, Algorithm 1 becomes the stochastic Bregmanproximal point method. Analysis of the deterministic version of the method for convex problems goes back to [14, 13, 22]. Observe that Assumption 2 holds trivially. Assumption 3 and Assumption 4 should be verified in particular circumstances, depending on how the models are generated. In particular, one can verify Assumption 4 under polynomial growth of the Lipschitz constant, by appealing to Example 3.1.
Stochastic mirror descent.
Suppose that the models are given by
for some measurable mapping satisfying for all . Algorithm 1 then becomes the stochastic mirror descent algorithm, classically studied in [31, 6] in the convex setting and more recently analyzed in [30, 2, 29] under convexity and relative continuity assumptions. Assumption 2 simply says that is weakly convex relative to , while Assumption 3 holds trivially with . Assumption 4 is directly implied by the relative continuity condition of Lu [29]. Namely it suffices to assume that there is a square integrable function satisfying
where is an arbitrary norm on , and is the dual norm. We refer to [29] for more details on this condition and examples.
GaussNewton method with Bregman regularization.
In the next example, suppose that has the composite form
for some measurable function that is convex in for a.e. and a measurable map that is smooth in for a.e. . We may then use the convex models
which automatically satisfy 3 with . Algorithm 1 then becomes a stochastic GaussNewton method with Bregman regularization.
In the Euclidean case , the method reduces to the stochastic proxlinear algorithm, introduced in [21] and further analyzed in [15]. The deterministic proxlinear method has classical roots, going back at least to [11, 23, 36], while a more modern complexity theoretic perspective appears in [28, 18, 12, 32, 19]. Even in the deterministic setting, to make progress, one typically assumes that and are globally Lipschitz. More generally and in line with our current work, one may introduce a different Legendre function . For example, in the case of polynomial growth, the following propositions construct Legendre functions that are compatible with Assumptions 2 and 4. We defer their proofs to Appendix A.3. In the two propositions, we assume that the outer functions are globally Lipschitz, while the inner maps may have a high order of growth. It is possible to also analyze the setting when has polynomial growth, but the resulting statements and assumptions become much more cumbersome; we therefore omit that discussion.
Proposition 3.4 (Satisfying 2).
Suppose there are square integrable functions and a univariate polynomial with nonnegative coefficients satisfying
Define the Legendre function Then assumption 2 holds with .
Stochastic saddle point problems.
As the final example, suppose that is given in the stochastic conjugate form
where is some auxiliary set and is some function. Thus we are interested in solving the stochastic saddlepoint problem
(3.7) 
Such problems appear often in data science, where the variation of
in the “uncertainty set”makes the loss function robust. One popular example is adversarial training
[25]. In this setting, we have , where is a loss function, encodes the observed data, and varies over some uncertainty set , such as an ball.In order to apply our algorithmic framework, we must have access to stochastic onesided models of . It is quite natural to construct such models by using onesided stochastic models of . Indeed, it is appealing to simply set
(3.8) 
All of the model types in the previous examples could now serve as the models , provided they meet the conditions outlined below.
Formally, to ensure that 1(A5) hold for the models , we must make the following assumptions:

The mapping is measurable and has finite first moment for every fixed .

The function is weakly convex relative to , for every fixed , , and a.e. .

There exists a mapping satisfying
for all and a.e. with the property that the functions and are measurable.

For all , we have
and

There exists a square integrable function such that
Given these assumptions, let us define as in (3.8) We now verify properties 24. Property 2 follows from Property 4, which implies that and
Property 3 follows directly from Property 2. Finally, 4 follows from Property 5.
4 Stationarity measure
In this section, we introduce a natural stationarity measure that we will use to describe the convergence rate of Algorithm 1. The stationarity measure is simply the size of the gradient of an appropriate smooth approximation of the problem (3.1). This idea is completely analogous to the Euclidean setting [15, 16]. Setting the stage, for any , define the envelope
and the associated proximal map
Note that in the Euclidean setting , these two constructions reduce to the standard Moreau envelope and the proximity map; see for example the monographs [38, 35] or the note [17] for recent perspectives.
We will measure the convergence guarantees of Algorithm 1 based on the rate at which the quantity
(4.1) 
tends to zero for some fixed . The significance of this quantity becomes apparent after making slightly stronger assumptions on the Legendre function . In this section only, suppose that is 1strongly convex with respect to some norm and that is twice differentiable at every point in . With these assumptions, the following result shows that the envelope is differentiable, with a meaningful gradient. Indeed, this result follows quickly from [4]. For the sake of completeness, we present a selfcontained argument in Appendix A.4.
Theorem 4.1 (Smoothness of the envelope).
For any positive , the envelope is differentiable at any point with gradient given by
In light of Theorem 4.1, for any point , we may define the local norm
Then a quick computation shows that the dual norm is given by
Therefore appealing to Theorem 4.1, for any positive and we obtain the estimate
Thus the square root of the Bregman divergence, which we will show tends to zero along the iterate sequence at a controlled rate, bounds the local norm of the gradient .
5 Convergence analysis
We now present convergence analysis of Algorithm 1 under Assumptions 1(A5). Henceforth, let be the iterates generated by Algorithm 1 and let be the corresponding samples used. For each index , define the Bregmanproximal point
To simplify notation, we will use the symbol to denote the expectation conditioned on all the realizations . The entire argument of Theorem 5.2—our main result—relies on the following lemma.
Lemma 5.1.
For each iteration , the iterates of Algorithm 1 satisfy
Comments
There are no comments yet.