    # Algorithms for stochastic optimization with expectation constraints

This paper considers the problem of minimizing an expectation function over a closed convex set, coupled with an expectation constraint on either decision variables or problem parameters. We first present a new stochastic approximation (SA) type algorithm, namely the cooperative SA (CSA), to handle problems with the expectation constraint on devision variables. We show that this algorithm exhibits the optimal O(1/√(N)) rate of convergence, in terms of both optimality gap and constraint violation, when the objective and constraint functions are generally convex, where N denotes the number of iterations. Moreover, we show that this rate of convergence can be improved to O(1/N) if the objective and constraint functions are strongly convex. We then present a variant of CSA, namely the cooperative stochastic parameter approximation (CSPA) algorithm, to deal with the situation when the expectation constraint is defined over problem parameters and show that it exhibits similar optimal rate of convergence to CSA. It is worth noting that CSA and CSPA are primal methods which do not require the iterations on the dual space and/or the estimation on the size of the dual variables. To the best of our knowledge, this is the first time that such optimal SA methods for solving expectation constrained stochastic optimization are presented in the literature.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we study two related stochastic programming (SP) problems with expectation constraints. The first one is a classical SP problem with the expectation constraint over the decision variables, formally defined as

 min f(x):=E[F(x,ζ)] (1.1) s.t. g(x):=E[G(x,ξ)]≤0, x∈X,

where is a convex compact set, and

are random vectors supported on

and , respectively, and are closed convex functions w.r.t. for a.e. and . Moreover, we assume that and are independent of . Under these assumptions, (1.1) is a convex optimization problem.

Problem (1.1) has many applications in operations research, finance and data analysis. One motivating example is SP with the conditional value at risk (CVaR) constraint. In an important work , Rockafellar and Uryasev shows that a class of asset allocation problem can be modeled as

 minx,τ−μTxs.t.τ+1βE{[−ξTx−τ]+}≤0,∑ni=1xi=1,x≥0, (1.2)

where denotes the random return with mean . Expectation constraints also play an important role in providing tight convex approximation to chance constrained problems (e.g., Nemirovksi and Shapiro ). Some other important applications of (1.1

) can be found in semi-supervised learning (see, e.g.,

). For example, one can use the objective function to define the fidelity of the model for the labelled data, while using the constraint to enforce some other properties of the model for the unlabelled data (e.g., proximity for data with similar features).

While problem (1.1) covers a wide class of problems with expectation constraints over the decision variables, in practice we often encounter the situation where the expectation constraint is defined over the problem parameters. Under these circumstances our goal is to find a pair of parameters and decision variables such that

 y∗(x∗) ∈Argminy∈Y{ϕ(x∗,y):=E[Φ(x∗,y,ζ)]}, (1.3) x∗ ∈{x∈X|g(x):=E[G(x,ξ)]≤0}. (1.4)

Here is convex w.r.t. for a.e. but possibly nonconvex w.r.t. jointly, and is convex w.r.t. for a.e. . Moreover, we assume that and are independent of and , respectively, while is not necessarily independent of . Note that (1.3)-(1.4) defines a pair of optimization and feasibility problems coupled through the following ways: a) the solution to (1.4) defines an admissible parameter of (1.3

); b) the random variables

and can be dependent, e.g., ; and c)

can be a random variable with probability distribution parameterized by

.

Problem (1.3)-(1.4

) also has many applications, especially in data analysis. One such example is to learn a classifier

with a certain metric

using the support vector machine model:

 minwE[l(w;(¯A12u,v))]+λ2∥w∥2, (1.5) ¯A∈{A⪰0|E[|Tr(A(ui−vj)(ui−vj)T)−bij|]≤0,Tr(A)≤C}, (1.6)

where

denotes the hinge loss function,

, , and are the random variables satisfying certain probability distributions, and are certain given parameters. In this problem, (1.5) is used to learn the classifer by using the metric satisfying certain requirements in (1.6), including the low rank (or nuclear norm) assumption. Problem (1.3)-(1.4) can also be used in some data-driven applications, where one can use (1.4) to specify the parameters for the probabilistic models associated with the random variable , as well as some other applications for multi-objective stochastic optimization.

In spite of its wide applicability, the study on efficient solution methods for expectation constrained optimization is still limited. For the sake of simplicity, suppose for now that is given as a deterministic vector and hence that the objective functions and in (1.1) and (1.3) are easily computable. One popular method to solve stochastic optimization problems is called the sample average approximation (SAA) approach ([35, 18, 38]). To apply SAA for (1.1) and (1.4), we first generate a random sample , for some and then approximate by . The main issues associated with the SAA for solving (1.1) include: i) the deterministic SAA problem might not be feasible; ii) the resulting deterministic SAA problem is often difficult to solve especially when is large, requiring going through the whole sample at each iteration; and ii) it is not applicable to the on-line setting where one needs to update the decision variable whenever a new piece of sample , , is collected.

A different approach to solve stochastic optimization problems is called stochastic approximation (SA), which was initially proposed in a seminal paper by Robbins and Monro in 1951 for solving strongly convex SP problems. This algorithm mimics the gradient descent method by using the stochastic gradient rather than the original gradient for minimizing in (1.1) over a simple convex set (see also [4, 10, 11, 26, 32, 36]). An important improvement of this algorithm was developed by Polyak and Juditsky(,) through using longer steps and then averaging the obtained iterates. Their method was shown to be more robust with respect to the choice of stepsize than classic SA method for solving strongly convex SP problems. More recently, Nemirovski et al.  presented a modified SA method, namely, the mirror descent SA method, and demonstrated its superior numerical performance for solving a general class of nonsmooth convex SP problems. The SA algorithms have been intensively studied over the past few years (see, e.g., [19, 12, 13, 9, 39, 14, 22, 33]). It should be noted, however, that none of these SA algorithms are applicable to expectation constrained problems, since each iteration of these algorithms requires the projection over the feasible set , which is computationally prohibitive as is given in the form of expectation.

In this paper, we intend to develop efficient solution methods for solving expectation constrained problems by properly addressing the aforementioned issues associated with existing SA methods. Our contribution mainly exists in the following several aspects. Firstly, inspired by Polayk’s subgradient method for constrained optimization [27, 25], we present a new SA algorithm, namely the cooperative SA (CSA) method for solving the SP problem with expectation constraint in (1.1). At the -th iteration, CSA performs a projected subgradient step along either or over the set , depending on whether the stochastic condition is satisfied or not. We introduce an index set in order to compute the output solution as a weighted average of the iterates in . By carefully bounding , we show that the number of iterations performed by the CSA algorithm to find an -solution of (1.1), i.e., a point s.t. and , can be bounded by . Moreover, when both and are strongly convex, by using a different set of algorithmic parameters we show that the complexity of the CSA method can be significantly improved to . It it is worth mentioning that this result is new even for solving deterministic strongly convex problems with functional constraints. We also established the large-deviation properties for the CSA method under certain light-tail assumptions.

Secondly, we develop a variant of CSA, namely the cooperative stochastic parameter approximation (CSPA) method for solving the SP problem with expectation constraints on problem parameters in (1.3)-(1.4). At the -the iteration, CSPA performs either a projected subgradient step along over the set , or along over the set , depending on whether the stochastic condition is satisfied or not. We then define the output solution as a randomly selected iterate from the index set (rather than the weighted average as in CSA) mainly due to the possible non-convexity of w.r.t. . By carefully bounding , we show that the number of iterations performed by the CSPA algorithm to find an -solution of (1.3)-(1.4), i.e., a pair of solution s.t. and , can be bounded by . Moreover, this bound can be significantly improved to if and are strongly convex w.r.t. and , respectively. We also derive the large-deviation properties for the CSPA method and design a two-phase procedure to improve such properties of CSPA under certain light-tail assumptions.

To the best of our knowledge, all the aforementioned algorithmic developments are new in the stochastic optimization literature. It is also worth mentioning a few alternative or related methods to solve (1.1) and (1.3)-(1.4). First, without efficient methods to directly solve (1.1), current practice resorts to reformulate it as for some . However, one then has to face the difficulty of properly specifying , since an optimal selection would depend on the unknown dual multiplier. As a consequence, we cannot assess the quality of the solutions obtained by solving this reformulated problem. Second, one alternative approach to solve (1.1) is the penalty-based or primal-dual approach. However these methods would require either the estimation of the optimal dual variables or iterations performed on the dual space (see ,  and ). Moreover, the rate of convergence of these methods for functional constrained problems has not been well-understood other than conic constraints even for the deterministic setting. Third, in  (and see references therein), Jiang and Shanbhag developed a coupled SA method to solve a stochastic optimization problem with parameters given by another optimization problem, and hence is not applicable to problem (1.3)-(1.4). Moreover, each iteration of their method requires two stochastic subgradient projection steps and hence is more expensive than that of CSPA.

The remaining part of this paper is organized as follows. In Section 2, we present the CSA algorithm and establish its convergence properties under general convexity and strong convexity assumptions. Then in Section 3, we develop a variant of the CSA algorithm, namely the CSPA for solving SP problems with the expectation constraint over problem parameters and discuss its convergence properties. We then present some numerical results for these new SA methods in section 4. Finally some concluding remarks are added in Section 5.

## 2 Expectation constraints over decision variables

In this section we present the cooperative SA (CSA) algorithm for solving convex stochastic optimization problems with the expectation constraint over decision variables. More specifically, we first briefly review the distance generating function and prox-mapping in Subsection 2.1. We then describe the CSA algorithm and discuss its convergence properties for solving general convex problems in Subsection 2.2, and show how to improve the convergence of this algorithm by imposing strong convexity assumptions in Subsection 2.3. Finally, we establish some large deviation properties for the CSA method in Subsection 2.4.

### 2.1 Preliminary: prox-mapping

Recall that a function is a distance generating function with parameter , if is continuously differentiable and strongly convex with parameter with respect to . Without loss of generality, we assume throughout this paper that , because we can always rescale to . Therefore, we have

 ⟨x−z,∇ωX(x)−∇ωX(z)⟩≥∥x−z∥2,∀x,z∈X.

The prox-function associated with is given by

 VX(z,x)=ωX(x)−ωX(z)−⟨∇ωX(z),x−z⟩.

is also called the Bregman’s distance, which was initially studied by Bregman  and later by many others (see , and ). In this paper we assume the prox-function is chosen such that, for a given , the prox-mapping defined as

 Px,X(ϕ):=argminz∈X{⟨ϕ,z⟩+VX(x,z)} (2.1)

is easily computed.

It can be seen from the strong convexity of that

 VX(x,z)≥12∥x−z∥2,∀x,z∈X. (2.2)

Whenever the set is bounded, the distance generating function also gives rise to the diameter of that will be used frequently in our convergence analysis:

 DX≡DX,ωX:=√maxx,z∈XVX(x,z). (2.3)

The following lemma follows from the optimality condition of (2.1) and the definition of the prox-function (see the proof in ).

###### Lemma 1

For every , and , we have

 VX(Px,X(y),u)≤VX(x,u)+yT(u−x)+12∥y∥2∗,

where the denotes the conjugate of , i.e., .

### 2.2 The CSA method for general convex problems

In this section, we consider the constrained optimization problem in (1.1). We assume the expectation functions and , in addition to being well-defined and finite-valued for every , are continuous and convex on . Throughout this section, we make the following assumption for the stochastic subgradients.

###### Assumption 1

For any , a.e. and a.e. ,

 E[∥F′(x,ζ)∥2∗]≤M2F  and   E[∥G′(x,ξ)∥2∗]≤M2G,

where and .

The CSA method can be viewed as a stochastic counterpart of Polayk’s subgradient method, which was originally designed for solving deterministic nonsmooth convex optimization problems (see  and a more recent generalization in ). At each iterate , , depending on whether for some tolerance , it moves either along the subgradient direction or , with an appropriately chosen stepsize which also depends on and . However, Polayk’s subgradient method cannot be applied to solve (1.1) because we do not have access to exact information about , and . The CSA method differs from Polyak’s subgradient method in the following three aspects. Firstly, the search direction is defined in a stochastic manner: we first check if the solution we computed at iteration k violates the condition for some and a random realization of . If so, we set the in order to control the violation of expectation constraint. Otherwise, we set . Secondly, for some , we partition the indices into two subsets: and , and define the output as an ergodic mean of over . This differs from the Polyak’s subgradient method that defines the output solution as the best , with the smallest objective value. Thirdly, while the original Polayk’s subgradient method were developed only for general nonsmooth problems, we show that the CSA method also exhibits an optimal rate of convergence for solving strongly convex problems by properly choosing and .

Notice that each iteration of the CSA algorithm only requires at most one realization and . Hence, this algorithm can make progress in an on-line fashion as more and more samples of and are collected. Our goal in the remaining part of this section is to establish the rate of convergence associated with this algorithm, in terms of both the distance to the optimal value and the violation of constraints. It should also be noted that Algorithm 1 is conceptional only as we have not specified a few algorithmic parameters (e.g. and ). We will come back to this issue after establishing some general properties about this method.

The following result establishes a simple but important recursion about the CSA method for stochastic optimization with expectation constraints.

###### Proposition 2

For any , we have

 ∑k∈Nγk(ηk−G(x,ξk))+∑k∈Bγk⟨F′(xk,ζk),xk−x⟩≤V(xs,x)+12∑k∈Bγ2k∥F′(xk,ζk)∥2∗+12∑k∈Nγ2k∥G′(xk,ξk)∥2∗, (2.7)

for all .

Proof. For any , using Lemma 1, we have

 V(xk+1,x)≤V(xk,x)+γk⟨hk,x−xk⟩+12γ2k∥hk∥2∗. (2.8)

Observe that if , we have and

 ⟨hk,xk−x⟩=⟨F′(xk,ζk),xk−x⟩.

Moreover, if , we have and

 ⟨hk,xk−x⟩ =⟨G′(xk,ξk),xk−x⟩≥G(xk,ξk)−G(x,ξk)≥ηk−G(x,ξk).

Summing up the inequalities in (2.8) from to and using the previous two observations, we obtain

 V(xk+1,x) ≤V(xs,x)−∑Nk=sγk⟨hk,xk−x⟩+12∑Nk=sγ2k∥hk∥2∗ (2.9) ≤V(xs,x)−[∑k∈Nγk⟨G′(xk,ξk),xk−x⟩+∑k∈Bγk⟨F′(xk,ζk),xk−x⟩]+12∑Nk=sγ2k∥hk∥2∗ ≤V(xs,x)−[∑k∈Nγk(ηk−G(x,ξk))+∑k∈Bγk⟨F′(xk,ζk),xk−x⟩] +12∑k∈Bγ2k∥F′(xk,ζk)∥2∗+12∑k∈Nγ2k∥G′(xk,ξk)∥2∗.

Rearranging the terms in above inequality, we obtain (2.7)

Using Proposition 2, we present below a sufficient condition under which the output solution is well-defined.

###### Lemma 3

Let be an optimal solution of (1.1). If

 N−s+12mink∈Nγkηk>D2X+12∑k∈Bγ2kM2F+12∑k∈Nγ2kM2G, (2.10)

then , i.e., is well-defined. Moreover, we have one of the following two statements holds,

a)

b)

Proof. Taking expectation w.r.t. and on both sides of (2.7) and fixing , we have

 ∑k∈Nγk[ηk−g(x∗)]+∑k∈Bγk⟨f′(xk),xk−x∗⟩≤ V(xs,x∗)+12∑k∈Bγ2kM2F+12∑k∈Nγ2kM2G (2.11) ≤ D2X+12∑k∈Bγ2kM2F+12∑k∈Nγ2kM2G.

If , part b) holds. If , we have

 ∑k∈Nγk[ηk−g(x∗)]≤V(xs,x∗)+12∑k∈Bγ2kM2F+12∑k∈Nγ2kM2G,

which, in view of , implies that

 ∑k∈Nγkηk≤V(xs,x∗)+12∑k∈Bγ2kM2F+12∑k∈Nγ2kM2G. (2.12)

Suppose that , i.e., . Then,

 ∑k∈Nγkηk≥N−s+12mink∈Nγkηk>V(xs,x∗)+12∑k∈Bγ2kM2F+12∑k∈Nγ2kM2G,

which contradicts with (2.12). Hence, part a) holds.

Now we are ready to establish the main convergence properties of the CSA method.

###### Theorem 4

Suppose that and in the CSA algorithm are chosen such that (2.10) holds. Then for any , we have

 E[f(¯xN,s)−f(x∗)] ≤2D2X+max{M2F,M2G}∑s≤k≤Nγ2k(N−s+1)mins≤k≤Nγk, (2.13) E[g(¯xN,s)] ≤maxs≤k≤Nηk. (2.14)

Proof. We first show (2.13). By Lemma 3, if Lemma 3 part (b) holds, dividing both sides of (LABEL:lemma_b) by and taking expectation, we have

 E[f(¯xN,s)−f(x∗)]≤0. (2.15)

If , we have It follows from the definition of in (2.6), the convexity of and (2.11) that

 ∑k∈Nγkηk+∑k∈BγkE[f(¯xN,s)−f(x∗)]≤∑k∈Nγkηk+∑k∈BE[γk(f(xk)−f(x∗))]≤D2X+12∑k∈Bγ2kM2F+12∑k∈Nγ2kM2G,

which implies that

 |N|mink∈Nγkηk+(∑k∈Bγk)E[f(¯xN,s)−f(x∗)]≤D2X+12∑k∈Bγ2kM2F+12∑k∈Nγ2kM2G. (2.16)

Using this bound and the fact in (2.16), we have

 E[f(¯xN,s)−f(x∗)] ≤2D2X+∑k∈Bγ2kM2F+∑k∈Nγ2kM2G(N−s+1)mink∈Iγk (2.17) ≤2D2X+max{M2F,M2G}∑s≤k≤Nγ2k(N−s+1)mink∈Bγk.

Combining these two inequalities (2.15) and (2.17), we have (2.13). Now we show that (2.14) holds. For any , we have . Taking expectations w.r.t. on both sides, we have , which, in view of the definition of in (2.6) and the convexity of , then implies that

 E[g(¯xN,s)] ≤E[∑k∈Bγkg(xk)]∑k∈Bγk≤∑k∈Bγkηk∑k∈Bγk≤maxs≤k≤Nηk. (2.18)

Below we provide a few specific selections of , and that lead to the optimal rate of convergence for the CSA method. In particular, we will present a constant and variable stepsize policy, respectively, in Corollaries 5 and 6.

###### Corollary 5

If s=1, and , , then

 E[f(¯xN,s)−f(x∗)] ≤4DX(MF+MG)√N, E[g(¯xN,s)] ≤4DX(MF+MG)√N.

Proof. First, observe that condition (2.10) holds by using the facts that

 N−s+12mink∈Nγkηk=N24D2XN=2D2X, D2X+12∑k∈Bγ2kM2F+12∑k∈Nγ2kM2G ≤D2X+12∑k∈BD2XM2FN(MF+MG)2+12∑k∈ND2XM2GN(MF+MG)2 ≤D2X+12∑Nk=1D2XN≤2D2X.

It then follows from Lemma 3 and Theorem 4 that

 E[f(¯xN,s)−f(x∗)] ≤2DX(MF+MG)+∑k∈BDXM2FN(MF+MG)+∑k∈NDXM2GN(MF+MG)√N≤4DX(MF+MG)√N, E[g(¯xN,s)] ≤maxs≤k≤Nηk=4DX(MF+MG)√N.

###### Corollary 6

If , and , then

 E[f(¯xN,s)−f(x∗)] ≤4DX(1+12log2)(MF+MG)√N, E[g(¯xN,s)] ≤4√2DX(MF+MG)√N.

Proof. The proof is similar to that of corollary 4 and hence the details are skipped.

In view of Corollaries 5 and 6, the CSA algorithm achieves an rate of convergence for solving problem (1.1). This convergence rate seems to be unimprovable as it matches the optimal rate of convergence for deterministic convex optimization problems with functional constraints . However, to the best of our knowledge, no such complexity bounds have been obtained before for solving stochastic optimization problems with functional expectation constraints.

### 2.3 Strongly convex objective and strongly convex constraints

In this subsection, we are interested in establishing the convergence of the CSA algorithm applied to strongly convex problems. More specifically, we assume that the objective function and constraint function are both strongly convex w.r.t. , i.e., and s.t.

 F(x1,ζ) ≥F(x2,ζ)+⟨F′(x2,ζ),x1−x2⟩+μF2∥x1−x2∥2,∀x1,x2∈X, G(x1,ξ) ≥G(x2,ξ)+⟨G′(x2,ξ),x1−x2⟩+μG2∥x1−x2∥2,∀x1,x2∈X.

In order to estimate the convergent rate of the CSA algorithm for solving strongly convex problems, we need to assume that the prox-function satisfies a quadratic growth condition

 VX(z,x)≤Q2∥z−x∥2,∀z,x∈X. (2.19)

Moreover, letting be the stepsizes used in the CSA method, and denoting

 ak=⎧⎨⎩μFγkQ,k∈B,μGγkQ,k∈N,and Ak={1,k=1,(1−ak)Ak−1,k≥2.

we modify the output in Algorithm 1 to

 ¯xN,s=∑k∈Bρkxk∑k∈Bρk, (2.20)

where . The following simple result will be used in the convergence analysis of the CSA method.

###### Lemma 7

If , k = 0,1,2,…, , and satisfies

 Δk+1≤(1−ak)Δk+Bk,∀k≥1,

then we have

 Δk+1Ak≤(1−a1)Δ1+∑ki=1BiAi.

Below we provide an important recursion about CSA applied to strongly convex problems. This result differs from Proposition 2 for the general convex case in that we use different weight rather than .

###### Proposition 8

For any , we have

 (2.21)

Proof. Consider the iteration , . If , by Lemma 1 and the strong convexity of , we have

 V(xk+1,x) ≤V(xk,x)−γk⟨hk,xk−x⟩+12γ2k∥F′(xk,ζk)∥2∗ =V(xk,x)−γk⟨F′(xk,ζk),xk−x⟩+12γ2k∥F′(xk,ζk)∥2∗ ≤V(xk,x)−γk[F(xk,ζk)−F(x,ζ