1 Introduction
In this paper, we study two related stochastic programming (SP) problems with expectation constraints. The first one is a classical SP problem with the expectation constraint over the decision variables, formally defined as
(1.1)  
where is a convex compact set, and
are random vectors supported on
and , respectively, and are closed convex functions w.r.t. for a.e. and . Moreover, we assume that and are independent of . Under these assumptions, (1.1) is a convex optimization problem.Problem (1.1) has many applications in operations research, finance and data analysis. One motivating example is SP with the conditional value at risk (CVaR) constraint. In an important work [31], Rockafellar and Uryasev shows that a class of asset allocation problem can be modeled as
(1.2) 
where denotes the random return with mean . Expectation constraints also play an important role in providing tight convex approximation to chance constrained problems (e.g., Nemirovksi and Shapiro [24]). Some other important applications of (1.1
) can be found in semisupervised learning (see, e.g.,
[6]). For example, one can use the objective function to define the fidelity of the model for the labelled data, while using the constraint to enforce some other properties of the model for the unlabelled data (e.g., proximity for data with similar features).While problem (1.1) covers a wide class of problems with expectation constraints over the decision variables, in practice we often encounter the situation where the expectation constraint is defined over the problem parameters. Under these circumstances our goal is to find a pair of parameters and decision variables such that
(1.3)  
(1.4) 
Here is convex w.r.t. for a.e. but possibly nonconvex w.r.t. jointly, and is convex w.r.t. for a.e. . Moreover, we assume that and are independent of and , respectively, while is not necessarily independent of . Note that (1.3)(1.4) defines a pair of optimization and feasibility problems coupled through the following ways: a) the solution to (1.4) defines an admissible parameter of (1.3
); b) the random variables
and can be dependent, e.g., ; and c)can be a random variable with probability distribution parameterized by
.) also has many applications, especially in data analysis. One such example is to learn a classifier
with a certain metricusing the support vector machine model:
(1.5)  
(1.6) 
where
denotes the hinge loss function,
, , and are the random variables satisfying certain probability distributions, and are certain given parameters. In this problem, (1.5) is used to learn the classifer by using the metric satisfying certain requirements in (1.6), including the low rank (or nuclear norm) assumption. Problem (1.3)(1.4) can also be used in some datadriven applications, where one can use (1.4) to specify the parameters for the probabilistic models associated with the random variable , as well as some other applications for multiobjective stochastic optimization.In spite of its wide applicability, the study on efficient solution methods for expectation constrained optimization is still limited. For the sake of simplicity, suppose for now that is given as a deterministic vector and hence that the objective functions and in (1.1) and (1.3) are easily computable. One popular method to solve stochastic optimization problems is called the sample average approximation (SAA) approach ([35, 18, 38]). To apply SAA for (1.1) and (1.4), we first generate a random sample , for some and then approximate by . The main issues associated with the SAA for solving (1.1) include: i) the deterministic SAA problem might not be feasible; ii) the resulting deterministic SAA problem is often difficult to solve especially when is large, requiring going through the whole sample at each iteration; and ii) it is not applicable to the online setting where one needs to update the decision variable whenever a new piece of sample , , is collected.
A different approach to solve stochastic optimization problems is called stochastic approximation (SA), which was initially proposed in a seminal paper by Robbins and Monro[30] in 1951 for solving strongly convex SP problems. This algorithm mimics the gradient descent method by using the stochastic gradient rather than the original gradient for minimizing in (1.1) over a simple convex set (see also [4, 10, 11, 26, 32, 36]). An important improvement of this algorithm was developed by Polyak and Juditsky([28],[29]) through using longer steps and then averaging the obtained iterates. Their method was shown to be more robust with respect to the choice of stepsize than classic SA method for solving strongly convex SP problems. More recently, Nemirovski et al. [23] presented a modified SA method, namely, the mirror descent SA method, and demonstrated its superior numerical performance for solving a general class of nonsmooth convex SP problems. The SA algorithms have been intensively studied over the past few years (see, e.g., [19, 12, 13, 9, 39, 14, 22, 33]). It should be noted, however, that none of these SA algorithms are applicable to expectation constrained problems, since each iteration of these algorithms requires the projection over the feasible set , which is computationally prohibitive as is given in the form of expectation.
In this paper, we intend to develop efficient solution methods for solving expectation constrained problems by properly addressing the aforementioned issues associated with existing SA methods. Our contribution mainly exists in the following several aspects. Firstly, inspired by Polayk’s subgradient method for constrained optimization [27, 25], we present a new SA algorithm, namely the cooperative SA (CSA) method for solving the SP problem with expectation constraint in (1.1). At the th iteration, CSA performs a projected subgradient step along either or over the set , depending on whether the stochastic condition is satisfied or not. We introduce an index set in order to compute the output solution as a weighted average of the iterates in . By carefully bounding , we show that the number of iterations performed by the CSA algorithm to find an solution of (1.1), i.e., a point s.t. and , can be bounded by . Moreover, when both and are strongly convex, by using a different set of algorithmic parameters we show that the complexity of the CSA method can be significantly improved to . It it is worth mentioning that this result is new even for solving deterministic strongly convex problems with functional constraints. We also established the largedeviation properties for the CSA method under certain lighttail assumptions.
Secondly, we develop a variant of CSA, namely the cooperative stochastic parameter approximation (CSPA) method for solving the SP problem with expectation constraints on problem parameters in (1.3)(1.4). At the the iteration, CSPA performs either a projected subgradient step along over the set , or along over the set , depending on whether the stochastic condition is satisfied or not. We then define the output solution as a randomly selected iterate from the index set (rather than the weighted average as in CSA) mainly due to the possible nonconvexity of w.r.t. . By carefully bounding , we show that the number of iterations performed by the CSPA algorithm to find an solution of (1.3)(1.4), i.e., a pair of solution s.t. and , can be bounded by . Moreover, this bound can be significantly improved to if and are strongly convex w.r.t. and , respectively. We also derive the largedeviation properties for the CSPA method and design a twophase procedure to improve such properties of CSPA under certain lighttail assumptions.
To the best of our knowledge, all the aforementioned algorithmic developments are new in the stochastic optimization literature. It is also worth mentioning a few alternative or related methods to solve (1.1) and (1.3)(1.4). First, without efficient methods to directly solve (1.1), current practice resorts to reformulate it as for some . However, one then has to face the difficulty of properly specifying , since an optimal selection would depend on the unknown dual multiplier. As a consequence, we cannot assess the quality of the solutions obtained by solving this reformulated problem. Second, one alternative approach to solve (1.1) is the penaltybased or primaldual approach. However these methods would require either the estimation of the optimal dual variables or iterations performed on the dual space (see [7], [23] and [20]). Moreover, the rate of convergence of these methods for functional constrained problems has not been wellunderstood other than conic constraints even for the deterministic setting. Third, in [17] (and see references therein), Jiang and Shanbhag developed a coupled SA method to solve a stochastic optimization problem with parameters given by another optimization problem, and hence is not applicable to problem (1.3)(1.4). Moreover, each iteration of their method requires two stochastic subgradient projection steps and hence is more expensive than that of CSPA.
The remaining part of this paper is organized as follows. In Section 2, we present the CSA algorithm and establish its convergence properties under general convexity and strong convexity assumptions. Then in Section 3, we develop a variant of the CSA algorithm, namely the CSPA for solving SP problems with the expectation constraint over problem parameters and discuss its convergence properties. We then present some numerical results for these new SA methods in section 4. Finally some concluding remarks are added in Section 5.
2 Expectation constraints over decision variables
In this section we present the cooperative SA (CSA) algorithm for solving convex stochastic optimization problems with the expectation constraint over decision variables. More specifically, we first briefly review the distance generating function and proxmapping in Subsection 2.1. We then describe the CSA algorithm and discuss its convergence properties for solving general convex problems in Subsection 2.2, and show how to improve the convergence of this algorithm by imposing strong convexity assumptions in Subsection 2.3. Finally, we establish some large deviation properties for the CSA method in Subsection 2.4.
2.1 Preliminary: proxmapping
Recall that a function is a distance generating function with parameter , if is continuously differentiable and strongly convex with parameter with respect to . Without loss of generality, we assume throughout this paper that , because we can always rescale to . Therefore, we have
The proxfunction associated with is given by
is also called the Bregman’s distance, which was initially studied by Bregman [5] and later by many others (see [1],[2] and [37]). In this paper we assume the proxfunction is chosen such that, for a given , the proxmapping defined as
(2.1) 
is easily computed.
It can be seen from the strong convexity of that
(2.2) 
Whenever the set is bounded, the distance generating function also gives rise to the diameter of that will be used frequently in our convergence analysis:
(2.3) 
The following lemma follows from the optimality condition of (2.1) and the definition of the proxfunction (see the proof in [23]).
Lemma 1
For every , and , we have
where the denotes the conjugate of , i.e., .
2.2 The CSA method for general convex problems
In this section, we consider the constrained optimization problem in (1.1). We assume the expectation functions and , in addition to being welldefined and finitevalued for every , are continuous and convex on . Throughout this section, we make the following assumption for the stochastic subgradients.
Assumption 1
For any , a.e. and a.e. ,
where and .
The CSA method can be viewed as a stochastic counterpart of Polayk’s subgradient method, which was originally designed for solving deterministic nonsmooth convex optimization problems (see [27] and a more recent generalization in [3]). At each iterate , , depending on whether for some tolerance , it moves either along the subgradient direction or , with an appropriately chosen stepsize which also depends on and . However, Polayk’s subgradient method cannot be applied to solve (1.1) because we do not have access to exact information about , and . The CSA method differs from Polyak’s subgradient method in the following three aspects. Firstly, the search direction is defined in a stochastic manner: we first check if the solution we computed at iteration k violates the condition for some and a random realization of . If so, we set the in order to control the violation of expectation constraint. Otherwise, we set . Secondly, for some , we partition the indices into two subsets: and , and define the output as an ergodic mean of over . This differs from the Polyak’s subgradient method that defines the output solution as the best , with the smallest objective value. Thirdly, while the original Polayk’s subgradient method were developed only for general nonsmooth problems, we show that the CSA method also exhibits an optimal rate of convergence for solving strongly convex problems by properly choosing and .
Input: initial point , stepsizes and tolerances .
(2.4)  
(2.5) 
(2.6) 
Notice that each iteration of the CSA algorithm only requires at most one realization and . Hence, this algorithm can make progress in an online fashion as more and more samples of and are collected. Our goal in the remaining part of this section is to establish the rate of convergence associated with this algorithm, in terms of both the distance to the optimal value and the violation of constraints. It should also be noted that Algorithm 1 is conceptional only as we have not specified a few algorithmic parameters (e.g. and ). We will come back to this issue after establishing some general properties about this method.
The following result establishes a simple but important recursion about the CSA method for stochastic optimization with expectation constraints.
Proposition 2
For any , we have
(2.7) 
for all .
Proof. For any , using Lemma 1, we have
(2.8) 
Observe that if , we have and
Moreover, if , we have and
Summing up the inequalities in (2.8) from to and using the previous two observations, we obtain
(2.9)  
Rearranging the terms in above inequality, we obtain (2.7)
Using Proposition 2, we present below a sufficient condition under which the output solution is welldefined.
Lemma 3
Let be an optimal solution of (1.1). If
(2.10) 
then , i.e., is welldefined. Moreover, we have one of the following two statements holds,
 a)

 b)

Proof. Taking expectation w.r.t. and on both sides of (2.7) and fixing , we have
(2.11)  
If , part b) holds. If , we have
which, in view of , implies that
(2.12) 
Suppose that , i.e., . Then,
which contradicts with (2.12). Hence, part a) holds.
Now we are ready to establish the main convergence properties of the CSA method.
Theorem 4
Suppose that and in the CSA algorithm are chosen such that (2.10) holds. Then for any , we have
(2.13)  
(2.14) 
Proof. We first show (2.13). By Lemma 3, if Lemma 3 part (b) holds, dividing both sides of (LABEL:lemma_b) by and taking expectation, we have
(2.15) 
If , we have It follows from the definition of in (2.6), the convexity of and (2.11) that
which implies that
(2.16) 
Using this bound and the fact in (2.16), we have
(2.17)  
Combining these two inequalities (2.15) and (2.17), we have (2.13). Now we show that (2.14) holds. For any , we have . Taking expectations w.r.t. on both sides, we have , which, in view of the definition of in (2.6) and the convexity of , then implies that
(2.18) 
Below we provide a few specific selections of , and that lead to the optimal rate of convergence for the CSA method. In particular, we will present a constant and variable stepsize policy, respectively, in Corollaries 5 and 6.
Corollary 5
If s=1, and , , then
Proof. First, observe that condition (2.10) holds by using the facts that
It then follows from Lemma 3 and Theorem 4 that
Corollary 6
If , and , then
Proof. The proof is similar to that of corollary 4 and hence the details are skipped.
In view of Corollaries 5 and 6, the CSA algorithm achieves an rate of convergence for solving problem (1.1). This convergence rate seems to be unimprovable as it matches the optimal rate of convergence for deterministic convex optimization problems with functional constraints [25]. However, to the best of our knowledge, no such complexity bounds have been obtained before for solving stochastic optimization problems with functional expectation constraints.
2.3 Strongly convex objective and strongly convex constraints
In this subsection, we are interested in establishing the convergence of the CSA algorithm applied to strongly convex problems. More specifically, we assume that the objective function and constraint function are both strongly convex w.r.t. , i.e., and s.t.
In order to estimate the convergent rate of the CSA algorithm for solving strongly convex problems, we need to assume that the proxfunction satisfies a quadratic growth condition
(2.19) 
Moreover, letting be the stepsizes used in the CSA method, and denoting
we modify the output in Algorithm 1 to
(2.20) 
where . The following simple result will be used in the convergence analysis of the CSA method.
Lemma 7
If , k = 0,1,2,…, , and satisfies
then we have
Below we provide an important recursion about CSA applied to strongly convex problems. This result differs from Proposition 2 for the general convex case in that we use different weight rather than .
Proposition 8
For any , we have
(2.21) 
Proof. Consider the iteration , . If , by Lemma 1 and the strong convexity of , we have