1 Introduction
In this paper we focus on a stochastic optimization problem with multiple expectation constraints. Specifically, we are interested in solving a problem of the following form:
(1)  
subject to  
where is a convex compact set,
are random variables for
. for are closed convex functions with respect to for a.e. for .There are several applications of the above problem formulation (1) especially, in fields such as control theory [10], management science [3], finance [12], etc. Our specific motivation comes from a problem arising from a largescale social network platform. Any such platform does extensive experimentation to identify which parameters should be applied to certain members to get the most amount of metric gains. An example of such an influential parameter is the gap between successive ads on a newsfeed product (described in details in Section 4).
Traditionally, stochastic optimization routines were solved either via sample average approximation (SAA) [6, 13, 17] or via stochastic approximation (SA) [11]. In SAA, each function for are approximated by a sample average and then solved via traditional means. The SAA solution is computationally challenging, may not be applicable to the online setting and the approximation might lead to an infeasible problem. In SA, the algorithm follows the usual gradient descent algorithm by using the stochastic gradient rather than to solve (1) [2, 14]. The SA solution requires a projection onto the domain specified by , which may not be possible when we have the expectation formulation.
There have been several papers showing improvement over the original SA method especially for strongly convex problems [9], and for general class of nonsmooth convex stochastic programming problems [8]. However, these methods may not be directly applicable to the case where each is an expectation constraint. Very recently [18] developed a method of stochastic online optimization with general stochastic constraints by generalizing Zinkevich’s online convex optimization. Their approach is much more general and as a result may be not the optimal approach to solving this specific problem. For more related works, we refer to [7] and [18] and the references therein.
Lan et. al. (2016) [7] introduced the cooperative stochastic approximation (CSA) algorithm to solve problem (1) with . In this paper, we extend their algorithm to multiple expectation constraints and also close several gaps in their proof of optimal convergence. Specifically, we prove that an optimal point satisfying
can be obtained with steps of our algorithm. These rates are optimal due to the lower bound results following from [1]. We primarily focus on the theoretical analysis of the algorithm in this paper. We use a completely novel proof technique and overcome several gaps in the proof of the [7] which we highlight in Section 3. Furthermore, we run experiments on simulated data to show that this algorithm empirically outperforms the algorithm in [18]. For more practical results, we refer the reader to [15].
The rest of the paper is organized as follows. In Section 2, We introduce the multiple cooperative stochastic approximation (MCSA) algorithm for solving the problem stated in (1). In Section 3, we prove the convergence guarantee of our algorithm. We discuss some empirical results in Section 4 before concluding with a discussion in Section 5.
2 Multiple Cooperative Stochastic Approximation (MCSA)
We begin with the definition of a projection operator. Let be a 1strongly convex proximal function, i.e., we have The Bregman divergence [4] associated with a strongly convex and differentiable function is defined as Note that due to the strong convexity of we have
Definition 1.
Based on the function we define the proximal projection function as
(2) 
We fix and in the rest of the paper, we denote the proximal projection function by .
We assume that the objective function and each constraint function in (1) are welldefined, finite valued, continuous and convex functions for . Similar to the CSA algorithm [7], at each iterate , we move along the subgradient direction if all for all and for the tolerance sequence . Otherwise we move along a randomly chosen subgradient direction where is chosen randomly from the set . This modification to the CSA algorithm allows us to work with multiple constraints. At each stage, we move along the chosen direction with a stepsize . We will show in the Sections 3.2 how we can choose the tolerances so that we can achieve the optimal convergence rates.
Since we do not have access to the exact functions and for , we use an approximation for and use the stochastic gradients and for and respectively, where are i.i.d. observations that we get to observe from the distribution of at iteration . We run steps of our algorithm and choose our final solution as the mean over a set of indices that is defined by
(3) 
Here denotes a burnin period. Note that we need to get an approximation of for every stage for all
. A consistent estimate of
is given by(4) 
where for are i.i.d. observations from the distribution of . Throughout this paper we assume that grows in the same rate as . The full algorithm is written out as Algorithm 1.
3 Convergence Analysis
In this section, we study the convergence of the MCSA algorithm as described in Section 2. Specifically, we show that if we run iterations in Algorithm 1, then expected error and expected violation of the constraints is . Note that the presence of interrelated stochastic components of the MCSA algorithm poses a novel challenge in deriving the rate of convergence of MCSA. In particular, at every step MCSA makes a random decision for choosing depending on and , while is a random variable depends on , and a similar random decision made at the previous step for generating . A careful consideration of this intertwined stochasticity reveals several gaps in the convergence analysis of the CSA algorithm as presented in [7]. For this reason, we refrained from deriving the rate of convergence of MCSA by extending the convergence analysis of [7]. Our rigorous convergence analysis is the main contribution of this paper and this also reinforces the convergence rate of the CSA algorithm [7].
Proof Overview: We begin with the identification of a sufficient condition on the threshold parameters such that the solution is welldefined and on average consists of at least ’s for some (Theorem 1). In order to show that result, we use a concentration bound (Lemma 4) as well as the properties of a Donsker class of functions (Lemma 5). Finally, we prove the convergence rate of the MCSA algorithm in Theorem 2 using Theorem 1, an application of the FKG inequality (Lemma 7) and Lemma 5. We use the FKG inequality to untangle the dependency between the random variables and the random set .
3.1 Supporting Results
Bregman Divergence: We begin by stating our first lemma connecting Bregman divergence and the proximal projection function (2). All proofs of lemmas in this subsection are pushed to the supplementary material for brevity.
Lemma 1.
For any and , we have where is the dual norm of .
Lemma 2.
For any , we have
(5) 
Assumption 1.
For any , the following holds
Lemma 3.
Concentration Bounds: We use a concentration result to achieve the optimal convergence. Towards that, we first define,
Now, we assume that the distribution of has a lighttail.
Assumption 2.
There exists a such that
Lemma 4.
Under Assumption 2, for any ,
Donsker Class: Recall that we approximate by in Algorithm 1
. Although the central limit theorem guarantees the convergence of
to a zero mean Gaussian distribution for each
, it does not ensure a “uniform convergence” for all as in Lemma 5. To achieve that we show is a Donsker class [16] under the following assumption.Assumption 3.
For each , the class of functions satisfies the following Lipschitz condition:
for all and for some function satisfying .
The Lipschitz condition in Assumption 3 ensures is sufficiently smooth for each . It is easy to see that Assumption 3 holds if the derivative is uniformly bounded by a constant for all . We use Assumption 3 to prove the following lemma.
Lemma 5.
FKG Inequality: The FortuinKasteleynGinibre (FKG) [5] inequality asserts positive correlation between two increasing functions on a finite partially ordered set.
Lemma 6.
Let be partially ordered set such that be a finite distributive lattice. Further, let
be a probability measure on
such that . If both and are increasing functions on with respect to the partial ordering of , thenThe following result follows from Lemma 6 and the fact that .
Lemma 7.
Let , be as in Algorithm 1 and let . Then
3.2 Main Result
First, we present a sufficient condition under which . We define,
(7) 
where denotes the diameter of .
Theorem 1.
Let , , be as in Algorithm 1 and let . Let
(8) 
for some sufficiently large constant and . Then one of the following two conditions hold:

which implies for some

and almost surely.
Proof.
We show that if the second condition does not hold then the first condition must hold. Note that if the second condition does not hold,
where the equality holds if . Thus it follows from Lemma 3 with that
Moreover, since , , we have
(9) 
where is defined in (7). Note that, for , we have . Thus we have,
(10) 
Now consider the following two sets:
(11)  
(12) 
On the set , we have
(13) 
By combining (8), (10) and (13), we get
This implies on the set , we have . Thus, we get,
Next, we derive upper bounds for and . From straightforward calculations, it follows that
(14) 
where the last inequality follows from Lemma 5 for sufficiently large , since we have chosen . Now using Lemma 4 and choosing , we have
(15) 
where the last inequality follows from choosing suitably large . Thus, using (3.2) and (15) we get
∎
Proof.
First observe that the second condition of Theorem 1 implies that our Algorithm has already converged. Specifically, if our algorithm is welldefined and we have,
(17) 
where we have used the successively used the convexity of . Since , we get and hence the algorithm has already converged, i.e. . In this case, we have and . Thus, either our algorithm has already converged or the first condition of Theorem 1 holds. From the first condition of Theorem 1 we have for sufficiently large and for some . Now using convexity of and the fact that for , we have
where the we have used Lemma 5 to get the last equality. Similarly, using convexity of
(18) 
where the last inequality follows from Lemma 7. Note that applying Jensen’s inequality for the concave function we have,
(19) 
where we have used the definition of and the fact that . Moreover, using the fact that both and do not depend on we have,
(20) 
where the first inequality follows from the convexity of , the second inequality follows from Lemma 2 with and the definition of given in Lemma 3 and the third inequality follows the definition of from (7) and the fact that . By plugging in (19) and (3.2) into (3.2) we get,
Finally,
where the last inequality follows from the second part of Lemma 5 and the fact that . Thus, we have
where the last equality follows from the fact that . ∎
4 Experiments
In this section, we describe simulated experimentsto showcase the efficacy of our algorithm. Throughout this section, we compare our algorithm to the stateoftheart algorithm in online convex optimization with stochastic constraints [18]. We focus our experiments on a realworld problem which motivated our work.
4.1 Personalized Parameter Selection in LargeScale Social Network
Most social networks do extensive experimentation to identify which parameter gives rise to the biggest online metric gains. However, in many cases choosing a single global parameter is not very prudent. That being said, experimenting to identify member level parameter is an extremely challenging problem. A good middle pathway lies in identifying cohorts of members and estimating the optimal parameter for each cohort. [15] tries to solve this problem by framing this problem as (1).
Let us focus on the minimum gap parameter in the ranking ads problem. Let us assume that the parameter can take possible values and there are metrics we are interested in. One primary metric (revenue) and guardrail metrics (ads clickthroughrate, organic clickthroughrate, etc). Specifically, we can estimate the causal effect when treatment is applied to a cohort for each of these
metrics. This effect is a random variable which is usually distributed as a Gaussian with some mean and variance. Our aim is to identify the optimal allocation
which maximizes the expected revenue. Formally, let denote the probability of of assigning the th treatment to the