DeepAI

# Optimal Convergence for Stochastic Optimization with Multiple Expectation Constraints

In this paper, we focus on the problem of stochastic optimization where the objective function can be written as an expectation function over a closed convex set. We also consider multiple expectation constraints which restrict the domain of the problem. We extend the cooperative stochastic approximation algorithm from Lan and Zhou [2016] to solve the particular problem. We close the gaps in the previous analysis and provide a novel proof technique to show that our algorithm attains the optimal rate of convergence for both optimality gap and constraint violation when the functions are generally convex. We also compare our algorithm empirically to the state-of-the-art and show improved convergence in many situations.

• 25 publications
• 9 publications
04/13/2016

### Algorithms for stochastic optimization with expectation constraints

This paper considers the problem of minimizing an expectation function o...
04/10/2022

### Rockafellian Relaxation in Optimization under Uncertainty: Asymptotically Exact Formulations

In practice, optimization models are often prone to unavoidable inaccura...
06/22/2021

### A stochastic linearized proximal method of multipliers for convex stochastic optimization with expectation constraints

This paper considers the problem of minimizing a convex expectation func...
08/13/2020

### Conservative Stochastic Optimization with Expectation Constraints

This paper considers stochastic convex optimization problems where the o...
08/07/2019

### A Data Efficient and Feasible Level Set Method for Stochastic Convex Optimization with Expectation Constraints

Stochastic convex optimization problems with expectation constraints (SO...
10/05/2018

### Bounding Optimality Gap in Stochastic Optimization via Bagging: Statistical Efficiency and Stability

We study a statistical method to estimate the optimal value, and the opt...
05/24/2011

### Ergodic Mirror Descent

We generalize stochastic subgradient descent methods to situations in wh...

## 1 Introduction

In this paper we focus on a stochastic optimization problem with multiple expectation constraints. Specifically, we are interested in solving a problem of the following form:

 minx f(x):=Eξ0(F(x,ξ0)) (1) subject to gj(x):=Eξj(Gj(x,ξj))≤0 for j=1,…,m. x∈X

where is a convex compact set,

are random variables for

. for are closed convex functions with respect to for a.e. for .

There are several applications of the above problem formulation (1) especially, in fields such as control theory [10], management science [3], finance [12], etc. Our specific motivation comes from a problem arising from a large-scale social network platform. Any such platform does extensive experimentation to identify which parameters should be applied to certain members to get the most amount of metric gains. An example of such an influential parameter is the gap between successive ads on a newsfeed product (described in details in Section 4).

Traditionally, stochastic optimization routines were solved either via sample average approximation (SAA) [6, 13, 17] or via stochastic approximation (SA) [11]. In SAA, each function for are approximated by a sample average and then solved via traditional means. The SAA solution is computationally challenging, may not be applicable to the online setting and the approximation might lead to an infeasible problem. In SA, the algorithm follows the usual gradient descent algorithm by using the stochastic gradient rather than to solve (1) [2, 14]. The SA solution requires a projection onto the domain specified by , which may not be possible when we have the expectation formulation.

There have been several papers showing improvement over the original SA method especially for strongly convex problems [9], and for general class of non-smooth convex stochastic programming problems [8]. However, these methods may not be directly applicable to the case where each is an expectation constraint. Very recently [18] developed a method of stochastic online optimization with general stochastic constraints by generalizing Zinkevich’s online convex optimization. Their approach is much more general and as a result may be not the optimal approach to solving this specific problem. For more related works, we refer to [7] and [18] and the references therein.

Lan et. al. (2016) [7] introduced the cooperative stochastic approximation (CSA) algorithm to solve problem (1) with . In this paper, we extend their algorithm to multiple expectation constraints and also close several gaps in their proof of optimal convergence. Specifically, we prove that an optimal point satisfying

 E(f(^x)−f(x∗))≤c/√N and E(gj(^x))≤C/√N∀j∈{1,…,m}

can be obtained with steps of our algorithm. These rates are optimal due to the lower bound results following from [1]. We primarily focus on the theoretical analysis of the algorithm in this paper. We use a completely novel proof technique and overcome several gaps in the proof of the [7] which we highlight in Section 3. Furthermore, we run experiments on simulated data to show that this algorithm empirically outperforms the algorithm in [18]. For more practical results, we refer the reader to [15].

The rest of the paper is organized as follows. In Section 2, We introduce the multiple cooperative stochastic approximation (MCSA) algorithm for solving the problem stated in (1). In Section 3, we prove the convergence guarantee of our algorithm. We discuss some empirical results in Section 4 before concluding with a discussion in Section 5.

## 2 Multiple Cooperative Stochastic Approximation (MCSA)

We begin with the definition of a projection operator. Let be a 1-strongly convex proximal function, i.e., we have The Bregman divergence [4] associated with a strongly convex and differentiable function is defined as Note that due to the strong convexity of we have

###### Definition 1.

Based on the function we define the proximal projection function as

 Pψx,X(⋅):=argminz∈X{⟨⋅,z⟩+Bψ(z,x)}. (2)

We fix and in the rest of the paper, we denote the proximal projection function by .

We assume that the objective function and each constraint function in (1) are well-defined, finite valued, continuous and convex functions for . Similar to the CSA algorithm [7], at each iterate , we move along the subgradient direction if all for all and for the tolerance sequence . Otherwise we move along a randomly chosen subgradient direction where is chosen randomly from the set . This modification to the CSA algorithm allows us to work with multiple constraints. At each stage, we move along the chosen direction with a stepsize . We will show in the Sections 3.2 how we can choose the tolerances so that we can achieve the optimal convergence rates.

Since we do not have access to the exact functions and for , we use an approximation for and use the stochastic gradients and for and respectively, where are i.i.d. observations that we get to observe from the distribution of at iteration . We run steps of our algorithm and choose our final solution as the mean over a set of indices that is defined by

 B={s≤t≤N:^Gj(xt)≤ηj,t∀j∈{1,…,m}} (3)

Here denotes a burn-in period. Note that we need to get an approximation of for every stage for all

. A consistent estimate of

is given by

 ^Gj(xt):=1LL∑ℓ=1Gj(xt,~ξj,ℓ), (4)

where for are i.i.d. observations from the distribution of . Throughout this paper we assume that grows in the same rate as . The full algorithm is written out as Algorithm 1.

## 3 Convergence Analysis

In this section, we study the convergence of the MCSA algorithm as described in Section 2. Specifically, we show that if we run iterations in Algorithm 1, then expected error and expected violation of the constraints is . Note that the presence of interrelated stochastic components of the MCSA algorithm poses a novel challenge in deriving the rate of convergence of MCSA. In particular, at every step MCSA makes a random decision for choosing depending on and , while is a random variable depends on , and a similar random decision made at the previous step for generating . A careful consideration of this intertwined stochasticity reveals several gaps in the convergence analysis of the CSA algorithm as presented in [7]. For this reason, we refrained from deriving the rate of convergence of MCSA by extending the convergence analysis of [7]. Our rigorous convergence analysis is the main contribution of this paper and this also reinforces the convergence rate of the CSA algorithm [7].

Proof Overview: We begin with the identification of a sufficient condition on the threshold parameters such that the solution is well-defined and on average consists of at least ’s for some (Theorem 1). In order to show that result, we use a concentration bound (Lemma 4) as well as the properties of a Donsker class of functions (Lemma 5). Finally, we prove the convergence rate of the MCSA algorithm in Theorem 2 using Theorem 1, an application of the FKG inequality (Lemma 7) and Lemma 5. We use the FKG inequality to untangle the dependency between the random variables and the random set .

### 3.1 Supporting Results

Bregman Divergence: We begin by stating our first lemma connecting Bregman divergence and the proximal projection function (2). All proofs of lemmas in this sub-section are pushed to the supplementary material for brevity.

###### Lemma 1.

For any and , we have where is the dual norm of .

The following result follows from Lemma 1 and a careful analysis of Algorithm 1.

###### Lemma 2.

For any , we have

 N∑t∈Bγt⟨F′(xt,ξ0,t),xt−x⟩ +m∑j=1∑t∈Njγt(Gj(xt,ξj,t)−Gj(x,ξj,t))≤Bψ(x,xs) +∑t∈Bγ2t2∥F′(xt,ξ0,t)∥2ψ∗+m∑j=1∑t∈Nj∥G′(xt,ξj,t)∥2ψ∗a.s. (5)

We bound the right hand side of (2) in Lemma 2 by making the following assumption.

###### Assumption 1.

For any , the following holds

 E(∥F′(x,ξ0)∥2ψ∗)≤M2F and E(∥G′j(x,ξj)∥2ψ∗)≤M2Gj∀j∈{1,…,m}.
###### Lemma 3.

Under Assumption 1, for any , we have

 N∑t=sm∑j=1γt E[(gj(xt)−gj(x)) 1{t∈Nj}∣∣∣~ξ] + N∑t=sγt E[⟨f′(xt),xt−x⟩ 1{t∈B}∣∣∣~ξ] ≤ E[Bψ(x,xs)∣∣∣~ξ] + M22 N∑t=sγ2t, (6)

where , and denotes the random set of indices for which in Algorithm 1.

Concentration Bounds: We use a concentration result to achieve the optimal convergence. Towards that, we first define,

 ζt=m∑j=1γt(gj(xt)1{t∈Nj}−E[gj(xt) 1{t∈Nj}∣∣∣~ξ]).

Now, we assume that the distribution of has a light-tail.

###### Assumption 2.

There exists a such that

###### Lemma 4.

Under Assumption 2, for any ,

Donsker Class: Recall that we approximate by in Algorithm 1

. Although the central limit theorem guarantees the convergence of

to a zero mean Gaussian distribution for each

, it does not ensure a “uniform convergence” for all as in Lemma 5. To achieve that we show is a Donsker class [16] under the following assumption.

###### Assumption 3.

For each , the class of functions satisfies the following Lipschitz condition:

 |Gj(x,ξ)−Gj(y,ξ)|≤|x−y| ϕ(ξ),

for all and for some function satisfying .

The Lipschitz condition in Assumption 3 ensures is sufficiently smooth for each . It is easy to see that Assumption 3 holds if the derivative is uniformly bounded by a constant for all . We use Assumption 3 to prove the following lemma.

###### Lemma 5.

Under Assumption 3,

 supx∈X√L∣∣^Gj(x)−gj(x)∣∣d⟶supx∈X|G(x)|  % and  E(supx∈X∣∣^Gj(x)−gj(x)∣∣)=O(1/√L),

where is a zero mean Gaussian process with covariance function .

FKG Inequality: The Fortuin-Kasteleyn-Ginibre (FKG) [5] inequality asserts positive correlation between two increasing functions on a finite partially ordered set.

###### Lemma 6.

Let be partially ordered set such that be a finite distributive lattice. Further, let

be a probability measure on

such that . If both and are increasing functions on with respect to the partial ordering of , then

 ∑ℓ∈Lh1(ℓ)h2(ℓ)μ(ℓ)≥(∑ℓ∈Lh1(ℓ)μ(ℓ))(∑ℓ∈Lh2(ℓ)μ(ℓ)).

The following result follows from Lemma 6 and the fact that .

###### Lemma 7.

Let , be as in Algorithm 1 and let . Then

 E[∑t∈Bγt(f(xt)−f(x∗))∑t∈Bγt]≤E[∑t∈Bγt(f(xt)−f(x∗))] E[1∑t∈Bγt].

### 3.2 Main Result

First, we present a sufficient condition under which . We define,

 τ:=D2X+M22N∑t=sγ2t, (7)

where denotes the diameter of .

###### Theorem 1.

Let , , be as in Algorithm 1 and let . Let

 γt=√2K1√N  and  ηj,t=√2K2√N  for all t, (8)

for some sufficiently large constant and . Then one of the following two conditions hold:

1. which implies for some

2. and almost surely.

###### Proof.

We show that if the second condition does not hold then the first condition must hold. Note that if the second condition does not hold,

 N∑t=sγt⟨f′(xt),xt−x∗⟩ 1{t∈B}≥0⇒N∑t=sγt E[⟨f′(xt),xt−x∗⟩ 1{t∈B}∣∣∣~ξ]≥0,

where the equality holds if . Thus it follows from Lemma 3 with that

 N∑t=sm∑j=1γt E[(gj(xt)−gj(x∗)) 1{t∈Nj}∣∣∣~ξ]≤ E[Bψ(x∗,xs)∣∣∣~ξ] + M22N∑t=sγ2t.

Moreover, since , , we have

 N∑t=sm∑j=1γtE[gj(xt) 1{t∈Nj}∣∣∣~ξ]≤τ (9)

where is defined in (7). Note that, for , we have . Thus we have,

 N∑t=sm∑j=1γt(^Gj(xt)1{t∈Nj}−E[gj(xt) 1{t∈Nj}∣∣∣~ξ])>N∑t=sm∑j=1γtηj,t1{t∈Nj}−τ. (10)

Now consider the following two sets:

 A1 :={N∑t=sm∑j=1γt(^Gj(xt)−gj(xt))1{t∈Nj}≤12N∑t=sm∑j=1γtηj,t1{t∈Nj}}; (11) A2 :={N∑t=sm∑j=1γt(gj(xt)1{t∈Nj}−E[gj(xt) 1{t∈Nj}∣∣∣~ξ])≤K1K2−τ2}. (12)

On the set , we have

 (13)

By combining (8), (10) and (13), we get

 2K1K2(N−s+1−|B|)N≤N∑t=sm∑j=1γtηj,t1{t∈Nj}

This implies on the set , we have . Thus, we get,

 P(|B|N>12−s−1N−τ2K1K2) ≥P(A1∩A2)≥1−P(Ac1)−P(Ac2).

Next, we derive upper bounds for and . From straightforward calculations, it follows that

 P(Ac1) ≤P(N∑t=sm∑j=1∣∣γt(^Gj(xt)−gj(xt))1{t∈Nj}∣∣>N∑t=sm∑j=1K1K2N) ≤m∑j=1P(supx∈X∣∣√L(^Gj(x)−gj(x))∣∣>√LK222N)<12. (14)

where the last inequality follows from Lemma 5 for sufficiently large , since we have chosen . Now using Lemma 4 and choosing , we have

 P(Ac2) ≤exp(−N(K1K2−τ)212σ2(N−s+1)K21)<14, (15)

where the last inequality follows from choosing suitably large . Thus, using (3.2) and (15) we get

 P(|B|N>12−s−1N−τ2K1K2)>14.

###### Theorem 2.

Let , , , be as in Theorem 1. Then under Assumptions 1 and 3, we have

 E[f(^x)−f(x∗)]=O(1√N) and E[gj(^x)]=O(1√L+1√N)∀j. (16)
###### Proof.

First observe that the second condition of Theorem 1 implies that our Algorithm has already converged. Specifically, if our algorithm is well-defined and we have,

 0 ≥N∑t=sγt⟨f′(xt),xt−x∗⟩ 1{t∈B}≥N∑t=sγt(f(xt)−f(x∗)) 1{t∈B} ≥(N∑t=sγt1{t∈B})(f(^x)−f(x∗)) (17)

where we have used the successively used the convexity of . Since , we get and hence the algorithm has already converged, i.e. . In this case, we have and . Thus, either our algorithm has already converged or the first condition of Theorem 1 holds. From the first condition of Theorem 1 we have for sufficiently large and for some . Now using convexity of and the fact that for , we have

 E(gj(^x)) ≤E⎡⎣∑Nt=sγtgj(xt)1{t∈B}∑Nt=sγt1{t∈B}⎤⎦ =E⎡⎣∑Nt=sγt(gj(xt)−^Gj(xt))1{t∈B}∑Nt=sγt1{t∈B}⎤⎦+E⎡⎣∑Nt=sγt^Gj(xt)1{t∈B}∑Nt=sγt1{t∈B}⎤⎦ ≤E[supx(gj(x)−^Gj(x))]+√2K2√N=O(1√L+1√N),

where the we have used Lemma 5 to get the last equality. Similarly, using convexity of

 E[f(^x)−f(x∗)] ≤E⎡⎣∑Nt=sγt(f(xt)−f(x∗))1{t∈B}∑Nt=sγt1{t∈B}⎤⎦ ≤E[N∑t=sγt(f(xt)−f(x∗))1{t∈B}]E⎡⎣1∑Nt=sγt1{t∈B}⎤⎦ (18)

where the last inequality follows from Lemma 7. Note that applying Jensen’s inequality for the concave function we have,

 E⎡⎣1∑Nt=sγt1{t∈B}⎤⎦≤1E[∑Nt=sγt1{t∈B}]≤√N√2K1NE[|B|N]≤1(√2cK1)√N. (19)

where we have used the definition of and the fact that . Moreover, using the fact that both and do not depend on we have,

 E [N∑t=sγt(f(xt)−f(x∗))1{t∈B}]=N∑t=sγtE[(F(xt,ξ0,t)−F(x∗,ξ0,t))1{t∈B}] ≤N∑t=sγtE[⟨F′(xt,ξ0,t),xt−x∗⟩1{t∈B}] ≤E(Bψ(x∗,xs))−N∑t=sm∑j=1E[γt(Gj(xt,ξj,t)−Gj(x∗,ξj,t))1{t∈Nj}]+M22N∑t=sγ2t ≤τ−N∑t=sm∑j=1γt E[gj(xt) 1{t∈Nj}] (20)

where the first inequality follows from the convexity of , the second inequality follows from Lemma 2 with and the definition of given in Lemma 3 and the third inequality follows the definition of from (7) and the fact that . By plugging in (19) and (3.2) into (3.2) we get,

 E[f(^x)−f(x∗)] ≤1(√2cK1)√N(τ−N∑t=sm∑j=1γt E[gj(xt) 1{t∈Nj}])

Finally,

 E[gj(xt) 1{t∈Nj}] =E[(gj(xt)−^Gj(xt)) 1{t∈Nj}]+E[^Gj(xt) 1{t∈Nj}] ≥−E[supx∣∣^Gj(x)−gj(x)∣∣]+√2K2√N≥−C2√L  for some C2>0,

where the last inequality follows from the second part of Lemma 5 and the fact that . Thus, we have

 E[f(^x)−f(x∗)] ≤1(√2cK1)√N(τ+N∑t=sm∑j=1C2γt√L)=O(1√N),

where the last equality follows from the fact that . ∎

## 4 Experiments

In this section, we describe simulated experimentsto showcase the efficacy of our algorithm. Throughout this section, we compare our algorithm to the state-of-the-art algorithm in online convex optimization with stochastic constraints [18]. We focus our experiments on a real-world problem which motivated our work.

### 4.1 Personalized Parameter Selection in Large-Scale Social Network

Most social networks do extensive experimentation to identify which parameter gives rise to the biggest online metric gains. However, in many cases choosing a single global parameter is not very prudent. That being said, experimenting to identify member level parameter is an extremely challenging problem. A good middle pathway lies in identifying cohorts of members and estimating the optimal parameter for each cohort. [15] tries to solve this problem by framing this problem as (1).

Let us focus on the minimum gap parameter in the ranking ads problem. Let us assume that the parameter can take possible values and there are metrics we are interested in. One primary metric (revenue) and guardrail metrics (ads click-through-rate, organic click-through-rate, etc). Specifically, we can estimate the causal effect when treatment is applied to a cohort for each of these

metrics. This effect is a random variable which is usually distributed as a Gaussian with some mean and variance. Our aim is to identify the optimal allocation

which maximizes the expected revenue. Formally, let denote the probability of of assigning the -th treatment to the