Online convex optimization (OCO) is a multi-round learning process with arbitrarily-varying convex loss functions where the decision maker has to choose decisionbefore observing the corresponding loss function . For a fixed time horizon , define the regret of a learning algorithm with respect to the best fixed decision in hindsight (with full knowledge of all loss functions) as
The goal of OCO is to develop dynamic learning algorithms such that regret grows sub-linearly with respect to . The setting of OCO is introduced in a series of work [1, 2, 3, 4] and is formalized in . OCO has gained considerable amount of research interest recently with various applications such as online regression, prediction with expert advice, online ranking, online shortest paths and portfolio selection. See [5, 6] for more applications and backgrounds.
In , Zinkevich shows that using an online gradient descent (OGD) update given by
where is a subgradient of and is the projection onto set can achieve regret. Hazan et al. in  show that better regret is possible under the assumption that each loss function is strongly convex but is the best possible if no additional assumption is imposed.
It is obvious that Zinkevich’s OGD in (1) requires the full knowledge of set and low complexity of the projection . However, in practice, the constraint set , which is often described by many functional inequality constraints, can be time varying and may not be fully disclosed to the decision maker. In , Mannor et al. extend OCO by considering time-varying constraint functions which can arbitrarily vary and are only disclosed to us after each is chosen. In this setting, Mannor et al. in  explore the possibility of designing learning algorithms such that regret grows sub-linearly and , i.e., the (cumulative) constraint violation also grows sub-linearly. Unfortunately, Mannor et al. in  prove that this is impossible even when both and are simple linear functions.
Given the impossibility results shown by Mannor et al. in , this paper considers OCO where constraint functions are not arbitrarily varying but independently and identically distributed (i.i.d.) generated from an unknown probability model. More specifically, this paper considers online convex optimization (OCO) with stochastic constraint where is a known fixed set; the expressions of stochastic constraints (involving expectations with respect to from an unknown distribution) are unknown; and subscripts indicate the possibility of multiple functional constraints. In OCO with stochastic constraints, the decision maker receives loss function and i.i.d. constraint function realizations at each round . However, the expressions of and are disclosed to the decision maker only after decision is chosen. This setting arises naturally when decisions are restricted by stochastic environments or deterministic environments with noisy observations. For example, if we consider online routing (with link capacity constraints) in wireless networks 
, each link capacity is not a fixed constant (as in wireline networks) but an i.i.d. random variable since wireless channels are stochastically time-varying by nature. OCO with stochastic constraints also covers important special cases such as OCO with long term constraints [10, 11, 12], stochastic constrained convex optimization  and deterministic constrained convex optimization .
Let be the best fixed decision in hindsight (knowing all loss functions and the distribution of stochastic constraint functions ). Thus, minimizes the -round cumulative loss and satisfies all stochastic constraints in expectation, which also implies
almost surely by the strong law of large numbers. Our goal is to develop dynamic learning algorithms that guarantee both regretand constraint violations grow sub-linearly.
Note that Zinkevich’s algorithm in (1) is not applicable to OCO with stochastic constraints since is unknown and it can happen that for certain realizations , such that projections or required in (1) are not even well-defined.
Our Contributions: This paper solves online convex optimization with stochastic constraints. In particular, we propose a new learning algorithm that is proven to achieve expected regret and constraint violations and high probability regret and constraint violations. Along the way, we developed new techniques for stochastic analysis, e.g., Lemma 5, and improve upon state-of-the-art results in the following special cases.
OCO with long term constraints: This is a special case where each is known and does not depend on time. Note that can be complicated while might be a simple hypercube. To avoid high complexity involved in the projection onto as in Zinkevich’s algorithm, work in [10, 11, 12] develops low complexity algorithms that use projections onto a simpler set by allowing for certain rounds but ensuring . The best existing performance is regret and constraint violations where is an algorithm parameter . This gives regret with worse constraint violations or constraint violations with worse regret. In contrast, our algorithm, which only uses projections onto as shown in Lemma 1, can achieve regret and constraint violations simultaneously.111By adapting the methodology presented in this paper, our other report  developed a different algorithm that can only solve the special case problem “OCO with long term constraints” but can achieve regret and constraint violations. The current paper also relaxes a deterministic Slater condition assumption required in our other technical report  for OCO with time-varying constraints, which requires the existence of constant and fixed point such that for all . By relaxing the deterministic Slater condition assumption to the stochastic Slater condition in Assumption 2, the current paper even allows the possibility that is infeasible for certain . However, under the deterministic Slater condition assumption, our technical report  shows that if the regret is defined as the cumulative loss difference between our algorithm and the best fixed point from set , which is called a common subset in , then our algorithm can achieve regret and constraint violations simultaneously even if the constraint functions are arbitrarily time-varying (not necessarily i.i.d.). That is, by imposing the additional deterministic Slater condition and restricting the regret to be defined over the common subset , our algorithm can escape the impossibility shown by Mannor et al. in . To the best of our knowledge, this is the first time that specific conditions are proposed to enable sublinear regret and constraints violations simultaneously for OCO with arbitrarily time-varying constraint functions. Since the current paper focuses on OCO with stochastic constraints, we refer interested readers to Section IV in  for results on OCO with arbitrarily time-varying constraints.
Stochastic constrained convex optimization: This is a special case where each
is i.i.d. generated from an unknown distribution. This problem has many applications in operation research and machine learning such as Neyman-Pearson classification and risk-mean portfolio. The work develops a (batch) offline algorithm that produces a solution with high probability performance guarantees only after sampling the problems for sufficiently many times. That is, during the process of sampling, there are no performance guarantees. The work  proposes a stochastic approximation based (batch) offline algorithm for stochastic convex optimization with one single stochastic functional inequality constraint. In contrast, our algorithm is an online algorithm with online performance guarantees. 222While the analysis of this paper assumes a Slater-type condition, note that our other work  shows that the Slater condition is not needed in the special case when both the objective and constraint functions vary i.i.d. over time. (This also includes the case of deterministic constrained convex optimization, since processes that do not vary with time are indeed i.i.d. processes.) In such scenarios, Section VI in our work 
shows that our algorithm works more generally whenever a Lagrange multiplier vector attaining the strong duality exists.
Deterministic constrained convex optimization: This is a special case where each and are known and do not depend on time. In this case, the goal is to develop a fast algorithm that converges to a good solution (with a small error) with a few number of iterations; and our algorithm with regret and constraint violations is equivalent to an iterative numerical algorithm with convergence rate. Our algorithm is subgradient based and does not require the smoothness or differentiability of the convex program. Recall that Nesterov in  shows that is the best possible convergence rate of any subgradient/gradient based algorithm for non-smooth convex programs. Thus, our algorithm is optimal. The primal-dual subgradient method considered in  has the same convergence rate but requires an upper bound of optimal Lagrange multipliers, which is typically unknown in practice. Our algorithm does not require such bounds to be known.
Ii Formulation and New Algorithm
Let be a known fixed compact convex set. Let be sequences of functions that are i.i.d. realizations of stochastic constraint functions with random variable from an unknown distribution. That is, are i.i.d. samples of . Let be a sequence of convex functions that can arbitrarily vary as long as each is independent of all with so that we are unable to predict future constraint functions based on the knowledge of the current loss function. For example, each can even be chosen adversarially based on and actions . For each , we assume are convex with respect to . At the beginning of each round , neither the loss function nor the constraint function realizations are known to the decision maker. However, the decision maker still needs to make a decision for round ; and after that and are disclosed to the decision maker at the end of round .
For convenience, we often suppress the dependence of each on and write . Recall where the expectation is with respect to . Define . We further define the stacked vector of multiple functions as and define . We use to denote the Euclidean norm for a vector. Throughout this paper, we have the following assumptions:
Assumption 1 (Basic Assumptions).
Loss functions and constraint functions have bounded subgradients on . That is, there exists and such that for all and all and for all , all and all .333We use to denote a subgradient of a convex function at the point . If the gradient exists, then is the gradient. Nothing in this paper requires gradients to exist: We only need the basic subgradient inequality for all .
There exists constant such that for all and all .
There exists constant such that for all .
Assumption 2 (The Slater Condition).
There exists and such that for all .
Ii-a New Algorithm
Now consider the following algorithm described in Algorithm 1. This algorithm chooses as the decision for round based on and without requiring or .
Let be constant algorithm parameters. Choose arbitrarily and let . At the end of each round , observe and and do the following:
Choose that solves
as the decision for the next round , where is a subgradient of at point and is a subgradient of at point .
Update each virtual queue via
where takes the larger one between two elements.
For each stochastic constraint function , we introduce and call it a virtual queue since its dynamic is similar to a queue dynamic. The next lemma summarizes that update in (2) can be implemented via a simple projection onto .
The update in (2) is given by , where and is the projection onto convex set .
The projection by definition is and is equivalent to (2). ∎
Ii-B Intuitions of Algorithm 1
where (a) follows from (2); and (b) follows from Lemma 1 by noting that . Call the term marked by an underbrace in (4) the penalty. Thus, Zinkevich’s algorithm is to minimize the penalty term and is a special case of Algorithm 1 used to solve OCO over .
Let be the vector of virtual queue backlogs. Let be a Lyapunov function and define Lyapunov drift
The intuition behind Algorithm 1 is to choose to minimize an upper bound of the expression
The intention to minimize penalty is natural since Zinkevich’s algorithm (for OCO without stochastic constraints) minimizes penalty, while the intention to minimize drift is motivated by observing that is accumulated into queue introduced in (3) such that we intend to have small queue backlogs. The drift can be complicated and is in general non-convex. The next lemma provides a simple upper bound of and follows directly from (3).
At each round , Algorithm 1 guarantees
where is the number of constraint functions; and and are defined in Assumption 1.
Recall that for any , if then . Fix . The virtual queue update equation implies that
where (a) follows by defining .
Define , where ; and . Then,
where (a) follows from the triangle inequality; and (b) follows from the definition of Euclidean norm, the Cauchy-Schwartz inequality and Assumption 1.
Ii-C Preliminary Analysis and More Intuitions of Algorithm 1
The next lemma relates constraint violations and virtual queue values and follows directly from (3).
Recall that function is said to be -strongly convex if is convex over . By the definition of strongly convex functions, it is easy to see that if is a convex function, then for any constant and any constant vector , the function is -strongly convex. Further, it is known that if is a -strongly convex function and is minimized at point , then (see, for example, Corollary 1 in ):
Let be arbitrary. For all , Algorithm 1 guarantees
The next corollary follows by taking in Lemma 4.
For all , Algorithm 1 guarantees .
Fix . Note that . Taking in Lemma 4 yields
Rearranging terms and cancelling common terms yields
where (a) follows by the Cauchy-Schwarz inequality (note that the second term on the right side applies the Cauchy-Schwarz inequality twice); and (b) follows from Assumption 1.
Thus, we have
This corollary further justifies why Algorithm 1 intends to minimize drift . Recall that controlled drift can often lead to boundedness of a stochastic process as illustrated in the next section. Thus, the intuition of minimizing drift is to yield small bounds.
Iii Expected Performance Analysis of Algorithm 1
This section shows that if we choose and in Algorithm 1, then both expected regret and expected constraint violations are .
Iii-a A Drift Lemma for Stochastic Processes
Let be a discrete time stochastic process adapted444Random variable is said to be adapted to -algebra if is -measurable. In this case, we often write . Similarly, random process is adapted to filtration if . See e.g. . to a filtration . For example,
can be a random walk, a Markov chain or a martingale. Drift analysis is the method of deducing properties, e.g., recurrence, ergodicity, or boundedness, aboutfrom its drift . See [21, 22] for more discussions or applications on drift analysis. This paper proposes a new drift analysis lemma for stochastic processes as follows:
Let be a discrete time stochastic process adapted to a filtration . Suppose there exists an integer , real constants , , and such that
hold for all . Then, the following holds
For any constant , we have where .
See Appendix A. ∎
The above lemma provides both expected and high probability bounds for stochastic processes based on a drift condition. It will be used to establish upper bounds of virtual queues , which further leads to expected and high probability constraint performance bounds of our algorithm. For a given stochastic process , it is possible to show the drift condition (12) holds for multiple with different and . In fact, we will show in Lemma 7 that yielded by Algorithm 1 satisfies (12) for any integer by selecting and according to . One-step drift conditions, corresponding to the special case of Lemma 5, have been previously considered in [22, 23]. However, Lemma 5 (with general ) allows us to choose the best in performance analysis such that sublinear regret and constraint violation bounds are possible.
Iii-B Expected Constraint Violation Analysis
Define filtration with and being the -algebra generated by random samples up to round . From the update rule in Algorithm 1, we observe that is a deterministic function of and where is further a deterministic function of , and . By inductions, it is easy to show that and for all where denotes the -algebra generated by random variable . For fixed , since is fully determined by and are i.i.d., we know is independent of . This is formally summarized in the next lemma.
If satisfies , then Algorithm 1 guarantees:
Fix and . Since is independent of , which is determined by , it follows that , where (a) follows from the fact that and . ∎
To establish a bound on constraint violations, by Corollary 2, it suffices to derive upper bounds for . In this subsection, we derive upper bounds for by applying the drift lemma (Lemma 5) developed at the beginning of this section. The next lemma shows that random process satisfies the conditions in Lemma 5.
Let be an arbitrary integer. At each round in Algorithm 1, the following holds
See Appendix B. ∎
Lemma 7 allows us to apply Lemma 5 to random process and obtain by taking , and , where represents the smallest integer no less than . By Corollary 2, this further implies the expected constraint violation bound as summarized in the next theorem.
Theorem 1 (Expected Constraint Violation Bound).
If and in Algorithm 1, then for all , we have
where the expectation is taken with respect to all .
Define random process and filtration