1 Introduction
This paper, motivated by solving robust optimization and (equivalently) adversarial training, revisits the concept of “adversary” in online learning. A significant amount of literature in online learning focuses on the socalled “adversarial” setup, where a learner picks a sequence of solutions against a sequence of inputs “chosen adversarially”, and achieves accumulated utility almost as good as the best fixed solution in hindsight. While those results are widely known and applied, we observe that the concept of “adversary” is often understood and applied in an incorrect way, partly because of a lack of a rigorous definition, which we address in this paper. Our observation is largely motivated by recent works applying online learning to solve robust optimization and adversarial training, where diminishing regret, contrary to the claim, is not guaranteed to be achieved.
Robust optimization (see e.g., ben1998robust; BN00; ben2002robust; Bertsimas04; BGN09; bertsimas2011theory) is a powerful approach to account for data uncertainty when distributional information is not available. Taking a worstcase perspective in Robust Optimization (RO) we are interested in finding a good solution against the worstcase data realization, which leads to problems of the form
(1) 
where is the decision set and is the uncertainty set and it is well known that robust optimization can be expressed in this standard form via appropriate choice of . We thus look for a robust solution that minimizes the cost under the worstcase realization . In particular, when the function is convex in the decisions and concave in the uncertainty the resulting problem is convex and there is a plethora of approaches to solve these problems. In fact, robust optimization is tractable in many cases and can be solved often via methods from convex optimization as well as integer programming in the context of discrete decision; we refer the interested reader to BGN09 for an introduction and to, e.g., bertsimas2003robust for robust optimization with discrete decisions.
Robust optimization is also closely related to adversarial training
, a subject that has drawn significant attention in machine learning research, and particularly in deep learning. It has been observed that a welltrained deep neural network can often be “fooled” by small perturbations. That is, if a data point (an image for example) is perturbed by a carefully designed adversarial noise, potentially imperceivable to the human being (i.e., with a small
norm in the case of images) , the classification result can be completely changed. To mitigate this issue, adversarial training aims to train a neural network to be robust to such perturbations, which can be formulated as follows:where
is the weight vector of the neural network,
are the th data point, is the prediction of the neural network on an input andis the loss function. Thus, adversarial training essentially is an attempt to solve an (often nonconvex and hard) robust optimization problem. Due to the lack of convexity, exactly solving the above formulation is intractable, and numerous heuristics has been proposed to address the computational issue.
Recently, several works explored a general framework to solve robust optimization and adversarial training via online learning. The main idea is the following: instead of solving the robust problem oneshot by exploiting convex duality, a sequence of scenarios is generated using online learning, and optimal decisions (or approximate solutions for very complicated functions, in the adversarial training case) for each scenarios are then averaged as the final output. Using theorems from online learning, it is shown that the final output is close to optimal (or achieves same approximation ratio, for adversarial training) for all scenarios in .
The framework outlined above can be very appealing computationally. However, a close examination of the argument shows that because of the ambiguity on the concept “adversary” in online learning, some of the claimed results are invalid (see Section 2.2 for a concrete counterexample). Thus, we feel that it is necessary to characterize the concept “adversary” in a more rigorous way to avoid future confusion. This also enables us to develop new methods for solving robust optimization using online learning.
Contribution.
We now summarize our contribution:
Clarification of concepts. The main contribution in this paper is to distinguish two types of adversaries in an online learning setup, which we termed “anticipatory” and “nonanticipatory”. In a nutshell, anticipatory adversaries are those that have access to inherent randomness of the online learning algorithm at the current step. The example that motivates this concept is when the adversary is chosen by solving an optimization problem whose parameters are the output of the online learning algorithm. Nonanticipatory adversaries do not have access to the inherent randomness of the current step (however can still be adaptive). Based on that, we further distinguish two types of online learning algorithms, which are both known in the literature to achieve diminishing regret with respect to adversarial input. Depending on whether such adversarial input can be anticipatory or not, we call the two classes strong learners and weak learners.
Onesided minimax problems via imaginary play. We then apply our model to the special case of solving robust optimization problems. We show how to solve problems of the form (1) by means of online learning with two weak learners. Slightly simplified, two learners play against each other solving Problem (1). However, in contrast to general saddle point problems, only one of the players can extract a feasible solution, as we considerably weaken convexity requirements both for the domains as well as the functions (or even drop them altogether). For this we present a general primaldual setup that is then later instantiated for specific applications by means of pluggingin the desired learners.
Biased play with asymmetric learners. We then show how to further gain flexibility by allowing asymmetry between the learners. Here one learner is weakened (in terms of requirements) to an optimization oracle, and consequently the other player is strengthened to allow anticipatory inputs.
Applications. Finally, we demonstrate how our approach can be used to solved a large variety of robust optimization problems. For example, we show how to solve robust optimization problems with complicated feasible sets involves integer programming. Another example is robust MDPs with nonrectangular uncertainty sets, where only the reward parameters are subject to uncertainty. Due to space constraints, we defer the applications to the appendix.
2 Preliminaries and motivation
In the following let denote the unit simplex in . We will use the shorthand to denote the set . For the sake of exhibition we will differentiate between Maximize and , where the former indicates that we maximize a function via an algorithm, whereas the latter is simply indicating the maximum function without any algorithmic reference. Moreover, we will denote both the decision vector as well as the uncertainty vector in bold letters. All other notations are standard if not defined otherwise.
2.1 Conventional wisdom
In this work we consider games between two player and we will use robust optimization or adversarial training of the form (1) as our running example. For the sake of continuity, we adapt the notation of ben2015oracle, however we stress that we later will selectively relax some of the assumptions. Consider:
where is the domain of feasible decisions and the with are convex functions in that are parametrized via vectors for some for . The problem above is parametrized by a fixed choice of vectors with and we will refer to a problem in this form as the nominal problem (with parameters ), which corresponds to the outer minimization problem given a realization of the adversary’s choice .
In robust optimization we robustify the nominal problem against the worstcase choice of via the formulation:
where the uncertainty set is the set of possible choices of parameter . Thus, denote and we have .
Recently there has been a line of work proposing methods to solve adversarial training and robust optimization via online learning methods or in an equivalent fashion (see e.g., ben2015oracle; madry2017towards; chen2017robust; sinha2017certifiable). In all cases the underlying metaalgorithm works as follows: the player takes as input and generate a sequence of according to an online learning algorithm which achieves diminishing regret against adversarial input. The player on the other hand, computes by minimizing the loss function with as input; the interpretation of the roles of the players depends on the considered problem.
In particular, in ben2015oracle, the authors proposed two methods along this line, using online convex optimization and Follow the Perturbed Leader (FPL) as the online learning algorithm, respectively. In chen2017robust, the authors consider the case where is a finite set, and proposed to use exponential weighting as the online learning algorithm (in the infinite case, they use online gradient descent), and then solve by minimizing the loss function for the distributional problem. While superficially similar, these two approaches are markedly different as we will see.
2.2 A motivating counter example
Unfortunately, the outlined approach above can be easily flawed, for reasons that will be made clear later. We start with the following counter example, and apply the second method (i.e., FPL based one) proposed in ben2015oracle.
Consider the following robust feasibility problem: Let , and does there exist such that
The answer is clearly negative, as for any , at least one of and is less than or equal to . However, Theorem 2 of ben2015oracle asserts that if we update according to Follow the Perturbed Leader, compute via
and if for (which is true here as the objectives are linear and thus is a vertex of ), then is robust feasible, i.e.,
where ; this is clearly not true and we obtain the desired contradiction.
Furthermore, we also show that does not converge to the minimax solution: Since is obtained by FPL and the objective is a linear function as before we have that is a vertex of . Further notice that for we have . Therefore is on the line segment between and . One can easily check that for any on this line segment, we have
On the other hand, clearly for , we have
Thus, does not converge to the minimax solution.
Interestingly, the first method proposed in ben2015oracle turns out to be a valid method for this example. Also, the approach in chen2017robust does not suffer from this weakness as the Bayesian optimization oracle is applied to the output distribution, rather than a sampled solution (which would be problematic). Indeed, this is no coincidence. To clearly explain the different behaviors for various methods proposed in literature is the main motivation of this work.
3 Anticipatory and nonanticipatory adversaries
In this section we provide definitions of the main concepts that we are introducing in this paper. There are two types of “adversaries”, namely anticipatory adversaries and nonanticipatory adversaries for online learning setups that need to be clearly distinguished. In a nutshell, slightly simplifying, the distinction is whether the adversary’s decision (who is potentially computationally unbounded) is independent of the private randomness of the current round ; if not the adversary might be able to anticipate the player’s decision .
Definition 3.1.
An online learning setup is as follows: for , the algorithm is given access to an external signal
, and an exogenous random variable
which are independent with everything else, and furthermore and are independent for . An online learning algorithm is a mapping:We say an online learning algorithm is deterministic if it is a mapping
In other words, an online learning algorithm picks an action at time depending on past actions , past signals , and an exogenous randomness . And a deterministic online learning algorithm is independent of the exogenous randomness.
Existing analyses for online learning algorithms focus on two cases: either the external signal is generated stochastically (and typically in an iid fashion), or it is generated “adversarially”. However, as we will show later, the term “adversarially” is loosely defined and causes significant confusion. Instead, we now define two types of adversary signals.
Definition 3.2.
Recall an online learning algorithm is given access to exogenous random variables and its output may depend on , but is independent to . A sequence is called nonoblivious nonanticipatory (NONA) with respect to if may depend on , but is independent of , for all . A sequence is called anticipatory wrt if may depend on , but is independent of , for all .
We now provide some examples to illustrate the concept.

If is a sequence chosen arbitrarily, independent of , then it is a NONA sequence.

If is chosen according to
for some function , then is a NONA sequence.

If is chosen according to
for some function , then is an anticipatory sequence. This is because is (potentially) dependent to , and so is . As a special case, suppose
then is an anticipatory sequence.

If is the output of a deterministic online learning algorithm, and is chosen according to
for some function , then is a NONA sequence. This is because is independent of since the online learning algorithm is deterministic. In this case, the following sequence is NONA as well:
The standard target of online learning algorithms is to achieve diminishing regret vis a vis a sequence of external signals. Thus, depending on whether the external signal is anticipatory or not, we define two class of learning algorithms.
Definition 3.3.
Suppose is the feasible set of actions, and for the action chosen at time , it is evaluated by , with a smaller value being more desirable.

We call an online learning algorithm for a weak learning algorithm with regret , if for any NONA sequence
, the following holds with a probability
(over the exogenous randomness of the algorithm), where are the output of : 
We call an online learning algorithm for a strong learning algorithm with regret , if for any anticipatory sequence , the following holds with a probability (over the exogenous randomness of the algorithm), where are the output of :
To illustrate the subtle difference, let us consider the classical exponential weighting algorithms, where a set of experts are given, and the goal of the online learning algorithm is to predict unseen according to the prediction of the experts, such that the algorithm does as good as the best among the experts. The algorithm maintains a weight vector over all experts depending on their performance in previous rounds, and outputs the weighted average of the prediction from the experts. Notice that this is a deterministic learning algorithm, i.e., the learning algorithm is independent of the exogenous randomness . Thus, whether has access to or not has no influence on the performance of the algorithm. As such, the exponential weighting algorithm (in this specific form) is a strong learning algorithm.
On the other hand, there is a variant of exponential weighting algorithm where instead of outputting the weighted average of the prediction, the algorithm outputs the prediction of one expert, based on a probability proportional to the weight. The common proof for this technique is that through randomization, the expected loss is upper bounded by the loss of the weighted average, and hence the regret of this variant is upper bounded by the regret of the vanilla version. Clearly, this argument implicitly uses an assumption that the realized is independent of this randomness, and breaks down otherwise. Hence, this form of exponential weighting algorithm is a weak learning algorithm.
As a rule of thumb, it appears that for online learning algorithms that “work in the adversarial case”, all deterministic algorithms (e.g., Online Gradient Descent) are strong learning algorithms; whereas all algorithms which inherently require randomness (e.g., Follow the Perturbed Leader) are weak learning algorithms.
We remark that in the online learning literature, there is the concept of adaptive adversaries, which is a relevant concept that can better highlight the observation made in the paper. An adaptive adversary in online learning is allowed to adapt its choice at time to the output of the online learning algorithm until time , but is independent of the exogenous randomness at time . Thus, it generates a nonanticipatory sequence. An online learning algorithm that achieves a diminishing regret against such an adversary is thus a weak learner, and not necessarily a strong learner.
4 Warmup: Minimax problem via Online Learning
We will first consider the case where we have one function . In principle this function can be highly complex and could be, e.g., the maximum of a family of functions , however here the reader should be thinking of as a relatively simple function. This will be made more precise below, where we specify the learnability requirements for , which ultimately limits the complexity of the considered functions. In Section 5 we will then consider the more general case of a family of (simple) functions , which arises naturally in robust optimization.
Thus, we are solving the following optimization problem
(2) 
Assumption 4.1 (Problem structure).
We will make the following assumptions regarding the domains and function if not stated otherwise. Note that these assumptions only affect the player. (1) For any , the function is convex. (2) The set is convex.
4.1 Parallel Weak Learners
Our first framework solves Problem (2) via weak online learning algorithms and imaginary play (i.e., both players can have full knowledge about the function ) to update and in parallel. In the following we will always assume that the sequence is a sequence of elements in and the sequence is a sequence of elements in .
Assumption 4.2 (Weak Learnability).

There exists a weak online learning algorithm for with regret . That is, for any NONA sequence , the following holds with a probability , where is the output of :

There exists a weak online learning algorithm for with regret . That is, for any NONA sequence , the following holds with a probability , where is the output of :
As mentioned above, the learnability assumption constrains the complexity of the function . For example if for some family of functions that are convex in and concave in , then might not be concave in and the resulting Problem (2) might be intractable and the learnability assumption for might be violated.
We are now ready to present the metaalgorithm, which is given in Algorithm 1. We would like to remark that we refer to these minimax problems as robust optimization as we only require to be able to produce an explicit (stationary) solution for the player.
Remark 4.3 (Dependence on ).
Note that in Algorithm 1 the function does not explicitly occur. In fact, is captured in Assumptions 4.1 and 4.2 and in particular, we make a priori no distinction what type of feedback (full information, semibandit, bandit, etc.) the learner observes. In principle, since we are assuming imaginary play both learners can have full knowledge about the function while in actual applications the learners will only require limited information. For example, a learner might only require bandit feedback to ensure the learnability assumption with a given regret bound, while another might require full information depending on the setup. In the formulation above, Algorithm 1 is completely agnostic to this; also in all other algorithms, the situation will be analogous.
Observe that due to convexity of , we have . The theorem below shows that converges to , which achieves the best worstcase performance. Note that the guarantee is asymmetric as a saddle point may not exist as no assumptions on or are made. If indeed is concave with respect to the second argument and is convex, then the theorem reduces to the well known result of solving a zerosum game via online learning in parallel freund1999adaptive. The proof is similar and included in the supplementary material for completeness.
Theorem 4.4.
With probability , Algorithm 1 returns a point satisfying
(3) 
We remark that the two weak learners framework superficially resembles the online learning based method to solve zerosum games freund1999adaptive where both players run an online learning algorithm. Yet, our setup and results depart from those in freund1999adaptive, as we drop any requirement for the uncertainty and in particular is not necessarily concave with respect to . In short, the minimax problem we solve is not a saddlepoint problem, and as such only the player is able to extract a nearoptimal solution. Notice that the lack of concavity with respect to arises naturally in robust optimization formulations and adversarial training (see Section 5 for details).
4.2 Biased Play with a Strong Learner
In our second framework, the structure of the problem is “biased” toward one player. Here one of the learners is particularly strong, allowing the other to break NONAness. We will consider the case where the learner is particularly strong. The case for is symmetric.
Assumption 4.5 (Strong Learnability of ).

There exists a strong online learning algorithm for with regret . That is, for any anticipatory sequence , the following holds with a probability , where is the output of :

Given , there is an optimization oracle that computes .
We now a theorem similar to Theorem 4.4, for the case where is a strong learner; the proof is to be found in Supplementary Material A.
Theorem 4.6.
With probability , Algorithm 2 returns a point satisfying
Some remarks are in order.

Note that since is updated via solving an optimization problem determined by , the sequence is an anticipatory sequence. As such, it is crucial that a strong learning algorithm is required to update . This is explains the existence of the counter example in Section 2.2: both FPL and exponential weighting (with output randomly chosen) are weak learners, as opposed to online gradient descent which is a strong learner.

It is easy to extend the analysis to the case where an optimization oracle is replaced by an approximate optimization oracle, which given computes such that
for some . In such a case, the statement in Theorem 4.6 is replaced by
That is, the algorithm will return a approximate solution for the minimax problem.
Before concluding this section, we remark that the convexity requirement for the player can be relaxed if randomized actions are allowed; see Appendix B for details. Consequently, this allows us to solve robust optimization where both the feasible region and the uncertainty set are represented as feasible regions of Integer Programming problems (see Appendix D for an example).
5 Multiple Objectives: Online Learning for Robust Optimization
Our general approach can be readily extended to the case where the primal player needs to satisfy multiple objectives simultaneously. Multiobjective decisionmaking naturally arises in many setups where the preference of decisions are multidimensional. In particular multiobjective decisionmaking can model robust optimization, where typically the decision maker aims to find a decision such that a set of robust constraints are satisfied, i.e.,
We consider solving the following general case:
(4) 
where is a closed convex set. For example, if , the dimensional unit simplex, then Problem (4) reduces to
which corresponds to the aforementioned case of robust optimization. On the other hand, if is a singleton, then (4) is equivalent to
and we solve our problem for a specific preference or weighing among the different objectives.
We first consider solving Problem (4) via parallel weak learners. We present the following two approaches both of which are based on imaginary play. Due to space constraints, the biased case with a strong learner is deferred to the supplementary material.
Approach via Explicit Maximum
In the first approach we model the maximum over the different functions explicitly. To this end, let denote the concatenation of , i.e., and define the function
Thus, roughly speaking, the optimal is approachable if weak learnability holds for both and with respect to . Due to space constraints we defer detailed results into the supplementary material.
Approach via Distributional Maximum
In this section we will present an alternative approach, where the maximum is only implicitly modeled via a distributional approach, which captures the maximum via a worstcase distribution.
In the following let
With this Problem (4) can be rewritten as
As before we specify the learnability requirement.
Assumption 5.1 (Learnability).
We make the following assumptions for the learners:

For every , there exists an online learning algorithm for for , i.e., for any NONA sequence of , the following holds with probability :
where is the output of .

There exists an online learning algorithm for for . That is, for any NONA sequences of and , the following holds with probability :
where is the output of .

There exists an online learning algorithm for for . That is, for any NONA sequences of and , the following holds with probability :
where is the output of .
Note that the learner and learner should be considered as the dual learners and as the primal learner. In fact, we show that the learner and learner together give rise to a learner, which allows us then to reuse previous methodology. Further observe that, is a linear function of and thus the second part of the assumption, for example, holds using the Follow the Perturbed Leader algorithm (see kalai2005efficient).
Proposition 5.2.
Suppose that Assumption 5.1 holds and that . Then running and simultaneously is an online learning algorithm for of function . That is, for any NONA sequence , let be the output of , and be the output of , then with probability , we have
By Proposition 5.2, there exist weak learners for both the primal and the dual player and solving Problem 4 reduces to solving Problem 2. Below we present the formal algorithm and the corresponding theorem with performance guarantees.
Theorem 5.3.
To illustrate the result, let us consider the following example. Suppose all are bilinear with respect to and
, as in the case of a robust linear programming,
and are subsets of the Euclidean space, and further suppose is a convex set (notice that we make no such assumptions on the uncertainty sets ). We say a set is equipped with a linear optimization oracle, if given any , we can compute Thus, Assumption 5.1 holds if the followings are true:
is equipped with a linear optimization oracle for .

is equipped with a linear optimization oracle.

is equipped with a linear optimization oracle.
Indeed, each of the three conditions ensures the learnability in Assumption 5.1 for , , and respectively, via e.g., the Follow the Perturbed Leader algorithm. This is due to the fact that for each argument, its respective objective function is linear.
References
Appendix A Proofs
a.1 Proofs from Section 4
Proof of Theorem 4.4.
Since and are obtained by and , we have that and are NONA. Thus, by Assumption 4.2,
(5) 
and
(6) 
hold simultaneously with probability . Summing up the two inequalities leads to
(7) 
By convexity of , we have for all , so that
(8) 
Since for all we also have for any that , which implies further leading to
(9) 
Combining Equations (8) and (9) we obtain
which together with (7) establishes the theorem. ∎
Proof of Theorem 4.6.
We obtain an analogous statement for Randomized Robust Optimization via Strong Primal Learner whose proof is almost identical to the proof of Theorem 4.6 from above.
Theorem A.1.
With probability , Algorithm 4 returns a distribution satisfying
Comments
There are no comments yet.