# Online Multiobjective Minimax Optimization and Applications

We introduce a simple but general online learning framework, in which at every round, an adaptive adversary introduces a new game, consisting of an action space for the learner, an action space for the adversary, and a vector valued objective function that is convex-concave in every coordinate. The learner and the adversary then play in this game. The learner's goal is to play so as to minimize the maximum coordinate of the cumulative vector-valued loss. The resulting one-shot game is not convex-concave, and so the minimax theorem does not apply. Nevertheless, we give a simple algorithm that can compete with the setting in which the adversary must announce their action first, with optimally diminishing regret. We demonstrate the power of our simple framework by using it to derive optimal bounds and algorithms across a variety of domains. This includes no regret learning: we can recover optimal algorithms and bounds for minimizing external regret, internal regret, adaptive regret, multigroup regret, subsequence regret, and a notion of regret in the sleeping experts setting. Next, we use it to derive a variant of Blackwell's Approachability Theorem, which we term "Fast Polytope Approachability". Finally, we are able to recover recently derived algorithms and bounds for online adversarial multicalibration and related notions (mean-conditioned moment multicalibration, and prediction interval multivalidity).

There are no comments yet.

## Authors

• 4 publications
• 1 publication
• 43 publications
• ### Optimistic and Adaptive Lagrangian Hedging

In online learning an algorithm plays against an environment with losses...
01/23/2021 ∙ by Ryan D'Orazio, et al. ∙ 0

• ### Online Learning with Simple Predictors and a Combinatorial Characterization of Minimax in 0/1 Games

Which classes can be learned properly in the online model? – that is, by...
02/02/2021 ∙ by Steve Hanneke, et al. ∙ 0

• ### Policy Regret in Repeated Games

The notion of policy regret in online learning is a well defined? perfor...
11/09/2018 ∙ by Raman Arora, et al. ∙ 0

• ### An Information-Theoretic Approach to Minimax Regret in Partial Monitoring

We prove a new minimax theorem connecting the worst-case Bayesian regret...
02/01/2019 ∙ by Tor Lattimore, et al. ∙ 6

• ### Approachability in unknown games: Online learning meets multi-objective optimization

In the standard setting of approachability there are two players and a t...
02/10/2014 ∙ by Shie Mannor, et al. ∙ 0

• ### Prediction against limited adversary

We study the problem of prediction with expert advice with adversarial c...
10/31/2020 ∙ by Erhan Bayraktar, et al. ∙ 0

• ### Learner-Private Online Convex Optimization

Online convex optimization is a framework where a learner sequentially q...
02/23/2021 ∙ by Jiaming Xu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We introduce and study a simple but powerful framework for online adversarial multiobjective minimax optimization. At each round , an adaptive adversary chooses an environment for the learner to play in, defined by a convex compact action set for the learner, a convex compact action set for the adversary, and a

-dimensional continuous loss function

that, in each coordinate, is convex in the learner’s action and concave in the adversary’s action. The learner then chooses an action or distribution over actions , and as a function of the learner’s choice, the adversary chooses an action . This results in a loss vector , which accumulates over time. The goal of the learner is to minimize the maximum accumulated loss over each of the dimensions: .

When described this way, it is natural to view the environment chosen at each round as defining a zero sum game between the learner and the adversary in which the learner wishes to minimize the maximum coordinate of the resulting loss vector. The objective of the learner in the stage game in isolation can be written as:111 A brief aside about the “inf max max” structure of : since each is continuous, so is , and hence is attained on the compact set — but as is no longer a continuous function of , the infimum over need not be attained.

 wtL=infxt∈Xtmaxyt∈Yt(maxj∈[d]ℓtj(xt,yt)).

Unfortunately, although is convex-concave in each coordinate, the maximum over coordinates does not preserve concavity for the adversary. Thus the minimax theorem does not hold, and the value of the game in which the learner must move first (defined above) is larger than the value of the game in which the adversary is forced to move first— that is, , where is defined as:222The reason for taking the supremum instead of maximum over is the same as explained in Footnote 1 for .

 wtA=supyt∈Ytminxt∈Xt(maxj∈[d]ℓtj(xt,yt)).

Nevertheless, fixing a series of environments chosen by the adversary, this defines in hindsight an aspirational quantity , summing the adversary-moves-first value of the constituent zero sum games. Despite the fact that these values are not individually obtainable in the stage games, we show that they are approachable on average over a sequence of rounds in the following sense: there is an algorithm for the learner that guarantees that against any adversary

 maxj∈[d](1TT∑t=1ℓtj(xt,yt))≤1TWTA+4√2lndT.

Our derivation is elementary and based on a minimax argument. The generic algorithm plays actions at every round according to a minimax equilibrium strategy in a surrogate game that is derived both from the environment chosen by the adversary at round , as well as from the history of play so far on previous rounds . The loss in the surrogate game is convex-concave (and so we may apply minimax arguments), and can be used to upper bound the loss in the original games.

We then show that this simple framework can be instantiated to derive a wide array of optimal bounds, and that the corresponding algorithms can be derived in closed form by solving for the minimax equilibrium of the corresponding surrogate game. Our applications fall into three categories:

1. Expert Learning: We can derive optimal regret bounds and algorithms for a wide variety of learning-with-experts settings. In these settings, there is a finite set of experts who each incur an adversarially selected loss in at each round. The learner must select an expert at each round before the losses are revealed, and incurs the loss of her chosen expert. We can recover algorithms and bounds in a large variety of settings—a non-exhaustive list includes:

1. External Regret: In the standard setting of regret to the best fixed expert out of , our framework recovers the multiplicative weights algorithm [MW1, MW2] and the corresponding regret bound. This bound is optimal and hence witnesses the optimality of our main theorem.

2. Internal Regret and Swap Regret: Internal and swap regret bound the Learner’s regret conditioned on the action that they play. Minimizing these notions of regret in a multiplayer game corresponds to convergence to the set of correlated equilibria; see [fostervohra, HM00]. Our method derives an algorithm of [internalregret] from first principles. This explicates the fixed point calculation in the algorithm of [internalregret].

3. Adaptive Regret, studied by [MW2, adaptiveregret2, adaptiveregretvovk], asks for diminishing regret not just over the entire sequence of rounds, but also over each interval for . This represents regret to the best expert in a setting in which the best expert may be defined as changing over time.

4. Sleeping Experts: In the sleeping experts problem [sleeping, sleeping2, internalregret, kleinberg2010sleepingexperts], only an adversarially chosen subset of experts is available to the learner in each round. Blum and Mansour [internalregret] define the goal of obtaining diminishing regret to each expert on the subsequence of rounds on which that expert is available.

5. Multi-group Regret: Multi-group regret is a fairness-motivated notion (studied under a different name in [multigroup1] and in the batch setting in [multigroup2]) that associates each round with an individual, who may be a member of a subset of a large number of overlapping groups . It asks for diminishing regret on all subsequences identified by individuals from some group — i.e. simultaneously for all groups, we should do as well on a group as the best expert defined on that group in isolation.

2. Fast Polytope Blackwell Approachability: We give a variant of Blackwell’s Approachability Theorem [blackwell] when the convex body to be approached is a polytope. Standard approachability algorithms approach the body in Euclidean distance, and have a convergence rate that is polynomial in the ambient dimension of the Blackwell game. In contrast, we give a dimension-independent approachability guarantee: we approximately satisfy all halfspace constraints defining the polytope, after logarithmically many rounds in the number of such constraints. This can be a significant improvement over a polynomial dependence on the dimension in many settings.

3. Multicalibration and Multivalidity: We can similarly derive state of the art bounds and algorithms for notions of multivalidity as defined in [multivalid], including mean multicalibration, mean-conditioned moment multicalibration [momentmulti], and prediction interval multivalidity. Mean multicalibration asks for calibrated predictions not just overall, but simultaneously on each subsequence defined by membership in a large and overlapping set of groups . We recover optimal convergence bounds depending only logarithmically on . Similarly, our techniques can be used to achieve bounds for moment prediction and prediction intervals, guaranteeing valid coverage over each of the groups simultaneously.

### 1.1 Additional Related Work

Two papers by Azar et al. [azar2014sequential] and Kesselheim and Singla [kesselheim2020online] study a related problem from a very different perspective. They study an online linear optimization problem with vector valued outcomes, and just as in our work, their goal is to minimize the maximum coordinate in the accumulated loss vector (they also consider other -norms). However, they study an incomparable benchmark that in our notation would be written as (which is well-defined in their setting as they consider the loss functions and action sets to be fixed throughout the interaction). On the one hand, this benchmark is stronger than ours in the sense that the maximum over coordinates is taken outside of the sum over time, whereas our benchmark considers a “greedy” per-round maximum. On the other hand, since in our setting the game can be different at every round, our benchmark allows a comparison to a different action at each round rather than a single fixed action. In the setting of [kesselheim2020online], it is impossible to give any regret bound to their benchmark, so they derive an algorithm obtaining a competitive ratio to this benchmark (i.e. a multiplicative rather than additive approximation). In contrast, our benchmark admits a regret bound. The consequence is that our results are quite different in kind despite the outward similarity of the settings: none of our applications follow from the theorems of [azar2014sequential, kesselheim2020online] (since in all of our applications, we derive regret bounds).

Our underlying technique is derived from a game-theoretic line of argument that originates from the calibration literature: specifically an argument of Hart (originally communicated in [fostervohra], and recently explicated in [Hart20]) and of Fudenberg and Levine [FL99]. This argument was extended in Gupta et al. [multivalid] to obtain fast rates and explicit algorithms in the context of multicalibration and multivalidity; in this paper we distill the argument to its core to obtain our general framework.

There is a substantial body of work related to each of our application areas. Algorithms obtaining diminishing “external regret” (i.e. regret to the best fixed action in a set ) date back to Hannan [hannan]. Foster and Vohra [fostervohra] introduced the notion of “internal regret”, which corresponds to asymptotic performance that is competitive with the best sequence of actions that arises from applying an arbitrary strategy modification rule (i.e. a function that can map actions to arbitrary replacement actions) to the empirical choices of the algorithm; this notion of regret is closely connected to correlated equilibrium [fostervohra, HM00]. This notion of regret was then substantially generalized [widerangelehrer, phiregret]. Lehrer defines a very general notion of regret (“wide-range regret”) that asks for diminishing regret to a set of subsequences of rounds defined by “time selection functions” on which arbitrary strategy modification rules can be applied. Blum and Mansour [internalregret] give explicit rates and algorithms for obtaining diminishing wide-range regret. Subsequence regret (as we define it in this paper) can be viewed as a different parametrization of wide-range regret; up to a polynomial change in the parameters, the two notions can be reduced to one another (see Appendix B for details).

Work on online calibrated prediction dates back to Dawid [dawid1982well]. Foster and Vohra [fostervohra] were the first to show that it is possible to obtain asymptotic calibration against an adversary. Lehrer and Sandroni et al. [lehrer2001any, sandroni2003calibration] generalized this result and showed that it was possible to extend these ideas in order to satisfy calibration not just overall, but on arbitrary computable subsequences of rounds. These later results were nonconstructive and did not derive explicit rates. In the algorithmic fairness literature, Hébert-Johnson et al. defined the notion of multicalibration and derived algorithms and explicit sample complexity bounds in the batch setting [hebert2018multicalibration]. Jung et al. [momentmulti]

extended this notion from means to variances and other higher moments. Gupta et al.

[multivalid] gave explicit online algorithms with optimal rates for mean and moment multicalibration, as well as a new notion of prediction interval multivalidity which they defined.

Blackwell originally proved his approachability theorem in [blackwell]. It has been known since [blackwellnoregret] that Blackwell approachability can be used to derive no regret learning algorithms. Foster showed that calibrated forecasters could be derived from Blackwell approachability [Fos99]. Abernethy, Bartlett, and Hazan [abernethy2011blackwell] showed conversely how Blackwell approachability could be derived from no-regret learning algorithms. The standard Blackwell approachability theorem proves approachability in the metric, and hence necessarily inherits a dependence on the ambient dimension in its convergence rate. The result is a polynomial rather than logarithmic dependence on the number of experts when used to derive no-regret learners. Chzhen, Giraud, and Stoltz [chzhen2021unified] use (the standard) Blackwell approachability theorem to study online learning under various fairness constraints like multicalibration and other multigroup notions of fairness [gerrymander], and similarly inherit a polynomial dependence on the number of groups rather than the optimal logarithmic dependence that our version of the approachability theorem yields. Perchet [perchet2015exponential] shows that the negative orthant is approachable in the metric with a dependence in the convergence rate. This is equivalent to polytope Blackwell approachability as we define it. He uses this to derive several results about no regret learning and calibration, including the optimal rate for internal regret (although not the algorithm).

A line of work initiated by Rakhlin, Sridharan, and Tewari [rakhlin1, rakhlin2] takes a very general minimax approach towards deriving bounds in online learning, including regret minimization, calibration, and approachability. Their approach is substantially more powerful than the framework we introduce here (e.g. it can be used to derive bounds for infinite dimensional problems, and characterizes online learnability in the sense that it can also be used to prove lower bounds). However it is also correspondingly more complex, and requires analyzing the continuation value of a round dynamic program, in contrast to the greedy 1-round analysis needed in our framework. The result is that our framework is inherently constructive, in that the algorithm derives from solving a one-round stage game, which can always be done in time polynomial in the number of actions of the learner and adversary, whereas generically results from [rakhlin1, rakhlin2] are nonconstructive — although in certain cases their framework can also be used to derive algorithms [rakhlin3]. Relative to this literature, we view our framework as a “user-friendly” power tool, that can be used to derive a wide variety of algorithms and bounds without much additional work — at the cost of not being universally expressive.

## 2 General Framework and Extensions

We begin by defining our general setting in Section 2.1. We then introduce our generic algorithmic framework, along with our proof techniques, in Section 2.2. We close this section by discussing, in Section 2.3, some extensions of this framework (to randomized learners and learners who only solve the optimization problem defined in our generic algorithm approximately) that will be useful in Section 3, when we derive the applications of our general framework.

### 2.1 The Setting

Consider a learner (she) playing against an adversary (he) over discrete rounds . Over these rounds, the learner accumulates a -dimensional vector of losses, where is a positive integer. We assume that each round’s loss vector lies in for some constant .

At each round , the interaction between the learner and the adversary proceeds as follows:

1. At the beginning of each round , the adversary selects an environment consisting of the following, and reveals it to the learner:

1. The learner’s convex compact action set and the adversary’s convex compact action set , where each of is embedded into a finite-dimensional Euclidean space;

2. A continuous vector valued loss function . Every dimension (where ) of the loss function must be convex in the first argument and concave in the second argument.

2. The learner selects some .

3. The adversary observes the learner’s selection , and chooses some action in response.

4. The learner suffers (and observes) the vector of loss .

The learner’s objective is to minimize the value of the maximum dimension of the accumulated loss vector after rounds—in other words, to minimize:

 maxj∈[d]T∑t=1ℓtj(xt,yt).

We now define the benchmark with which we will compare the learner’s performance. At any round (which fixes an environment), the following quantity will be key:

###### Definition 1 (The Adversary-Moves-First Value at Round t).

The adversary-moves-first value of the game defined by the environment at round is:

Observe that is the smallest value of the maximum coordinate of that the learner could guarantee if the adversary was forced to reveal his strategy first and the learner were allowed to best respond. However, since the function is not convex-concave (because the does not preserve concavity), the minimax theorem does not hold, and hence this is unobtainable by the learner at each stage game—since the learner is the player who is obligated to reveal her strategy first.

However, we can define regret to a benchmark defined by the cumulative adversary-moves-first values of the stage games:

###### Definition 2 (Adversary-Moves-First (AMF) Regret).

Fixing a transcript , we can define the Learner’s Adversary Moves First (AMF) regret for the dimension at time to be:

 Rtj(πt):=t∑s=1ℓsj(xs,ys)−t∑s=1wsA.

The overall AMF regret is then defined to be:

 Rt(πt)=maxj∈[d]Rtj.

We will generally elide the dependence on the transcript and simply write and for notational economy.

If we were playing a convex-concave stage game at every round, the minimax theorem would imply that by playing the minimax optimal strategy at every round, we could guarantee . Although we are not, our goal will be to design algorithms that can guarantee that in the worst case over adaptive adversaries, the AMF Regret grows sublinearly with : .

### 2.2 General Algorithm

Our algorithmic framework will be based on a natural idea: instead of directly grappling with the maximum coordinate of the cumulative vector valued loss, we upper bound the AMF regret with a one-dimensional “soft-max” surrogate loss function, which the algorithm will then aim to minimize.

###### Definition 3 (Surrogate loss).

Fixing a parameter , and for any round , our surrogate loss function (which implicitly depends on the transcript through round ) is defined as

 Lt:=∑j∈[d]exp(ηRtj),

where is a small parameter to be chosen later. Additionally, it is natural to define .333With the understanding that .

We begin by showing that the surrogate loss gives rise to an upper bound on the AMF regret .

###### Lemma 1.

The learner’s AMF Regret is upper bounded relative to the surrogate loss as follows:

 RT≤lnLTη.
###### Proof.

We may write:

 exp(ηmaxj∈[d]RTj)=exp(maxj∈[d]ηRTj)=maxj∈[d]exp(ηRTj)≤∑j∈[d]exp(ηRTj)=LT.

Thus, , and taking logs and dividing by gives the desired result. ∎

Next we observe a simple but important bound on the per-round increase in the surrogate loss.

###### Lemma 2.

For any , any transcript through round , and any , it holds that:

 Lt≤(4η2C2+1)Lt−1+η∑j∈[d]exp(ηRt−1j)⋅(ℓtj(xt,yt)−wtA).
###### Proof.

By definition of the surrogate loss,we have:

 Lt−Lt−1 =∑j∈[d]exp(ηRtj)−∑j∈[d]exp(ηRt−1j), =∑j∈[d]exp(ηRt−1j+η(ℓtj(xt,yt)−wtA))−∑j∈[d]exp(ηRt−1j), =∑j∈[d]exp(ηRt−1j)(exp(η(ℓtj(xt,yt)−wtA))−1). Using the fact that exp(x)−1≤x+x2 for |x|≤1, we have, for η⋅2C≤1, ≤∑j∈[d]exp(ηRt−1j)(η(ℓtj(xt,yt)−wtA)+η2(ℓtj(xt,yt)−wtA)2), ≤η∑j∈[d]exp(ηRt−1j)(ℓtj(xt,yt)−wtA)+η2(2C)2Lt−1.\qed

A direct consequence of Lemma 2 is the existence of an algorithm for the learner that guarantees the following particularly nice telescoping bound on the surrogate loss. The proof proceeds by defining a convex-concave zero-sum game that reflects our per-round bound on the increase in the surrogate loss, and considering the algorithm that plays the minimax equilibrium of that game at every round.

###### Lemma 3.

For any , the learner can ensure that the final surrogate loss is bounded as:

 LT≤d(4η2C2+1)T.
###### Proof.

We begin by recalling that . Thus, the desired bound on follows via Lemma 2 and a telescoping argument, if only we can show that for every the learner has an action which guarantees that for any ,

 η∑j∈[d]exp(ηRt−1j)(ℓtj(xt,yt)−wtA)≤0.

To this end, we define a zero-sum game between the learner and the adversary, with action space for the learner and for the adversary, and with the objective function (which the adversary wants to maximize and the learner wants to minimize):

 ut(x,y):=∑j∈[d]exp(ηRt−1j)(ℓtj(x,y)−wtA), for all x∈Xt,y∈Yt.

Recall from the definition of our framework that are convex, compact and finite-dimensional, as well as that each is continuous, convex in the first argument, and concave in the second argument. Since is defined as an affine function of the individual coordinate functions , is also convex-concave and continuous. This means that we may invoke Sion’s Minimax Theorem:

###### Fact 1 (Sion’s Minimax Theorem).

Given finite-dimensional convex compact sets , and a continuous function which is convex in the first argument and concave in the second argument, it holds that

 minx∈Xmaxy∈Yf(x,y)=maxy∈Yminx∈Xf(x,y).

Using Sion’s Theorem to switch the order of play (so that the adversary is compelled to move first), and then recalling the definition of (the value of the maximum coordinate value of that the learner can obtain when the adversary is compelled to move first), we obtain:444Note that in the third step, turns into . This is because after each is replaced with , the maximum over generally becomes unachievable (recall Footnote 1).

 minxt∈Xtmaxyt∈Ytut(xt,yt) =maxyt∈Ytminxt∈Xtut(xt,yt) =maxyt∈Ytminxt∈Xt∑j′∈[d]exp(ηRt−1j′)⋅(ℓtj′(xt,yt)−wtA), ≤\adjustlimitssupyt∈Ytminxt∈Xt∑j′∈[d]exp(ηRt−1j′)⋅maxj∈[d](ℓtj(xt,yt)−wtA), =0.

Thus, the learner can ensure that by playing at every round :

This concludes the proof. ∎

Now we present our Algorithm, which is implicit in the proof of Lemma 3, in pseudocode form. We observe that the learner’s optimal action at each round, derived in the proof, can be expressed without any reference to the quantities :

The weights placed on the loss coordinates

in the final expression form a probability distribution which should remind the reader of the well known Exponential Weights distribution. Observe that in our case, this expression is inside a minimax optimization problem. However in Section

3.1.1, we will show that this algorithm indeed reduces to the familiar Exponential Weights algorithm when our framework is instantiated to minimize external regret in the classic expert learning setting.

Finally, we derive the guarantee of Algorithm 1.

###### Theorem 1.

Against any adversary, and given any , Algorithm 1 with learning rate obtains AMF regret bounded by:

 RT≤4C√Tlnd.
###### Proof.

By Lemma 3, the surrogate loss is bounded as , and hence via Lemma 1 and using we obtain that

 RT=maxj∈[d]RTj≤ln(d(4η2C2+1)T)η≤ln(dexp(4Tη2C2))η=lndη+4TC2η.

Setting (note that precisely when ) leads to

 RT≤4C√Tlnd.\qed

### 2.3 Extensions

Before presenting applications of our framework, we pause to discuss two natural extensions that are called for in some of our applications. Both extensions only require very minimal changes to the notation in Section 2.1 and to the general algorithmic framework in Section 2.2.

We begin by discussing, in Section 2.3.1, how to adapt our framework to the setting where the learner is allowed to randomize at each round amongst a finite set of actions, and wishes to obtain probabilistic guarantees for her AMF regret with respect to her randomness. This will be useful in all three of our applications.

We then proceed to show, in Section 2.3.2, that our AMF regret bounds are robust to the case in which at each round, the learner, who is playing according to the general Algorithm 1 given above, computes and plays according to an approximate (rather than exact) minimax strategy. This is useful for settings where it may be desirable (for computational or other reasons) to implement our algorithmic framework approximately, rather than exactly. In particular, in one of our applications — mean multicalibration, which is discussed in Section 3.3 — we will illustrate this point by deriving a multicalibration algorithm that has the learner play only extremely (computationally and structurally) simple strategies, at the cost of adding an arbitrarily small term to the multicalibration bounds, compared to the learner that plays the exact minimax equilibrium.

#### 2.3.1 Performance Bounds for a Probabilistic Learner

So far, we have described the interaction between the learner and the adversary as deterministic. In many applications, however, the convex action space for the learner is the simplex over some finite set of base actions, representing probability distributions over actions. In this case, the adversary chooses his action in response to the probability distribution over base actions chosen by the learner, at which point the learner samples a single base action from her chosen distribution.

We will use the following notation. The learner’s pure action set at time is denoted by . Before each round , the adversary reveals a vector valued loss function . At the beginning of round , the learner chooses a probabilistic mixture over her action set , which we will usually denote as ; after the adversary has made his move, the learner samples her pure action for the round, which is recorded into the transcript of the interaction.

The redefined vector valued losses now take as their first argument a pure action . We extend this to as for any . In this notation, holding the second argument fixed, the loss function is linear (hence convex and continuous) and has a convex, compact domain (the simplex ). Using this extended notation, it is now easy to see how to define the probabilistic analog of the AMF value.

###### Definition 4 (Probabilistic AMF Value).

For a more detailed discussion of the probabilistic setting, please refer to Appendix A.

##### Adapting the algorithm to the probabilistic learner setting

Above, Algorithm 1 was given for the deterministic case of our framework. In the probabilistic setting, when computing the probability distribution for the current round, the learner should take into account the realized losses from the past rounds. We present the modified algorithm below.

##### Probabilistic performance guarantees

Algorithm 2 provides two crucial blackbox guarantees to the probabilistic learner. First, the guarantees on Algorithm 1 from Theorem 1 almost immediately translate into a bound on the expected AMF regret of the learner who uses Algorithm 2, over the randomness in her actions. Second, a high-probability AMF regret bound, also over the learner’s randomness, can be derived in a straightforward way.

###### Theorem 2 (In-Expectation Bound).

Given , Algorithm 2 with learning rate guarantees that ex-ante, with respect to the randomness in the learner’s realized outcomes, the expected AMF regret is bounded as:

 E[RT]≤4C√Tlnd.
###### Proof Sketch.

Using Jensen’s inequality to switch expectations and exponentials, it is easy to modify the proof of Lemma 1 to obtain the following in-expectation bound:

 E[RT]≤lnE[LT]η.

The rest of the proof is similar to the proofs of Lemma 2 and Lemma 3. ∎

###### Theorem 3 (High-Probability Bound).

Fix any . Given , Algorithm 2 with learning rate guarantees that the AMF regret will satisfy, with ex-ante probability over the randomness in the learner’s realized outcomes,

 RT≤8C√Tln(dδ).
###### Proof Sketch.

The proof proceeds by constructing a martingale with bounded increments that tracks the increase in the surrogate loss , and then using Azuma’s inequality to conclude that the final surrogate loss (and hence the AMF regret) is bounded above with high probability. For a detailed proof, see Appendix A. ∎

#### 2.3.2 Performance Bounds for a Suboptimal Learner

Our general Algorithms 1 and 2 involve the learner solving a convex program at each round in order to identify her minimax optimal strategy. However, in some applications of our framework it may be necessary or desirable for the learner to restrict herself to playing approximately minimax optimal strategies instead of exactly optimal ones. This can happen for a variety of reasons:

1. Computational efficiency. While the convex program that the Learner must solve at each round is polynomial-sized in the description of the environment, one may wish for a better running time dependence — e.g. in settings in which the action space for the learner is exponential in some other relevant parameter of the problem. In such cases, we will want to trade off run-time for approximation error in the computation of the minimax equilibrium at each round.

2. Structural simplicity of strategies. One may wish to restrict the learner to only playing “simple” strategies (for example, distributions over actions with small support), or more generally, strategies belonging to a certain predefined strict subset of the learner’s strategy space. This subset may only contain approximately optimal minimax strategies.

3. Numerical precision. As the convex programs solved by the learner at each round generally have irrational coefficients (due to the exponents), using finite-precision arithmetic to solve these programs will lead to a corresponding precision error in the solution, making the computed strategy only approximately minimax optimal for the learner. This kind of approximation error can generally be driven to be arbitrarily small, but still necessitates being able to reason about approximate solutions.

Given a suboptimal instantiation of Algorithm 1 or 2, we thus want to know: how much worse will its achieved regret bound be, compared to the existential guarantee? We will now address this question for both the deterministic setting of Sections 2.1 and 2.2, and the probabilistic setting of Section 2.3.1.

Recall that at each round , both Algorithm 1 and Algorithm 2 (with the weights defined accordingly) have the learner solve for the minimizer of the function defined as:

 ψt(x):=maxy∈Yt∑j∈[d]χtj⋅ℓtj(x,y).

The range of is as indicated, since it is a linear combination of loss coordinates , where the weights form a probability distribution over .

Now suppose the learner ends up playing actions which do not necessarily minimize the respective objectives . The following definition helps capture the degree of suboptimality in the learner’s play at each round.

###### Definition 5 (Achieved AMF Value Bound).

Consider any round , and suppose the learner plays action at round . Then, any number

 wtbd∈[ψt(xt),C]

is called an achieved AMF value bound for round .

This definition has two aspects. Most importantly, upper bounds the learner’s achieved objective function value at round . Furthermore, we restrict to be — otherwise it would be a meaningless bound as the learner gets objective value no matter what she plays.

We now formulate the desired bounds on the performance of a suboptimal learner. The upshot is that for a suboptimal learner, the bounds of Theorems 123 hold with each replaced with the corresponding achieved AMF bound .

###### Theorem 4 (Bounds for a Suboptimal Learner).

Consider a learner who does not necessarily play optimally at all rounds, and a sequence of achieved AMF value bounds.

In the deterministic setting, the learner achieves the following regret bound analogous to Theorem 1:

 maxj∈[d]T∑t=1ℓtj(xt,yt)≤T∑t=1wtbd+4C√Tlnd.

In the probabilistic setting, the learner achieves the following in-expectation regret bound analogous to Theorem 2:

 E[maxj∈[d]T∑t=1ℓtj(at,yt)]≤T∑t=1wtbd+4C√Tlnd,

and the following high-probability bound analogous to Theorem 3:

 maxj∈[d]T∑t=1ℓtj(at,yt)≤T∑t=1wtbd+8C√Tln(dδ) with probability% ≥1−δ, for any δ∈(0,1).
###### Proof Sketch.

We use the deterministic case for illustration. The main idea is to redefine the learner’s regret to be relative to her achieved AMF value bounds rather than the AMF values . Namely, we let , where The surrogate loss is defined in the same way as before, namely .

First, Lemma 1 still holds: , with the same proof. Lemma 2 also holds after replacing each with : namely, The proof is almost the same: we formerly used , and now use that by Definition 5.

Now, following the proofs of Lemma 3 and Theorem 1, to obtain the declared regret bound it suffices to show for that the learner’s action guarantees , no matter what is played by the adversary. For any , we can rewrite this objective as:

 ∑j∈[d]exp(η(Rtbd)j)⋅(ℓtj(xt,yt)−wtbd)=∑i∈[d]exp(η∑t−1s=1ℓsi(xs,ys))exp(∑t−1s=1wsbd)∑j∈[d]χtj⋅(ℓtj(xt,yt)−wtbd).

It now follows that action achieves , from observing that:

 ∑j∈[d]χtj⋅(ℓtj(xt,yt)−wtbd)=∑j∈[d]χtj⋅ℓtj(xt,yt)−wtbd≤ψt(xt)−wtbd≤0,

where the final inequality holds since the learner achieves AMF value bound at round . ∎

## 3 Applications

We now instantiate our framework to derive algorithms and bounds in a number of settings. In all cases, we first obtain existential bounds and then explicit algorithms. The bounds follow directly from our main Theorems 1, 2, and 3, and the algorithms are obtained by computing (exactly or approximately) minimax equilibria of the zero-sum games given in Algorithm 2 (which, as discussed above, is the appropriate specialization of Algorithm 1 to the probabilistic setting).

### 3.1 No Regret Learning Algorithms

As a warmup, we begin this subsection by carefully demonstrating how to use our framework to derive bounds and algorithms for the very fundamental external regret setting. Then, we derive the same types of existential guarantees in the much more general subsequence regret setting. We then specialize these subsequence regret bounds into tight bounds for various existing regret notions (such as internal, adaptive, sleeping experts, and multigroup regret). We conclude this subsection by deriving a general no-subsequence-regret algorithm which in turn specializes to an efficient algorithm in all of our applications.

#### 3.1.1 Simple Learning From Expert Advice: External Regret

In the classical experts learning setting [MW2], the learner has a set of pure actions (“experts”) . At the outset of each round , the learner chooses a distribution over experts . The adversary then comes up with a vector of losses corresponding to each expert. Next, the learner samples , and experiences loss corresponding to the expert she chose: . The learner also gets to observe the entire vector of losses for that round. The goal of the learner is to achieve sublinear external regret — that is, to ensure that the difference between her cumulative loss and the loss of the best fixed expert in hindsight grows sublinearly with :

 RText(πT):=∑t∈[T]rtat−minj∈A∑t∈[T]rtj=o(T).
###### Theorem 5.

Fix a finite pure action set for the learner and a time horizon . Then, Algorithm 2 can be instantiated to guarantee that the learner’s expected external regret is bounded as

 EπT[RText(πT)]≤4√Tln|A|,

and furthermore that for any , with ex-ante probability over the learner’s randomness,

 RText(πT)≤8√Tln|A|δ.
###### Proof.

We instantiate our probabilistic framework (see Section 2.3.1).

###### Defining the strategy spaces.

We define the learner’s pure action set at each round to be the set , and the adversary’s strategy space to be the convex and compact set , from which the adversary chooses each round’s collection of all actions’ losses.

###### Defining the loss functions.

For , we define a -dimensional vector valued loss function , where for every action , the corresponding coordinate is given by

 ℓtj(a,rt)=rta−rtj, for a∈A,rt∈[0,1]|A|.

It is easy to see that is continuous and concave — in fact, linear — in the second argument for all and . Furthermore, its range is , for . This verifies the technical conditions imposed by our framework on the loss functions.

###### Applying AMF regret bounds.

We may now invoke Theorem 2, which implies the following in-expectation AMF regret bound after round for the instantiation of Algorithm 2 with the just defined vector losses :

 E⎡⎣maxj∈A∑t∈[T]ℓtj(at,rt)−∑t∈[T]wtA⎤⎦≤4C√Tlnd=4√Tln|A|,

where recall that is the Adversary-Moves-First (AMF) value at round . Connecting the instantiated AMF regret to the learner’s external regret, we get:

 E[RText]=E⎡⎣maxj∈A∑t∈[T]rtat−rtj⎤⎦=E⎡⎣maxj∈A∑t∈[T]ℓtj(at,rt)⎤⎦≤4√Tln|A|+∑t∈[T]wtA.
###### Bounding the Adversary-Moves-First value.

To obtain the claimed in-expectation external regret bound, it suffices to show that the AMF value at each round satisfies