We introduce and study a simple but powerful framework for online adversarial multiobjective minimax optimization. At each round , an adaptive adversary chooses an environment for the learner to play in, defined by a convex compact action set for the learner, a convex compact action set for the adversary, and a
-dimensional continuous loss functionthat, in each coordinate, is convex in the learner’s action and concave in the adversary’s action. The learner then chooses an action or distribution over actions , and as a function of the learner’s choice, the adversary chooses an action . This results in a loss vector , which accumulates over time. The goal of the learner is to minimize the maximum accumulated loss over each of the dimensions: .
When described this way, it is natural to view the environment chosen at each round as defining a zero sum game between the learner and the adversary in which the learner wishes to minimize the maximum coordinate of the resulting loss vector. The objective of the learner in the stage game in isolation can be written as:111 A brief aside about the “inf max max” structure of : since each is continuous, so is , and hence is attained on the compact set — but as is no longer a continuous function of , the infimum over need not be attained.
Unfortunately, although is convex-concave in each coordinate, the maximum over coordinates does not preserve concavity for the adversary. Thus the minimax theorem does not hold, and the value of the game in which the learner must move first (defined above) is larger than the value of the game in which the adversary is forced to move first— that is, , where is defined as:222The reason for taking the supremum instead of maximum over is the same as explained in Footnote 1 for .
Nevertheless, fixing a series of environments chosen by the adversary, this defines in hindsight an aspirational quantity , summing the adversary-moves-first value of the constituent zero sum games. Despite the fact that these values are not individually obtainable in the stage games, we show that they are approachable on average over a sequence of rounds in the following sense: there is an algorithm for the learner that guarantees that against any adversary
Our derivation is elementary and based on a minimax argument. The generic algorithm plays actions at every round according to a minimax equilibrium strategy in a surrogate game that is derived both from the environment chosen by the adversary at round , as well as from the history of play so far on previous rounds . The loss in the surrogate game is convex-concave (and so we may apply minimax arguments), and can be used to upper bound the loss in the original games.
We then show that this simple framework can be instantiated to derive a wide array of optimal bounds, and that the corresponding algorithms can be derived in closed form by solving for the minimax equilibrium of the corresponding surrogate game. Our applications fall into three categories:
Expert Learning: We can derive optimal regret bounds and algorithms for a wide variety of learning-with-experts settings. In these settings, there is a finite set of experts who each incur an adversarially selected loss in at each round. The learner must select an expert at each round before the losses are revealed, and incurs the loss of her chosen expert. We can recover algorithms and bounds in a large variety of settings—a non-exhaustive list includes:
External Regret: In the standard setting of regret to the best fixed expert out of , our framework recovers the multiplicative weights algorithm [MW1, MW2] and the corresponding regret bound. This bound is optimal and hence witnesses the optimality of our main theorem.
Internal Regret and Swap Regret: Internal and swap regret bound the Learner’s regret conditioned on the action that they play. Minimizing these notions of regret in a multiplayer game corresponds to convergence to the set of correlated equilibria; see [fostervohra, HM00]. Our method derives an algorithm of [internalregret] from first principles. This explicates the fixed point calculation in the algorithm of [internalregret].
Adaptive Regret, studied by [MW2, adaptiveregret2, adaptiveregretvovk], asks for diminishing regret not just over the entire sequence of rounds, but also over each interval for . This represents regret to the best expert in a setting in which the best expert may be defined as changing over time.
Sleeping Experts: In the sleeping experts problem [sleeping, sleeping2, internalregret, kleinberg2010sleepingexperts], only an adversarially chosen subset of experts is available to the learner in each round. Blum and Mansour [internalregret] define the goal of obtaining diminishing regret to each expert on the subsequence of rounds on which that expert is available.
Multi-group Regret: Multi-group regret is a fairness-motivated notion (studied under a different name in [multigroup1] and in the batch setting in [multigroup2]) that associates each round with an individual, who may be a member of a subset of a large number of overlapping groups . It asks for diminishing regret on all subsequences identified by individuals from some group — i.e. simultaneously for all groups, we should do as well on a group as the best expert defined on that group in isolation.
Fast Polytope Blackwell Approachability: We give a variant of Blackwell’s Approachability Theorem [blackwell] when the convex body to be approached is a polytope. Standard approachability algorithms approach the body in Euclidean distance, and have a convergence rate that is polynomial in the ambient dimension of the Blackwell game. In contrast, we give a dimension-independent approachability guarantee: we approximately satisfy all halfspace constraints defining the polytope, after logarithmically many rounds in the number of such constraints. This can be a significant improvement over a polynomial dependence on the dimension in many settings.
Multicalibration and Multivalidity: We can similarly derive state of the art bounds and algorithms for notions of multivalidity as defined in [multivalid], including mean multicalibration, mean-conditioned moment multicalibration [momentmulti], and prediction interval multivalidity. Mean multicalibration asks for calibrated predictions not just overall, but simultaneously on each subsequence defined by membership in a large and overlapping set of groups . We recover optimal convergence bounds depending only logarithmically on . Similarly, our techniques can be used to achieve bounds for moment prediction and prediction intervals, guaranteeing valid coverage over each of the groups simultaneously.
1.1 Additional Related Work
Two papers by Azar et al. [azar2014sequential] and Kesselheim and Singla [kesselheim2020online] study a related problem from a very different perspective. They study an online linear optimization problem with vector valued outcomes, and just as in our work, their goal is to minimize the maximum coordinate in the accumulated loss vector (they also consider other -norms). However, they study an incomparable benchmark that in our notation would be written as (which is well-defined in their setting as they consider the loss functions and action sets to be fixed throughout the interaction). On the one hand, this benchmark is stronger than ours in the sense that the maximum over coordinates is taken outside of the sum over time, whereas our benchmark considers a “greedy” per-round maximum. On the other hand, since in our setting the game can be different at every round, our benchmark allows a comparison to a different action at each round rather than a single fixed action. In the setting of [kesselheim2020online], it is impossible to give any regret bound to their benchmark, so they derive an algorithm obtaining a competitive ratio to this benchmark (i.e. a multiplicative rather than additive approximation). In contrast, our benchmark admits a regret bound. The consequence is that our results are quite different in kind despite the outward similarity of the settings: none of our applications follow from the theorems of [azar2014sequential, kesselheim2020online] (since in all of our applications, we derive regret bounds).
Our underlying technique is derived from a game-theoretic line of argument that originates from the calibration literature: specifically an argument of Hart (originally communicated in [fostervohra], and recently explicated in [Hart20]) and of Fudenberg and Levine [FL99]. This argument was extended in Gupta et al. [multivalid] to obtain fast rates and explicit algorithms in the context of multicalibration and multivalidity; in this paper we distill the argument to its core to obtain our general framework.
There is a substantial body of work related to each of our application areas. Algorithms obtaining diminishing “external regret” (i.e. regret to the best fixed action in a set ) date back to Hannan [hannan]. Foster and Vohra [fostervohra] introduced the notion of “internal regret”, which corresponds to asymptotic performance that is competitive with the best sequence of actions that arises from applying an arbitrary strategy modification rule (i.e. a function that can map actions to arbitrary replacement actions) to the empirical choices of the algorithm; this notion of regret is closely connected to correlated equilibrium [fostervohra, HM00]. This notion of regret was then substantially generalized [widerangelehrer, phiregret]. Lehrer defines a very general notion of regret (“wide-range regret”) that asks for diminishing regret to a set of subsequences of rounds defined by “time selection functions” on which arbitrary strategy modification rules can be applied. Blum and Mansour [internalregret] give explicit rates and algorithms for obtaining diminishing wide-range regret. Subsequence regret (as we define it in this paper) can be viewed as a different parametrization of wide-range regret; up to a polynomial change in the parameters, the two notions can be reduced to one another (see Appendix B for details).
Work on online calibrated prediction dates back to Dawid [dawid1982well]. Foster and Vohra [fostervohra] were the first to show that it is possible to obtain asymptotic calibration against an adversary. Lehrer and Sandroni et al. [lehrer2001any, sandroni2003calibration] generalized this result and showed that it was possible to extend these ideas in order to satisfy calibration not just overall, but on arbitrary computable subsequences of rounds. These later results were nonconstructive and did not derive explicit rates. In the algorithmic fairness literature, Hébert-Johnson et al. defined the notion of multicalibration and derived algorithms and explicit sample complexity bounds in the batch setting [hebert2018multicalibration]. Jung et al. [momentmulti]
extended this notion from means to variances and other higher moments. Gupta et al.[multivalid] gave explicit online algorithms with optimal rates for mean and moment multicalibration, as well as a new notion of prediction interval multivalidity which they defined.
Blackwell originally proved his approachability theorem in [blackwell]. It has been known since [blackwellnoregret] that Blackwell approachability can be used to derive no regret learning algorithms. Foster showed that calibrated forecasters could be derived from Blackwell approachability [Fos99]. Abernethy, Bartlett, and Hazan [abernethy2011blackwell] showed conversely how Blackwell approachability could be derived from no-regret learning algorithms. The standard Blackwell approachability theorem proves approachability in the metric, and hence necessarily inherits a dependence on the ambient dimension in its convergence rate. The result is a polynomial rather than logarithmic dependence on the number of experts when used to derive no-regret learners. Chzhen, Giraud, and Stoltz [chzhen2021unified] use (the standard) Blackwell approachability theorem to study online learning under various fairness constraints like multicalibration and other multigroup notions of fairness [gerrymander], and similarly inherit a polynomial dependence on the number of groups rather than the optimal logarithmic dependence that our version of the approachability theorem yields. Perchet [perchet2015exponential] shows that the negative orthant is approachable in the metric with a dependence in the convergence rate. This is equivalent to polytope Blackwell approachability as we define it. He uses this to derive several results about no regret learning and calibration, including the optimal rate for internal regret (although not the algorithm).
A line of work initiated by Rakhlin, Sridharan, and Tewari [rakhlin1, rakhlin2] takes a very general minimax approach towards deriving bounds in online learning, including regret minimization, calibration, and approachability. Their approach is substantially more powerful than the framework we introduce here (e.g. it can be used to derive bounds for infinite dimensional problems, and characterizes online learnability in the sense that it can also be used to prove lower bounds). However it is also correspondingly more complex, and requires analyzing the continuation value of a round dynamic program, in contrast to the greedy 1-round analysis needed in our framework. The result is that our framework is inherently constructive, in that the algorithm derives from solving a one-round stage game, which can always be done in time polynomial in the number of actions of the learner and adversary, whereas generically results from [rakhlin1, rakhlin2] are nonconstructive — although in certain cases their framework can also be used to derive algorithms [rakhlin3]. Relative to this literature, we view our framework as a “user-friendly” power tool, that can be used to derive a wide variety of algorithms and bounds without much additional work — at the cost of not being universally expressive.
2 General Framework and Extensions
We begin by defining our general setting in Section 2.1. We then introduce our generic algorithmic framework, along with our proof techniques, in Section 2.2. We close this section by discussing, in Section 2.3, some extensions of this framework (to randomized learners and learners who only solve the optimization problem defined in our generic algorithm approximately) that will be useful in Section 3, when we derive the applications of our general framework.
2.1 The Setting
Consider a learner (she) playing against an adversary (he) over discrete rounds . Over these rounds, the learner accumulates a -dimensional vector of losses, where is a positive integer. We assume that each round’s loss vector lies in for some constant .
At each round , the interaction between the learner and the adversary proceeds as follows:
At the beginning of each round , the adversary selects an environment consisting of the following, and reveals it to the learner:
The learner’s convex compact action set and the adversary’s convex compact action set , where each of is embedded into a finite-dimensional Euclidean space;
A continuous vector valued loss function . Every dimension (where ) of the loss function must be convex in the first argument and concave in the second argument.
The learner selects some .
The adversary observes the learner’s selection , and chooses some action in response.
The learner suffers (and observes) the vector of loss .
The learner’s objective is to minimize the value of the maximum dimension of the accumulated loss vector after rounds—in other words, to minimize:
We now define the benchmark with which we will compare the learner’s performance. At any round (which fixes an environment), the following quantity will be key:
Definition 1 (The Adversary-Moves-First Value at Round ).
The adversary-moves-first value of the game defined by the environment at round is:
Observe that is the smallest value of the maximum coordinate of that the learner could guarantee if the adversary was forced to reveal his strategy first and the learner were allowed to best respond. However, since the function is not convex-concave (because the does not preserve concavity), the minimax theorem does not hold, and hence this is unobtainable by the learner at each stage game—since the learner is the player who is obligated to reveal her strategy first.
However, we can define regret to a benchmark defined by the cumulative adversary-moves-first values of the stage games:
Definition 2 (Adversary-Moves-First (AMF) Regret).
Fixing a transcript , we can define the Learner’s Adversary Moves First (AMF) regret for the dimension at time to be:
The overall AMF regret is then defined to be:
We will generally elide the dependence on the transcript and simply write and for notational economy.
If we were playing a convex-concave stage game at every round, the minimax theorem would imply that by playing the minimax optimal strategy at every round, we could guarantee . Although we are not, our goal will be to design algorithms that can guarantee that in the worst case over adaptive adversaries, the AMF Regret grows sublinearly with : .
2.2 General Algorithm
Our algorithmic framework will be based on a natural idea: instead of directly grappling with the maximum coordinate of the cumulative vector valued loss, we upper bound the AMF regret with a one-dimensional “soft-max” surrogate loss function, which the algorithm will then aim to minimize.
Definition 3 (Surrogate loss).
Fixing a parameter , and for any round , our surrogate loss function (which implicitly depends on the transcript through round ) is defined as
where is a small parameter to be chosen later. Additionally, it is natural to define .333With the understanding that .
We begin by showing that the surrogate loss gives rise to an upper bound on the AMF regret .
The learner’s AMF Regret is upper bounded relative to the surrogate loss as follows:
We may write:
Thus, , and taking logs and dividing by gives the desired result. ∎
Next we observe a simple but important bound on the per-round increase in the surrogate loss.
For any , any transcript through round , and any , it holds that:
By definition of the surrogate loss,we have:
|Using the fact that for , we have, for ,|
A direct consequence of Lemma 2 is the existence of an algorithm for the learner that guarantees the following particularly nice telescoping bound on the surrogate loss. The proof proceeds by defining a convex-concave zero-sum game that reflects our per-round bound on the increase in the surrogate loss, and considering the algorithm that plays the minimax equilibrium of that game at every round.
For any , the learner can ensure that the final surrogate loss is bounded as:
We begin by recalling that . Thus, the desired bound on follows via Lemma 2 and a telescoping argument, if only we can show that for every the learner has an action which guarantees that for any ,
To this end, we define a zero-sum game between the learner and the adversary, with action space for the learner and for the adversary, and with the objective function (which the adversary wants to maximize and the learner wants to minimize):
Recall from the definition of our framework that are convex, compact and finite-dimensional, as well as that each is continuous, convex in the first argument, and concave in the second argument. Since is defined as an affine function of the individual coordinate functions , is also convex-concave and continuous. This means that we may invoke Sion’s Minimax Theorem:
Fact 1 (Sion’s Minimax Theorem).
Given finite-dimensional convex compact sets , and a continuous function which is convex in the first argument and concave in the second argument, it holds that
Using Sion’s Theorem to switch the order of play (so that the adversary is compelled to move first), and then recalling the definition of (the value of the maximum coordinate value of that the learner can obtain when the adversary is compelled to move first), we obtain:444Note that in the third step, turns into . This is because after each is replaced with , the maximum over generally becomes unachievable (recall Footnote 1).
Thus, the learner can ensure that by playing at every round :
This concludes the proof. ∎
Now we present our Algorithm, which is implicit in the proof of Lemma 3, in pseudocode form. We observe that the learner’s optimal action at each round, derived in the proof, can be expressed without any reference to the quantities :
The weights placed on the loss coordinates
in the final expression form a probability distribution which should remind the reader of the well known Exponential Weights distribution. Observe that in our case, this expression is inside a minimax optimization problem. However in Section3.1.1, we will show that this algorithm indeed reduces to the familiar Exponential Weights algorithm when our framework is instantiated to minimize external regret in the classic expert learning setting.
Finally, we derive the guarantee of Algorithm 1.
Against any adversary, and given any , Algorithm 1 with learning rate obtains AMF regret bounded by:
Before presenting applications of our framework, we pause to discuss two natural extensions that are called for in some of our applications. Both extensions only require very minimal changes to the notation in Section 2.1 and to the general algorithmic framework in Section 2.2.
We begin by discussing, in Section 2.3.1, how to adapt our framework to the setting where the learner is allowed to randomize at each round amongst a finite set of actions, and wishes to obtain probabilistic guarantees for her AMF regret with respect to her randomness. This will be useful in all three of our applications.
We then proceed to show, in Section 2.3.2, that our AMF regret bounds are robust to the case in which at each round, the learner, who is playing according to the general Algorithm 1 given above, computes and plays according to an approximate (rather than exact) minimax strategy. This is useful for settings where it may be desirable (for computational or other reasons) to implement our algorithmic framework approximately, rather than exactly. In particular, in one of our applications — mean multicalibration, which is discussed in Section 3.3 — we will illustrate this point by deriving a multicalibration algorithm that has the learner play only extremely (computationally and structurally) simple strategies, at the cost of adding an arbitrarily small term to the multicalibration bounds, compared to the learner that plays the exact minimax equilibrium.
2.3.1 Performance Bounds for a Probabilistic Learner
So far, we have described the interaction between the learner and the adversary as deterministic. In many applications, however, the convex action space for the learner is the simplex over some finite set of base actions, representing probability distributions over actions. In this case, the adversary chooses his action in response to the probability distribution over base actions chosen by the learner, at which point the learner samples a single base action from her chosen distribution.
We will use the following notation. The learner’s pure action set at time is denoted by . Before each round , the adversary reveals a vector valued loss function . At the beginning of round , the learner chooses a probabilistic mixture over her action set , which we will usually denote as ; after the adversary has made his move, the learner samples her pure action for the round, which is recorded into the transcript of the interaction.
The redefined vector valued losses now take as their first argument a pure action . We extend this to as for any . In this notation, holding the second argument fixed, the loss function is linear (hence convex and continuous) and has a convex, compact domain (the simplex ). Using this extended notation, it is now easy to see how to define the probabilistic analog of the AMF value.
Definition 4 (Probabilistic AMF Value).
For a more detailed discussion of the probabilistic setting, please refer to Appendix A.
Adapting the algorithm to the probabilistic learner setting
Above, Algorithm 1 was given for the deterministic case of our framework. In the probabilistic setting, when computing the probability distribution for the current round, the learner should take into account the realized losses from the past rounds. We present the modified algorithm below.
Probabilistic performance guarantees
Algorithm 2 provides two crucial blackbox guarantees to the probabilistic learner. First, the guarantees on Algorithm 1 from Theorem 1 almost immediately translate into a bound on the expected AMF regret of the learner who uses Algorithm 2, over the randomness in her actions. Second, a high-probability AMF regret bound, also over the learner’s randomness, can be derived in a straightforward way.
Theorem 2 (In-Expectation Bound).
Given , Algorithm 2 with learning rate guarantees that ex-ante, with respect to the randomness in the learner’s realized outcomes, the expected AMF regret is bounded as:
Theorem 3 (High-Probability Bound).
Fix any . Given , Algorithm 2 with learning rate guarantees that the AMF regret will satisfy, with ex-ante probability over the randomness in the learner’s realized outcomes,
The proof proceeds by constructing a martingale with bounded increments that tracks the increase in the surrogate loss , and then using Azuma’s inequality to conclude that the final surrogate loss (and hence the AMF regret) is bounded above with high probability. For a detailed proof, see Appendix A. ∎
2.3.2 Performance Bounds for a Suboptimal Learner
Our general Algorithms 1 and 2 involve the learner solving a convex program at each round in order to identify her minimax optimal strategy. However, in some applications of our framework it may be necessary or desirable for the learner to restrict herself to playing approximately minimax optimal strategies instead of exactly optimal ones. This can happen for a variety of reasons:
Computational efficiency. While the convex program that the Learner must solve at each round is polynomial-sized in the description of the environment, one may wish for a better running time dependence — e.g. in settings in which the action space for the learner is exponential in some other relevant parameter of the problem. In such cases, we will want to trade off run-time for approximation error in the computation of the minimax equilibrium at each round.
Structural simplicity of strategies. One may wish to restrict the learner to only playing “simple” strategies (for example, distributions over actions with small support), or more generally, strategies belonging to a certain predefined strict subset of the learner’s strategy space. This subset may only contain approximately optimal minimax strategies.
Numerical precision. As the convex programs solved by the learner at each round generally have irrational coefficients (due to the exponents), using finite-precision arithmetic to solve these programs will lead to a corresponding precision error in the solution, making the computed strategy only approximately minimax optimal for the learner. This kind of approximation error can generally be driven to be arbitrarily small, but still necessitates being able to reason about approximate solutions.
Given a suboptimal instantiation of Algorithm 1 or 2, we thus want to know: how much worse will its achieved regret bound be, compared to the existential guarantee? We will now address this question for both the deterministic setting of Sections 2.1 and 2.2, and the probabilistic setting of Section 2.3.1.
The range of is as indicated, since it is a linear combination of loss coordinates , where the weights form a probability distribution over .
Now suppose the learner ends up playing actions which do not necessarily minimize the respective objectives . The following definition helps capture the degree of suboptimality in the learner’s play at each round.
Definition 5 (Achieved AMF Value Bound).
Consider any round , and suppose the learner plays action at round . Then, any number
is called an achieved AMF value bound for round .
This definition has two aspects. Most importantly, upper bounds the learner’s achieved objective function value at round . Furthermore, we restrict to be — otherwise it would be a meaningless bound as the learner gets objective value no matter what she plays.
We now formulate the desired bounds on the performance of a suboptimal learner. The upshot is that for a suboptimal learner, the bounds of Theorems 1, 2, 3 hold with each replaced with the corresponding achieved AMF bound .
Theorem 4 (Bounds for a Suboptimal Learner).
Consider a learner who does not necessarily play optimally at all rounds, and a sequence of achieved AMF value bounds.
In the deterministic setting, the learner achieves the following regret bound analogous to Theorem 1:
We use the deterministic case for illustration. The main idea is to redefine the learner’s regret to be relative to her achieved AMF value bounds rather than the AMF values . Namely, we let , where The surrogate loss is defined in the same way as before, namely .
Now, following the proofs of Lemma 3 and Theorem 1, to obtain the declared regret bound it suffices to show for that the learner’s action guarantees , no matter what is played by the adversary. For any , we can rewrite this objective as:
It now follows that action achieves , from observing that:
where the final inequality holds since the learner achieves AMF value bound at round . ∎
We now instantiate our framework to derive algorithms and bounds in a number of settings. In all cases, we first obtain existential bounds and then explicit algorithms. The bounds follow directly from our main Theorems 1, 2, and 3, and the algorithms are obtained by computing (exactly or approximately) minimax equilibria of the zero-sum games given in Algorithm 2 (which, as discussed above, is the appropriate specialization of Algorithm 1 to the probabilistic setting).
3.1 No Regret Learning Algorithms
As a warmup, we begin this subsection by carefully demonstrating how to use our framework to derive bounds and algorithms for the very fundamental external regret setting. Then, we derive the same types of existential guarantees in the much more general subsequence regret setting. We then specialize these subsequence regret bounds into tight bounds for various existing regret notions (such as internal, adaptive, sleeping experts, and multigroup regret). We conclude this subsection by deriving a general no-subsequence-regret algorithm which in turn specializes to an efficient algorithm in all of our applications.
3.1.1 Simple Learning From Expert Advice: External Regret
In the classical experts learning setting [MW2], the learner has a set of pure actions (“experts”) . At the outset of each round , the learner chooses a distribution over experts . The adversary then comes up with a vector of losses corresponding to each expert. Next, the learner samples , and experiences loss corresponding to the expert she chose: . The learner also gets to observe the entire vector of losses for that round. The goal of the learner is to achieve sublinear external regret — that is, to ensure that the difference between her cumulative loss and the loss of the best fixed expert in hindsight grows sublinearly with :
Fix a finite pure action set for the learner and a time horizon . Then, Algorithm 2 can be instantiated to guarantee that the learner’s expected external regret is bounded as
and furthermore that for any , with ex-ante probability over the learner’s randomness,
We instantiate our probabilistic framework (see Section 2.3.1).
Defining the strategy spaces.
We define the learner’s pure action set at each round to be the set , and the adversary’s strategy space to be the convex and compact set , from which the adversary chooses each round’s collection of all actions’ losses.
Defining the loss functions.
For , we define a -dimensional vector valued loss function , where for every action , the corresponding coordinate is given by
It is easy to see that is continuous and concave — in fact, linear — in the second argument for all and . Furthermore, its range is , for . This verifies the technical conditions imposed by our framework on the loss functions.
Applying AMF regret bounds.
where recall that is the Adversary-Moves-First (AMF) value at round . Connecting the instantiated AMF regret to the learner’s external regret, we get:
Bounding the Adversary-Moves-First value.
To obtain the claimed in-expectation external regret bound, it suffices to show that the AMF value at each round satisfies