1. Introduction
^{color=DarkGreen!20!LightGray,author=CP,inline}^{color=DarkGreen!20!LightGray,author=CP,inline}todo: color=DarkGreen!20!LightGray,author=CP,inlineRegularization is optimization’s latest and most incisive twist, its present zeitgeist: Through the introduction of a new component to the objective, a new algorithm results which overcomes illconditioning and overfitting, and achieves sparsity and parsimony without sacrificing efficiency. In the context of online optimization, regularization is exemplified…Regularization is a fundamental and incisive method in optimization, its present zeitgeist and its entry into machine learning. Through the introduction of a new component in the objective, regularization techniques overcome illconditioning and overfitting, and they yield algorithms that achieve sparsity and parsimony without sacrificing efficiency [5, 8, 2].
In the context of online optimization, these features are exemplified in the family of learning algorithms known as “Follow the Regularized Leader” (FoReL) [41]. “Follow the Regularized Leader” (FoReL) represents an important archetype of adaptive behavior for several reasons: it provides optimal minmax regret guarantees ( in an adversarial setting), it offers significant flexibility with respect to the geometry of the problem at hand, and it captures numerous other dynamics as special cases (hedge, multiplicative weights, gradient descent, etc.) [15, 8, 2]. As such, given that these regret guarantees hold without any further assumptions about how payoffs/costs are determined at each stage, the dynamics of FoReL
have been the object of intense scrutiny and study in algorithmic game theory.
The standard way of analyzing such noregret dynamics in games involves a twostep approach. The first step exploits the fact that the empirical frequency of play under a noregret algorithm converges to the game’s set of coarse correlated equilibria (CCE). The second involves proving some useful property of the game’s coarse correlated equilibria: For instance, leveraging robustness [33] implies that the social welfare at a CCE lies within a small constant of the optimum social welfare; as another example, the product of the marginal distributions of CCE in zerosum games is Nash. In this way, the noregret properties of FoReL can be turned into convergence guarantees for the players’ empirical frequency of play (that is, in a timeaveraged, correlated sense).
Recently, several papers have moved beyond this “blackbox” framework and focused instead on obtaining stronger regret/convergence guarantees for systems of learning algorithms coupled together in games with a specific structure. Along these lines, Daskalakis et al. [9] and Rakhlin and Sridharan [31] developed classes of dynamics that enjoy a regret minimization rate in twoplayer zerosum games. Syrgkanis et al. [43] further analyzed a recency biased variant of FoReL in more general multiplayer games and showed that it is possible to achieve an regret minimization rate. The social welfare converges at a rate of , a result which was extended to standard versions of FoReL dynamics in [11].
Whilst a regretbased analysis provides significant insights about these systems, it does not answer a fundamental behavioral question:
Does the system converge to a Nash equilibrium?
Does it even stabilize?
The dichotomy between a selfstabilizing, convergent system and a system with recurrent cycles is of obvious significance, but a regretbased analysis cannot distinguish between the two. Indeed, convergent, recurrent, and even chaotic [26] systems may exhibit equally strong regret minimization properties in general games, so the question remains: What does the longrun behavior of FoReL look like, really?
This question becomes particularly interesting and important under perfect competition (such as zerosum games and variants thereof). Especially in practice, zerosum games can capture optimization “duels” [18]: for example, two Internet search engines competing to maximize their market share can be modeled as players in a zerosum game with a convex strategy space. In [18]
it was shown that the timeaverage of a regretminimizing class of dynamics converges to an approximate equilibrium of the game. Finally, zerosum games have also been used quite recently as a model for deep learning optimization techniques in image generation and discrimination
[14, 39].In each of the above cases, minmax strategies are typically thought of as the axiomatically correct prediction. The fact that the time average of the marginals of a FoReL procedure converges to such states is considered as further evidence of the correctness of this prediction. However, the longrun behavior of the actual sequence of play (as opposed to its timeaverages) seems to be trickier, and a number of natural questions arise:

Does optimizationdriven learning converge under perfect competition?

Does fast regret minimization necessarily imply (fast) equilibration in this case?
Our results
We settle these questions with a resounding “no”. Specifically, we show that the behavior of FoReL in zerosum games with an interior equilibrium (e.g. Matching Pennies) is Poincaré recurrent, implying that almost every trajectory revisits any (arbitrarily small) neighborhood of its starting point infinitely often. Importantly, the observed cycling behavior is robust to the agents’ choice of regularization mechanism (each agent could be using a different regularizer), and it applies to any positive affine transformation of zerosum games (and hence all strictly competitive games [1]) even though these transformations lead to different trajectories of play. Finally, this cycling behavior also persists in the case of networked competition, i.e. for constantsum polymatrix games [6, 7, 10].
Given that the noregret guarantees of FoReL require a decreasing stepsize (or learning rate),^{1}^{1}1A standard trick is to decrease stepsizes by a constant factor after a window of “doubling” length [40]. we focus on a smooth version of FoReL described by a dynamical system in continuous time. The resulting FoReL dynamics enjoy a particularly strong regret minimization rate and they capture as a special case the replicator dynamics [45, 44, 38] and the projection dynamics [12, 36, 24], arguably the most widely studied game dynamics in biology, evolutionary game theory and transportation science [16, 48, 35]. In this way, our analysis unifies and generalizes many prior results on the cycling behavior of evolutionary dynamics [16, 29, 28, 37] and it provides a new interpretation of these results through the lens of optimization and machine learning.
From a technical point of view, our analysis touches on several issues. Our first insight is to focus not on the simplex of the players’ mixed strategies, but on a dual
space of payoff differences. The reason for this is that the vector of cumulative payoff differences between two strategies fully determines a player’s mixed strategy under
FoReL, and it is precisely these differences that ultimately drive the players’ learning process. Under this transformation, FoReL exhibits a striking property, incompressibility: the flow of the dynamics is volumepreserving, so a ball of initial conditions in this dual space can never collapse to a point.That being said, the evolution of such a ball in the space of payoffs could be transient, implying in particular that the players’ mixed strategies could converge (because the choice map that links payoff differences to strategies is nonlinear). To rule out such behaviors, we show that FoReL in zerosum games with an interior Nash equilibrium has a further important property: it admits a constant of motion. Specifically, if is an interior equilibrium of the game and is an arbitrary point in the payoff space of player , this constant is given by the coupling function
where is the convex conjugate of the regularizer that generates the learning process of player (for the details, see Sections 4 and 3). Coupled with the dynamics’ incompressibility, this invariance can be used to show that FoReL is recurrent: after some finite time, almost every trajectory returns arbitrarily close to its initial state.
On the other hand, if the game does not admit an interior equilibrium, the coupling above is no longer a constant of motion. In this case, decreases over time until the support of the players’ mixed strategies matches that of a Nash equilibrium with maximal support: as this point in time is approached, essentially becomes a constant. Thus, in general zerosum games, FoReL wanders perpetually in the smallest face of the game’s strategy space containing all of the game’s equilibria; indeed, the only possibility that FoReL converges is if the game admits a unique Nash equilibrium in pure strategies – a fairly restrictive requirement.
2. Definitions from game theory
2.1. Games in normal form
We begin with some basic definitions from game theory. A finite game in normal form consists of a finite set of players , each with a finite set of actions (or strategies) . The preferences of player for one action over another are determined by an associated payoff function which assigns a reward to player under the strategy profile of all players’ actions.^{2}^{2}2In the above, we use the standard shorthand for the profile . Putting all this together, a game in normal form will be written as a tuple with players, actions and payoffs defined as above.
Players can also use mixed strategies
, i.e. mixed probability distributions
over their action sets . The resulting probability vector is called a mixed strategy and we write for the mixed strategy space of player . Aggregating over players, we also write for the game’s strategy space, i.e. the space of all strategy profiles .In this context (and in a slight abuse of notation), the expected payoff of the th player in the profile is
(2.1) 
To keep track of the payoff of each pure strategy, we also write for the payoff of strategy under the profile and for the resulting payoff vector of player . We then have
(2.2) 
where denotes the ordinary pairing between and .
The most widely used solution concept in game theory is that of a Nash equilibrium (NE), defined here as a mixed strategy profile such that
(NE) 
for every deviation of player and all . Writing for the support of , a Nash equilibrium is called pure if for some and all . At the other end of the spectrum, is said to be interior (or fully mixed) if for all . Finally, a coarse correlated equilibrium (CCE) is a distribution over the set of action profiles such that, for every player and every action , we have , where is the marginal distribution of with respect to .
2.2. Zerosum games and zerosum polymatrix games
Perhaps the most widely studied class of finite games (and certainly the first to be considered) is that of player zerosum games, i.e. when and . Letting , the value of a player zerosum game is defined as
(2.3) 
with equality following from von Neumann’s celebrated minmax theorem [47]. As is well known, the solutions of this saddlepoint problem form a closed, convex set consisting precisely of the game’s Nash equilibria; moreover, the players’ equilibrium payoffs are simply and respectively. As a result, Nash equilibrium is the standard gametheoretic prediction in such games.
An important question that arises here is whether the straightforward equilibrium structure of zerosum games extends to the case of a network of competitors. Following [10, 7, 6], an player pairwise zero/constantsum polymatrix game consists of an (undirected) interaction graph whose set of nodes represents the competing players, with two nodes connected by an edge in if and only if the corresponding players compete with each other in a twoplayer zero/constantsum game.
To formalize this, we assume that a) every player has a finite set of actions (as before); and b) to each edge is associated a twoplayer game zero/constantsum with player set , action sets and , and payoff functions respectively.^{3}^{3}3In a zerosum game, we have by default. Since the underlying interaction graph is assumed undirected, we also assume that the labeling of the players’ payoff functions is symmetric. At the expense of concision, our analysis extends to directed graphs, but we stick with the undirected case for clarity. The space of mixed strategies of player is again , but the player’s payoff is now determined by aggregating over all games involving player , i.e.
(2.4) 
where denotes the set of “neighbors” of player . In other words, the payoff to player is simply the the sum of all payoffs in the zero/constantsum games that player plays with their neighbors.
In what follows, we will also consider games which are payoffequivalent to positiveaffine transformations of pairwise constantsum polymatrix games. Formally, we will allow for games such that there exists a pairwise constantsum polymatrix game and constants and for each player such that for each outcome .
3. Noregret learning via regularization
Throughout this paper, our focus will be on repeated decision making in lowinformation environments where the players don’t know the rules of the game (perhaps not even that they are playing a game). In this case, even if the game admits a unique Nash equilibrium, it is not reasonable to assume that players are able to precompute their component of an equilibrium strategy – let alone assume that all players are fully rational, that there is common knowledge of rationality, etc.
With this in mind, we only make the barebones assumption that every player seeks to at least minimize their “regret”, i.e. the average payoff difference between a player’s mixed strategy at time and the player’s best possible strategy in hindsight. Formally, assuming that play evolves in continuous time, the regret of player along the sequence of play is defined as
(3.1) 
and we say that player has no regret under if .
The most widely used scheme to achieve this worstcase guarantee is known as “Follow the Regularized Leader” (FoReL), an exploitationexploration class of policies that consists of playing a mixed strategy that maximizes the player’s expected cumulative payoff (the exploitation part) minus a regularization term (exploration). In our continuoustime framework, this is described by the learning dynamics
(FoReL)  
where the socalled choice map is defined as
(3.2) 
In the above, the regularizer function is a convex penalty term which smoothens the “hard” correspondence that maximizes the player’s cumulative payoff over . As a result, the “regularized leader” is biased towards the proxcenter of . For most common regularizers, the proxcenter is interior (and usually coincides with the barycenter of ), so the regularization in (3.2) encourages exploration by favoring mixed strategies with full support.
In Appendix A, we present in detail two of the prototypical examples of (FoReL): i ) the multiplicative weights (MW) dynamics induced by the entropic regularizer function (which lead to the replicator dynamics of evolutionary game theory); and ii ) the projection dynamics induced by the Euclidean regularizer . For concreteness, we will assume in what follows that the regularizer of every player satisfies the following minimal requirements:

is continuous and strictly convex on .

is smooth on the relative interior of every face of (including itself).
Under these basic assumptions, the “regularized leader” is welldefined in the sense that (3.2) admits a unique solution. More importantly, we have the following noregret guarantee:
Theorem 3.1.
To streamline our discussion, we relegate the proof of Theorem 3.1 to Appendix C; we also refer to [20] for a similar regret bound for (FoReL) in the context of online convex optimization. Instead of discussing the proof, we close this section by noting that (3.3) represents a striking improvement over the worstcase bound for FoReL in discrete time [40]. In view of this, the continuoustime framework we consider here can be seen as particularly amenable to learning because it allows players seek to minimize their regret (and thus converge to coarse correlated equilibria) at the fastest possible rate.
4. Recurrence in adversarial regularized learning
In this section, our aim is to take a closer look at the ramifications of fast regret minimization under (FoReL) beyond convergence to the set of coarse correlated equilibria. Indeed, as is well known, this set is fairly large and may contain thoroughly nonrationalizable strategies: for instance, Viossat and Zapechelnyuk [46] recently showed that a coarse correlated equilibrium could assign positive selection probability only to strictly dominated strategies. Moreover, the timeaveraging that is inherent in the definition of the players’ regret leaves open the possibility of complex daytoday behavior e.g. periodicity, recurrence, limit cycles or chaos [37, 29, 26, 27]. Motivated by this, we examine the longrun behavior of the (FoReL) in the popular setting of zerosum games (with or without interior equilibria) and several extensions thereof.
A key notion in our analysis is that of (Poincaré) recurrence. Intuitively, a dynamical system is recurrent if, after a sufficiently long (but finite) time, almost every state returns arbitrarily close to the system’s initial state.^{4}^{4}4Here, “almost” means that the set of such states has full Lebesgue measure. More formally, given a dynamical system on that is defined by means of a semiflow , we have:^{5}^{5}5Recall that a continuous map is a semiflow if and for all and all
describes the trajectory of the dynamical system starting at .Definition 4.1.
A point is said to be recurrent under if, for every neighborhood of in , there exists an increasing sequence of times such that for all . Moreover, the flow is called (Poincaré) recurrent if, for every measurable subset of , the set of recurrent points in has full measure.
An immediate consequence of Definition 4.1 is that, if a point is recurrent, there exists an increasing sequence of times such that . On that account, recurrence can be seen as the flip side of convergence: under the latter, (almost) every initial state of the dynamics eventually reaches some welldefined endstate; instead, under the former, the system’s orbits fill the entire state space and return arbitarily close to their starting points infinitely often (so there is no possibility of convergence beyond trivial cases).
4.1. Zerosum games with an interior equilibrium
Our first result is that (FoReL) is recurrent (and hence, nonconvergent) in zerosum games with an interior Nash equilibrium:
Theorem 4.2.
Let be a player zerosum game that admits an interior Nash equilibrium. Then, almost every solution trajectory of (FoReL) is recurrent; specifically, for (Lebesgue) almost every initial condition , there exists an increasing sequence of times such that .
The proof of Theorem 4.2 is fairly complicated, so we outline the basic steps below:

We first show that the dynamics of the score sequence are incompressible, i.e. the volume of a set of initial conditions remains invariant as the dynamics evolve over time. By Poincaré’s recurrence theorem (cf. Appendix B), if every solution orbit of (FoReL) remains in a compact set for all , incompressibility implies recurrence.

To counter the possibility of solutions escaping to infinity, we introduce a transformed system based on the differences between scores (as opposed to the scores themselves). To establish boundedness in these dynamics, we consider the “primaldual” coupling
(4.1) where is an interior Nash equilibrium and denotes the convex conjugate of .^{6}^{6}6This coupling is closely related to the socalled Bregman divergence – for the details, see [19, 3, 40, 24]. The key property of this coupling is that it remains invariant under (FoReL); however, its level sets are not bounded so, again, precompactness of solutions is not guaranteed.

Nevertheless, under the score transformation described above, the level sets of are compact. Since the transformed dynamics are invariant under said transformation, Poincaré’s theorem finally implies recurrence.
Proof of Theorem 4.2.
To make the above plan precise, fix some “benchmark” strategy for every player and, for all , consider the corresponding score differences
(4.2) 
Obviously, is identically zero so we can ignore it in the above definition. In so doing, we obtain a linear map sending ; aggregating over all players, we also write for the product map sending . For posterity, note that this map is surjective but not injective,^{7}^{7}7Specifically, if and only if for some and all . so it does not allow us to recover the score vector from the score difference vector .
Now, under (FoReL), the score differences (4.2) evolve as
(4.3) 
However, since the righthand side (RHS) of (4.3) depends on and the mapping is not invertible (so cannot be expressed as a function of ), the above does not a priori constitute an autonomous dynamical system (as required to apply Poincaré’s recurrence theorem). Our first step below is to show that (4.3) does in fact constitute a welldefined dynamical system on .
To do so, consider the reduced choice map defined as
(4.4) 
for some such that . That such a exists is a consequence of being surjective; furthemore, that is welldefined is a consequence of the fact that is invariant on the fibers of . Indeed, by construction, we have if and only if for some and all . Hence, by the definition of , we get
(4.5) 
where we used the fact that . The above shows that if and only if , so is welldefined.
Letting denote the aggregation of the players’ individual choice maps , it follows immediately that by construction. Hence, the dynamics (4.3) may be written as
(4.6) 
where
(4.7) 
These dynamics obviously constitute an autonomous system, so our goal will be to use Liouville’s formula and Poincaré’s theorem in order to establish recurrence and then conclude that the induced trajectory of play is recurrent by leveraging the properties of .
As a first step towards applying Liouville’s formula, we note that the dynamics (4.6) are incompressible. Indeed, we have
(4.8) 
because does not depend on . We thus obtain , i.e. the dynamics (4.6) are incompressible.
We now show that every solution orbit of (4.6) is precompact, that is, . To that end, note that the coupling defined in (4.1) remains invariant under (FoReL) when is a player zerosum game. Indeed, by Lemma C.1, we have
(4.9) 
where we used the fact that in the first line (cf. (C.2) above), and the assumption that is an interior Nash equilibrium of a player zerosum game in the last one. We conclude that remains constant under (FoReL), as claimed.
By Lemma D.2 in Appendix D, the invariance of under (FoReL) implies that the score differences also remain bounded for all . Hence, by Liouville’s formula and Poincaré’s recurrence theorem, the dynamics (4.6) are recurrent, i.e. for (Lebesgue) almost every initial condition and every neighborhood of , there exists some such that (cf. Definition 4.1). Thus, taking a shrinking net of balls and iterating the above, it follows that there exists an increasing sequence of times such that . Therefore, to prove the corresponding claim for the induced trajectories of play of (FoReL), fix an initial condition and take some such that . By taking as above, we have so, by continuity, . This shows that any solution orbit of (FoReL) is recurrent and our proof is complete. ∎
4.2. Zerosum games with no interior equilibria
At first sight, Theorem 4.2 suggests that cycling is ubiquitous in zerosum games; however, if the game does not admit an interior equilibrium, the behavior of (FoReL) turns out to be qualitatively different. To state our result for such games, it will be convenient to assume that the players’ regularizer functions are strongly convex, i.e. each can be bounded from below by a quadratic minorant:
(4.10) 
for all and for all . Under this technical assumption, we have:
Theorem 4.3.
Let be a player zerosum game that does not admit an interior Nash equilibrium. Then, for every initial condition of (FoReL), the induced trajectory of play converges to the boundary of . Specifically, if is a Nash equilibrium of with maximal support, converges to the relative interior of the face of spanned by .
Theorem 4.3 is our most comprehensive result for the behavior of (FoReL) in zerosum games, so several remarks are in order. First, we note that Theorem 4.3 complements Theorem 4.2 in a very natural way: specifically, if admits an interior Nash equilibrium, Theorem 4.3 suggests that the solutions of (FoReL) will stay within the relative interior of (since an interior equilibrium is supported on all actions). Of course, Theorem 4.2 provides a stronger result because it states that, within , (FoReL) is recurrent. Hence, applying both results in tandem, we obtain the following heuristic for the behavior of (FoReL) in zerosum games:
In the long run, (FoReL) wanders in perpetuity
in the smallest face of containing the equilibrium set of .
This leads to two extremes: On the one hand, if admits an interior equilibrium, (FoReL) is recurrent and cycles in the level sets of the coupling function (4.1). At the other end of the spectrum, if admits only a single, pure equilibrium, then (FoReL) converges to it (since it has to wander in a singleton set). In all other “inbetween” cases, (FoReL) exhibits a hybrid behavior, converging to the face of that is spanned by the maximal support equilibrium of , and then cycling in that face in perpetuity.
The reason for this behavior is that the coupling (4.1) is no longer a constant of motion of (FoReL) if the game does not admit an interior equilibrium. As we show in Appendix C, the coupling (4.1) is strictly decreasing when the support of is strictly greater than that of a Nash equilibrium with maximal support. When the two match, the rate of change of (4.1) drops to zero, and we fall back to a “constrained” version of Theorem 4.2. We make this argument precise in Appendix C (where we present the proof of Theorem 4.3).
4.3. Zerosum polymatrix games & positive affine payoff transformations
We close this section by showing that the recurrence properties of (FoReL) are not unique to “vanilla” zerosum games, but also occur when there is a network of competitors – i.e. in player zerosum polymatrix games. In fact, the recurrence results carry over to any player game which is isomorphic to a constantsum polymatrix game with an interior equilibrium up to a positiveaffine payoff transformation (possibly different transformation for each agent). For example, this class of games contains all strictly competitive games [1]. Such transformations do not affect the equilibrium structure of the game, but can affect the geometry of the trajectories; nevertheless, the recurrent behavior persists as shown by the following result:
Theorem 4.4.
Let be a constantsum polymatrix game (or a positive affine payoff transformation thereof). If admits an interior Nash equilibrium, almost every solution trajectory of (FoReL) is recurrent; specifically, for (Lebesgue) almost every initial condition , there exists an increasing sequence of times such that .
We leave the case of zerosum polymatrix games with no interior equilibria to future work.
5. Conclusions
Our results show that the behavior of regularized learning in adversarial environments is considerably more intricate than the strong noregret properties of FoReL might at first suggest. Even though the empirical frequency of play under FoReL converges to the set of coarse correlated equilibria (possibly at an increased rate, depending on the game’s structure), the actual trajectory of play under FoReL is recurrent and exhibits cycles in zerosum games. We find this property particularly interesting as it suggests that “black box” guarantees are not the beall/endall of learning in games: the theory of dynamical systems is rife with complex phenomena and notions that arise naturally when examining the behavior of learning algorithms in finer detail.
Appendix A Examples of FoReL dynamics
Example A.1 (Multiplicative weights and the replicator dynamics).
Perhaps the most widely known example of a regularized choice map is the socalled logit choice map
(A.1) 
This choice model was first studied in the context of discrete choice theory by McFadden [22] and it leads to the multiplicative weights (MW) dynamics:^{8}^{8}8The terminology “multiplicative weights” refers to the fact that (MW) is the continuous version of the discretetime multiplicative weights update rule:
(MW)  
As is well known, the logit map above is obtained by the model (3.2) by considering the entropic regularizer
(A.2) 
i.e. the (negative) Gibbs–Shannon entropy function. A simple differentiation of (MW) then shows that the players’ mixed strategies evolve according to the dynamics
(RD) 
This equation describes the replicator dynamics of [45], the most widely studied model for evolution under natural selection in population biology and evolutionary game theory. The basic relation between (MW) and (RD) was first noted in a singleagent environment by [34] and was explored further in game theory by [17, 42, 23, 24] and many others.
Example A.2 (Euclidean regularization and the projection dynamics).
Another widely used example of regularization is given by the quadratic penalty
(A.3) 
The induced choice map (3.2) is the (Euclidean) projection map
(A.4) 
leading to the
projected reinforcement learning
process(PL)  
The players’ mixed strategies are then known to follow the projection dynamics
(PD) 
over all intervals for which the support of remains constant [24]. The dynamics (PD) were introduced in game theory by [12] as a geometric model of the evolution of play in population games; for a closely related approach, see also [25, 21] and references therein.
Appendix B Liouville’s formula and Poincaré recurrence
Below we present for completeness some basic results from the theory of dynamical systems.
Liouville’s Formula
Liouville’s formula can be applied to any system of autonomous differential equations with a continuously differentiable vector field on an open domain of . The divergence of at is defined as the trace of the corresponding Jacobian at , i.e., . Since divergence is a continuous function we can compute its integral over measurable sets . Given any such set , let be the image of under map at time . is measurable and is volume is . Liouville’s formula states that the time derivative of the volume exists and is equal to the integral of the divergence over :
A vector field is called divergence free if its divergence is zero everywhere. Liouville’s formula trivially implies that volume is preserved in such flows.
Poincaré’s recurrence theorem
The notion of recurrence that we will be using in this paper goes back to Poincaré and specifically to his study of the threebody problem. In 1890, in his celebrated work [30], he proved that whenever a dynamical system preserves volume almost all trajectories return arbitrarily close to their initial position, and they do so an infinite number of times. More precisely, Poincaré established the following:
Appendix C Technical proofs
The first result that we prove in this appendix is a key technical lemma concerning the evolution of the coupling function (4.1):
Lemma C.1.
Proof.
We begin by recalling the “maximizing argument” identity
(C.2) 
which expresses the choice map as a function of the convex conjugate of [40, p. 149]. With this at hand, a simple differentiation gives
(C.3) 
where the last step follows from the fact that . ∎
Armed with this lemma, we proceed to prove the noregret guarantees of (FoReL):
Proof of Theorem 3.1.
Fix some base point and let . Then, by Lemma C.1, we have
(C.4) 
and hence, after integrating and rearranging, we get
(C.5) 
where we used the fact that – cf. Eq. 2.2 in Section 2. However, expanding the RHS of (C.5), we get
(C.6) 
where we used the defining property of convex conjugation in the second and third lines above – i.e. that for all , with equality if and only if . Thus, maximizing (C.5) over , we finally obtain
(C.7) 
as claimed. ∎
We now turn to twoplayer zerosum games that do not admit interior equilibria. To describe such equilibria in more detail, we consider below the notion of essential and nonessential strategies:
Definition C.2.
A strategy of agent in a zero sum game is called essential if there exists a Nash equilibrium in which player plays with positive probability. A strategy that is not essential is called nonessential.
As it turns out, the Nash equilibria of a zerosum game admit a very useful characterization in terms of essential and nonessential strategies:
Lemma C.3.
Let be a player zerosum game that does not admit an interior Nash equilibrium. Then, there exists a mixed Nash equilibrium such that a) each agent plays each of their essential strategies with positive probability; and b) for each agent deviating to a nonessential strategy results to a strictly worse performance than the value of the game.
The key step in proving this characterization is Farkas’ lemma;
the version we employ here is due to Gale, Kuhn and Tucker [13]):
^{color=DodgerBlue!20!LightGray,author=PM,inline}^{color=DodgerBlue!20!LightGray,author=PM,inline}todo: color=DodgerBlue!20!LightGray,author=PM,inlineProvide reference? Done.
Lemma C.4 (Farkas’ lemma).
Let and . Then exactly one of the following two statements is true:

There exists a such that and .

There exists a such that and .
With this lemma at hand, we have:
Proof of Lemma c.3.
Assume without loss of generality that the value of the zerosum game is zero. and that the first agent is a maximizing agent. Let be the payoff matrix of the first agent and hence the payoff matrix of the second/minimizing agent. We will show first that for any nonessential strategy of each agent there exists a Nash equilibrium strategy of his opponent such that the expected performance of is strictly worse than the value of the game (i.e. zero).
It suffices to argue this for the first agent. Let be one of his nonessential strategies then by definition there does not exist any Nash equilibrium strategy of that agent that chooses with positive probability. This is equivalent to the negation of the following statement:
There exists a such that and
where
(C.8) 
and , the standard basis vector of dimension that “chooses" the th strategy. By Farkas’ lemma, there exists a such that and . It is convenient to express where and . Hence, for all and thus . Finally, for and thus . Hence is a Nash equilibrium strategy for the second player such that when the first agent chooses the nonessential strategy he receives payoff which is strictly worse than his value (zero).
To complete the proof, for each essential strategy of the first agent there exists one equilibrium strategy of his that chooses it with positive probability (by definition). Similarly, for each nonessential strategy of the second agent there exists one equilibrium strategy of the first agent such that makes the expected payoff of that nonessential strategy strictly worse than the value of the game. The barycenter of all the above equilibrium strategies is still an equilibrium strategy (by convexity) and has all the desired properties. ∎
With all this at hand, we are finally in a position to prove Theorem 4.3:
Proof of Theorem 4.3.
We first show that the coupling
Comments
There are no comments yet.