1 Introduction
Learning in dynamically evolving environments can be described as a repeated game between a player, an online learning algorithm, and an adversary. At each round of the game, the player selects an action, e.g. invests in a specific stock, the adversary, which may be the stock market, chooses a utility function, and the player gains the utility value of its action. The player observes the utility value and uses it to update its strategy for subsequent rounds. The player’s goal is to accumulate the largest possible utility over a finite number of rounds of play.^{1}^{1}1Such games can be equivalently described in terms of minimizing losses rather than maximizing utilities. All our results can be equivalently expressed in terms of losses instead of utilities.
The standard measure of the performance of a player is its regret, that is the difference between the utility achieved by the best offline solution from some restricted class and the utility obtained by the online player, when utilities are revealed incrementally. Formally, we can model learning as the following problem. Consider an action set . The player selects an action at round , the adversary picks a utility function , and the player gains the utility value . While in a full observation setting the player observes the entire utility function , in a bandit setting the player only observes the utility value of its own action, . We use the shorthand to denote the player’s sequence of actions and denote by the family of utility functions . The objective of the player is to maximize its expected cumulative utility over rounds, i.e. maximize , where the expectation is over the player’s (possible) internal randomization. Since this is clearly impossible to maximize without knowledge of the future, the algorithm instead seeks to achieve a performance comparable to that of the best fixed action in hindsight. Formally, external regret is defined as
(1) 
A player is said to admit no external regret if the external regret is sublinear, that is . In contrast to statistical learning, online learning algorithms do not need to make stochastic assumptions about data generation: strong regret bounds are possible even if the utility functions are adversarial.
There are two main adversarial settings in online learning: the oblivious setting where the adversary ignores the player’s actions and where the utility functions can be thought of as determined before the game starts (for instance, in weather prediction); and the adaptive setting where the adversary can react to the player’s actions, thus seeking to throw the player off track (e.g., competing with other agents in the stock market). More generally, we define an memory bounded adversary as one that at any time selects a utility function based on the player’s past actions: , for all and all . An oblivious adversary can therefore be equivalently viewed as a memory bounded adversary and an adaptive adversary as an 8 memory bounded adversary. For an oblivious adversary, external regret in Equation 1 reduces to , since the utility functions do not depend upon past actions. Thus, external regret is meaningful when the adversary is oblivious, but it does not admit any natural interpretation when the adversary is adaptive. The problem stems from the fact that in the definition of external regret, the benchmark is a function of the player’s actions. Thus, if the adversary is adaptive, or even memorybounded for some , then, external regret does not take into account how the adversary would react had the player selected some other action.
To resolve this critical issue, Arora et al. (2012b) introduced an alternative measure of performance called policy regret for which the benchmark does not depend on the player’s actions. Policy regret is defined as follows
(2) 
Arora et al. (2012b) further gave a reduction, using a minibatch technique where the minibatch size is larger than the memory of adversary, that turns any algorithm with a sublinear external regret against an oblivious adversary into an algorithm with a sublinear policy regret against an memory bounded adversary, albeit at the price of a somewhat worse regret bound, which is still sublinear.
In this paper, we revisit the problem of online learning against adaptive adversaries. Since Arora et al. (2012b) showed that there exists an adaptive adversary against which any online learning algorithm admits linear policy regret, even when the external regret may be sublinear, we ask if no policy regret implies no external regret. One could expect this to be the case since policy regret seems to be a stronger notion than external regret. However, our first main result (Theorem 3.2) shows that this in fact is not the case and that the two notions of regret are incompatible: there exist adversaries (or sequence of utilities) on which action sequences with sublinear external regret admit linear policy regret and action sequences with sublinear policy regret incur linear external regret.
We argue, however, that such sequences may not arise in practical settings, that is in settings where the adversary is a selfinterested entity. In such settings, rather than considering a malicious opponent whose goal is to hurt the player by inflicting large regret, it seems more reasonable to consider an opponent whose goal is to maximize his own utility. In zerosum games, maximizing one’s utility comes at the expense of the other player’s, but there is a subtle difference between an adversary who is seeking to maximize the player’s regret and an adversary who is seeking to minimize the player’s utility (or maximize his own utility). We show that in such strategic game settings there is indeed a strong relationship between external regret and policy regret. In particular, we show in Theorem 3.4 that a large class of stable online learning algorithms with sublinear external regret also benefit from sublinear policy regret.
Further, we consider a twoplayer game where each player is playing a no policy regret algorithm. It is known that no external regret play converges to a coarse correlated equilibrium (CCE) in such a game, but what happens when players are using no policy regret algorithms? We show in Theorem 4.8 that the average play in repeated games between no policy regret players converges to a policy equilibrium, a new notion of equilibrium that we introduce. Policy equilibria differ from more traditional notions of equilibria such as Nash or CCEs in a crucial way. Recall that a CCE is defined to be a recommended joint strategy for players in a game such that there is no incentive for any player to deviate unilaterally from the recommended strategy if other players do not deviate.
What happens if the other players react to one player’s deviation by deviating themselves? This type of reasoning is not captured by external regret, but is essentially what is captured by policy regret. Thus, our notion of policy equilibrium must take into account these counterfactuals, and so the definition is significantly more complex. But, by considering functions rather than just actions, we can define such equilibria and prove that they exactly characterize no policy regret play.
Finally, it becomes natural to determine the relationship between policy equilibria (which characterize no policy regret play) and CCEs (which characterize no external regret play). We show in Theorems 4.9 and 4.10 that the set of CCEs is a strict subset of policy equilibria. In other words, every CCE can be thought of as a policy regret equilibrium, but no policy regret play might not converge to a CCE.
2 Related work
The problem of minimizing policy regret in a fully adversarial setting was first studied by Merhav et al. (2002)
. Their work dealt specifically with the full observation setting and assumed that the utility (or loss) functions were
memory bounded. They gave regret bounds in . The followup work by Farias and Megiddo (2006) designed algorithms in a reactive bandit setting. However, their results were not in the form of regret bounds but rather introduced a new way to compare against acting according to a fixed expert strategy. Arora et al. (2012b) studied memory bounded adversaries both in the bandit and full information settings and provided extensions to more powerful competitor classes considered in swap regret and more general regret. Dekel et al. (2014) provided a lower bound in the bandit setting for switching cost adversaries, which also leads to a tight lower bound for policy regret in the order . Their results were later extended by Koren et al. (2017a) and Koren et al. (2017b). More recently, Heidari et al. (2016) considered the multiarmed bandit problem where each arm’s loss evolves with consecutive pulls. The process according to which the loss evolves was not assumed to be stochastic but it was not arbitrary either – in particular, the authors required either the losses to be concave, increasing and to satisfy a decreasing marginal returns property, or decreasing. The regret bounds given are in terms of the time required to distinguish the optimal arm from all others.A large part of reinforcement learning is also aimed at studying sequential decision making problems. In particular, one can define a Markov Decision Process (MDP) by a set of states equipped with transition distributions, a set of actions and a set of reward or loss distributions associated with each state action pair. The transition and reward distributions are assumed unknown and the goal is to play according to a strategy that minimizes loss or maximizes reward. We refer the reader to
(Sutton and Barto, 1998; Kakade et al., 2003; Szepesvári, 2010) for general results in RL. MDPs in the online setting with bandit feedback or arbitrary payoff processes have been studied by EvenDar et al. (2009a); Yu et al. (2009); Neu et al. (2010) and Arora et al. (2012a).The tight connection between noregret algorithms and correlated equilibria was established and studied by Foster and Vohra (1997); Fudenberg and Levine (1999); Hart and MasColell (2000); Blum and Mansour (2007). A general extension to games with compact, convex strategy sets was given by Stoltz and Lugosi (2007). No external regret dynamics were studied in the context of socially concave games (EvenDar et al., 2009b). More recently, Hazan and Kale (2008) considered more general notions of regret and established an equivalence result between fixedpoint computation, the existence of certain noregret algorithms, and the convergence to the corresponding equilibria. In a followup work by Mohri and Yang (2014) and Mohri and Yang (2017), the authors considered a more powerful set of competitors and showed that the repeated play according to conditional swapregret or transductive regret algorithms leads to a new set of equilibria.
3 Policy regret in reactive versus strategic environments
Often, distant actions in the past influence an adversary more than more recent ones. The definition of policy regret (2) models this influence decay by assuming that the adversary is memory bounded for some . This assumption is somewhat stringent, however, since ideally we could model the current move of the adversary as a function of the entire past, even if actions taken further in the past have less significance. Thus, we extend the definition of Arora et al. (2012b) as follows.
Definition 3.1.
The memory policy regret at time of a sequence of actions with respect to a fixed action in the action set and the sequence of utilities , where and is
We say that the sequence has sublinear policy regret (or no policy regret) if , for all actions .
Let us emphasize that this definition is just an extension of the standard policy regret definition and that, when the utility functions are memory bounded, the two definitions exactly coincide.
While the motivation for policy regret suggests that this should be a stronger notion compared to external regret, we show that not only that these notions are incomparable in the general adversarial setting, but that they are also incompatible in a strong sense.
Theorem 3.2.
There exists a sequence of memory bounded utility functions , where , such that for any constant (independent of ), any action sequence with sublinear policy regret will have linear external regret and any action sequence with sublinear external regret will have linear policy regret.
The proof of the above theorem constructs a sequence for which no reasonable play can attain sublinear external regret. In particular, the only way the learner can have sublinear external regret is if they choose to have very small utility. To achieve this, the utility functions chosen by the adversary are the following. At time , if the player chose to play the same action as their past actions then they get utility . If the player’s past two actions were equal but their current action is different, then they get utility , and if their past two actions differ then no matter what their current action is they receive utility . It is easy to see that the maximum utility play for this sequence (and the lowest memory bounded policy regret strategy) is choosing the same action at every round. However, such an action sequence admits linear external regret. Moreover, every sublinear external regret strategy must then admit sublinear utility and thus linear policy regret.
As discussed in Section 1
, in many realistic environments we can instead think of the adversary as a selfinterested agent trying to maximize their own utility, rather than trying to maximize the regret of the player. This more strategic environment is better captured by the game theory setting, in particular a
player game where both players are trying to maximize their utility. Even though we have argued that external regret is not a good measure, our next result shows that minimizing policy regret in games can be done if both players choose their strategies according to certain no external regret algorithms. More generally, we adapt a classical notion of stability from the statistical machine learning setting and argue that if the players use no external regret algorithms that are
stable, then the players will have no policy regret in expectation. To state the result formally we first need to introduce some notation.Game definition:
We consider a player game , with players and . The action set of player is denoted by , which we think of as being embedded into
in the obvious way where each action corresponds to a standard basis vector. The corresponding simplex is
. The action of player at time is and of player is . The observed utility for player at time is and this is a bilinear form with corresponding matrix . We assume that the utilities are bounded in .Algorithm of the player:
When discussing algorithms, we take the view of player . Specifically, at time , player plays according to an algorithm which can be described as . We distinguish between two settings: full information, in which the player observes the full utility function at time (i.e., ), and the bandit setting, in which the player only observes . In the full information setting, algorithms like multiplicative weight updates (MWU Arora et al. (2012c)) depend only on the past utility functions , and thus we can think of as a function . In the bandit setting, though, the output at time of the algorithm depends both on the previous actions and on the utility functions (i.e., the actions picked by the other player).
But even in the bandit setting, we would like to think of the player’s algorithm as a function . We cannot quite do this, however we can think of the player’s algorithm as a distribution over such functions. So how do we remove the dependence on ? Intuitively, if we fix the sequence of actions played by player , we want to take the expectation of over possible choices of the actions played by player . In order to do this more formally, consider the distribution over generated by simulating the play of the players for rounds. Then let be the distribution obtained by conditioning on the actions of player being . Now we let be the distribution obtained by sampling from and using . When taking expectations over , the expectation is taken with respect to the above distribution. We also refer to the output as the strategy of the player at time .
Now that we can refer to algorithms simply as functions (or distributions over functions), we introduces the notion of a stable algorithm.
Definition 3.3.
Let be a sample from (as described above), mapping the past actions in to a distribution over the action set . Let the distribution returned at time be . We call this algorithm on average stable with respect to the norm , if for any such that , it holds that , where the expectation is taken with respect to the randomization in the algorithm.
Even though this definition of stability is given with respect to the game setting, it is not hard to see that it can be extended to the general online learning setting, and in fact this definition is similar in spirit to the one given in Saha et al. (2012). It turns out that most natural no external regret algorithms are stable. In particular we show, in the supplementary, that both Exp3 Auer et al. (2002) and MWU are on average stable with respect to norm for any . It is now possible to show that if each of the players are facing stable no external regret algorithms, they will also have bounded policy regret (so the incompatibility from Theorem 3.2 cannot occur in this case).
Theorem 3.4.
Let and be the action sequences of player and and suppose that they are coming from no external regret algorithms modeled by functions and , with regrets and respectively. Assume that the algorithms are on average stable with respect to the norm. Then
where in the definition of equals and similarly in the definition of , equals . The above holds for any fixed actions and . Here the matrix norm is the spectral norm.
4 Policy equilibrium
Recall that unlike external regret, policy regret captures how other players in a game might react if a player decides to deviate from their strategy. The story is similar when considering different notions of equilibria. In particular Nash equlibria, Correlated equilibria and CCEs can be interpreted in the following way: if player deviates from the equilibrium play, their utility will not increase no matter how they decide to switch, provided that all other players continue to play according to the equilibrium. This sentiment is a reflection of what no external and no swap regret algorithms guarantee. Equipped with the knowledge that no policy regret sequences are obtainable in the game setting under reasonable play from all parties, it is natural to reason how other players would react if player deviated and what would be the cost of deviation when taking into account possible reactions.
Let us again consider the 2player game setup through the view of player . The player believes their opponent might be memory bounded and decides to proceed by playing according to a no policy regret algorithm. After many rounds of the game, player has computed an empirical distribution of play over . The player is familiar with the guarantees of the algorithm and knows that, if instead, they changed to playing any fixed action , then the resulting empirical distribution of play , where player has responded accordingly in a memorybounded way, is such that . This thought experiment suggests that if no policy regret play converges to an equilibrium, then the equilibrium is not only described by the deviations of player , but also through the change in player ’s behavior, which is encoded in the distribution . Thus, any equilibrium induced by no policy regret play, can be described by tuples of distributions , where is the distribution corresponding to player ’s deviation to the fixed action and captures player ’s deviation to the fixed action . Clearly and are not arbitrary but we still need a formal way to describe how they arise.
For convenience, lets restrict the memory of player to be . Thus, what player believes is that at each round of the game, they play an action and player plays a function , mapping to . Finally, the observed utility is . The empirical distribution of play, , from the perspective of player , is formed from the observed play . Moreover, the distribution, , that would have occurred if player chose to play action on every round is formed from the play . In the view of the world of player , the actions taken by player are actually functions rather than actions in . This suggests that the equilibrium induced by a nopolicy regret play, is a distribution over the functional space defined below.
Definition 4.1.
Let and denote the functional spaces of play of players and , respectively. Denote the product space by .
Note that when , is in a onetoone correspondence with , i.e. when players believe their opponents are oblivious, we recover the action set studied in standard equilibria. For simplicity, for the remainder of the paper we assume that . However, all of the definitions and results that follow can be extended to the fully general setting of arbitrary and ; see the supplementary for details.
Let us now investigate how a distribution over can give rise to a tuple of distributions . We begin by defining the utility of such that it equals the utility of a distribution over i.e., we want . Since utilities are not defined for functions, we need an interpretation of which makes sense. We notice that
induces a Markov chain with state space
in the following way.Definition 4.2.
Let be any distribution over . Then induces a Markov process
with transition probabilities
We associate with this Markov process the transition matrix , with where .Since every Markov chain with a finite state space has a stationary distribution, we think of utility of as the utility of a particular stationary distribution of . How we choose among all stationary distributions is going to become clear later, but for now we can think about as the distribution which maximizes the utilities of both players. Next, we need to construct and , which capture the deviation in play, when player switches to action and player switches to action . The nopolicy regret guarantee can be interpreted as i.e., if player chose to switch to a fixed action (or equivalently, the constant function which maps everything to the action ), then their utility should not increase. Switching to a fixed action , changes to a new distribution over . This turns out to be a product distribution which also induces a Markov chain.
Definition 4.3.
Let be any distribution over . Let be the distribution over putting all mass on the constant function mapping all actions to the fixed action . Let be the marginal of over . The distribution resulting from player 1 switching to playing a fixed action , is denoted as . This distribution induces a Markov chain with transition probabilities and the transition matrix of this Markov process is denoted by . The distribution and matrix are defined similarly for player .
Since the no policy regret algorithms we work with do not directly induce distributions over the functional space but rather only distributions over the action space , we would like to state all of our utility inequalities in terms of distributions over . Thus, we would like to check if there is a stationary distribution of such that . This is indeed the case as verified by the following theorem.
Theorem 4.4.
Let be a distribution over the product of function spaces . There exists a stationary distribution of the Markov chain for any fixed such that . Similarly, for every fixed action , there exists a stationary distribution of such that .
The proof of this theorem is constructive and can be found in the supplementary. With all of this notation we are ready to formally describe what nopolicy regret play promises in the game setting in terms of an equilibrium.
Definition 4.5.
A distribution over is a policy equilibrium if for all fixed actions and , which generate Markov chains and respectively, with stationary distributions and from Theorem 4.4, there exists a stationary distribution of the Markov chain induced by such that:
(3)  
In other words, is a policy equilibrium if there exists a stationary distribution of the Markov chain corresponding to , such that, when actions are drawn according to , no player has incentive to change their action. For a simple example of a policy equilibrium see Section E in the supplementary.
4.1 Convergence to the set of policy equilibria
We have tried to formally capture the notion of equilibria in which player ’s deviation would lead to a reaction from player and vice versa in Definition 4.5. This definition is inspired by the counterfactual guarantees of no policy regret play and we would like to check that if players’ strategies yield sublinear policy regret then the play converges to a policy equilibrium. Since the definition of sublinear policy regret does not include a distribution over functional spaces but only works with empirical distributions of play, we would like to present our result in terms of distributions over the action space . Thus we begin by defining the set of all product distributions , induced by policy equilibria as described in the previous subsection. Here and represent the deviation in strategy if player changed to playing the fixed action and player changed to playing the fixed action respectively as constructed in Theorem 4.4.
Definition 4.6.
For a policy equilibrium , let be the set of all stationary distributions which satisfy the equilibrium inequalities (3), . Define , where is the set of all policy equilibria.
Our main result states that the sequence of empirical product distributions formed after rounds of the game is going to converge to . Here and denote the distributions of deviation in play, when player switches to the fixed action and player switches to the fixed action respectively. We now define these distributions formally.
Definition 4.7.
Suppose player is playing an algorithm with output at time given by i.e. . Similarly, suppose player is playing an algorithm with output at time given by . The empirical distribution at time is , where is the product distribution over at time . Further let denote the distribution at time , provided that player switched their strategy to the constant action . Let denote the distribution over which puts all the probability mass on action . Let be the product distribution over , corresponding to the change of play at time . Denote by the empirical distribution corresponding to the change of play. The distribution is defined similarly.
Suppose that and are nopolicy regret algorithms, then our main result states that the sequence converges to the set .
Theorem 4.8.
If the algorithms played by player in the form of and player in the form of give sublinear policy regret sequences, then the sequence of product distributions converges weakly to the set .
In particular if both players are playing MWU or Exp3, we know that they will have sublinear policy regret. Not surprisingly, we can show something slightly stronger as well. Let , and denote the empirical distributions of observed play corresponding to , and , i.e. , where denotes the Dirac distribution, putting all weight on the played actions at time . Then these empirical distributions also converge to almost surely.
4.2 Sketch of proof of the main result
The proof of Theorem 4.8 has three main steps. The first step defines the natural empirical Markov chains , and from the empirical play (see Definition B.2) and shows that the empirical distributions , and are stationary distributions of the respective Markov chains. The latter is done in Lemma B.3. The next step is to show that the empirical Markov chains converge to Markov chains , and induced by some distribution over . In particular, we construct an empirical distribution and distributions and corresponding to player’s deviations (see Definition B.5), and show that these induce the Markov chains , and respectively (Lemma B.7). The distribution we want is now the limit of the sequence . The final step is to show that is a policy equilibrium. The proof goes by contradiction. Assume is not a policy equilibrium, this implies that no stationary distribution of and corresponding stationary distributions of and can satisfy inequalities (3). Since the empirical distributions , and of the play satisfies inequalities (3) up to an additive factor, we can show, in Theorem B.8, that in the limit, the policy equilibrium inequalities are exactly satisfied. Combined with the convergence of , and to , and , respectively, this implies that stationary distributions of , and , satisfying (3), giving a contradiction.
We would like to emphasize that the convergence guarantee of Theorem 4.8 does not rely on there being a unique stationary distribution of the empirical Markov chains , and or their respective limits . Indeed, Theorem 4.8 shows that any limit point of satisfies the conditions of Definition 4.5. The proof does not require that any of the respective Markov chains have a unique stationary distribution, but rather requires only that has sublinear policy regret. We would also like to remark that need not have a unique limit and our convergence result only guarantees that the sequence is going to the set . This is standard when showing that any type of no regret play converges to an equilibrium, see for example Stoltz and Lugosi (2007).
4.3 Relation of policy equlibria to CCEs
So far we have defined a new class of equilibria and shown that they correspond to no policy regret play. Furthermore, we know that if both players in a 2player game play stable no external regret algorithms, then their play also has sublinear policy regret. It is natural to ask if every CCE is also a policy equilibrium: if is a CCE, is there a corresponding policy equilibrium which induces a Markov chain for which is a stationary distribution satisfying (3)? We show that the answer to this question is positive:
Theorem 4.9.
For any CCE of a 2player game , there exists a policyequilibrium which induces a Markov chain with stationary distribution .
To prove this, we show that for any CCE we can construct stable noexternal regret algorithm which converge to it, and so since stable noexternal regret algorithms always converge to policy equilibria (Theorem 3.4), this implies the CCE is also a policy equilibrium.
However, we show the converse is not true: policy equilibria can give rise to behavior which is not a CCE. Our proof appeals to a utility sequence which is similar in spirit to the one in Theorem 3.2, but is adapted to the game setting.
Theorem 4.10.
There exists a 2player game and product distributions (where is defined in Definition 4.6 as the possible distributions of play from policy equilibria), such that is not a CCE of .
In Section E of the supplementary we give a simple example of a policy equilibrium which is not a CCE.
5 Discussion
In this work we gave a new twist on policy regret by examining it in the game setting, where we introduced the notion of policy equilibrium and showed that it captures the behavior of no policy regret players. While our characterization is precise, we view this as only the first step towards truly understanding policy regret and its variants in the game setting. Many interesting open questions remain. Even with our current definitions, since we now have a broader class of equilibria to consider it is natural to go back to the extensive literature in algorithmic game theory on the price of anarchy and price of stability and reconsider it in the context of policy equilibria. For example Roughgarden (2015) showed that in “smooth games” the worst CCE is no worse than the worst Nash. Since policy equilibria contain all CCEs (Theorem 4.9), is the same true for policy equilibria?
Even more interesting questions remain if we change our definitions to be more general. For example, what happens with more than 2 players? With three or more players, definitions of “reaction” by necessity become more complicated. Or what happens when is not a constant? No policy regret algorithms exist for superconstant , but our notion of equilibrium requires to be constant in order for the Markov chains to make sense. Finally, what if we compare against deviations that are more complicated than a single action, in the spirit of swap regret or regret?
From an online learning perspective, note that our notion of on average stable and the definition of memory boundedness are different notions of stability. Is there one unified definition of “stable” which would allow us to give no policy regret algorithms against stable adversaries even outside of the game setting?
Acknowledgments
This work was supported in part by NSF BIGDATA grant IIS1546482, NSF BIGDATA grant IIS1838139, NSF CCF1535987, NSF IIS1618662, NSF CCF1464239, and NSF AITF CCF1535887.
References
 AllenZhu and Li [2016] Zeyuan AllenZhu and Yuanzhi Li. LazySVD: Even faster SVD decomposition yet without agonizing pain. In Advances in Neural Information Processing Systems, pages 974–982, 2016.
 Arora et al. [2012a] Raman Arora, Ofer Dekel, and Ambuj Tewari. Deterministic MDPs with adversarial rewards and bandit feedback. In Proceedings on Uncertainty in Artificial Intellegence (UAI), 2012a.
 Arora et al. [2012b] Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret. In Proceedings of International Conference on Machine Learning (ICML), 2012b.
 Arora et al. [2012c] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a metaalgorithm and applications. Theory of Computing, 8(1):121–164, 2012c.
 Auer et al. [2002] Peter Auer, Nicolo CesaBianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
 Blum and Mansour [2007] Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research, 8:1307–1324, 2007.
 Dekel et al. [2014] Ofer Dekel, Jian Ding, Tomer Koren, and Yuval Peres. Bandits with switching costs: regret. In Proceedings of the fortysixth annual ACM symposium on Theory of computing, pages 459–467. ACM, 2014.
 EvenDar et al. [2009a] Eyal EvenDar, Sham M Kakade, and Yishay Mansour. Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009a.
 EvenDar et al. [2009b] Eyal EvenDar, Yishay Mansour, and Uri Nadav. On the convergence of regret minimization dynamics in concave games. In Proceedings of the fortyfirst annual ACM symposium on Theory of computing, pages 523–532. ACM, 2009b.
 Farias and Megiddo [2006] Daniela Pucci De Farias and Nimrod Megiddo. Combining expert advice in reactive environments. Journal of the ACM (JACM), 53(5):762–799, 2006.
 Foster and Vohra [1997] Dean P Foster and Rakesh V Vohra. Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21(12):40, 1997.
 Fudenberg and Levine [1999] Drew Fudenberg and David K Levine. Conditional universal consistency. Games and Economic Behavior, 29(12):104–130, 1999.
 Hart and MasColell [2000] Sergiu Hart and Andreu MasColell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
 Hazan and Kale [2008] Elad Hazan and Satyen Kale. Computational equivalence of fixed points and no regret algorithms, and convergence to equilibria. In Advances in Neural Information Processing Systems, pages 625–632, 2008.
 Heidari et al. [2016] Hoda Heidari, Michael Kearns, and Aaron Roth. Tight policy regret bounds for improving and decaying bandits. In IJCAI, pages 1562–1570, 2016.
 Kakade et al. [2003] Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD thesis, University of London London, England, 2003.
 Koren et al. [2017a] Tomer Koren, Roi Livni, and Yishay Mansour. Bandits with movement costs and adaptive pricing. In Proceedings of the 2017 Conference on Learning Theory, pages 1242–1268, 2017a.
 Koren et al. [2017b] Tomer Koren, Roi Livni, and Yishay Mansour. Multiarmed bandits with metric movement costs. In Advances in Neural Information Processing Systems, pages 4119–4128, 2017b.
 Merhav et al. [2002] Neri Merhav, Erik Ordentlich, Gadiel Seroussi, and Marcelo J Weinberger. On sequential strategies for loss functions with memory. IEEE Transactions on Information Theory, 48(7):1947–1958, 2002.
 Mohri and Yang [2014] Mehryar Mohri and Scott Yang. Conditional swap regret and conditional correlated equilibrium. In Advances in Neural Information Processing Systems, pages 1314–1322, 2014.
 Mohri and Yang [2017] Mehryar Mohri and Scott Yang. Online learning with transductive regret. In Advances in Neural Information Processing Systems, pages 5220–5230, 2017.
 Neu et al. [2010] Gergely Neu, Andras Antos, András György, and Csaba Szepesvári. Online markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems, pages 1804–1812, 2010.
 Roughgarden [2015] Tim Roughgarden. Intrinsic robustness of the price of anarchy. Journal of the ACM (JACM), 62(5):32, 2015.
 Saha et al. [2012] Ankan Saha, Prateek Jain, and Ambuj Tewari. The interplay between stability and regret in online learning. arXiv preprint arXiv:1211.6158, 2012.
 Stoltz and Lugosi [2007] Gilles Stoltz and Gábor Lugosi. Learning correlated equilibria in games with compact sets of strategies. Games and Economic Behavior, 59(1):187–208, 2007.
 Sutton and Barto [1998] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.

Szepesvári [2010]
Csaba Szepesvári.
Algorithms for reinforcement learning.
Synthesis lectures on artificial intelligence and machine learning
, 4(1):1–103, 2010.  Yu et al. [2009] Jia Yuan Yu, Shie Mannor, and Nahum Shimkin. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737–757, 2009.
Appendix A Proofs of results from Section 3
Theorem A.1.
Let the space of actions be and define the memory bounded utility functions as follows:
Assuming is a fixed constant (independent of ) any sequence with with sublinear policy regret will have linear external regret and every sequence with sublinear external regret will have linear policy regret.
Proof of Theorem 3.2.
Let be a sequence with sublinear policy regret. Then this sequence has utility at least and so there are at most terms in the sequence which are not equal to . Let the subsequence consisting of all be indexed by . Define and consider the subsequence of functions . This is precisely the sequence of functions which have utility with respect to the sequence of play . Notice that the length of this sequence is at least . The utility of this sequence is , however, this subsequence has linear external regret, since . Thus the external regret of is
where the last inequality follows from the fact that the cardinality of is at most and thus .
Assume that has sublinear external regret. From the above argument, it follows that the utility of the sequence is at most (otherwise if the sequence has utility , we can repeat the previous argument and get a contradiction with the fact the sequence is noexternal regret). This implies that the the policy regret of the sequence is . ∎
Theorem A.2.
Let and be the action sequences of player and respectively and suppose that they are coming from noregret algorithms with regrets and respectively. Assume that the algorithms are on average stable with respect to the norm. Then
for any fixed action . A similar inequality holds for any fixed action and the utility of player .
Proof of Theorem 3.4.
where the first inequality holds by CauchySchwartz, the second inequality holds by using the stability of the algorithm, together with the inequality between and norms. ∎
The next theorems show that MWU and Exp3 are stable algorithms.
Theorem A.3.
MWU is an on average stable algorithm with respect to , for any .
Proof of Theorem a.3.
We think of MWU as Exponentiated Gradient (EG) where the loss vector has th entry equal to the negative utility if the player decided to play action . Let the observed loss vector at time be and the output distribution be , then the update of EG can be written as , where is the simplex of the set of possible actions and is the KLdivergence. Using Lemma 3 in Saha et al. [2012], with and the fact that the KLdivergence is 1strongly convex over the simplex with respect to the norm, we have:
where the second inequality follows from Hölder’s inequality and the fact that . Thus with step size , we have . Using triangle inequality, we can get and induction shows that . Suppose for the last iterations, a fixed loss function was played instead and the resulting output of the algorithm becomes . Then using the same argument as above we have and thus . Summing over all rounds concludes the proof. ∎
Theorem A.4.
Exp3 is an on average stable algorithm with respect to , for any
Proof of Theorem a.4.
The update at time , conditioning on the the draw being , is given by and for , , where is the utility vector at time , is the weight vector at time i.e. , and is the number of actions. We have the following bound:
where the first inequality uses the fact that and the second inequality uses the choice of together with for . Similarly for , we have:
Comments
There are no comments yet.