1 Introduction
Game theory is a wellestablished tool for studying interactions among selfinterested players. Under the assumption of complete information on the game composition at each player, the focal point of gametheoretic studies has been on the Nash equilibrium (NE) in analyzing game outcomes and predicting strategic behaviors of rational players.
The difficulty in obtaining complete information in realworld applications gives rise to the formulation of repeated unknown games, where each player has access to only local information such as his own actions and utilities, but is otherwise unaware of the game composition or even the existence of opponents. In such a setting, a rational player improves his decisionmaking through realtime interactions with the system and learns from past experiences [1]. The problem can be viewed through the lens of distributed online learning, where the central question is whether learning dynamics of distributed players lead to a systemlevel equilibrium in some sense. Studies in the past few decades have revealed intriguing connections between various notions of noregret learning at each player and certain relaxed versions of NE at the system level [1, 2].
While onestep closer to realworld systems, repeated unknown games, in their canonical forms, often adopt idealistic assumptions in terms of the stationarity of the player population and their utilities, availability of complete and perfect feedback, full rationality of players with unbounded cognition and computation capacity, and homogeneity among players in their knowledge of the game. Many emerging multiagent systems, however, are inherently dynamic and heterogeneous, and inevitably limited in terms of available information and the cognition and computation capacity of the players. We give below two examples.
Example: adversarial machine learning.
Security issues are at the forefront of machine learning and deep learning research, especially in safetycritical and risksensitive applications. The interaction between the defender and the attacker can be modeled as a twoplayer game. While the player population may be small, the game is highly complex in terms of the action space, utilities, feedback models, and the available knowledge each player has about the other. In particular, the attacker is characterized by its knowledge—how much information it has for designing attacks—and power—how often a successful attack can be launched. Both can be dynamically changing and adaptive to the strategies of the defender. A full spectrum of attacker profiles has been considered, ranging from the socalled blackbox model to the whitebox model (i.e., an omniscient attacker). The attack process is also dynamic, often exhibiting bursty behaviors following a successful intrusion or a system malfunction. The action space of the attacker can be equally diverse, including poisoning attacks and perturbation attacks. The former targets the training phase by injecting corrupted labels and examples for the purpose of embedding wrong decision rules into the machine learning algorithm. The latter targets the blind spots of a fully trained artificial intelligence using strategically perturbed instances that trigger wrong outputs, even when the perturbation is so minute as being indiscernible to humans. In terms of utilities, the attacker’s goal may be to compromise the integrity of the system (i.e., to evade detection by causing false negatives) or the availability of the system (to flood the system with false positives). See a comprehensive taxonomy of attacks against machine learning systems in
[3].Example: transportation systems. Route selection in urban transportation is a typical example of a noncooperative game repeated over time. The game is characterized by a large population of players that is both dynamic and heterogeneous, with vehicles leaving and joining the system and utilities varying across players and over time. The envisioned largescale adoption of autonomous vehicles will further diversify the traffic composition. Autonomous vehicles are significantly different from human drivers in terms of decisionmaking rationality, access to and usage of systemlevel knowledge, and memory and computation power. Bounded rationality is more evident in human drivers: they are likely to select a familiar route and inclined to settle for sufficing yet suboptimal options.
Complex multiagent systems as in the above examples call for new game models, new concepts of regret, new design of distributed learning algorithms, and new techniques for analyzing game outcomes. We present in this article representative results on distributed noregret learning in multiagent systems. We start in Sec. 2 with a brief review of background knowledge on classical repeated unknown games. In the subsequent four sections, we explore four game characteristics—dynamicity, incomplete and imperfect feedback, bounded rationality, and heterogeneity—that challenge the classical game models. For each characteristic, we illuminate its implications and ramifications in game modeling, notions of regret, feasible game outcomes, and the design and analysis of distributed learning algorithms. Limited by our understanding of this expansive research field and constrained by the page limit, the coverage is inevitably incomplete. We hope the article nevertheless provides an informative glimpse of the current landscape of this field and stimulates future research interests.
2 Distributed Learning in Repeated Unknown Games
In this section, we review key concepts in game theory and highlight classical results on distributed learning in repeated unknown games.
2.1 Static Games and Equilibria
An player static game is represented by a tuple , where is the set of players, the Cartesian product of each player’s action space , and the utility functions that capture the interaction among players. Specifically, the utility function of player encodes his preference towards an action. It is a mapping from the action profile of all players to player ’s reward .
A Nash equilibrium (NE) is an action profile under which no player can increase his reward via a unilateral deviation. Specifically, for all and all , where denotes the action profile after excluding player . Due to the focus on deterministic actions (also called pure strategies), the resulting equilibrium is a pure Nash equilibrium. A player may also adopt a mixed strategy
, which is a probability distribution
over the action space. Correspondingly, a mixed Nash equilibrium is a product distribution under which the expected utility for every player is no smaller than that under a unilateral deviation in player ’s strategy. A game with a finite population and a finite action space has at least one mixed NE but may not have any pure NE [4].NE is defined under the assumption that players adopt independent strategies (note the product form of ). A more general equilibrium—correlated equilibrium (CE)—allows correlation across players’ strategies. We note that for equilibrium definitions introduced here, we focus on games with a finite action space. Specifically, a CE is a joint probability distribution (not necessarily in a product form) satisfying for all , , and , where the expectation is over the joint strategy conditioned on that the realized action of player is . The concept of CE can be interpreted by introducing a mediator, who draws an outcome from and privately recommends action to player . The equilibrium condition states that no player has the incentive to deviate from the outcome of the correlated draw from after his part is revealed. CE can be further relaxed to the socalled coarse correlated equilibrium
(CCE), which is a joint distribution
satisfying for all and all . Different from CE, CCE imposes an equilibrium condition that is realization independent.The four types of equilibria exhibit a sequential inclusion relation as illustrated in Fig. 1
. The more general set of strategy profiles (i.e., allowing correlated strategies across players) in CE and CCE may lead to higher expected utilities summed over all players. CE and CCE can also be computed via linear programming, while pure NE and mixed NE are hard to compute
[4]. More importantly, CE and CCE are learnable through certain learning dynamics of players when a game is played repeatedly as discussed next. A caveat is that the set of CCE may contain highly nonrational strategies that choose only strictly dominated actions (actions that are suboptimal responses to all action profiles of the other players). See [6] for specific examples.2.2 Repeated Unknown Games and NoRegret Learning
A repeated game consists of repetitions of a static game (referred to as the stage game in this context)^{1}^{1}1In a general definition of a repeated game [7], the stage game is parameterized by a state, which affects the utility function. Two basic settings exist in the literature: (i) the state evolves over time following a Markov transition rule (the state in the next stage depends on the state and actions in the current stage); (ii) the state is fixed throughout all stages. We focus on the second setting in discussing classical results on repeated games.. In a repeated unknown game, after taking an action (potentially randomized according to a mixed strategy) in the th stage, player accrues a utility
and observes the entire utility vector
for all actions in his action space (we focus on a finite action space here) against the action profile of the other players. The actions and utilities of the other players, however, are unknown and unobservable.From a single player’s perspective, a repeated unknown game can be viewed as an online learning problem where the player chooses actions sequentially in time by learning from past experiences. A commonly adopted performance measure in online learning is regret, defined as the cumulative reward loss against a properly defined benchmark policy with hindsight vision and/or certain clairvoyant knowledge about the game. In other words, the benchmark policy defines the learning objective that an online algorithm aims to achieve over time. Different benchmark policies lead to different regret measures. Two classical regret notions are the external regret and the internal regret as detailed below.
Let denote the online learning algorithm adopted by player . For a fixed action sequence of the other players, the external regret of is defined as:
(1) 
where denotes the expectation over the random action process induced by . In other words, the benchmark policy in the external regret chooses the best fixed response to the other players’ actions in hindsight. The internal regret of is defined as:
(2) 
where is the indicator function. In this definition, the benchmark policy is the best hindsight modification of by swapping a single action with another throughout all stages.
An online learning algorithm is said to achieve the noregret condition if against all action sequences of the other players, the cumulative regret has a sublinear growth rate with the time horizon . In other words, offers, asymptotically as , the same average reward per stage as the specific benchmark policy adopted in the corresponding regret measure. Noregret learning is also referred to as Hannan consistency due to the original work [8] as well as [9].
It is clear that the significance of noregret learning depends on the adopted benchmark policy which the learning algorithm is measured against. A benchmark policy with stronger performance leads to a stronger notion of regret. In particular, the internal regret is a stronger notion than the external regret: noregret learning under the former implies noregret learning under the latter, but not vice versa [10].
A number of noregret learning algorithms exist in the literature. Representative algorithms achieving noexternalregret learning include Multiplicative Weights (MW) (also known as the Hedge algorithm) and Follow the Perturbed Leader [1]. Both are randomized policies, as randomization is necessary for achieving noregret learning in an adversarial setting with general reward functions [1]. In particular, under the MW algorithm, each player maintains a weight of each action at every stage based on past rewards: , where is the reward received under at stage and is the learning rate. The probability of choosing in the next stage is proportional to its weight given by .
For nointernalregret learning, a representative algorithm is Regret Matching [11]. Let denote the average gain per play by switching from action to an alternative in the past plays. In the ()th stage, the probability of switching from the previous action to an alternative is given by , where is a normalization parameter chosen to ensure a positive probability of staying with action . Regret Matching also offers noexternalregret learning by setting the probability of selecting an action at the ()th stage to the normalized average gain per play from playing action throughout the past plays, i.e., , where [11].
2.3 SystemLevel Performance under NoRegret Learning
Regret captures the learning objective of an individual player. At the system level, it is desirable to know whether the dynamical behaviors of distributed players converge to an equilibrium in some sense and whether the selfinterested regret minimization promises a certain level of optimality in terms of social welfare.
For the first question, it has been shown that if every player adopts a noexternalregret learning algorithm, the empirical distribution of the sequence of actions taken by all players converges to the set of CCE of the stage game [5]. Noregret learning under the internal regret measure guarantees convergence to the more restrictive set of CE [11]. Such convergence results are, however, in terms of the empirical frequency of the players’ actions rather than the actual sequence of plays. The convergence is also only to the set of equilibria, rather than to an equilibrium in the corresponding set. In fact, by treating learning in games as a dynamical system, recent studies have shown that in the continuoustime setting, the actual plays under noregret learning algorithms (such as Follow the Regularized Leader) may exhibit cycles rather than convergence [12]. In the discretetime setting, it has been shown that in zerosum games, the actual plays under the MW algorithm (starting from a nonequilibrium initial strategy) diverges from every fully mixed NE [13]. For games with special structures (e.g., potential games [14] with a finite action space and bilinear smooth games [15] with a continuum of actions), however, stronger results on the convergence of the actual plays to the more restrictive set of (mixed) NE have been established.
In addition to the convergence of learning dynamics, the social welfare resulting from the selfinterested learning of individual players is of great interest in many applications. In (known) static games, the loss in social welfare (i.e., the systemlevel utility under a strategy profile ) due to the selfinterested behaviors of players is quantified by the price of anarchy (POA). It is defined as the ratio of the optimal social welfare among all strategies to the smallest social welfare in the set of mixed NE. For repeated unknown games, a corresponding concept, price of total anarchy (POTA), is defined as:
(3) 
where is the sequence of strategy profiles in the noregret dynamics of all players. It has been shown that in games with special structures (e.g., valid games and congestion games), noregret learning guarantees a POTA that converges to the POA of the stage game even though the sequence of actual plays may not converge to a (mixed) NE [16]. The convergence of the POTA to the POA of the stage game implies that noregret learning can fully negate the impact of the unknown nature of the game on social welfare. The result was later extended in [5] to a general class of games referred to as smooth games (which includes valid games and congestion games as special cases). To achieve higher social welfare, cooperation among players is necessary. For example, if every player agrees to follow a learning algorithm designed specifically for optimizing the systemlevel performance, the optimal action profile will be selected a high percentage of time [17].
3 Dynamicity
In a dynamic repeated game, the stage game is timevarying. The dynamicity may be in any of the three elements of the game composition: the set of players, the action space, and the utility functions^{2}^{2}2Note that the general definition of repeated games in [7] includes dynamicity in the utility function, as the state parameter may evolve over time following a Markov transition rule. The dynamic repeated game discussed in this section differs from the general repeated game in two aspects: (i) the set of players and the action space can also be timevarying; (ii) the utility functions are in general independent across stages..
3.1 Notions of Regret
Dynamic unknown games call for new notions of regret to provide meaningful performance measures for distributed online learning algorithms. Specifically, the benchmark policy of a fixed single best action used in the external regret and that of a fixed single best action modification used in the internal regret can be highly suboptimal in dynamic games. As a result, achieving noregret learning under thusdefined regret measures can no longer serve as a stamp for good performance.
A rather immediate extension of the external regret is to consider every interval of the learning horizon and measure the cumulative loss against a single best action in hindsight that is specific to each interval. This leads to the notion of adaptive regret, under which noregret learning requires a sublinear growth of the cumulative reward loss in every interval as the interval length tends to infinity. The adaptive regret is particularly suitable for piecewise stationary systems where changes can be abrupt but infrequent. Classical learning algorithms such as MW can be extended to achieve noadaptiveregret [18]. The key issue in algorithm design is a mechanism to discount experiences from the distant past.
Another extension of the external regret is the socalled dynamic regret, in which the benchmark policy can be an arbitrary sequence of actions, as opposed to a fixed action throughout an interval of growing length. Achieving diminishing reward loss against all sequences of actions is, however, unattainable. Constraints on either the benchmark action sequence or the reward functions are necessary for defining a meaningful measure. On the variation of the benchmark action sequence, a commonly adopted constraint in the setting with finite actions is that the benchmark sequence is piecewisestationary with at most changes (the thusdefined regret is also referred to as the Kshifting regret). In this case, the noadaptiveregret condition directly implies nodynamicregret [18]. With a continuum of actions, the constraint is often imposed on the cumulative distance between every two consecutive actions in the sequence, i.e., . It has been shown that if the benchmark sequence is slowvarying, i.e., , nodynamicregret is achievable through welldesigned restart procedures [19]. The variation constraint can also be applied to the reward functions. A typical example with a continuum of actions is the sublinear “variation budget” assumption. Specifically, the cumulative variation between the reward functions in two consecutive stage games grows sublinearly in , i.e., . Similar constraints can be imposed on the gradient of the utility function and with the variation measured by the norm. See [20] and references therein for details and corresponding noregret learning algorithms.
The external regret and its extensions are measured against an alternative strategy of a single player. A new notion of regret—Nash equilibrium regret—considers a benchmark policy that is jointly determined by the strategies of all players [21]. Consider a repeated game with timevarying utility functions for each player . Let be the average utility function and the mixed NE of the static game defined by the average utility functions . The NE regret of player following a policy is then given by , where is the action profile selected by the policies of all players at stage . Noregret learning under the NE regret ensures that each player’s average reward asymptotically matches that promised by the mixed NE under the average utility functions. A centralized learning algorithm achieving noNEregret was developed in [21] for repeated twoplayer zerosum games with arbitrarily varying utility functions. Achieving noregret learning under the measure of NE regret in a distributed setting, however, remains open.
3.2 SystemLevel Performance
The two key measures—convergence to equilibria and POTA—for systemlevel performance also need to be modified to take into account game dynamics. The timevarying sequence of stage games defines a sequence of equilibria and a sequence of optimal social welfare. The desired relation between noregret learning dynamics at individual players and the systemlevel equilibria is thus in terms of tracking rather than converging. For the definition of POTA, the optimal social welfare in the numerator in (3) needs to be replaced with the average optimal social welfare .
An online learning algorithm is said to successfully track the sequence of (mixed) NE in a dynamic game if the average distance between the sequence of (mixed) action profiles resulting from the algorithm and the sequence of (mixed) NE vanishes as tends to infinity. A representative study in [19] considers a game with a continuum of actions and dynamicity manifesting only in the utility functions. Under the assumptions that the sequence of NE is slowvarying and the utility functions are monotonic, it was shown that learning algorithms with sublinear dynamic regret successfully track the sequence of NE. The monotonicity of the utility functions plays a key role in the analysis: it translates the closeness between the learning dynamics and the NE in terms of the cumulative reward (as in the regret measure) to the closeness in terms of their distance in the action space (the concern of the tracking outcome).
The performance of noregret learning in terms of social welfare was studied in [22] for games with a dynamic population of players. Specifically, in each stage, each player may independently exit with a fixed probability and is subsequently replaced with a new player with a potentially different utility function (the population size is therefore fixed and the player set is a stationary process over time). For structural games such as firstprice auctions, bandwidth allocation, and congestion games, the relation between noadaptiveregret learning and the average optimal social welfare was examined.
Game dynamics can be in diverse forms. There lacks a holistic understanding on the matching between regret notions and the underlying dynamics of the game. Different forms of game dynamics demand different benchmark policies in order to arrive at a meaningful regret measure that lends significance to the stamp of “noregret learning” yet at the same time is attainable. Viewing from a different angle, one may pose the fundamental question on what kinds of game dynamics are tamable through distributed online learning and make noregret learning and approximately optimal social welfare feasible.
4 Incomplete and Imperfect Feedback
Learning and adaptation rely on feedback. Quality of the feedback in terms of completeness and accuracy thus has significant implications in noregret learning. We explore this issue in this section.
4.1 Incomplete Feedback
Incomplete feedback stands in contrast to fullinformation feedback where utilities of all actions a player could have taken are observed in each stage. Incompleteness can be spatial across the action space or temporal across decision stages. In the former case, a commonly studied model is the socalled bandit feedback, where only the utility of the chosen action is revealed. In the latter, the feedback model is referred to as lossy feedback where there are decision stages with no feedback [23]. One can easily envision a more general model compounding bandit feedback with lossy feedback. Studies on this general model are lacking in the literature.
The term “bandit feedback” has its roots in the classical problem of multiarmed bandit [24]. The name of the problem comes from likening an archetypical singleplayer online learning problem to playing a multiarmed slot machine (known as a bandit for its ability of emptying the player’s pocket). Each arm, when pulled, generates rewards according to an unknown stochastic model or in an adversarial fashion. Only the reward of the chosen arm is revealed after each play. Due to the incomplete feedback, the player faces the tradeoff between exploration (to gather information from less explored arms) and exploitation (to maximize immediate reward by favoring arms with a good reward history).
In a multiplayer game setting with bandit feedback, noregret learning from an individual player’s perspective can be cast as a singleplayer nonstochastic/adversarial bandit model where the payoff of each arm/action is adversarially chosen and aggregates the interaction with the other players in the game. The concept of external regret in the game setting corresponds to the weak regret in the adversarial bandit model [25], which adopts the best singlearm policy in hindsight as the benchmark. The MW algorithm was modified in [25] to handle the change of the feedback model from fullinformation to bandit. Specifically, the weight of action at time is updated as where is the probability of selecting action at time and if is unselected. Dividing the observed reward by the corresponding probability of the chosen action ensures the unbiasedness of the observation. Quite intuitively, the price for not observing the rewards of all actions is the degradation of the regret order in the size of the action space, i.e., from in the fullinformation setting [1], to in the bandit setting [26].
The multiplayer bandit problem explicitly models the existence of players competing for () arms [27]. Originally motivated by applications in wireless communication networks where distributed users compete for access to multiple channels, this specific game model is characterized by a special form of interaction among players: a collision occurs when multiple players select the same arm, which results in utility loss. The objective of this distributed learning problem is to minimize the systemlevel regret over all players against the optimal centralized (hence collisionfree) allocation of the players to the best set of arms [27]. In addition to the explorationexploitation tradeoff in the singleplayer setting, this distributed learning problem under a systemlevel objective also faces the tradeoff between selecting a good arm and avoiding colliding with competing players. A number of distributed learning algorithms have been developed to achieve a sublinear systemlevel regret with respect to [27]. Recent extensions of the multiplayer bandit problem further consider the setting where each arm offers different payoffs across players [28].
The multiplayer bandit problem is a special game model in that the players have identical action space and their interaction is only in the form of “collisions” when choosing the same action. In a general game setting, the impact of incomplete feedback on noregret learning and systemlevel performance is largely open. One quantitative measure of the impact is the regret order with respect to the size of the action space. As mentioned above, bandit feedback results in an additional term in the regret order, which can be significant when the action space is large. Recent work [29, 30] has shown that local communications among neighboring players in a network setting can mitigate the negative impact of bandit feedback on the regret order in . In terms of the impact on the systemlevel performance, it has been shown under a game model with a continuum of actions that bandit feedback degrades the convergence rate of the learning dynamics to equilibria [31].
4.2 Imperfect Feedback
Imperfect feedback refers to the inaccuracy of the observed utilities in revealing the quality of the selected actions. Recall that mixed strategies are necessary for achieving noregret learning in the adversarial setting. The quality of a mixed strategy is characterized by the expected utility where the expectation is taken over the randomness of strategies of all players. Referred to as expected feedback, the feedback model assuming observations on the expected utility, however, can be unrealistic. A more commonly adopted feedback model is the realized feedback
where only the utility of the realized action profile is revealed. The realized feedback can be viewed as a noisy unbiased estimate of the expected feedback where the noise is due to the randomness of players’ strategies.
The socalled noisy feedback assumes a different source of noise: it comes from the external environment and is additive to either the observed utility vectors in the socalled semibandit feedback [14] with a finite action space, or the gradient of the utility functions in the firstorder feedback [32]
with a continuum of actions. Under the assumptions of unbiasedness and bounded variance, the issue of the additive noise can be addressed by rather standard estimation techniques and analysis. A more challenging setting is to consider nonstochastic noise due to adversarial attacks, especially in applications such as adversarial machine learning. This problem was recently studied in the singleplayer setting
[33]. Studies in the multiagent setting are still lacking.5 Bounded Rationality
The concept of bounded rationality was first introduced in economics [34] to provide more realistic models than the often adopted perfect rationality that assumes the decisionmaking of players is the result of a full optimization of their utilities. In reality, players often take reasoning shortcuts that may lead to suboptimal decisions. Such reasoning shortcuts may be a result of limited cognition of human minds or necessitated by the available computation time and power relative to the complexity of action optimization.
Cognitive limitations include the limited ability in anticipating other decisionmakers’ strategic responses and certain psychological factors that interfere with the valuation of options. Various models exist for capturing the limitations in the players’ valuation of options. For example, a player may be myopic, focusing only on the shortterm reward [35]. Even with forwardthinking, a player may settle for suboptimal actions perceived as acceptable by the player [34]. The limitation in a player’s ability to anticipate other players’ strategies can be modeled through a cognitive hierarchy by grouping players according to their cognitive abilities and characterizing them in an iterative fashion. Specifically, players with the lowest level of cognitive ability are grouped as the level players who make decisions randomly. Level () players are then defined iteratively as those who assume they are playing against lowerlevel players and anticipate the opponents’ strategies accordingly. Recent work draws an interesting connection between the cognitive hierarchy model and the Optimistic Mirror Descent (OMD) algorithm for solving the saddle point problem with applications in generative adversarial networks [36]. The saddlepoint problem can be viewed as a twoplayer zerosum game with a continuum of actions. The solutions to the problem correspond to the set of NE. It has been shown that the OMD algorithm guarantees a converging system dynamic to an NE in terms of the actual plays while Gradient Descent (GD) may lead to cycles [36]. In the language of cognitive hierarchy, players adopting GD can be regarded as level0 thinkers in the sense that they do not anticipate the strategies of their opponents. Players adopting OMD are level1 thinkers since they take advantage of the fact that their opponents are taking similar gradient methods, which will not lead to abrupt gradient changes between two consecutive stages [36]. Consequently, an extra gradient update is applied in OMD to accelerate learning.
Besides cognitive limitations, players are also constrained in terms of physical resources such as memory and computation power. Acquiring, storing, and processing all relevant information for decisionmaking may be infeasible, especially in complex systems with a large action space. For example, players may only choose from strategies with bounded complexity [37], or use only recent observations in decisionmaking due to memory constraints [38].
While models for bounded rationality abound in economics, political science, and other related disciplines, incorporating such models into distributed online learning is still in its infancy. A holistic understanding on the implications of bounded rationality in distributed online learning is yet to be gained. An intriguing aspect of the problem is that bounded rationality may not necessarily imply degraded performance. For example, in dynamic games, bounded memory of past experiences may have little effect since noregret learning dictates that the distant past be forgotten (see discussions in Sec. 3).
6 Heterogeneity
The heterogeneity of complex multiagent systems characterizes the asymmetry across players in three aspects: the available information and knowledge about the system, available actions, and the level of adaptivity to opponents’ strategies. In the example of mixed traffic in urban transportation, autonomous vehicles, while likely to have greater computation power for solving complex decision problems, may have to obey an additional set of regulations on available actions.
In adversarial machine learning, in addition to the asymmetry on the knowledge and power, the attacker and the defender may also have different levels of realtime adaptivity to the other player’s strategy. Classical regret notions such as the external regret that assumes fixed actions of the other players, while applicable to oblivious attackers, are no longer valid under adaptive attacks. A partial solution is to adopt a new notion of policy regret defined against an adaptive adversary who assigns reward vectors based on previous actions of the player [39]. Specifically, let denote the player’s reward function determined by the adversary at time , given the sequence of actions taken by the player in the past. The policy regret with reward functions is defined as
(4) 
where denotes the reward function determined by the adversary if the player took actions in the past. The memory policy regret is defined by assuming that the reward function depends only on the past actions of the player.
The difference between the external regret and the policy regret may not be crucial if the adversary and the player have homogeneous objectives (e.g., mixed traffic in transportation systems). It has been shown that there exists a wide class of algorithms that can ensure noregret learning under both regret definitions, as long as the adversary is also using such an algorithm [40]. In applications such as adversarial machine learning where the adversary may be a malicious opponent, the two notions of regret are incompatible: there exists an memory adaptive adversary that can make any action sequence of the player with sublinear regret in one notion suffer from linear regret in the other [40]. A general technique for developing nopolicyregret algorithms in the singleplayer setting was proposed in [39]. In terms of the systemlevel performance, it was shown in twoplayer games that nopolicyregret learning guarantees convergence of the system dynamic to a new notion of equilibrium called policy equilibrium [40]. However, the understanding of policy equilibrium is limited. In games with more than two players, even the definition of policy equilibrium is unclear.
References
 [1] N. CesaBianchi and G. Lugosi, Prediction, Learning, and Games. Cambridge university press, 2006.
 [2] H. P. Young, Strategic Learning and its Limits. OUP Oxford, 2004.
 [3] M. Barreno, B. Nelson, A. D. Joseph, and J. D. Tygar, “The security of machine learning,” Machine Learning, vol. 81, no. 2, pp. 121–148, 2010.
 [4] N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani, Algorithmic Game Theory. Cambridge university press, 2007.
 [5] T. Roughgarden, “Intrinsic robustness of the price of anarchy,” Journal of the ACM (JACM), vol. 62, no. 5, p. 32, 2015.
 [6] Y. Viossat and A. Zapechelnyuk, “Noregret dynamics and fictitious play,” Journal of Economic Theory, vol. 148, no. 2, pp. 825–842, 2013.
 [7] R. Laraki and S. Sorin, “Advances in zerosum dynamic games,” in Handbook of Game Theory with Economic Applications. Elsevier, 2015, vol. 4, pp. 27–93.
 [8] J. Hannan, “Approximation to bayes risk in repeated play,” Contributions to the Theory of Games, vol. 3, pp. 97–139, 1957.
 [9] D. Blackwell et al., “An analog of the minimax theorem for vector payoffs.” Pacific Journal of Mathematics, vol. 6, no. 1, pp. 1–8, 1956.
 [10] G. Stoltz and G. Lugosi, “Internal regret in online portfolio selection,” Machine Learning, vol. 59, no. 12, pp. 125–159, 2005.
 [11] S. Hart and A. MasColell, “A simple adaptive procedure leading to correlated equilibrium,” Econometrica, vol. 68, no. 5, pp. 1127–1150, 2000.
 [12] P. Mertikopoulos, C. Papadimitriou, and G. Piliouras, “Cycles in adversarial regularized learning,” in Proceedings of the 29th Annual ACMSIAM Symposium on Discrete Algorithms. SIAM, 2018, pp. 2703–2717.
 [13] J. P. Bailey and G. Piliouras, “Multiplicative weights update in zerosum games,” in Proceedings of the 2018 ACM Conference on Economics and Computation. ACM, 2018, pp. 321–338.
 [14] A. Heliou, J. Cohen, and P. Mertikopoulos, “Learning with bandit feedback in potential games,” in Advances in Neural Information Processing Systems, 2017, pp. 6369–6378.
 [15] G. Gidel, R. A. Hemmat, M. Pezeshki, R. Le Priol, G. Huang, S. LacosteJulien, and I. Mitliagkas, “Negative momentum for improved game dynamics,” in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 1802–1811.

[16]
A. Blum, M. Hajiaghayi, K. Ligett, and A. Roth, “Regret minimization and the
price of total anarchy,” in
Proceedings of The 40th Annual ACM Symposium on Theory of Computing
. ACM, 2008, pp. 373–382.  [17] J. R. Marden, H. P. Young, and L. Y. Pao, “Achieving pareto optimality through distributed learning,” SIAM Journal on Control and Optimization, vol. 52, no. 5, pp. 2753–2770, 2014.
 [18] H. Luo and R. E. Schapire, “Achieving all with no parameters: AdaNormalHedge,” in Conference on Learning Theory, 2015, pp. 1286–1304.
 [19] B. Duvocelle, P. Mertikopoulos, M. Staudigl, and D. Vermeulen, “Learning in timevarying games,” arXiv preprint arXiv:1809.03066, 2018.
 [20] A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro, “Online optimization in dynamic environments: Improved regret rates for strongly convex problems,” in 2016 IEEE 55th Conference on Decision and Control (CDC). IEEE, 2016, pp. 7195–7201.
 [21] A. R. Cardoso, J. Abernethy, H. Wang, and H. Xu, “Competing against Nash equilibria in adversarially changing zerosum games,” in Proceedings of the 36th International Conference on Machine Learning, vol. 97. PMLR, 2019, pp. 921–930.
 [22] T. Lykouris, V. Syrgkanis, and É. Tardos, “Learning and efficiency in games with dynamic population,” in Proceedings of The 27th Annual ACMSIAM Symposium on Discrete Algorithms, 2016, pp. 120–129.
 [23] Z. Zhou, P. Mertikopoulos, S. Athey, N. Bambos, P. W. Glynn, and Y. Ye, “Learning in games with lossy feedback,” in Advances in Neural Information Processing Systems, 2018, pp. 5140–5150.
 [24] Q. Zhao, MultiArmed Bandits: Theory and Applications to Online Learning in Networks. Morgan & Claypool Publishers, 2019.
 [25] P. Auer, N. CesaBianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multiarmed bandit problem,” SIAM Journal on Computing, vol. 32, no. 1, pp. 48–77, 2002.
 [26] J.Y. Audibert and S. Bubeck, “Minimax policies for adversarial and stochastic bandits,” in Proceedings of the 22nd Annual Conference on Learning Theory, 2009, pp. 217–226.
 [27] K. Liu and Q. Zhao, “Distributed learning in multiarmed bandit with multiple players,” IEEE Transactions on Signal Processing, vol. 58, no. 11, pp. 5667–5681, 2010.
 [28] I. Bistritz and A. Leshem, “Distributed multiplayer bandits—a game of thrones approach,” in Advances in Neural Information Processing Systems, 2018, pp. 7222–7232.
 [29] N. CesaBianchi, C. Gentile, and Y. Mansour, “Delay and cooperation in nonstochastic bandits,” The Journal of Machine Learning Research, vol. 20, no. 1, pp. 613–650, 2019.
 [30] Y. BarOn and Y. Mansour, “Individual regret in cooperative nonstochastic multiarmed bandits,” in Advances in Neural Information Processing Systems, 2019, pp. 3110–3120.
 [31] M. Bravo, D. Leslie, and P. Mertikopoulos, “Bandit learning in concave nperson games,” in Advances in Neural Information Processing Systems, 2018, pp. 5661–5671.
 [32] P. Mertikopoulos and Z. Zhou, “Learning in games with continuous action sets and unknown payoff functions,” Mathematical Programming, vol. 173, no. 12, pp. 465–507, 2019.
 [33] K.S. Jun, L. Li, Y. Ma, and J. Zhu, “Adversarial attacks on stochastic bandits,” in Advances in Neural Information Processing Systems, 2018, pp. 3640–3649.
 [34] H. A. Simon, “A behavioral model of rational choice,” The Quarterly Journal of Economics, vol. 69, no. 1, pp. 99–118, 1955.
 [35] X. Gabaix and D. Laibson, “Bounded rationality and directed cognition,” Harvard University, 2005.
 [36] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng, “Training GANs with optimism.” in International Conference on Learning Representations, 2018.
 [37] M. Scarsini and T. Tomala, “Repeated congestion games with bounded rationality,” International Journal of Game Theory, vol. 41, no. 3, pp. 651–669, 2012.
 [38] L. Chen, F. Lin, P. Tang, K. Wang, R. Wang, and S. Wang, “Kmemory strategies in repeated games,” in Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems, 2017, pp. 1493–1498.
 [39] R. Arora, O. Dekel, and A. Tewari, “Online bandit learning against an adaptive adversary: from regret to policy regret,” Proceedings of the 29th International Conference on Machine Learning, pp. 1747–1754, 2012.
 [40] R. Arora, M. Dinitz, T. V. Marinov, and M. Mohri, “Policy regret in repeated games,” in Advances in Neural Information Processing Systems, 2018, pp. 6732–6741.