Game theory is a well-established tool for studying interactions among self-interested players. Under the assumption of complete information on the game composition at each player, the focal point of game-theoretic studies has been on the Nash equilibrium (NE) in analyzing game outcomes and predicting strategic behaviors of rational players.
The difficulty in obtaining complete information in real-world applications gives rise to the formulation of repeated unknown games, where each player has access to only local information such as his own actions and utilities, but is otherwise unaware of the game composition or even the existence of opponents. In such a setting, a rational player improves his decision-making through real-time interactions with the system and learns from past experiences . The problem can be viewed through the lens of distributed online learning, where the central question is whether learning dynamics of distributed players lead to a system-level equilibrium in some sense. Studies in the past few decades have revealed intriguing connections between various notions of no-regret learning at each player and certain relaxed versions of NE at the system level [1, 2].
While one-step closer to real-world systems, repeated unknown games, in their canonical forms, often adopt idealistic assumptions in terms of the stationarity of the player population and their utilities, availability of complete and perfect feedback, full rationality of players with unbounded cognition and computation capacity, and homogeneity among players in their knowledge of the game. Many emerging multi-agent systems, however, are inherently dynamic and heterogeneous, and inevitably limited in terms of available information and the cognition and computation capacity of the players. We give below two examples.
Example: adversarial machine learning.
Example: adversarial machine learning.
Security issues are at the forefront of machine learning and deep learning research, especially in safety-critical and risk-sensitive applications. The interaction between the defender and the attacker can be modeled as a two-player game. While the player population may be small, the game is highly complex in terms of the action space, utilities, feedback models, and the available knowledge each player has about the other. In particular, the attacker is characterized by its knowledge—how much information it has for designing attacks—and power—how often a successful attack can be launched. Both can be dynamically changing and adaptive to the strategies of the defender. A full spectrum of attacker profiles has been considered, ranging from the so-called black-box model to the white-box model (i.e., an omniscient attacker). The attack process is also dynamic, often exhibiting bursty behaviors following a successful intrusion or a system malfunction. The action space of the attacker can be equally diverse, including poisoning attacks and perturbation attacks. The former targets the training phase by injecting corrupted labels and examples for the purpose of embedding wrong decision rules into the machine learning algorithm. The latter targets the blind spots of a fully trained artificial intelligence using strategically perturbed instances that trigger wrong outputs, even when the perturbation is so minute as being indiscernible to humans. In terms of utilities, the attacker’s goal may be to compromise the integrity of the system (i.e., to evade detection by causing false negatives) or the availability of the system (to flood the system with false positives). See a comprehensive taxonomy of attacks against machine learning systems in.
Example: transportation systems. Route selection in urban transportation is a typical example of a non-cooperative game repeated over time. The game is characterized by a large population of players that is both dynamic and heterogeneous, with vehicles leaving and joining the system and utilities varying across players and over time. The envisioned large-scale adoption of autonomous vehicles will further diversify the traffic composition. Autonomous vehicles are significantly different from human drivers in terms of decision-making rationality, access to and usage of system-level knowledge, and memory and computation power. Bounded rationality is more evident in human drivers: they are likely to select a familiar route and inclined to settle for sufficing yet suboptimal options.
Complex multi-agent systems as in the above examples call for new game models, new concepts of regret, new design of distributed learning algorithms, and new techniques for analyzing game outcomes. We present in this article representative results on distributed no-regret learning in multi-agent systems. We start in Sec. 2 with a brief review of background knowledge on classical repeated unknown games. In the subsequent four sections, we explore four game characteristics—dynamicity, incomplete and imperfect feedback, bounded rationality, and heterogeneity—that challenge the classical game models. For each characteristic, we illuminate its implications and ramifications in game modeling, notions of regret, feasible game outcomes, and the design and analysis of distributed learning algorithms. Limited by our understanding of this expansive research field and constrained by the page limit, the coverage is inevitably incomplete. We hope the article nevertheless provides an informative glimpse of the current landscape of this field and stimulates future research interests.
2 Distributed Learning in Repeated Unknown Games
In this section, we review key concepts in game theory and highlight classical results on distributed learning in repeated unknown games.
2.1 Static Games and Equilibria
An -player static game is represented by a tuple , where is the set of players, the Cartesian product of each player’s action space , and the utility functions that capture the interaction among players. Specifically, the utility function of player encodes his preference towards an action. It is a mapping from the action profile of all players to player ’s reward .
A Nash equilibrium (NE) is an action profile under which no player can increase his reward via a unilateral deviation. Specifically, for all and all , where denotes the action profile after excluding player . Due to the focus on deterministic actions (also called pure strategies), the resulting equilibrium is a pure Nash equilibrium. A player may also adopt a mixed strategy
, which is a probability distributionover the action space. Correspondingly, a mixed Nash equilibrium is a product distribution under which the expected utility for every player is no smaller than that under a unilateral deviation in player ’s strategy. A game with a finite population and a finite action space has at least one mixed NE but may not have any pure NE .
NE is defined under the assumption that players adopt independent strategies (note the product form of ). A more general equilibrium—correlated equilibrium (CE)—allows correlation across players’ strategies. We note that for equilibrium definitions introduced here, we focus on games with a finite action space. Specifically, a CE is a joint probability distribution (not necessarily in a product form) satisfying for all , , and , where the expectation is over the joint strategy conditioned on that the realized action of player is . The concept of CE can be interpreted by introducing a mediator, who draws an outcome from and privately recommends action to player . The equilibrium condition states that no player has the incentive to deviate from the outcome of the correlated draw from after his part is revealed. CE can be further relaxed to the so-called coarse correlated equilibrium
(CCE), which is a joint distributionsatisfying for all and all . Different from CE, CCE imposes an equilibrium condition that is realization independent.
The four types of equilibria exhibit a sequential inclusion relation as illustrated in Fig. 1
. The more general set of strategy profiles (i.e., allowing correlated strategies across players) in CE and CCE may lead to higher expected utilities summed over all players. CE and CCE can also be computed via linear programming, while pure NE and mixed NE are hard to compute. More importantly, CE and CCE are learnable through certain learning dynamics of players when a game is played repeatedly as discussed next. A caveat is that the set of CCE may contain highly non-rational strategies that choose only strictly dominated actions (actions that are suboptimal responses to all action profiles of the other players). See  for specific examples.
2.2 Repeated Unknown Games and No-Regret Learning
A repeated game consists of repetitions of a static game (referred to as the stage game in this context)111In a general definition of a repeated game , the stage game is parameterized by a state, which affects the utility function. Two basic settings exist in the literature: (i) the state evolves over time following a Markov transition rule (the state in the next stage depends on the state and actions in the current stage); (ii) the state is fixed throughout all stages. We focus on the second setting in discussing classical results on repeated games.. In a repeated unknown game, after taking an action (potentially randomized according to a mixed strategy) in the -th stage, player accrues a utility
and observes the entire utility vectorfor all actions in his action space (we focus on a finite action space here) against the action profile of the other players. The actions and utilities of the other players, however, are unknown and unobservable.
From a single player’s perspective, a repeated unknown game can be viewed as an online learning problem where the player chooses actions sequentially in time by learning from past experiences. A commonly adopted performance measure in online learning is regret, defined as the cumulative reward loss against a properly defined benchmark policy with hindsight vision and/or certain clairvoyant knowledge about the game. In other words, the benchmark policy defines the learning objective that an online algorithm aims to achieve over time. Different benchmark policies lead to different regret measures. Two classical regret notions are the external regret and the internal regret as detailed below.
Let denote the online learning algorithm adopted by player . For a fixed action sequence of the other players, the external regret of is defined as:
where denotes the expectation over the random action process induced by . In other words, the benchmark policy in the external regret chooses the best fixed response to the other players’ actions in hindsight. The internal regret of is defined as:
where is the indicator function. In this definition, the benchmark policy is the best hindsight modification of by swapping a single action with another throughout all stages.
An online learning algorithm is said to achieve the no-regret condition if against all action sequences of the other players, the cumulative regret has a sublinear growth rate with the time horizon . In other words, offers, asymptotically as , the same average reward per stage as the specific benchmark policy adopted in the corresponding regret measure. No-regret learning is also referred to as Hannan consistency due to the original work  as well as .
It is clear that the significance of no-regret learning depends on the adopted benchmark policy which the learning algorithm is measured against. A benchmark policy with stronger performance leads to a stronger notion of regret. In particular, the internal regret is a stronger notion than the external regret: no-regret learning under the former implies no-regret learning under the latter, but not vice versa .
A number of no-regret learning algorithms exist in the literature. Representative algorithms achieving no-external-regret learning include Multiplicative Weights (MW) (also known as the Hedge algorithm) and Follow the Perturbed Leader . Both are randomized policies, as randomization is necessary for achieving no-regret learning in an adversarial setting with general reward functions . In particular, under the MW algorithm, each player maintains a weight of each action at every stage based on past rewards: , where is the reward received under at stage and is the learning rate. The probability of choosing in the next stage is proportional to its weight given by .
For no-internal-regret learning, a representative algorithm is Regret Matching . Let denote the average gain per play by switching from action to an alternative in the past plays. In the ()-th stage, the probability of switching from the previous action to an alternative is given by , where is a normalization parameter chosen to ensure a positive probability of staying with action . Regret Matching also offers no-external-regret learning by setting the probability of selecting an action at the ()-th stage to the normalized average gain per play from playing action throughout the past plays, i.e., , where .
2.3 System-Level Performance under No-Regret Learning
Regret captures the learning objective of an individual player. At the system level, it is desirable to know whether the dynamical behaviors of distributed players converge to an equilibrium in some sense and whether the self-interested regret minimization promises a certain level of optimality in terms of social welfare.
For the first question, it has been shown that if every player adopts a no-external-regret learning algorithm, the empirical distribution of the sequence of actions taken by all players converges to the set of CCE of the stage game . No-regret learning under the internal regret measure guarantees convergence to the more restrictive set of CE . Such convergence results are, however, in terms of the empirical frequency of the players’ actions rather than the actual sequence of plays. The convergence is also only to the set of equilibria, rather than to an equilibrium in the corresponding set. In fact, by treating learning in games as a dynamical system, recent studies have shown that in the continuous-time setting, the actual plays under no-regret learning algorithms (such as Follow the Regularized Leader) may exhibit cycles rather than convergence . In the discrete-time setting, it has been shown that in zero-sum games, the actual plays under the MW algorithm (starting from a non-equilibrium initial strategy) diverges from every fully mixed NE . For games with special structures (e.g., potential games  with a finite action space and bilinear smooth games  with a continuum of actions), however, stronger results on the convergence of the actual plays to the more restrictive set of (mixed) NE have been established.
In addition to the convergence of learning dynamics, the social welfare resulting from the self-interested learning of individual players is of great interest in many applications. In (known) static games, the loss in social welfare (i.e., the system-level utility under a strategy profile ) due to the self-interested behaviors of players is quantified by the price of anarchy (POA). It is defined as the ratio of the optimal social welfare among all strategies to the smallest social welfare in the set of mixed NE. For repeated unknown games, a corresponding concept, price of total anarchy (POTA), is defined as:
where is the sequence of strategy profiles in the no-regret dynamics of all players. It has been shown that in games with special structures (e.g., valid games and congestion games), no-regret learning guarantees a POTA that converges to the POA of the stage game even though the sequence of actual plays may not converge to a (mixed) NE . The convergence of the POTA to the POA of the stage game implies that no-regret learning can fully negate the impact of the unknown nature of the game on social welfare. The result was later extended in  to a general class of games referred to as smooth games (which includes valid games and congestion games as special cases). To achieve higher social welfare, cooperation among players is necessary. For example, if every player agrees to follow a learning algorithm designed specifically for optimizing the system-level performance, the optimal action profile will be selected a high percentage of time .
In a dynamic repeated game, the stage game is time-varying. The dynamicity may be in any of the three elements of the game composition: the set of players, the action space, and the utility functions222Note that the general definition of repeated games in  includes dynamicity in the utility function, as the state parameter may evolve over time following a Markov transition rule. The dynamic repeated game discussed in this section differs from the general repeated game in two aspects: (i) the set of players and the action space can also be time-varying; (ii) the utility functions are in general independent across stages..
3.1 Notions of Regret
Dynamic unknown games call for new notions of regret to provide meaningful performance measures for distributed online learning algorithms. Specifically, the benchmark policy of a fixed single best action used in the external regret and that of a fixed single best action modification used in the internal regret can be highly suboptimal in dynamic games. As a result, achieving no-regret learning under thus-defined regret measures can no longer serve as a stamp for good performance.
A rather immediate extension of the external regret is to consider every interval of the learning horizon and measure the cumulative loss against a single best action in hindsight that is specific to each interval. This leads to the notion of adaptive regret, under which no-regret learning requires a sublinear growth of the cumulative reward loss in every interval as the interval length tends to infinity. The adaptive regret is particularly suitable for piecewise stationary systems where changes can be abrupt but infrequent. Classical learning algorithms such as MW can be extended to achieve no-adaptive-regret . The key issue in algorithm design is a mechanism to discount experiences from the distant past.
Another extension of the external regret is the so-called dynamic regret, in which the benchmark policy can be an arbitrary sequence of actions, as opposed to a fixed action throughout an interval of growing length. Achieving diminishing reward loss against all sequences of actions is, however, unattainable. Constraints on either the benchmark action sequence or the reward functions are necessary for defining a meaningful measure. On the variation of the benchmark action sequence, a commonly adopted constraint in the setting with finite actions is that the benchmark sequence is piecewise-stationary with at most changes (the thus-defined regret is also referred to as the K-shifting regret). In this case, the no-adaptive-regret condition directly implies no-dynamic-regret . With a continuum of actions, the constraint is often imposed on the cumulative distance between every two consecutive actions in the sequence, i.e., . It has been shown that if the benchmark sequence is slow-varying, i.e., , no-dynamic-regret is achievable through well-designed restart procedures . The variation constraint can also be applied to the reward functions. A typical example with a continuum of actions is the sublinear “variation budget” assumption. Specifically, the cumulative variation between the reward functions in two consecutive stage games grows sublinearly in , i.e., . Similar constraints can be imposed on the gradient of the utility function and with the variation measured by the -norm. See  and references therein for details and corresponding no-regret learning algorithms.
The external regret and its extensions are measured against an alternative strategy of a single player. A new notion of regret—Nash equilibrium regret—considers a benchmark policy that is jointly determined by the strategies of all players . Consider a repeated game with time-varying utility functions for each player . Let be the average utility function and the mixed NE of the static game defined by the average utility functions . The NE regret of player following a policy is then given by , where is the action profile selected by the policies of all players at stage . No-regret learning under the NE regret ensures that each player’s average reward asymptotically matches that promised by the mixed NE under the average utility functions. A centralized learning algorithm achieving no-NE-regret was developed in  for repeated two-player zero-sum games with arbitrarily varying utility functions. Achieving no-regret learning under the measure of NE regret in a distributed setting, however, remains open.
3.2 System-Level Performance
The two key measures—convergence to equilibria and POTA—for system-level performance also need to be modified to take into account game dynamics. The time-varying sequence of stage games defines a sequence of equilibria and a sequence of optimal social welfare. The desired relation between no-regret learning dynamics at individual players and the system-level equilibria is thus in terms of tracking rather than converging. For the definition of POTA, the optimal social welfare in the numerator in (3) needs to be replaced with the average optimal social welfare .
An online learning algorithm is said to successfully track the sequence of (mixed) NE in a dynamic game if the average distance between the sequence of (mixed) action profiles resulting from the algorithm and the sequence of (mixed) NE vanishes as tends to infinity. A representative study in  considers a game with a continuum of actions and dynamicity manifesting only in the utility functions. Under the assumptions that the sequence of NE is slow-varying and the utility functions are monotonic, it was shown that learning algorithms with sublinear dynamic regret successfully track the sequence of NE. The monotonicity of the utility functions plays a key role in the analysis: it translates the closeness between the learning dynamics and the NE in terms of the cumulative reward (as in the regret measure) to the closeness in terms of their distance in the action space (the concern of the tracking outcome).
The performance of no-regret learning in terms of social welfare was studied in  for games with a dynamic population of players. Specifically, in each stage, each player may independently exit with a fixed probability and is subsequently replaced with a new player with a potentially different utility function (the population size is therefore fixed and the player set is a stationary process over time). For structural games such as first-price auctions, bandwidth allocation, and congestion games, the relation between no-adaptive-regret learning and the average optimal social welfare was examined.
Game dynamics can be in diverse forms. There lacks a holistic understanding on the matching between regret notions and the underlying dynamics of the game. Different forms of game dynamics demand different benchmark policies in order to arrive at a meaningful regret measure that lends significance to the stamp of “no-regret learning” yet at the same time is attainable. Viewing from a different angle, one may pose the fundamental question on what kinds of game dynamics are tamable through distributed online learning and make no-regret learning and approximately optimal social welfare feasible.
4 Incomplete and Imperfect Feedback
Learning and adaptation rely on feedback. Quality of the feedback in terms of completeness and accuracy thus has significant implications in no-regret learning. We explore this issue in this section.
4.1 Incomplete Feedback
Incomplete feedback stands in contrast to full-information feedback where utilities of all actions a player could have taken are observed in each stage. Incompleteness can be spatial across the action space or temporal across decision stages. In the former case, a commonly studied model is the so-called bandit feedback, where only the utility of the chosen action is revealed. In the latter, the feedback model is referred to as lossy feedback where there are decision stages with no feedback . One can easily envision a more general model compounding bandit feedback with lossy feedback. Studies on this general model are lacking in the literature.
The term “bandit feedback” has its roots in the classical problem of multi-armed bandit . The name of the problem comes from likening an archetypical single-player online learning problem to playing a multi-armed slot machine (known as a bandit for its ability of emptying the player’s pocket). Each arm, when pulled, generates rewards according to an unknown stochastic model or in an adversarial fashion. Only the reward of the chosen arm is revealed after each play. Due to the incomplete feedback, the player faces the tradeoff between exploration (to gather information from less explored arms) and exploitation (to maximize immediate reward by favoring arms with a good reward history).
In a multi-player game setting with bandit feedback, no-regret learning from an individual player’s perspective can be cast as a single-player non-stochastic/adversarial bandit model where the payoff of each arm/action is adversarially chosen and aggregates the interaction with the other players in the game. The concept of external regret in the game setting corresponds to the weak regret in the adversarial bandit model , which adopts the best single-arm policy in hindsight as the benchmark. The MW algorithm was modified in  to handle the change of the feedback model from full-information to bandit. Specifically, the weight of action at time is updated as where is the probability of selecting action at time and if is unselected. Dividing the observed reward by the corresponding probability of the chosen action ensures the unbiasedness of the observation. Quite intuitively, the price for not observing the rewards of all actions is the degradation of the regret order in the size of the action space, i.e., from in the full-information setting , to in the bandit setting .
The multi-player bandit problem explicitly models the existence of players competing for () arms . Originally motivated by applications in wireless communication networks where distributed users compete for access to multiple channels, this specific game model is characterized by a special form of interaction among players: a collision occurs when multiple players select the same arm, which results in utility loss. The objective of this distributed learning problem is to minimize the system-level regret over all players against the optimal centralized (hence collision-free) allocation of the players to the best set of arms . In addition to the exploration-exploitation tradeoff in the single-player setting, this distributed learning problem under a system-level objective also faces the tradeoff between selecting a good arm and avoiding colliding with competing players. A number of distributed learning algorithms have been developed to achieve a sublinear system-level regret with respect to . Recent extensions of the multi-player bandit problem further consider the setting where each arm offers different payoffs across players .
The multi-player bandit problem is a special game model in that the players have identical action space and their interaction is only in the form of “collisions” when choosing the same action. In a general game setting, the impact of incomplete feedback on no-regret learning and system-level performance is largely open. One quantitative measure of the impact is the regret order with respect to the size of the action space. As mentioned above, bandit feedback results in an additional term in the regret order, which can be significant when the action space is large. Recent work [29, 30] has shown that local communications among neighboring players in a network setting can mitigate the negative impact of bandit feedback on the regret order in . In terms of the impact on the system-level performance, it has been shown under a game model with a continuum of actions that bandit feedback degrades the convergence rate of the learning dynamics to equilibria .
4.2 Imperfect Feedback
Imperfect feedback refers to the inaccuracy of the observed utilities in revealing the quality of the selected actions. Recall that mixed strategies are necessary for achieving no-regret learning in the adversarial setting. The quality of a mixed strategy is characterized by the expected utility where the expectation is taken over the randomness of strategies of all players. Referred to as expected feedback, the feedback model assuming observations on the expected utility, however, can be unrealistic. A more commonly adopted feedback model is the realized feedback
where only the utility of the realized action profile is revealed. The realized feedback can be viewed as a noisy unbiased estimate of the expected feedback where the noise is due to the randomness of players’ strategies.
The so-called noisy feedback assumes a different source of noise: it comes from the external environment and is additive to either the observed utility vectors in the so-called semi-bandit feedback  with a finite action space, or the gradient of the utility functions in the first-order feedback 
with a continuum of actions. Under the assumptions of unbiasedness and bounded variance, the issue of the additive noise can be addressed by rather standard estimation techniques and analysis. A more challenging setting is to consider non-stochastic noise due to adversarial attacks, especially in applications such as adversarial machine learning. This problem was recently studied in the single-player setting. Studies in the multi-agent setting are still lacking.
5 Bounded Rationality
The concept of bounded rationality was first introduced in economics  to provide more realistic models than the often adopted perfect rationality that assumes the decision-making of players is the result of a full optimization of their utilities. In reality, players often take reasoning shortcuts that may lead to suboptimal decisions. Such reasoning shortcuts may be a result of limited cognition of human minds or necessitated by the available computation time and power relative to the complexity of action optimization.
Cognitive limitations include the limited ability in anticipating other decision-makers’ strategic responses and certain psychological factors that interfere with the valuation of options. Various models exist for capturing the limitations in the players’ valuation of options. For example, a player may be myopic, focusing only on the short-term reward . Even with forward-thinking, a player may settle for suboptimal actions perceived as acceptable by the player . The limitation in a player’s ability to anticipate other players’ strategies can be modeled through a cognitive hierarchy by grouping players according to their cognitive abilities and characterizing them in an iterative fashion. Specifically, players with the lowest level of cognitive ability are grouped as the level- players who make decisions randomly. Level- () players are then defined iteratively as those who assume they are playing against lower-level players and anticipate the opponents’ strategies accordingly. Recent work draws an interesting connection between the cognitive hierarchy model and the Optimistic Mirror Descent (OMD) algorithm for solving the saddle point problem with applications in generative adversarial networks . The saddle-point problem can be viewed as a two-player zero-sum game with a continuum of actions. The solutions to the problem correspond to the set of NE. It has been shown that the OMD algorithm guarantees a converging system dynamic to an NE in terms of the actual plays while Gradient Descent (GD) may lead to cycles . In the language of cognitive hierarchy, players adopting GD can be regarded as level-0 thinkers in the sense that they do not anticipate the strategies of their opponents. Players adopting OMD are level-1 thinkers since they take advantage of the fact that their opponents are taking similar gradient methods, which will not lead to abrupt gradient changes between two consecutive stages . Consequently, an extra gradient update is applied in OMD to accelerate learning.
Besides cognitive limitations, players are also constrained in terms of physical resources such as memory and computation power. Acquiring, storing, and processing all relevant information for decision-making may be infeasible, especially in complex systems with a large action space. For example, players may only choose from strategies with bounded complexity , or use only recent observations in decision-making due to memory constraints .
While models for bounded rationality abound in economics, political science, and other related disciplines, incorporating such models into distributed online learning is still in its infancy. A holistic understanding on the implications of bounded rationality in distributed online learning is yet to be gained. An intriguing aspect of the problem is that bounded rationality may not necessarily imply degraded performance. For example, in dynamic games, bounded memory of past experiences may have little effect since no-regret learning dictates that the distant past be forgotten (see discussions in Sec. 3).
The heterogeneity of complex multi-agent systems characterizes the asymmetry across players in three aspects: the available information and knowledge about the system, available actions, and the level of adaptivity to opponents’ strategies. In the example of mixed traffic in urban transportation, autonomous vehicles, while likely to have greater computation power for solving complex decision problems, may have to obey an additional set of regulations on available actions.
In adversarial machine learning, in addition to the asymmetry on the knowledge and power, the attacker and the defender may also have different levels of real-time adaptivity to the other player’s strategy. Classical regret notions such as the external regret that assumes fixed actions of the other players, while applicable to oblivious attackers, are no longer valid under adaptive attacks. A partial solution is to adopt a new notion of policy regret defined against an adaptive adversary who assigns reward vectors based on previous actions of the player . Specifically, let denote the player’s reward function determined by the adversary at time , given the sequence of actions taken by the player in the past. The policy regret with reward functions is defined as
where denotes the reward function determined by the adversary if the player took actions in the past. The -memory policy regret is defined by assuming that the reward function depends only on the past actions of the player.
The difference between the external regret and the policy regret may not be crucial if the adversary and the player have homogeneous objectives (e.g., mixed traffic in transportation systems). It has been shown that there exists a wide class of algorithms that can ensure no-regret learning under both regret definitions, as long as the adversary is also using such an algorithm . In applications such as adversarial machine learning where the adversary may be a malicious opponent, the two notions of regret are incompatible: there exists an -memory adaptive adversary that can make any action sequence of the player with sublinear regret in one notion suffer from linear regret in the other . A general technique for developing no-policy-regret algorithms in the single-player setting was proposed in . In terms of the system-level performance, it was shown in two-player games that no-policy-regret learning guarantees convergence of the system dynamic to a new notion of equilibrium called policy equilibrium . However, the understanding of policy equilibrium is limited. In games with more than two players, even the definition of policy equilibrium is unclear.
-  N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games. Cambridge university press, 2006.
-  H. P. Young, Strategic Learning and its Limits. OUP Oxford, 2004.
-  M. Barreno, B. Nelson, A. D. Joseph, and J. D. Tygar, “The security of machine learning,” Machine Learning, vol. 81, no. 2, pp. 121–148, 2010.
-  N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani, Algorithmic Game Theory. Cambridge university press, 2007.
-  T. Roughgarden, “Intrinsic robustness of the price of anarchy,” Journal of the ACM (JACM), vol. 62, no. 5, p. 32, 2015.
-  Y. Viossat and A. Zapechelnyuk, “No-regret dynamics and fictitious play,” Journal of Economic Theory, vol. 148, no. 2, pp. 825–842, 2013.
-  R. Laraki and S. Sorin, “Advances in zero-sum dynamic games,” in Handbook of Game Theory with Economic Applications. Elsevier, 2015, vol. 4, pp. 27–93.
-  J. Hannan, “Approximation to bayes risk in repeated play,” Contributions to the Theory of Games, vol. 3, pp. 97–139, 1957.
-  D. Blackwell et al., “An analog of the minimax theorem for vector payoffs.” Pacific Journal of Mathematics, vol. 6, no. 1, pp. 1–8, 1956.
-  G. Stoltz and G. Lugosi, “Internal regret in on-line portfolio selection,” Machine Learning, vol. 59, no. 1-2, pp. 125–159, 2005.
-  S. Hart and A. Mas-Colell, “A simple adaptive procedure leading to correlated equilibrium,” Econometrica, vol. 68, no. 5, pp. 1127–1150, 2000.
-  P. Mertikopoulos, C. Papadimitriou, and G. Piliouras, “Cycles in adversarial regularized learning,” in Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2018, pp. 2703–2717.
-  J. P. Bailey and G. Piliouras, “Multiplicative weights update in zero-sum games,” in Proceedings of the 2018 ACM Conference on Economics and Computation. ACM, 2018, pp. 321–338.
-  A. Heliou, J. Cohen, and P. Mertikopoulos, “Learning with bandit feedback in potential games,” in Advances in Neural Information Processing Systems, 2017, pp. 6369–6378.
-  G. Gidel, R. A. Hemmat, M. Pezeshki, R. Le Priol, G. Huang, S. Lacoste-Julien, and I. Mitliagkas, “Negative momentum for improved game dynamics,” in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 1802–1811.
A. Blum, M. Hajiaghayi, K. Ligett, and A. Roth, “Regret minimization and the
price of total anarchy,” in
Proceedings of The 40th Annual ACM Symposium on Theory of Computing. ACM, 2008, pp. 373–382.
-  J. R. Marden, H. P. Young, and L. Y. Pao, “Achieving pareto optimality through distributed learning,” SIAM Journal on Control and Optimization, vol. 52, no. 5, pp. 2753–2770, 2014.
-  H. Luo and R. E. Schapire, “Achieving all with no parameters: AdaNormalHedge,” in Conference on Learning Theory, 2015, pp. 1286–1304.
-  B. Duvocelle, P. Mertikopoulos, M. Staudigl, and D. Vermeulen, “Learning in time-varying games,” arXiv preprint arXiv:1809.03066, 2018.
-  A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro, “Online optimization in dynamic environments: Improved regret rates for strongly convex problems,” in 2016 IEEE 55th Conference on Decision and Control (CDC). IEEE, 2016, pp. 7195–7201.
-  A. R. Cardoso, J. Abernethy, H. Wang, and H. Xu, “Competing against Nash equilibria in adversarially changing zero-sum games,” in Proceedings of the 36th International Conference on Machine Learning, vol. 97. PMLR, 2019, pp. 921–930.
-  T. Lykouris, V. Syrgkanis, and É. Tardos, “Learning and efficiency in games with dynamic population,” in Proceedings of The 27th Annual ACM-SIAM Symposium on Discrete Algorithms, 2016, pp. 120–129.
-  Z. Zhou, P. Mertikopoulos, S. Athey, N. Bambos, P. W. Glynn, and Y. Ye, “Learning in games with lossy feedback,” in Advances in Neural Information Processing Systems, 2018, pp. 5140–5150.
-  Q. Zhao, Multi-Armed Bandits: Theory and Applications to Online Learning in Networks. Morgan & Claypool Publishers, 2019.
-  P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multiarmed bandit problem,” SIAM Journal on Computing, vol. 32, no. 1, pp. 48–77, 2002.
-  J.-Y. Audibert and S. Bubeck, “Minimax policies for adversarial and stochastic bandits,” in Proceedings of the 22nd Annual Conference on Learning Theory, 2009, pp. 217–226.
-  K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit with multiple players,” IEEE Transactions on Signal Processing, vol. 58, no. 11, pp. 5667–5681, 2010.
-  I. Bistritz and A. Leshem, “Distributed multi-player bandits—a game of thrones approach,” in Advances in Neural Information Processing Systems, 2018, pp. 7222–7232.
-  N. Cesa-Bianchi, C. Gentile, and Y. Mansour, “Delay and cooperation in nonstochastic bandits,” The Journal of Machine Learning Research, vol. 20, no. 1, pp. 613–650, 2019.
-  Y. Bar-On and Y. Mansour, “Individual regret in cooperative nonstochastic multi-armed bandits,” in Advances in Neural Information Processing Systems, 2019, pp. 3110–3120.
-  M. Bravo, D. Leslie, and P. Mertikopoulos, “Bandit learning in concave n-person games,” in Advances in Neural Information Processing Systems, 2018, pp. 5661–5671.
-  P. Mertikopoulos and Z. Zhou, “Learning in games with continuous action sets and unknown payoff functions,” Mathematical Programming, vol. 173, no. 1-2, pp. 465–507, 2019.
-  K.-S. Jun, L. Li, Y. Ma, and J. Zhu, “Adversarial attacks on stochastic bandits,” in Advances in Neural Information Processing Systems, 2018, pp. 3640–3649.
-  H. A. Simon, “A behavioral model of rational choice,” The Quarterly Journal of Economics, vol. 69, no. 1, pp. 99–118, 1955.
-  X. Gabaix and D. Laibson, “Bounded rationality and directed cognition,” Harvard University, 2005.
-  C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng, “Training GANs with optimism.” in International Conference on Learning Representations, 2018.
-  M. Scarsini and T. Tomala, “Repeated congestion games with bounded rationality,” International Journal of Game Theory, vol. 41, no. 3, pp. 651–669, 2012.
-  L. Chen, F. Lin, P. Tang, K. Wang, R. Wang, and S. Wang, “K-memory strategies in repeated games,” in Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems, 2017, pp. 1493–1498.
-  R. Arora, O. Dekel, and A. Tewari, “Online bandit learning against an adaptive adversary: from regret to policy regret,” Proceedings of the 29th International Conference on Machine Learning, pp. 1747–1754, 2012.
-  R. Arora, M. Dinitz, T. V. Marinov, and M. Mohri, “Policy regret in repeated games,” in Advances in Neural Information Processing Systems, 2018, pp. 6732–6741.