Multi-agent systems are ubiquitous in many real life applications such as self-driving cars, games, computer networks, etc. Agents acting in such systems are usually self-interested and aim to maximize their own individual utility. To achieve their best utility, agents face three key fundamental questions: exploit, cooperate, or insure safety? Learning how agents should behave when faced with these significant challenges is the subject of this paper. We focus on two players (one is called agent, the other opponent) repeated games, a setting which captures the key challenges faced when interacting in a multi-agent system where at each round, the players simultaneously select an action and observe an individual numerical value called reward. The goal of each player in this game, is to maximize the sum of accumulated rewards over many rounds. One of the key dilemmas for learning in repeated games is the lack of a single optimal behavior that is satisfactory against all opponents, since the best strategy necessarily depends on the opponent.
Powers et al. (2007) tackle this dilemma and propose a rigorous criterion called guarded optimality which for two players simplifies to three criteria: (1) Targeted Optimality: when the opponent is a member of the target set, the average reward is close to the best response against that opponent; (2) Safety: against any opponent, the average reward is close to a safety value; (3) Individual Rationality: in self-play, the average reward is Pareto efficient111i.e., it is impossible for one agent to change to a better policy without making the other agent worse off. and individually not below the safety value.
In this paper, we adopt those criteria and focus on the self-play settings. We pick the safety value to be the largest value one can guarantee against any opponent (also called maximin value, see Definition 5). For the individual rationality criterion, we depart from previous works by considering the so called egalitarian bargaining solution (EBS) (Kalai, 1977) in which both players bargain to get an equal amount above their maximin value. This EBS is a Nash equilibrium (NE) for the repeated game, a direct consequence of the folk theorems (Osborne and Rubinstein, 1994) and in many games (see Example 1) has a value typically no worse for both players than values achievable by single-stage (i.e. non-repeated) NE usually considered in the literature. We pick the EBS since it satisfies even more desirable properties (Kalai, 1977) on top of the individual rationality criterion such as: independence of irrelevant alternatives (i.e. eliminating choices that were irrelevant does not change the choices of the agents), individual monotonicity (a player with better options should get a weakly-better value) and (importantly) uniqueness. It is also connected to fairness and Rawls (1971) theory of justice for human society (Kalai, 1977).
Our work is related to Munoz de Cote and Littman (2008) where an algorithm to find the same egalitarian solution for general-sum repeated stochastic games is provided. When applied to general-sum repeated games, their algorithm implies finding an approximate solution using a (binary) search through the space of policy. Instead, our result will find the exact egalitarian solution with a more direct and simple formula. Also Munoz de Cote and Littman (2008) and many other works such as (Littman and Stone, 2003; Powers et al., 2007; Chakraborty and Stone, 2014) assume deterministic rewards known to both players. In this work, we consider the case of stochastic rewards generated from a fixed distribution unknown by both players.
Another difference with many previous works is the type of solution considered in self-play. Indeed, we consider a NE for the repeated game whereas works such as (Chakraborty and Stone, 2014; Conitzer and Sandholm, 2007; Banerjee and Peng, 2004; Powers and Shoham, 2005) consider the single-stage NE. The single-stage NE is typically undesirable in self-play since equilibrium with much higher values can be achieved as illustrated by Example 1 in this paper. Other works such as Powers et al. (2007) consider optimizing for the sum of rewards in self-play. However, as illustrated by Example 1 in this paper, this sum of rewards does not always guarantee individual rationality since some player could get lower than their maximin.
Crandall and Goodrich (2011); Stimpson and Goodrich (2003) proposes algorithms with the goal of converging to a NE of the repeated games. However, Crandall and Goodrich (2011) only show an asymptotic convergence empirically in a few games while Stimpson and Goodrich (2003) only show that some parameters of their algorithms are more likely to converge asymptotically. Instead, we provide finite-time theoretical guarantees for our algorithm. Although in their settings players only observe their own rewards and not the other player, they assume deterministic rewards.
Wei et al. (2017); Brafman and Tennenholtz (2002) tackles online learning for a generalization of repeated games called stochastic games. However, they consider zero-sum games where the sum of the rewards of both players for any joint-action is always 0. In our case, we look at the general sum case where no such restrictions are placed on the rewards. In the learning settings there are other single-stage equilibrium considered such as correlated-equilibrium (Greenwald and Hall, 2003).
Our work is also related to multi-objective multi-armed bandit (Drugan and Nowe, 2013) by considering the joint-actions as arms controlled by a single-player. Typical work consider on multi-objective multi-armed bandit tries to find any solution that minimizes the distance between the Pareto frontier. However, not all Pareto efficient solutions are acceptable as illustrated by Example 1 in this paper. Instead, our work show that a specific Pareto efficient (the egalitarian) is more desirable.
The paper is organized as follows: Section 2 presents formally our setting, assumptions, as well key definitions needed to understand the remainder of the paper. Section 3 shows a description of our algorithm while section 4 contains its analysis as well as the lower bound. We conclude in section 5 with indication about future works. Detailed proof of our theorems is available in the appendix.
2 Background and Problem Statement
We focus on two-player general sum repeated games. At round , both players select and play a joint action from a finite set . Then, they receive rewards generated from a fixed but unknown bounded distribution depending on their joint action. The actions and rewards are then revealed to both players. We assume the first agent to be under our control and the second agent to be the opponent. We would like to design algorithms such that our agent’s cumulative rewards are as high as possible. The opponent can have one of two types known to our agent: (1) self-player (another independently run version of our algorithm) or (2) arbitrary (i.e any possible opponents with no access to the agent’s internal randomness).
To measure performance, we compare our agent to an oracle that has full knowledge of the distribution of rewards for all joint-actions. The oracle then plays like this: (1) in self-play, they both compute before the game start the egalitarian equilibrium and play it; (2) against any other arbitrary opponent, the oracle plays the policy ensuring the maximin value.
Our goal is to design algorithms that have low expected regret against this oracle after any number of rounds, where regret is the difference between the value that the oracle would have obtained and the value that our algorithm actually obtained. Next, we formally define the terms that describe our problem setting.
Definition 1 (Policy).
A policy in a repeated game for player is a mapping from each possible history to a distribution over its actions. That is: where is the current round and is the set of all possible history of joint-actions up to round .
A policy is called stationary if it plays the same distribution at each round. It is called deterministic stationary if it plays the same action at each round.
Definition 2 (Joint-Policy).
A joint policy is a pair of policies, one for each player in the game. In particular, this means that the probability distributions over actions of both players are independent. When each component policy is
in the game. In particular, this means that the probability distributions over actions of both players are independent. When each component policy isstationary, we call the resulting policy stationary and similarly for deterministic stationary.
Definition 3 (Correlated-Policy).
Any joint-policy where player actions are not independent is correlated222For example through a public signal.. A correlated policy specifies a probability distribution over joint-actions known by both players: .
In this paper, when we refer to a policy without any qualifier, we will mean a correlated-policy, which is required for the egalitarian solution. When we refer to and we will mean the components of a non-correlated joint-policy.
2.1 Solution concepts
In this section, we explain the two solution concepts we aim to address: safety–selected as the maximin value and individual rationality selected as achieving the value of the EBS. We start from the definition of a value for a policy.
Definition 4 (Value of a policy).
The value of a policy for player in a repeated game is defined as the infinite horizon undiscounted expected average reward given by:
We use to denote values for both players and drop when clear from the context.
Definition 5 (Maximin value).
The maximin policy for player and its value are such that:
where is the value for player playing policy while all other players play .
Definition 6 (Advantage game and Advantage value).
Consider a repeated game between two players and defined by the joint-actions and the random rewards drawn from a distribution . Let be the maximin value of the two players. The advantage game is the game with (random) rewards obtained by subtracting the maximin value of the players from . More precisely, the advantage game is defined by: . The value of any policy in this advantage game is called advantage value.
Definition 7 (EBS in repeated games).
Consider a repeated game between two players and with maximin value . A policy is an EBS if it satisfies the following two conditions: (1) it belongs to the set of policies maximizing the minimum of the advantage value for both players. (2) it maximizes the value of the player with highest advantage value.
More formally, for any vector
More formally, for any vector, let be a permutation of such that . Let’s define a lexicographic maximin ordering on as:
A policy is an EBS 333Also corresponds to the leximin solution to the Bargaining problem Bossert and Tan (1995). if:
We call EBS value the value and will be used to designate the egalitarian advantage.
2.2 Performance criteria
We can now define precisely the two criteria we aim to optimize.
Definition 8 (Safety Regret).
The safety regret for an algorithm playing for rounds as agent against an arbitrary opponent with no knowledge of the internal randomness of is defined by:
Definition 9 (Individual Rational Regret).
The individual rational regret for an algorithm playing for rounds as agent against its self identified as is defined by:
Example 1 (Comparison of the EBS value to other concepts).
In Table 1, we present a game and give the values achieved by the single-stage NE, and Correlated Equilibrium Greenwald and Hall (2003) (Correlated); maximizing the sum of rewards (Sum), and a Pareto-efficient solution (Pareto). In this game, the maximin value is . Sum leads to for the first player, much lower than its maximin. Pareto is also similarly problematic. Consequently, it is not enough to converge to any Pareto solution since that does not necessarily guarantee rationality for both players. Both NE and Correlated fail to give the players a value higher than their maximin while the EBS shows that a high value is achievable. A conclusion similar to this example can also be made for all non trivial zero-sum games.
3 Methods Description
Before we detail the safe and individual rational algorithms, we will describe their general structure. The key challenge is how to deal with uncertainty, the fact that we do not know the rewards. To deal with this uncertainty, we use the standard principle of optimism in the face of uncertainty Jaksch et al. (2010). It works by a) constructing a set of statistically plausible games
containing the true game with high probability through a confidence region around estimated mean rewards, a step detailed in section3.1; b) finding within that set of plausible games the one whose EBS policy (called optimistic) has the highest value, a step detailed in section 3.2; c) playing this optimistic policy until the start of an artificial epoch
where a new epoch starts when the number of times any joint-action has been played is doubled (also known as thedoubling trick), a step described in Jaksch et al. (2010) and summarized by Algorithm 3 in Appendix G.
3.1 Construction of the plausible set
At epoch , our construction is based on creating a set containing all possible games with expected rewards such that,
where is the number of times action has been played up to round , is the empirical mean reward observed up to round and is an adjustable probability. The plausible set can be used to define the following upper and lower bounds on the rewards of the game:
We denote the game with rewards and the game with . Values in those two games are resp. denoted , . We used , to refer to the bounds obtained by a weighted (using ) average of the bounds for individual action. When clear from context, the subscript is dropped.
3.2 Optimistic EBS policy
§1 Problem formulation
Our goal is to find a game and a policy whose EBS value is near-optimal simultaneously for both players. In particular, if we refer to the true but unknown game by and assume that we want to find and such that:
where is defined in Definition 7 and a small configurable error.
Note that the condition in (2) is required (contrarily to single-agent games Jaksch et al. (2010)) since in general, there might not exist a game in that achieves the highest EBS value simultaneously for both players. For example, one can construct a case where the plausible set contains two games with EBS value (resp) and for any (See Table 2 in Appendix E). This makes the optimization problem (2) significantly more challenging than for single-agent games since a small error in the rewards can lead to a large (linear) regret for one of the player. This is also the root cause for why the best possible regret becomes rather than typical for single agent games. We refer the this challenge as the small -error large regret issue.
To solve (2), a) we set the optimistic game as the game in with the highest rewards for both players. Indeed, for any policy and game , one can always get a better value for both players by using ; b) we compute an advantage game corresponding to by estimating an optimistic maximin value for both players, a step detailed in paragraph §3; c) we compute in paragraph §4 an EBS policy using the advantage game; d) we set the policy to be unless one of three conditions explained in paragraph §5 happens. Algorithm 2 details the steps to compute and to correlate the policy, players play the joint-action minimizing their observed frequency of played actions compared to (See function ) of Algorithm 3 in Appendix G).
§3 Optimistic Maximin Computation
Satisfying (2) implies we need to find a value with:
where is the maximin value of player in the true game .
To do so, we return a lower bound value for the optimistic maximin policy of player . We begin by computing in polynomial time444 For example by using linear programming
For example by using linear programmingDantzig (1951); Adler (2013). the (stationary) maximin policy for the game with largest rewards. We then compute the (deterministic, stationary) best response policy using the game with the lowest rewards. The detailed steps are available in Algorithm 1. This results in a lower bound on the maximin value satisfying (3) as proven in Lemma 1.
§4 Computing an EBS policy.
Armed with the optimistic game and the optimistic maximin value, we can now easily compute the corresponding optimistic advantage game whose rewards are denoted by . An EBS policy is computed using this advantage game. The key insight to do so is that the EBS involves playing a single deterministic stationary policy or combine two deterministic stationary policies (Proposition 1). Given that the number of actions is finite we can then just loop through each pairs of joint-actions and check which one gives the best EBS score. The score (justified in the proof of Proposition 2 in Appendix C.4.) to use for any two joint-actions and is: with as follows:
And the policy is such that
§5 Policy Execution
We always play the optimistic EBS policy unless one of the following three events happens:
The probable error on the maximin value of one player is too large. Indeed, the error on the maximin value can become too large if the weighted bound on the actions played by the maximin policies is too large. In that case, we play the action causing the largest error.
The small -error large regret issue is probable: Proposition 2 implies that the small -error large regret issue may only happen if the player with the lowest ideal advantage value (the maximum advantage under the condition that the advantage of the other player is non-negative) is receiving it when playing an EBS policy. This allows Algorithm 2 to check for this player and plays the action corresponding to its ideal advantage as far as the other player is still receiving -close to its EBS value (Line 5 to 15 in Algorithm 2).
The probable error on the EBS value of one player is too large This only happens if we keep not playing the EBS policy due to the small -error large regret issue. In that case, the error on the EBS value used to detect the small -error large regret issue might become too large making the check for the small -error large regret issue irrelevant. In that case, we play the action of the EBS policy responsible for the largest error.
4 Theoretical analysis
Before we present theoretical analysis for the learning algorithm, we discuss the existence and uniqueness of the EBS value, as well as the type of policies that can achieve it.
Properties of the EBS
Fact 1 allows us to restrict our attention to stationary policies since it means that any (optimal) value achievable can be achieved by a stationary (correlated-) policy and Fact 2 means that the egalitarian always exists and is unique providing us with a good benchmark to compare against. Fact 1 and 2 are resp. justified in Appendix C.1 and C.2.
Fact 1 (Achievable values for both players).
Any achievable value for the players can be achieved by a stationary correlated-policy.
Fact 2 (Existence and Uniqueness of the EBS value for stationary policies).
If we are restricted to the set of stationary policies, then the EBS value defined in Definition 7 exists and is unique.
The following Proposition 1 strengthens the observation in Fact 1 and establishes that a weighted combination of at most two joint-actions can achieve the EBS value. This allows for an efficient algorithm that can just loop through all possible pairs of joint-actions and check for the best one. However, given any two joint-actions one still needs to know how to combine them to get an EBS value. This question is answered by proposition 2.
Proposition 1 (On the form of an EBS policy).
Given any 2 player repeated game, the EBS value can always be achieved by a stationary policy with non-zero probability on at most two joint-actions.
Proposition 2 (Finding an EBS policy).
Let us call the ideal advantage value of a player , the maximum advantage that this player can achieve under the restriction that the advantage value of the other player is non-negative. More formally: . The egalitarian advantage value for the two players is exactly the same unless there exists an EBS policy that is deterministic stationary where at least one player (necessarily including the player with the lowest ideal advantage value) is receiving its ideal advantage value.
The following theorem 1 gives us a high probability upper bound on the regret in self-play against the EBS value, a result achieved without the knowledge of .
Theorem 1 (Individual Rational Regret for Algorithm 3 in self-play).
The structure of the proof follows that of Jaksch et al. (2010). The key step is to prove that the value of policy returned by Algorithm 2 in our plausible set is -close to the EBS value in the true model (optimism). In our case, we cannot always guarantee this optimism. Our proof identifies the concerned cases and show that they cannot happen too often (Lemma 4 in Appendix B.1). Then for the remaining cases, Lemma 3 shows that we can guarantee the optimism with an error of . The step-by-step detailed proof is available in Appendix B.1. ∎
Theorem 2 (Safety Regret of policy in Algorithm 1).
Consider a safe algorithm for player obtained by playing the policy with . After any rounds against any opponent, then with probability at least , the safe regret (definition 8) of this policy is upper-bounded by:
Lower bounds for the individual rational regret
Here we establish a lower bound of for any algorithm trying to learn the EBS value. This shows that our upper bound is optimal up to logarithm-factors. The key idea in proving this lower bound is the example illustrated by Table 2. In that example, the rewards of the first player are all and the second player has an ideal value of . However, 50% of the times a player cannot realize its ideal value due to an -increase in a single joint-action for both players. The main intuition behind the proof of the lower bound is that any algorithm that wants to minimize regret can only try two things (a) detect whether there exists a joint-action with an or if all rewards of the first player are equal. (b) always ensure the ideal value of the second player. To achieve (a) any algorithm needs to play all joint-actions for times. Picking ensures the desired lower bound. The same would also ensure the same lower bound for an algorithm targeting only (b). Appendix E formally proves this lower bound.
Theorem 3 (Lower bounds).
For any algorithm , any natural numbers , , , there is a general sum game with joint-actions such that the expected individual rational regret of after steps is at least .
5 Conclusion and Future Directions
In this paper, we illustrated a situation in which typical solutions for self-play in repeated games, such as single-stage equilibrium or sum of rewards, are not appropriate. We propose the usage of an egalitarian bargaining solution (EBS) which guarantees each player to receive no less than their maximin value. We analyze the properties of EBS for repeated games with stochastic rewards and derive an algorithm that achieves a near-optimal finite-time regret of with high probability. We are able to conclude that the proposed algorithm is near-optimal, since we prove a matching lower bound up to logarithmic-factor. Although our results imply a safety regret (i.e. compared to the maximin value), we also show that a component of our algorithm guarantees the near-optimal safety regret against arbitrary opponents.
Our work illustrates an interesting property of the EBS which is: it can be achieved with sub-linear regret by two individually rational agents who are uncertain about their utility. We wonder if other solutions to the Bargaining Problem such as the Nash Bargaining Solution or the Kalai–Smorodinsky Solution also admit the same property. Since the EBS is an equilibrium, another intriguing question is whether one can design an algorithm that converges naturally to the EBS solution against some well-defined class of opponents.
Finally, a natural and interesting future direction for our work is its extension to stateful games such as Markov games.
Adler, I. (2013).
The equivalence of linear programs and zero-sum games.
International Journal of Game Theory, 42(1):165–177.
- Auer et al. (2002) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77.
Banerjee and Peng (2004)
Banerjee, B. and Peng, J. (2004).
Performance bounded reinforcement learning in strategic interactions.In AAAI, volume 4, pages 2–7.
- Bossert and Tan (1995) Bossert, W. and Tan, G. (1995). An arbitration game and the egalitarian bargaining solution. Social Choice and Welfare, 12(1):29–41.
Brafman and Tennenholtz (2002)
Brafman, R. I. and Tennenholtz, M. (2002).
R-max-a general polynomial time algorithm for near-optimal
Journal of Machine Learning Research, 3(Oct):213–231.
- Cesa-Bianchi and Lugosi (2006) Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university press.
- Chakraborty and Stone (2014) Chakraborty, D. and Stone, P. (2014). Multiagent learning in the presence of memory-bounded agents. Autonomous agents and multi-agent systems, 28(2):182–213.
- Conitzer and Sandholm (2007) Conitzer, V. and Sandholm, T. (2007). Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 67(1-2):23–43.
- Crandall and Goodrich (2011) Crandall, J. W. and Goodrich, M. A. (2011). Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning. Machine Learning, 82(3):281–314.
- Dantzig (1951) Dantzig, G. B. (1951). A proof of the equivalence of the programming problem and the game problem. Activity analysis of production and allocation, (13):330–338.
- Drugan and Nowe (2013) Drugan, M. M. and Nowe, A. (2013). Designing multi-objective multi-armed bandits algorithms: A study. learning, 8:9.
Filippi et al. (2010)
Filippi, S., Cappé, O., and Garivier, A. (2010).
Optimism in reinforcement learning and kullback-leibler divergence.In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pages 115–122. IEEE.
- Greenwald and Hall (2003) Greenwald, A. and Hall, K. (2003). Correlated-q learning. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML’03, pages 242–249. AAAI Press.
Hoeffding, W. (1963).
Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30.
- Imai (1983) Imai, H. (1983). Individual monotonicity and lexicographic maxmin solution. Econometrica: Journal of the Econometric Society, pages 389–401.
- Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600.
- Kalai (1977) Kalai, E. (1977). Proportional solutions to bargaining situations: interpersonal utility comparisons. Econometrica: Journal of the Econometric Society, pages 1623–1630.
- Littman and Stone (2003) Littman, M. L. and Stone, P. (2003). A polynomial-time nash equilibrium algorithm for repeated games. In Proceedings of the 4th ACM Conference on Electronic Commerce, EC ’03, pages 48–54, New York, NY, USA. ACM.
Munoz de Cote and Littman (2008)
Munoz de Cote, E. and Littman, M. L. (2008).
A polynomial-time Nash equilibrium algorithm for repeated
Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI), pages 419–426, Corvallis, Oregon. AUAI Press.
- Nash (1951) Nash, J. (1951). Non-cooperative games. Annals of mathematics, pages 286–295.
- Nash Jr (1950) Nash Jr, J. F. (1950). The bargaining problem. Econometrica: Journal of the Econometric Society, pages 155–162.
- Osborne and Rubinstein (1994) Osborne, M. J. and Rubinstein, A. (1994). A course in game theory.
- Powers and Shoham (2005) Powers, R. and Shoham, Y. (2005). Learning against opponents with bounded memory. In IJCAI, volume 5, pages 817–822.
- Powers et al. (2007) Powers, R., Shoham, Y., and Vu, T. (2007). A general criterion and an algorithmic framework for learning in multi-agent systems. Machine Learning, 67(1):45–76.
- Puterman (2014) Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
- Rawls (1971) Rawls, J. (1971). A theory of justice. Harvard university press.
- Stimpson and Goodrich (2003) Stimpson, J. L. and Goodrich, M. A. (2003). Learning to cooperate in a social dilemma: A satisficing approach to bargaining. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 728–735.
- Wei et al. (2017) Wei, C.-Y., Hong, Y.-T., and Lu, C.-J. (2017). Online reinforcement learning in stochastic games. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 4987–4997.
Appendix A Notations and terminology
We will use action to mean joint-actions unless otherwise specified. We will denote the players as and . This is to be understood as follows: if there are two players , when , then and when , . The true but unknown game will be denoted as whereas the plausible set of games we consider at epoch will be denoted by . An EBS policy in the true game will be denoted by and its value by . If for the EBS value in , the player with the lowest ideal advantage value is receiving it, we will denote this player by while the other player will be . The EBS policy in this situation will be denoted as (it is guaranteed to be a single joint-action).
will be used to denote empirical mean rewards and in general is used to mean a value computed using empirical . will be used to mean the rewards from the upper limit game in our plausible set, while will be used to mean the rewards from the lower limit game in our plausible set. Also, in general while be used to mean a value computed using and to mean a value computed using .
will be used to denote the current epoch. the number of rounds action has been played in epoch — the number of rounds epoch has lasted — the number of rounds played up to epoch — the number of rounds action has been played up to round — the empirical average rewards of player for action at round . will be used to denote the total number of epochs up to round .
Appendix B Proof of Theorem 1
b.1 Regret analysis for the egalitarian algorithm in self-play
The proof is similar to that of UCRL2 Jaksch et al. (2010) and KL-UCRL Filippi et al. (2010). As the algorithm is divided into epochs, we first show that the regret bound within an epoch is sub-linear. We then combine those per-epoch regret terms to get a regret for the whole horizon simultaneously. Both of these regrets are computed with the assumption that the true game is within our plausible set. We then conclude by showing that this is indeed true with high probability. Let’s first start by decomposing the regret.
Here we decompose the regret in each round . We start by defining the following event ,
where is the regret per-epoch defined by
Regret when the event E defined by (6) is False and the true Model is in our plausible set
We will now simplify the notation by using to mean that the expression is condition on being False. We can thus bound :
where Equation (16) comes from the fact that when , (See Lemma 3). Equation (17) comes from the fact that we assume meaning . Equation (18) comes from the fact the egalitarian solution involves playing one joint-action with probability and another joint-action with probability ; since it is always possible to bound as with a non-negative integer, and by construction the players play as close as possible to , then the error is bounded by .
We are now ready to sum up the per-epoch regret over all epochs for which the event is false. We have:
Now assuming , we have:
Using Appendix C.3 in Jaksch et al. (2010), we can conclude that
Similarly Jaksch et al. (2010) Equation (20) shows that:
Furthermore Jaksch et al. (2010) shows that: