1 Introduction
Multiagent systems are ubiquitous in many real life applications such as selfdriving cars, games, computer networks, etc. Agents acting in such systems are usually selfinterested and aim to maximize their own individual utility. To achieve their best utility, agents face three key fundamental questions: exploit, cooperate, or insure safety? Learning how agents should behave when faced with these significant challenges is the subject of this paper. We focus on two players (one is called agent, the other opponent) repeated games, a setting which captures the key challenges faced when interacting in a multiagent system where at each round, the players simultaneously select an action and observe an individual numerical value called reward. The goal of each player in this game, is to maximize the sum of accumulated rewards over many rounds. One of the key dilemmas for learning in repeated games is the lack of a single optimal behavior that is satisfactory against all opponents, since the best strategy necessarily depends on the opponent.
Powers et al. (2007) tackle this dilemma and propose a rigorous criterion called guarded optimality which for two players simplifies to three criteria: (1) Targeted Optimality: when the opponent is a member of the target set, the average reward is close to the best response against that opponent; (2) Safety: against any opponent, the average reward is close to a safety value; (3) Individual Rationality: in selfplay, the average reward is Pareto efficient^{1}^{1}1i.e., it is impossible for one agent to change to a better policy without making the other agent worse off. and individually not below the safety value.
In this paper, we adopt those criteria and focus on the selfplay settings. We pick the safety value to be the largest value one can guarantee against any opponent (also called maximin value, see Definition 5). For the individual rationality criterion, we depart from previous works by considering the so called egalitarian bargaining solution (EBS) (Kalai, 1977) in which both players bargain to get an equal amount above their maximin value. This EBS is a Nash equilibrium (NE) for the repeated game, a direct consequence of the folk theorems (Osborne and Rubinstein, 1994) and in many games (see Example 1) has a value typically no worse for both players than values achievable by singlestage (i.e. nonrepeated) NE usually considered in the literature. We pick the EBS since it satisfies even more desirable properties (Kalai, 1977) on top of the individual rationality criterion such as: independence of irrelevant alternatives (i.e. eliminating choices that were irrelevant does not change the choices of the agents), individual monotonicity (a player with better options should get a weaklybetter value) and (importantly) uniqueness. It is also connected to fairness and Rawls (1971) theory of justice for human society (Kalai, 1977).
Related work
Our work is related to Munoz de Cote and Littman (2008) where an algorithm to find the same egalitarian solution for generalsum repeated stochastic games is provided. When applied to generalsum repeated games, their algorithm implies finding an approximate solution using a (binary) search through the space of policy. Instead, our result will find the exact egalitarian solution with a more direct and simple formula. Also Munoz de Cote and Littman (2008) and many other works such as (Littman and Stone, 2003; Powers et al., 2007; Chakraborty and Stone, 2014) assume deterministic rewards known to both players. In this work, we consider the case of stochastic rewards generated from a fixed distribution unknown by both players.
Another difference with many previous works is the type of solution considered in selfplay. Indeed, we consider a NE for the repeated game whereas works such as (Chakraborty and Stone, 2014; Conitzer and Sandholm, 2007; Banerjee and Peng, 2004; Powers and Shoham, 2005) consider the singlestage NE. The singlestage NE is typically undesirable in selfplay since equilibrium with much higher values can be achieved as illustrated by Example 1 in this paper. Other works such as Powers et al. (2007) consider optimizing for the sum of rewards in selfplay. However, as illustrated by Example 1 in this paper, this sum of rewards does not always guarantee individual rationality since some player could get lower than their maximin.
Crandall and Goodrich (2011); Stimpson and Goodrich (2003) proposes algorithms with the goal of converging to a NE of the repeated games. However, Crandall and Goodrich (2011) only show an asymptotic convergence empirically in a few games while Stimpson and Goodrich (2003) only show that some parameters of their algorithms are more likely to converge asymptotically. Instead, we provide finitetime theoretical guarantees for our algorithm. Although in their settings players only observe their own rewards and not the other player, they assume deterministic rewards.
Wei et al. (2017); Brafman and Tennenholtz (2002) tackles online learning for a generalization of repeated games called stochastic games. However, they consider zerosum games where the sum of the rewards of both players for any jointaction is always 0. In our case, we look at the general sum case where no such restrictions are placed on the rewards. In the learning settings there are other singlestage equilibrium considered such as correlatedequilibrium (Greenwald and Hall, 2003).
Our work is also related to multiobjective multiarmed bandit (Drugan and Nowe, 2013) by considering the jointactions as arms controlled by a singleplayer. Typical work consider on multiobjective multiarmed bandit tries to find any solution that minimizes the distance between the Pareto frontier. However, not all Pareto efficient solutions are acceptable as illustrated by Example 1 in this paper. Instead, our work show that a specific Pareto efficient (the egalitarian) is more desirable.
Paper organization
The paper is organized as follows: Section 2 presents formally our setting, assumptions, as well key definitions needed to understand the remainder of the paper. Section 3 shows a description of our algorithm while section 4 contains its analysis as well as the lower bound. We conclude in section 5 with indication about future works. Detailed proof of our theorems is available in the appendix.
2 Background and Problem Statement
We focus on twoplayer general sum repeated games. At round , both players select and play a joint action from a finite set . Then, they receive rewards generated from a fixed but unknown bounded distribution depending on their joint action. The actions and rewards are then revealed to both players. We assume the first agent to be under our control and the second agent to be the opponent. We would like to design algorithms such that our agent’s cumulative rewards are as high as possible. The opponent can have one of two types known to our agent: (1) selfplayer (another independently run version of our algorithm) or (2) arbitrary (i.e any possible opponents with no access to the agent’s internal randomness).
To measure performance, we compare our agent to an oracle that has full knowledge of the distribution of rewards for all jointactions. The oracle then plays like this: (1) in selfplay, they both compute before the game start the egalitarian equilibrium and play it; (2) against any other arbitrary opponent, the oracle plays the policy ensuring the maximin value.
Our goal is to design algorithms that have low expected regret against this oracle after any number of rounds, where regret is the difference between the value that the oracle would have obtained and the value that our algorithm actually obtained. Next, we formally define the terms that describe our problem setting.
Definition 1 (Policy).
A policy in a repeated game for player is a mapping from each possible history to a distribution over its actions. That is: where is the current round and is the set of all possible history of jointactions up to round .
A policy is called stationary if it plays the same distribution at each round. It is called deterministic stationary if it plays the same action at each round.
Definition 2 (JointPolicy).
A joint policy is a pair of policies, one for each player
in the game. In particular, this means that the probability distributions over actions of both players are independent. When each component policy is
stationary, we call the resulting policy stationary and similarly for deterministic stationary.Definition 3 (CorrelatedPolicy).
Any jointpolicy where player actions are not independent is correlated^{2}^{2}2For example through a public signal.. A correlated policy specifies a probability distribution over jointactions known by both players: .
In this paper, when we refer to a policy without any qualifier, we will mean a correlatedpolicy, which is required for the egalitarian solution. When we refer to and we will mean the components of a noncorrelated jointpolicy.
2.1 Solution concepts
In this section, we explain the two solution concepts we aim to address: safety–selected as the maximin value and individual rationality selected as achieving the value of the EBS. We start from the definition of a value for a policy.
Definition 4 (Value of a policy).
The value of a policy for player in a repeated game is defined as the infinite horizon undiscounted expected average reward given by:
We use to denote values for both players and drop when clear from the context.
Definition 5 (Maximin value).
The maximin policy for player and its value are such that:
where is the value for player playing policy while all other players play .
Definition 6 (Advantage game and Advantage value).
Consider a repeated game between two players and defined by the jointactions and the random rewards drawn from a distribution . Let be the maximin value of the two players. The advantage game is the game with (random) rewards obtained by subtracting the maximin value of the players from . More precisely, the advantage game is defined by: . The value of any policy in this advantage game is called advantage value.
Definition 7 (EBS in repeated games).
Consider a repeated game between two players and with maximin value . A policy is an EBS if it satisfies the following two conditions: (1) it belongs to the set of policies maximizing the minimum of the advantage value for both players. (2) it maximizes the value of the player with highest advantage value.
More formally, for any vector
, let be a permutation of such that . Let’s define a lexicographic maximin ordering on as:A policy is an EBS ^{3}^{3}3Also corresponds to the leximin solution to the Bargaining problem Bossert and Tan (1995). if:
We call EBS value the value and will be used to designate the egalitarian advantage.
2.2 Performance criteria
We can now define precisely the two criteria we aim to optimize.
Definition 8 (Safety Regret).
The safety regret for an algorithm playing for rounds as agent against an arbitrary opponent with no knowledge of the internal randomness of is defined by:
Definition 9 (Individual Rational Regret).
The individual rational regret for an algorithm playing for rounds as agent against its self identified as is defined by:
Example 1 (Comparison of the EBS value to other concepts).
In Table 1, we present a game and give the values achieved by the singlestage NE, and Correlated Equilibrium Greenwald and Hall (2003) (Correlated); maximizing the sum of rewards (Sum), and a Paretoefficient solution (Pareto). In this game, the maximin value is . Sum leads to for the first player, much lower than its maximin. Pareto is also similarly problematic. Consequently, it is not enough to converge to any Pareto solution since that does not necessarily guarantee rationality for both players. Both NE and Correlated fail to give the players a value higher than their maximin while the EBS shows that a high value is achievable. A conclusion similar to this example can also be made for all non trivial zerosum games.


3 Methods Description
Generic structure
Before we detail the safe and individual rational algorithms, we will describe their general structure. The key challenge is how to deal with uncertainty, the fact that we do not know the rewards. To deal with this uncertainty, we use the standard principle of optimism in the face of uncertainty Jaksch et al. (2010). It works by a) constructing a set of statistically plausible games
containing the true game with high probability through a confidence region around estimated mean rewards, a step detailed in section
3.1; b) finding within that set of plausible games the one whose EBS policy (called optimistic) has the highest value, a step detailed in section 3.2; c) playing this optimistic policy until the start of an artificial epochwhere a new epoch starts when the number of times any jointaction has been played is doubled (also known as the
doubling trick), a step described in Jaksch et al. (2010) and summarized by Algorithm 3 in Appendix G.3.1 Construction of the plausible set
At epoch , our construction is based on creating a set containing all possible games with expected rewards such that,
(1) 
where is the number of times action has been played up to round , is the empirical mean reward observed up to round and is an adjustable probability. The plausible set can be used to define the following upper and lower bounds on the rewards of the game:
We denote the game with rewards and the game with . Values in those two games are resp. denoted , . We used , to refer to the bounds obtained by a weighted (using ) average of the bounds for individual action. When clear from context, the subscript is dropped.
3.2 Optimistic EBS policy
§1 Problem formulation
Our goal is to find a game and a policy whose EBS value is nearoptimal simultaneously for both players. In particular, if we refer to the true but unknown game by and assume that we want to find and such that:
(2) 
where is defined in Definition 7 and a small configurable error.
Note that the condition in (2) is required (contrarily to singleagent games Jaksch et al. (2010)) since in general, there might not exist a game in that achieves the highest EBS value simultaneously for both players. For example, one can construct a case where the plausible set contains two games with EBS value (resp) and for any (See Table 2 in Appendix E). This makes the optimization problem (2) significantly more challenging than for singleagent games since a small error in the rewards can lead to a large (linear) regret for one of the player. This is also the root cause for why the best possible regret becomes rather than typical for single agent games. We refer the this challenge as the small error large regret issue.
§2 Solution
To solve (2), a) we set the optimistic game as the game in with the highest rewards for both players. Indeed, for any policy and game , one can always get a better value for both players by using ; b) we compute an advantage game corresponding to by estimating an optimistic maximin value for both players, a step detailed in paragraph §3; c) we compute in paragraph §4 an EBS policy using the advantage game; d) we set the policy to be unless one of three conditions explained in paragraph §5 happens. Algorithm 2 details the steps to compute and to correlate the policy, players play the jointaction minimizing their observed frequency of played actions compared to (See function ) of Algorithm 3 in Appendix G).
§3 Optimistic Maximin Computation
Satisfying (2) implies we need to find a value with:
(3) 
where is the maximin value of player in the true game . To do so, we return a lower bound value for the optimistic maximin policy of player . We begin by computing in polynomial time^{4}^{4}4
For example by using linear programming
Dantzig (1951); Adler (2013). the (stationary) maximin policy for the game with largest rewards. We then compute the (deterministic, stationary) best response policy using the game with the lowest rewards. The detailed steps are available in Algorithm 1. This results in a lower bound on the maximin value satisfying (3) as proven in Lemma 1.§4 Computing an EBS policy.
Armed with the optimistic game and the optimistic maximin value, we can now easily compute the corresponding optimistic advantage game whose rewards are denoted by . An EBS policy is computed using this advantage game. The key insight to do so is that the EBS involves playing a single deterministic stationary policy or combine two deterministic stationary policies (Proposition 1). Given that the number of actions is finite we can then just loop through each pairs of jointactions and check which one gives the best EBS score. The score (justified in the proof of Proposition 2 in Appendix C.4.) to use for any two jointactions and is: with as follows:
(4) 
And the policy is such that
(5) 
§5 Policy Execution
We always play the optimistic EBS policy unless one of the following three events happens:

The probable error on the maximin value of one player is too large. Indeed, the error on the maximin value can become too large if the weighted bound on the actions played by the maximin policies is too large. In that case, we play the action causing the largest error.

The small error large regret issue is probable: Proposition 2 implies that the small error large regret issue may only happen if the player with the lowest ideal advantage value (the maximum advantage under the condition that the advantage of the other player is nonnegative) is receiving it when playing an EBS policy. This allows Algorithm 2 to check for this player and plays the action corresponding to its ideal advantage as far as the other player is still receiving close to its EBS value (Line 5 to 15 in Algorithm 2).

The probable error on the EBS value of one player is too large This only happens if we keep not playing the EBS policy due to the small error large regret issue. In that case, the error on the EBS value used to detect the small error large regret issue might become too large making the check for the small error large regret issue irrelevant. In that case, we play the action of the EBS policy responsible for the largest error.
4 Theoretical analysis
Before we present theoretical analysis for the learning algorithm, we discuss the existence and uniqueness of the EBS value, as well as the type of policies that can achieve it.
Properties of the EBS
Fact 1 allows us to restrict our attention to stationary policies since it means that any (optimal) value achievable can be achieved by a stationary (correlated) policy and Fact 2 means that the egalitarian always exists and is unique providing us with a good benchmark to compare against. Fact 1 and 2 are resp. justified in Appendix C.1 and C.2.
Fact 1 (Achievable values for both players).
Any achievable value for the players can be achieved by a stationary correlatedpolicy.
Fact 2 (Existence and Uniqueness of the EBS value for stationary policies).
If we are restricted to the set of stationary policies, then the EBS value defined in Definition 7 exists and is unique.
The following Proposition 1 strengthens the observation in Fact 1 and establishes that a weighted combination of at most two jointactions can achieve the EBS value. This allows for an efficient algorithm that can just loop through all possible pairs of jointactions and check for the best one. However, given any two jointactions one still needs to know how to combine them to get an EBS value. This question is answered by proposition 2.
Proposition 1 (On the form of an EBS policy).
Given any 2 player repeated game, the EBS value can always be achieved by a stationary policy with nonzero probability on at most two jointactions.
Sketch.
Proposition 2 (Finding an EBS policy).
Let us call the ideal advantage value of a player , the maximum advantage that this player can achieve under the restriction that the advantage value of the other player is nonnegative. More formally: . The egalitarian advantage value for the two players is exactly the same unless there exists an EBS policy that is deterministic stationary where at least one player (necessarily including the player with the lowest ideal advantage value) is receiving its ideal advantage value.
Regret Analysis
The following theorem 1 gives us a high probability upper bound on the regret in selfplay against the EBS value, a result achieved without the knowledge of .
Theorem 1 (Individual Rational Regret for Algorithm 3 in selfplay).
Sketch.
The structure of the proof follows that of Jaksch et al. (2010). The key step is to prove that the value of policy returned by Algorithm 2 in our plausible set is close to the EBS value in the true model (optimism). In our case, we cannot always guarantee this optimism. Our proof identifies the concerned cases and show that they cannot happen too often (Lemma 4 in Appendix B.1). Then for the remaining cases, Lemma 3 shows that we can guarantee the optimism with an error of . The stepbystep detailed proof is available in Appendix B.1. ∎
By definition of EBS, Theorem 1 also applies to the safety regret. However in Theorem 2, we show that the optimistic maximin policy enjoys nearoptimal safety regret of .
Lower bounds for the individual rational regret
Here we establish a lower bound of for any algorithm trying to learn the EBS value. This shows that our upper bound is optimal up to logarithmfactors. The key idea in proving this lower bound is the example illustrated by Table 2. In that example, the rewards of the first player are all and the second player has an ideal value of . However, 50% of the times a player cannot realize its ideal value due to an increase in a single jointaction for both players. The main intuition behind the proof of the lower bound is that any algorithm that wants to minimize regret can only try two things (a) detect whether there exists a jointaction with an or if all rewards of the first player are equal. (b) always ensure the ideal value of the second player. To achieve (a) any algorithm needs to play all jointactions for times. Picking ensures the desired lower bound. The same would also ensure the same lower bound for an algorithm targeting only (b). Appendix E formally proves this lower bound.
Theorem 3 (Lower bounds).
For any algorithm , any natural numbers , , , there is a general sum game with jointactions such that the expected individual rational regret of after steps is at least .
5 Conclusion and Future Directions
In this paper, we illustrated a situation in which typical solutions for selfplay in repeated games, such as singlestage equilibrium or sum of rewards, are not appropriate. We propose the usage of an egalitarian bargaining solution (EBS) which guarantees each player to receive no less than their maximin value. We analyze the properties of EBS for repeated games with stochastic rewards and derive an algorithm that achieves a nearoptimal finitetime regret of with high probability. We are able to conclude that the proposed algorithm is nearoptimal, since we prove a matching lower bound up to logarithmicfactor. Although our results imply a safety regret (i.e. compared to the maximin value), we also show that a component of our algorithm guarantees the nearoptimal safety regret against arbitrary opponents.
Our work illustrates an interesting property of the EBS which is: it can be achieved with sublinear regret by two individually rational agents who are uncertain about their utility. We wonder if other solutions to the Bargaining Problem such as the Nash Bargaining Solution or the Kalai–Smorodinsky Solution also admit the same property. Since the EBS is an equilibrium, another intriguing question is whether one can design an algorithm that converges naturally to the EBS solution against some welldefined class of opponents.
Finally, a natural and interesting future direction for our work is its extension to stateful games such as Markov games.
References

Adler (2013)
Adler, I. (2013).
The equivalence of linear programs and zerosum games.
International Journal of Game Theory
, 42(1):165–177.  Auer et al. (2002) Auer, P., CesaBianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77.

Banerjee and Peng (2004)
Banerjee, B. and Peng, J. (2004).
Performance bounded reinforcement learning in strategic interactions.
In AAAI, volume 4, pages 2–7.  Bossert and Tan (1995) Bossert, W. and Tan, G. (1995). An arbitration game and the egalitarian bargaining solution. Social Choice and Welfare, 12(1):29–41.

Brafman and Tennenholtz (2002)
Brafman, R. I. and Tennenholtz, M. (2002).
Rmaxa general polynomial time algorithm for nearoptimal
reinforcement learning.
Journal of Machine Learning Research
, 3(Oct):213–231.  CesaBianchi and Lugosi (2006) CesaBianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university press.
 Chakraborty and Stone (2014) Chakraborty, D. and Stone, P. (2014). Multiagent learning in the presence of memorybounded agents. Autonomous agents and multiagent systems, 28(2):182–213.
 Conitzer and Sandholm (2007) Conitzer, V. and Sandholm, T. (2007). Awesome: A general multiagent learning algorithm that converges in selfplay and learns a best response against stationary opponents. Machine Learning, 67(12):23–43.
 Crandall and Goodrich (2011) Crandall, J. W. and Goodrich, M. A. (2011). Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning. Machine Learning, 82(3):281–314.
 Dantzig (1951) Dantzig, G. B. (1951). A proof of the equivalence of the programming problem and the game problem. Activity analysis of production and allocation, (13):330–338.
 Drugan and Nowe (2013) Drugan, M. M. and Nowe, A. (2013). Designing multiobjective multiarmed bandits algorithms: A study. learning, 8:9.

Filippi et al. (2010)
Filippi, S., Cappé, O., and Garivier, A. (2010).
Optimism in reinforcement learning and kullbackleibler divergence.
In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pages 115–122. IEEE.  Greenwald and Hall (2003) Greenwald, A. and Hall, K. (2003). Correlatedq learning. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML’03, pages 242–249. AAAI Press.

Hoeffding (1963)
Hoeffding, W. (1963).
Probability inequalities for sums of bounded random variables.
Journal of the American statistical association, 58(301):13–30.  Imai (1983) Imai, H. (1983). Individual monotonicity and lexicographic maxmin solution. Econometrica: Journal of the Econometric Society, pages 389–401.
 Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P. (2010). Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600.
 Kalai (1977) Kalai, E. (1977). Proportional solutions to bargaining situations: interpersonal utility comparisons. Econometrica: Journal of the Econometric Society, pages 1623–1630.
 Littman and Stone (2003) Littman, M. L. and Stone, P. (2003). A polynomialtime nash equilibrium algorithm for repeated games. In Proceedings of the 4th ACM Conference on Electronic Commerce, EC ’03, pages 48–54, New York, NY, USA. ACM.

Munoz de Cote and Littman (2008)
Munoz de Cote, E. and Littman, M. L. (2008).
A polynomialtime Nash equilibrium algorithm for repeated
stochastic games.
In
Proceedings of the TwentyFourth Conference on Uncertainty in Artificial Intelligence (UAI)
, pages 419–426, Corvallis, Oregon. AUAI Press.  Nash (1951) Nash, J. (1951). Noncooperative games. Annals of mathematics, pages 286–295.
 Nash Jr (1950) Nash Jr, J. F. (1950). The bargaining problem. Econometrica: Journal of the Econometric Society, pages 155–162.
 Osborne and Rubinstein (1994) Osborne, M. J. and Rubinstein, A. (1994). A course in game theory.
 Powers and Shoham (2005) Powers, R. and Shoham, Y. (2005). Learning against opponents with bounded memory. In IJCAI, volume 5, pages 817–822.
 Powers et al. (2007) Powers, R., Shoham, Y., and Vu, T. (2007). A general criterion and an algorithmic framework for learning in multiagent systems. Machine Learning, 67(1):45–76.
 Puterman (2014) Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
 Rawls (1971) Rawls, J. (1971). A theory of justice. Harvard university press.
 Stimpson and Goodrich (2003) Stimpson, J. L. and Goodrich, M. A. (2003). Learning to cooperate in a social dilemma: A satisficing approach to bargaining. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pages 728–735.
 Wei et al. (2017) Wei, C.Y., Hong, Y.T., and Lu, C.J. (2017). Online reinforcement learning in stochastic games. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 4987–4997.
Appendix A Notations and terminology
We will use action to mean jointactions unless otherwise specified. We will denote the players as and . This is to be understood as follows: if there are two players , when , then and when , . The true but unknown game will be denoted as whereas the plausible set of games we consider at epoch will be denoted by . An EBS policy in the true game will be denoted by and its value by . If for the EBS value in , the player with the lowest ideal advantage value is receiving it, we will denote this player by while the other player will be . The EBS policy in this situation will be denoted as (it is guaranteed to be a single jointaction).
will be used to denote empirical mean rewards and in general is used to mean a value computed using empirical . will be used to mean the rewards from the upper limit game in our plausible set, while will be used to mean the rewards from the lower limit game in our plausible set. Also, in general while be used to mean a value computed using and to mean a value computed using .
will be used to denote the current epoch. the number of rounds action has been played in epoch — the number of rounds epoch has lasted — the number of rounds played up to epoch — the number of rounds action has been played up to round — the empirical average rewards of player for action at round . will be used to denote the total number of epochs up to round .
Appendix B Proof of Theorem 1
Theorem (1).
b.1 Regret analysis for the egalitarian algorithm in selfplay
The proof is similar to that of UCRL2 Jaksch et al. (2010) and KLUCRL Filippi et al. (2010). As the algorithm is divided into epochs, we first show that the regret bound within an epoch is sublinear. We then combine those perepoch regret terms to get a regret for the whole horizon simultaneously. Both of these regrets are computed with the assumption that the true game is within our plausible set. We then conclude by showing that this is indeed true with high probability. Let’s first start by decomposing the regret.
Regret decomposition
Here we decompose the regret in each round . We start by defining the following event ,
(6) 
(7)  
(8)  
(9)  
(10) 
We have:
(11)  
(12)  
(13) 
In the following, we will use Hoeffding’s inequality to bound the last term of Equation 13, similarly to Section 4.1 in Jaksch et al. (2010). In particular, with probability at least :
(14) 
where is the regret perepoch defined by
(15) 
Regret when the event E defined by (6) is False and the true Model is in our plausible set
We will now simplify the notation by using to mean that the expression is condition on being False. We can thus bound :
(16)  
(17)  
(18) 
where Equation (16) comes from the fact that when , (See Lemma 3). Equation (17) comes from the fact that we assume meaning . Equation (18) comes from the fact the egalitarian solution involves playing one jointaction with probability and another jointaction with probability ; since it is always possible to bound as with a nonnegative integer, and by construction the players play as close as possible to , then the error is bounded by .
We are now ready to sum up the perepoch regret over all epochs for which the event is false. We have:
(19)  
(20) 
Now assuming , we have:
(21)  
(22) 
Using Appendix C.3 in Jaksch et al. (2010), we can conclude that
Similarly Jaksch et al. (2010) Equation (20) shows that:
Furthermore Jaksch et al. (2010) shows that:
Comments
There are no comments yet.