Foolproof Cooperative Learning

06/24/2019 ∙ by Alexis Jacq, et al. ∙ 0

This paper extends the notion of equilibrium in game theory to learning algorithms in repeated stochastic games. We define a learning equilibrium as an algorithm used by a population of players, such that no player can individually use an alternative algorithm and increase its asymptotic score. We introduce Foolproof Cooperative Learning (FCL), an algorithm that converges to a Tit-for-Tat behavior. It allows cooperative strategies when played against itself while being not exploitable by selfish players. We prove that in repeated symmetric games, this algorithm is a learning equilibrium. We illustrate the behavior of FCL on symmetric matrix and grid games, and its robustness to selfish learners.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In William Golding’s novel “Lord of the Flies”, a group of children who survived an airplane crash try to establish rules on a desert island in order to avoid chaos. Unfortunately, they fail at forcing a cooperative solution and some of them start defecting, which results in a demented group behaviour. In this paper, we prevent such tragedies in learning algorithms by constructing a safe way to learn cooperation in unknown environments, without being exploitable by potentially selfish agents.

In multi-agent learning settings, environments are usually modeled by stochastic games [shapley1953stochastic]

. Multi-agent reinforcement learning (MARL) brings a framework to construct algorithm that aim to solve stochastic games where players individually or jointly search for an optimal decision-making to maximize a reward function. Individualist approaches mostly aim at reaching equilibrium, taking the best actions whatever the opponents behaviors are 

[bowling2001rational, littman2001friend]. Joint approaches aim at optimizing a cooperative objective and can be viewed as a single agent problem in a larger dimension [claus1998dynamics], but are easily exploited when one agent starts being individualist.

We focus on symmetric situations, making sure that no agent has an individual advantage. For example, this is the case on a desert island with a quantity of resources equally accessible to all agents. Moreover, we consider repeated games, modelling the recurrent possibility to start again the situation from the beginning. In the island resource example, repetitions could represent successive days or, at larger scale, 4-seasons cycles.

In this context, we introduce Foolproof Cooperative Learning (FCL), a model-free learning algorithm that, by construction, converges to a Tit-for-Tat behaviour, cooperative against itself and retaliatory against selfish algorithms. We propose a definition for learning equilibrium, describing a class of learning algorithms such that the best way to play against them is to adopt the same behaviour. We demonstrate that FCL is an example of learning equilibrium that forces a cooperative behaviour, and we empirically verify this claim with two-agents matrix and grid-world repeated symmetric games.

The proofs of all stated results are provided in the appendix.

2 Definitions and Notations

An N-player stochastic game can be written as a tuple , where is the set of states, the set of actions for player ,

the transition probability (

), a distribution over states (), the reward function for player ().

We also assume bounded, deterministic reward functions and finite state and action spaces.

In a repeated stochastic game, a stochastic game (the stage game) is played and at each iteration, it continues with probability or terminates and starts again according to . This is repeated an infinite number of times, and players have to maximize their average return during a stage game [de2012polynomial]. Terminating with probability is equivalent to use a discount factor while playing a stage game.

A stationary strategy (or policy) for player ,

, maps a state to a probability distribution over its set of possible actions. We note

the product of all players strategies but player and the product of all player strategies, called the strategy profile. Given opponents strategies , the goal for a rational player is to find a strategy that maximizes its average return during a stage game:


The policy depends on the opponents strategies and is called the best response for player to . In general, we call strategy any process defining a stationary strategy for any stage . The value of a player’s non-stationary strategy is the average return over stage games, .

In order to allow rewarding or retaliation strategies, we only consider games where all players are aware of all opponents actions and rewards, and receive a signal each time the game is reset. We also admit players to share information with some opponents in order to organize joint retaliation actions or joint explorations. Moreover, we only consider Repeated Symmetric Games (RSG):

Definition 1 (Repeated Symmetric game (RSG)).

An N-player repeated stochastic game is symmetric if, for any stationary strategy profile () and for any permutation over players:


This generalizes the definition for symmetric N-player matrix games [dasgupta1986existence] to stochastic games where players utilities are replaced by average returns111Actually, the definition initially given in [dasgupta1986existence], is incorrect in the sens that symmetries are not independent of player identities, which is not the case if the right-hand return is indexed with the inverse permutation instead [vester2012symmetric].. In this paper, we use the concept of N-cyclic permutations to construct specific strategies:

Definition 2 (N-cyclic permutation).

A permutation is N-cyclic if for all , there is such that .

2.1 Nash equilibrium

A Nash equilibrium describes a stationary strategy profile , such that no player can individually deviate and increase its payoff [nash1951non]:


Note that in a symmetric game, for any Nash equilibrium with returns and for any permutation over players, there is another Nash equilibrium with returns . This definition can be extended to non-stationary policies using expected return over stage games: no players can individually deviate from an equilibrium non-stationary strategy and increase its average return over stage games:


As for stationary strategy profiles, any stationary strategy equilibrium is still an equilibrium among non-stationary processes.

3 Cooperative strategies

We call cooperative any strategy (not necessary stationary) that maximizes a common quantity . Usual examples are strategies that maximize the sum, the product or the minimum of players returns. In RSGs, the strategy that maximizes the minimum of player returns is particularly interesting as it coincides with Kalais [kalai1977proportional] solutions to the Bargaining problem [nash1950bargaining] and is easy to determine. In this paper, we refer to this strategy as the bestmin solution. An important property of RSGs is the fact that bestmin solutions can always be obtained by repeatedly applying a N-cyclic permutation on a stationary strategy that maximizes the sum of players returns.

Theorem 1.

Let be a stationary strategy that maximizes the sum of players returns in an N-player RSG, an N-cyclic permutation over players, and indexing the repeated stage games. Then, the strategy (where times) is a bestmin strategy.

3.1 Tit-for-Tat

Given a stochastic game, one player can learn a strategy that retaliates when another player deviates from a target strategy. If a retaliation is smaller than the reward obtained by the player while deviating, the strategy can be repeated until the retaliation is larger than this reward in total. In that case, the target strategy is said enforceable: if all player are accorded to retaliate when a player deviates from a strategy profile and if the retaliation is strong enough, no player can improve its payoff by individually deviating from the strategy profile. If opponents actions are part of the observable state and if the target strategy profile and the dynamics are deterministic, it becomes possible to construct a stationary strategy that retaliates when a player does not play according to the profile. If the retaliation lasts forever after the first deviation, the strategy is by construction a Nash equilibrium [osborne1994course]. However, we are more interested in finished retaliations since it gives a chance to a selfish learning agent to learn the target strategy. Such processes are called Tit-for-Tat (TFT) and are known to induce cooperation in repeated social dilemma. In an RSG, if the target is a bestmin strategy, there is always a stationary way to retaliate and one can always construct a TFT strategy.

Theorem 2.

In an RSG, let , and a bestmin strategy (not necessary stationary). Then, is a retaliation strategy with respect to :


For a player , we note its average return when all player cooperate, its best average return when others retaliate and its best average return by defecting. When a single retaliation is too small so it still worth defecting for a selfish player, the retaliation must be repeated. The minimal number or retaliation repeats can be given by (see the proof of Thm. 3 in appendix):


In the edge case where , the retaliation strategy must be employed endless, but the cooperative objective is not affected (this is the case in Rock–paper–scissors). Let be the (non-stationary) strategy that follows if all players cooperate, or repeat over stage games if a player deviates from . By construction, is a Nash equilibrium.

Theorem 3.

is a Nash equilibrium.

4 Learning algorithm

In this work, we mean by learning algorithm played by any random process conditioned, at any time , by the historic of all states, actions and rewards up to time . The algorithm profile is the set of all players algorithms. We will note .

4.1 Multi-agent learning

Reinforcement learning provides a class of algorithms that aim at maximizing an agent’s return. Out of all of them, our interest concerns -learning approaches [watkins1992q] for three reasons: they are model-free, off-policy and they are guaranteed to converge in finite state and action spaces. In a game , for a player and given opponents policy , the basic idea is to learn a -function that approximates, for all states and actions, the average return starting from playing this action at this point while using the best strategy. Ideally, the -function associated with player ’s policy that maximizes its return holds:


-learning algorithms are constructed in order to progressively approximate the -function without approximating the problem dynamics and reward functions , and without knowing the decision process that generated the historic buffer (in contrast, for example, to policy gradient algorithms [williams1992simple]). In finite states and actions spaces, the approximation is obtained by successively applying the updates:


where is the learning rate. However, when the opponent policy is not fixed, maximizing the -function with respect to actions is no longer an improvement of the policy (the response of the opponents to this deterministic policy can decrease the average player’s return). MARL provides several alternative greedy improvements. For example, a defensive player can expect opponents to minimize its -function (minimax -learning). In that case, a greedy improvement of the policy to evaluate the value of a new state is obtained by solving the linear problem [littman1994markov]:


and the corresponding -learning update becomes:


4.2 Learning equilibrium

We define a learning equilibrium as follows.

Definition 3 (Learning equilibrium).

Let be a set of stochastic games. An algorithm profile is a learning equilibrium for if, for any game , there is a time such that, for any player and any learning algorithm :


Consequently, just like Nash equilibrium for the choice of a strategy, no player can individually follow an alternative algorithm and increase its asymptotic score. However, one important difference is the fact that a learning algorithm is not defined with respect to a particular game, but a set of games.

We may think that a process always playing a Nash equilibrium of the given game ( for all ) is a learning equilibrium. However, such a process requires an initial knowledge about the dynamics and the reward functions of the game and can’t be obtained from a process starting with an empty condition. Therefore, it can’t be described as a learning algorithm. For the same reason, a TFT process is not a learning equilibrium. However, we may construct learning algorithms that asymptotically behave as a TFT or always play a Nash equilibrium. This is the key idea of FCL.

5 Foolproof cooperative learning

As we are interested in forced cooperation, we are looking for a learning algorithm profile that converges to a TFT process, retaliating if a player deviates from a cooperative strategy. Since the objective of a cooperative strategy is a common quantity and TFT processes are symmetric, such a convergence can be obtained if all players are using the same algorithm. FCL, as described in Alg. 1 (for a player ), has the property to converge to such a behavior when played by all players. In an N-player game, FCL approximates -functions: one associated with the cooperative policy that maximizes the sum of all players (), associated with retaliation policies preventing any defection from other players (), and associated with each opponent’s best response to the cooperative strategy (). At each played stage game, FCL will play according to a bestmin cooperative strategy (learned through ) unless one of the opponents deviated from that strategy. In case of an opponent’s defection, all FCL agents will agree on a joint retaliation according to the minimax strategy (learned through with Eq.(10)) for stages according to Eq.(6). In order to allow exploration, a deterministic process is used to decide, at each time , between exploration and exploitation. We design as a known realization of a random process such that explorations are endless (), but becomes rare enough with time so the probability of explorations tends to zero (). This can be implemented using a pseudo-random process with a fixed seed, known by all FCL players. At exploration stages, all agents are allowed to perform any action without being accused of defection. In a way, this algorithm can be seen as a disentangled version of Friend-or-Foe -learning (FFQ) [littman2001friend] which learns to play cooperatively if an opponent is cooperative, or defensively if the opponent is defective with a single -function. However, FFQ can’t learn a TFT behavior as it is either always cooperative, or always defensive.

0:  List of counters to repeat retaliations, exploration process , N-cyclic permutation , learning rate sequence , initial (arbitrary) functions , and , initial state .
1:  for stages to  do
2:     while stage continue do
3:        if  then
4:           if  then
5:              Explore with uniform probability
6:           else
7:              Take action
8:           end if
9:        else
10:           Randomly select an agent such that
11:           Take action
13:        end if
14:        Observe and new state , receive reward and observe
17:        for all other agents  do
23:           if not and  then
25:           end if
26:        end for
28:     end while
29:  end for
Algorithm 1 FCL for player .
Theorem 4.

Assume and are finite spaces. Then, FCL converges to a TFT behavior forcing the bestmin cooperative strategy in RSGs.

Theorem 5.

FCL is a learning equilibrium for RSGs.

6 Experiments

Despite our theoretical claims are established for any number of agents, we restrict our experiments to games involving two players. We first explore the case of three well known repeated symmetric matrix games: Iterated Prisoners Dilemma (IPD), Iterated Chicken (ICH) and Rock-Paper-Scissors (RPC). Then, we investigate larger state spaces with grid games inducing coordination problems and social dilemma, as introduced in [de2012polynomial]. We added another grid game, closer to the concept of limited resource appropriation: the Temptation game. In Temptation, making a movement to the sides can be seen as taking immediately the resource, while making a movement to the bottom can be seen as waiting for the winter. All grid games are described in details in Table 1. In order to verify that FCL is a learning equilibrium, we compare the score obtained by FCL and by selfish learning algorithm, -learning and policy-gradient (PG), against FCL.

6.1 Implementation details

We implemented FCL using a state-dependent learning rate that counts the number of state visits, and exploration where is a pseudo-random uniform sample between 0 and 1 with a fixed seed, the initial treshold and a decay parameter close to one. The closer is to one, the longer lasts the exploration. For selfish -learning, we used a similar learning rate and exploration process, however with different seeds and decay parameters. The policy gradient was implemented with tabular parameters and Adam gradient descent with learning rate 0.1. Since matrix games are not sequential and since grid games were automatically reset after 30 steps, we could use a discount factor

to estimate value functions. In practice, we found that adding 1 to the minimal number of retaliation repeats given in Eq. 

6 significantly improves the robustness to selfish learners.

6.2 Results

An iterated matrix game can be seen as a repeated stochastic game with only one state. As it does not require large exploration, we used and for both selfish -learning and FCL. Figure 1 displays our results with the three matrix games IPD, ICH and RPC. In grid games, we used and . Figure 2 displays our results on grid games. As expected, FCL was never exploited by selfish learners, and successfully cooperated with another FCL. Except in RCP, defection conduced to less reward than cooperation because of retaliations. In RCP, FCL found the only way to retaliate by infinitely playing randomly against selfish learners, resulting in an average of 0 reward for all players, equivalent to the reward for cooperation.

(a) Iterated prisoners dilemma
(b) Iterated chicken
(c) Rock paper scissors
Figure 1:

Matrix games. Average scores over 20 runs obtained by two standard RL algorithms and FCL, playing against FCL. In IPD and ICH, after some iterations selfish behaviours, as induced by Q-learning and PG, start being sub-optimal because of FCL retaliations and accumulate less return than a cooperative behaviours, as induced by FCL against itself. In RPC, FCL learns to play with a uniform distribution against selfish algorithms so their average score is null. Black dotted line represents the average score after convergence of two selfish agent playing against themselves (the

minimax solution).
(a) Grid prisoners dilemma (b) Compromise
(c) Coordination (d) Temptation
Table 1: Grid games. is the starting position of one player, is the starting position of the other. At each turn, both players simultaneously select one action among going up, down, left, right or stay. When reward cells with symbol are reached by one player, the player obtains the corresponding reward and the game is immediately reset. means that only player gets the reward when reaching the cell, means that any player gets reward when reaching the cell, and means that the player who reach the cell gets and the other gets (if the other player reach another rewarding cell, the rewards are summed). Two players can not be on the same cell at the same time and they can not cross each other. In case of conflict, one player reaches the cell and the other stays with probability 0.5.
(a) Prisoners dilemma
(b) Compromise
(c) Coordination
(d) Temptation
Figure 2: Grid games. Average scores over 20 runs obtained by two standard RL algorithms and FCL, playing against FCL. After some iterations, selfish behaviours, as induced by Q-learning and PG, start being sub-optimal because of FCL retaliations and accumulate less return than a cooperative behaviour, as induced by FCL against itself. Black dotted line represents the average score after convergence of two selfish agents playing against themselves (the minimax solution).

7 Related work

Learning cooperative behaviours in a multi-agent setting is a vast field of research, and various approaches depend on assumptions about the type of games, the type and number of agents, the type of cooperation and the initial knowledge.

When the game’s dynamic is initially known and in two-player settings, Kalais’ bargaining solution can be obtained by mixing dynamic and linear programming. Therefore, a polynomial-time algorithm can be used to solve repeated matrix games 

[littman2005polynomial], as well as repeated stochastic games [de2012polynomial]. Since a bargaining solution is always better than a minimax strategy (the disagreement point) [osborne1994course], a cooperative equilibrium is immediately given. An alternative to our cooperate or retaliate architecture consists in choosing between maximizing oneself reward (being competitive) or maximizing a cooperative reward, for example by inferring opponents intentions [kleiman2016coordinate].

In games inducing social dilemmas and when the dynamic is accessible as an oracle, cooperative solutions can also be obtained by self-playing and then applied to define a TFT behaviour forcing cooperation [lerer2017maintaining], even when opponent actions are unknown, since in that case the reward function already brings sufficient information [peysakhovich2017consequentialist].

Closer to our setting, when the dynamic is unknown, online MARL can extract cooperative solution in some non-cooperative games, and particularly in restricted resource appropriation [perolat2017multi]. Using alternative objectives based on all players reward functions and their propensity to cooperate or defect improves and generalizes the emergence of cooperation in non-cooperative games and limits the risk of being exploited by purely selfish agents [hughes2018inequity].

A similar approach, called Learning with Opponent Learning Awareness (LOLA), consists in modelling the strategies and the learning dynamics of opponents as part of the environment’s dynamics and to derive the gradient of the average return’s expectation [foerster2018learning]

. If LOLA has no guaranty of convergence, a recent improvement of the gradient computation, which interpolates between first and second-order derivations, is proved to converge to local optimums 

[letcher2018stable]. Although such agents are purely selfish, empirical results show that they are able to shape each others learning trajectories and to cooperate in prisoners dilemma. A limitation of this approach toward building learning equilibrium is the strong assumption regarding the opponents learning algorithms, supposed to perform policy gradient. Also, this approach differs to our goal since LOLA is selfish and aims at shaping an opponent behavior (in 2-player settings) while FCL is cooperative but retaliates in response to selfish agents (in N-player settings).

8 Conclusion

We introduced FCL, a model-free learning algorithm that, by construction, converges to a TFT behaviour, cooperative against itself and retaliating against selfish algorithms. We proposed a definition for learning equilibrium, describing a class of learning algorithms such that the best way to play against it is to adopt the same behaviour. We demonstrated that FCL is an example of learning equilibrium that forces a cooperative behaviour, and we empirically verified this claim with two-agents matrix and grid-world repeated symmetric games.

Our approach could be improved by facilitating opponent’s learning of the optimal cooperative response and by using faster learning approaches. It could also be adapted to larger dimensions such as continuous state spaces and partially observed settings with function approximation by replacing tabular -learning with deep -learning [mnih2015human].


9 Appendix

9.1 Proof of Thm. 1


Since is N-cyclic, any player receives the same average return every N stage games:


Consequently, maximizes the sum of returns at any and the average return of the strategy is the same for all players. Now, imagine there is a strategy such that:


In that case,


which is in contradiction with the fact that maximizes the sum of returns at any . ∎

9.2 Proof of Thm. 2


Assume that:


Then, the same is true in particular for ’s best response to , that we note :


Since in that case, minimizes ’s return:


In particular, we have


Besides, we have:


Indeed, if this was not the case, we would have


and is no longer a bestmin.

On the other hand, since the game is symmetric, one can apply the transposition that permutes players and strategies in Eq. (22):


which is in contradiction with Eq.(23). ∎

9.3 Proof of Thm. 3


Since and , we write:


which gives:


On the left, this is the average return over stages of an always cooperating player, on the right this is the average return over stages of any deviating player. Therefore, for any :


9.4 Proof of Thm. 4


We assume that all agents are playing FCL. Given the fact that is deterministic, there are no defection. Let’s focus on the endless sub-process corresponding to exploration times. Let . Clearly, for all , and , . Consequently, the convergence of -functions is given by the convergence of the classic -learning using  [melo2001convergence]. Let and be the corresponding points of convergence. By construction:

  • , maximizing , maximizes the sum of players returns.

  • maximizes the min of players returns.

  • , minimizing , retaliates when player deviates from .

FCL decision rule at line 7 in Alg.1 corresponds to playing according to when is close enough to (when the difference between the max value and the second-max value is larger than twice an update size). Similarly, decision rule at line 11 corresponds to playing according to when is close enough to . Let be the smallest time after which both and are close enough to and so lines 7 and 11 correspond to playing according to and . If explorations are stopped, FCL is a TFT strategy. Actually explorations never stop but the probability222Actually, is deterministic but is given by the realization of a random process verifying this property. One can also considerate that all FCL players are observing the same random process telling them to explore or not. of exploration times tends to zero, which translates to


where is any player’s TFT strategy induced by and as described in Thm. 3. ∎

9.5 Proof of Thm. 5


Let be a learning algorithm different that FCL and played by an agent while all other players do FCL (). We will distinguish two situations:


In situation (a), conditions for the convergence of FCL agents are met and they converge to a TFT behavior:


Let the deviating player’s average return when other agents are doing TFT (with probability ), its average return when other agents are exploring during stages (whith probability and and the respective returns when no player deviates). Because of Thm. 3, we know that . If , then the average return of a deviating player is always smaller than if it does not deviate. Otherwise, we have and by taking:


we obtain, for all :




In situation (b), there is still a subset of state-action couples that will be explored an infinite number of times. If all other players restrict their states and actions to the same subset (using ) the induced sub-game is still symmetric and player is exploring the whole sub-game an infinite number of times. Consequently, FCL can at least learn a TFT strategy based on a retaliation strategy and a cooperative strategy defined on such that:


Since is necessarily sub-optimal to cooperate in the whole game, we have:


As a consequence, players can still retaliate and we can use the exact same argument than in (a) to obtain the desired statement. ∎