Adaptive Mechanism Design: Learning to Promote Cooperation

06/11/2018 ∙ by Tobias Baumann, et al. ∙ 0

In the future, artificial learning agents are likely to become increasingly widespread in our society. They will interact with both other learning agents and humans in a variety of complex settings including social dilemmas. We consider the problem of how an external agent can promote cooperation between artificial learners by distributing additional rewards and punishments based on observing the learners' actions. We propose a rule for automatically learning how to create right incentives by considering the players' anticipated parameter updates. Using this learning rule leads to cooperation with high social welfare in matrix games in which the agents would otherwise learn to defect with high probability. We show that the resulting cooperative outcome is stable in certain games even if the planning agent is turned off after a given number of episodes, while other games require ongoing intervention to maintain mutual cooperation. However, even in the latter case, the amount of necessary additional incentives decreases over time.



page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Social dilemmas highlight conflicts between individual and collective interests. Cooperation allows for better outcomes for all participants, but individual participants are tempted to increase their own payoff at the expense of others. Selfish incentives can therefore destabilize the socially desirable outcome of mutual cooperation and often lead to outcomes that make everyone worse off (Van Lange et al., 2013).

Cooperation often emerges due to direct reciprocity (Trivers, 1971) or indirect reciprocity (Nowak and Sigmund, 2005). However, even if these mechanisms are not sufficient on their own, humans are often able to establish cooperation by changing the structure of the social dilemma. This is often referred to as mechanism design. For instance, institutions such as the police and the judicial system incentivize humans to cooperate in the social dilemma of peaceful coexistence, and have succeeded in dramatically reducing rates of violence (Pinker, 2011).

Studies of social dilemmas have traditionally focused on the context of human agents. However, in the future, artificial learning agents will likely be increasingly widespread in our society, and be employed in a variety of economically relevant tasks. In that case, they will interact both with other artificial agents and humans in complex and partially competitive settings.

This raises the question of how we can ensure that artificial agents will learn to navigate the resulting social dilemmas productively and safely. Failing to learn cooperative policies would lead to socially inefficient or even disastrous outcomes. In particular, the escalation of conflicts between artificial agents (or between artificial agents and humans) may pose a serious security risk in safety-critical systems. The behaviour of artificial agents in cooperation problems is thus of both theoretical and practical importance.

In this work, we will examine how mechanism design can promote beneficial outcomes in social dilemmas among artificial learners. We consider a setting with agents in a social dilemma and an additional planning agent that can distribute (positive or negative) rewards to the players after observing their actions, and aims to guide the learners to a socially desirable outcome (as measured by the sum of rewards).

We derive a learning rule that allows the planning agent to learn how to set the additional incentives by looking ahead at how the agents will update their policy parameter in the next learning step. We also extend the method to settings in which the planning agent does not know what internal parameters the other agents use and does not have direct access to the opponents’ policy.

We evaluate the learning rule on several different matrix game social dilemmas. The planning agent learns to successfully guide the learners to cooperation with high social welfare in all games, while they learn to defect in the absence of a planning agent. We show that the resulting cooperative outcome is stable in certain games even if the planning agent is turned off after a given number of episodes. In other games, cooperation is unstable without continued intervention. However, even in the latter case, we show that the amount of necessary additional rewards decreases over time.

2 Related Work

The study of social dilemmas has a long tradition in game theory, theoretical social science, and biology. In particular, there is a substantial body of literature that fruitfully employs matrix games to study how stable mutual cooperation can emerge

(Axelrod and Hamilton, 1981). Key mechanisms that can serve to stabilize the socially preferred outcome of mutual cooperation include direct reciprocity (Trivers, 1971), indirect reciprocity (Nowak and Sigmund, 2005), and norm enforcement (Axelrod, 1986). Bachrach et al. (2009) examine how cooperation can be stabilized via supplemental payments from an external party.

Our work is inspired by the field of mechanism design, pioneered by Vickrey (1961), which aims to design economic mechanisms and institutions to achieve certain goals, most notably social welfare or revenue maximization. Seabright (1993) studies how informal and formal incentives for cooperative behaviour can prevent a tragedy of the commons. Mechanism design has also been studied in the context of computerized agents (Varian, 1995)

and combined with machine learning techniques

(Narasimhan et al., 2016).

We also draw on the rich literature on multi-agent reinforcement learning. It is beyond the scope of this work to review all relevant methods in multi-agent reinforcement learning, so we refer the reader to existing surveys on the subject

(Busoniu et al., 2008; Tuyls and Weiss, 2012). However, we note that most work in multi-agent reinforcement learning considers coordination or communication problems in the fully cooperative setting, where the agents share a common goal(Omidshafiei et al., 2017; Foerster et al., 2016).

As an exception, Leibo et al. (2017) study the learned behaviour of deep Q-networks in a fruit-gathering game and a Wolfpack hunting game that represent sequential social dilemmas. Tampuu et al. (2017) successfully train agents to play Pong with either a fully cooperative, a fully competitive, or a mixed cooperative-competitive objective. Crandall et al. (2018) introduce a learning algorithm that uses novel mechanisms for generating and acting on signals to learn to cooperate with humans and with other machines in iterated matrix games. Finally, Lowe et al. (2017) propose a centralized actor-critic architecture that is applicable to both the fully cooperative as well as the mixed cooperative-competitive setting.

However, these methods assume a given set of opponent policies as given in that they do not take into account how one’s actions affect the parameter updates on other agents. In contrast, Foerster et al. (2017) introduce Learning with Opponent-Learning Awareness (LOLA), an algorithm that explicitly attempts to shape the opponent’s anticipated learning. The LOLA learning rule includes an additional term that reflects the effect of the agent’s policy on the parameter update of the other agents and inspired the learning rule in this work. However, while LOLA leads to emergent cooperation in an iterated Prisoner’s dilemma, the aim of LOLA agents is to shape the opponent’s learning to their own advantage, which does not always promote cooperation.

3 Background

3.1 Markov Games

We consider partially observable Markov games (Littman, 1994)

as a multi-agent extension of Markov decision processes (MDPs). An

-player Markov game is defined by a set of states , an observation function specifying each player’s -dimensional view, a set of actions for each player, a transition function , where

denotes the set of probability distributions over

, and a reward function for each player. To choose actions, each player uses a policy , where is the observation space of player . Each player in a Markov game aims to maximize its discounted expected return , where is a discount factor and is the time horizon.

3.2 Policy gradient methods

Policy gradient methods (Sutton and Barto, 1998) are a popular choice for a variety of reinforcement learning tasks. Suppose the policy of an agent is parametrized by . Policy gradient methods aim to maximize the objective by updating the agent’s policy steps in the direction of .

Using the policy gradient theorem (Sutton et al., 2000), we can write the gradient as follows:


where is the state distribution and .

3.3 Matrix Game Social Dilemmas

A matrix game is the special case of two-player perfectly observable Markov games with , and . That is, two actions are available to each player, which we will interpret as cooperation and defection.

Table 1: Payoff matrix of a symmetric 2-player matrix game. A cell of represents a utility of to the row player and to the column player.

Table 1 shows the generic payoff structure of a (symmetric) matrix game. Players can receive four possible rewards: (reward for mutual cooperation), (punishment for mutual defection), (temptation of defecting against a cooperator), and (sucker outcome of cooperating against a defector).

A matrix game is considered a social dilemma if the following conditions hold (Macy and Flache, 2002):

  1. Mutual cooperation is preferable to mutual defection:

  2. Mutual cooperation is preferable to being exploited:

  3. Mutual cooperation is preferable to an equal probability of unilateral defection by either player:

  4. The players have some reason to defect because exploiting a cooperator is preferable to mutual cooperation () or because mutual defection is preferable to being exploited ().

The last condition reflects the mixed incentive structure of matrix game social dilemmas. We will refer to the motivation to exploit a cooperator (quantified by ) as greed and to the motivation to avoid being exploited by a defector () as fear. As shown in Table 2, we can use the presence or absence of greed and fear to categorize matrix game social dilemmas.

Chicken C D
Stag Hunt C D
Table 2: The three canonical examples of matrix game social dilemmas with different reasons to defect. In Chicken, agents may defect out of greed, but not out of fear. In Stag Hunt, agents can never get more than the reward of mutual cooperation by defecting, but they may still defect out of fear of a non-cooperative partner. In Prisoner’s Dilemma (PD), agents are motivated by both greed and fear simultaneously.

4 Methods

4.1 Amended Markov game including the planning agent

Suppose agents play a Markov game described by , , , and . We introduce a planning agent that can hand out additional rewards and punishments to the players and aims to use this to ensure the socially preferred outcome of mutual cooperation.

To do this, the Markov game can be amended as follows. We add another action set that represents which additional rewards and punishments are available to the planning agent. Based on its observation and the other player’s actions , the planning agent takes an action .111Technically, we could represent the dependence on the other player’s actions by introducing an extra step after the regular step in which the planning agent chooses additional rewards and punishments. However, for simplicity, we will discard this and treat the player’s actions and the planning action as a single step. Formally, we can justify this by letting the planning agent specify its action for every possible combination of player actions. The new reward function of player is , i.e. the sum of the original reward and the additional reward, and we denote the corresponding value functions as . Finally, the transition function formally receives as an additional argument, but does not depend on it ().

4.2 The learning problem

Let and be parametrizations of the player’s policies and the planning agent’s policy .

The planning agent aims to maximize the total social welfare , which is a natural metric of how socially desirable an outcome is. Note that without restrictions on the set of possible additional rewards and punishments, i.e. , the planning agent can always transform the game into a fully cooperative game by choosing .

However, it is difficult to learn how to set the right incentives using traditional reinforcement learning techniques. This is because does not depend directly on . The planning agent’s actions only affect indirectly by changing the parameter updates of the learners. For this reason, it is vital to explicitly take into account how the other agents’ learning changes in response to additional incentives.

This can be achieved by considering the next learning step of each player (cf. (Foerster et al., 2017)). We assume that the learners update their parameters by simple gradient ascent:


where is step size of player and is the gradient with respect to parameters .

Instead of optimizing , the planning agent looks ahead one step and maximizes . Assuming that the parameter updates are small, a first-order Taylor expansion yields


We use a simple rule of the form to update the planning agent’s policy, where is the learning step size of the planning agent. Exploiting the fact that does not depend directly on , i.e. , we can calculate the gradient:


since does not depend on either.

4.3 Policy gradient approximation

If the planning agent does not have access to the exact gradients and , we use policy gradients as an approximation. Let be a state-action trajectory of horizon , where , , and are the actions taken and rewards received in time step . Then, the episodic return and approximate and , respectively. Similarly, approximates the social welfare .

We can now calculate the gradients using the policy gradient theorem:


The other gradients and can be approximated in the same way. This yields the following rule for the parameter update of the planning agent:


4.4 Opponent modeling

Equations 4 and 6

assume that the planning agent has access to each agent’s internal policy parameters and gradients. This is a restrictive assumption. In particular, agents may have an incentive to conceal their inner workings in adversarial settings. However, if the assumption is not fulfilled, we can instead model the opponents’ policies using parameter vectors

and infer the value of these parameters from the player’s actions (Ross et al., 2010)

. A simple approach is to use a maximum likelihood estimate based on the observed trajectory:


Given this, we can substitute for in equation 4.

4.5 Cost of additional rewards

In real-world examples, it may be costly to distribute additional rewards or punishment. We can model this cost by changing the planning agent’s objective to , where is a cost parameter and . The modified update rule is (using equation 4)


5 Experimental setup

In our experiments, we consider learning agents playing a matrix game social dilemma (MGSD) as outlined in section 3.3. The learners are simple agents with a single policy parameter that controls the probability of cooperation and defection: , . The agents use a centralized critic (Lowe et al., 2017) to learn their value function.

The agents play 4000 episodes of a matrix game social dilemma. We fix the payoffs and , which allows us to describe the game using the level of greed and fear. We will consider three canonical matrix game social dilemmas as shown in Table 3.

Game Greed Fear
Prisoner’s Dilemma 1 1 4 0
Chicken 0.5 -1 3.5 2
Stag Hunt -1 1 2 0
Table 3: Levels of fear and greed and resulting temptation () and sucker () payoffs in three matrix games. Note that the level of greed in Chicken has to be smaller than 1 because it is otherwise not a social dilemma ( is not fulfilled).

The planning agent’s policy is parametrized by a single layer neural network. We limit the maximum amount of additional rewards or punishments (i.e. we restrict

to vectors that satisfy for a given constant ). Unless specified otherwise, we use a step size of 0.01 for both the planning agent and the learners, use cost regularisation (Equation 8) with a cost parameter of 0.0002, set the maximum reward to 3, and use the exact value function. In some experiments, we also require that the planning agent can only redistribute rewards, but cannot change the total sum of rewards (i.e. is restricted to vectors that satisfy ). We refer to this as the revenue-neutral setting

6 Results

In this section, we summarize the experimental results.222Source code available at We aim to answer the following questions:

  • Does the introduction of the planning agent succeed in promoting significantly higher levels of cooperation?

  • What qualitative conclusions can be drawn about the amount of additional incentives needed to learn and maintain cooperation?

  • In which cases is it possible to achieve cooperation even when the planning agent is only active for a limited timespan?

  • How does a restriction to revenue-neutrality affect the effectiveness of mechanism design?

(a) Probability of cooperation
(b) Additional rewards for player 1
(c) Fear and greed in the modified game
(d) Cumulative additional rewards
Figure 1: Mechanism design over 4000 episodes of a Prisoner’s Dilemma. The initial probability of cooperation is 0.25 for each player. Shown is (a) the probability of cooperation over time, (b) the additional reward for the first player in each of the four possible outcomes, (c) the resulting levels of fear and greed including additional rewards, and (d) the cumulative amount of distributed rewards.

Figure 0(a) illustrates that the players learn to cooperate with high probability if the planning agent is present, resulting in the socially preferred outcome of stable mutual cooperation. Thus the planning agent successfully learns how to distribute additional rewards to guide the players to a better outcome.

Figure 0(b) shows how the planning agent rewards or punishes the player conditional on each of the four possible outcomes. At first, the planning agent learns to reward cooperation, which creates a sufficient incentive to cause the players to learn to cooperate. In Figure 0(c) we show how this changes the level of fear and greed in the modified game. The levels of greed and fear soon drop below zero, which means that the modified game is no longer a social dilemma.

Note that rewarding cooperation is less costly than punishing defection if (and only if) cooperation is the less common action. After the player learns to cooperate with high probability, the planning agent learns that it is now less costly to punish defection and consequently stops handing out additional rewards in the case of mutual cooperation outcome. As shown in Figure 0(d), the amount of necessary additional rewards converges to 0 over time as defection becomes increasingly rare.

Table 4 summarizes the results of all three canonical social dilemmas. Without adaptive mechanism design, the learners fail to achieve mutual cooperation in all cases. By contrast, if the planning agent is turned on, the learners learn to cooperate with high probability, resulting in a significantly higher level of social welfare.

Greed Fear No mech. design With mech. design Turning off
1 1
Chicken 0.5 -1
Stag Hunt -1 1
Table 4:

Comparison of the resulting levels of cooperation after 4000 episodes, a) without mechanism design, b) with mechanism design, and c) when turning off the planning agent after 4000 episodes and running another 4000 episodes. Each cell shows the mean and standard deviation of ten training runs.

is the probability of mutual cooperation at the end of training and is the expected social welfare that results from the players’ final action probabilities. The initial probability of cooperation is 0.25 for each player.

The three games differ, however, in whether the cooperative outcome obtained through mechanism design is stable even when the planning agent is turned off. Without additional incentives, mutual cooperation is not a Nash equilibrium in the Prisoner’s Dilemma and in Chicken (Fudenberg and Tirole, 1991), which is why one or both players learn to defect again after the planning agent is turned off. These games thus require continued (but only occasional) intervention to maintain cooperation. By contrast, mutual cooperation is a stable equilibrium in Stag Hunt (Fudenberg and Tirole, 1991). As shown in Table 4, this means that long-term cooperation in Stag Hunt can be achieved even if the planning agent is only active over a limited timespan (and thus at limited cost).

Greed Fear
1 1
Chicken 0.5 -1
Stag Hunt -1 1
Table 5: Resulting levels of cooperation and average additional rewards (AAR) per round for different variants of the learning rule. The variants differ in whether they use the exact value function (Equation 4) or an estimate (Equation 6) and in whether the setting is revenue-neutral or unrestricted.

Table 5 compares the performance of different variants of the learning rule. Interestingly, restricting the possible planning actions to redistribution leads to lower probabilities of cooperation in Prisoner’s Dilemma and Stag Hunt, but not in Chicken. We hypothesize that this is because in Chicken, mutual defection is not in the individual interest of the players anyway. This means that the main task for the planning agent is to prevent (C,D) or (D,C) outcomes, which can be easily achieved by redistribution. By contrast, these outcomes are fairly unattractive (in terms of individual interests) in Stag Hunt, so the most effective intervention is to make (D,D) less attractive and (C,C) more attractive, which is not feasible by pure redistribution. Consequently, mechanism design by redistribution works best in Chicken and worst in Stag Hunt.

Using an estimate of the value function leads to inferior performance on all three games, both in terms of the resulting probability of mutual cooperation and with respect to the amount of distributed additional results. However, the effect is by far least pronounced in Stag Hunt. This may be because mutual cooperation is an equilibrium in Stag Hunt, which means that a beneficial outcome can more easily arise even if the incentive structure created by the planning agent is imperfect.

7 Conclusions and Future Work

We have presented a method for learning how to create the right incentives to ensure cooperation between artificial learners. Empirically, we have shown that a planning agent that uses the proposed learning rule is able to successfully guide the learners to the socially preferred outcome of mutual cooperation in several different matrix game social dilemmas, while they learn to defect with high probability in the absence of a planning agent. The resulting cooperative outcome is stable in certain games even if the planning agent is turned off after a given number of episodes, while other games require continued (but increasingly rare) intervention to maintain cooperation. We also showed that restricting the planning agent to redistribution leads to worse performance in Stag Hunt, but not in Chicken.

In the future, we would like to explore the limitations of adaptive mechanism design in more complex environments, particularly in games with more than two players, without full observability of the players’ actions, and using opponent modeling (cf. Equation 7). Future work could also consider settings in which the planning agent aims to ensure cooperation by altering the dynamics of the environment or the players’ action set (e.g. by introducing mechanisms that allow players to better punish defectors or reward cooperators).

Finally, under the assumption that artificial learners will play vital roles in future society, it is worthwhile to develop policy recommendations that would facilitate mechanism design for these agents (and the humans they interact with), thus contributing to a cooperative outcome in potential social dilemmas. For instance, it would be helpful if the agents were set up in a way that makes their intentions as transparent as possible and allows for simple ways to distribute additional rewards and punishments without incurring large costs.


  • Axelrod [1986] Robert Axelrod. An Evolutionary Approach to Norms. American Political Science Review, 1986.
  • Axelrod and Hamilton [1981] Robert Axelrod and William D. Hamilton. The Evolution of Cooperation. Evolution, 1981.
  • Bachrach et al. [2009] Yoram Bachrach, Edith Elkind, Reshef Meir, Dmitrii Pasechnik, Michael Zuckerman, Jörg Rothe, and Jeffrey S Rosenschein. The cost of stability in coalitional games. In International Symposium on Algorithmic Game Theory, pages 122–134. Springer, 2009.
  • Busoniu et al. [2008] Lucian Busoniu, Robert Babuska, and Bart De Schutter. A Comprehensive Survey of Multiagent Reinforcement Learning. Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2008. ISSN 1094-6977.
  • Crandall et al. [2018] Jacob W. Crandall, Mayada Oudah, Fatimah Ishowo-Oloko, Sherief Abdallah, Jean-François Bonnefon, et al. Cooperating with machines. Nature communications, 9(1):233, 2018.
  • Foerster et al. [2016] Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to Communicate with Deep Multi-Agent Reinforcement Learning. pages 2137–2145, 2016.
  • Foerster et al. [2017] Jakob N. Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with Opponent-Learning Awareness. 2017. URL
  • Fudenberg and Tirole [1991] Drew Fudenberg and Jean Tirole. Game Theory. MIT Press, Cambridge, MA, 1991.
  • Leibo et al. [2017] Joel Z. Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent Reinforcement Learning in Sequential Social Dilemmas. Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, 2017. URL
  • Littman [1994] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994. 1994.
  • Lowe et al. [2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pages 6382–6393, 2017.
  • Macy and Flache [2002] Michael W. Macy and Andreas Flache. Learning Dynamics in Social Dilemmas. Proceedings of the National Academy of Sciences of the United States of America, 2002.
  • Narasimhan et al. [2016] Harikrishna Narasimhan, Shivani Brinda Agarwal, and David C Parkes. Automated mechanism design without money via machine learning. 2016.
  • Nowak and Sigmund [2005] Martin A. Nowak and Karl Sigmund. Evolution of Indirect Reciprocity, 2005. ISSN 14764687.
  • Omidshafiei et al. [2017] Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P. How, and John Vian. Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability. 2017. URL
  • Pinker [2011] Steven Pinker. The Better Angels of Our Nature. 2011.
  • Ross et al. [2010] Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. 2010. URL
  • Seabright [1993] Paul Seabright. Managing Local Commons: Theoretical Issues in Incentive Design. Journal of Economic Perspectives, 7(4):113–134, 1993.
  • Sutton et al. [2000] Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. pages 1057–1063, 2000.
  • Sutton and Barto [1998] RS Sutton and AG Barto. Reinforcement learning: An introduction. 1998.
  • Tampuu et al. [2017] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PLoS ONE, 2017.
  • Trivers [1971] Robert L. Trivers. The Evolution of Reciprocal Altruism. The Quarterly Review of Biology, 1971.
  • Tuyls and Weiss [2012] Karl Tuyls and Gerhard Weiss. Multiagent Learning: Basics, Challenges, and Prospects. AI Magazine, 2012.
  • Van Lange et al. [2013] Paul A. M. Van Lange, Jeff Joireman, Craig D. Parks, and Eric Van Dijk. The Psychology of Social Dilemmas: A Review. Organizational Behavior and Human Decision Processes, 120(2):125–141, 2013.
  • Varian [1995] Hal R. Varian. Economic mechanism design for computerized agents. In USENIX workshop on Electronic Commerce, pages 13–21, 1995.
  • Vickrey [1961] William Vickrey. Counterspeculation, Auctions, and Competitive Sealed Tenders. The Journal of Finance, 16(1):8–37, 1961.