Cooperative Artificial Intelligence

02/20/2022
by   Tobias Baumann, et al.
0

In the future, artificial learning agents are likely to become increasingly widespread in our society. They will interact with both other learning agents and humans in a variety of complex settings including social dilemmas. We argue that there is a need for research on the intersection between game theory and artificial intelligence, with the goal of achieving cooperative artificial intelligence that can navigate social dilemmas well. We consider the problem of how an external agent can promote cooperation between artificial learners by distributing additional rewards and punishments based on observing the actions of the learners. We propose a rule for automatically learning how to create the right incentives by considering the anticipated parameter updates of each agent. Using this learning rule leads to cooperation with high social welfare in matrix games in which the agents would otherwise learn to defect with high probability. We show that the resulting cooperative outcome is stable in certain games even if the planning agent is turned off after a given number of episodes, while other games require ongoing intervention to maintain mutual cooperation. Finally, we reflect on what the goals of multi-agent reinforcement learning should be in the first place, and discuss the necessary building blocks towards the goal of building cooperative AI.

READ FULL TEXT VIEW PDF

Authors

page 40

06/11/2018

Adaptive Mechanism Design: Learning to Promote Cooperation

In the future, artificial learning agents are likely to become increasin...
06/21/2020

Emergent cooperation through mutual information maximization

With artificial intelligence systems becoming ubiquitous in our society,...
12/15/2020

Open Problems in Cooperative AI

Problems of cooperation–in which agents seek ways to jointly improve the...
05/23/2020

Evolution of Cooperative Hunting in Artificial Multi-layered Societies

The complexity of cooperative behavior is a crucial issue in multiagent-...
03/03/2021

Morality, Machines and the Interpretation Problem: A value-based, Wittgensteinian approach to building Moral Agents

We argue that the attempt to build morality into machines is subject to ...
02/03/2012

On the influence of intelligence in (social) intelligence testing environments

This paper analyses the influence of including agents of different degre...
11/29/2021

Adversarial Attacks in Cooperative AI

Single-agent reinforcement learning algorithms in a multi-agent environm...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

2.1 Social dilemmas

Social dilemmas highlight conflicts between individual and collective interests. A social dilemma is a situation where cooperation allows for better outcomes for all participants, but individual participants are tempted to increase their own payoff at the expense of others. Selfish incentives can therefore destabilize the socially desirable outcome of mutual cooperation and often lead to outcomes that make everyone worse off [VanLange2013TheReview]. The study of social dilemmas has a long tradition in many disciplines, including game theory, social psychology, economics, and biology.

Social dilemmas can take many forms. One particularly well-known model is the Prisoner’s Dilemma [Poundstone1992], a simple game analysed in game theory. The Prisoner’s Dilemma entails two players that each choose whether to cooperate or defect. In this game, mutual cooperation results in the highest total payoffs, but defection is a dominant strategy in the single-stage game. As a result, self-interested actors often end up in a suboptimal equilibrium of mutual defection.

The Prisoner’s Dilemma is but one example of a broader class of two-player matrix game social dilemmas. There is a substantial body of literature that fruitfully employs matrix games to study how stable mutual cooperation can emerge among self-interested actors [Axelrod1981TheCooperation].

Cooperation often emerges due to direct reciprocity [Trivers1971TheAltruism] in iterated interactions. The temptation to defect for a higher immediate payoff can be outweighed by the anticipation that the other player will retaliate by defecting in the future. Conversely, a player may choose to cooperate in the hope of eliciting future cooperation from the other player. This conditional retaliation or reward turns mutual cooperation into a stable equilibrium.

Indeed, this simple Tit-for-Tat strategy is often considered the best strategy in the iterated Prisoner’s Dilemma, going back to the famous tournaments by Axelrod in which this strategy performed best [Axelrod1980, Axelrod1980a]. However, later research suggests that the full picture is more complicated [Rapoport2015] and that other strategies can also be competitive [Nowak1993, Wedekind1996]. Generous or forgiving variants of Tit-for-Tat can also outperform the non-generous variant as they prevent escalating retaliation arising from a single defection [Rand2009].

Despite this rich body of literature, matrix games are a very simple and therefore limited model of social dilemmas. Many real-world settings involve more than two agent, which gives rise to additional dynamics. [Schelling1973] explores the variety of possible multi-agent social dilemmas and proposes a classification based on their payoff structure. A particularly well-known multi-agent social dilemma is the tragedy of the commons, a situation where self-interested users of a shared resource deplete or spoil the shared resource through their collective action [Hardin1968].

Indirect reciprocity [Nowak2005EvolutionReciprocity] has been proposed as a mechanism for how cooperation may evolve even in settings where direct reciprocity is not feasible. If one’s actions are observed by third parties, self-interested actors have an incentive to cooperate in order to build a reputation as a reliable and trustworthy partner. However, indirect reciprocity only fosters cooperation if reputations are sufficiently accurate and widely known, so that the cost-to-benefit ratio of acting cooperatively becomes positive [Nowak1998].

Social norms are another powerful mechanism that can serve to stabilize the socially preferred outcome of mutual cooperation [Axelrod1986AnNorms]. Social norms are standards of behaviour that individual actors are expected to follow. There is a rich literature in the social sciences on how such norms are formed, how their specific content is determined, and how norms are maintained [Fehr2004]. In particular, the enforcement of norms often gives rise to a second-order free-riding problem [Ozono2016].

Last, research in social psychology suggests that the behaviour of humans in social dilemmas is guided not only by dispassionate cost-benefit calculations, but also by emotional factors including trust [Parks1995] and affect [Tan2010]. Such emotions arguably evolved in humans (and possibly other animals) as a means to navigate social dilemmas [Turner2000].

2.2 Institutions and mechanism design

Reciprocity and norm enforcement are often not sufficient on their own to achieve socially beneficial outcomes. In these cases, it may still be possible establish cooperation by changing the structure of the social dilemma. This is often referred to as mechanism design. For instance, institutions such as the police and the judicial system incentivize humans to cooperate in the social dilemma of peaceful coexistence, and have succeeded in dramatically reducing rates of violence [Pinker2011TheNature].

The field of mechanism design, pioneered by [Vickrey1961COUNTERSPECULATIONTENDERS], aims to design economic mechanisms and institutions to achieve certain goals, most notably social welfare or revenue maximization. [Seabright1993ManagingDesign] studies how informal and formal incentives for cooperative behaviour can prevent a tragedy of the commons. [monderer2004k] considers a setting in which an interested party can commit to non-negative monetary transfers, and studies the conditions under which desirable outcomes can be implemented with a given amount of payment. [bachrach2009cost] examine how cooperation can be stabilized via supplemental payments from an external party. Mechanism design has also been studied in the context of computerized agents [Varian1995EconomicAgents] and combined with machine learning techniques [narasimhan2016automated].

There is also a rich literature on the principal-agent problem [Grossman1983], which can be considered a special case of mechanism design. The principal-agent problem occurs when a person or entity (the agent) makes decisions or takes actions on behalf of another person or entity (the principal), resulting in a potential mismatch between the interests of the agent and the principal. This frequently occurs in organisations of various kinds [vaubel2006principal, jensen1976theory] and is related to mechanism design in that both aim to implement a coordination mechanism to align the interests of the agent with the principal [myerson1982optimal]. This mirrored in feudal reinforcement learning, an approach in which a high-level manager learns to break down a task into subtasks that are carried out by sub-managers or workers[dayan1993feudal, vezhnevets2017feudal]. A common technique is reward shaping [ng1999policy], which aims to guide the learning process by augmenting the natural reward signal with additional rewards for progress towards a good solution.

[Ostrom1992] contains a comprehensive analysis of possible policies and institutions to solve the collective action problem of using common pool resources. According to this analysis, there is no universal solution to the problem, as neither state control nor privatization of resources have been uniformly successful in avoiding the tragedy of the commons. The most successful and sustainable forms of common pool resource governance emerge organically, are fitted to local conditions, impose graduated sanctions for rule violations, and define clear community boundaries.

2.3 Bargaining theory

So far, we have assumed that it is clear what cooperation means. However, in many situations, there are different possible ways to share the surplus that two or more agents can create compared to a disagreement point. This gives rise to a bargaining problem in which the agents negotiate which division of payoffs to choose.

The most well-known solution to the bargaining problem is the Nash bargaining solution [Society], which maximises the product of surplus utilities (also called the Nash welfare). This solution uniquely satisfies the properties of Pareto optimality, symmetry, invariance to affine transformations, and independence of irrelevant alternatives. The Nash bargaining solution can also be obtained as the subgame-perfect equilibrium of an alternating-offers bargaining model as the patience of the players goes to infinity [Binmore1986].

However, maximising the Nash welfare is not the only plausible bargaining solution. The Kalai-Smorodinsky bargaining solution [Kalai1975], which is based on different axioms, chooses the payoffs that equalise the ratios of maximal gains.

Another possibility is to maximise the sum of utilities (the utilitarian welfare function). This is not usually considered a bargaining solution because it violates individual rationality in some cases. However, maximising the utilitarian welfare function can be derived on different grounds [Harsanyi1955].

2.4 Multi-agent reinforcement learning

Reinforcement learning takes the perspective of an agent that learns to maximize its reward through trial-and-error interactions with its environment [Sutton1998ReinforcementIntroduction, Littman2015]. These methods have achieved substantial successes in classic board games such as Go [Silver2017] and in video games including the Atari platform [Mnih] or Starcraft 2 [Vinyals2019]. Reinforcement learning has also been applied in robotics [Levine2015], management of power consumption [Tesauro2008] and indoor navigation [Zhu2016]. For a more comprehensive survey, we refer the reader to [Arulkumaran2017].

For purposes of this work, we are most interested in the rich literature on the subfield of multi-agent reinforcement learning [Busoniu2008ALearning, Tuyls2012MultiagentProspects, Hernandez-Leal2018]. While the artificial intelligence literature focuses on different aspects compared to the game theoretic literature, multi-agent learning is arguably one of the most fruitful interaction grounds between computer science and game theory (and the study of social dilemmas in particular).

Unlike single-agent learning algorithms, multi-agent reinforcement learning methods explicitly consider the presence of other agents in the environment. However, there has been some discussion on the precise nature of this distinction. [Shoham2007] argue that the multi-agent learning literature actually pursues several different agendas that are often left implicit or conflated, resulting in confusion.

From a computational perspective, the key difference between single and multi-agent learning is that in the latter, learning processes of other agents render the environment non-stationary from the perspective of an individual agent. Hence, applying variations of the basic -learning algorithm to multi-agent settings [Sen1994] can fail when an opponent adapts its choice of actions based on the past history of the game. Various approaches have been proposed to address this problem, including the minimax--learning algorithm [Littmana], joint-action learners [Claus1998], and the Friend-or-Foe Q-learning algorithm [Littman].

A common approach to learning in repeated games is fictitious play, a learning rule which assumes that the opponent follows a stationary strategy. At each round, the player aims to play the best response to the empirical distribution of opponent actions. It has been shown that this approach results in convergence to a Nash equilibrium under certain assumptions [Kalai1993].

2.5 Cooperation and competition in multi-agent reinforcement learning

Most work on multi-agent reinforcement learning considers coordination or communication problems in the fully cooperative setting, where the agents share a common goal[Omidshafiei2017DeepObservability, Foerster2016LearningLearning]. However, there has been less emphasis on mixed cooperative-competitive case, i.e. the question of how we can ensure that artificial agents learn to navigate social dilemmas productively, without being stuck in suboptimal equilibria. Studies of social dilemmas have traditionally focused on the context of human agents [Lange2014, Capraro2013], while the machine learning literature tends to focus more on computational aspects.

As an exception, [Leibo2017Multi-agentDilemmas] study the learned behaviour of deep Q-networks in a fruit-gathering game and a Wolfpack hunting game that represent sequential social dilemmas. [Tampuu2017MultiagentLearningb] successfully train agents to play Pong with either a fully cooperative, a fully competitive, or a mixed cooperative-competitive objective. [Crandall2018CooperatingMachines] introduce a learning algorithm that uses novel mechanisms for generating and acting on signals to learn to cooperate with humans and with other machines in iterated matrix games. [Anthony2020] use a variant of best response policy iteration to navigate social dilemmas arising in the multi-player board game Diplomacy. Finally, [Lowe2017Multi-AgentEnvironments] propose a centralized actor-critic architecture that is applicable to both the fully cooperative as well as the mixed cooperative-competitive setting.

However, these methods assume a given set of opponent policies as given in that they do not take into account how one’s actions affect the parameter updates on other agents. In contrast, [Foerster2017LearningAwareness] introduce Learning with Opponent-Learning Awareness (LOLA), an algorithm that explicitly attempts to shape the opponent’s anticipated learning. The LOLA learning rule includes an additional term that reflects the effect of the agent’s policy on the parameter update of the other agents and inspired the learning rule in this work. However, while LOLA leads to emergent cooperation in an iterated Prisoner’s dilemma, the aim of LOLA agents is to shape the opponent’s learning to their own advantage, which does not always promote cooperation.

Another approach, suggested by [Lerer2017], is that reinforcement agents learn both a cooperative and a defective policy. The idea is to cooperate as long as one’s opponent follows the cooperative policy, and switch to defection when the opponents’ actions indicate that this is no longer the case. A variety of approaches have been suggested to address the key subproblem of detecting defection [Hernandez-Leala, Hernandez-Leal, Damer]. For instance, it is possible to switch when one’s rewards indicate that the other agent is not cooperating [Peysakhovich2017]. However, this approach is binary as the agent only switches between two policies, representing full cooperation or full defection. [Wang2018] instead suggest a trained defection-detection model that also considers degrees of cooperation.

3.1 Game-theoretic concepts

3.1.1 Nash equilibrium and Pareto-optimality

An -person game is defined in terms of the strategy sets representing the actions available to players and the utility functions which describe their payoffs. A tuple for is called a strategy profile. We also use the notation , where represents all strategies of players other than .

A Nash equilibrium is a strategy profile such that

(3.1)

for all players and all . In other words, a Nash equilibrium is a strategy profile in which each player plays the best response to others’ strategies, and no player can improve by deviating unilaterally.

A strategy profile is Pareto-optimal if there is no strategy profile such that for some and for all . That is, in a Pareto-optimal profile it is not possible to make some players better off without making others worse-off.

3.1.2 Matrix game social dilemmas

A matrix game is a two-player game with only two actions available to each player, which we will interpret as cooperation and defection.

C D
C
D
Table 3.1: Payoff matrix of a symmetric 2-player matrix game. A cell of represents a utility of to the row player and to the column player.

Table 3.1 shows the generic payoff structure of a (symmetric) matrix game. Players can receive four possible rewards: (reward for mutual cooperation), (punishment for mutual defection), (temptation of defecting against a cooperator), and (sucker outcome of cooperating against a defector).

A matrix game is considered a social dilemma if the following conditions hold [Macy2002LearningDilemmas.]:

  1. Mutual cooperation is preferable to mutual defection:

  2. Mutual cooperation is preferable to being exploited:

  3. Mutual cooperation is preferable to an equal probability of unilateral defection by either player:

  4. The players have some reason to defect because exploiting a cooperator is preferable to mutual cooperation () or because mutual defection is preferable to being exploited ().

The last condition reflects the mixed incentive structure of matrix game social dilemmas. We will refer to the motivation to exploit a cooperator (quantified by ) as greed and to the motivation to avoid being exploited by a defector () as fear. As shown in Table 3.2, we can use the presence or absence of greed and fear to categorize matrix game social dilemmas.

Chicken C D
C
D
Stag Hunt C D
C
D
PD C D
C
D
Table 3.2: The three canonical examples of matrix game social dilemmas with different reasons to defect. In Chicken, agents may defect out of greed, but not out of fear. In Stag Hunt, agents can never get more than the reward of mutual cooperation by defecting, but they may still defect out of fear of a non-cooperative partner. In Prisoner’s Dilemma (PD), agents are motivated by both greed and fear simultaneously.

3.1.3 Bargaining

In many situations, it is not obvious what defection and cooperation means, as there are many possible ways to share the surplus that two or more agents can generate. This gives rise to a bargaining problem over how to divide this surplus.

Formally, a (two-player) bargaining problem is defined by a feasibility set that describes all possible agreements, and a disagreement point which represents the payoffs if no agreement can be reached. Payoffs are commonly normalised so that .

A bargaining solution selects an agreement point from . Various solutions have been proposed based on slightly different criteria. The Nash bargaining solution [Society] maximises the product of surplus utilities, that is, it selects the point that maximises the Nash welfare function , or simply if . The Nash bargaining solution is the unique bargaining solution that results from the assumptions of Pareto-optimality, symmetry, scale-invariance, and independence of irrelevant alternatives.

An alternative is the Kalai-Smorodinsky bargaining solution, which drops the independence of irrelevant alternatives axiom in favor of a monotonicity requirement. The Kalai-Smorodinsky bargaining solution considers the best achievable utilities and and selects the point on the Pareto frontier that maintains the ratio of achievable gains .

3.2 Reinforcement learning

Reinforcement learning is concerned with how an agent ought to take actions in an environment so as to maximize some notion of reward. At each time step , the agent receives a representation of the environment’s state, and it selects an action . Then, as a consequence of its action, the agent receives a reward .

The agent follows a policy, which is a mapping that describes the actions taken by the agent. That is,

represents the probability distribution over actions that the agent could take in when in state

.

The aim of the agent (at time step ) is to maximise its discounted accumulated reward

(3.2)

for a given discount factor .

The value function

(3.3)

is the expected reward in state when following policy . Informally, it describes how good it is to be in a given state when following a certain policy .

Alternatively, we can express the expected reward in terms of state-action pairs using the -function:

(3.4)

We seek to find the optimal policy which fulfils

(3.5)

or

(3.6)

Using this new notation, we can express using :

(3.7)

That is, under the optimal policy, the value of a state is equal to the expected return from the best action from that state.

3.2.1 The Bellman equation

We can expand the value function to obtain the following recursive property:

(3.8)

We can do the same for the Q function:

(3.9)

The same holds for the optimal value function and -function and . This so-called Bellman equation can be solved using dynamic programming methods.

3.2.2 Markov games

We consider partially observable Markov games [Littman1994MarkovLearning]

as a multi-agent extension of Markov decision processes (MDPs). An

-player Markov game , sometimes also called a stochastic game [shapley1953stochastic], is defined by a set of states , an observation function specifying each player’s -dimensional view, a set of actions for each player, a transition function , where denotes the set of probability distributions over , and a reward function for each player. To choose actions, each player uses a policy , where is the observation space of player . Each player in a Markov game aims to maximize its discounted expected return , where is a discount factor and is the time horizon.

A matrix game is the special case of two-player perfectly observable Markov games with , and .

3.2.3 Policy gradient methods

Policy gradient methods [Sutton1998ReinforcementIntroduction] are a popular choice for a variety of reinforcement learning tasks. Suppose the policy of an agent is parametrized by . Policy gradient methods aim to maximize the objective by updating the agent’s policy steps in the direction of .

Using the policy gradient theorem [SuttonPolicyApproximation], we can write the gradient as follows:

(3.10)

where is the state distribution and .

The policy gradient theorem has given rise to several practical algorithms, which often differ in how they estimate

. For example, the REINFORCE algorithm [williams1992simple] uses a sample return to estimate . Alternatively, one could learn an approximation of the true action-value function via temporal-difference learning [Sutton1998ReinforcementIntroduction] or a variety of actor-critic algorithms [Sutton1998ReinforcementIntroduction].

3.2.4 Multi-agent learning methods

Traditional reinforcement learning methods, such as Q-learning, are not always suitable for the multi-agent case. This is due to the challenge posed by the inherent non-stationarity of the environment. As a result, specialised techniques for multi-agent learning have been developed.

For example, [Lowe2017Multi-AgentEnvironments] present a multi-agent adaptation of actor-critic methods. Consider a game with players following policies parametrised by . Then we can write the gradient of the expected reward for agent as

(3.11)

where and . Here is a centralised action-value function that takes as input the actions of all agents, and is therefore stationary.

4.1 Methods

4.1.1 Amended Markov game including the planning agent

Suppose agents play a Markov game described by , , , and . We introduce a planning agent that can hand out additional rewards and punishments to the players and aims to use this to ensure the socially preferred outcome of mutual cooperation.

To do this, the Markov game can be amended as follows. We add another action set that represents which additional rewards and punishments are available to the planning agent. Based on its observation and the other player’s actions , the planning agent takes an action .111Technically, we could represent the dependence on the other player’s actions by introducing an extra step after the regular step in which the planning agent chooses additional rewards and punishments. However, for simplicity, we will discard this and treat the player’s actions and the planning action as a single step. Formally, we can justify this by letting the planning agent specify its action for every possible combination of player actions. The new reward function of player is , i.e. the sum of the original reward and the additional reward, and we denote the corresponding value functions as . Finally, the transition function formally receives as an additional argument, but does not depend on it ().

4.1.2 The learning problem

Let and be parametrizations of the player’s policies and the planning agent’s policy .

The planning agent aims to maximize the total social welfare , which is a natural metric of how socially desirable an outcome is. Note that without restrictions on the set of possible additional rewards and punishments, i.e. , the planning agent can always transform the game into a fully cooperative game by choosing .

However, it is difficult to learn how to set the right incentives using traditional reinforcement learning techniques. This is because does not depend directly on . The planning agent’s actions only affect indirectly by changing the parameter updates of the learners. For this reason, it is vital to explicitly take into account how the other agents’ learning changes in response to additional incentives.

This can be achieved by considering the next learning step of each player (cf. [Foerster2017LearningAwareness]). We assume that the learners update their parameters by simple gradient ascent:

(4.1)

where is step size of player and is the gradient with respect to parameters .

Instead of optimizing , the planning agent looks ahead one step and maximizes . Assuming that the parameter updates are small, a first-order Taylor expansion yields

(4.2)

We use a simple rule of the form to update the planning agent’s policy, where is the learning step size of the planning agent and . Exploiting the fact that does not depend directly on , i.e. , we can calculate the gradient:

(4.3)

since does not depend on either.

4.1.3 Policy gradient approximation

If the planning agent does not have access to the exact gradients of and , we use policy gradients as an approximation. Let be a state-action trajectory of horizon , where , , and are the actions taken and rewards received in time step . Then, the episodic return and approximate and , respectively. Similarly, approximates the social welfare .

We can now calculate the gradients using the policy gradient theorem:

(4.4)

The other gradients and can be approximated in the same way. This yields the following rule for the parameter update of the planning agent:

(4.5)

See algorithm 1 for an overview of the process for updating each agent’s parameters.

Initialise policies and with parameters and
Initialise the environment state
for  to  do
       for  to  do
             Sample according to
            
       end for
      Sample according to
       for  to  do
             Update according to Equation 4.1:
            
       end for
      Update the planning agent parameters according to 4.3:
      
       Update the state of the environment:
      
end for
Algorithm 1 Pseudocode

4.1.4 Opponent modeling

Equations 4.3 and 4.5

assume that the planning agent has access to each agent’s internal policy parameters and gradients. This is a restrictive assumption. In particular, agents may have an incentive to conceal their inner workings in adversarial settings. However, if the assumption is not fulfilled, we can instead model the opponents’ policies using parameter vectors

and infer the value of these parameters from the player’s actions [Ross2010ALearning]. A simple approach is to use a maximum likelihood estimate based on the observed trajectory:

(4.6)

Given this, we can substitute for in equation 4.3.

4.1.5 Cost of additional rewards

In real-world examples, it may be costly to distribute additional rewards or punishment. We can model this cost by changing the planning agent’s objective to , where is a cost parameter and . The modified update rule is (using equation 4.3)

(4.7)

4.2 Experimental setup

In our experiments, we consider learning agents playing a matrix game social dilemma (MGSD) as outlined in section 3.1.2. The learners are simple agents with a single policy parameter that controls the probability of cooperation and defection: , . The agents use a centralized critic [Lowe2017Multi-AgentEnvironments] to learn their value function.

The agents play 4000 episodes of a matrix game social dilemma. We fix the payoffs and , which allows us to describe the game using the level of greed and fear. We will consider three canonical matrix game social dilemmas as shown in Table 4.1.

Game Greed Fear
Prisoner’s Dilemma 1 1 4 0
Chicken 0.5 -1 3.5 2
Stag Hunt -1 1 2 0
Table 4.1: Levels of fear and greed and resulting temptation and sucker payoffs in three matrix games. Note that the level of greed in Chicken has to be smaller than 1 because it is otherwise not a social dilemma ( is not fulfilled).

The planning agent’s policy is parametrized by a single layer neural network. We limit the maximum amount of additional rewards or punishments (i.e. we restrict

to vectors that satisfy for a given constant ). Unless specified otherwise, we use a step size of 0.01 for both the planning agent and the learners, use cost regularisation (Equation 4.7) with a cost parameter of 0.0002, set the maximum reward to 3, and use the exact value function. In some experiments, we also require that the planning agent can only redistribute rewards, but cannot change the total sum of rewards (i.e. is restricted to vectors that satisfy ). We refer to this as the revenue-neutral setting.

4.3 Results

In this section, we summarize the experimental results.222Source code available at https://github.com/tobiasbaumann1/Adaptive˙Mechanism˙Design We aim to answer the following questions:

  • Does the introduction of the planning agent succeed in promoting significantly higher levels of cooperation?

  • What qualitative conclusions can be drawn about the amount of additional incentives needed to learn and maintain cooperation?

  • In which cases is it possible to achieve cooperation even when the planning agent is only active for a limited timespan?

  • How does a restriction to revenue-neutrality affect the effectiveness of mechanism design?

Figure 4.1: Mechanism design over 4000 episodes of a Prisoner’s Dilemma. The initial probability of cooperation is 0.25 for each player. Shown is (a) the probability of cooperation over time, (b) the additional reward for the first player in each of the four possible outcomes, (c) the resulting levels of fear and greed including additional rewards, and (d) the cumulative amount of distributed rewards.

Figure 1a illustrates that the players learn to cooperate with high probability if the planning agent is present, resulting in the socially preferred outcome of stable mutual cooperation. Thus the planning agent successfully learns how to distribute additional rewards to guide the players to a better outcome.

Figure 1b shows how the planning agent rewards or punishes the player conditional on each of the four possible outcomes. At first, the planning agent learns to reward cooperation, which creates a sufficient incentive to cause the players to learn to cooperate. In Figure 1c we show how this changes the level of fear and greed in the modified game. The levels of greed and fear soon drop below zero, which means that the modified game is no longer a social dilemma.

Note that rewarding cooperation is less costly than punishing defection if (and only if) cooperation is the less common action. After the player learns to cooperate with high probability, the planning agent learns that it is now less costly to punish defection and consequently stops handing out additional rewards in the case of mutual cooperation outcome. As shown in Figure 1d, the amount of necessary additional rewards converges to 0 over time as defection becomes increasingly rare.

Table 4.2 summarizes the results of all three canonical social dilemmas. Without adaptive mechanism design, the learners fail to achieve mutual cooperation in all cases. By contrast, if the planning agent is turned on, the learners learn to cooperate with high probability, resulting in a significantly higher level of social welfare.

Prisoner’s
Dilemma
Chicken Stag Hunt
Greed 1 0.5 -1
Fear 1 -1 1
No mech. design
0.004%
0.001%
3.7%
1.3%
0.004%
0.002%
2.024
0.003
5.44
0.01
2.00
0.00
With mech. design
98.7%
0.1%
99.0%
0.1%
99.1%
0.1%
5.975
0.002
5.995
0.001
5.964
0.005
Turning
off
0.48%
0.4%
53.8%
29.4%
99.6%
0.0%
2.60
0.69
5.728
0.174
5.986
0.002
Table 4.2:

Comparison of the resulting levels of cooperation after 4000 episodes, a) without mechanism design, b) with mechanism design, and c) when turning off the planning agent after 4000 episodes and running another 4000 episodes. Each cell shows the mean and standard deviation of ten training runs.

is the probability of mutual cooperation at the end of training and is the expected social welfare that results from the players’ final action probabilities. The initial probability of cooperation is 0.25 for each player.

The three games differ, however, in whether the cooperative outcome obtained through mechanism design is stable even when the planning agent is turned off. Without additional incentives, mutual cooperation is not a Nash equilibrium in the Prisoner’s Dilemma and in Chicken [Fudenberg_Game_Theory], which is why one or both players learn to defect again after the planning agent is turned off. These games thus require continued (but only occasional) intervention to maintain cooperation. By contrast, mutual cooperation is a stable equilibrium in Stag Hunt [Fudenberg_Game_Theory]. As shown in Table 4.2, this means that long-term cooperation in Stag Hunt can be achieved even if the planning agent is only active over a limited timespan (and thus at limited cost).

Prisoner’s
Dilemma
Chicken Stag Hunt
Greed 1 0.5 -1
Fear 1 -1 1
Exact
98.7%
0.1%
99.0%
0.1%
99.1%
0.1%
AAR
0.77
0.21
0.41
0.02
0.45
0.02
Exact
Revenue-neutral
91.4%
1.0%
98.9%
0.1%
69.2%
45.3%
AAR
0.61
0.04
0.31
0.02
0.19
0.11
Estimated
61.3%
20.0%
52.2%
18.6%
96.0%
1.2%
AAR
3.31
0.63
2.65
0.31
4.89
0.39
Table 4.3: Resulting levels of cooperation and average additional rewards (AAR) per round for different variants of the learning rule. The variants differ in whether they use the exact value function (Equation 4.3) or an estimate (Equation 4.5) and in whether the setting is revenue-neutral or unrestricted.

Table 4.3 compares the performance of different variants of the learning rule. Interestingly, restricting the possible planning actions to redistribution leads to lower probabilities of cooperation in Prisoner’s Dilemma and Stag Hunt, but not in Chicken. We hypothesize that this is because in Chicken, mutual defection is not in the individual interest of the players anyway. This means that the main task for the planning agent is to prevent (C,D) or (D,C) outcomes, which can be easily achieved by redistribution. By contrast, these outcomes are fairly unattractive (in terms of individual interests) in Stag Hunt, so the most effective intervention is to make (D,D) less attractive and (C,C) more attractive, which is not feasible by pure redistribution. Consequently, mechanism design by redistribution works best in Chicken and worst in Stag Hunt.

Using an estimate of the value function leads to inferior performance on all three games, both in terms of the resulting probability of mutual cooperation and with respect to the amount of distributed additional results. However, the effect is by far least pronounced in Stag Hunt. This may be because mutual cooperation is an equilibrium in Stag Hunt, which means that a beneficial outcome can more easily arise even if the incentive structure created by the planning agent is imperfect.

Finally, we note that the presented approach is also applicable to settings with more than two players.333Source code available in a separate repository at https://github.com/tobiasbaumann1/Mechanism˙Design˙Multi-Player We consider a multi-player Prisoner’s Dilemma with agents.444

The payoffs are as follows: 3 if all players cooperate, 1 if all players defect, 4 if you are the only to defect, 0 if you are the only to cooperate. Payoffs of intermediate outcomes, where some fraction of players cooperate, are obtained by linear interpolation.