SA-IGA: A Multiagent Reinforcement Learning Method Towards Socially Optimal Outcomes

03/08/2018 ∙ by Chengwei Zhang, et al. ∙ University of Liverpool Tianjin University Southwest University 0

In multiagent environments, the capability of learning is important for an agent to behave appropriately in face of unknown opponents and dynamic environment. From the system designer's perspective, it is desirable if the agents can learn to coordinate towards socially optimal outcomes, while also avoiding being exploited by selfish opponents. To this end, we propose a novel gradient ascent based algorithm (SA-IGA) which augments the basic gradient-ascent algorithm by incorporating social awareness into the policy update process. We theoretically analyze the learning dynamics of SA-IGA using dynamical system theory and SA-IGA is shown to have linear dynamics for a wide range of games including symmetric games. The learning dynamics of two representative games (the prisoner's dilemma game and the coordination game) are analyzed in details. Based on the idea of SA-IGA, we further propose a practical multiagent learning algorithm, called SA-PGA, based on Q-learning update rule. Simulation results show that SA-PGA agent can achieve higher social welfare than previous social-optimality oriented Conditional Joint Action Learner (CJAL) and also is robust against individually rational opponents by reaching Nash equilibrium solutions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In multiagent systems, the ability of learning is important for an agent to adaptively adjust its behaviors in response to coexisting agents and unknown environments in order to optimize its performance. Multiagent learning algorithms have received extensive investigation in the literature, and lots of learning strategies busoniu2008comprehensive ; matignon2012independent ; bloembergen2015evolutionary ; Zhang2017FMRQ have been proposed to facilitate coordination among agents.

The multi-agent learning criteria proposed in WOLF-PHC require that an agent should be able to converge to a stationary policy against some class of opponents (convergence) and the best-response policy against any stationary opponent (rationality). If both agents adopt a rational learning strategy in the context of repeated games and also their strategies converge, then they will converge to a Nash equilibrium of the stage game. Indeed, convergence to Nash equilibrium has been the most commonly accepted goal to pursue in multiagent learning literature. Until now, a number of gradient-ascent based multiagent learning algorithms singh2000nash ; WOLF-PHC ; Abdallah2008MRL ; zhang2010multi have been sequentially proposed towards converging to Nash equilibrium with improved convergence performance and more relaxed assumptions (less information is required). Under the same direction, another well-studied family of multiagent learning strategies is based on reinforcement learning (e.g., Q-learning Q-learning ). Representative examples include distributed Q-learning in cooperative games lauerRiedmiller , minimax Q-learning in zero-sum games minmaxQ , Nash Q-learning in general-sum games hu2003nash , and other extensions littman2001friend ; busoniu2008comprehensive , to name just a few.

1’s payoff
2’s payoff
Agent 2’s actions
C D

Agent 1’s
actions
C 3/3 0/5
D 5/0 1/1
Table 1: The Prisoner’s Dilemma Game

All the aforementioned learning strategies pursue converging to Nash equilibrium under self-play, however, Nash equilibrium solution may be undesirable in many scenarios. One well-known example is the prisoner’s dilemma (PD) game shown in Table 1. By converging to the Nash equilibrium , both agents obtain the payoff of 1, while they could have obtained a much higher payoff of 3 by coordinating on the non-equilibrium outcome . In situations like the PD game, converging to the socially optimal outcome, i.e., the maximal total reward of all players, under self-play would be more preferred. To address this issue, one natural modification for a gradient-ascent learner is to update its policy along the direction of maximizing the sum of all agents’ expected payoff instead of its own. However, in an open environment, the agents are usually designed by different parties and may have not the incentive to follow the strategy we design. The above way of updating strategy would be easily exploited and taken advantage by (equilibrium-driven) self-interested agents. Thus it would be highly desirable if an agent can converge to socially optimal outcomes under self-play and Nash equilibrium against self-interested agents to avoid being exploited.

In this paper, we propose a new gradient-ascent based algorithm (SA-IGA) which augments the basic gradient ascent algorithm by incorporating “social awareness” into the policy update process. Social awareness means that agents try to optimize social outcomes as well as its own outcome. A SA-IGA agent holds a social attitude to reflect its socially-aware degree, which can be adjusted adaptively based on the relative performance between its own and its opponent. A SA-IGA agent seeks to update its policy in the direction of increasing its overall payoff which is defined as the average of its individual and the social payoff weighted by its socially-aware degree. We theoretically show that for a wide range of games (e.g., symmetric games), the dynamics of SA-IGAs under self-play exhibits linear characteristics. For general-sum games, it may exhibit non-linear dynamics which can still be analyzed numerically. The learning dynamics of two representative games (the prisoner’s dilemma game and the coordination game representing symmetric games and asymmetric games, respectively) are analyzed in details. Like previous theoretical multiagent learning algorithms, SA-IGA also requires additional assumption of knowing the opponent’s policy and the game structure.

To relax the above assumption, we then propose a practical gradient ascent based multiagent learning strategy, called Socially-aware Policy Gradient Ascent (SA-PGA). SA-PGA relaxes the above assumptions by estimating the performance of its own and the opponent using Q-learning techniques. We empirically evaluate its performance in different types of benchmark games and simulation results show that SA-PGA agent outperforms previous learning strategies in terms of maximizing the social welfare and Nash product of the agents. Besides, SA-PGA is also shown to be robust against individually rational opponents and converges to Nash equilibrium solutions.

The remainder of the paper is organized as follows. Section 2 generally reviews some related works about Gradient Ascent Reinforcement Learning algorithms. Section 3 reviews normal-form game and the basic gradient ascent approach. Section 4 introduces the SA-IGA algorithm and analyzes its learning dynamics theoretically. Section 5 presents the practical multiagent learning algorithm SA-PGA in details. In Section 6, we extensively evaluate the performance of SA-PGA under various benchmark games. Lastly we conclude the paper and point out future directions in Section 7.

2 Related Works

The first gradient ascent multiagent reinforcement learning algorithm is Infinitesimal Gradient Ascent (IGAsingh2000nash ), in which each learner updates its policy towards the gradient direction of its expected payoff. The purpose of IGA is to promote agents to converge to a particular Nash Equilibrium in a two-player two-action normal-form game. IGA has been proved that agents will converge to Nash equilibrium or if the strategies themselves do not converge, then their average payoffs will nevertheless converge to the average payoffs of Nash equilibrium. Soon after, M. Zinkevich et al. Zinkevich2003Online propose an algorithm called Generalized Infinitesimal Gradient Ascent(GIGA), which extends IGA to the game with an arbitrary number of actions.

Both IGA and GIGA can be combined with the Win or Learn Fast (WoLF) heuristic in order to improve performance in stochastic games (Wolf-IGA

WOLF-PHC , Wolf-GIGABowling2004Convergence ). The intuition behind WoLF principle is that an agent should adapt quickly when it performs worse than expected, whereas it should maintain the current strategy when it receives payoff better than the expected one. By altering the learning rate according to the WoLF principle, a rational algorithm can be made convergent. The shortage of WoLF-IGA or WoLF-GIGA is that these two algorithms require a reference policy, i.e., they require the estimation of Nash equilibrium strategies and corresponding payoffs. To this end, Banerjee et alBanerjee2003Adaptive propose an alternative criterion of WoLF-IGA, named Policy Dynamics based WoLF(PDWoLF), that can be accurately computed and guarantees convergence. The Weighted Policy Learner (WPLAbdallah2008MRL ) is another variation of IGA that also modulates the learning rate, meanwhile, it does not require a reference policy. Both of the WoLF and WPL are designed to guarantee convergence in stochastic repeated games.

Another direction for extending IGA is making improvements from the learning value functions. Zhang et alzhang2010multi propose a gradient-based learning algorithm by adjusting the expected payoff function of IGA, named Gradient Ascent with Policy Prediction Algorithm(IGA-PP). The algorithm is designed for games with two agents. The key idea behind this algorithm is that a player adjusts its strategy in response to forecasted strategies of the other player, instead of its current ones. It has been proved that, in two-player, two-action, general-sum matrix games, IGA-PP in self-play or against IGA would lead players’ strategies to converge to a Nash equilibrium. Like other MARL algorithms, besides the common assumption, this algorithm also has additional requirements that a player knows the other player s strategy and current strategy gradient (or payoff matrix) so that it can forecast the other player s strategy.

All the aforementioned learning strategies pursue converging to Nash equilibriums. In contrast, in this work, we seek to incorporate the social awareness into GA-based strategy update and aim at improving the social welfare of the players under self-play rather than pursuing Nash equilibrium solutions. Meanwhile, individually rational behavior is employed when playing against a selfish agent. Similar idea of adaptively behaving differently against different opponents was also employed in previous algorithms littman2001friend ; conitzer2007awesome ; powers2005learning ; chakraborty2014multiagent . However, all the existing works focus on maximizing an agent’s individual payoff against different opponents in different types of games, but do not directly take into consideration the goal of maximizing social welfare (e.g., cooperate in the prisoner’s dilemma game).

3 Background

In this section we introduce the necessary background for our contribution. First, we gave an overview of the relevant game theory definition. Then a brief review of gradient ascent based MARL (GA-MARL) algorithm is given.

3.1 Game theory

Game theory provides a framework for modeling agents’ interaction, which was used by previous researchers in order to analyze the convergence properties of MARL algorithms singh2000nash ; WOLF-PHC ; Abdallah2008MRL ; zhang2010multi . A game specifies, in a compact and simple manner, how the payoff of an agent depends on other agents actions. A (normal form) game is defined by the tuple , where is the number of players in the game, is the set of actions available to agent , and is the reward (payoff) of agent which is defined as a function of the joint action executed by all agents. If the game has only two agents, then it is convenient to define their reward functions as a payoff matrix as follows,

where , and . Each element in the matrix represents the payoff received by agent , if agent plays action and its opponent plays action .

A (or a ) of an agent is denoted by

, which maps its actions to a probability. The probability of choosing an action

according to policy is . A policy is deterministic or pure if the probability of playing one action is while the probability of playing other actions is 0, (i.e. AND ), otherwise the policy is stochastic or mixed. The joint policy of all agents is the collection of individual agents’ policies, which is defined as . For continence, the joint policy is usually expressed as , where is the collection of all policies of agents other than agent .

The of an agent is defined as the reward averaged over the joint policy. Let , if agents follow a joint policy , then the of agent would be, , where .

The goal of each agent is to find such a policy that maximizes the player s expected payoff. Ideally, we want all agents to reach the equilibrium that maximizes their individual payoffs. However, when agents do not communicate and/or agents are not cooperative, reaching a globally optimal equilibrium is not always attainable. An alternative goal is converging to the Nash Equilibrium (NE), which is by definition a local maximum across agents. A joint strategy is called a (NE), if no player can get a better expected payoff by changing its current strategy unilaterally. Formally, is a NE, iff : . An NE is pure if all its constituting policies are pure. Otherwise the NE is called mixed or stochastic. Any game has at least one Nash equilibrium, but may not have any pure equilibrium.

Next subsection, we introduce the Gradient Ascent based MARL algorithm (GA-MARL), together with a brief review of the dynamic analysis of GA-MARL.

3.2 Gradient Ascent (GA) MARL Algorithms

Gradient ascent MARL algorithms (GA-MARL) learn a stochastic policy by directly following the expected reward gradient. The ability to learn a stochastic policy is particularly important when the world is not fully observable or has a competitive nature. The basic GA-MARL algorithm whose dynamics were analyzed is the Infinitesimal Gradient Ascent(IGA singh2000nash ) . When a game is repeatedly played, an IGA player updates its strategy towards maximizing its expected payoffs. A player employing GA-based algorithms updates its policy towards the direction of its expected reward gradient, as illustrated by the following equations,

(1)
(2)

where parameter is the gradient step size, and is the projection function mapping the input value to the valid probability range of , used to prevent the gradient moving the strategy out of the valid probability space. Formally, we have,

(3)

Singh, Kearns, and Mansour singh2000nash examined the dynamics of using gradient ascent in two-player, two-action, iterated matrix games. We can represent this problem as two matrices,

We refer to the joint policy of the two players at time by the probabilities of choosing the first action , where , is the policy of player . The notation will be omitted when it does not affect clarity (for example, when we are considering only one point in time). Then, for the two-player two-action case, the above way of GA-based updating in Equations 1 and 2 can be simplified as follows,

(4)

where , .

In the case of infinitesimal gradient step size (), the learning dynamics of the players can be modeled as a system of differential equations, i.e. , , which can be analyzed using dynamic system theory Coddington1955Theory . It is proved that the agents will converge to a Nash equilibrium, or if the strategies themselves do not converge, then their average payoffs will nevertheless converge to the average payoffs of a Nash equilibrium singh2000nash .

Combined with Q-learningWatkins1989Learning , researchers propose a practical learning algorithm, i.e. the policy hill-climbing algorithm (PHC)WOLF-PHC , which is a simple extension of IGA and is shown in Table 1.

1:  Lets be learning rates.
2:  Initialize,, .
3:  repeat
4:     Select action according to mixed strategy with suitable exploration.
5:     Observing reward . Update ,.
6:     Update according to gradient ascent strategy,, if ,, if .
7:  until the repeated game ends
Algorithm 1 PHC for player

The algorithm performs hill-climbing in the space of mixed policies, which is similar to gradient ascent, but does not require as much knowledge. Q values are maintained just as in normal Q-learning. In addition the algorithm maintains the current mixed policy. The policy is improved by increasing the probability that it selects the highest valued action according to a learning rate . After that, the policy is mapped back to the valid probability space. This technique, like Q-learning, is rational and will converge to an optimal policy if other players are playing stationary strategies. The algorithm guarantees the values will converge to (the local optimal value of ) with a suitable exploration policy. will converge to a policy that is greedy according to , which is converging to , and therefore will converge to a best response. PHC is rational and has no limit on the number of agents and actions.

4 Socially-aware Infinitesimal Gradient Ascent (SA-IGA)

In our daily life, people usually do not always behave as a purely individually rational entity and seek to achieve Nash equilibrium solutions. For example, when two person subjects play a PD game, reaching mutual cooperation may be observed frequently. Similar phenomena have also been observed in extensive human-subject based experiments in games such as the Public Goods gameHauert2003Prisone and Ultimatum gameAlvard2004The , in which human subjects are usually found to obtain much higher payoff by mutual cooperation rather than pursuing Nash equilibrium solutions. If the above phenomenon is transformed into computational models, it indicates that an agent may not only update its policy in the direction of maximizing its own payoff, but also take into consideration other’s payoff. We call this type of agents as socially-aware agents.

In this paper, we incorporate the social awareness into the gradient-ascent based learning algorithm. In this way, apart from learning to maximizing its individual payoff, an agent is also equipped with the social awareness so that it can (1) reach mutually cooperative solutions faced with other socially-aware agents (self-play); (2) behave in a purely individually rational manner when others are purely rational.

Specifically, for each SA-IGA agent , it distinguishes two types of expected payoffs, namely and . Payoffs and represent the individual and social payoff (the average payoff of all agents) that agent perceives under the joint strategy respectively. The payoff follows the same definition as IGA and the payoff is defined as the average of the individual payoffs of all agents.

(5)

Each agent adopts a social attitude to reflect its socially-aware degree. The social attitude intuitively models an agent’s social friendliness degree towards others. Specifically, it is used as the weighting factor to adjust the relative importance between and , and agent ’s overall expected payoff is defined as follows,

(6)

Each agent updates its strategy in the direction of maximizing the value of . Formally we have,

(7)

where parameter is the gradient step size of . If , it means that the agent seeks to maximize its individual payoff only, which is reduced to the case of traditional gradient-ascent updating; if , it means that the agent seeks to maximize the sum of the payoffs of both players.

Finally, each agent ’s socially-aware degree is adaptively adjusted in response to the relative value of and as follows. During each round, if player ’s own expected payoff exceeds the value of , then player increases its social attitude , (i.e., it becomes more social-friendly because it perceives itself to be earning more than the average). Conversely, if is less than , then the agent tends to care more about its own interest by decreasing the value of . Formally,

(8)

where parameter is the learning rate of .

4.1 Theoretical Modeling and Analysis of SA-IGA

An important aspect of understanding the behavior of a multiagent learning algorithm is theoretically modeling and analyzing its underlying dynamics tuyls2003selection ; rodrigues2009dynamic ; bloembergen2015evolutionary . In this section, we first show that the learning dynamics of SA-IGA under self-play can be modeled as a system of differential equations. To simplify analysis, we only considered two-player-two-action games.

Based on the adjustment rules in Equation (7) and (8), the learning dynamics of a SA-IGA agent can be modeled as a set of equations in (9). For ease of exposition, we concentrate on an unconstrained update equations by removing the policy projection function which does not affect our qualitative analytical results. Any trajectory with linear (non-linear) characteristic without constraints is still linear (non-linear) when a boundary is enforced.

(9)

Substituting and by their definitions (Equations 4 and 5), the learning dynamics of two SA-IGA agents can be expressed as follows,

(10)

where , ,, and with .

As and , it is straightforward to show that the above equations become differential. Thus the unconstrained dynamics of the strategy pair and social attitudes as a function of time is modeled by the following system of differential equations:

(11)

where .

Based on the above theoretical modeling, next we analyze the learning dynamics of SA-IGA qualitatively as follows.

Theorem 4.1

SA-IGA has non-linear dynamics when .

Proof

: From differential equations in (11), it is straightforward to verify that the dynamics of SA-IGA learners are non-linear when due to the existence of , and in all equations.

Since SA-IGA’s dynamics are non-linear when , in general we cannot obtain a closed-form solution, but we can still resort to solve the equations numerically to obtain useful insight of the system’s dynamics. Moreover, a wide range of important games fall into the category of , in which the system of equations become linear. Therefore, it allows us to use dynamic system theory to systematically analyze the underlying dynamics of SA-IGA.

Theorem 4.2

SA-IGA has linear dynamics when the game itself is symmetric.

Proof

: A two-player two-action symmetric game can be represented in Table 2 in general. It is obvious to check that it satisfies the constraint of , given that , . Thus the theorem holds.

1’s payoff
2’s payoff
Agent 2’s actions
action 1 action 2

Agent 1’s
actions
action 1 a/a b/c
action 2 c/b d/d
Table 2: The General Form of a Symmetric Game

4.2 Dynamics Analysis of SA-IGA

Previous section mainly analyzed the dynamics of SA-IGA in a qualitative manner. In this section, we move to provide detailed analysis of SA-IGA’s learning dynamics. We first summarize a generalized conclusion for symmetric games, and then analysis symmetric circumstances in two representative games: the Prisoner’s Dilemma game and the Symmetric Coordination game. For asymmetric circumstances, because the complexity of nonlinear problem analysis, we only focus on the general coordination game (Table 3). Specifically we analyze the SA-IGA’s learning dynamics of those games by identifying the existing equilibrium points, which provides useful insights into understanding of SA-IGA’s dynamics.

For symmetric games, we have the following conclusion,

Theorem 4.3

The dynamics of SA-IGA algorithm under self-play under a symmetric game have three types of equilibrium points:

  1. ;
    ;

  2. , if ;
    , if ;

  3. ,

where . The first and second types of equilibrium points are stable, while the last is not. We say an equilibrium point is stable if once the strategy starts ”close enough” to the equilibrium (within a distance from it), it will remain ”close enough” to the equilibrium point forever.

Proof

: Following the system of differential equations in Equations (11), we can express the dynamics of SA-IGA in Symmetric game as follows:

(12)

where ,, .

We start with proving the last type of equilibrium points: If there exist an equilibrium , then we have and , . By solving the above equations, we have and . Since , then we have,

Then is an equilibrium. The stability of can be verified using theories of non-linear dynamicsshilnikov2001methods . By expressing the unconstrained update differential equations in the form of , we have

After calculating matrix

’s eigenvalue, then we have

, , and , where is a constant. Since there exist an eigenvalue , the equilibrium is not stable.

Next we turn to consider cases that equilibriums are in the boundary. In these cases, we need to put the projection function back. If , according to the known conditions, we have . Combined with the unconstrained update differential equations 12, we have , then remains unchanged. And because , then for , , then is an equilibrium.

Because , there exist a , and a set , that for , . Thus will stabilize on the point of 0. Also, as , also stable, and thus the equilibrium is stable.

The case that can be proved similarly, which is omitted here.

For the case , if is an equilibrium, combined with the unconstrained update differential equations 12, we have , which means that will keeps changing until or . If is an equilibrium, then and . Take into Equations 12, we get . Other case are the same, thus we it omit here.

The stability of the second type of equilibriums can be proved by the way as the first type one, which is omitted here.

From Theorem 4.3, we know that there are three types of equilibriums if both players play SA-IGA policy, while only the first and second types of equilibrium points are stable. Besides, all equilibriums of the first two types are pure strategies, i.e., the probability for selecting action 1 for agent equals to or . Notably, the range of (the social attitude) in these three types of equilibriums may be overlapped, resulting in that the final convergence of the algorithm also depends on the value of . Next we concentrate on details of two representative symmetric games: the Prisoner’s Dilemma (PD) game and the Symmetric Coordination game.

The Prisoner’s Dilemma (PD) game is a symmetric game whose parameters meet the conditions: . Combined with Theorem 4.3, we have the following conclusion,

Corollary 1

The dynamics of SA-IGA algorithm under Prisoner’s Dilemma (PD) game have two types of stable equilibrium points:

  1. , if ;

  2. , if ;

Proof

: Because the PD game is a symmetric game, we can use conclusions of Theorem 4.3 directly. From Theorem 4.3, we can see that the PD game have two types of stable equilibrium points:

  1. ;
    ;

  2. , if ;
    , if ;

For the first type of equilibrium, take into conditions in above formulas, we have: if , then is an stable equilibrium; else if , then is an stable equilibrium.

For the second type of equilibrium, take into consideration, we found that the conditions are in conflict with each other, which means there is no such type of equilibriums under Prisoner’s Dilemma (PD) game.

Intuitively, for a PD game, from Corollary 1, we know that if both SA-IGA players are initially sufficiently social-friendly (the value of w is large than a certain threshold), then they will always converge to mutual cooperation of . In other words, given that the value of exceeds certain threshold, the strategy point of (or ) in the strategy space is asymptotically stable. If both players start with a low socially-aware degree ( is smaller than certain threshold), then they will always converge to mutual defection of eventually. For the rest of cases, there exist infinite number of equilibrium points in-between the above two extreme cases, all of which are not stable, which means that the learning dynamic will never converge to those equilibrium points.

Next we turn to analyze the dynamics of SA-IGA playing the Symmetric Coordination game. The general form of a Coordination game is shown in Table 3. From the table, we can see that the Coordination game is asymmetric if any of the following conditions are met: , , or . we analyze a simplified game first, i.e., the Symmetric Coordination game, the general circumstance of coordination game will be analyzed later. Similar to the analysis of Theorem 1, we have,

1’s payoff
2’s payoff
Agent 2’s actions
1 2

Agent 1’s
actions
1 R/r S/t
2 T/s P/p
Table 3: The General Form of a Coordination Game (where and )
Corollary 2

The dynamics of SA-IGA algorithm under a symmetric coordination game have two types of stable equilibrium points:

  1. , with ;

  2. , with ;

where .

Proof

: The proof is the same with Theorem 1, thus we omit it here.

Intuitively, for a Symmetric Coordination game, from Corollary 2, there are two types of stable equilibrium if players playing SA-IGA policy, which means players will eventually converging to action or , i.e., the Nash equilibriums of the Symmetric Coordination game. Besides, because the final convergence of the algorithm depends on the combined effect of and , we cannot give a theoretical conclusion about the condition under which the algorithm will converge to the social optimal for a symmetric Coordination game. In fact, experimental simulations in the following section show that the SA-IGA has a higher probability converging to social optimal.

Now we turn to consider the asymmetric case. As we mentioned before, SA-IGA under an asymmetric game may have nonlinear dynamics when , which has caused great difficulties for theoretical analysis. For this reason, we only analyze the general Coordination game which is a typical asymmetric game.

Theorem 4.4

The dynamics of SA-IGA algorithm under a general coordination game have three types of equilibrium points:

  1. , with when ; when ; and when ;

  2. , with when ; when ; and when ;

  3. , others.

The first and second types of equilibrium points are stable, while the last non-boundary equilibrium points is not.

Proof

: Following the system of differential equations in Equations (11), we can express the dynamics of SA-IGA in coordination game as follows:

(13)

where ,, , , , , , and . We can see that the dynamic of coordination game is nonlinear when . We start with proving the last type of equilibrium points first:

If there exit a equilibrium , then there have and , . By linearizing the unconstrained update differential equations into the form of in point , we have

where and , The parameters are represented as functions of and . Without loss of generality, we set . Because of , and , we have and , which means .

After calculating matrix ’s eigenvalue in Matlab, we have an eigenvalue , an eigenvalue with its real part , an eigenvalue with and an eigenvalue close to . Since there exists an eigenvalue , the equilibrium is not stableshilnikov2001methods .

Next we turn to prove the first type of equilibrium. In this case, we need to put the projection function back since we are dealing with boundary cases.

For the case , we have , thus and , which means and will keep and . Because and , then and will keep and . According to the continuity theorem of differential equations Coddington1955Theory , is a stable equilibrium. The case can be proved similarly, which is omitted here.

For the case , we have , then . Because , we have . Because , we have . According to the continuity theorem of differential equations, is a stable equilibrium. The stability of the second type of equilibrium points can be proved similarly, which is omitted here.

From Theorem 4.4, we find that conclusions of Corollary 2 is a special case of Theorem 4.4. Note that it can be verified by drawn the symmetry conditions into Theorem 4.4.

5 A Practical Algorithm

In SA-IGA, each agent needs to know the policy of others and the payoff function, which are usually not available before a repeated game starts. Based on the idea of SA-IGA, we relax the above assumptions and propose a practical multiagent learning algorithm called Socially-Aware Policy Gradient Ascent (SA-PGA). The overall flow of SA-PGA is shown in Algorithm 2. In SA-PGA, each agent only needs to observe the payoffs of both agents by the end of each round.

1:  Let , and be learning rates.
2:  Initialize,, , , , .
3:  repeat
4:     Same as PHC in Step 4 of Table 1.
5:     Observing reward and the average of all agents’ current rewards ,,,.
6:     Update according to gradient ascent strategy, Same as PHC in Step 6 of Table 1.
7:     Update , . . .
8:  until the repeated game ends
Algorithm 2 SA-PGA for player

In SA-IGA, we know that agent ’s policy (the probability of selecting each action) is updated based on the partial derivative of the expected value , while the social attitude is adjusted according to the relative value of and . Here in SA-PGA, we first estimate the value of and using Q-values, which are updated based on the immediate payoffs received during repeated interactions. Specifically, each agent keeps a record of the Q-value of each action for both its own and the average of all agents ( and ) (Step 5). Both Q-values are updated following Q-learning update rules accordingly by the end of each round (Step 5). Then the overall Q-value of each agent is calculated as the weighted average of and weighted by its social attitude (Step 5). The policy update strategy is the same as the Table 1 in Step 6. Finally, the social attitude of agent is updated in Step 7. The value of and are estimated based on its current policy and Q-values. The updating direction of is estimated as the difference between and . Note that a SA-PGA player in each interaction needs only to know its own reward and the average reward of all agents. Knowing the average reward of a group is a reasonable assumption in many realistic scenarios, such as elections and voting.

6 Experimental Evaluation

This section is divided into three parts. Subsection 6.1 compare SA-IGA and SA-PGA with simulation in different types of two-agents, two-actions, general-sum games. Subsection 6.2 presents the experimental results for the 2x2 benchmark games, specifically, performance of converging to the social optimal outcomes and against selfish agents. Subsection 6.3 presents the experimental results for games with multiple agents, i.e. public good gameAndreoni1998Partners .

6.1 Simulation comparison of SA-IGA and SA-PGA

We start the performance evaluation with analyzing the learning performance of SA-PGA under two-player two-action repeated games. In general a two-player two-action game can be classified into three categories

tuyls2006evolutionary :

  1. , . In this case, each player has a dominant strategy and thus the game only has one pure strategy NE.

  2. , and . In this case, there are two pure strategy NEs and one mixed strategy NE.

  3. , and . In this case, there only exists one one mixed strategy NE.

where is the payoff of player when player takes action while its opponent takes action . We select one representative game for each category for illustration.

6.1.1 Category 1

For category 1, we consider the PD game as shown in Table 1. In this game, both players have one dominant strategy , and is the only pure strategy NE, while there also exists one socially optimal outcome under which both players can obtain higher payoffs.

(a) SA-PGA in PD game
(b) SA-IGA in PD game
Figure 1: The Learning Dynamics of SA-IGA and SA-PGA in PD game ( , , )

Figure 1(a) show the learning dynamics of the practical SA-PGA algorithm playing the PD game. The x-axis represents player 1’s probability of playing action and the y-axis represents player 2’s probability of playing action . We randomly selected 20 initial policy points as the starting point for the SA-PGA agents. We can observe that the SA-PGA agents are able to converge to the mutual cooperation equilibrium point starting from different initial policies.

Figure 1(b) illustrates the learning dynamics predicted by the theoretical SA-IGA approach. Similar to the setting in Figure 1(a), the same set of initial policy points are selected and we plot all the learning curves accordingly. We can see that for each starting policy point, the learning dynamics predicted from the theoretical SA-IGA is well consistent with the learning curves from simulation. This indicates that we can better understand and predict the dynamics of SA-PGA algorithm using its corresponding theoretical SA-IGA model.

6.1.2 Category 2

For category 2, we consider the CG game as shown in Table 4. In this game, there exist two pure strategy Nash equilibria (C, C) and (D, D), and both of them are also socially optimal.

1’s payoff
2’s payoff
Agent 2’s actions
C D

Agent 1’s
actions
C 3/4 0/0