Shapley Q-value: A Local Reward Approach to Solve Global Reward Games

07/11/2019 ∙ by Jianhong Wang, et al. ∙ Imperial College London 0

Cooperative game is a critical research area in multi-agent reinforcement learning (MARL). Global reward game is a subclass of cooperative games, where all agents aim to maximize cumulative global rewards. Credit assignment is an important problem studied in the global reward game. Most of works stand by the view of non-cooperative-game theoretical framework with the shared reward approach, i.e., each agent being assigned a shared global reward directly. This, however, may give each agent an inaccurate feedback on its contribution to the group. In this paper, we introduce a cooperative-game theoretical framework and extend it to the infinite-horizon case. We show that our proposed framework is a superset of the global reward game. Based on this framework, we propose a local reward approach called Shapley Q-value that can distribute the cumulative global rewards fairly, reflecting each agent's own contribution in contrast to the shared reward approach. Moreover, we derive an MARL algorithm called Shapley Q-value policy gradient (SQPG), using Shapley Q-value as critics. We evaluate SQPG on Cooperative Navigation, Prey-and-Predator and Traffic Junction, compared with MADDPG, COMA, Independent A2C and Independent DDPG. In the experiments, SQPG shows the better performance than the baselines. In addition, we also plot the Shapley Q-value and validate the property of fairly distributing the global rewards.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

Code Repositories

multi-agent-rl

This is a framework for the research on multi-agent reinforcement learning and the implementation of the experiments in the paper titled by Rethink Global Reward Game and Credit Assignment in Multi-agent Reinforcement Learning


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cooperative game is a critical research area in multi-agent reinforcement learning (MARL). Many real-life tasks can be modeled as cooperative games, e.g., the coordination of autonomous vehicles [14], autonomous distributed logistics [32] and search-and-rescue robots [17, 31]. Global reward game [5] is a subclass of cooperative games where agents aim to maximize cumulative global rewards. In a global reward game, credit assignment is an important problem, which targets on finding a method to distribute the global rewards. There are two categories of approaches to solve out this problem, namely shared reward approach (also known as shared reward game or fully cooperative game) [37, 26, 15] and local reward approach [28]. The shared reward approach directly assigns the global rewards to all agents. The local reward approach, on the other hand, distributes the global rewards according to each agent’s contribution, and turns out to have superior performances for many tasks [10, 25].

Whatever approach is adopted, a remaining open question is whether there is an underlying theory to explain the credit assignment. Conventionally, a global reward game is built upon non-cooperative game theory, which primarily aims to find Nash equilibrium as the stable solution

[27, 3]. This formulation can be extended to a dynamic environment with infinite horizons via stochastic game [34]. However, Nash equilibrium focuses on individual rewards and has no explicit incentives for cooperation. As a result, the shared reward function has to be utilized to force cooperation, which can be used as a possible explanation to the shared reward approach, but not the local reward approach.

In our work, we introduce and investigate cooperative game theory (or coalitional game theory) [4] in which the local reward approach becomes rationalized. In cooperative game theory, the objective is dividing coalitions and binding agreements among agents who belong to the same coalition. We focus on the convex game (CG) which is a typical game in cooperative game theory featuring the existence of a stable coalition structure with an efficient payoff distribution scheme, i.e., a local reward approach, called core. This payoff distribution is equivalent to credit assignment, thereby the core rationalizes and well explains the local reward approach [30].

Referring to the concepts from stochastic games [34], we extend CG to infinite-horizon scenarios, namely extended CG (ECG). In addition, we show that a global reward game is equivalent to an ECG with a grand coalition and an efficient payoff distribution scheme. Furthermore, we propose Shapley Q-value (i.e., an efficient payoff distribution scheme), extending Shapley value for credit assignment in an ECG with a grand coalition. Therefore, it is apparent that Shapley Q-value works in a global reward game. Finally, we derive an algorithm called Shapley Q-value policy gradient (SQPG) according to the actor-critic framework [18] to learn decentralized policies with centralized critics (i.e., Shapley Q-value). SQPG is evaluated on the environments such as Cooperative Navigation, Prey-and-Predator [20], and Traffic Junction [37], compared with baselines, e.g., MADDPG [20], COMA [10], Independent DDPG [19] and Independent A2C [38].

2 Related Work

2.1 Multi-agent Learning

Multi-agent learning refers to a category of methods that tackle the games with multiple agents such as cooperative games. Among these methods, we only focus on using reinforcement learning to deal with a cooperative game, which is called multi-agent reinforcement learning (MARL). Incredible progresses have recently been made on MARL. Some researchers [37, 15, 7] focus on distributed executions, which allow communications among agents. Others [5, 10, 25, 20, 12] consider decentralized executions, where no communication is permitted during the execution. Nevertheless, all of them study on the centralized critic, which means information can be shared on the value function during training. In our work, we pay our attention to the decentralized execution and the centralized critic.

2.2 Cooperative Game

As opposed to competing with others, agents in a cooperative game aim to cooperate to solve a joint task or maximize global payoffs [4]. Shapley [34] proposed a non-cooperative game theoretical framework called stochastic game, which models the dynamics of multiple agents in the zero-sum game with infinite horizons. Hu et al. [11] introduced a general-sum stochastic game theoretical framework, which does include the zero-sum game. To force cooperation under this framework, Potential function [22] is applied such that each agent shares the same objective, namely a global reward game [5]. In this paper, we use cooperative game theory whereas the existing cooperative game framework are built under non-cooperative game theory. Our framework gives a new view on global reward games and well explains how the credit assignment is important. We show that the global reward game is a subclass of our framework if we interpret such that the agents in a global reward game forms a grand coalition. Under our framework, it is more rational to use a local reward approach to distribute the global rewards.

2.3 Credit Assignment

Credit assignment is a significant problem that has been studied in cooperative games for a long time. There are two sorts of credit assignment schemes, i.e., shared reward approach and local reward approach. The shared reward approach directly assigns each agent the global reward [37, 15, 7, 20]. We show that this is actually equivalent to distributing the global reward equally to individual agents. The global reward game with this credit assignment scheme is also called shared reward game (or fully cooperative game) [28]. However, Wolpert and Tumer [41] claimed that the shared reward approach does not give each agent the accurate contribution. Thus, it may not perform well in difficult problems. This motivates the study on the local reward approach, which distributes the global reward to agents according to their contributions. The existing question is how to quantify the contributions. To investigate the answer to this question, Chang et al. [5]

attempted using Kalman filter to infer the contribution of each agent. Recently,

Foerster et al. [10] and Nguyen et al. [25] modelled the marginal contributions inspired by the reward difference [41]. Under our proposed framework, we propose a new algorithm to learn a local reward called Shapley value [33], which is guaranteed to distribute the global reward fairly. Although Shapley value can be regarded as the expectation of the marginal contributions, it is different from the previous work: it considers all possible orders of agents to form a grand coalition, which has not been addressed in any of the prior work aforementioned.

3 Background

3.1 Non-cooperative Game Theory

To clarify why we are interested in the framework of cooperative game theory, let us begin with non-cooperative game theory. Non-cooperative game theory aims to solve out the problem in which agents are selfish and intelligent [27]. In other words, each agent merely considers maximizing his own rewards to approach a Nash equilibrium. Consequently, non-cooperative game theory is intuitively not suitable for modelling cooperative scenarios. To model the cooperative scenarios, one possible way is through constructing a global reward function (which is unnecessary to be convex), e.g., Potential [22] that replaces each individual reward function. As a result, each agent’s objective is forced to be identical, i.e., fully cooperative game. Even if each agent is still keen on maximizing his own rewards, the coordination can be formed. Nonetheless, this approach has its own limitations. Firstly, it lacks the clear explanations for credit assignments. Secondly, assigning credits to each agent equally may cause the slow learning rate [1, 2]. Thirdly, this framework is difficult to be extended to solve out more complex problems such as the competition among different coalitions. However, cooperative game theory (with transferable utility), concentrating on credit assignment and dividing coalitions (or groups) [4] can solve out these problems. This is the reason why we introduce and investigate cooperative game theory.

3.2 Convex Game

Convex game (CG) is a typical transferable utility game in cooperative game theory. The definitions below are referred to [4]. A CG is formally represented as , where is the set of agents and is the value function to measure the profits earned by a coalition. itself is called the grand coalition. The value function is a mapping from a coalition to a real number . In a CG, its value function satisfies two properties, i.e., 1) ; 2) the coalitions are independent. The solution of a CG is a tuple , where is a coalition structure and indicates the payoffs distributed to each agent, which satisfies two conditions, i.e., 1) ; 2) , where . denotes the set of all possible coalition structures. A core is the stable solution set of a CG, which can be defined mathematically as . The core of a CG ensures reasonable payoff distribution and inspires our work on credit assignment in MARL.

3.3 Shapley Value

Shapley value [33] is one of the most popular methods to solve the payoff distribution problem for a grand coalition [9, 21, 8]. Given a cooperative game , for any let , then the Shapley value of each agent can be written as:

(1)

Literally, Shapley value takes the average of the marginal contribution of possible coalitions, so that it satisfies: 1) efficiency: ; 2) fairness: if an agent has no contribution, then ; if -th and -th agents have the same contribution, then [4]. As we can see from Eq.1, if we calculate the Shapley value for an agent, we have to consider possible coalitions that the agent could join in to form a grand coalition, which causes the computational catastrophe. We purpose mitigating this issue in the scenarios with infinite horizons.

3.4 Multi-agent Actor-Critic

Different from the value based method, i.e., Q-learning [39], Policy gradient [40] directly learns the policy by maximizing , where is the reward of an arbitrary state-action pair. Since the gradient of w.r.t. cannot be directly calculated, the policy gradient theorem [38] is used to approximate the gradient such that . In the actor-critic framework [18], based on the policy gradient theorem, is called actor and is called critic. Additionally, . Extending to the multi-agent scenarios, the gradient of each agent can be represented as .

can be regarded as the estimation of the contribution of each agent

. If the deterministic policy [19] need to be learned in MARL problems, we can reformulate the approximated gradient of each agent as .

4 Our Work

In this section, we (i) extend convex game (CG) with the infinite horizons and decisions, namely extended convex game (ECG) and show that a global reward game is equivalent to an ECG with a grand coalition and an efficient distribution scheme, (ii) show that shared reward approach is an efficient distribution scheme in an ECG with a grand coalition, (iii) propose Shapley Q-value by extending and approximating Shapley value to distribute credits in a global reward game, because it can accelerate the convergence rate compared with shared reward approach, (iv) and derive an MARL algorithm called Shapley Q-value policy gradient (SQPG), using Shapley Q-value as each agent’s critic.

4.1 Extended Convex Game

We now extend CG to the scenarios with infinite horizons and decisions, namely extended CG (ECG). The set of joint actions of agents is defined as , where is the feasible action set for each agent . is the set of possible states in the environment. The dynamics of the environment is defined as , where and . Inspired by Nash [24], we construct ECG by two stages. In stage 1, an oracle arranges the coalition structure and contracts the cooperation agreements, i.e., the credit assigned to an agent for his optimal long-term contribution if he joins in some coalition. We assume that this oracle can observe the whole environment and be familiar with each agent’s feature. In stage 2, after joining in the allocated coalition, each agent will further make a decision by to maximize the social value of its coalition, so that the optimal social value of each coalition and individual credit assignment can be obtained, where and . Mathematically, the optimal value of a coalition can be written as ; ; is the reward gained by coalition at each time step. According to the property (1) of CG aforementioned, the formula holds. In this paper, we denote the joint policy of the whole agents as and assume that each agent can observe the global state.

Lemma 1 (Shapley [35], Chalkiadakis et al. [4]).

1) Every convex game has a non-empty core. 2) If a solution

is in the core of a characteristic function game

and a payoff distribution scheme is efficient, then for every coalition structure .

Theorem 1.

By the efficient payoff distribution scheme, for an extended convex game (ECG), one solution in the core must exist with the grand coalition and the objective is , which can lead to the maximal social welfare, i.e., for every coalition structure .

Proof.

This theorem is proved based on Lemma 1. See Appendix for the complete proofs. ∎

Corollary 1.

For a extended convex game (ECG) with a grand coalition, Shapley value must be in the core.

Proof.

Since Shapley value must be in the core for a CG with a grand coalition [35] and ESG still conserves the property of CG, this statement holds. ∎

4.2 Comparing ECG with Global Reward Game

As seen from Theorem 1, by an efficient payoff distribution scheme, an ECG with a grand coalition is actually equivalent to a global reward game. Both of them aim to maximize the global value (i.e., global cumulative rewards). Here, we assume that the agents in a global reward game are regarded as a grand coalition. Consequently, the local reward approach in the context of ESG can be used in a global reward game.

4.3 Looking into Shared Reward Approach by the View of ECG

Shared reward approach assigns each agent the global reward directly in a global reward game. Each agent unilaterally maximizes the cumulative global rewards to seek his optimal policy such that

(2)

where is the global reward and [28]. If is multiplied by a normalizing factor, i.e., , then the objective of the new optimization problem for each agent should be equivalent to Eq.2. We can express it mathematically as

(3)

Then, the credit assigned to each agent in shared reward approach is actually , and the sum of the whole agents’ credits is equal to the cumulative global rewards. It suffices the condition of the efficient payoff distribution scheme. Therefore, we show that shared reward approach is an efficient payoff distribution scheme in an ECG with a grand coalition, i.e., a global reward game. Nevertheless, from the view of ECG, shared reward approach cannot be guaranteed to find the optimal solution. By Corollary 1 and Theorem 1, we know that Shapley value can theoretically promise the convergence to the maximal global value. This is one of the reasons why we are interested in this local reward approach. To clarify the concepts we mentioned before, we draw a Venn diagram shown as Fig.1.

Figure 1: Relationship between the concepts mentioned in this paper.

4.4 Shapley Q-value

Although shared reward approach successfully solves out a global reward game in practice, it has been shown that local reward approach gives faster convergence rates [1, 2]. For the two aforementioned reasons, we use Shapley value, i.e., a local reward approach for the credit assignment to each agent. Because represents the cumulative global rewards earned by coalition in an ECG, we can model it as a Q-value, i.e., , where represents the state and . According to Eq.1, the Shapley Q-value of each agent , i.e., can be written as

(4)
(5)

4.5 Approximate Marginal Contribution

As seen from Eq.4, it is difficult and unstable to learn two Q-value functions (where one is for representing the Q-value of and the other is for representing the Q-value of ) to estimate the marginal contributions (i.e., making the difference between two Q-values) for different coalitions. To mitigate this problem, we propose a method called Approximate Marginal Contribution (AMC) to directly estimate the marginal contribution of each coalition, i.e., .

In cooperative game theory, each agent is assumed to join the grand coalition sequentially. in Eq.1, denoted as , is interpreted as that an agent randomly joins in an existing coalition (which could be empty) to form a complete grand coalition with subsequent agents [4]. According to this interpretation, we model a function to approximate the marginal contribution directly such that

(6)

where is the state space; is the ordered coalition that agent would like to join in; , and the actions are ordered. For example, if the order of a coalition is , then . By such a formulation, we believe that the property of marginal contribution (i.e., mapping from every possible combination of coalition and agent to a numerical value) can be maintained. Hence, AMC is reasonable to replace the actual marginal contribution. In practice, we represent

by the concatenation of each agent’s action vector. To keep the input size of

constant in different cases, we fix the actions as the concatenation of all agents’ actions and mask the actions of irrelevant agents (i.e., the agents who are not in the coalition) with zeros.

4.6 Approximate Shapley Q-value

Followed by the interpretation above, Shapley Q-value can be rewritten as

(7)

To enable Eq.7 to be tractable in realization, we can sample here. Combined with AMC, i.e., in Eq.6, we can write the equation of the approximate Shapley Q-value (ASQ) as

(8)

4.7 Shapley Q-value Policy Gradient

In an ECG with a grand coalition, each agent only needs to maximize his own credit by an efficient payoff distribution scheme so that can be achieved such that

(9)

Therefore, if we show that for each agent is approached, then we can show that the maximal global value can be met. Now, the problem transfers to solving for each agent . Aforementioned, a global reward game is identical to a potential game 111A potential game is a game where there exists a Potential function [22].. Additionally, Monderer and Shapley [22] showed that in a potential game there exists a pure Nash equilibrium, i.e., a deterministic optimal policy solution. For these reasons, we apply deterministic policy gradient (DPG) [36] to search out a deterministic optimal policy. If we substitute Shapley Q-value for in DPG, we can directly write the policy gradient of each agent such that

(10)

where is the ASQ for agent and is agent ’s deterministic policy, parameterized by . A global reward is received each time step in a global reward game. Since each ASQ is correlated to the cumulative local rewards, we cannot update each directly by the global reward. However, benefited by the property of efficiency, we can solve it out according to the minimization problem such that

(11)

where is the global reward received from the environment each time step and for each agent (i.e., a linear combination of ) is parameterized by . Constrained by this objective function, the approximate Shapley Q-value suffices the property of efficiency. Accordingly, the condition of efficient payoff distribution scheme stated in Theorem 1 is promised. Because Shapley Q-value takes all of feasible agents’ actions and states as input, our algorithm actually uses the centralized critic. Nevertheless, the policies are decentralized in execution.

Silver et al. [36] showed that DPG has the familiar machinery of policy gradient. Besides, Sutton et al. [38] emphasized that with a small learning rate, policy gradient algorithm can converge to a local optimum. Consequently, we can conclude that with a small learning rate, each agent can find a local maximizer and the global value converges to a local maximum. The convexity of is impossible to be guaranteed in applications, so the global maximum stated in Theorem 1 may not be fulfilled always. Since our algorithm aims to find optimal policies by Shapley Q-values, we call it Shapley Q-value policy gradient (SQPG).

4.8 Implementation

In implementation, for the sake of better approximation on policy gradients by off-policy and the powerful function approximation by deep neural networks, we use the deep deterministic policy gradient (DDPG) method

[19]. Additionally, we apply the reparameterization technique called gumbel-softmax trick [13] to deal with discrete action space. The pseudo code for the algorithm SQPG is given in Appendix.

5 Experiments

We evaluate SQPG on Cooperative Navigation, Prey and Predator and Traffic Junction. The environments of Cooperative Navigation and Prey-and-Predator are from Mordatch and Abbeel [23], and Traffic Junction from Sukhbaatar et al. [37]. In the experiments, we compare SQPG with two Independent algorithms (with decentralised critics), such as DDPG [19] and A2C [38], as well as two methods with centralised critics, such as MADDPG [20] and COMA [10]. In the experiments, policy and critic networks for all of MARL algorithms are parameterized by MLPs, except for COMA whose policy is parameterized by GRUs [6] following the original paper. The details of experimental setups are given in Appendix. All models are trained by Adam Optimizer [16] with the same learning rate for each task. The code for experiments is published on: https://github.com/hsvgbkhgbv/multi-agent-rl.

5.1 Cooperative Navigation

5.1.1 Environment Settings

In this task, there are 3 agents and 3 targets. Each agent aims to move to a target, with no prior allocation of targets to each agent. The state of each agent in this environment includes its current position and velocity, the displacement to three targets, and the displacement to other agents. The action space of each agent is move_up, move_down, move_right, move_left and stay. The global reward of this environment is the negative sum of the distance between each target and the nearest agent to it. Besides, if a collision happens, then the global reward will be reduced by 1.

5.1.2 Results

As seen from Fig.1(a), SQPGs with variant sample sizes (i.e., M in Eq.8) outperform the baselines, in terms of either the final convergence performance or the convergence rate. As the sample size grows, the approximate Shapley Q-value estimation in Eq.7 could be more accurate and easier to converge to the optimal value. This explains the reason why the convergence rate of SQPG becomes faster when the sample size increases. Moreover, our result supports the previous argument that the local reward approach converges faster than the global reward approach [1, 2]. Since we show that SQPG with the sample size of 1 can finally obtain nearly the same performance as other variants, we just run it in the rest of experiments to reduce the computational complexity.

(a) Mean rewards per episode during training in Cooperative Navigation.
(b) Turns to capture prey per episode during training in Prey-and-Predator.
Figure 2: Training performances in Cooperative Navigation and Prey-and-Predator.

5.2 Prey-and-Predator

5.2.1 Environment Settings

In this task, we can only control three predators and the prey is a random agent. The aim of predators is coordinating to capture the prey by as less steps as possible. The state of each predator contains its current position and velocity, the respective displacement to the prey and other predators, and the velocity of the prey. The action space is the same as that defined in Cooperative Navigation. The global reward is the negative minimal distance between any predator and the prey. In addition, if the prey is caught by any predator, then the global reward is added by 10 and the game terminates.

5.2.2 Results

As Fig.1(b) shows, SQPG with sample size of 1 leads the board with about 30 turns to capture the prey finally, followed by MADDPG and Independent DDPG. To study and understand the credit assignment, we visualize the Q-values of each MARL algorithm by one randomly selected trajectory of states and actions from an expert policy. For convenient visualization, we normalize the Q-values by min-max normalization [29] for each MARL algorithm. We can see from Fig.3 that the credit assignment of SQPG is more explainable than that of baselines. Specifically, it is intuitive that the credit assigned to each agent by SQPG is negatively correlated to its distance to the prey. However, other MARL algorithms do not explicitly display such a property. To confirm this intuition, we also evaluate it quantitatively through the negative Pearson correlation coefficient (NPCC) on 1000 randomly selected transition samples. The value is greater and the credit assignment and distance are more negatively correlated. As Tab.1 shows, SQPG expresses the negative correlation significantly, with NPCC of 0.3403 and the two-tailed p-value of 1.4423e-9. The contribution distribution of Shapley Q-value is fair, since it is more likely to catch the prey and the contribution is more significant, if the predator is closer to it.

Figure 3: Credit assignment to each predator according to a fixed trajectory. The leftmost figure records a trajectory sampled by an expert policy. The square represents the initial position whereas the circle indicates the final position of each agent. The dots on the trajectory indicates each agent’s temporary positions. The other figures show the normalized credit assignments generated by different MARL algorithms according to this trajectory.
Independent A2C Independent DDPG COMA MADDPG SQPG
negative coefficient -0.0441 -0.1880 -0.1398 0.1855 0.3403
two-tailed p-value 4.4694e-1 1.0676e-3 1.5381e-2 1.2496e-3 1.4423e-9
Table 1: Negative Pearson correlation coefficient between the credit assignment to each predator and its distance to the prey. This test is conducted by 1000 randomly selected episode samples.
Difficulty Independent A2C Independent DDPG COMA MADDPG SQPG
Easy 65.01% 93.08% 93.01% 93.72% 93.26%
Medium 67.51% 84.16% 82.48% 87.92% 88.98%
Hard 60.89% 64.99% 85.33% 84.21% 87.04%
Table 2: Success rate on Traffic Junction, tested with 20, 40, and 60 steps per episode in easy, medium and hard versions respectively. The results are obtained by running each algorithm after training for 1000 episodes.

5.3 Traffic Junction

5.3.1 Environment Settings

In this task, cars move along the predefined routes which intersect on one or more traffic junctions. At each time step, new cars are added to the environment with probability

, and the total number of cars is limited below . After a car finishes its mission, it will be removed from the environment and possibly sampled back to a new route. Each car has a limited vision of 1, which means it can only observe the circumstance within the 3x3 region surrounding it. No communication between cars is permitted in our experiment, in contrast to the others’ experiments on the same task [37, 7]. The action space of each car is gas and brake, and global reward function is , where is the time steps that car is continuously active on the road in one mission and is the total number of cars. Additionally, if a collision occurs, the global reward will be reduced by 10. We evaluate the performance by the success rate, i.e., the episode that no collisions happen.

5.3.2 Results

We compare our method with baselines on the easy, medium and hard version of Traffic Junction. The easy version is constituted of one traffic junction of two one-way roads on a grid with and . The medium version is constituted of one traffic junction of two-way roads on a grid with and . The hard version is constituted of four connected traffic junctions of two-way roads on a grid with and . From Tab.2, we can see that on the easy version, except for Independent AC, other algorithms can get a success rate over , since this scenario is too easy. On the medium and hard version, SQPG outperforms the other baselines with the success rate of on the medium version and on the hard version. Moreover, the performance of SQPG significantly exceeds that of no-communication algorithms reported as and in [7]. We demonstrate that SQPG can solve out large scale problems.

5.4 Discussion

In the experimental results, it is surprising that Independent DDPG achieves a good performance. The possible reason could be that a potential game (i.e., a global reward game) can be solved by fictitious play [22] and DDPG is analogous to it, finding an optimal deterministic policy by attempting to fit the other agents’ behaviours. However, the convergence rate is not guaranteed when the number of agents becomes large, such as the result shown in the hard version of Traffic Junction. The bad performance of COMA could be due to the so complicated model it has so that the convergence in a continuous control problem, e.g., Cooperative Navigation and Prey-and-Predator, becomes difficult. To deal with both of competitive and cooperative games, MADDPG assigns each agent a centralised critic to estimate the global value. Theoretically the credits assigned to agents are identical, though in experiment it does not always display. The possible reason could be the bias existing in Q-values.

6 Conclusion

We introduce cooperative game theory to extend the existing global reward game to a broader framework called extended convex game (ECG). Under this framework, we propose an algorithm namely Shapley Q-value policy gradient (SQPG), leveraging a local reward approach called Shapley Q-value, which is theoretically guaranteed to find out the optimal solution in an ESG with a grand coalition (i.e., a global reward game). We evaluate SQPG on three global reward games and show the promising performance compared with baselines. In the future work, we plan to dynamically group the agents at each time step with theoretical guarantees and jump out of the restriction of the global reward game.

Jianhong Wang especially thanks Ms Yunlu Li for useful and patient explanations on mathematics. Additionally, the authors thank Ms Jing Li for helpful discussions on cooperative game theory. Jianhong Wang is sponsored by EPSRC-UKRI Innovation Fellowship EP/S000909/1.

References

  • [1] T. Balch et al. (1997) Learning roles: behavioral diversity in robot teams. College of Computing Technical Report GIT-CC-97-12, Georgia Institute of Technology, Atlanta, Georgia 73. Cited by: §3.1, §4.4, §5.1.2.
  • [2] T. Balch (1999) Reward and diversity in multirobot foraging. In IJCAI-99 Workshop on Agents Learning About, From and With other Agents. Cited by: §3.1, §4.4, §5.1.2.
  • [3] T. Basar and G. J. Olsder (1999) Dynamic noncooperative game theory. Vol. 23, Siam. Cited by: §1.
  • [4] G. Chalkiadakis, E. Elkind, and M. Wooldridge (2011) Computational aspects of cooperative game theory.

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    5 (6), pp. 1–168.
    Cited by: §1, §2.2, §3.1, §3.2, §3.3, §4.5, Lemma 1.
  • [5] Y. Chang, T. Ho, and L. P. Kaelbling (2004) All learning is local: multi-agent learning in global reward games. In Advances in neural information processing systems, pp. 807–814. Cited by: §1, §2.1, §2.2, §2.3.
  • [6] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014)

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    .
    arXiv preprint arXiv:1412.3555. Cited by: §5.
  • [7] A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau (2018) TarMAC: targeted multi-agent communication. arXiv preprint arXiv:1810.11187. Cited by: §2.1, §2.3, §5.3.1, §5.3.2.
  • [8] U. Faigle and W. Kern (1992) The shapley value for cooperative games under precedence constraints. International Journal of Game Theory 21 (3), pp. 249–266. Cited by: §3.3.
  • [9] S. S. Fatima, M. Wooldridge, and N. R. Jennings (2008) A linear approximation method for the shapley value. Artificial Intelligence 172 (14), pp. 1673–1699. Cited by: §3.3.
  • [10] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson (2018) Counterfactual multi-agent policy gradients. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §1, §2.1, §2.3, §5.
  • [11] J. Hu, M. P. Wellman, et al. (1998) Multiagent reinforcement learning: theoretical framework and an algorithm.. In ICML, Vol. 98, pp. 242–250. Cited by: §2.2.
  • [12] S. Iqbal and F. Sha (2018) Actor-attention-critic for multi-agent reinforcement learning. arXiv preprint arXiv:1810.02912. Cited by: §2.1.
  • [13] E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §4.8.
  • [14] T. Keviczky, F. Borrelli, K. Fregene, D. Godbole, and G. J. Balas (2007) Decentralized receding horizon control and coordination of autonomous vehicle formations. IEEE Transactions on control systems technology 16 (1), pp. 19–33. Cited by: §1.
  • [15] D. Kim, S. Moon, D. Hostallero, W. J. Kang, T. Lee, K. Son, and Y. Yi (2019) Learning to schedule communication in multi-agent reinforcement learning. In International Conference on Learning Representations, Cited by: §1, §2.1, §2.3.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix C, §5.
  • [17] M. Koes, I. Nourbakhsh, and K. Sycara (2006) Constraint optimization coordination architecture for search and rescue robotics. In Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., pp. 3977–3982. Cited by: §1.
  • [18] V. R. Konda and J. N. Tsitsiklis (2000) Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §1, §3.4.
  • [19] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §3.4, §4.8, §5.
  • [20] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390. Cited by: §1, §2.1, §2.3, §5.
  • [21] T. P. Michalak, K. V. Aadithya, P. L. Szczepanski, B. Ravindran, and N. R. Jennings (2013) Efficient computation of the shapley value for game-theoretic network centrality. Journal of Artificial Intelligence Research 46, pp. 607–650. Cited by: §3.3.
  • [22] D. Monderer and L. S. Shapley (1996) Potential games. Games and economic behavior 14 (1), pp. 124–143. Cited by: §2.2, §3.1, §4.7, §5.4, footnote 1.
  • [23] I. Mordatch and P. Abbeel (2018) Emergence of grounded compositional language in multi-agent populations. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §5.
  • [24] J. Nash (1953) Two-person cooperative games. Econometrica: Journal of the Econometric Society, pp. 128–140. Cited by: §4.1.
  • [25] D. T. Nguyen, A. Kumar, and H. C. Lau (2018) Credit assignment for collective multiagent rl with global rewards. In Advances in Neural Information Processing Systems, pp. 8102–8113. Cited by: §1, §2.1, §2.3.
  • [26] S. Omidshafiei, D. Kim, M. Liu, G. Tesauro, M. Riemer, C. Amato, M. Campbell, and J. P. How (2018) Learning to teach in cooperative multiagent reinforcement learning. arXiv preprint arXiv:1805.07830. Cited by: §1.
  • [27] M. J. Osborne and A. Rubinstein (1994) A course in game theory. MIT press. Cited by: §1, §3.1.
  • [28] L. Panait and S. Luke (2005) Cooperative multi-agent learning: the state of the art. Autonomous agents and multi-agent systems 11 (3), pp. 387–434. Cited by: §1, §2.3, §4.3.
  • [29] S. Patro and K. K. Sahu (2015) Normalization: a preprocessing stage. arXiv preprint arXiv:1503.06462. Cited by: §5.2.2.
  • [30] B. Peleg and P. Sudhölter (2007) Introduction to the theory of cooperative games. Vol. 34, Springer Science & Business Media. Cited by: §1.
  • [31] S. D. Ramchurn, A. Farinelli, K. S. Macarthur, and N. R. Jennings (2010) Decentralized coordination in robocup rescue. The Computer Journal 53 (9), pp. 1447–1461. Cited by: §1.
  • [32] A. Schuldt (2012) Multiagent coordination enabling autonomous logistics. KI-Künstliche Intelligenz 26 (1), pp. 91–94. Cited by: §1.
  • [33] L. S. Shapley (1953) A value for n-person games. Contributions to the Theory of Games 2 (28), pp. 307–317. Cited by: §2.3, §3.3.
  • [34] L. S. Shapley (1953) Stochastic games. Proceedings of the national academy of sciences 39 (10), pp. 1095–1100. Cited by: §1, §1, §2.2.
  • [35] L. S. Shapley (1971) Cores of convex games. International journal of game theory 1 (1), pp. 11–26. Cited by: Corollary 1, Lemma 1.
  • [36] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. In ICML, Cited by: §4.7, §4.7.
  • [37] S. Sukhbaatar, R. Fergus, et al. (2016)

    Learning multiagent communication with backpropagation

    .
    In Advances in Neural Information Processing Systems, pp. 2244–2252. Cited by: §1, §1, §2.1, §2.3, §5.3.1, §5.
  • [38] R. S. Sutton, A. G. Barto, et al. (1998) Introduction to reinforcement learning. Vol. 135, MIT press Cambridge. Cited by: §1, §3.4, §4.7, §5.
  • [39] C. J. Watkins and P. Dayan (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §3.4.
  • [40] R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §3.4.
  • [41] D. H. Wolpert and K. Tumer (2002) Optimal payoff functions for members of collectives. In Modeling complexity in economic and social systems, pp. 355–369. Cited by: §2.3.

Appendix

Appendix A Algorithm

In this section, we show the pseudo code of Shapley Q-value policy gradient (SQPG) in Algorithm 1.

1:Initialize actor parameters , and critic (AMC) parameters for each agent
2:Initialize target actor parameters , and target critic parameters for each agent
3:Initialize the sample size for approximating Shapley Q-value
4:Initialize the learning rate for updating target network
5:Initialize the discount rate
6:for episode = 1 to D do
7:     Observe initial state from the environment
8:     for t = 1 to T do
9:          for each agent Select action according to the current policy and exploration noise
10:         Execute actions and observe the global reward and the next state
11:         Store in the replay buffer
12:         Sample a minibatch of G samples from
13:         Get for each sample
14:         Get for each sample
15:         for each agent  do this procedure can be implemented in parallel
16:              Sample ordered coalitions by
17:              for each sampled coalition  do this procedure can be implemented in parallel
18:                  Order each by and mask the irrelevant agents’ actions, storing it to
19:                  Order each by and mask the irrelevant agents’ actions, storing it to
20:                  Order each by and mask the irrelevant agents’ actions, storing it to               
21:              Get for each sample
22:              Get for each sample
23:              Update by deterministic policy gradient according to Eq.10:
         
24:         Set for each sample
25:         Update for each agent by minimizing the optimization problem according to Eq.11:
26:         Update target network parameters for each agent :
     
Algorithm 1 Shapley Q-value Policy Gradient (SQPG)

Appendix B Proof of Theorem 1

Theorem 1.

By an efficient payoff distribution scheme, for an extended convex game (ECG), one solution in the core can be certainly found with the grand coalition and the objective is , which can lead to the maximal social welfare, i.e., for every coalition structure .

Proof.

The proof is as follows.

As we defined before, in an ECG, after allocating coalitions each agent will further maximize the cumulative rewards of his coalition by the optimal policy. Now, we denote the optimal value of an arbitrary coalition as , where . Similarly, we can define the optimal value given an arbitrary coalition structure as . If we re-write the value function defined above, ECG can be reformulated to CG. For this reason, we can directly use the results in Lemma 1 here to complete the proof.

First, we aim to show that (i) In an ECG, by an efficient payoff distribution scheme x, for any , and is a solution in the core.

Suppose for the sake of contradiction that is not in the core, but due to statement (1) in Lemma 1, there must exist a coalition structure other than satisfies the result that is in the core. According to statement (2) in Lemma 1, since CG is a subclass of characteristic function games, by an efficient payoff distribution scheme we can get that

(12)

On the other hand, because of the property of ECG, i.e.,

(13)

we have that

(We further expand the terms similarly)
(14)

By Eq.12 and 14, we can get that

(15)

According to the condition of efficient payoff distribution scheme, we can write:

(16)
(17)

By Eq.15, we can get that

(18)

By Eq.18, it is obvious that we can always find out a payoff distribution scheme for the grand coalition . Since is presumed to be in the core, must satisfy the conditions of core. As a result, we derive that (where ) is a solution in the core which contradicts the presumption we made and we show that proposition (i) holds.

Then, we aim to show that (ii) In an ECG, by an efficient payoff distribution scheme, the objective is .

The objective of a CG is finding a solution in the core. According to (i), is equivalent to finding a solution in the core corresponding to the grand coalition . For this reason, we can write

(Since we write )
(19)

Therefore, we prove (ii).

According to (ii), we can conclude that in an ECG, the objective is maximizing , i.e., the cumulative global rewards. However, an efficient payoff distribution scheme, e.g. Shapley value should be a precondition, otherwise, , where is the theoretically optimal value that can be found with an efficient payoff distribution scheme. ∎

Appendix C Experimental Setups

As for the setups of experiments, because different environments may involve variant complexity and dynamics, we give different hyperparameters for each task. Except that COMA claims to use GRUs as hidden layer, any other algorithms use MLPs as hidden layer for the policy networks. All of policy networks only use one hidden layer. About critic networks, every algorithm uses MLPs with one hidden layer. For each experiment, we keep the learning rate, entropy regularization coefficient, update frequency, batch size and the number of hidden units identical on each algorithm. In experiments, each agent has its own state in execution, while in training agents share the states. The rest details of experimental setups are introduced as below. All of models are trained by Adam Optimizer

[16] with default hyperparameters.

c.1 Additional Details of Cooperative Navigation

The specific hyperparameters of each algorithm used in Cooperative Navigation are shown as Tab.3.

Hyperparameters # Description
hidden units 32 The number of hidden units for both policy and critic network
training episodes 2000 The number of training episodes
episode length 200 Maximum time steps per episode
discount factor 0.9 The importance of future rewards
update frequency for behaviour network 100 Behaviour network updates every # steps
learning rate for policy network 1e-3 Policy network learning rate
learning rate for critic network 1e-2 Critic network learning rate
update frequency for target network 200 Target network updates every # steps
target update rate 0.1 Target network update rate
entropy regularization coefficient 1e-2 Weight or regularization for exploration
batch size 32 The number of transitions for each update
Table 3: Table of hyperparameters for Cooperative Navigation.

c.2 Additional Details of Prey-and-Predator

The specific hyperparameters of each algorithm used in Prey-and-Predator are shown as Tab.4. Also, we provide more evidences to support the conclusion that the credit assignment of SQPG is negatively correlated to the distance to the prey, shown as Fig.46.

Figure 4: Credit assignment to each predator according to a fixed trajectory. The leftmost figure records a trajectory sampled by an expert policy. The square represents the initial position whereas the circle indicates the final position of each agent. The dots on the trajectory indicates each agent’s temporary positions. The other figures show the normalized credit assignments generated by different MARL algorithms according to this trajectory.
Figure 5: Credit assignment to each predator according to a fixed trajectory. The leftmost figure records a trajectory sampled by an expert policy. The square represents the initial position whereas the circle indicates the final position of each agent. The dots on the trajectory indicates each agent’s temporary positions. The other figures show the normalized credit assignments generated by different MARL algorithms according to this trajectory.
Figure 6: Credit assignment to each predator according to a fixed trajectory. The leftmost figure records a trajectory sampled by an expert policy. The square represents the initial position whereas the circle indicates the final position of each agent. The dots on the trajectory indicates each agent’s temporary positions. The other figures show the normalized credit assignments generated by different MARL algorithms according to this trajectory.
Hyperparameters # Description
hidden units 128 The number of hidden units for both policy and critic network
training episodes 4000 The number of training episodes
episode length 200 Maximum time steps per episode
discount factor 0.99 The importance of future rewards
update frequency for behaviour network 100 Behaviour network updates every # steps
learning rate for policy network 1e-4 Policy network learning rate
learning rate for critic network 1e-3 Critic network learning rate
update frequency for target network 200 Target network updates every # steps
target update rate 0.1 Target network update rate
entropy regularization coefficient 1e-3 Weight or regularization for exploration
batch size 128 The number of transitions for each update
Table 4: Table of hyperparameters for Prey-and-Predator.

c.3 Additional Details of Traffic Junction

To give the reader an intuitive understanding of the environment, we list the experimental settings of different difficulty levels as Tab.5 shows and give the illustrations shown as Fig.9. The specific hyperparameters of each algorithm used in Traffic Junction are shown as Tab.6. To exhibit the training procedure in more details, we also display the figures of mean rewards, e.g., Fig.6(a)6(c) and the figures of success rate, e.g., Fig.7(a)7(c).

Difficulty Entry-Points # Routes # Two-way Junctions # Dimension
Easy 0.3 5 2 1 F 1 7x7
Medium 0.2 10 4 3 T 1 14x14
Hard 0.05 20 8 7 T 4 18x18
Table 5: The settings of Traffic Junction for different difficulty levels. means the probability to add an available car into the environment. means the existing number of the cars. Entry-Points # means the number of possible entry points for each car. Routes # means the number of possible routes starting from every entry point.
Hyperparameters Easy Meidum Hard Description
hidden units 128 128 128 The number of hidden units for both policy and critic network
training episodes 2000 5000 2000 The number of training episodes
episode length 50 50 100 Maximum time steps per episode
discount factor 0.99 0.99 0.99 The importance of future rewards
update frequency for behaviour network 25 25 25 Behaviour network updates every # steps
learning rate for policy network 1e-4 1e-4 1e-4 Policy network learning rate
learning rate for critic network 1e-3 1e-3 1e-3 Critic network learning rate
update frequency for target network 50 50 50 Target network updates every # steps
target update rate 0.1 0.1 0.1 Target network update rate
entropy regularization coefficient 1e-4 1e-4 1e-4 Weight or regularization for exploration
batch size 64 32 32 The number of transitions for each update
Table 6: Table of hyperparameters for Traffic Junction.
(a) Mean rewards per episode during training in Traffic Junction on easy version.
(b) Mean rewards per episode during training in Traffic Junction on medium version.
(c) Mean rewards per episode during training in Traffic Junction on high version.
Figure 7: Mean rewards per episode during training in Traffic Junction on the versions of different difficulty levels.
(a) Success rate per episode during training in Traffic Junction on easy version.
(b) Success rate per episode during training in Traffic Junction on medium version.
(c) Success rate per episode during training in Traffic Junction on high version.
Figure 8: Success rate per episode during training in Traffic Junction on the versions of different difficulty levels.
(a) Easy
(b) Medium
(c) Hard
Figure 9: Visualizations of traffic junction environment. The black points represent the available entry points. The orange arrows represent the available routes at each entry point. The green lines separate the two-way roads.

Appendix D Limitations of Extended Convex Game

In this paper, we propose a framework built on cooperative game theory called extended convex game (ECG). Although ECG has extended the framework of global reward game defined upon non-cooperative game theory to a broader scope, there exist some limitations to this model. Firstly, we have to assume that there is an oracle scheduling the coalition initially, however, this oracle is difficult to realize in implementation. Even if the oracle can be implemented, this model still cannot solve out some problems with random perturbations. This is due to the fact that the oracle has assigned each agent to a coalition with the environment that it knows. Obviously, the perturbation exceeds its knowledge. To deal with this problem, we may investigate how to enable the coalition construction dynamically in the future work. The intuitive idea is enabling the oracle to learn a policy for scheduling the coalition from the history information. At each step, it uses the learned policy to divide the coalitions. Then, each agent act within the coalition to maximize the social value of the coalition. This process can be repeated infinitely. Nonetheless, the promising convergence under the cooperative game theoretical framework for this complicated process could be a challenge.