Rethink Global Reward Game and Credit Assignment in Multi-agent Reinforcement Learning

07/11/2019 ∙ by Jianhong Wang, et al. ∙ Imperial College London 1

Cooperative game is a critical research area in multi-agent reinforcement learning (MARL). Global reward game is a subclass of cooperative games, where all agents aim to maximize cumulative global rewards. Credit assignment is an important problem studied in the global reward game. Most works stand by the view of non-cooperative-game theoretical framework with the shared reward approach, i.e., each agent is assigned a shared global reward directly. This, however, may give each agent an inaccurate feedback on his contribution to the group. In this paper, we introduce a cooperative-game theoretical framework and extend it to the finite-horizon case. We show that our proposed framework is a superset of the global reward game. Based on this framework, we propose an algorithm called Shapley Q-value policy gradient (SQPG) to learn a local reward approach that can distribute the cumulative global reward fairly, reflecting each agent's own contribution in contrast to the shared reward approach. We evaluate our method on the Cooperative Navigation, Prey-and-Predator and Traffic Junction, compared with MADDPG, COMA, Independent actor-critic and Independent DDPG. In the experiments, our algorithm shows better convergence than the baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

multi-agent-rl

This is a framework for the research on multi-agent reinforcement learning and the implementation of the experiments in the paper titled by Rethink Global Reward Game and Credit Assignment in Multi-agent Reinforcement Learning


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cooperative game is a critical research area in multi-agent reinforcement learning (MARL). Many real-life tasks can be modeled as cooperative games, e.g. the coordination of autonomous vehicles [1], autonomous distributed logistics [2] and search-and-rescue robots [3, 4]. Global reward game [5] is a subclass of cooperative games where agents aim to maximize cumulative global rewards. In a global reward game, credit assignment is an important problem, which targets on finding a method to distribute the global reward. There are two categories of approaches to solve out this problem, namely shared reward approach (also known as shared reward game or fully cooperative game) [6, 7, 8] and local reward approach [9]. The shared reward approach assigns a single global reward to all agents. The local reward approach, on the other hand, distributes the global reward according to each agent’s contribution, and turns out to have superior performances for many tasks [10, 11, 12].

Whichever approach is adopted, a remaining open question is whether there is an underlying theory to explain the credit assignment. Conventionally, a global reward game is built upon non-cooperative game theory, which primarily aims to find Nash equilibrium as the stable solution

[13, 14]. This formulation can be extended to a dynamic environment with finite horizons via stochastic game [15]. However, Nash equilibrium focuses on individual rewards and has no explicit incentives for cooperation. As a result, a shared reward function (equivalent to a potential function [16]) has to be utilized to force cooperation, which leads to an explanation to the shared reward approach, but not the local reward approach.

In our work, we introduce the framework of cooperative game theory (or coalitional game theory) [17] in which the local reward approach becomes rationalized. In cooperative game theory, the objective is to divide coalitions and to bind agreements among agents who belong to the same coalition, which enables explicit cooperation. We focus on the convex game (CG), which is a typical game in cooperative game theory featuring the existence of a stable coalition structure and payoff distribution scheme called core. The payoff distribution is equivalent to credit assignment and the core thereby rationalizes the local reward approach [18].

Referring to the concepts from stochastic games [15], we extend CG to finite-horizon scenarios. We propose Shapley Q-value for credit assignment in the extended CG (ECG), and sample it for computational efficiency. Finally, we use the actor-critic framework [19] to learn decentralized policies [20] with a centralized critic [10, 21, 22]. The advantage of the proposed method is evaluated on the games of Cooperative Navigation, Prey-and-Predator [21], and Traffic Junction [6].

The main contributions of our work are: 1) we introduce CG from the framework of cooperative game theory to rationalize the motivation of the local reward approach; 2) we propose the ECG to extend CGs to dynamic environments with finite horizons; 3) we propose Shapley Q-value for credit assignment in ECG and use its sample-based approximation for computational efficiency.

2 Related Work

Multi-agent learning refers to a category of methods that tackle a game with multiple agents, e.g. a cooperative game. Among these methods, we only focus on using reinforcement learning to deal with a cooperative game, which is called multi-agent reinforcement learning (MARL). Incredible progresses have recently been made on MARL. Some researchers [6, 8, 23] focus on distributed executions, which allow communications among agents. Others [5, 10, 12, 21, 22] consider decentralized executions, where no communication is permitted during the execution. Nevertheless, all of them study a global reward game with a centralized critic, which means information can be shared on the value function during training. In our work, we pay our attention to decentralized execution and centralized critic settings.

As opposed to competing with others, agents in a cooperative game aim to cooperate to solve a joint task or maximize global payoffs [17]. Shapley [15] proposed a non-cooperative game theoretical framework called stochastic game, which models the dynamics of multiple agents in the zero-sum game with finite horizons. Hu et al. [24] introduced a general-sum stochastic game theoretical framework, which does include the zero-sum game. To force cooperation under this framework, the potential function [16] is applied such that each agent shares the same objective, namely a global reward game [5]. In this paper, we use cooperative game theory whereas the existing cooperative game framework are built under non-cooperative game theory. Our framework gives a new view on global reward games and well explains how the credit assignment is important. We show that the global reward game is a subclass of our framework if we interpret such that the agents in the global reward game forms a grand coalition. Under our framework, it is more rational to use a local reward approach to distribute the global reward.

Credit assignment is a significant problem that has been studied in cooperative games for a long time. There are two sorts of credit assignment schemes, i.e., shared reward approach and local reward approach. The shared reward approach directly assigns each agent the global reward [6, 8, 23, 21]. We show that this is actually equivalent to distributing the global reward equally to individual agents. The global reward game with this credit assignment scheme is also called shared reward game (or fully cooperative game) [9]. However, Wolpert and Tumer [25] claimed that the shared reward approach does not give each agent the accurate contribution. Thus, it may not perform well in difficult problems. This motivates the study on the local reward approach, which distributes the global reward to agents according to their contributions. The existing question is how to quantify the contributions. To investigate the answer to this question, Chang et al. [5]

attempted using Kalman filter to infer the contribution of each agent. Recently,

Foerster et al. [10], Zhang et al. [11] and Nguyen et al. [12] modelled the marginal contributions inspired by the reward difference [25]. Under our proposed framework, we propose a new algorithm to learn a local reward called Shapley value [26], which is guaranteed to distribute the global reward fairly. Although Shapley value can be regarded as the expectation of the marginal contributions, it is different from the previous work: it considers all possible orders of agents to form a grand coalition, which has not been addressed in any of the prior work aforementioned.

3 Background

Convex Game (CG).

Convex game is a typical game in the cooperative game theory. The definitions below are referred to [17]. A CG is formally represented as , where is the set of agents and is the value function to measure the profits earned by a coalition. itself is called the grand coalition. The value function is a mapping from a coalition to a real number . In a CG, its value function satisfies two properties, i.e., 1) ; 2) the coalitions are independent. The solution of a CG is a tuple , where is a coalition structure and indicates the payoffs distributed to each agent, which satisfies two conditions, i.e., 1) ; 2) , where . denotes the set of all possible coalition structures. A core is the stable solution set of a CG, which can be defined mathematically as . The core of a CG ensures reasonable payoff distribution and inspires our work on credit assignment in MARL.

Shapley Value.

Shapley value [26] is one of the most popular methods to solve the payoff distribution problem for a grand coalition [27, 28, 29]. Given a cooperative game , for any let , then the Shapley value of each agent can be written as:

(1)

Literally, Shapley value takes the average of the marginal contribution of possible coalitions, so that it satisfies: 1) efficiency: ; 2) fairness: if an agent has no contribution, then ; if -th and -th agents have the same contribution, then [17]. As we can see from Eq.1, if we calculate the Shapley value for an agent, we have to consider possible coalitions that the agent could join in to form a grand coalition, which causes the computational catastrophe. We purpose mitigating this issue in the scenarios with finite horizons.

Multi-agent Actor-Critic.

Different from the value based method, i.e., Q-learning [30], Policy gradient [31] directly learns the policy by maximizing , where is the reward of an arbitrary state-action pair. Since the gradient of w.r.t. cannot be directly calculated, the policy gradient theorem [32] is used to approximate the gradient such that . In the actor-critic framework [19], based on the policy gradient theorem, is called actor and is called critic. Additionally, , where is a finite horizon. Extending to the multi-agent scenarios, the gradient of each agent can be represented as .

can be regarded as the estimation of the contribution of each agent

. If the deterministic policy [33] need to be learned in MARL problems, we can reformulate the approximated gradient of each agent as .

4 Our Work

4.1 Extended Convex Game

We now extend CG to the scenarios with finite horizons and decisions, namely extended CG (ECG). The set of joint actions of agents is defined as , where is the feasible action set for each agent . is the set of possible states in the environment. The dynamics of the environment is defined as , where and . Inspired by Nash [34], we construct ECG by two stages. In stage 1, an oracle arranges the coalition structure and contracts the cooperation agreements, i.e., the credit assigned to an agent for his optimal long-term contribution if he joins in some coalitions. We assume that this oracle can observe the whole environment and be familiar with each agent’s feature. In stage 2, after joining in the allocated coalition, each agent will further make a decision by to maximize the social value of its coalition, so that the optimal social value of each coalition and the individual credit assignment can be obtained. Mathematically, the optimal value of a coalition can be written as , where and ; ; is the reward gained by coalition at each time step. Therefore, each coalition can be regarded as a stochastic game [15], but with credit assignment approach other than the shared reward approach. We denote the joint policy of the whole agents as . Since constructing coalitions and binding agreements in stage 1 is independent of the decision process defined in stage 2, the formula holds. Here, we assume that each agent can observe the global state.

Theorem 1.

With the efficient payoff distribution scheme, for an extended convex game (ECG), one solution in the core must exist with the grand coalition and the objective is , which can lead to the maximal social welfare, i.e., for every coalition structure .

Proof.

See Appendix B. ∎

4.2 Compare ECG with Global Reward Game

As seen from Theorem 1, an ECG with a grand coalition and an efficient payoff distribution scheme is actually equivalent to a global reward game. Here, we think that the agents in a global reward game are regarded as a grand coalition. We consequently draw a conclusion that the maximal global welfare can be achieved in a global reward game by the view of ECG. This gives a theoretical justification on the motivation to solve a global reward game. Meanwhile, we can solve a global reward game instead of ECG when we aim to find the stable solution with a grand coalition. In the rest of paper, we will focus on how to solve out a global reward game with local reward approach.

4.3 Look into Shared Reward Approach by the View of ECG

Shared reward approach assigns each agent the global reward directly in a global reward game. Each agent unilaterally maximizes the cumulative global rewards to seek his optimal policy such that

(2)

where is the global reward and [9]. If is multiplied by a normalizing factor, i.e., , then the objective of the new optimization problem for each agent should be equivalent to Eq.2. We can express it mathematically as

(3)

Then, the credit assigned to each agent in shared reward approach is actually , and the sum of the whole agents’ credits is equal to the cumulative global rewards. It suffices the condition of the efficient payoff distribution scheme. Therefore, we show that shared reward approach is a special case of ECG with a grand coalition and the efficient payoff distribution scheme. To clarify the concepts we mentioned before, we draw a Venn diagram shown as Fig.1.

Figure 1: Relationship of the concepts mentioned in this paper.

4.4 Shapley Q-value

As discussed above, we know that solving an ECG with an efficient payoff distribution scheme is actually equivalent to solving a global reward game . Whereas shared reward approach involves an efficient payoff distribution scheme, it has been shown that local reward approach gives faster convergence rates [35, 36]. For this reason, we attempt to use Shapley value, a local reward approach to complete the credit assignment. Because represents the cumulative global rewards earned by coalition in an ECG, we can model it as a Q-value, i.e., , where are the states and . According to Eq.1, the Shapley Q-value of each agent , i.e., can be written such that

(4)
(5)

4.5 Approximate Marginal Contribution

As seen from Eq.4 and 5, it is clear that we need to access the environment repeatedly each time step to measure the value functions of different coalitions and to calculate different marginal contributions . To mitigate this problem, we propose a method called Approximate Marginal Contribution (AMC) to estimate the marginal contribution of each coalition directly. Besides, we need to access the environment just once at each time step.

In cooperative game theory, each agent is assumed to join the grand coalition sequentially. in Eq.1, denoted as is interpreted as that an agent randomly joins in an existing coalition (which can be empty) to form a complete grand coalition with subsequent agents [17]. With this interpretation, we model a function to approximate the marginal contribution directly such that

(6)

where is the state space; is the ordered coalition that agent would like to join in; , and the actions are ordered. For example, if the order of a coalition is , then . In practice, we represent

by the concatenation of each agent’s action vector. To keep the input size of

constant in different cases, we fix the actions as the concatenation of all agents’ actions and mask irrelevant agents’ actions with zeros.

4.6 Approximate Shapley Q-value

Followed by the interpretation above, Shapley Q-value can be rewritten as

(7)

Analogously to the derivation of Q-learning from Bellman equation [32], we can also sample here. Combined with AMC, we can write the equation of the approximate Shapley Q-value (ASQ) as

(8)

4.7 Shapley Q-value Policy Gradient

In an ECG with an efficient payoff distribution scheme and a grand coalition, each agent only needs to maximize his own credit so that can be achieved as

(9)

Therefore, if we show that for each agent is approached, then we can show that the maximal social welfare can be met. Now, the problem transfers to how to solve for each agent . We have shown that global reward game is actually a potential game 111A potential game is a game where there exists a potential function [16]., and an ECG with the efficient payoff distribution scheme and a grand coalition is equivalent to a global reward game. Moreover, Monderer and Shapley [16] showed that in a potential game there exists a pure Nash equilibrium. For these reasons, we apply deterministic policy gradient (DPG) [37] to find out an optimal policy. If we substitute Shapley Q-value for in DPG, we can directly write the policy gradient of each agent such that

(10)

where is the ASQ for agent and is agent ’s deterministic policy, parameterized by . Since in a global reward game only a global reward is received each time step, we cannot use it to update the ASQs (consisted of AMCs). However, benefited by the property of efficiency we can update the ASQs according to the minimization problem such that

(11)

where is parameterized by and is the global reward received from the environment each time step. By this update, the approximate Shapley Q-value is constrained and efficient. Therefore, the condition stated in Theorem 1 is guaranteed. Because Shapley Q-value takes all of possible agents’ actions and states as input, our algorithm actually uses the centralized critic. Nonetheless, the policy is decentralized in execution.

Silver et al. [37] showed that DPG has the familiar machinery of policy gradient. Besides, Sutton et al. [32] emphasized that with a small learning rate, policy gradient algorithm can converge to a local optimum. Therefore, we can conclude that with a small learning rate, each agent can find a local maximizer and the global value converges to a local maximum. Since our algorithm aims to find optimal policies by Shapley Q-values, we call it Shapley Q-value policy gradient (SQPG).

4.8 Implementation

In the implementation, for the sake of better approximation on policy gradients by off-policy and the powerful function approximation by deep neural networks, we use the deep deterministic policy gradient (DDPG) method

[33]. Additionally, we apply the reparameterization technique called gumbel-softmax trick [38] to deal with discrete action space. The pseudo code for the algorithm SQPG is given in Appendix A.

5 Experiments

We evaluate the performance of SQPG on Cooperative Navigation, Prey and Predator and Traffic Junction. The environments of Cooperative Navigation and Prey-and-Predator are from Mordatch and Abbeel [39], and Traffic Junction from Sukhbaatar et al. [6]. Cooperative Navigation is originally a global reward game. To enable Traffic Junction to be a global reward game, we modify the reward to the sum of each agent’s reward. Besides, we let prey in Prey-and-Predator be a random agent, as part of the environment, and we can only control predators in the game. Additionally, all of predators receive the sum of each agent’s reward in the original environment. Henceforth, Prey-and-Predator becomes a global reward game. In the experiments, we compare our algorithm with two Independent algorithms, DDPG [33] and actor-critic (AC) [19], as well as two methods with a centralised critic, MADDPG [21] and COMA [10]. The Independent algorithms are implemented with decentralized execution and decentralized training.

5.1 Cooperative Navigation

In this task, there are 3 agents and 3 targets. Each agent aims to move to a target, with no prior allocation of targets to each agent. The state of each agent in this environment includes his current position and velocity, the displacement to three targets, and the displacement to other agents. The action space of each agent is move up, move down, move right, move left and stay. In this environment, collision is not allowed. If it happen, then the global reward will be reduced by 1. Additionally, the global reward also considers the negative sum of the distance between each target and the nearest agent to it.

(a) Mean reward per episode during training.
(b) Agents’ movement dynamics of scenario 1.
(c) Agents’ movement dynamics of scenario 2.
Figure 2: Fig.1(a) shows the mean rewards per episode during the training procedure for Cooperative Navigation with 3 agents. Fig.1(b) and Fig.1(c) records the trails of agents to targets for two random selected scenarios by SQPG with sample size of 1. Each agent starts from one of colour circles, moving along the corresponding dashed lines to one of three targets, drawn by small black circles.

As seen from Fig.1(a), any variant of SQPG outperforms the baselines in terms of the final convergence performance. Additionally, SQPG with variant sample sizes to approximate Shapley Q-value converges faster than the baselines at different time points. As the sample size becomes larger, theoretically it can approach the accurate Shapley Q-value faster. This explains why the convergence rate becomes faster earlier when the sample size is larger. Therefore, our result supports the arguments that local reward approach converges faster [35, 36]. Since we show that SQPG with the sample size of 1 can finally obtain the same performance as other variants, we just run it in the rest experiments to reduce the computational complexity.

Fig.1(b) and 1(c) show the movement dynamics of agents in two random selected scenarios by SQPG with the sample size of 1. In different scenarios, each agent plays with different strategies to move to a target and avoid collisions with others. As we mentioned above, each agent knows the position of each target and the displacement to other agents according to the initial state. Therefore, each of them can decide a target to move to so that the collision with other agents can be as much as avoided in future. We believe that this is because the accurate credit assignment by Shapley Q-value during the training procedure enables agents to well estimate the contribution of each decision to the team given other agents’ information. Thus, each agent can decide a strategy for the later movements to maximize the profits of the team.

5.2 Prey-and-Predator

In this task, there are three predators that we can control and prey is set to a random agent. The aim of predators is using the least steps to capture the prey. The state of each predator contains his current position and velocity, the displacement to prey and other predators, and the velocity of prey. The action space is the same as that defined in Cooperative Navigation. The global reward of predators are the negative minimal distance between any predator and the prey. In addition, if the prey is caught by any predator, then the global reward is added by 10. Once the prey is caught, this episode terminates.

(a) Turns to capture prey per episode during training.
(b) Agents’ movement dynamics of scenario 1.
(c) Agents’ movement dynamics of scenario 2.
Figure 3: Fig.2(a) records the turns to capture prey per episode during the training procedure for Prey and Predator with 3 predators by SQPG with sample size of 1. Fig.2(b) and Fig.2(c) shows the trails of predators and prey. Predators start from the red rectangles and finalize at the red circles whereas prey starts from the green rectangle and stops at the green circle.

As seen from Fig.2(a), SQPG leads the board during the training procedure, with about 30 turns to capture the prey finally. Similar to Cooperative Navigation, we also select two random cases to show the movement dynamics of agents. Fig.2(b) and 2(c) demonstrate the movement dynamics of agents in two random selected scenarios by SQPG with sample size of 1. As shown in Fig.2(b), we can observe that predators learn to surround the prey. Observing from Fig.2(c), predators still attempt to surround the prey, but in this scenario the prey is captured midway. Comparing these two scenarios, it is shown quite clear that at least one predator is in charge of moving towards the start point of the prey. Meanwhile, a predator is responsible for predicting the possible movement of the prey and moving towards the estimated position that he can catch the prey. This validates the hypothesis we stated in Cooperative Navigation that Shapley Q-value is advantageous for maximizing the profits of team.

5.3 Traffic Junction

In this task, cars move along the predefined routes which intersect on one or more traffic junctions. At each time step, new cars are added to the environment with probability

, and the total number of cars is limited below . After a car finishes its mission, it will be removed from the environment and possibly sampled back to a new route. Each car has a limited vision of 1, which means it can only observe the circumstance within the 3x3 region surrounding it. No communication between cars is permitted in our experiment, in contrast to the others’ experiments on the same task [6, 23]. The action space of each car is gas and brake, and the reward function is a linear penalty of time , where is the time steps that a car is continuously active on the road in one mission. Additionally, if a collision occur, the reward will be reduced by 10. We evaluate the performance by the success rate, i.e., the steps that no collisions happen.

Difficulty Independent AC Independent DDPG COMA MADDPG SQPG
Easy 65.01% 93.08% 93.01% 93.72% 93.26%
Medium 67.51% 84.16% 82.48% 87.92% 88.98%
Hard 60.89% 64.99% 85.33% 84.21% 87.04%
Table 1: Success rates on Traffic Junction, tested with 20, 40, and 60 steps per episode in easy, medium and hard versions respectively. The results are obtained by running each algorithm after training for 1000 episodes.

We compare our method with baselines on the easy, medium and hard version of Traffic Junction. The easy version is constituted of one traffic junction of two one-way roads on a grid with and . The medium version is constituted of one traffic junction of two-way roads on a grid with and . The hard version is constituted of four connected traffic junctions of two-way roads on a grid with and . From Tab.1, we can see that on the easy version, except for Independent AC, other algorithms can get a success rate over , since this scenario is too easy. On the medium and hard version, SQPG outperforms the other baselines with the success rate of on the medium version and on the hard version. Moreover, the performance of SQPG significantly exceeds the performance of no-communication algorithms reported as and in [23]. We demonstrate that SQPG can also solve out large scale problems.

5.4 Discussion

In the experimental results, it is surprised that Independent DDPG achieved a good performance. The reason could be that a potential game can be solved by fictitious play [16] and DDPG is analogous to it, finding an optimal deterministic policy. However, the convergence rate is not guaranteed when the number of agents becomes large. The bad performance of COMA could be due to the complicated model it has so that the convergence in a continuous control problem, e.g. Cooperative Navigation and Prey-and-Predator becomes difficult.

6 Conclusion

We introduce cooperative game theory to extend the existing global reward game to a broader framework called extended convex game (ECG). Under this framework, we rationalize the local reward approach and propose an algorithm named Shapley Q-value policy gradient (SQPG) to learn a sort of local reward called Shapley Q-value, which is guaranteed to distribute the global reward fairly to each agent. We evaluate SQPG in variant scenarios of global reward game and show the promising performance compared with baselines. In the future work, we plan to dynamically group the agents at each time step with theoretical guarantees and jump out of the restriction of the global reward game.

The authors thank Ms Yunlu Li for useful explanations on mathematics and Ms Jing Li for helpful discussions on cooperative game theory. Jianhong Wang is sponsored by EPSRC-UKRI Innovation Fellowship EP/S000909/1.

References

  • Keviczky et al. [2007] T. Keviczky, F. Borrelli, K. Fregene, D. Godbole, and G. J. Balas. Decentralized receding horizon control and coordination of autonomous vehicle formations. IEEE Transactions on control systems technology, 16(1):19–33, 2007.
  • Schuldt [2012] A. Schuldt. Multiagent coordination enabling autonomous logistics. KI-Künstliche Intelligenz, 26(1):91–94, 2012.
  • Koes et al. [2006] M. Koes, I. Nourbakhsh, and K. Sycara. Constraint optimization coordination architecture for search and rescue robotics. In Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., pages 3977–3982. IEEE, 2006.
  • Ramchurn et al. [2010] S. D. Ramchurn, A. Farinelli, K. S. Macarthur, and N. R. Jennings. Decentralized coordination in robocup rescue. The Computer Journal, 53(9):1447–1461, 2010.
  • Chang et al. [2004] Y.-H. Chang, T. Ho, and L. P. Kaelbling. All learning is local: Multi-agent learning in global reward games. In Advances in neural information processing systems, pages 807–814, 2004.
  • Sukhbaatar et al. [2016] S. Sukhbaatar, R. Fergus, et al.

    Learning multiagent communication with backpropagation.

    In Advances in Neural Information Processing Systems, pages 2244–2252, 2016.
  • Omidshafiei et al. [2018] S. Omidshafiei, D.-K. Kim, M. Liu, G. Tesauro, M. Riemer, C. Amato, M. Campbell, and J. P. How. Learning to teach in cooperative multiagent reinforcement learning. arXiv preprint arXiv:1805.07830, 2018.
  • Kim et al. [2019] D. Kim, S. Moon, D. Hostallero, W. J. Kang, T. Lee, K. Son, and Y. Yi. Learning to schedule communication in multi-agent reinforcement learning. In International Conference on Learning Representations, 2019.
  • Panait and Luke [2005] L. Panait and S. Luke. Cooperative multi-agent learning: The state of the art. Autonomous agents and multi-agent systems, 11(3):387–434, 2005.
  • Foerster et al. [2018] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multi-agent policy gradients. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • Zhang et al. [2018] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar. Fully decentralized multi-agent reinforcement learning with networked agents. In

    Proceedings of the 35th International Conference on Machine Learning

    , pages 5872–5881. PMLR, 2018.
  • Nguyen et al. [2018] D. T. Nguyen, A. Kumar, and H. C. Lau. Credit assignment for collective multiagent rl with global rewards. In Advances in Neural Information Processing Systems, pages 8102–8113, 2018.
  • Osborne and Rubinstein [1994] M. J. Osborne and A. Rubinstein. A course in game theory. MIT press, 1994.
  • Basar and Olsder [1999] T. Basar and G. J. Olsder. Dynamic noncooperative game theory, volume 23. Siam, 1999.
  • Shapley [1953] L. S. Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953.
  • Monderer and Shapley [1996] D. Monderer and L. S. Shapley. Potential games. Games and economic behavior, 14(1):124–143, 1996.
  • Chalkiadakis et al. [2011] G. Chalkiadakis, E. Elkind, and M. Wooldridge. Computational aspects of cooperative game theory. Synthesis Lectures on Artificial Intelligence and Machine Learning, 5(6):1–168, 2011.
  • Peleg and Sudhölter [2007] B. Peleg and P. Sudhölter. Introduction to the theory of cooperative games, volume 34. Springer Science & Business Media, 2007.
  • Konda and Tsitsiklis [2000] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
  • Buşoniu et al. [2010] L. Buşoniu, R. Babuška, and B. De Schutter. Multi-agent reinforcement learning: An overview. In Innovations in multi-agent systems and applications-1, pages 183–221. Springer, 2010.
  • Lowe et al. [2017] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pages 6379–6390, 2017.
  • Iqbal and Sha [2018] S. Iqbal and F. Sha. Actor-attention-critic for multi-agent reinforcement learning. arXiv preprint arXiv:1810.02912, 2018.
  • Das et al. [2018] A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau. Tarmac: Targeted multi-agent communication. arXiv preprint arXiv:1810.11187, 2018.
  • Hu et al. [1998] J. Hu, M. P. Wellman, et al. Multiagent reinforcement learning: theoretical framework and an algorithm. In ICML, volume 98, pages 242–250. Citeseer, 1998.
  • Wolpert and Tumer [2002] D. H. Wolpert and K. Tumer. Optimal payoff functions for members of collectives. In Modeling complexity in economic and social systems, pages 355–369. World Scientific, 2002.
  • Shapley [1953] L. S. Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317, 1953.
  • Fatima et al. [2008] S. S. Fatima, M. Wooldridge, and N. R. Jennings. A linear approximation method for the shapley value. Artificial Intelligence, 172(14):1673–1699, 2008.
  • Michalak et al. [2013] T. P. Michalak, K. V. Aadithya, P. L. Szczepanski, B. Ravindran, and N. R. Jennings. Efficient computation of the shapley value for game-theoretic network centrality. Journal of Artificial Intelligence Research, 46:607–650, 2013.
  • Faigle and Kern [1992] U. Faigle and W. Kern. The shapley value for cooperative games under precedence constraints. International Journal of Game Theory, 21(3):249–266, 1992.
  • Watkins and Dayan [1992] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
  • Williams [1992] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • Sutton et al. [1998] R. S. Sutton, A. G. Barto, et al. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
  • Lillicrap et al. [2015] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Nash [1953] J. Nash. Two-person cooperative games. Econometrica: Journal of the Econometric Society, pages 128–140, 1953.
  • Balch et al. [1997] T. Balch et al. Learning roles: Behavioral diversity in robot teams. College of Computing Technical Report GIT-CC-97-12, Georgia Institute of Technology, Atlanta, Georgia, 73, 1997.
  • Balch [1999] T. Balch. Reward and diversity in multirobot foraging. In IJCAI-99 Workshop on Agents Learning About, From and With other Agents, 1999.
  • Silver et al. [2014] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
  • Jang et al. [2017] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
  • Mordatch and Abbeel [2018] I. Mordatch and P. Abbeel. Emergence of grounded compositional language in multi-agent populations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Shapley [1971] L. S. Shapley. Cores of convex games. International journal of game theory, 1(1):11–26, 1971.
  • Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Appendix

Appendix A Algorithm

1:Initialize actor parameters , and critic (AMC) parameters for each agent
2:Initialize target actor parameters , and target critic parameters for each agent
3:Initialize the sample size for approximating Shapley Q-value
4:Initialize the learning rate for updating target network
5:Initialize the discount rate
6:for episode = 1 to D do
7:     Observe initial state from the environment
8:     for t = 1 to T do
9:          for each agent Select action according to the current policy and exploration noise
10:         Execute actions and observe the global reward and the next state
11:         Store in the replay buffer
12:         Sample a minibatch of G samples from
13:         Get for each sample
14:         Get for each sample
15:         for each agent  do this procedure can be implemented in parallel
16:              Sample ordered coalitions by
17:              for each sampled coalition  do this procedure can be implemented in parallel
18:                  Order each by and mask the irrelevant agents’ actions, storing it to
19:                  Order each by and mask the irrelevant agents’ actions, storing it to
20:                  Order each by and mask the irrelevant agents’ actions, storing it to               
21:              Get for each sample
22:              Get for each sample
23:              Update by deterministic policy gradient according to Eq.10:
         
24:         Set for each sample
25:         Update for each agent by minimizing the optimization problem according to Eq.11:
26:         Update target network parameters for each agent :
     
Algorithm 1 Shapley Q-value Policy Gradient (SQPG)

Appendix B Proof

Lemma 1 (Shapley [40], Chalkiadakis et al. [17]).

1) Every convex game has a non-empty core. 2) If a solution

is in the core of a characteristic function game

and a payoff distribution scheme is efficient (see Section 3), then for every coalition structure .

Theorem 1.

With an efficient payoff distribution scheme, for an extended convex game (ECG), one solution in the core can be certainly found with the grand coalition and the objective is , which can lead to the maximal social welfare, i.e., for every coalition structure .

Proof.

The proof is as follows.

As we defined before, in an ECG, after allocating coalitions each agent will further maximize the cumulative rewards of his coalition by the optimal policy. Now, we denote the optimal value of an arbitrary coalition as , where . Similarly, we can define the optimal value given an arbitrary coalition structure as . If we re-write the value function defined above, ECG can be reformulated to CG. For this reason, we can directly use the results in Lemma 1 here to complete the proof.

First, we aim to show that (i) In an ECG, with an efficient payoff distribution scheme x, for any , and is a solution in the core.

Suppose for the sake of contradiction that is not in the core, but due to statement (1) in Lemma 1, there must exist a coalition structure other than satisfies the result that is in the core. According to statement (2) in Lemma 1, since CG is a subclass of characteristic function games, under an efficient payoff distribution scheme we can get that

(12)

On the other hand, because of the property of ECG, i.e.,

(13)

we have that

(14)

By Eq.12 and 14, we can get that

(15)

According to the condition of efficient payoff distribution scheme, we can write:

(16)
(17)

By Eq.15, we can get that

(18)

By Eq.18, it is obvious that we can always find out a payoff distribution scheme for the grand coalition . Since is presumed to be in the core, must satisfy the conditions of core. As a result, we derive that (where ) is a solution in the core which contradicts the presumption we made and we show that proposition (i) holds.

Then, we aim to show that (ii) In an ECG, with an efficient payoff distribution scheme, the objective is .

The objective of a CG is finding a solution in the core. According to (i), is equivalent to finding a solution in the core corresponding to the grand coalition . For this reason, we can write

(Since we write )
(19)

Therefore, we prove (ii).

According to (ii), we can conclude that in an ECG, the objective is maximizing , i.e., the cumulative global rewards. However, an efficient payoff distribution scheme, e.g. Shapley value should be a precondition, otherwise, , where is the theoretically optimal value that can be found with an efficient payoff distribution scheme. ∎

Appendix C Non-cooperative Game Theory

To clarify why we are interested in the framework of cooperative game theory, let us begin with non-cooperative game theory. The non-cooperative game theory aims to solve out problem, in which agents should be selfish and intelligent [13]. In other words, each agent only considers maximizing his own reward. Nash equilibrium would be a local optimum where no agent can achieve a greater reward with unilateral actions. Therefore, the non-cooperative game theory is not suitable to model cooperative scenarios. One possible way to model the cooperative scenarios is through constructing a global function e.g. potential function [16] that replaces each individual reward function, such that each agent’s objective is forced to be identical, i.e., fully cooperative game. Even if each agent is only keen on maximizing his own reward, the cooperation can also be formed. Nonetheless, this framework has its own limitation. Firstly, although theoretically potential function can be decomposed to the individual reward to each agent, it is difficult to find a proper decomposition under the framework of non-cooperative game. Secondly, it is difficult to explain how the coalition is formed if we assume that only agents within a coalition can cooperate. These two defects of the non-cooperative game theoretical framework motivates us to investigate and introduce the framework under cooperative game theory.

Appendix D Experimental Settings

As for the settings of experiments, because different environments may involve variant complexity and dynamics, we set different hyperparameters for each task. Except that COMA claims to use GRU as hidden layer, any other algorithms uses MLP as hidden layer for policy network. All of policy networks only use one hidden layer. About critic networks, every algorithm uses MLP with one hidden layer. For each experiment, we keep learning rate, entropy regularization coefficient, update frequency, batch size and the number of hidden units identical on each model. In experiments, each agent has his own state in execution, while in training agents share states. The rest details of experimental settings are shown as below. All of models are trained by Adam Optimizer

[41] with default hyperparameters.

d.1 Details of Cooperative Navigation

As we introduced in Sec.5.1, Cooperative Navigation is an environment with 3 agents and 3 targets. Each agent aims to move to a target with no assignment of target and no collision. In this section, we provide the details of training details and extra results.

Training Details.

As we mentioned above, we keep hyperparameters fixed in order to achieve a fair comparison between algorithms. Tab.2 lists part of hyperparameters used in Cooperative Navigation. The rest hyperparameters, i.e, replay buffer size, grad cliping can be tuned specifically to achieve the best results for variant models.

Hyperparameters # Description
hidden units 32 The number of hidden units for both policy and critic network
training episodes 2000 The number of training episodes
episode length 200 Maximum time steps per episode
discount factor 0.9 The importance of future rewards
update frequency for behaviour network 100 Behaviour network updates every # steps
learning rate for policy network 1e-3 Policy network learning rate
learning rate for critic network 1e-2 Critic network learning rate
update frequency for target network 200 Target network updates every # steps
target update rate 0.1 Target network update rate
entropy regularization coefficient 1e-2 Weight or regularization for exploration
batch size 32 The number of transitions for each update
Table 2: Table of hyperparameters for Cooperative Navigation.

Additional Results.

In this section, we show more movement dynamics shown as Fig.4.

Figure 4: Movement Dynamics of extra three randomly selected scenarios for Cooperative Navigation.

d.2 Details of Prey-and-Predator

As we introduced in Sec.5.2, Prey-and-Predator is an environment with 4 agents, where 3 predators are in the same alliance chasing the prey. In this game, we only control the predators to cooperate to catch the prey controlled by a random policy.In this section, we provide the details of training details and extra results.

Training Details.

As we mentioned above, we keep hyperparameters fixed in order to achieve a fair comparison between algorithms. Tab.3 lists part of hyperparameters used in Cooperative Navigation. The rest hyperparameters, i.e, replay buffer size, grad cliping can be tuned specifically to achieve the best results for variant models.

Hyperparameters # Description
hidden units 128 The number of hidden units for both policy and critic network
training episodes 4000 The number of training episodes
episode length 200 Maximum time steps per episode
discount factor 0.99 The importance of future rewards
update frequency for behaviour network 100 Behaviour network updates every # steps
learning rate for policy network 1e-4 Policy network learning rate
learning rate for critic network 1e-3 Critic network learning rate
update frequency for target network 200 Target network updates every # steps
target update rate 0.1 Target network update rate
entropy regularization coefficient 1e-3 Weight or regularization for exploration
batch size 128 The number of transitions for each update
Table 3: Table of hyperparameters for Prey-and-Predator.

Additional Results.

In this section, we show more movement dynamics shown as Fig.5.

Figure 5: Movement Dynamics of extra three randomly selected scenarios for Prey-and-Predator.

d.3 Details of Traffic Junction

As we introduced in Sec.5.3, there are three different difficulty levels in Traffic Junction. In the more difficult level, the number of cars, arrival probability, number of traffic junctions and grid size are upgraded. Moreover, more entry points are added so that the environment become more complex when the level is harder. For each entry point, there are multiple choices of routes for cars. We list details of the setting of this experiment in Tab.4. Moreover, to give readers a more clear sight, we show Fig.6 to visualize the environment.

Difficulty Entry-Points # Routes # Two-way Junctions # Dimension
Easy 0.3 5 2 1 F 1 7x7
Medium 0.2 10 4 3 T 1 14x14
Hard 0.05 20 8 7 T 4 18x18
Table 4: The settings of Traffic Junction for different difficulty levels. means the probability to add an available car into the environment. means the existing number of the cars. Entry-Points # means the number of possible entry points for each car. Routes # means the number of possible routes starting from every entry point.
(a) Easy
(b) Medium
(c) Hard
Figure 6: Visualizations of traffic junction environment. The black points represent the available entry points. The orange arrows represent the available routes at each entry point. The green lines separate the two-way roads.

Training Details.

As we mentioned above, we keep hyperparameters fixed in order to achieve a fair comparison between algorithms. Tab.5 lists part of hyperparameters used in Traffic Junction. The rest hyperparameters, i.e, replay buffer size, grad cliping can be tuned specifically to achieve the best results for variant models.

Hyperparameters Easy Meidum Hard Description
hidden units 128 128 128 The number of hidden units for both policy and critic network
training episodes 2000 5000 2000 The number of training episodes
episode length 50 50 100 Maximum time steps per episode
discount factor 0.99 0.99 0.99 The importance of future rewards
update frequency for behaviour network 25 25 25 Behaviour network updates every # steps
learning rate for policy network 1e-4 1e-4 1e-4 Policy network learning rate
learning rate for critic network 1e-3 1e-3 1e-3 Critic network learning rate
update frequency for target network 50 50 50 Target network updates every # steps
target update rate 0.1 0.1 0.1 Target network update rate
entropy regularization coefficient 1e-4 1e-4 1e-4 Weight or regularization for exploration
batch size 64 32 32 The number of transitions for each update
Table 5: Table of hyperparameters for Traffic Junction.

Additional Results.

We run 1000 episodes to evaluate the trained models and report the mean success rate. The mean success rate is defined as the mean of the success frequencies. If there is no collision in a step, then the success frequency is denoted by 1, otherwise, the success rate is denoted by 0. The evaluation result is reported in Tab.1. We also report the mean reward per episode during the training procedure to understand the learning performance, shown as Fig.7. From this figure, we can see that SQPG has a faster convergence rate compared with baselines.

(a) Easy.
(b) Medium.
(c) Hard.
Figure 7: Mean rewards per episode during training process in different difficulties of traffic junction environment.

Appendix E Limitations on Extended Convex Game

In this paper, we proposed a framework built on cooperative game theory called extended convex game (ECG). Although ECG has extended the framework of global reward game defined upon non-cooperative game theory to a broader scope, there exist some limitations to this model. Firstly, we have to assume that there is an oracle scheduling the coalition initially, however, this oracle is difficult to realize in implementation. Even if this oracle can be implemented, this model still cannot solve out some problems with random perturbations. This is due to the fact that this oracle has assigned each agent to a coalition with the environment that he knows. Obviously, the perturbation exceeds his knowledge. To deal with this problem, we may investigate how to enable the coalition construction dynamically in the future work. The intuitive idea is enabling the oracle to learn a policy for scheduling the coalition from the history information. At each step, he uses the learned policy to divide the coalitions. Then, each agent act within the coalition to maximize the social value of the coalition. This process can be repeated infinitely. Nonetheless, the promising convergence under the cooperative game theoretical framework for this complicated process could be a challenge.