1 Introduction
A multiagent system refers to a group of agents interact in a sharing environment, in which agents perceive the environment and form a policy conditioned on each other to accomplish a task[1] . It is widely used in different fields, such as robotics[2], distributed control[3], energy management[4]
, etc. From the point of game theory
[5], these tasks can be divided into fully cooperative, fully competitive and mixed stochastic games. The complexity makes it difficult to design a fixed pattern to control the agents. A natural idea to solve it is learning on its own, which lead to the research on multiagent reinforcement learning.Multiagent reinforcement learning (MARL)[6] is the combination of multiagent system and reinforcement learning (RL)[7]
. Agents interact with the common environment, perform actions and get rewards, learn a joint optimal policy by trial and error for multiagent sequential decision making problems. In addition to the problems like sparse rewards and sample efficiency in reinforcement learning, multiagent reinforcement learning encounters new problems such as the curse of dimensionality, nonstationary environment, multiple equilibria, etc. In addition, current mainstream algorithms tend to have a separate network structure for each agent, which undoubtedly poses a huge challenge to computing resources.
For the problem of nonstationary environment, the usual practice is to use centralized policy and global observation to transform the problem into a multiagent problem with centralized control, but this method will encounter the curse of dimensionality and can not solve many tasks that require distributed deployment[8][9] . An improved approach is to use centralized training to stabilize the environment and decentralized execution for distributed deployment[10] , which largely mitigates nonstationary environment. But usually all the agents in the system are learning simultaneously, which makes each agent actually face with a movingtarget learning problem: its own optimal policy changes as other agents’ policies change. Based on the idea of GAN[11] , we propose an iterative update method to stabilize the nonstationary environment. Divide the agents into the current learning agent and the agents waiting for learning, fix the strategies of agents in waiting list, only the current particular agent is trained, the problem is transformed into a singleagent case. Then each agent is switched regularly and gradually improved.
For the computing resources, current mainstream algorithms configure each agent an independent network. The memory usage increases linearly with the number of agents and the action of each agent needs to be computed separately by the network which also greatly slows down the interaction with environment. In this paper, all the agents are represented by one network, which greatly mitigates the demand for computing resources. The speed of interaction between the agent and the environment can also be accelerated by batch compute. To implement the iterative update strategy in this unified representation case, inspired by continual lifelong learning[12], we design a value fixing method based on Bellman Equation to achieve it. In this way, the strategy of the agents in waiting list will be fixed as much as possible.
We compare our method with MADDPG[13]. Results show that our method both in fullycooperative and mixed cooperativecompetitive environments can effectively mitigate the nonstationarity and improve the performance. At the same time, the wallclock time spent by the algorithm is greatly reduced. In particular, the advantages of the algorithm are more obvious when the number of agents increases.
The remainder of this paper is organized as follows. Section 2 reviews related work on multiagent reinforcement learning. Section 3 presents the details of our method. The experimental setup and results are illustrated in Section 4. Section 5 and 6 give future research directions and summarize this paper.
2 Related work
Multiagent reinforcement learning can be divided into centralized methods and decentralized methods according to specific tasks. We are concerned with methods that can extract decentralized policies for an agent to make decisions based on its own observation. In these tasks, centralized decisions are unattainable because global state and joint policy are unavailable.
This problem can be described as Partially Observable Markov Decision Process
[14] (POMDP). Consider POMDP with agents is a set global state, is a set of observations for each agent, is a set of actions, denotes the state transition function. are the reward functions based on the joint actions. For each agent’s policy , we have the joint policy . Because the rewards of the agents depend on the joint policy, we denote return for each agent as followsis discount factor, is the initial state.
The goal is to get the equilibrium strategy to maximize each agent’s return. We want to get the optimal policy , for all we have
At this point, the problem can be considered as a stochastic game problem for multiagents. This definition is suitable for fullycooperative, fullycompetitive and mixed cooperativecompetitive cases. Particularly, in fullycooperative stochastic game, we have the same reward function for all agent, that is . The returns are also the same, . For fullycompetitive and mixed cooperativecompetitive cases, the reward function for each agent depends on the given objective function of the environment.
For the good performance of Qlearning in singleagent cases, Tan[15] introduces independent Qlearning (IQL) for multiagent reinforcement learning. This algorithm does not take the nonstationarity into consideration, each agent learns a function independently. The independent approach hardly converge and it’s no surprise that the performance is not as good as Qlearning[16] in singleagent cases.
Oliehoek[17]
introduces a new paradigm of centralized training and decentralized execution and becomes the mainstream framework for decentralized tasks. This approach introduces global observed critics and decentralized actors for policy execution. The critic can minimize the estimation error of other agents and the actor can do decentralized decisions. MADDPG
[13] is the most popular algorithm in this paradigm. It extends DDPG[18] into multiagent cases, each agent has a critic with global observation to direct the partial observed actor. MADDPG agents are able to perform coordination strategies in both cooperative, competitive and mixed environments.VDN[19] (value decomposition networks) establishes a link between centralized reinforcement learning and decentralized reinforcement learning. It decomposes the centralized Q function and learns each agent’s Q value separately. Based on VDN, QMIX[20] designs a nonlinear function through network for value function decomposition. These methods can learn the accurate Q value for each agents which mitigates nonstationarity. However, such methods are only suitable for fullycooperative environments.
COMA[21] combines the framework with counterfactual baseline. The baseline is obtained by fix other agents’ actions which is a little similar with our methods. The difference is that we fixed the strategies of the agents rather than the specific actions. And COMA fixes other agents’ actions to compute the baseline while our purpose is to learn the agent’s policy.
In terms of computing resources, current mainstream algorithms simply skip the problem, but it will become more severe as the tasks’ complexity increases. In some complex games like Pommerman[22], Quake III[23], and StarCraft , a continuous league was created, and a large amount of competitors are trained by using parallel algorithm like population based training (PBT) [24], which undoubtedly poses a huge challenge to computing resources.
Focusing on the above problems, our method can greatly alleviate the nonstationarity and save computing resources.
3 Method
A stable agent reduces the nonstationarity in the learning problem of the other agents which makes it easier to solve. We fix the policies of agents in waiting list and transform the problem into a singleagent case. Then we have is the waiting list and agents will be fixed. The critic has the global observation, the action value function is is the set of global state and is agent ’s actions. For actor we have policy is agent ’s perspective. When the agent is under training, let denote the joint policy of all agents in , Specifically, the here is not changed.
Further, in order to improve computational efficiency and save wallclock time. We represent all agents by only one network and add firstperson perspective information to distinguish the roles of each agent. So we don’t need to label or allocate separate parameters for each agent and they can make optimal polices based on its own perspective. Then the action value function for critic is is agent ’s perspective. For each actor, the observation is unchanged, so the policy remains . Here we use deterministic policy like MADDPG. In this way, the memory usage won’t increase regardless how many agents in the environment and the actions output from networks can be transformed from sequential compute to batch compute.
A critical problem is how to realize the iterative update method in this unified representative network. If all agents share a common network, it’s no easy to fix the agents in waiting list. When we update the parameters in order to update one agent’s policy, the others’ will also change. Then the iterative update method fails and that’s what we want to avoid.
We consider the problem as continual lifelong learning[12] , there are usually three types of methods. a) retraining with regularization to prevent catastrophic forgetting, b) extending network to represent new tasks, c) selective retraining with possible expansion. Retraining with regularization is not suitable for reinforcement learning because the policy update conflicts with the regularization which will impede the improvement of policy. Network expansion is obviously opposed to our purpose of saving memory usage. Still, inspired by the lifelong learning’s thought, we focus on Bellman Equation[25] and propose a value fixing method to achieve it. For current learning agent, Q value is computed by Bellman Equation, for agent in waiting list, Q value output from critic directly. In concrete, we denote experience replay buffer includes global state, observations, rewards as follows:
For the current learning agent i, the Q value is updated follows Bellman Equation
Let denote other agents except , we have . Q value is directly output from the network
Where is the target policy network. The actionvalue function is updated as:
The samples from both and . For the deterministic policy , updated by gradient ascent as:
In theory, when other agents’ strategies are optimal with the current Q function, we have
Which means their policies will not change. But in practice, the Q function is represented with a neural network which is a nonlinear function. Consider the limited samples and the gradient ascent
[26] method’s error, it is impossible to hold the policy gradient[27] exactly equal to zero. Luckily we can usually guarantee a smaller norm of its gradient as follows:Here uses Bellman Equation also for agents , the update is the same as:
Though we can’t hold the equation accurately, this method can substantially fix the other agents’ polices which mitigate the nonstationarity effectively in practice. The pseudo code of the algorithm is in .
4 Experiments and Results
We evaluate the algorithm in fullycooperative and mixed cooperativecompetitive environments. In order to compare the influence of the number of agents on the algorithm, we designed the control groups by increasing the number of agents. Besides, all environments’ state space and action space are continuous which are designed as follows.
4.1 Environments
4.1.1 Fullycooperative environments: Spread
Agents perceive the environment on its own perspective and cooperate with each other to reach different destinations (the black points). In this environment, if agents collide during the movement, the agents will be punished. That is to say the agents must learn to reach the specified destinations without colliding with other agents. We set up a simple environment with three agents (Spread_3) and a complex environment with ten (Spread_10)as shown in figure 2:
4.1.2 Mixed cooperativecompetitive environments: PredatorPrey
Agents are divided into predators and preys. The predators need to cooperate with each other to catch the preys. The prey needs to find a way to escape as much as possible. Two obstacles (black circle) will render randomly to block the way. If any of the predators collides with the prey, predators win and otherwise the prey wins. We set up three chase one as simple scenes and six chase two as complex scenes, which are respectively recorded as Predator_3prey_1 and Predator_6prey_2, as shown in figure 3
4.2 Results
We compare IU (Iterative Update) and IUUR (Iterative Update and Unified Representation) on the basis of MADDPG. IU only uses iterative update method and IUUR adopts both iterative update and unified representation. The network structure is consistent with MADDPG, a twolayer ReLU MLP with 64 units per layer. Using the Adam optimizer, we set the soft update parameter
and train each model 100,000 episodes. We only finetuned the new hyperparameters
to control the frequency of iterative update. In our experiments, we set . Experiments show that our algorithm not only get good performance, but also improve computational efficiency.The source code of our algorithm implementation is available online ( https://github.com/DreamChaser128/IUURforMultiAgentReinforcementLearning).4.2.1 Performance
We run five random seeds for each environment and compare the performance among MADDPG, IU and IUUR.
The results of fullycooperative environments (Spread_3 and Spread_10) show in figure 4. In Spread_3, we can find the IUUR converges quickly and after 20,000 episodes, it has exceeded MADDPG and maintained a steady rise. It’s surprising that IU is inferior to MADDPG though it still has an upward trend after the maximum training steps. This may show that the nonstationarity of the environment in simple environment is not particularly serious and the new hyperparameters may impede the learning efficiency of agents in waiting list. In Spread_10, it can be clearly seen that IUUR and IU both greatly exceed the performance of MADDPG, which indicates that as the number of agents increases, the nonstationarity gets worse, and iterative update can effectively alleviate the problem.
For the mixed cooperativecompetitive environments PredatorPrey. We set MADDPG vs MADDPG as baseline, then replace the predators or preys with our IU and IUUR to compete with MADDPG. By comparing the reward of our algorithms with the MADDPG agents, we can clearly get the performance of each methods.
Environment Predator_3Prey_1 shows in figure 5, when we replace the predators, IUUR outperforms MADDPG a lot. IU’s performance is slightly worse than that of MADDPG which is out of our expectation. In theory, we believe that IU can fix the agents’ strategies in waiting list, IUUR can only guarantee a smaller norm of its gradients which can only alleviate the nonstationarity to some extent. In this sense, we think IU will be better than IUUR, but the experimental result shows opposite performance. The reason may be the new hyperparameters , though IU can stable the environment, it will impede the learning efficiency of agents in waiting list at the same time which is very similar to environment Spread_3. When we replace the prey, both IU and IUUR can get a better reward and IUUR owns the highest reward.
Environment Predator_6Prey_2 shows in figure 6, when we replace the predators, IU outperforms MADDPG a lot. IUUR’s performance is worse than that of MADDPG. The reason is that as the number of agents increases, nonstationarity arises in multiagent reinforcement learning gets more serious. For IU, each agent has its own network and iterative update can alleviate the nonstationarity effectively. But for IUUR, though iterative update can alleviate the nonstationarity, the update error is introduced by unified representation which leads to the inaccuracy of value and poor performance. When we replace the preys, both IU and IUUR get a better reward and IU owns the highest reward. Overall, although there is a slight gap between IUUR and MADDPG, considering that IUUR can learn faster and use less memory space, so even if there is a slight difference in performance, it is still within acceptable range. Moreover, we only simply control the learning frequency of iterative update hyperparameter through experience, which plays a key role on the performance and can be improved in the future work.
4.2.2 Computational efficiency
We compared the differences in training time and the speed of interaction with environment. Table 1 and table 2 show the details in different environments. All results were generated on 2.20GHz Intel Xeon E52630 and 2 GeFore GTX 1080Ti graphics cards based machine running ubuntu.
Obviously, IUUR saves a lot of time in both training and interaction by a large margin especially with more agents. That is to say, IUUR can be used in large scale multiagent reinforcement learning without the linear growth of wallclock time.
environment  spread_3  

algorithm  Baseline  IU  IUUR 
training_time(h)  5.76  4.25  3.47 
Interaction_time(s)  0.0011  0.0011  0.0007 
environment  spread_10  

algorithm  Baseline  IU  IUUR 
training_time(h)  34  31  28 
Interaction_time(s)  0.0037  0.0036  0.0012 
environment  Predator_3prey_1  

algorithm  Baseline  IU  IUUR 
training_time(h)  6.4  3.6  1.7 
Interaction_time(s)  0.0014  0.0015  0.001 
environment  Predator_6prey_2  

algorithm  Baseline  IU  IUUR 
training_time(h)  8.7  3.53  2.8 
Interaction_time(s)  0.003  0.0029  0.0011 
5 Discussion and Future Work
This paper proposes iteration updating and unified representation. Iterative update is used to stabilize the environment and the idea of batch computing is used to save memory and speed up interaction, which largely solves these two problems. This method does not affect decentralized execution and distributed deployment. In addition, our experiments are based on MADDPG, but this method is suitable for most multiagent algorithms like IQL, VDN, QMIX etc. When combined with PBT or other algorithms for parallel training, the unified representation is particularly efficient.
This method also has some drawbacks. We all know Generative Adversarial Networks are notoriously hard to train, the main difficulty lies in the balance between generator and discriminator and many articles have further studied on it[28][29]. In our iterative update, there is also a problem of balancing the capabilities of each agents, especially in mixed cooperativecompetitive environments. It is necessary to adjust it cautiously in order to obtain strong agents. At present, we only simply control the learning frequency of iterative update hyperparameter through experience, which is a research direction in the future. Another problem is how to realize the iterative update method in this unified representative network. The value fixing method based on Bellman Equation can only guarantee a smaller norm of its gradients and cannot strictly hold the equation. This can be further improved in the future work. In addition, due to the limited computing resources, we only expand the number of agents to a certain extent, which can be further verified in more complex environments in the future.
6 Conclusion
This paper presents an iterative update and unified representation method to solve the problems of environmental nonstationarity and computational efficiency. This method greatly alleviates the nonstationarity and outperforms MADDPG both in fullycooperative and mixed cooperativecompetitive tasks, especially when the number of agents increases. At the same time, unified representation and batch compute make use of the advantages of tensor compute of neural network, which effectively improves the computing efficiency and avoids the linear growth of the interaction time with the environment. In addition, our method is applicable to most MultiAgent Reinforcement Learning algorithms. Our future work will focus on the stability of iterative update’s training under unified representation and apply this method to more complex tasks.
References

[1]
Nikos Vlassis.
A concise introduction to multiagent systems and distributed artificial intelligence.
Synthesis Lectures on Artificial Intelligence and Machine Learning
, 1(1):1–71, 2007.  [2] Peter Stone and Manuela Veloso. Multiagent systems: A survey from a machine learning perspective. Autonomous Robots, 8(3):345–383, 2000.
 [3] Gerhard Weiss. Multiagent systems: a modern approach to distributed artificial intelligence. MIT press, 1999.
 [4] Martin Riedmiller, Andrew Moore, and Jeff Schneider. Reinforcement learning for cooperating and communicating reactive agents in electrical power grids. In Workshop on Balancing Reactivity and Social Deliberation in MultiAgent Systems, pages 137–149. Springer, 2000.
 [5] Ann Nowé, Peter Vrancx, and YannMichaël De Hauwere. Game theory and multiagent reinforcement learning. In Reinforcement Learning, pages 441–470. Springer, 2012.
 [6] Lucian Buşoniu, Robert Babuška, and Bart De Schutter. Multiagent reinforcement learning: An overview. In Innovations in multiagent systems and applications1, pages 183–221. Springer, 2010.
 [7] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [8] Michael L Littman. Markov games as a framework for multiagent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. Elsevier, 1994.
 [9] Junling Hu and Michael P Wellman. Nash qlearning for generalsum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
 [10] Peng Peng, Ying Wen, Yaodong Yang, Quan Yuan, Zhenkun Tang, Haitao Long, and Jun Wang. Multiagent bidirectionallycoordinated nets: Emergence of humanlevel coordination in learning to play starcraft combat games. arXiv preprint arXiv:1703.10069, 2017.
 [11] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [12] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019.
 [13] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. In Advances in Neural Information Processing Systems, pages 6379–6390, 2017.
 [14] Jason D Williams and Steve Young. Partially observable markov decision processes for spoken dialog systems. Computer Speech & Language, 21(2):393–422, 2007.
 [15] Ming Tan. Multiagent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pages 330–337, 1993.
 [16] Christopher JCH Watkins and Peter Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
 [17] Frans A Oliehoek, Matthijs TJ Spaan, and Nikos Vlassis. Optimal and approximate qvalue functions for decentralized pomdps. Journal of Artificial Intelligence Research, 32:289–353, 2008.
 [18] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [19] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Valuedecomposition networks for cooperative multiagent learning. arXiv preprint arXiv:1706.05296, 2017.
 [20] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multiagent reinforcement learning. arXiv preprint arXiv:1803.11485, 2018.
 [21] Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multiagent policy gradients. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [22] Peng Peng, Liang Pang, Yufeng Yuan, and Chao Gao. Continual match based training in pommerman: Technical report. arXiv preprint arXiv:1812.07297, 2018.
 [23] Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al. Humanlevel performance in firstperson multiplayer games with populationbased deep reinforcement learning. arXiv preprint arXiv:1807.01281, 2018.
 [24] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017.

[25]
Shige Peng.
A generalized dynamic programming principle and
hamiltonjacobibellman equation.
Stochastics: An International Journal of Probability and Stochastic Processes
, 38(2):119–134, 1992.  [26] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 [27] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 [28] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [29] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
Comments
There are no comments yet.