Iterative Update and Unified Representation for Multi-Agent Reinforcement Learning

08/16/2019 ∙ by Jiancheng Long, et al. ∙ Nanchang University 0

Multi-agent systems have a wide range of applications in cooperative and competitive tasks. As the number of agents increases, nonstationarity gets more serious in multi-agent reinforcement learning (MARL), which brings great difficulties to the learning process. Besides, current mainstream algorithms configure each agent an independent network,so that the memory usage increases linearly with the number of agents which greatly slows down the interaction with the environment. Inspired by Generative Adversarial Networks (GAN), this paper proposes an iterative update method (IU) to stabilize the nonstationary environment. Further, we add first-person perspective and represent all agents by only one network which can change agents' policies from sequential compute to batch compute. Similar to continual lifelong learning, we realize the iterative update method in this unified representative network (IUUR). In this method, iterative update can greatly alleviate the nonstationarity of the environment, unified representation can speed up the interaction with environment and avoid the linear growth of memory usage. Besides, this method does not bother decentralized execution and distributed deployment. Experiments show that compared with MADDPG, our algorithm achieves state-of-the-art performance and saves wall-clock time by a large margin especially with more agents.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A multi-agent system refers to a group of agents interact in a sharing environment, in which agents perceive the environment and form a policy conditioned on each other to accomplish a task[1] . It is widely used in different fields, such as robotics[2], distributed control[3], energy management[4]

, etc. From the point of game theory

[5], these tasks can be divided into fully cooperative, fully competitive and mixed stochastic games. The complexity makes it difficult to design a fixed pattern to control the agents. A natural idea to solve it is learning on its own, which lead to the research on multi-agent reinforcement learning.

Multi-agent reinforcement learning (MARL)[6] is the combination of multi-agent system and reinforcement learning (RL)[7]

. Agents interact with the common environment, perform actions and get rewards, learn a joint optimal policy by trial and error for multi-agent sequential decision making problems. In addition to the problems like sparse rewards and sample efficiency in reinforcement learning, multi-agent reinforcement learning encounters new problems such as the curse of dimensionality, nonstationary environment, multiple equilibria, etc. In addition, current mainstream algorithms tend to have a separate network structure for each agent, which undoubtedly poses a huge challenge to computing resources.

For the problem of nonstationary environment, the usual practice is to use centralized policy and global observation to transform the problem into a multi-agent problem with centralized control, but this method will encounter the curse of dimensionality and can not solve many tasks that require distributed deployment[8][9] . An improved approach is to use centralized training to stabilize the environment and decentralized execution for distributed deployment[10] , which largely mitigates nonstationary environment. But usually all the agents in the system are learning simultaneously, which makes each agent actually face with a moving-target learning problem: its own optimal policy changes as other agents’ policies change. Based on the idea of GAN[11] , we propose an iterative update method to stabilize the nonstationary environment. Divide the agents into the current learning agent and the agents waiting for learning, fix the strategies of agents in waiting list, only the current particular agent is trained, the problem is transformed into a single-agent case. Then each agent is switched regularly and gradually improved.

For the computing resources, current mainstream algorithms configure each agent an independent network. The memory usage increases linearly with the number of agents and the action of each agent needs to be computed separately by the network which also greatly slows down the interaction with environment. In this paper, all the agents are represented by one network, which greatly mitigates the demand for computing resources. The speed of interaction between the agent and the environment can also be accelerated by batch compute. To implement the iterative update strategy in this unified representation case, inspired by continual lifelong learning[12], we design a value fixing method based on Bellman Equation to achieve it. In this way, the strategy of the agents in waiting list will be fixed as much as possible.

We compare our method with MADDPG[13]. Results show that our method both in fully-cooperative and mixed cooperative-competitive environments can effectively mitigate the nonstationarity and improve the performance. At the same time, the wall-clock time spent by the algorithm is greatly reduced. In particular, the advantages of the algorithm are more obvious when the number of agents increases.

The remainder of this paper is organized as follows. Section 2 reviews related work on multi-agent reinforcement learning. Section 3 presents the details of our method. The experimental setup and results are illustrated in Section 4. Section 5 and 6 give future research directions and summarize this paper.

2 Related work

Multi-agent reinforcement learning can be divided into centralized methods and decentralized methods according to specific tasks. We are concerned with methods that can extract decentralized policies for an agent to make decisions based on its own observation. In these tasks, centralized decisions are unattainable because global state and joint policy are unavailable.

This problem can be described as Partially Observable Markov Decision Process

[14] (POMDP). Consider POMDP with agents is a set global state, is a set of observations for each agent, is a set of actions, denotes the state transition function. are the reward functions based on the joint actions. For each agent’s policy , we have the joint policy . Because the rewards of the agents depend on the joint policy, we denote return for each agent as follows

is discount factor, is the initial state.

The goal is to get the equilibrium strategy to maximize each agent’s return. We want to get the optimal policy , for all we have

At this point, the problem can be considered as a stochastic game problem for multi-agents. This definition is suitable for fully-cooperative, fully-competitive and mixed cooperative-competitive cases. Particularly, in fully-cooperative stochastic game, we have the same reward function for all agent, that is . The returns are also the same, . For fully-competitive and mixed cooperative-competitive cases, the reward function for each agent depends on the given objective function of the environment.

For the good performance of Q-learning in single-agent cases, Tan[15] introduces independent Q-learning (IQL) for multi-agent reinforcement learning. This algorithm does not take the nonstationarity into consideration, each agent learns a function independently. The independent approach hardly converge and it’s no surprise that the performance is not as good as Q-learning[16] in single-agent cases.

Oliehoek[17]

introduces a new paradigm of centralized training and decentralized execution and becomes the mainstream framework for decentralized tasks. This approach introduces global observed critics and decentralized actors for policy execution. The critic can minimize the estimation error of other agents and the actor can do decentralized decisions. MADDPG

[13] is the most popular algorithm in this paradigm. It extends DDPG[18] into multi-agent cases, each agent has a critic with global observation to direct the partial observed actor. MADDPG agents are able to perform coordination strategies in both cooperative, competitive and mixed environments.

VDN[19] (value decomposition networks) establishes a link between centralized reinforcement learning and decentralized reinforcement learning. It decomposes the centralized Q function and learns each agent’s Q value separately. Based on VDN, QMIX[20] designs a nonlinear function through network for value function decomposition. These methods can learn the accurate Q value for each agents which mitigates nonstationarity. However, such methods are only suitable for fully-cooperative environments.

COMA[21] combines the framework with counterfactual baseline. The baseline is obtained by fix other agents’ actions which is a little similar with our methods. The difference is that we fixed the strategies of the agents rather than the specific actions. And COMA fixes other agents’ actions to compute the baseline while our purpose is to learn the agent’s policy.

In terms of computing resources, current mainstream algorithms simply skip the problem, but it will become more severe as the tasks’ complexity increases. In some complex games like Pommerman[22], Quake III[23], and StarCraft , a continuous league was created, and a large amount of competitors are trained by using parallel algorithm like population based training (PBT) [24], which undoubtedly poses a huge challenge to computing resources.

Focusing on the above problems, our method can greatly alleviate the nonstationarity and save computing resources.

(a) common method
(b) IU
(c) IUUR
Figure 1: (a) common method: all the agents are updated simultaneously (b) IU: the agents are updated iteratively (c) IUUR: represent all the agents in a unified network and update each agent iteratively

3 Method

A stable agent reduces the nonstationarity in the learning problem of the other agents which makes it easier to solve. We fix the policies of agents in waiting list and transform the problem into a single-agent case. Then we have is the waiting list and agents will be fixed. The critic has the global observation, the action value function is is the set of global state and is agent ’s actions. For actor we have policy is agent ’s perspective. When the agent is under training, let denote the joint policy of all agents in , Specifically, the here is not changed.

Further, in order to improve computational efficiency and save wall-clock time. We represent all agents by only one network and add first-person perspective information to distinguish the roles of each agent. So we don’t need to label or allocate separate parameters for each agent and they can make optimal polices based on its own perspective. Then the action value function for critic is is agent ’s perspective. For each actor, the observation is unchanged, so the policy remains . Here we use deterministic policy like MADDPG. In this way, the memory usage won’t increase regardless how many agents in the environment and the actions output from networks can be transformed from sequential compute to batch compute.

A critical problem is how to realize the iterative update method in this unified representative network. If all agents share a common network, it’s no easy to fix the agents in waiting list. When we update the parameters in order to update one agent’s policy, the others’ will also change. Then the iterative update method fails and that’s what we want to avoid.

We consider the problem as continual lifelong learning[12] , there are usually three types of methods. a) retraining with regularization to prevent catastrophic forgetting, b) extending network to represent new tasks, c) selective retraining with possible expansion. Retraining with regularization is not suitable for reinforcement learning because the policy update conflicts with the regularization which will impede the improvement of policy. Network expansion is obviously opposed to our purpose of saving memory usage. Still, inspired by the lifelong learning’s thought, we focus on Bellman Equation[25] and propose a value fixing method to achieve it. For current learning agent, Q value is computed by Bellman Equation, for agent in waiting list, Q value output from critic directly. In concrete, we denote experience replay buffer includes global state, observations, rewards as follows:

For the current learning agent i, the Q value is updated follows Bellman Equation

Let denote other agents except , we have . Q value is directly output from the network

Where is the target policy network. The action-value function is updated as:

The samples from both and . For the deterministic policy , updated by gradient ascent as:

In theory, when other agents’ strategies are optimal with the current Q function, we have

Which means their policies will not change. But in practice, the Q function is represented with a neural network which is a nonlinear function. Consider the limited samples and the gradient ascent

[26] method’s error, it is impossible to hold the policy gradient[27] exactly equal to zero. Luckily we can usually guarantee a smaller norm of its gradient as follows:

Here uses Bellman Equation also for agents , the update is the same as:

Though we can’t hold the equation accurately, this method can substantially fix the other agents’ polices which mitigate the nonstationarity effectively in practice. The pseudo code of the algorithm is in .

  Initialize critic and actor with random weights and
  Initialize target network and with weights
  Initialize replay buffer
  Learning agent
  for  do
     Initialize a random process for action exploration
     Receive initial state and observation
     if  then
        
     end if
     for  do
        For each agent , select action
        Execute actions
        Get reward , new state and observation
        Store in replay buffer
        
        Sample a batch of transitions from
        
        
Algorithm 1 IUUR

4 Experiments and Results

We evaluate the algorithm in fully-cooperative and mixed cooperative-competitive environments. In order to compare the influence of the number of agents on the algorithm, we designed the control groups by increasing the number of agents. Besides, all environments’ state space and action space are continuous which are designed as follows.

4.1 Environments

4.1.1 Fully-cooperative environments: Spread

Agents perceive the environment on its own perspective and cooperate with each other to reach different destinations (the black points). In this environment, if agents collide during the movement, the agents will be punished. That is to say the agents must learn to reach the specified destinations without colliding with other agents. We set up a simple environment with three agents (Spread_3) and a complex environment with ten (Spread_10)as shown in figure 2:

(a) Spread_3

(b) Spread_10
Figure 2: Spread environments. Agents should reach different destinations (the black points) without colliding with each other.

4.1.2 Mixed cooperative-competitive environments: Predator-Prey

Agents are divided into predators and preys. The predators need to cooperate with each other to catch the preys. The prey needs to find a way to escape as much as possible. Two obstacles (black circle) will render randomly to block the way. If any of the predators collides with the prey, predators win and otherwise the prey wins. We set up three chase one as simple scenes and six chase two as complex scenes, which are respectively recorded as Predator_3-prey_1 and Predator_6-prey_2, as shown in figure 3

(a) Predator_3-Prey_1

(b) Predator_6-Prey_2
Figure 3: Predator-Prey environments. Any of the predators collides with the prey will get better return and win the game. Otherwise the prey will escape successfully and get better return.

4.2 Results

We compare IU (Iterative Update) and IUUR (Iterative Update and Unified Representation) on the basis of MADDPG. IU only uses iterative update method and IUUR adopts both iterative update and unified representation. The network structure is consistent with MADDPG, a two-layer ReLU MLP with 64 units per layer. Using the Adam optimizer, we set the soft update parameter

and train each model 100,000 episodes. We only fine-tuned the new hyperparameters

to control the frequency of iterative update. In our experiments, we set . Experiments show that our algorithm not only get good performance, but also improve computational efficiency.The source code of our algorithm implementation is available online ( https://github.com/DreamChaser128/IUUR-for-Multi-Agent-Reinforcement-Learning).

4.2.1 Performance

We run five random seeds for each environment and compare the performance among MADDPG, IU and IUUR.

(a) Spread_3
(b) Spread_10
Figure 4: Agents reward curves on fully-cooperative environments Spread_3 and Spread_10.

The results of fully-cooperative environments (Spread_3 and Spread_10) show in figure 4. In Spread_3, we can find the IUUR converges quickly and after 20,000 episodes, it has exceeded MADDPG and maintained a steady rise. It’s surprising that IU is inferior to MADDPG though it still has an upward trend after the maximum training steps. This may show that the nonstationarity of the environment in simple environment is not particularly serious and the new hyperparameters may impede the learning efficiency of agents in waiting list. In Spread_10, it can be clearly seen that IUUR and IU both greatly exceed the performance of MADDPG, which indicates that as the number of agents increases, the nonstationarity gets worse, and iterative update can effectively alleviate the problem.

(a) predator comparison
(b) prey comparison
Figure 5: Performance comparison in Predator_3-Prey_1. In (a), predators are replaced by IU and IUUR, we can find that IUUR outperforms MADDPG a lot and IU’s performance is slightly worse than that of MADDPG. In (b), both IU and IUUR get a better reward than MADDPG.

For the mixed cooperative-competitive environments Predator-Prey. We set MADDPG vs MADDPG as baseline, then replace the predators or preys with our IU and IUUR to compete with MADDPG. By comparing the reward of our algorithms with the MADDPG agents, we can clearly get the performance of each methods.

Environment Predator_3-Prey_1 shows in figure 5, when we replace the predators, IUUR outperforms MADDPG a lot. IU’s performance is slightly worse than that of MADDPG which is out of our expectation. In theory, we believe that IU can fix the agents’ strategies in waiting list, IUUR can only guarantee a smaller norm of its gradients which can only alleviate the nonstationarity to some extent. In this sense, we think IU will be better than IUUR, but the experimental result shows opposite performance. The reason may be the new hyperparameters , though IU can stable the environment, it will impede the learning efficiency of agents in waiting list at the same time which is very similar to environment Spread_3. When we replace the prey, both IU and IUUR can get a better reward and IUUR owns the highest reward.

(a) predator comparison
(b) prey comparison
Figure 6: Performance comparison in Predator_6-Prey_2. In (a), predators are replaced by IU and IUUR, we can find that IU outperforms MADDPG a lot and IU’s performance is worse than that of MADDPG. In (b), IU and IUUR perform much better than MADDPG

Environment Predator_6-Prey_2 shows in figure 6, when we replace the predators, IU outperforms MADDPG a lot. IUUR’s performance is worse than that of MADDPG. The reason is that as the number of agents increases, nonstationarity arises in multi-agent reinforcement learning gets more serious. For IU, each agent has its own network and iterative update can alleviate the nonstationarity effectively. But for IUUR, though iterative update can alleviate the nonstationarity, the update error is introduced by unified representation which leads to the inaccuracy of value and poor performance. When we replace the preys, both IU and IUUR get a better reward and IU owns the highest reward. Overall, although there is a slight gap between IUUR and MADDPG, considering that IUUR can learn faster and use less memory space, so even if there is a slight difference in performance, it is still within acceptable range. Moreover, we only simply control the learning frequency of iterative update hyperparameter through experience, which plays a key role on the performance and can be improved in the future work.

4.2.2 Computational efficiency

We compared the differences in training time and the speed of interaction with environment. Table 1 and table 2 show the details in different environments. All results were generated on 2.20GHz Intel Xeon E5-2630 and 2 GeFore GTX 1080Ti graphics cards based machine running ubuntu.

Obviously, IUUR saves a lot of time in both training and interaction by a large margin especially with more agents. That is to say, IUUR can be used in large scale multi-agent reinforcement learning without the linear growth of wall-clock time.

environment spread_3
algorithm Baseline IU IUUR
training_time(h) 5.76 4.25 3.47
Interaction_time(s) 0.0011 0.0011 0.0007

environment spread_10
algorithm Baseline IU IUUR
training_time(h) 34 31 28
Interaction_time(s) 0.0037 0.0036 0.0012
Table 1: Computational efficiency in fully-cooperative environments.
environment Predator_3-prey_1
algorithm Baseline IU IUUR
training_time(h) 6.4 3.6 1.7
Interaction_time(s) 0.0014 0.0015 0.001

environment Predator_6-prey_2
algorithm Baseline IU IUUR
training_time(h) 8.7 3.53 2.8
Interaction_time(s) 0.003 0.0029 0.0011
Table 2: Computational efficiency in mixed cooperative-competitive environments.

5 Discussion and Future Work

This paper proposes iteration updating and unified representation. Iterative update is used to stabilize the environment and the idea of batch computing is used to save memory and speed up interaction, which largely solves these two problems. This method does not affect decentralized execution and distributed deployment. In addition, our experiments are based on MADDPG, but this method is suitable for most multi-agent algorithms like IQL, VDN, QMIX etc. When combined with PBT or other algorithms for parallel training, the unified representation is particularly efficient.

This method also has some drawbacks. We all know Generative Adversarial Networks are notoriously hard to train, the main difficulty lies in the balance between generator and discriminator and many articles have further studied on it[28][29]. In our iterative update, there is also a problem of balancing the capabilities of each agents, especially in mixed cooperative-competitive environments. It is necessary to adjust it cautiously in order to obtain strong agents. At present, we only simply control the learning frequency of iterative update hyperparameter through experience, which is a research direction in the future. Another problem is how to realize the iterative update method in this unified representative network. The value fixing method based on Bellman Equation can only guarantee a smaller norm of its gradients and cannot strictly hold the equation. This can be further improved in the future work. In addition, due to the limited computing resources, we only expand the number of agents to a certain extent, which can be further verified in more complex environments in the future.

6 Conclusion

This paper presents an iterative update and unified representation method to solve the problems of environmental nonstationarity and computational efficiency. This method greatly alleviates the nonstationarity and outperforms MADDPG both in fully-cooperative and mixed cooperative-competitive tasks, especially when the number of agents increases. At the same time, unified representation and batch compute make use of the advantages of tensor compute of neural network, which effectively improves the computing efficiency and avoids the linear growth of the interaction time with the environment. In addition, our method is applicable to most Multi-Agent Reinforcement Learning algorithms. Our future work will focus on the stability of iterative update’s training under unified representation and apply this method to more complex tasks.

References