I Introduction
Modelfree reinforcement learning (RL) has been proven a promising technique for sequence decision making in a wide range of applications such as robotic arm control[1], autopilot[2] and wireless communication antijamming[3]
. In particular, combined with deep learning (DL) as value function simulator, the deep reinforcement learning (DRL) exhibits powerful capabilities in realworld control problems such as Go
[4] and Atari games[5]. However, the inefficient exploration restricts the application of RL as the agent requires high specimen complexity to learn an acceptable behavior policy[6]. Large sample and longtime exploration are unacceptable in practical applications.One possible solution is exploiting the joint exploration of multiagent RL (MARL) to accelerate the convergence of policy iteration. The strategy obtained by confederate exploration depends on the collaborative approach among multiple agents to large degree. Multiagent coordinately uses concurrent exploration of the stateaction space of the common environment in the early stage of RL to avoid all agents exploring the same or similar subspace simultaneously, which in turn will result in insufficient diversity of samples. Stochastic policy based RL algorithms such as TRPO[7], PPO[8] and A3C[9] need to generate samples online to execute gradient descent, causing slow efficiency and highcost exploration. Especially, the A3C algorithm supports multiagent asynchronous exploration (which to some degree proves the benefits brought by multiagent coordination), while involves risk of failure to converge when implemented in concurrent MARL.
Valuebased Qlearning[10], DQN[4] and other algorithms derived from this, such as Double DQN (DDQN)[11], Dueling DQN[12] and Prioritized experience replay[13], all perform well on discrete control problems. DQN has achieved beyondhuman level in both Go and Atari games. DDQN takes advantage of separate target network to calculate the expected Q value to alleviate the overfitting problem in the value function approximation, which effectively improves the algorithm stability. Dueling DQN and Prioritized experience replay modify the network architecture and achieve better policy evaluation. However, they cannot effectively deal with the continuous control problems existing widely in real world. When the action dimension is particularly high, above algorithms may lose effectiveness. High resolution quantization level can partially solve this challenge but the computational complexity becomes unacceptable.
Policybased deterministic policy gradient (DPG) algorithm[14] try to directly outputs action instead of stateaction value approximation, thus naturally suitable for continuous control problems. The ActorCritic (AC) algorithm deep deterministic policy gradient (DDPG)[15]
combining DPG with Qfunction approximation has outperformed other algorithm. Nonetheless, DDPG is fragile to hyperparameters. The improved multiagent DDPG (MADDPG) algorithm
[16] allows each agent to learn its own behavior strategy, which may overfit to other agents.In this work, we leverage on a novel communication method to guide MARL to conspire. Inspired by the tangible that many soldiers can effectively conduct military operations under the organization of army leader, we lead into a commander to coordinate behaviors of multiple RL agents. The stateaction evaluation network Critic in AC algorithm is a qualified candidate for such a task, which means all agents share a Qvalue network called Commander, and each maintains its own action network^{1}^{1}1Which is similar to the centralized critic in MADDPG, but our commander use guidance to predict the distribution of reward. Meanwhile, the action network of each agent perform different degrees of association according to the similarity of their experience pool priorities as described in Section 4..
We consider the problem of insufficient information on reward in RL, which plays a similar role as label in supervised learning. In MARL circumstance, all agents cooperate to achieve the same purpose, thus we can estimate the reward distribution of different actions under the progressive optimal policy, and the evaluated value reflecting this distribution can promote a more efficient exploration. To achieve this goal with acceptable computational complexity, we propose a predictable network called PrecoderNet to forecast the reward of each
sa pair. The output of Commander is regarded as the target of the PrecoderNet, and the prediction of PrecoderNet called guidance can subsequently indicate which region to explore. The PrecoderNet used to fit the reward distribution is also shared by all agents. The experimental results show that the guidance based on prior experience can make multiagent effectively improve the efficiency of exploration and convergence in the process of collaborative learning.Furthermore, we use the prioritized replay buffer to reduce calculation complexity and accelerate exploration. Prioritized experience takes advantage of Qlearning to approximate the value function and introduces the time difference error (TDerror), the difference between the target network and the estimated network, as the priority of samples to improve gradient descent. Meanwhile, the priority stored in the root node of the treestructured prioritized experience can reflect whether the agent leans well or not. According to the similarity of different agents experience priorities, some agents share their experience replay to reduce the correlation of samples in the experience pool and avoid the bad trajectories tried by other agents so that exploration can focus more on effective subregion than the entire space only after a short learning period.
The proposed prior experience based multiagent coordinative exploration with prioritized guidance, called MAEPG, is an extension of the AC algorithm. We perform it in simulated environment gym for multiple agents to effectively and efficiently explore common environments to speed up the exploration process. Experimental results show the benefits of the prioritized guidance for MARL compared with DQN and MADDPG in both single agent and multiagent situations, which can achieve better performance as the agent number increases.
The rest of this paper is organized as follows. Section 2 briefly reviews previous literatures on MARL. After the introduction of RL background (Section 3), we describe the proposed algorithm in Section 4. The experimental results are given in Section 5, and Section 6 concludes the paper showing some discussion about the approach. More information on the algorithm are included in the Appendix.
Ii Related Work
MARL is used to solve strategic learning problems that require multiple agents to collaborate or compete to accomplish complex tasks. Although behavioral criteria can be predefined, agents need to continually learn to find approximate optimal solutions for current tasks in complex environments, especially when the environment is likely to change[17]. Multiagents interact with the environment to obtain feedback values to advance the learning of better strategies, which has crucial significance for the development of RL.
When multiple agents simultaneously interact with a dynamic environment and treats the others as part of surroundings, the environment is unstable from the perspective of current agent because the strategies of other agents are constantly changing during learning[18]. The randomness of environment has catastrophic impacts on the RL, resulting in slow or even no convergence of policy. In response to this problem, the use of regret bound to constrain the multiagent environment and ensure convergence was be discussed in [19] and [20]
. However, this theoretical analysis does not work well in practical applications. In many cases, agents suspend learning due to the instability of the environment when they are still far away from the regret bound. MARL was improved based on game theory in
[21] by using empirical gametheoretic analysis to calculate which strategy would be chosen in the case of multiagent. Whereas, this method could lead to inconsistent learning speed of the agent as the faster agent will make more use of the slower one and thus causing overfitting.Lack of effective communication mechanisms for multiagent joint exploration is another problem. In [22]
, a large forward feedback neural network was used to map the input of all agents to the action space while each agent occupies a part of the input unit and broadcasts its own hidden layer information to other agents during communication. As the number of agents increases, the overhead of communication is too large and the stability of the algorithm decreases. A centralized evaluation network was used in
[16] to control the behavior of all agents and achieved a certain improvement in some cooperation and competition tasks. However, they actually allow each agent to learn a separate behavioral policy, which may lead to suboptimal solutions to a certain extent. The minimax concept in game theory based on [23] was used to limit the learning step size of the agent to strengthen the algorithm stability in [16] which in the meantime sacrificed the speed advantage that multiagent exploration should have. If multiple agents can interact with each other in a more efficient approach in the learning process, they should be able to achieve certain gains. For example, directly calling someone to warn of danger is more effective than igniting a fire as warning signals.Another undeniable problem is the insufficientinformation reward in RL compared with the label in supervised learning[6]. Although the use of reward instead of labels greatly improves the mobility and practical application of RL, it cost more resources in exploration, especially for MARL. When most agents get feedbacks with insufficient information, the consequence is that the entire learning process is difficult to perform. A method for increasing the information of reward based on the latent state is proposed in [24], using the experience stored from previous trajectories as the representation of reward to train a network to predict the reward of the new state action pair and add it to the feedback value in logarithmic form. However, this requires numerous repetitive simulations, which is too costly in complex environments. Although multihead output layers were used to reduce the risk of overfitting, this prediction is easily limited by the prior experience, resulting in insufficient generalization and increased network complexity.
The proposed algorithm in this paper combines the advantages of the above work and attempts to improve the existing problems mentioned. Based on the idea of centralized critic in [16], we use a critic network called Commander to coordinate the behavior of each agent and add guidance for a more informative reward to fit the distribution of reward functions during the process of exploration. Inspired by [24], one shared network called PrecoderNet is used for prediction , which increases the information of reward values while reducing complexity meanwhile. Furthermore, according to [13], we add priority to the experience replay and the actor networks of agents are united in varied contribution weights according to the similarity of their treestructured experience pool priorities stored in the root node so that our agents can effectively utilize the knowledge learned by others when it encounters a new state without exploring the entire state action space, which is more like the way humans behave.
Iii Preliminaries
Markov Decision Process (MDP): We model the multiagent RL problem as MDP, consisting of M agents (M ), of which the interaction between agent and environment can be represented by a quintuple <S,A,r,,>. S represents the state space while A means the action space. Reward r:SAR is the feedback from environment measuring the chosen action under current state. is a discount factor that converts an infinite sequence problem into a matter with a maximum upper bound in order that the MDP can converge within finite steps. represents the policy on which the agent selects action depends, and the chosen action is .
Deep Qnetworks (DQN): DQN[4] approx imates the valuebased Qlearning statevalue function (s,a)= as a deep neural network with parameter , where is the expected return of the current stateaction against the discount factor. The goal of DQN is to maximize the target y=r+[] of the (sa) pair, and update Qvalue by bellman equation in dynamic programming. Then the gradient descent
will be carried out after random sampling in the experience replay, and the action with the largest Q value is selected with probability
or randomly selected with .Deep Deterministic Policy Gradient (DDPG): DDPG[15] is an AC algorithm using the policybased deterministic policy network parameterized by to generate deterministic action . DDPG updates the learned actor policy networks parameterized by with gradient descent by taking advantage of the Qnetwork in DQN as the critic so that it can maximize the output Qvalue.
We also offer the summary of symbols and notations for convenience shown in Table 1.




























Iv Algorithm
The proposed MAEPG is an extension of DDPG in the field of MARL, which enables better corporate behaviors between agents and improves the exploration policy by increasing the information of reward to improve the efficiency of concurrent RL. Our algorithm consists of a multiagent joint actor network with an improved priority experience replay R, a centralized guidance network and critic network called PrecoderNet and Commander, respectively, as shown in Figure 1.
RL agents can only explore a part of the whole exploration space with high stateaction dimension, especially in the continuous control problems. That is, partially observed environment restricts the performance of MARL agents. The experience replay in DQN and DDPG makes it possible for agents to continuously optimize based on the previous trajectories stored in the buffer. As the agent number increases, each agent will learn complementary action policies in different sub regions. Enlightened by the intuitive that myriad soldiers can better defend the fighters in a globally optimal operational strategy under the leadership of a general, all the cooperative agents share a centralized critic network called Commander and maintain their own action network called ActorNet . At the same time, partial observation of various agent was fed into the PrecoderNet to generate the guidance for increasing the information of reward function to provide a more accurate ongoing direction and accelerate the exploration so that the exploration focuses on the effective subspace. ActorNet selects action to interact with environment under the guidance , obtains feedback and the new state from the surrounding. In the meantime, our Commander access the chosen action by the approximate Qvalue . The sample priority is shown in (1) by the TDerror.
(1) 
Where micro account ensures a positive priority. Then is stored the latest transition in the prioritized experience replay[13]. Since larger TDerror means greater contribution when conducting the gradient descent, the information in the learning process is effectively increased.
Iva Improving priority experience replay
The prioritized experience replay R (as the blue cylinder illustrated in Fig. 1) considers the importance of different samples and describes it as the difference between access value from Commander and the obtained reward as shown in (1). Sampling by priority (mini batch is N) utilizes the samples with large TDerror to make the gradient descent faster simultaneously ensuring all samples are likely to be used (the effect of ). The experience replay is stored in the form of sumtree to improve sampling efficiency and save storage space. The lowest level leaf node stores the transition and priority while the remaining nodes only store the sum of the priority of their children nodes. Inspired by [25] and considering the tangible that visiting time of a sample can also reflect the importance of this specimen, we store the visiting time called of the leaf node in the treestructured experience replay and update the priority via (2)
(2) 
Boltzman distribution[26] of the ith agent can be defined according to the sum of priority stored in the root node of the experience replay to further take the different agents contribution into consideration as shown in (3).
(3) 
Each agent uses (3) as weights to better improve the evaluation of Commanders. That is, the loss function of the Commander considering the contribution of different agents is (4) and (5).
(4) 
(5) 
The policy gradient can be written as (6) and (7):
(6) 
(7) 
Where the evaluated network A and the target network parameterized by and relatively are used to mitigate the overfitting problem according to [4]. Meanwhile, the Commander also contains an evaluated network C and a target network C’ parameterized by and relatively as shown in Fig 2. We soft update all the target networks by . In this way, we use the agents with greater TDerrors to provide more information for Commander’s decision making, which in turn makes Commander more comprehensive and efficient. We remark that similar methods of using TD error as priority can also be used in surprised learning to enhance training efficiency. Overall, the improved priority experience replay leads to more coordinated concurrent learning among MARL.
IvB Precoding the reward
A scalar reward signal evaluates the quality of each transition, and the agent has to maximize the cumulative reward along the course of interaction. The RL feedback (the reward) is less informative than in supervised learning, where the agent would be given the correct actions to take. It is still unclear whether the reward predefined in the Gym environments and Atari environment is optimal or not. In MARL, we should pay more attention to the form of reward because of the unstable environmental caused by multiagents and the cooperation or competitive interaction of various agents. In general, reward with more information and constraints will have positive effects on RL learning and make RL attractive for multiagent learning.
For example, the goal of an MDP task is to make a virtual person move forward in a virtual environment. Setting the reward value as the forward speed cause the doll to swing wildly while moving forward as shown in Figure 3. In MARL, these useless actions of agents not only wastes exploring resources bur also greatly interfere the direction of other agents[27]. In order to make the walking of the doll closer to a normal person, the modified reward is set as the difference between the virtual doll and the walking posture of human beings, and better results can be obtained as shown in Fig 4, which indicates the informative reward is significant for MARL.
In [24], representations from latent state is used to correct the reward. Based on numerous preexperiments, and all previous stateactionreward values are stored for training a predictive network is feasible but cost too much. We propose a PrecoderNet parameterized by to use the output Commander as label to estimate the reward of current stateaction pair, collecting (sa) from all agents for realtime training without preexperiment as in In Fig 2.
After the interaction between agents and environment, we input (sa) to the PrecoderNet to obtain the guidance as the correction item, and a discount factor is used to determines how much the correction is used as shown in (8).
(8) 
(9) 
Then we do the gradient descent update on PrecoderNet as shown in (10):
(10) 
The gradient flow from Commander through ActorNet and PrecoderNet, and the final gradient of Commander is the sum of the two gradient calculations of ActorNet and PrecoderNet. Multiagent joint PrecoderNet and Commander s update will be more efficient and faster. To sum up, the more informative rewards leads to more efficient and effective explorations of all agents.
V Experiments
The proposed algorithm is performed in the RL environment Gym Pendulumv0. This RL task is to maintain a vertical tilt of a centerfixed pendulum in the vertical direction[28] by applying apposite torque, as shown in Figure 5. In this environment, the closer the nonpositive reward is to zero, the more perfect current state is to the ideal location. Since the action is torque (), the value of the action is continuous, and the state action dimension to be explored is infinite. We hope to finish the task quickly through collaborative MARL and compare the proposed MAEPG with DQN and MADDPG as benchmarks.
Va Hyper parameters
In all experiments, we used an Adam optimizer[29] with a learning rate of 1e3. The discount factor when calculating the target Q value, and
is set to 0.005 for soft update of all target networks. The ActorNet, Commander, and PrecoderNet all use a fourlayer neural network including two hidden layers. The number of hidden layer neurons in ActorNet is 180 while in Commander and PrecoderNet the number is 300. In the Pendulum task, the action dimension is 1, so the input size of ActorNet is 3 while the input size of Commander and PrecoderNet is 4, and their output sizes are all equal to 1. In the experiment, the guidance from PrecoderNet is assigned to reward multiplied a discount factor
, and the number of agents is M=1, 2, 3. In addition, gaussian white noise
of which the mean value is 0 and the variance is 0.1 has been used to increase the diversity of exploration.
VB Results
We first compare the convergence time needed with regard to DQN, MADDPG and MAEPG with a single agent. During the experiment, the DQN algorithm quantifies the action between the maximum and minimum values of action torque as 100 discrete values stepped by 0.04. We designed a multihead network structure for multiagent DQN which means the first three layers (including one input layer and two hidden layers) are shared by all agents, like the Commander in MAEPG, and finally the action of each agent is determined by their respective output layers.
As shown in Figure 6, DQN is difficult to converge with only one agent while MADDPG and MAEPG converge ultimately with unequal speed. MAEFG can converge in 220 episodes, and MADDPG need about 340 cycles to achieve the same convergence level. As shown in Fig. 7 and Fig. 8, when agent number M is more than 2, the multiagent DQN still cannot learn a feasible behavior policy.
Then we compare the performance of MADDPG and MAEPG with agent number M=2 and 3. It can be seen that the convergence speed of MAEPG is greatly improved with the increase of M and significantly outperform MADDPG in terms of stability. The average reward and gain attained when the three algorithms converge are shown in Table 2. The gain in form of percentage represents the degree to which the average reward of MAEPG is better than MADDPG. Having considered the importance of different agents and samples to the overall learning progress in a multiagent scenario, MAEPG learned faster and maintains the pendulum in a vertically upward ideal position with smaller swing compared with MADDPG.
The experimental results shows the proposed MAEPG outperforms the benchmarks under collaborative MARL exploration. Further discussions about the experiments are presented in appendix.
Vi Conclusion
In this paper, we proposed a novel cooperative algorithm called MAEPG for multiagent RL to achieve coordinately efficient and effective exploration by using knowledge learned by a centralized Commander and guidance perceived from previous experience. In particular, we assist the multiple agents to better communicate via ameliorating the prioritized experience replay (Section 4.1) and the priority can help agents to explore more efficiently. We also propose a centralized precoder network to enrich the information of reward in RL tasks (Section 4.2) to accelerate the learning process in MARL. The experiment we carried out demonstrates that the proposed algorithm outperforms existing methods in cooperative multiagent environments. We remark that this algorithm can be extended to supervised learning to speed up its training.
Vii Appendix A
For completeness, we provide the MAEPG algorithm as below.
Viii Appendix B
In the experiment, we observed the behavioral trajectory of the two agents under the MADDPG and MAEPG algorithms. Fig.8 and Fig.9 are the learning curves of the two agents under the proposed MADDPG and MAEPG algorithms, respectively. The abscissa is the exploration time slot, and the ordinate is the action value corresponding to the time slot (the torque magnitude and direction in the vertical pendulum experiment).
Since the DQN algorithm cannot converge in the multiagent collaborative exploration environment, only the learning curves of MAEPG and MADDPG are compared. It can be seen from the figure that the policy learned by MAEPG converges faster, reaching the steady state of the pendulum. Then the torque is maintained at a small value, while MADDPG takes a long time to converge and final outputs is still relatively unstable.
References
 [1] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates,” in IEEE International Conference on Robotics and Automation, 2017.
 [2] E. Perot, M. Jaritz, M. Toromanoff, and R. D. Charette, “Endtoend driving in a realistic racing game with deep reinforcement learning,” in Computer Vision and Pattern Recognition Workshops, 2017.
 [3] G. Han, X. Liang, and H. V. Poor, “Twodimensional antijamming communication based on deep reinforcement learning,” in IEEE International Conference on Acoustics, 2017.
 [4] M. Volodymyr, K. Koray, and S. David, “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [5] M. Guzdial and M. Riedl, “Toward game level generation from gameplay videos,” arXiv preprint arXiv:1602.07721, 2016.
 [6] Y. Li, “Deep reinforcement learning: An overview,” arXiv preprint arXiv:1701.07274, 2017.
 [7] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” Computer Science, pp. 1889–1897, 2015.
 [8] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.

[9]
V. Mnih, A. P. Badia, and M. Mirza, “Asynchronous methods for deep
reinforcement learning,” in
International conference on machine learning
, pp. 1928–1937, 2016.  [10] C. J. C. H. Watkins and P. Dayan, “Technical note: Qlearning,” Machine Learning, vol. 8, no. 34, pp. 279–292, 1992.
 [11] H. V. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double qlearning,” Computer Science, 2015.
 [12] Z. Wang, T. Schaul, M. Hessel, and V. Hasselt, “Dueling network architectures for deep reinforcement learning,” arXiv preprint arXiv:1511.06581, 2015.
 [13] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
 [14] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in ICML, 2014.
 [15] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” Computer Science, vol. 8, no. 6, p. A187, 2015.
 [16] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multiagent actorcritic for mixed cooperativecompetitive environments,” in Advances in Neural Information Processing Systems, pp. 6379–6390, 2017.
 [17] M. Tan, “Multiagent reinforcement learning: Independent vs. cooperative agents,” Machine Learning Proceedings, pp. 330–337, 1993.
 [18] C. Claus and C. Boutilier, “The dynamics of reinforcement learning in cooperative multiagent systems,” AAAI/IAAI, vol. 1998, pp. 746–752, 1998.
 [19] M. H. Bowling, “Convergence and noregret in multiagent learning,” in International Conference on Neural Information Processing Systems, 2004.

[20]
M. Bowling and M. Veloso, “Rational and convergent learning in stochastic
games,” in
International joint conference on artificial intelligence
, vol. 17, pp. 1021–1026, Lawrence Erlbaum Associates Ltd, 2001.  [21] M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver, and T. Graepel, “A unified gametheoretic approach to multiagent reinforcement learning,” in Advances in Neural Information Processing Systems, pp. 4190–4203, 2017.
 [22] A. Celikyilmaz, A. Bosselut, X. He, and Y. Choi, “Deep communicating agents for abstractive summarization,” arXiv preprint arXiv:1803.10357, 2018.
 [23] S. Li, Y. Wu, X. Cui, H. Dong, F. Fang, and S. Russell, “Robust multiagent reinforcement learning via minimax deep deterministic policy gradient,” in AAAI Conference on Artificial Intelligence (AAAI), 2019.
 [24] G. Vezzani, A. Gupta, L. Natale, and P. Abbeel, “Learning latent state representation for speeding up exploration,” arXiv preprint arXiv:1905.12621, 2019.
 [25] C. Dai, L. Xiao, X. Wan, and Y. Chen, “Reinforcement learning with safe exploration for network security,” in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3057–3061, IEEE, 2019.
 [26] E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Actormimic: Deep multitask and transfer reinforcement learning,” arXiv preprint arXiv:1511.06342, 2015.
 [27] X. B. Peng, G. Berseth, K. Yin, and M. Van De Panne, “Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning,” ACM Transactions on Graphics (TOG), vol. 36, no. 4, p. 41, 2017.
 [28] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
 [29] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
Comments
There are no comments yet.