Prioritized Guidance for Efficient Multi-Agent Reinforcement Learning Exploration

07/18/2019 ∙ by Qisheng Wang, et al. ∙ 0

Exploration efficiency is a challenging problem in multi-agent reinforcement learning (MARL), as the policy learned by confederate MARL depends on the collaborative approach among multiple agents. Another important problem is the less informative reward restricts the learning speed of MARL compared with the informative label in supervised learning. In this work, we leverage on a novel communication method to guide MARL to accelerate exploration and propose a predictive network to forecast the reward of current state-action pair and use the guidance learned by the predictive network to modify the reward function. An improved prioritized experience replay is employed to better take advantage of the different knowledge learned by different agents which utilizes Time-difference (TD) error more effectively. Experimental results demonstrates that the proposed algorithm outperforms existing methods in cooperative multi-agent environments. We remark that this algorithm can be extended to supervised learning to speed up its training.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Model-free reinforcement learning (RL) has been proven a promising technique for sequence decision making in a wide range of applications such as robotic arm control[1], autopilot[2] and wireless communication anti-jamming[3]

. In particular, combined with deep learning (DL) as value function simulator, the deep reinforcement learning (DRL) exhibits powerful capabilities in real-world control problems such as Go

[4] and Atari games[5]. However, the inefficient exploration restricts the application of RL as the agent requires high specimen complexity to learn an acceptable behavior policy[6]. Large sample and long-time exploration are unacceptable in practical applications.

One possible solution is exploiting the joint exploration of multi-agent RL (MARL) to accelerate the convergence of policy iteration. The strategy obtained by confederate exploration depends on the collaborative approach among multiple agents to large degree. Multi-agent coordinately uses concurrent exploration of the state-action space of the common environment in the early stage of RL to avoid all agents exploring the same or similar subspace simultaneously, which in turn will result in insufficient diversity of samples. Stochastic policy based RL algorithms such as TRPO[7], PPO[8] and A3C[9] need to generate samples online to execute gradient descent, causing slow efficiency and high-cost exploration. Especially, the A3C algorithm supports multi-agent asynchronous exploration (which to some degree proves the benefits brought by multi-agent coordination), while involves risk of failure to converge when implemented in concurrent MARL.

Value-based Q-learning[10], DQN[4] and other algorithms derived from this, such as Double DQN (DDQN)[11], Dueling DQN[12] and Prioritized experience replay[13], all perform well on discrete control problems. DQN has achieved beyond-human level in both Go and Atari games. DDQN takes advantage of separate target network to calculate the expected Q value to alleviate the over-fitting problem in the value function approximation, which effectively improves the algorithm stability. Dueling DQN and Prioritized experience replay modify the network architecture and achieve better policy evaluation. However, they cannot effectively deal with the continuous control problems existing widely in real world. When the action dimension is particularly high, above algorithms may lose effectiveness. High resolution quantization level can partially solve this challenge but the computational complexity becomes unacceptable.

Policy-based deterministic policy gradient (DPG) algorithm[14] try to directly outputs action instead of state-action value approximation, thus naturally suitable for continuous control problems. The Actor-Critic (AC) algorithm deep deterministic policy gradient (DDPG)[15]

combining DPG with Q-function approximation has outperformed other algorithm. Nonetheless, DDPG is fragile to hyperparameters. The improved multi-agent DDPG (MADDPG) algorithm

[16] allows each agent to learn its own behavior strategy, which may overfit to other agents.

In this work, we leverage on a novel communication method to guide MARL to conspire. Inspired by the tangible that many soldiers can effectively conduct military operations under the organization of army leader, we lead into a commander to coordinate behaviors of multiple RL agents. The state-action evaluation network Critic in AC algorithm is a qualified candidate for such a task, which means all agents share a Q-value network called Commander, and each maintains its own action network111Which is similar to the centralized critic in MADDPG, but our commander use guidance to predict the distribution of reward. Meanwhile, the action network of each agent perform different degrees of association according to the similarity of their experience pool priorities as described in Section 4..

We consider the problem of insufficient information on reward in RL, which plays a similar role as label in supervised learning. In MARL circumstance, all agents cooperate to achieve the same purpose, thus we can estimate the reward distribution of different actions under the progressive optimal policy, and the evaluated value reflecting this distribution can promote a more efficient exploration. To achieve this goal with acceptable computational complexity, we propose a predictable network called PrecoderNet to forecast the reward of each

s-a pair. The output of Commander is regarded as the target of the PrecoderNet, and the prediction of PrecoderNet called guidance can subsequently indicate which region to explore. The PrecoderNet used to fit the reward distribution is also shared by all agents. The experimental results show that the guidance based on prior experience can make multi-agent effectively improve the efficiency of exploration and convergence in the process of collaborative learning.

Furthermore, we use the prioritized replay buffer to reduce calculation complexity and accelerate exploration. Prioritized experience takes advantage of Q-learning to approximate the value function and introduces the time difference error (TD-error), the difference between the target network and the estimated network, as the priority of samples to improve gradient descent. Meanwhile, the priority stored in the root node of the tree-structured prioritized experience can reflect whether the agent leans well or not. According to the similarity of different agents experience priorities, some agents share their experience replay to reduce the correlation of samples in the experience pool and avoid the bad trajectories tried by other agents so that exploration can focus more on effective sub-region than the entire space only after a short learning period.

The proposed prior experience based multi-agent coordinative exploration with prioritized guidance, called MAEPG, is an extension of the AC algorithm. We perform it in simulated environment gym for multiple agents to effectively and efficiently explore common environments to speed up the exploration process. Experimental results show the benefits of the prioritized guidance for MARL compared with DQN and MADDPG in both single agent and multi-agent situations, which can achieve better performance as the agent number increases.

The rest of this paper is organized as follows. Section 2 briefly reviews previous literatures on MARL. After the introduction of RL background (Section 3), we describe the proposed algorithm in Section 4. The experimental results are given in Section 5, and Section 6 concludes the paper showing some discussion about the approach. More information on the algorithm are included in the Appendix.

Ii Related Work

MARL is used to solve strategic learning problems that require multiple agents to collaborate or compete to accomplish complex tasks. Although behavioral criteria can be predefined, agents need to continually learn to find approximate optimal solutions for current tasks in complex environments, especially when the environment is likely to change[17]. Multi-agents interact with the environment to obtain feedback values to advance the learning of better strategies, which has crucial significance for the development of RL.

When multiple agents simultaneously interact with a dynamic environment and treats the others as part of surroundings, the environment is unstable from the perspective of current agent because the strategies of other agents are constantly changing during learning[18]. The randomness of environment has catastrophic impacts on the RL, resulting in slow or even no convergence of policy. In response to this problem, the use of regret bound to constrain the multi-agent environment and ensure convergence was be discussed in [19] and [20]

. However, this theoretical analysis does not work well in practical applications. In many cases, agents suspend learning due to the instability of the environment when they are still far away from the regret bound. MARL was improved based on game theory in

[21] by using empirical game-theoretic analysis to calculate which strategy would be chosen in the case of multi-agent. Whereas, this method could lead to inconsistent learning speed of the agent as the faster agent will make more use of the slower one and thus causing overfitting.

Lack of effective communication mechanisms for multi-agent joint exploration is another problem. In [22]

, a large forward feedback neural network was used to map the input of all agents to the action space while each agent occupies a part of the input unit and broadcasts its own hidden layer information to other agents during communication. As the number of agents increases, the overhead of communication is too large and the stability of the algorithm decreases. A centralized evaluation network was used in

[16] to control the behavior of all agents and achieved a certain improvement in some cooperation and competition tasks. However, they actually allow each agent to learn a separate behavioral policy, which may lead to suboptimal solutions to a certain extent. The minimax concept in game theory based on [23] was used to limit the learning step size of the agent to strengthen the algorithm stability in [16] which in the meantime sacrificed the speed advantage that multi-agent exploration should have. If multiple agents can interact with each other in a more efficient approach in the learning process, they should be able to achieve certain gains. For example, directly calling someone to warn of danger is more effective than igniting a fire as warning signals.

Another undeniable problem is the insufficient-information reward in RL compared with the label in supervised learning[6]. Although the use of reward instead of labels greatly improves the mobility and practical application of RL, it cost more resources in exploration, especially for MARL. When most agents get feedbacks with insufficient information, the consequence is that the entire learning process is difficult to perform. A method for increasing the information of reward based on the latent state is proposed in [24], using the experience stored from previous trajectories as the representation of reward to train a network to predict the reward of the new state action pair and add it to the feedback value in logarithmic form. However, this requires numerous repetitive simulations, which is too costly in complex environments. Although multi-head output layers were used to reduce the risk of overfitting, this prediction is easily limited by the prior experience, resulting in insufficient generalization and increased network complexity.

The proposed algorithm in this paper combines the advantages of the above work and attempts to improve the existing problems mentioned. Based on the idea of centralized critic in [16], we use a critic network called Commander to coordinate the behavior of each agent and add guidance for a more informative reward to fit the distribution of reward functions during the process of exploration. Inspired by [24], one shared network called PrecoderNet is used for prediction , which increases the information of reward values while reducing complexity meanwhile. Furthermore, according to [13], we add priority to the experience replay and the actor networks of agents are united in varied contribution weights according to the similarity of their tree-structured experience pool priorities stored in the root node so that our agents can effectively utilize the knowledge learned by others when it encounters a new state without exploring the entire state action space, which is more like the way humans behave.

Iii Preliminaries

Markov Decision Process (MDP): We model the multi-agent RL problem as MDP, consisting of M agents (M ), of which the interaction between agent and environment can be represented by a quintuple <S,A,r,,>. S represents the state space while A means the action space. Reward r:SAR is the feedback from environment measuring the chosen action under current state. is a discount factor that converts an infinite sequence problem into a matter with a maximum upper bound in order that the MDP can converge within finite steps. represents the policy on which the agent selects action depends, and the chosen action is .

Deep Q-networks (DQN): DQN[4] approx imates the value-based Q-learning state-value function (s,a)= as a deep neural network with parameter , where is the expected return of the current state-action against the discount factor. The goal of DQN is to maximize the target y=r+[] of the (s-a) pair, and update Q-value by bellman equation in dynamic programming. Then the gradient descent

will be carried out after random sampling in the experience replay, and the action with the largest Q value is selected with probability

or randomly selected with .

Deep Deterministic Policy Gradient (DDPG): DDPG[15] is an AC algorithm using the policy-based deterministic policy network parameterized by to generate deterministic action . DDPG updates the learned actor policy networks parameterized by with gradient descent by taking advantage of the Q-network in DQN as the critic so that it can maximize the output Q-value.

We also offer the summary of symbols and notations for convenience shown in Table 1.

Action the agent i choose at time slot
State the agent i reach at time slot
Reward of the agent i at time slot
Guidance to enrich the information of
Improved prioritized experience replay
Sampling mini-batch
Visiting times of a certain specimen
Discount factor of guidance
Discount factor of long-term reward
Soft update parameter
Bias to guarantee positive priority
Priority of specimen of agent i at time slot t
Contribution weight of agent i in exploration
Noise added to action for explorating
TABLE I: Summary of symbols and notations

Iv Algorithm

The proposed MAEPG is an extension of DDPG in the field of MARL, which enables better corporate behaviors between agents and improves the exploration policy by increasing the information of reward to improve the efficiency of concurrent RL. Our algorithm consists of a multi-agent joint actor network with an improved priority experience replay R, a centralized guidance network and critic network called PrecoderNet and Commander, respectively, as shown in Figure 1.

Fig. 1: Architecture of MAEPG algorithm

RL agents can only explore a part of the whole exploration space with high state-action dimension, especially in the continuous control problems. That is, partially observed environment restricts the performance of MARL agents. The experience replay in DQN and DDPG makes it possible for agents to continuously optimize based on the previous trajectories stored in the buffer. As the agent number increases, each agent will learn complementary action policies in different sub regions. Enlightened by the intuitive that myriad soldiers can better defend the fighters in a globally optimal operational strategy under the leadership of a general, all the cooperative agents share a centralized critic network called Commander and maintain their own action network called ActorNet . At the same time, partial observation of various agent was fed into the PrecoderNet to generate the guidance for increasing the information of reward function to provide a more accurate ongoing direction and accelerate the exploration so that the exploration focuses on the effective subspace. ActorNet selects action to interact with environment under the guidance , obtains feedback and the new state from the surrounding. In the meantime, our Commander access the chosen action by the approximate Q-value . The sample priority is shown in (1) by the TD-error.


Where micro account ensures a positive priority. Then is stored the latest transition in the prioritized experience replay[13]. Since larger TD-error means greater contribution when conducting the gradient descent, the information in the learning process is effectively increased.

Iv-a Improving priority experience replay

The prioritized experience replay R (as the blue cylinder illustrated in Fig. 1) considers the importance of different samples and describes it as the difference between access value from Commander and the obtained reward as shown in (1). Sampling by priority (mini batch is N) utilizes the samples with large TD-error to make the gradient descent faster simultaneously ensuring all samples are likely to be used (the effect of ). The experience replay is stored in the form of sum-tree to improve sampling efficiency and save storage space. The lowest level leaf node stores the transition and priority while the remaining nodes only store the sum of the priority of their children nodes. Inspired by [25] and considering the tangible that visiting time of a sample can also reflect the importance of this specimen, we store the visiting time called of the leaf node in the tree-structured experience replay and update the priority via (2)


Boltzman distribution[26] of the i-th agent can be defined according to the sum of priority stored in the root node of the experience replay to further take the different agents contribution into consideration as shown in (3).


Each agent uses (3) as weights to better improve the evaluation of Commanders. That is, the loss function of the Commander considering the contribution of different agents is (4) and (5).


The policy gradient can be written as (6) and (7):


Where the evaluated network A and the target network parameterized by and relatively are used to mitigate the over-fitting problem according to [4]. Meanwhile, the Commander also contains an evaluated network C and a target network C’ parameterized by and relatively as shown in Fig 2. We soft update all the target networks by . In this way, we use the agents with greater TD-errors to provide more information for Commander’s decision making, which in turn makes Commander more comprehensive and efficient. We remark that similar methods of using TD error as priority can also be used in surprised learning to enhance training efficiency. Overall, the improved priority experience replay leads to more coordinated concurrent learning among MARL.

Fig. 2: Network structure of MAEPG algorithm

Iv-B Precoding the reward

A scalar reward signal evaluates the quality of each transition, and the agent has to maximize the cumulative reward along the course of interaction. The RL feedback (the reward) is less informative than in supervised learning, where the agent would be given the correct actions to take. It is still unclear whether the reward predefined in the Gym environments and Atari environment is optimal or not. In MARL, we should pay more attention to the form of reward because of the unstable environmental caused by multi-agents and the cooperation or competitive interaction of various agents. In general, reward with more information and constraints will have positive effects on RL learning and make RL attractive for multi-agent learning.

Fig. 3: The walking posture of a doll with inappropriate reward

Fig. 4: The walking posture of a doll with informative reward

For example, the goal of an MDP task is to make a virtual person move forward in a virtual environment. Setting the reward value as the forward speed cause the doll to swing wildly while moving forward as shown in Figure 3. In MARL, these useless actions of agents not only wastes exploring resources bur also greatly interfere the direction of other agents[27]. In order to make the walking of the doll closer to a normal person, the modified reward is set as the difference between the virtual doll and the walking posture of human beings, and better results can be obtained as shown in Fig 4, which indicates the informative reward is significant for MARL.

In [24], representations from latent state is used to correct the reward. Based on numerous pre-experiments, and all previous state-action-reward values are stored for training a predictive network is feasible but cost too much. We propose a PrecoderNet parameterized by to use the output Commander as label to estimate the reward of current state-action pair, collecting (s-a) from all agents for real-time training without pre-experiment as in In Fig 2.

After the interaction between agents and environment, we input (s-a) to the PrecoderNet to obtain the guidance as the correction item, and a discount factor is used to determines how much the correction is used as shown in (8).


Then we do the gradient descent update on PrecoderNet as shown in (10):


The gradient flow from Commander through ActorNet and PrecoderNet, and the final gradient of Commander is the sum of the two gradient calculations of ActorNet and PrecoderNet. Multi-agent joint PrecoderNet and Commander s update will be more efficient and faster. To sum up, the more informative rewards leads to more efficient and effective explorations of all agents.

V Experiments

The proposed algorithm is performed in the RL environment Gym Pendulum-v0. This RL task is to maintain a vertical tilt of a center-fixed pendulum in the vertical direction[28] by applying apposite torque, as shown in Figure 5. In this environment, the closer the non-positive reward is to zero, the more perfect current state is to the ideal location. Since the action is torque (), the value of the action is continuous, and the state action dimension to be explored is infinite. We hope to finish the task quickly through collaborative MARL and compare the proposed MAEPG with DQN and MADDPG as benchmarks.

V-a Hyper parameters

In all experiments, we used an Adam optimizer[29] with a learning rate of 1e-3. The discount factor when calculating the target Q value, and

is set to 0.005 for soft update of all target networks. The ActorNet, Commander, and PrecoderNet all use a four-layer neural network including two hidden layers. The number of hidden layer neurons in ActorNet is 180 while in Commander and PrecoderNet the number is 300. In the Pendulum task, the action dimension is 1, so the input size of ActorNet is 3 while the input size of Commander and PrecoderNet is 4, and their output sizes are all equal to 1. In the experiment, the guidance from PrecoderNet is assigned to reward multiplied a discount factor

, and the number of agents is M

=1, 2, 3. In addition, gaussian white noise

of which the mean value is 0 and the variance is 0.1 has been used to increase the diversity of exploration.

V-B Results

We first compare the convergence time needed with regard to DQN, MADDPG and MAEPG with a single agent. During the experiment, the DQN algorithm quantifies the action between the maximum and minimum values of action torque as 100 discrete values stepped by 0.04. We designed a multi-head network structure for multi-agent DQN which means the first three layers (including one input layer and two hidden layers) are shared by all agents, like the Commander in MAEPG, and finally the action of each agent is determined by their respective output layers.

Fig. 5: Convergence speed comparison among single agent MAEPG, DQN and MADDPG. The horizontal axis is the number of episode (two thousand steps per episode), and the vertical axis is the accumulated reward calculated per 5 cycles. The larger the reward is, the farther the current position is from the target position which demonstrate a bad state.

As shown in Figure 6, DQN is difficult to converge with only one agent while MADDPG and MAEPG converge ultimately with unequal speed. MAEFG can converge in 220 episodes, and MADDPG need about 340 cycles to achieve the same convergence level. As shown in Fig. 7 and Fig. 8, when agent number M is more than 2, the multi-agent DQN still cannot learn a feasible behavior policy.

Fig. 6: Convergence speed of two agents

Fig. 7: Convergence speed of three agents

Then we compare the performance of MADDPG and MAEPG with agent number M=2 and 3. It can be seen that the convergence speed of MAEPG is greatly improved with the increase of M and significantly outperform MADDPG in terms of stability. The average reward and gain attained when the three algorithms converge are shown in Table 2. The gain in form of percentage represents the degree to which the average reward of MAEPG is better than MADDPG. Having considered the importance of different agents and samples to the overall learning progress in a multi-agent scenario, MAEPG learned faster and maintains the pendulum in a vertically upward ideal position with smaller swing compared with MADDPG.

The experimental results shows the proposed MAEPG outperforms the benchmarks under collaborative MARL exploration. Further discussions about the experiments are presented in appendix.

Vi Conclusion

In this paper, we proposed a novel cooperative algorithm called MAEPG for multi-agent RL to achieve coordinately efficient and effective exploration by using knowledge learned by a centralized Commander and guidance perceived from previous experience. In particular, we assist the multiple agents to better communicate via ameliorating the prioritized experience replay (Section 4.1) and the priority can help agents to explore more efficiently. We also propose a centralized precoder network to enrich the information of reward in RL tasks (Section 4.2) to accelerate the learning process in MARL. The experiment we carried out demonstrates that the proposed algorithm outperforms existing methods in cooperative multi-agent environments. We remark that this algorithm can be extended to supervised learning to speed up its training.

Vii Appendix A

For completeness, we provide the MAEPG algorithm as below.

1:  Initialize Commander, ActorNet and PrecoderNetInitialize target networksInitialize improved prioritized replay buffer R
2:  for  do
3:     Reset the environment
4:     for  do
5:        for i in agents do
6:           Each agent i choose action
7:           Execute action , obtain reward
8:           Observe new state
9:           Get guidance
10:           Get Q-value
11:           Compute and priority via (2)
12:           Store transition in R
13:        end for
14:        Sample a mini-batch of N transitions from RCalculate the visiting times of transitions in RUpdate the priorities via (2)Update the contribution weights via (3)Update the Commander via (4)(6)Update the PrecoderNet via (9)(10)
15:        Update the target networks:
16:     end for
17:  end for
Algorithm 1 MAEPG in MARL coordinate exploration

Viii Appendix B

In the experiment, we observed the behavioral trajectory of the two agents under the MADDPG and MAEPG algorithms. Fig.8 and Fig.9 are the learning curves of the two agents under the proposed MADDPG and MAEPG algorithms, respectively. The abscissa is the exploration time slot, and the ordinate is the action value corresponding to the time slot (the torque magnitude and direction in the vertical pendulum experiment).

Fig. 8: Two-agent MADDPG action learning curve.

Fig. 9:

Two-agent MAEPG action learning curve. The abscissa is a time slot while each time slot is learned 500 times, and the ordinate is the action finally learned by the time slot. The blue curve indicates that the moment is positive, that is, the opposite pendulum applies a moment to the right, and the yellow curve indicates that the moment is negative.

Since the DQN algorithm cannot converge in the multi-agent collaborative exploration environment, only the learning curves of MAEPG and MADDPG are compared. It can be seen from the figure that the policy learned by MAEPG converges faster, reaching the steady state of the pendulum. Then the torque is maintained at a small value, while MADDPG takes a long time to converge and final outputs is still relatively unstable.