Reinforcement learning(RL) has achieved impressive results in challenging problems, from playing games Clark and Storkey (2014)Silver et al. (2016) to roboticsLevine et al. (2016)Chebotar et al. (2017). Nowadays, the biggest difficulty to deploy RL in real world is to collect large amounts of data to train a good RL policy for any new environment. Experience replay (ER) (Lin, 1992), which is a fundamental component of off-policy reinforcement learning, partially address this problem by storing experience in a memory buffer and reusing them randomly for multiple times. It breaks the correlation between the streaming of training data and improve the data efficiency, which stabilizes the training process and leads to a better convergence result (Mnih et al., 2015). Currently, a majority of off-policy reinforcement learning algorithms, such as DDPG (Lillicrap et al., 2015), C51 (Bellemare et al., 2017), SAC (Haarnoja et al., 2018), have adopted experience replay for its performance and simplicity.
However, in multi-agent domain, the problem is more complicated. Firstly, the MARL algorithms using centralized training framework (Lowe et al., 2017) (Yang et al., 2018)(Rashid et al., 2020), demand more exploration than those in single-agent domain, since the state action space expands exponentially as the number of agents grows, which is known as curse of dimensionality. Secondly, the interaction between the RL agents and the environment, which is necessary to collect training data, is the most time-consuming part to train a reinforcement learning system (Schaul et al., 2015) and its computational complexity grows rapidly as the the number of agents grows. These two factors make it a complicated but promising problem to make a fast exploration and train a multi-agent system in a time-efficient manner.
To improve the training efficiency, previous work have made improvements in several aspects of experience replay, such as importance sampling(Schaul et al., 2015), setting sub-goals to address sparse reward problem(Andrychowicz et al., 2017), examining the effects of hyper-parameters(Zhang and Sutton, 2017), sharing experiences among distributed agents(Horgan et al., 2018), the utilization of on-policy experience (Schmitt et al., 2019) , etc. Our work, however, is based on an intrinsic nature of the multi-agent task, and can be combined with above techniques for further improvement.
In this paper, we introduce a technique called Experience Augmentation, which provides a fast, thorough, unbiased exploration by shuffling the order of agents and accelerates the training by additionally updating the parameters on the generated experiences. Applied to MADDPG(Lowe et al., 2017), a classical MARL baseline, the performance of experience augmentation is demonstrated in 3 scenarios. In the best performing scenario, the agent with experience augmentation technique achieves the convergence reward of vanilla MADDPG with 1/4 training time, and its convergence beats the original model by a significant margin.
Before Deep Q-Learning(Mnih et al., 2015) demonstrated the effectiveness of Experience Replay (Lin, 1992), reinforcement learning was suffered from the instability mainly caused by the correlated data. To perform experience replay, it is typical to store the agent’s experience at each step into the replay buffer and uniformly sample mini-batches of experience to update the parameters for times in every steps. For ease of explanation, we refer to update times and to training interval. A large sliding window replay memory of size (such as 1 million) is usually used as the container of the stored experiences.
There are mainly three benefits of the experience replay mechanism. Firstly, by breaking the correlations in the stream of training data, experience replay stabilizes the training and induces a better convergence result(Schaul et al., 2015). Secondly, when learning on-policy, the current parameters of the policy determine the next experience to be sampled, which is, however, the next training data to be used for updating the parameters. Under such a circumstance, a bad feedback is easy to occur and the parameters would frequently get stuck in a poor local minimum(Mnih et al., 2015). By using experience replay, the distribution of training data is averaged over many of its previous states, ensuring the breadth of the data distribution and thus allows for a relatively thorough exploration and smooth learning process. Thirdly, experience replay also allows for less training time, as each step of experience is reused for several weight updates, it requires less interactions between agents and its environment, which is usually a time-consuming part in the learning process.
As an extension of actor-critic policy gradient method in multi-agent domain, MADDPG was proposed to use the framework of centralized training with decentralized execution, where the critic is augmented with extra information about the policies of other agents to help learn the policy effectively . In the execution phase, the actors would act in a decentralized manner, yet well-trained to cooperative or compete with other agents.
The centralized critic of agent is represented by , where is the collection of all agents’ deterministic policy; and are the concatenation of every agent’s observation and action. Each agent
trains its Q-function by minimizing the loss function(Watkins and Dayan, 1992) :
where , and is the size of the mini-batch.
The actor network of agent is trained by determinstic policy gradient(Silver et al., 2014) to maximize the objective function , whose gradient is given by:
In MADDPG, the update times is set to 1, and training interval is 100.
3.1 Two properties of the MARL environment
As is fulfilled by a majority of multi-agent envrionments Lowe et al. (2017)Mordatch and Abbeel (2018)Vidal et al. (2002)Gupta et al. (2017), we assume that the reward function for any agent can be represented as :
where represents the group of agent .
We also assume that each agent in a same group is homogeneous to others, i.e. they share the same properties(such as size, shape, mass, etc) and the same reward fuction. These assumptions lead to two properties of the environment:
The reward function for agent is symmetric to any other agents by group, i.e., exchanging the observation and action between any other two agents in the same group, would not influence the work done by agent or change the result of the environmental step , making no difference in the reward of agent .
The reward information of any agent in the same group with agent can also be utilized to train agent , i.e., exchanging the observation and action between agent and any agent in group , will exchange the reward as well.
Given the two properties of the environment, we could augment the original experience by shuffling the order of agents in a specific way, as shown in Sec.3.2.
3.2 Shuffle Trick
In Sec.3.1, we provide two properties of the MARL environments, which are the intuitive idea of our method. In this section, we formally propose the technique called Shuffle Trick that augments the original dataset factorially, and thus partially deteriorates the curse of dimensionality, which is caused by the exponentially expanded state action space of MARL.
There are two steps to perform the shuffle trick: firstly, find the feasible permutation matrix set in which every permutation matrix shuffles every agent by group; secondly, randomly select a permutation matrix from and premultiply it to the original experience to shuffle the agents’ order. In short, the shuffle trick can be expressed as:
Consider an environment with 4 agents, including 2 good agents (indexed as 1 and 2) and 2 adversaries (indexed as 3 and 4). The adversaries are rewarded by shortening the distance to its nearest agent. As show in Fig.1, by shuffling the order of good agents and adversaries respectively, we generate experiences from the original experience.
Practically, to train a agent , the rewards of other agents are useless. The reward of agent in the generated experiences is obtained by choosing the
element in the generated reward vector.
3.3 Experience Augmentation
In this section, we consider the concrete method to combine Shuffle Trick with MARL algorithms to accelerate training and provide better exploration. To this end, we choose MADDPGLowe et al. (2017), a classical MARL baseline, and analyze the factors that affect its training efficiency and performance.
In recent years, previous work have attempted to improve experience replay in multiple aspectsSchaul et al. (2015)Andrychowicz et al. (2017)He et al. (2020)Foerster et al. (2017). Different from the previous work, we notice that there is another factor that is significantly relevant to the performance of experience replay, i.e., the hyper-parameters named training interval and update times . As is illustrated in Sec.2, a typical ER-based RL algorithm would update the network parameters for times when every transitions are added to the replay buffer. An intuitive idea is that increasing the ratio of update times and training interval, , would result in a improvement of training speed, as it increases the number of updates per unit time.
However, by our experiments, the convergence might suffer when simply enlarging the update times or decreasing training interval (demonstrated in Sec.4.4). This phenomenon is possibly caused by the fact that, the replay buffer is an inaccurate subset of the ground truth reward distribution, as the scale of exploration space is far greater than the size of replay buffer, especially in the multi-agent domain. When the ratio is tuned to a higher value, the revisited times of every experience in the replay buffer increases proportionally, which may result in the consequnce that the model has been trained so many times on this subset, but not robust enough to encounter the data that don’t appear in the replay buffer. In other words, the model has become over-fitted to the dataset in replay buffer when increasing the in vanilla MADDPG.
Given the Shuffle Trick which provides a fast, thorough and symmetric exploration of the observation action space, we notice that extra update times on the generated dataset could effectively accelerate the training without deteriorating the convergence; on the contrary, in some environments it boosts the result by a significant margin. We refer this training technique to Experience Augmentation. There are two steps to perform experience augmentation in off-policy MARL: firstly, use shuffle trick to generate extra experiences; then, update parameters on these experiences sequentially, where is a newly introduced hyper-parameter called EA-times ,which is a shorthand of Experience Augmentation extra update Times. This hyper-parameter determines how many times the model additionally train on the generated dataset, whose effected are examined in Sec.4.4.
The principle behind the effectiveness of EA and the whole process of experience augmentation, are shown in Fig.2. It can be seen that in vanilla MARL, the parameters only update on every experience in the original buffer(which is a coarse approximation to groud truth reward distribution) for multiple times; while with EA, the parameters are additionally trained on the integrated buffer (which is a far more accurate approximation to groud truth) for EA-times
. This new feature allows the Q-function to provide more accurate estimation of Q-value and helps the policy to find better local minimum.
The experiments part is organized as follows. In Sec.4.1 we introduce MARL environments we use for the experiments. In Sec.4.2 we present the experimental settings. In Sec.4.3 we compare the performance of MADDPG with and without EA and analyze the accelerating effect of EA in the training process in realistic time. In Sec.4.4 we do ablation studies to prove the necessity of shuffle trick, and examine the effect of the newly introduced hyper-parameter EA-times.
Our experiments are mainly based on MPELowe et al. (2017)Mordatch and Abbeel (2018). We consider the following three tasks: Cooperative Navigation, UAV used for Mobile Base Station, World with Communication. The first two are homogeneous environments, and the last one is a heterogeneous environment with two groups of agents. The details of the envrionments are as follow.
Cooperative navigation consists of agents and landmarks. The goal of this environment is to occupy all of the landmarks while avoiding collisions among agents. By training, the agents learned the assignment strategy to cover the landmarks. Each agent receives a negative reward , which is the negative of the sum of the distances between the L landmarks to their nearest agents, and is shared by all agent; it would also receive a negative reward if it collides to another agent. In this environment, we set , and simulate one case:
UAV used for Mobile Base Station
The scenario where UAV swarm are used for Mobile Base Station (MBS), was proposed by Liu et al. (2019a), and implemented by us using MPELowe et al. (2017). The environment consists of UAVs and PoIs, where UAVs work as mobile base stations to provide communication services to the public (abstracted as to cover a set of PoI). Note that PoIs are invisible to UAVs. The reward was set to encourage the UAVs to cover more PoIs, meanwhile, it takes into account the fairness among the covered time for every PoI and the efficiency of energy consumption. By training, the UAVs learned the latent distribution of PoIs and the corresponding moving strategies. In this environment, we simulate one case:
World with Communication
World with Communication consists of slower predators work together to chase fast-moving preys, and inaccessible obstacle, as well as accessible forest. The observation of each agent is the concatenation of its position and velocity, the locations of obstacle, the locations of forest, and the locations of other agents. One of the N predators is the leader who can see preys hiding in the forest and can share the information with other predators through the communication channel. The predators get positive reward when colliding with any preys. Preys get negative reward when caught by (colliding with) any predators. In this environment, we set ,, ,and simulate one case:,,,.
4.2 Experienmental Setup
Following MADDPG, the actor policy and critic are both parameterized by a two-layer MLP with 128 hidden units per layer and ReLU activation function. Adam is used as the optimizer. The size of the replay buffer is one million. The batch size is 1024. The discounted factor is 0.95. The training intervalis 100 and update times is 1, as suggested by MADDPG. The learning rate for each environment is decided by a coarse grid search. For UAV Mobile Base Station, the learning rate is fixed to 0.01. For other scenarios, the learning rates are fixed to 0.001. For consistency, we set EA-time to 3 in every environment. In the ablation studies, we examine the impact of EA-times, the cooperative navigation scenario was chosen, and was tested.
For each case, we tried at least 10 random seeds. We train our models until convergence (either 80,000 or 160,000 episodes), and then evaluate them by averaging the metrics in the last 20,000 episodes. The mean value(solid line in figure) and quantile(translucent part in figure) are analyzed.
4.3 The performance and time-efficiency of EA
Does EA boost the convergence?
In order to verify if EA improves performance we evaluate MADDPG with and without EA on all 4 tasks. Moreover, we compare MADDPG with EA (EA-MADDPG) against MADDPG with PER (PER-MADDPG), and the performance of DDPG is also tested. For EA-MADDPG, during each update, it will generate 3 extra experiences and upate the parameters on them sequentially.
We present the results in two homogeneous environments in Table.1 and Table.2. It is clear that EA-MADDPG outperforms other algorithms by a significant margin. To evaluate the approach in heterogeneous environments, we pitch EA-MADDPG agents against MADDPG agents and DDPG agents, and compare the result of the agents and adversaries in Table. 3. It shows that agents with EA take more advantages over their opponents without EA.
Does EA accelerate the training?
Given the fact that EA improves the convergence of vanilla MADDPG, the next problem is that whether EA could accelerate the training process of MADDPG. From Fig.3(a)3(b), which present the learning curves in two homogeneous environments, it is clear that the training of EA-MADDPG is much faster than vanilla MADDPG and other algorithms, in terms of realistic training time. Note that in UAV scenario, EA-MADDPG achieves the reward equal to the convergence result of vanilla MADDPG with only 1/4 time. EA also accelerates the training of heterogeneous tasks in the early stage. As shown in Fig.3(c), the agents with EA take the advatange over their opponents earlier than those without EA (see the first peak in the learning curve).
|agent||adversary||caught||agent reward||adversary reward||training time|
4.4 Ablation Studies
With the verification of the performance of the Experience Augmentation technique, we performed in-depth ablation studies regarding the technique with respect to two key aspects of the technique: 1) the Shuffle Trick, 2) the EA-Times. In this section, we have 2 main purposes: 1. to demonstrate the necessity of Shuffle Trick, the technique we proposed in Sec.3.2 to generated extra experiences; 2. to analyze the function of the EA-times.
The Necessity of Shuffle Trick
One may argue that the increased caused by EA-times it the key factor of the performance of EA. To demonstrate the necessity of the shuffle trick, we examine 3 cases which have the same with EA-MADDPG(EA-times=3), they are: 1.MADDPG(t=25),which updates the network in every 25 iterations; 2.MADDPG(1+1+1+1), which updates the network with 4 sampled batches of experience in every 100 iterations; 3.MADDPG(1x4), which updates the network with a same batch of experience for 4 times in every 100 iterations. The environment is UAV for mobile base station, and the learning rate is fixed to 0.001. At least 10 random seeds were used for each case. As expected, it is seen in Fig.4(a) that the EA-MADDPG(EA-times=3) outperforms vanilla MADDPG in terms of learning speed and convergence result. The MADDPG(t=25) case, MADDPG(1+1+1+1) and MADDPG(1x4), which need close training time for each episode with EA-MADDPG(EA-times=3), show close training speed with EA-MADDPG in the first 10,000 episodes, yet lead to a worse convergence, when compared with Vanilla MADDPG. A reason for this phenomenon is, possibly, the model has been over-fitted to the replay buffer, as illustrated in Sec.3.3.
The Effect of EA-Times
To study the effect of EA-times, we compare the performance of EA-MADDPG with different EA-times varying from . The environment is Cooperative Navigation, where and . Note that EA-MADDPG with EA-times=0 corresponds to vanilla MADDPG.At least 10 random seeds are used for each case. Table.4 shows the performance and training time for each episode of each case, to help selecting best EA-times for the trade-off between performance and training speed. The learning curves of each case are also shown in Fig.4(b)4(c). It is seen that EA-MADDPG with any EA-times (except 31) significantly outperforms vanilla MADDPG. As expected, the learning speed (in terms of episode) increases as the EA-times increases. It can also be seen that in a suitable range (less than 31), as EA-time increases, the convergence result becomes better, and the boost in both performance and training speed slows down as EA-time exceeds 3. Considering the training speed in realistic time, we suggest that the most efficient value of EA-times in this envrionment is within the range of 3 to 7.
5 Related Work
Experience ReplayLin (1992) has became a fundamental component of off-policy RL after it was used in DQNMnih et al. (2015). Previous work have attempted to improve experience replay in multiple aspects, such as Prioritized Experience Replay Schaul et al. (2015), which prioritizes experiences in the replay buffer to speed up training, and Hindsight Experience Replay Andrychowicz et al. (2017), which addresses the sparse reward problem by introducing a sub-goal and giving reward of the transition based on the sub-goal. There are also some work focused on the multi-agent domain, e.g. Foerster et al. (2017) using a multi-agent variant of importance sampling to naturally decay obsolete data and conditioning each agent’s value function on a fingerprint that disambiguates the age of the data sampled from the replay memory. All of them are orthogonal to our work and can be easily combined to get further improvement.
Our approach, especially the Shuffle Trick, may be seen as a novel form of exploration strategy in multi-agent domain. Simple exploration strategies such as , which is used in this paper and MADDPG, may need exponentially many steps to find a (near-)optimal policy Whitehead (1991). By using shuffle trick, we could factorially augment the original dataset, which is a big acceleration of exploration. There are also many exploration strategies successfully applied to deep reinforcement learning and demonstrated their performance, such as Bellemare et al. (2016), Tang et al. (2017), Zhang et al. (2019). Both of them are orthogonal to our work and can be combined for faster exploration in multi-agent domain.
Interestingly, a data augmentation method used in Liu et al. (2019b) is closed to our method. They shuffles the order of agents’ observations and actions (they assume that every agent in a group share the reward value), and then let the parameters only trained on the generated experience. The difference between Liu et al. (2019b) and ours is that, firstly, our Shuffle Trick is applicable to the homogeneous and heterogeneous environment; secondly, we train on the original experience, then train on the generated experiences for EA-times, making full use of the generated dataset and accelerates the training.
In this paper, we present a novel technique called Shuffle Trick to perform a fast, thorough and symmetric exploration that factorially expands the original dataset in multi-agent domain. Based on the shuffle trick, we introduce a time-efficient training method called Experience Augmentation, which accelerates the training and boosts the convergence in off-policy MARL. We experimentally demonstrate that with MADDPG in two homogeneous environments and one heterogeneous environment. In addition, we have carried out in-depth ablation studies on the proposed algorithm, proved the necessity of shuffle trick, and examined the effects of EA-times on training speed and convergence.
- Hindsight experience replay. In Advances in neural information processing systems, pp. 5048–5058. Cited by: §1, §3.3, §5.
A distributional perspective on reinforcement learning.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 449–458. Cited by: §1.
- Unifying count-based exploration and intrinsic motivation. In Advances in neural information processing systems, pp. 1471–1479. Cited by: §5.
- Path integral guided policy search. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3381–3388. Cited by: §1.
Teaching deep convolutional neural networks to play go. External Links: Cited by: §1.
- Stabilising experience replay for deep multi-agent reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1146–1155. Cited by: §3.3, §5.
- Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pp. 66–83. Cited by: §3.1.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, Cited by: §1.
- Soft hindsight experience replay. External Links: Cited by: §3.3.
- Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933. Cited by: §1.
- End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §1.
- Continuous control with deep reinforcement learning. CoRR, pp. . Cited by: §1.
- Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8 (3-4), pp. 293–321. Cited by: §1, §2, §5.
- Distributed energy-efficient multi-uav navigation for long-term communication coverage by deep reinforcement learning. IEEE Transactions on Mobile Computing. Cited by: §4.1.
- PIC: permutation invariant critic for multi-agent deep reinforcement learning. arXiv preprint arXiv:1911.00025. Cited by: §5.
- Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in neural information processing systems, pp. 6379–6390. Cited by: §1, §1, §3.1, §3.3, §4.1, §4.1.
- Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1, §2, §2, §5.
Emergence of grounded compositional language in multi-agent populations.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §3.1, §4.1.
- Monotonic value function factorisation for deep multi-agent reinforcement learning. External Links: Cited by: §1.
- Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §1, §1, §2, §3.3, §5.
- Off-policy actor-critic with shared experience replay. External Links: Cited by: §1.
- Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1.
- Deterministic policy gradient algorithms. Cited by: §2.
- # exploration: a study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pp. 2753–2762. Cited by: §5.
Probabilistic pursuit-evasion games: theory, implementation, and experimental evaluation. IEEE transactions on robotics and automation 18 (5), pp. 662–669. Cited by: §3.1.
- Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §2.
- A complexity analysis of cooperative mechanisms in reinforcement learning.. In AAAI, pp. 607–613. Cited by: §5.
- Mean field multi-agent reinforcement learning. pp. . Cited by: §1.
- Explicit planning for efficient exploration in reinforcement learning. In Advances in Neural Information Processing Systems, pp. 7486–7495. Cited by: §5.
- A deeper look at experience replay. arXiv preprint arXiv:1712.01275. Cited by: §1.