1 Introduction
In reinforcement learning, an agent seeks to find an optimal policy that maximizes the longterm rewards by interacting with an unknown environment. Policybased methods, e.g., DDPG (Lillicrap et al. (2016)), TD3 (Fujimoto et al. (2018)), optimize parameterized policies by gradient ascent on the performance objective. Directly optimizing the policy by vanilla policy gradient methods may incur large policy changes, which can result in performance collapse due to unlimited updates. To resolve this issue, Trust Region Policy Optimization (TRPO) (Schulman et al. (2015)) and Proximal Policy Optimization (PPO) (Schulman et al. (2017)) optimize a surrogate function in a conservative way, both being onpolicy methods that perform policy updates based on samples collected by the current policy. These onpolicy methods have the desired feature that they generally achieve stable performance. This is usually not the case for offpolicy learning where the policy is updated according to samples drawn from a different policy, e.g., using an experience replay buffer (Haarnoja et al. (2018)). However, as onpolicy methods learn from what they collect, they can suffer from insufficient exploration ability, especially in sparse environments (Colas et al. (2018)). Thus, although TRPO and PPO start from a stochastic policy, the randomness in the policy decreases quickly during training. As a result, they can converge too prematurely to bad local optima in highdimensional, or sparse reward tasks.
Indeed, how to achieve efficient exploration is challenging in deep reinforcement learning. There has been recent progress in improving exploration ranging from countbased exploration (Ostrovski et al. (2017); Tang et al. (2017)), intrinsic motivation (Houthooft et al. (2016); Bellemare et al. (2016); Pathak et al. (2017)), to noisy networks (Fortunato et al. (2018); Plappert et al. (2018)). However, these methods either introduce sensitive parameters that require careful tuning on tasks (Khadka and Tumer (2018)), or require learning additional complex structures to estimate the novelty of states. Hong et al. (2018); Masood and DoshiVelez (2019) propose to encourage exploration by augmenting the objective function with a diversity term that measures the distance of current and prior policies. Yet, the distance between the current policy and past polices can be small for trustregion methods where the policy update is controlled, and limits the applicability of this kind of approach.
The ensemble technique has been widely applied in modelfree reinforcement learning by using an ensemble of value functions or actors for offpolicy algorithms, and has shown great power to improve exploration (Osband et al. (2016); Chen et al. (2017)), to better estimate value functions (Anschel et al. (2017)), or to escape from local optima (Zhang and Yao (2019)). In this paper, we aim to combine ensemble methods with onpolicy algorithms to enable better exploration. Indeed, how to effectively utilize ensembles for onpolicy methods under limited samples is challenging, where interaction with the environment to obtain experiences is expensive (Buckman et al. (2018)). Simply training an ensemble of policies and picking the best one fails to make good use of the policy ensemble. Due to the nature of onpolicy algorithms, each policy can only be updated by samples collected by itself. Thus, it is unnecessary to rollout all policies, which can cause waste of samples.
To tackle the problem, we propose a novel method, MultiPath Policy Optimization (MPPO), which improves exploration for onpolicy algorithms using the ensemble method. Here, a path refers to a sequence of policies generated during the course of policy optimization starting from a single policy. Figure 1 demonstrates the highlevel schematic of MPPO, which has four main components, i.e., pick and rollout, policy optimization, value network update, and policy buffer update. Specifically, MPPO starts with different policies randomly initialized in the policy buffer, and a shared value network. At each iteration, a candidate policy is picked from the policy buffer according to a picking rule, defined as a weighted combination of performance and entropy, introduced to enable a tradeoff between the exploration and exploitation. Then, the picked policy interacts with the environment by rollouts to collect samples. The picked policy is updated by policy optimization according to the samples and the shared value network. These samples also contribute to updating the shared value network. Finally, the improved picked policy updates the policy buffer by replacing itself, which is able to retain the diversity of the policy buffer.
With this scheme, MPPO maintains policy paths, which increases the exploration ability during training. Different policy paths provide diverse experiences for the shared value network to enable a better estimation (Nachum et al. (2018)), which yields a better signal for telling how well each state is. With a better estimated value function, policies updated by policy optimization are more able to collect trajectories with higher rewards. Therefore, MPPO can provide better guidance for each picked policy to explore states and actions that were not known to have high rewards previously.
The picking rule favors to select the policy that is most desirable to rollout and to optimize at each iteration, i.e., the one with good performance while being explorative simultaneously, which is a critical component of MPPO. We prove that when MPPO switches to an explorative policy, the performance variation of picked polices can be bounded and controlled. This is a useful feature that ensures smooth policy transition. We also empirically validate that the potential variation is small, and the picked policy converges to one single policy, which ensures the stability. Moreover, since only one candidate policy is picked at each iteration, our method does not require additional samples compared with the base policy optimization method.
We apply MPPO to two widely adopted onpolicy algorithms, TRPO and PPO, and conduct extensive experiments on several continuous control tasks based MuJoCo (Todorov et al. (2012)). Experimental results demonstrate that our proposed algorithms, MPTRPO and MPPPO, provide significant improvements over stateoftheart exploration methods, in terms of sample efficiency and final performance without incurring high computational cost. We also investigate the critical advantages of the proposed picking rule and policy buffer update strategy over other ensemble methods.
2 Preliminaries
A Markov decision process (MDP) is defined by
, where , denote the set of states and actions,the transition probability from state
to state under action , the corresponding immediate reward, and the discount factor. The agent interacts with the environment by its parameterized policy , with the goal to learn the optimal policy that maximizes the expected discounted return .Trust Region Policy Optimization (TRPO) (Schulman et al. (2015)) learns the policy parameter by optimizing a surrogate function in a conservative way. Specifically, it limits the stepsize towards updating the policy using a trustregion constraint, i.e.,
(1)  
(2) 
where denotes the empirical average over a finite batch of samples, the advantage function defined as
(3) 
One desired feature of TRPO is that it guarantees a monotonic policy improvement, i.e., the policy update step leads to a betterperforming policy during training. However, it is not computationally efficient as it involves solving a secondorder optimization problem using conjugate gradient.
Proximal Policy Optimization (PPO) (Schulman et al. (2017)
) is a simpler method only involving firstorder optimization using stochastic gradient descent. PPO maximizes a KLpenalized or clipped version of the objective function to ensure stable policy updates, where the clipped version is more common and is reported to perform better than the KLpenalized version. Specifically, the objective for the clipped version is to maximize
(4) 
where denotes the probability ratio, and is the parameter for clipping.
3 MultiPath Policy Optimization
In this section, we start with an example to illustrate the problem of TRPO in that it lacks sufficient exploration ability, to motivate the development of our MultiPath Policy Optimization (MPPO) method. Then, we describe the MPPO method, and apply it to two widely adopted onpolicy algorithms, TRPO and PPO.
3.1 A Motivating Example
Figure 2 shows an environment Maze of size with sparse rewards, where the black lines represent walls. The agent always starts at located on the lower left corner of the maze with the goal of reaching the destination in the lower right corner. A reward of is given only when the agent reaches , and otherwise.
We compare three schemes in this environment, i.e., TRPO, MPTRPO (MPPO applied to TRPO) and MultiTRPO (training an enemsble of policies and picking the best one). For fair comparison, all methods use the same amount of samples during training. As shown in Figure 4, the agent can fail to reach the goal state under TRPO. Figure 5(a) shows the resulting state visitation density under TRPO after training for 1 million steps. It can be seen that the agent can only explore a very limited area in the maze and mainly stays in the left side. It is also worth noting that simply training the policy ensemble and choosing the best, i.e., MultiTRPO, also fails to consistently find the destination. Although it is able to search a larger region, it still mostly reexplores the left part as shown in Figure 5(b). In contrast, MPTRPO can always successfully reach the destination after 0.6 million steps while others fail. As illustrated in Figure 5(c), it is capable to bypass the wall and explore both sides of the maze.
This is because TRPO suffers from insufficient exploration, and the entropy of the policy trained with TRPO decreases quickly as the policy is being optimized, as shown in Figure 4. MultiTRPO maintains greater exploration ability with the policy ensemble. However, recall that all three schemes consume the same amount of samples for training. As it rolls out all policies at each iteration, the performance improvement of any single policy in the ensemble is limited compared with MPTRPO. Indeed, for MultiTRPO, the acquirement of diverse samples from the policy ensemble comes at the expense of insufficient training of each policy under limited number of samples. This is because onpolicy algorithms cannot utilize experiences from other policies, and can only update the policy based on samples collected by itself. On the other hand, during the training process, MPTRPO optimizes the policy while simultaneously maintaining enough exploration ability.
We next systematically describe our proposed method and the motivation behind it, to illustrate why MPPO helps to improve exploration.
algocf[!h]
3.2 Method
The main idea of MPPO is summarized as follows. The policy buffer is initialized with random policies, and a shared value network is also randomly initialized. At each iteration , a candidate policy is picked from the policy buffer , which is used as the rollout policy to interact with the environment to generate a set of samples. The candidate policy is optimized according to the collected samples based on the shared value network. Also, the collected samples contribute to updating the shared value network. Finally, the improved candidate policy updates the policy by replacing itself.
Specifically, the key components of the MultiPath Policy Optimization method are as follows:
3.2.1 Pick and rollout.
From previous analysis of MultiTRPO, although an ensemble of policies can bring diverse samples, policies in the ensemble cannot exploit others’ experiences. Therefore, we propose to pick a candidate policy from the current policy buffer at each iteration. The picking rule for MPPO is to choose the policy with highest score , which takes into account both performance and entropy as defined in Eq. (5), i.e.,
(5) 
where and denote the normalized performance and entropy according to minmax normalization as in Eq. (6).
(6) 
In Eq. (6), , denote the performance and entropy of policy respectively, where we use Shannon entropy defined by .
The picking rule favors to pick the policy that is most desirable to rollout and to optimize, i.e., the one with good performance while being explorative simultaneously, which is a critical component of MPPO. In Eq. (5), provides the tradeoff between exploration and exploitation. Note that a criterion focusing only on the performance cannot make good use of the policy buffer, as it tends to pick the policy updated in last iteration. Therefore, it leads to a similar optimization process as that of singlepath, which also suffers from insufficient exploration. Considering the entropy term encourages exploring new behaviors. However, if one always pick the policy with the maximum entropy, then the policy ensemble fails to exploit learned good behaviors. Our weighted rule is designed to strike for a good tradeoff between exploration and exploitation.
It is crucial to note that only the candidate policy then interacts with the environment for sample collection, based on which both the candidate policy and the shared value network are updated. Therefore, MPPO does not require more samples than the base policy optimization method.
3.2.2 Policy optimization.
At each iteration , only the candidate policy is optimized using a base policy optimization method, according to samples collected by itself and the shared value network. The objective of policy optimization is to maximize the expected advantages over the policy distribution, where the estimated policy gradient is
(7) 
given a batch of samples . In fact, MPPO enables a better estimation of the advantage function with its mechanism utilizing the policy ensemble. Therefore, policy optimization drives each picked policy to explore previously unseen good states and actions.
3.2.3 Value network update.
Samples collected by the candidate policy also contribute to updating the shared value network by minimizing the meansquared error:
(8) 
During the course of training with MPPO, the shared value network exploits diverse samples collected by policies that are most desirable to be picked from the diverse policy buffer at each iteration. In this way, it can better estimate the value function compared with that of singlepath. Therefore, it provides more information for the advantage function to distinguish good or bad actions, which is critical and helpful for policy optimization.
3.2.4 Policy buffer update.
The updated policy will be added to the policy buffer by replacing the candidate policy itself, i.e.,
(9) 
which is able to maintain the diversity of the policy buffer. A common way to update the policy buffer is to replace the worst policy in the policy buffer with the improved policy, as usually used in evolutionarybased methods (Khadka and Tumer (2018)). However, this updating scheme quickly loses the diversity of the policy buffer, and leads to a set of very similar policies ultimately, which results in a low exploration level.
3.3 MultiPath Trust Region Policy Optimization
We first apply our proposed MPPO method to a widely adopted onpolicy algorithm TRPO. In this case, the resulting MultiPath Trust Region Policy Optimization (MPTRPO) algorithm is shown in Algorithm LABEL:alg:mp_trpo. Note that the performance of the updated policy is estimated by . Therefore, MPTRPO does not require extra samples to evaluate the updated policy.
During the course of policy optimization, if the same policy is picked as in last iteration, MPTRPO guarantees a monotonic improvement of the policy picked in current iteration over that in last iteration by Schulman et al. (2015). On the other hand, if a policy that is more explorative but the performance is not as good as that in last iteration is picked, it may lead to a temporary performance drop. In Theorem 2, we show that such a performance drop can be bounded, ensuring a smooth policy transition. The proof is referred to the supplemental material.
Theorem 1
Let , denote the indexes of policies that are picked at timestep , , respectively. Denote the improvement of over as . Then, the following bound holds for :
Theorem 2 shows that although there may be a temporary performance drop due to switching to a more explorative policy, such a sacrifice is bounded by an related term and the difference of the performance of the best and the worst policies in current policy buffer.
Figure 6 shows the course of policy picking of MPTRPO during training on Maze (Figure 2) for a single seed. In the beginning, MPTRPO may pick different policies to collect samples and to optimize according to the picking rule, which means that the performance gap between the best and the worst policies will not be large. Note that we use a fixed value of to be in our experiments. Therefore, the temporary performance drop is small by Theorem 2. In the end, MPTRPO will converge to picking a single policy, in which case the performance of the picked policy will be monotone increasing. This observation actually holds for other seeds, and the full empirical result is referred to the supplemental material. We remark here that our method maintains good performance throughout the policy optimization process, while bringing the advantage of better exploration.
3.4 MultiPath Proximal Policy Optimization
We also apply MPPO to another onpolicy algorithm PPO (Schulman et al. (2017)), and obtain the MPPPO algorithm. Due to space limitation, the full algorithm is referred to the supplemental material.
4 Experiments
We conduct extensive experiments to investigate the following key questions: (1) How does MPPO compare with singlepath policy optimization and stateoftheart exploration methods? (2) What is the effect of the number of paths and the weight ? (3) Which component of MPPO is critical for the improvement of the exploration ability? (4) Is MPPO generally applicable given a baseline onpolicy reinforcement learning algorithm to encourage exploration?
4.1 Experimental Setup
We evaluate MPPO on several continuous control MuJoCo environments (Todorov et al. (2012)), including some more challenging variants of the original environments with sparse rewards (Houthooft et al. (2016); Kang et al. (2018)), to examine the exploration ability of our algorithms. For example, in SparseDoublePendulum, a reward of is given only when the agent reaches the goal that it swings the double pendulum upright, and otherwise. Detailed descriptions of the benchmark environments are referred to the supplemental material. Each algorithm is run with different random seeds , and the performance is evaluated for episodes every steps. Note that the performance of MPPO is evaluated by the picked policy. The averaged return in evaluation is reported as the solid line, with the shaded region denoting a confidence interval. For fair comparisons, the hyperparameters for all comparing algorithms are set to be the same as the best set of hyperparameters reported in Henderson et al. (2018).
4.2 Baselines
To comprehensively study the MPTRPO algorithm, we compare it with six baselines. For fair comparison, all methods use the same amount of samples during the course of policy optimization. TRPO refers to vanilla singlepath TRPO, while MultiTRPO is an ensemble method that trains multiple () singlepath TRPO and chooses the best one. We further compare MPTRPO against DivTRPO (Hong et al. (2018)) and CuriosityTRPO (Pathak et al. (2017)
), both of which are stateofthe art methods for exploration. DivTRPO refers to the diversitydriven approach, which augments the loss function with the distance of current policies and prior policies. CuriosityTRPO is a curiositydriven approach, which improves exploration by augmenting the reward function with learned intrinsic rewards. To validate the effect of the shared value network, we furture compare with MultiTRPO (Independent), where the policy ensemble uses multiple independent value networks for each policy. To evaluate the importance of the replacement strategy for updating the policy buffer, we compare MPTRPO with a variant, ReplaceWorst, which updates the policy buffer by replacing the worst policy with the improved candidate policy. The effectiveness of the picking rule is verified using different weights of
.Then, we apply our proposed multipath policy optimization mechanism to another baseline policy optimization method, PPO, to demonstrate the general applicability of the MPPO method, and conduct similar evaluation.
4.3 Ablation Study
4.3.1 The effect of the number of paths .
Figure 7 shows the performance of MPTRPO and MPPPO with varying on SparseDoublePendulum. The value trades off the diversity of the policy buffer and sample efficiency. A larger maintains a greater diversity, but may require more samples to learn as there are more policies to be picked and to be optimized in early periods of learning. Indeed, there is an intermediate value for that provides the best tradeoff. We find that MPTRPO with achieves the best performance and thus we fix to be on all environments. For MPPPO, a relatively smaller is sufficient and performs best. This is because PPO itself exhibits greater exploration ability than TRPO, so we choose to be in all environments for MPPPO. Note that MPPO with different values of all outperform the corresponding baseline policy optimization method (TRPO or PPO). It is also worth noting that MPPO only uses 1.67% and 4.78% more memory for MPTRPO () and MPPPO () compared with TRPO and PPO respectively. The summary of memory consumption for different is referred to the supplemental material.
4.3.2 The effect of the weight .
In the picking rule, controls the tradeoff between exploration and exploitation. A larger emphasizes more on the exploration ability of the picked policy, but may fail to utilize the result of policy optimization. In addition, according to Theorem 2, a large may lead to a temporary performance drop. On the other hand, a smaller focuses more on exploiting the current bestperforming policy in the policy buffer, where refers to always picking the best policy based on current estimation. We vary for MPTRPO on SparseDoublePendulum, and the result is shown in Figure 8. As expected, a small achieves the best performance, so we fix to be in all environments for both MPTRPO and MPPPO.
4.4 Performance Comparison
Environment  MPTRPO  MPTRPO  DivTRPO  CuriosityTRPO  MultiTRPO  MultiTRPO  TRPO 

(ReplaceWorst)  (Independent)  
Ant  3017.02  470.05  2635.00  2744.79  0.93  3.90  2558.33 
(116.405)  (57.65)  (138.28)  (190.23)  (1.99)  (1.19)  (120.89)  
Hopper  3257.55  76.25  2957.31  2786.99  2878.34  3138.41  2947.19 
(62.68)  (32.68)  (82.94)  (258.17)  (91.57)  (17.66)  (111.43)  
Swimmer  340.92  179.67  199.06  232.89  198.00  339.17  186.21 
(4.39)  (29.63)  (20.58)  (38.93)  (27.47)  (2.21)  (32.63)  
SparseCartPoleSwingup  320.47  49.10  203.22  244.23  180.17  213.07  238.88 
(14.48)  (16.99)  (33.56)  (11.42)  (11.50)  (4.07)  (11.08)  
SparseDoublePendulum  1.00  0.33  1.00  1.00  0.83  0.98  1.00 
(0.00)  (0.15)  (0.00)  (0.00)  (0.12)  (0.01)  (0.00)  
SparseHalfCheetah  756.02  24.52  699.03  656.98  438.50  676.22  639.32 
(14.02)  (8.14)  (57.15)  (29.24)  (66.98)  (14.35)  (38.40)  
SparseHopper  302.32  0.00  89.60  188.47  14.63  38.33  182.63 
(52.48)  (0.00)  (45.09)  (33.51)  (10.29)  (25.82)  (60.13)  
SparseWalker2d  186.48  0.00  112.75  87.33  37.67  56.18  53.78 
(43.15)  (0.00)  (33.29)  (25.18)  (12.57)  (17.27)  (15.57) 
The comparative results of MPTRPO are demonstrated in Figure 9. As shown, MPTRPO is consistently more sample efficient than DivTRPO in all environments. In addition, it outperforms CuriosityTRPO in all but one environment in terms of sample efficiency. The margin is larger especially in sparse environments. Table 1 summarizes the performance at the end of training, which shows that MPTRPO achieves the best final performance in all environments.
DivTRPO augments the loss function with a measure of distances between past policies. As trustregion methods limit the update, the distance among prior policies is not large. Thus, the diversitydriven technique does not enable significant improvement of exploration on TRPO. CuriosityTRPO augments the reward function with a curiosity term that measures how novel a state is. It encourages the agent to reexplore states that are known to be unfamiliar with. However, it can be challenging for the agent to first discover such states in sparse environments.
Regarding the shared value network, MultiTRPO outperforms MultiTRPO (Independent), as it enables a better estimation of the value function. Additionally, MPTRPO outperforms MultiTRPO significantly. As for the strategy for policy buffer updates, note that MPTRPO (ReplaceWorst), where replacing the worst policy is a common strategy in evolutionary methods, performs poorly in all but one benchmark environments. After updating the picked policy, it replaces the worstperforming policy in the buffer with this improved policy. Under this updating scheme, the policy buffer loses the diversity of policies quickly and soon only stores similar copies of a single policy. Thus, MPTRPO (ReplaceWorst) performs worse than MultiTRPO (Independent). In contrast, the replacement strategy of MPTRPO best preserves the diversity of the policy buffer while ensuring policy optimization.
Our results provide empirical evidence that MPPO is an efficient mechanism to fully utilize the ensemble of policies without bringing high computation overhead.
To show that MPPO is readily applicable to other baseline onpolicy algorithms, we build it upon PPO, and evaluate the resulting MPPPO algorithm by comparing it with the corresponding PPO, MultiPPO, and MultiPPO (Independent) algorithms. Results are shown in Figure 10, where MPPPO outperforms the baseline method in all environments, which demonstrates its effectiveness to encourage exploration. The final performance is referred to the supplemental material.
5 Related Work
Entropyregularized reinforcement learning (RL) (Haarnoja et al. (2017, 2018); Nachum et al. (2018)) optimizes the standard objective augmented by an entropy regularizer considering the distance with the random policy, and learns a stochastic policy for better exploration. Our method differs from them in that MPPO still optimizes the standard objective, where the picking rule involves the performance and an entropy bonus term.
The ensemble technique has shown great potential in modelfree RL. Liu et al. (2017)
aim to learn a set of diverse policies by optimizing a distribution of policy parameters via solving a Bayesian inference problem, which trains multiple policies at each iteration with information sharing of gradients. However, it cannot ensure the stability of performance when applied to trustregion methods.
Zhang and Yao (2019) propose to escape from local maxima for an offpolicy algorithm, DDPG (Lillicrap et al. (2016)), by utilizing an ensemble of actors. The critic is updated according to the best action proposed by all actors that results in maximum Qvalue, and all actors are trained in parallel. However, it cannot be applied to RL algorithms with stochastic policies.There have also been a number of approaches improving exploration by combining evolutionary methods with deep reinforcement learning by maintaining a population of agents. Gangwani and Peng (2017) apply policy gradient methods to mutate the population. Khadka and Tumer (2018) utilize a population of evolutionary actors to collect samples, and an RL actor based on DDPG Lillicrap et al. (2016) is updated using these samples. Pourchot and Sigaud (2019) propose to combine the crossentropy method and TD3 (Fujimoto et al. (2018)). Our work differs from previous works in several aspects. First, we train a single policy at each iteration instead of a population of policies, and only the picked policy interacts with the environment. Second, we use multipath to enable better exploration than singlepath for onpolicy algorithms, while previous works cannot be applied to onpolicy algorithms.
6 Conclusion
We present MPPO using an ensemble of policies to improve exploration for onpolicy reinforcement learning algorithms. We apply the MPPO method to TRPO and PPO, and show that the performance can be guaranteed during policy switching. We conduct extensive experiments on several MuJoCo tasks including environments with sparse rewards, and show that MPPO outperforms baselines significantly in both sample efficiency and final performance.
Appendix A Theoretical Analysis for MPTRPO
Theorem 2
Let , denote the indexes of policies that are picked at timestep , , respectively. Denote the improvement of over as . Then, the following bound holds for :
Proof. As is the policy selected at timestep , we have
(10) 
Thus,
(11) 
According to the minmax normalization, we have
(12) 
Then, we obtain
(13) 
According to the monotonic improvement theorem (Schulman et al. (2015)), we have
(14) 
Appendix B Visualization of the Picking Rule
The picked policies chosen by the picking rule of MPTRPO on Maze during the first 0.5 million steps (the total number of training steps is 1 million) by different random seeds (05) is shown in Figure 11. The xaxis and yaxis correspond to the training steps and the index of the picked policies. As shown, in the beginning of learning, different policies are picked according to the picking rule, which is a weighted objective of performance and entropy. Finally, MPPO converges to picking a same policy to optimize.
Appendix C MultiPath Proximal Policy Optimization (MPPPO) Algorithm
The MPPPO algorithm is shown in Algorithm LABEL:alg:mp_ppo. algocf[!h]
Appendix D Experimental Setup
d.1 Environments
The environments are all from OpenAI Gym (Brockman et al. (2016)), and the details of the sparse environments are summarized as follows:

SparseCartPoleSwingup: a reward of is given only when , where is the pole angle, and otherwise

SparseDoublePendulum: a reward of is given only when the agent reaches the goal, i.e. swings the double pendulum upright, and otherwise

SparseHalfCheetah: the agent only receives a reward only when it runs multiple meters above the thereshold, and otherwise

SparseHopper: the agent only receives a reward only when it hops multiple meters above the thereshold, and otherwise

SparseWalker2d: the agent only receives a reward only when it walks multiple meters above the thereshold, and otherwise
d.2 Hyperparamters
The hyperparameters for MPTRPO and TRPO, MPPPO and PPO are shwon in Table 1 and Table 3 respectively, which are set to be the same for fair comparison according to (Henderson et al. (2018)). For all algorithms, the policy network is (64, tanh, 64, tanh, linear), and the value network is (64, tanh, 64, tanh, linear). The size of the policy buffer is set to be and in MPTRPO and MPPPO respectively, and the weight parameter is set to be 0.1 in all environments.
Hyperparameter  Value 
Discount Factor  0.995 
GAE  0.97 
Batch Size  5000 
Iterations of Conjugate Gradient  20 
Damping of Conjugate Gradient  0.1 
Iterations of Value Function Update  5 
Batch Size of Value Function Update  64 
Step Size of Value Function Update  0.001 
Coefficient of Entropy  0.0 
max KL  0.01 
Environment  MPPPO  MultiPPO (Independent)  MultiPPO  PPO 

Ant  1992.21 (216.33)  902.31 (101.00)  1131.90 (110.94)  1311.35 (220.93) 
Hopper  2264.54 (234.95)  1449.33 (224.93)  2073.46 (161.47)  1457.18 (258.87) 
Swimmer  109.96 (1.93)  100.28 (9.08)  98.79 (6.99)  106.12 (2.78) 
SparseCartPoleSwingup  352.12 (16.14)  135.65 (37.47)  238.02 (55.98)  294.88 (43.42) 
SparseDoublePendulum  1.0  0.67 (0.15)  0.83 (0.12)  0.67 (0.15) 
SparseHalfCheetah  593.37 (40.60)  456.65 (35.86)  478.70 (59.02)  463.07 (36.72) 
SparseHopper  131.35 (40.08)  57.88 (19.86)  63.75 (22.69)  90.40 (37.61) 
SparseWalker2d  55.93 (31.98)  1.67 (1.17)  41.68 (19.91)  10.20 (7.18) 
Hyperparameter  Value 
Discount Factor  0.995 
GAE  0.97 
Batch Size  2048 
Clip Parameter  0.2 
Epochs of Optimizer per Iteration  10 
Step Size of Optimizer  0.0003 
Batch Size of Optimizer  64 
Coefficient of Entropy  0.0 
Appendix E Memory Consumption Comparisons
The policy ensemble does not incur much more memory consumption, and comparison results for MPTRPO and MPPPO with varing number of paths on SparseDoublePendulum corresponding to our ablation experiments are shown in Table 4 and Table 5.
GPU Memory  Memory  

TRPO  359 M  404 M 
MPTRPO ()  365 M  410 M 
MPTRPO ()  365 M  410 M 
MPTRPO ()  365 M  410 M 
MPTRPO ()  365 M  411 M 
GPU Memory  Memory  

PPO  335 M  388 M 
MPPPO ()  351 M  407 M 
MPPPO ()  351 M  409 M 
MPPPO ()  351 M  409 M 
MPPPO ()  335 M  411 M 
Appendix F Final Performance Comparison of MPPPO
The final performance comparison of MPPPO is summarized in Table 2, where the number in the brackets denote the confidence interval.
References

Averageddqn: variance reduction and stabilization for deep reinforcement learning
. InInternational Conference on Machine Learning
, pp. 176–185. Cited by: §1.  Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479. Cited by: §1.
 Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §D.1.
 Sampleefficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8224–8234. Cited by: §1.
 UCB exploration via qensembles. arXiv preprint arXiv:1706.01502. Cited by: §1.
 GEPpg: decoupling exploration and exploitation in deep reinforcement learning algorithms. In International Conference on Machine Learning, pp. 1038–1047. Cited by: §1.
 Noisy networks for exploration. In International Conference on Learning Representations, Cited by: §1.
 Addressing function approximation error in actorcritic methods. In International Conference on Machine Learning, pp. 1582–1591. Cited by: §1, §5.
 Policy optimization by genetic distillation. In International Conference on Learning Representations, Cited by: §5.
 Reinforcement learning with deep energybased policies. In International Conference on Machine Learning, pp. 1352–1361. Cited by: §5.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1856–1865. Cited by: §1, §5.

Deep reinforcement learning that matters.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §D.2, §4.1.  Diversitydriven exploration strategy for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 10489–10500. Cited by: §1, §4.2.
 Vime: variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp. 1109–1117. Cited by: §1, §4.1.
 Policy optimization with demonstrations. In International Conference on Machine Learning, pp. 2474–2483. Cited by: §4.1.
 Evolutionguided policy gradient in reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1188–1200. Cited by: §1, §3.2.4, §5.
 Continuous control with deep reinforcement learning. In International Conference on Learning Representations, Cited by: §1, §5, §5.
 Stein variational policy gradient. arXiv preprint arXiv:1704.02399. Cited by: §5.
 Diversityinducing policy gradient: using maximum mean discrepancy to find a set of diverse policies. arXiv preprint arXiv:1906.00088. Cited by: §1.
 Trustpcl: an offpolicy trust region method for continuous control. In International Conference on Learning Representations, Cited by: §1, §5.
 Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pp. 4026–4034. Cited by: §1.
 Countbased exploration with neural density models. In International Conference on Machine Learning, pp. 2721–2730. Cited by: §1.

Curiositydriven exploration by selfsupervised prediction.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pp. 16–17. Cited by: §1, §4.2.  Parameter space noise for exploration. In International Conference on Learning Representations, Cited by: §1.
 CEMrl: combining evolutionary and gradientbased methods for policy search. In International Conference on Learning Representations, Cited by: §5.
 Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: Appendix A, §1, §2, §3.3.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §2, §3.4.
 # exploration: a study of countbased exploration for deep reinforcement learning. In Advances in neural information processing systems, pp. 2753–2762. Cited by: §1.
 Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §1, §4.1.
 Ace: an actor ensemble algorithm for continuous control with tree search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5789–5796. Cited by: §1, §5.
Comments
There are no comments yet.