Multi-Path Policy Optimization

11/11/2019
by   Ling Pan, et al.
0

Recent years have witnessed a tremendous improvement of deep reinforcement learning. However, a challenging problem is that an agent may suffer from inefficient exploration, particularly for on-policy methods. Previous exploration methods either rely on complex structure to estimate the novelty of states, or incur sensitive hyper-parameters causing instability. In this paper, we propose an efficient exploration method, Multi-Path Policy Optimization (MPPO), which does not incur high computation cost and ensures stability. MPPO maintains an efficient mechanism that effectively utilizes an ensemble of diverse policies to enable better exploration, especially in sparse environments. We build our scheme upon two widely-adopted on-policy methods, the Trust-Region Policy Optimization (TRPO) algorithm and Proximal Policy Optimization (PPO) algorithm. We conduct extensive experiments on several MuJoCo tasks and their sparsified variants to fairly evaluate the proposed method. Results show that MPPO significantly outperforms state-of-the-art exploration methods and ensemble methods in terms of both sample efficiency and final performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

01/29/2019

Trust Region-Guided Proximal Policy Optimization

Model-free reinforcement learning relies heavily on a safe yet explorato...
11/11/2020

Proximal Policy Optimization via Enhanced Exploration Efficiency

Proximal policy optimization (PPO) algorithm is a deep reinforcement lea...
06/14/2020

Optimistic Distributionally Robust Policy Optimization

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization...
10/29/2021

Generalized Proximal Policy Optimization with Sample Reuse

In real-world decision making tasks, it is critical for data-driven rein...
06/19/2020

NROWAN-DQN: A Stable Noisy Network with Noise Reduction and Online Weight Adjustment for Exploration

Deep reinforcement learning has been applied more and more widely nowada...
07/13/2021

Shortest-Path Constrained Reinforcement Learning for Sparse Reward Tasks

We propose the k-Shortest-Path (k-SP) constraint: a novel constraint on ...
01/23/2021

Rethinking Exploration for Sample-Efficient Policy Learning

Off-policy reinforcement learning for control has made great strides in ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In reinforcement learning, an agent seeks to find an optimal policy that maximizes the long-term rewards by interacting with an unknown environment. Policy-based methods, e.g., DDPG (Lillicrap et al. (2016)), TD3 (Fujimoto et al. (2018)), optimize parameterized policies by gradient ascent on the performance objective. Directly optimizing the policy by vanilla policy gradient methods may incur large policy changes, which can result in performance collapse due to unlimited updates. To resolve this issue, Trust Region Policy Optimization (TRPO) (Schulman et al. (2015)) and Proximal Policy Optimization (PPO) (Schulman et al. (2017)) optimize a surrogate function in a conservative way, both being on-policy methods that perform policy updates based on samples collected by the current policy. These on-policy methods have the desired feature that they generally achieve stable performance. This is usually not the case for off-policy learning where the policy is updated according to samples drawn from a different policy, e.g., using an experience replay buffer (Haarnoja et al. (2018)). However, as on-policy methods learn from what they collect, they can suffer from insufficient exploration ability, especially in sparse environments (Colas et al. (2018)). Thus, although TRPO and PPO start from a stochastic policy, the randomness in the policy decreases quickly during training. As a result, they can converge too prematurely to bad local optima in high-dimensional, or sparse reward tasks.

Indeed, how to achieve efficient exploration is challenging in deep reinforcement learning. There has been recent progress in improving exploration ranging from count-based exploration (Ostrovski et al. (2017); Tang et al. (2017)), intrinsic motivation (Houthooft et al. (2016); Bellemare et al. (2016); Pathak et al. (2017)), to noisy networks (Fortunato et al. (2018); Plappert et al. (2018)). However, these methods either introduce sensitive parameters that require careful tuning on tasks (Khadka and Tumer (2018)), or require learning additional complex structures to estimate the novelty of states. Hong et al. (2018); Masood and Doshi-Velez (2019) propose to encourage exploration by augmenting the objective function with a diversity term that measures the distance of current and prior policies. Yet, the distance between the current policy and past polices can be small for trust-region methods where the policy update is controlled, and limits the applicability of this kind of approach.

The ensemble technique has been widely applied in model-free reinforcement learning by using an ensemble of value functions or actors for off-policy algorithms, and has shown great power to improve exploration (Osband et al. (2016); Chen et al. (2017)), to better estimate value functions (Anschel et al. (2017)), or to escape from local optima (Zhang and Yao (2019)). In this paper, we aim to combine ensemble methods with on-policy algorithms to enable better exploration. Indeed, how to effectively utilize ensembles for on-policy methods under limited samples is challenging, where interaction with the environment to obtain experiences is expensive (Buckman et al. (2018)). Simply training an ensemble of policies and picking the best one fails to make good use of the policy ensemble. Due to the nature of on-policy algorithms, each policy can only be updated by samples collected by itself. Thus, it is unnecessary to rollout all policies, which can cause waste of samples.

To tackle the problem, we propose a novel method, Multi-Path Policy Optimization (MPPO), which improves exploration for on-policy algorithms using the ensemble method. Here, a path refers to a sequence of policies generated during the course of policy optimization starting from a single policy. Figure 1 demonstrates the high-level schematic of MPPO, which has four main components, i.e., pick and rollout, policy optimization, value network update, and policy buffer update. Specifically, MPPO starts with different policies randomly initialized in the policy buffer, and a shared value network. At each iteration, a candidate policy is picked from the policy buffer according to a picking rule, defined as a weighted combination of performance and entropy, introduced to enable a trade-off between the exploration and exploitation. Then, the picked policy interacts with the environment by rollouts to collect samples. The picked policy is updated by policy optimization according to the samples and the shared value network. These samples also contribute to updating the shared value network. Finally, the improved picked policy updates the policy buffer by replacing itself, which is able to retain the diversity of the policy buffer.

Figure 1: High-level schematic of MPPO.

With this scheme, MPPO maintains policy paths, which increases the exploration ability during training. Different policy paths provide diverse experiences for the shared value network to enable a better estimation (Nachum et al. (2018)), which yields a better signal for telling how well each state is. With a better estimated value function, policies updated by policy optimization are more able to collect trajectories with higher rewards. Therefore, MPPO can provide better guidance for each picked policy to explore states and actions that were not known to have high rewards previously.

The picking rule favors to select the policy that is most desirable to rollout and to optimize at each iteration, i.e., the one with good performance while being explorative simultaneously, which is a critical component of MPPO. We prove that when MPPO switches to an explorative policy, the performance variation of picked polices can be bounded and controlled. This is a useful feature that ensures smooth policy transition. We also empirically validate that the potential variation is small, and the picked policy converges to one single policy, which ensures the stability. Moreover, since only one candidate policy is picked at each iteration, our method does not require additional samples compared with the base policy optimization method.

We apply MPPO to two widely adopted on-policy algorithms, TRPO and PPO, and conduct extensive experiments on several continuous control tasks based MuJoCo (Todorov et al. (2012)). Experimental results demonstrate that our proposed algorithms, MP-TRPO and MP-PPO, provide significant improvements over state-of-the-art exploration methods, in terms of sample efficiency and final performance without incurring high computational cost. We also investigate the critical advantages of the proposed picking rule and policy buffer update strategy over other ensemble methods.

2 Preliminaries

A Markov decision process (MDP) is defined by

, where , denote the set of states and actions,

the transition probability from state

to state under action , the corresponding immediate reward, and the discount factor. The agent interacts with the environment by its parameterized policy , with the goal to learn the optimal policy that maximizes the expected discounted return .

Trust Region Policy Optimization (TRPO) (Schulman et al. (2015)) learns the policy parameter by optimizing a surrogate function in a conservative way. Specifically, it limits the stepsize towards updating the policy using a trust-region constraint, i.e.,

(1)
(2)

where denotes the empirical average over a finite batch of samples, the advantage function defined as

(3)

One desired feature of TRPO is that it guarantees a monotonic policy improvement, i.e., the policy update step leads to a better-performing policy during training. However, it is not computationally efficient as it involves solving a second-order optimization problem using conjugate gradient.

Proximal Policy Optimization (PPO) (Schulman et al. (2017)

) is a simpler method only involving first-order optimization using stochastic gradient descent. PPO maximizes a KL-penalized or clipped version of the objective function to ensure stable policy updates, where the clipped version is more common and is reported to perform better than the KL-penalized version. Specifically, the objective for the clipped version is to maximize

(4)

where denotes the probability ratio, and is the parameter for clipping.

3 Multi-Path Policy Optimization

In this section, we start with an example to illustrate the problem of TRPO in that it lacks sufficient exploration ability, to motivate the development of our Multi-Path Policy Optimization (MPPO) method. Then, we describe the MPPO method, and apply it to two widely adopted on-policy algorithms, TRPO and PPO.

3.1 A Motivating Example

Figure 2 shows an environment Maze of size with sparse rewards, where the black lines represent walls. The agent always starts at located on the lower left corner of the maze with the goal of reaching the destination in the lower right corner. A reward of is given only when the agent reaches , and otherwise.

Figure 2: Maze.

We compare three schemes in this environment, i.e., TRPO, MP-TRPO (MPPO applied to TRPO) and Multi-TRPO (training an enemsble of policies and picking the best one). For fair comparison, all methods use the same amount of samples during training. As shown in Figure 4, the agent can fail to reach the goal state under TRPO. Figure 5(a) shows the resulting state visitation density under TRPO after training for 1 million steps. It can be seen that the agent can only explore a very limited area in the maze and mainly stays in the left side. It is also worth noting that simply training the policy ensemble and choosing the best, i.e., Multi-TRPO, also fails to consistently find the destination. Although it is able to search a larger region, it still mostly re-explores the left part as shown in Figure 5(b). In contrast, MP-TRPO can always successfully reach the destination after 0.6 million steps while others fail. As illustrated in Figure 5(c), it is capable to bypass the wall and explore both sides of the maze.

Figure 3: Return.
Figure 4: Entropy.
(a) TRPO.
(b) Multi-TRPO.
(c) MP-TRPO.
Figure 5: State visitation comparisons.

This is because TRPO suffers from insufficient exploration, and the entropy of the policy trained with TRPO decreases quickly as the policy is being optimized, as shown in Figure 4. Multi-TRPO maintains greater exploration ability with the policy ensemble. However, recall that all three schemes consume the same amount of samples for training. As it rolls out all policies at each iteration, the performance improvement of any single policy in the ensemble is limited compared with MP-TRPO. Indeed, for Multi-TRPO, the acquirement of diverse samples from the policy ensemble comes at the expense of insufficient training of each policy under limited number of samples. This is because on-policy algorithms cannot utilize experiences from other policies, and can only update the policy based on samples collected by itself. On the other hand, during the training process, MP-TRPO optimizes the policy while simultaneously maintaining enough exploration ability.

We next systematically describe our proposed method and the motivation behind it, to illustrate why MPPO helps to improve exploration.

algocf[!h]    

3.2 Method

The main idea of MPPO is summarized as follows. The policy buffer is initialized with random policies, and a shared value network is also randomly initialized. At each iteration , a candidate policy is picked from the policy buffer , which is used as the rollout policy to interact with the environment to generate a set of samples. The candidate policy is optimized according to the collected samples based on the shared value network. Also, the collected samples contribute to updating the shared value network. Finally, the improved candidate policy updates the policy by replacing itself.

Specifically, the key components of the Multi-Path Policy Optimization method are as follows:

3.2.1 Pick and rollout.

From previous analysis of Multi-TRPO, although an ensemble of policies can bring diverse samples, policies in the ensemble cannot exploit others’ experiences. Therefore, we propose to pick a candidate policy from the current policy buffer at each iteration. The picking rule for MPPO is to choose the policy with highest score , which takes into account both performance and entropy as defined in Eq. (5), i.e.,

(5)

where and denote the normalized performance and entropy according to min-max normalization as in Eq. (6).

(6)

In Eq. (6), , denote the performance and entropy of policy respectively, where we use Shannon entropy defined by .

The picking rule favors to pick the policy that is most desirable to rollout and to optimize, i.e., the one with good performance while being explorative simultaneously, which is a critical component of MPPO. In Eq. (5), provides the trade-off between exploration and exploitation. Note that a criterion focusing only on the performance cannot make good use of the policy buffer, as it tends to pick the policy updated in last iteration. Therefore, it leads to a similar optimization process as that of single-path, which also suffers from insufficient exploration. Considering the entropy term encourages exploring new behaviors. However, if one always pick the policy with the maximum entropy, then the policy ensemble fails to exploit learned good behaviors. Our weighted rule is designed to strike for a good tradeoff between exploration and exploitation.

It is crucial to note that only the candidate policy then interacts with the environment for sample collection, based on which both the candidate policy and the shared value network are updated. Therefore, MPPO does not require more samples than the base policy optimization method.

3.2.2 Policy optimization.

At each iteration , only the candidate policy is optimized using a base policy optimization method, according to samples collected by itself and the shared value network. The objective of policy optimization is to maximize the expected advantages over the policy distribution, where the estimated policy gradient is

(7)

given a batch of samples . In fact, MPPO enables a better estimation of the advantage function with its mechanism utilizing the policy ensemble. Therefore, policy optimization drives each picked policy to explore previously unseen good states and actions.

3.2.3 Value network update.

Samples collected by the candidate policy also contribute to updating the shared value network by minimizing the mean-squared error:

(8)

During the course of training with MPPO, the shared value network exploits diverse samples collected by policies that are most desirable to be picked from the diverse policy buffer at each iteration. In this way, it can better estimate the value function compared with that of single-path. Therefore, it provides more information for the advantage function to distinguish good or bad actions, which is critical and helpful for policy optimization.

3.2.4 Policy buffer update.

The updated policy will be added to the policy buffer by replacing the candidate policy itself, i.e.,

(9)

which is able to maintain the diversity of the policy buffer. A common way to update the policy buffer is to replace the worst policy in the policy buffer with the improved policy, as usually used in evolutionary-based methods (Khadka and Tumer (2018)). However, this updating scheme quickly loses the diversity of the policy buffer, and leads to a set of very similar policies ultimately, which results in a low exploration level.

3.3 Multi-Path Trust Region Policy Optimization

We first apply our proposed MPPO method to a widely adopted on-policy algorithm TRPO. In this case, the resulting Multi-Path Trust Region Policy Optimization (MP-TRPO) algorithm is shown in Algorithm LABEL:alg:mp_trpo. Note that the performance of the updated policy is estimated by . Therefore, MP-TRPO does not require extra samples to evaluate the updated policy.

During the course of policy optimization, if the same policy is picked as in last iteration, MP-TRPO guarantees a monotonic improvement of the policy picked in current iteration over that in last iteration by Schulman et al. (2015). On the other hand, if a policy that is more explorative but the performance is not as good as that in last iteration is picked, it may lead to a temporary performance drop. In Theorem 2, we show that such a performance drop can be bounded, ensuring a smooth policy transition. The proof is referred to the supplemental material.

Theorem 1

Let , denote the indexes of policies that are picked at timestep , , respectively. Denote the improvement of over as . Then, the following bound holds for :

Theorem 2 shows that although there may be a temporary performance drop due to switching to a more explorative policy, such a sacrifice is bounded by an -related term and the difference of the performance of the best and the worst policies in current policy buffer.

Figure 6: Visualization of the course of policy picking.

Figure 6 shows the course of policy picking of MP-TRPO during training on Maze (Figure 2) for a single seed. In the beginning, MP-TRPO may pick different policies to collect samples and to optimize according to the picking rule, which means that the performance gap between the best and the worst policies will not be large. Note that we use a fixed value of to be in our experiments. Therefore, the temporary performance drop is small by Theorem 2. In the end, MP-TRPO will converge to picking a single policy, in which case the performance of the picked policy will be monotone increasing. This observation actually holds for other seeds, and the full empirical result is referred to the supplemental material. We remark here that our method maintains good performance throughout the policy optimization process, while bringing the advantage of better exploration.

3.4 Multi-Path Proximal Policy Optimization

We also apply MPPO to another on-policy algorithm PPO (Schulman et al. (2017)), and obtain the MP-PPO algorithm. Due to space limitation, the full algorithm is referred to the supplemental material.

4 Experiments

We conduct extensive experiments to investigate the following key questions: (1) How does MPPO compare with single-path policy optimization and state-of-the-art exploration methods? (2) What is the effect of the number of paths and the weight ? (3) Which component of MPPO is critical for the improvement of the exploration ability? (4) Is MPPO generally applicable given a baseline on-policy reinforcement learning algorithm to encourage exploration?

4.1 Experimental Setup

We evaluate MPPO on several continuous control MuJoCo environments (Todorov et al. (2012)), including some more challenging variants of the original environments with sparse rewards (Houthooft et al. (2016); Kang et al. (2018)), to examine the exploration ability of our algorithms. For example, in SparseDoublePendulum, a reward of is given only when the agent reaches the goal that it swings the double pendulum upright, and otherwise. Detailed descriptions of the benchmark environments are referred to the supplemental material. Each algorithm is run with different random seeds -, and the performance is evaluated for episodes every steps. Note that the performance of MPPO is evaluated by the picked policy. The averaged return in evaluation is reported as the solid line, with the shaded region denoting a confidence interval. For fair comparisons, the hyper-parameters for all comparing algorithms are set to be the same as the best set of hyper-parameters reported in Henderson et al. (2018).

4.2 Baselines

To comprehensively study the MP-TRPO algorithm, we compare it with six baselines. For fair comparison, all methods use the same amount of samples during the course of policy optimization. TRPO refers to vanilla single-path TRPO, while Multi-TRPO is an ensemble method that trains multiple () single-path TRPO and chooses the best one. We further compare MP-TRPO against Div-TRPO (Hong et al. (2018)) and Curiosity-TRPO (Pathak et al. (2017)

), both of which are state-of-the art methods for exploration. Div-TRPO refers to the diversity-driven approach, which augments the loss function with the distance of current policies and prior policies. Curiosity-TRPO is a curiosity-driven approach, which improves exploration by augmenting the reward function with learned intrinsic rewards. To validate the effect of the shared value network, we furture compare with Multi-TRPO (Independent), where the policy ensemble uses multiple independent value networks for each policy. To evaluate the importance of the replacement strategy for updating the policy buffer, we compare MP-TRPO with a variant, ReplaceWorst, which updates the policy buffer by replacing the worst policy with the improved candidate policy. The effectiveness of the picking rule is verified using different weights of

.

Then, we apply our proposed multi-path policy optimization mechanism to another baseline policy optimization method, PPO, to demonstrate the general applicability of the MPPO method, and conduct similar evaluation.

4.3 Ablation Study

(a) Varying for MP-TRPO.
(b) Varying for MP-PPO.
Figure 7: Ablation study of varying .

4.3.1 The effect of the number of paths .

Figure 7 shows the performance of MP-TRPO and MP-PPO with varying on SparseDoublePendulum. The value trades off the diversity of the policy buffer and sample efficiency. A larger maintains a greater diversity, but may require more samples to learn as there are more policies to be picked and to be optimized in early periods of learning. Indeed, there is an intermediate value for that provides the best trade-off. We find that MP-TRPO with achieves the best performance and thus we fix to be on all environments. For MP-PPO, a relatively smaller is sufficient and performs best. This is because PPO itself exhibits greater exploration ability than TRPO, so we choose to be in all environments for MP-PPO. Note that MPPO with different values of all outperform the corresponding baseline policy optimization method (TRPO or PPO). It is also worth noting that MPPO only uses 1.67% and 4.78% more memory for MP-TRPO () and MP-PPO () compared with TRPO and PPO respectively. The summary of memory consumption for different is referred to the supplemental material.

Figure 8: Ablation study of varying .

4.3.2 The effect of the weight .

In the picking rule, controls the trade-off between exploration and exploitation. A larger emphasizes more on the exploration ability of the picked policy, but may fail to utilize the result of policy optimization. In addition, according to Theorem 2, a large may lead to a temporary performance drop. On the other hand, a smaller focuses more on exploiting the current best-performing policy in the policy buffer, where refers to always picking the best policy based on current estimation. We vary for MP-TRPO on SparseDoublePendulum, and the result is shown in Figure 8. As expected, a small achieves the best performance, so we fix to be in all environments for both MP-TRPO and MP-PPO.

4.4 Performance Comparison

Figure 9: Performance comparison of MP-TRPO.
Environment MP-TRPO MP-TRPO Div-TRPO Curiosity-TRPO Multi-TRPO Multi-TRPO TRPO
(ReplaceWorst) (Independent)
Ant 3017.02 -470.05 2635.00 2744.79 0.93 3.90 2558.33
(116.405) (57.65) (138.28) (190.23) (1.99) (1.19) (120.89)
Hopper 3257.55 76.25 2957.31 2786.99 2878.34 3138.41 2947.19
(62.68) (32.68) (82.94) (258.17) (91.57) (17.66) (111.43)
Swimmer 340.92 179.67 199.06 232.89 198.00 339.17 186.21
(4.39) (29.63) (20.58) (38.93) (27.47) (2.21) (32.63)
SparseCartPoleSwingup 320.47 49.10 203.22 244.23 180.17 213.07 238.88
(14.48) (16.99) (33.56) (11.42) (11.50) (4.07) (11.08)
SparseDoublePendulum 1.00 0.33 1.00 1.00 0.83 0.98 1.00
(0.00) (0.15) (0.00) (0.00) (0.12) (0.01) (0.00)
SparseHalfCheetah 756.02 24.52 699.03 656.98 438.50 676.22 639.32
(14.02) (8.14) (57.15) (29.24) (66.98) (14.35) (38.40)
SparseHopper 302.32 0.00 89.60 188.47 14.63 38.33 182.63
(52.48) (0.00) (45.09) (33.51) (10.29) (25.82) (60.13)
SparseWalker2d 186.48 0.00 112.75 87.33 37.67 56.18 53.78
(43.15) (0.00) (33.29) (25.18) (12.57) (17.27) (15.57)
Table 1: Comparison of MP-TRPO on final performance (mean and confidence interval).

The comparative results of MP-TRPO are demonstrated in Figure 9. As shown, MP-TRPO is consistently more sample efficient than Div-TRPO in all environments. In addition, it outperforms Curiosity-TRPO in all but one environment in terms of sample efficiency. The margin is larger especially in sparse environments. Table 1 summarizes the performance at the end of training, which shows that MP-TRPO achieves the best final performance in all environments.

Div-TRPO augments the loss function with a measure of distances between past policies. As trust-region methods limit the update, the distance among prior policies is not large. Thus, the diversity-driven technique does not enable significant improvement of exploration on TRPO. Curiosity-TRPO augments the reward function with a curiosity term that measures how novel a state is. It encourages the agent to re-explore states that are known to be unfamiliar with. However, it can be challenging for the agent to first discover such states in sparse environments.

Figure 10: Performance comparison of MP-PPO.

Regarding the shared value network, Multi-TRPO outperforms Multi-TRPO (Independent), as it enables a better estimation of the value function. Additionally, MP-TRPO outperforms Multi-TRPO significantly. As for the strategy for policy buffer updates, note that MP-TRPO (ReplaceWorst), where replacing the worst policy is a common strategy in evolutionary methods, performs poorly in all but one benchmark environments. After updating the picked policy, it replaces the worst-performing policy in the buffer with this improved policy. Under this updating scheme, the policy buffer loses the diversity of policies quickly and soon only stores similar copies of a single policy. Thus, MP-TRPO (ReplaceWorst) performs worse than Multi-TRPO (Independent). In contrast, the replacement strategy of MP-TRPO best preserves the diversity of the policy buffer while ensuring policy optimization.

Our results provide empirical evidence that MPPO is an efficient mechanism to fully utilize the ensemble of policies without bringing high computation overhead.

To show that MPPO is readily applicable to other baseline on-policy algorithms, we build it upon PPO, and evaluate the resulting MP-PPO algorithm by comparing it with the corresponding PPO, Multi-PPO, and Multi-PPO (Independent) algorithms. Results are shown in Figure 10, where MP-PPO outperforms the baseline method in all environments, which demonstrates its effectiveness to encourage exploration. The final performance is referred to the supplemental material.

5 Related Work

Entropy-regularized reinforcement learning (RL) (Haarnoja et al. (2017, 2018); Nachum et al. (2018)) optimizes the standard objective augmented by an entropy regularizer considering the distance with the random policy, and learns a stochastic policy for better exploration. Our method differs from them in that MPPO still optimizes the standard objective, where the picking rule involves the performance and an entropy bonus term.

The ensemble technique has shown great potential in model-free RL. Liu et al. (2017)

aim to learn a set of diverse policies by optimizing a distribution of policy parameters via solving a Bayesian inference problem, which trains multiple policies at each iteration with information sharing of gradients. However, it cannot ensure the stability of performance when applied to trust-region methods.

Zhang and Yao (2019) propose to escape from local maxima for an off-policy algorithm, DDPG (Lillicrap et al. (2016)), by utilizing an ensemble of actors. The critic is updated according to the best action proposed by all actors that results in maximum Q-value, and all actors are trained in parallel. However, it cannot be applied to RL algorithms with stochastic policies.

There have also been a number of approaches improving exploration by combining evolutionary methods with deep reinforcement learning by maintaining a population of agents. Gangwani and Peng (2017) apply policy gradient methods to mutate the population. Khadka and Tumer (2018) utilize a population of evolutionary actors to collect samples, and an RL actor based on DDPG Lillicrap et al. (2016) is updated using these samples. Pourchot and Sigaud (2019) propose to combine the cross-entropy method and TD3 (Fujimoto et al. (2018)). Our work differs from previous works in several aspects. First, we train a single policy at each iteration instead of a population of policies, and only the picked policy interacts with the environment. Second, we use multi-path to enable better exploration than single-path for on-policy algorithms, while previous works cannot be applied to on-policy algorithms.

6 Conclusion

We present MPPO using an ensemble of policies to improve exploration for on-policy reinforcement learning algorithms. We apply the MPPO method to TRPO and PPO, and show that the performance can be guaranteed during policy switching. We conduct extensive experiments on several MuJoCo tasks including environments with sparse rewards, and show that MPPO outperforms baselines significantly in both sample efficiency and final performance.

Appendix A Theoretical Analysis for MP-TRPO

Theorem 2

Let , denote the indexes of policies that are picked at timestep , , respectively. Denote the improvement of over as . Then, the following bound holds for :

Proof. As is the policy selected at timestep , we have

(10)

Thus,

(11)

According to the min-max normalization, we have

(12)

Then, we obtain

(13)

According to the monotonic improvement theorem (Schulman et al. (2015)), we have

(14)

Appendix B Visualization of the Picking Rule

The picked policies chosen by the picking rule of MP-TRPO on Maze during the first 0.5 million steps (the total number of training steps is 1 million) by different random seeds (0-5) is shown in Figure 11. The x-axis and y-axis correspond to the training steps and the index of the picked policies. As shown, in the beginning of learning, different policies are picked according to the picking rule, which is a weighted objective of performance and entropy. Finally, MPPO converges to picking a same policy to optimize.

Figure 11: Visualization of the picked policies of MP-TRPO on Maze during the first 0.5M steps.

Appendix C Multi-Path Proximal Policy Optimization (MP-PPO) Algorithm

The MP-PPO algorithm is shown in Algorithm LABEL:alg:mp_ppo. algocf[!h]    

Appendix D Experimental Setup

d.1 Environments

The environments are all from OpenAI Gym (Brockman et al. (2016)), and the details of the sparse environments are summarized as follows:

  • SparseCartPoleSwingup: a reward of is given only when , where is the pole angle, and otherwise

  • SparseDoublePendulum: a reward of is given only when the agent reaches the goal, i.e. swings the double pendulum upright, and otherwise

  • SparseHalfCheetah: the agent only receives a reward only when it runs multiple meters above the thereshold, and otherwise

  • SparseHopper: the agent only receives a reward only when it hops multiple meters above the thereshold, and otherwise

  • SparseWalker2d: the agent only receives a reward only when it walks multiple meters above the thereshold, and otherwise

d.2 Hyperparamters

The hyper-parameters for MP-TRPO and TRPO, MP-PPO and PPO are shwon in Table 1 and Table 3 respectively, which are set to be the same for fair comparison according to (Henderson et al. (2018)). For all algorithms, the policy network is (64, tanh, 64, tanh, linear), and the value network is (64, tanh, 64, tanh, linear). The size of the policy buffer is set to be and in MP-TRPO and MP-PPO respectively, and the weight parameter is set to be 0.1 in all environments.

Hyper-parameter Value
Discount Factor 0.995
GAE 0.97
Batch Size 5000
Iterations of Conjugate Gradient 20
Damping of Conjugate Gradient 0.1
Iterations of Value Function Update 5
Batch Size of Value Function Update 64
Step Size of Value Function Update 0.001
Coefficient of Entropy 0.0
max KL 0.01
Table 2: Hyper-parameters of MP-TRPO and TRPO.
Environment MP-PPO Multi-PPO (Independent) Multi-PPO PPO
Ant 1992.21 (216.33) 902.31 (101.00) 1131.90 (110.94) 1311.35 (220.93)
Hopper 2264.54 (234.95) 1449.33 (224.93) 2073.46 (161.47) 1457.18 (258.87)
Swimmer 109.96 (1.93) 100.28 (9.08) 98.79 (6.99) 106.12 (2.78)
SparseCartPoleSwingup 352.12 (16.14) 135.65 (37.47) 238.02 (55.98) 294.88 (43.42)
SparseDoublePendulum 1.0 0.67 (0.15) 0.83 (0.12) 0.67 (0.15)
SparseHalfCheetah 593.37 (40.60) 456.65 (35.86) 478.70 (59.02) 463.07 (36.72)
SparseHopper 131.35 (40.08) 57.88 (19.86) 63.75 (22.69) 90.40 (37.61)
SparseWalker2d 55.93 (31.98) 1.67 (1.17) 41.68 (19.91) 10.20 (7.18)
Table 3: Comparison of MP-PPO on eventual performance (mean and confidence interval).
Hyper-parameter Value
Discount Factor 0.995
GAE 0.97
Batch Size 2048
Clip Parameter 0.2
Epochs of Optimizer per Iteration 10
Step Size of Optimizer 0.0003
Batch Size of Optimizer 64
Coefficient of Entropy 0.0
Table 4: Hyper-parameters of MP-PPO and PPO.

Appendix E Memory Consumption Comparisons

The policy ensemble does not incur much more memory consumption, and comparison results for MP-TRPO and MP-PPO with varing number of paths on SparseDoublePendulum corresponding to our ablation experiments are shown in Table 4 and Table 5.

GPU Memory Memory
TRPO 359 M 404 M
MP-TRPO () 365 M 410 M
MP-TRPO () 365 M 410 M
MP-TRPO () 365 M 410 M
MP-TRPO () 365 M 411 M
Table 5: Comparison results of memory consumption for MP-TRPO with different .
GPU Memory Memory
PPO 335 M 388 M
MP-PPO () 351 M 407 M
MP-PPO () 351 M 409 M
MP-PPO () 351 M 409 M
MP-PPO () 335 M 411 M
Table 6: Comparison results of memory consumption for MP-PPO with different .

Appendix F Final Performance Comparison of MP-PPO

The final performance comparison of MP-PPO is summarized in Table 2, where the number in the brackets denote the confidence interval.

References

  • O. Anschel, N. Baram, and N. Shimkin (2017)

    Averaged-dqn: variance reduction and stabilization for deep reinforcement learning

    .
    In

    International Conference on Machine Learning

    ,
    pp. 176–185. Cited by: §1.
  • M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos (2016) Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479. Cited by: §1.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §D.1.
  • J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee (2018) Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8224–8234. Cited by: §1.
  • R. Y. Chen, S. Sidor, P. Abbeel, and J. Schulman (2017) UCB exploration via q-ensembles. arXiv preprint arXiv:1706.01502. Cited by: §1.
  • C. Colas, O. Sigaud, and P. Oudeyer (2018) GEP-pg: decoupling exploration and exploitation in deep reinforcement learning algorithms. In International Conference on Machine Learning, pp. 1038–1047. Cited by: §1.
  • M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, et al. (2018) Noisy networks for exploration. In International Conference on Learning Representations, Cited by: §1.
  • S. Fujimoto, H. Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp. 1582–1591. Cited by: §1, §5.
  • T. Gangwani and J. Peng (2017) Policy optimization by genetic distillation. In International Conference on Learning Representations, Cited by: §5.
  • T. Haarnoja, H. Tang, P. Abbeel, and S. Levine (2017) Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pp. 1352–1361. Cited by: §5.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1856–1865. Cited by: §1, §5.
  • P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger (2018) Deep reinforcement learning that matters. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §D.2, §4.1.
  • Z. Hong, T. Shann, S. Su, Y. Chang, T. Fu, and C. Lee (2018) Diversity-driven exploration strategy for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 10489–10500. Cited by: §1, §4.2.
  • R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel (2016) Vime: variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp. 1109–1117. Cited by: §1, §4.1.
  • B. Kang, Z. Jie, and J. Feng (2018) Policy optimization with demonstrations. In International Conference on Machine Learning, pp. 2474–2483. Cited by: §4.1.
  • S. Khadka and K. Tumer (2018) Evolution-guided policy gradient in reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1188–1200. Cited by: §1, §3.2.4, §5.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. In International Conference on Learning Representations, Cited by: §1, §5, §5.
  • Y. Liu, P. Ramachandran, Q. Liu, and J. Peng (2017) Stein variational policy gradient. arXiv preprint arXiv:1704.02399. Cited by: §5.
  • M. A. Masood and F. Doshi-Velez (2019) Diversity-inducing policy gradient: using maximum mean discrepancy to find a set of diverse policies. arXiv preprint arXiv:1906.00088. Cited by: §1.
  • O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans (2018) Trust-pcl: an off-policy trust region method for continuous control. In International Conference on Learning Representations, Cited by: §1, §5.
  • I. Osband, C. Blundell, A. Pritzel, and B. Van Roy (2016) Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pp. 4026–4034. Cited by: §1.
  • G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos (2017) Count-based exploration with neural density models. In International Conference on Machine Learning, pp. 2721–2730. Cited by: §1.
  • D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    ,
    pp. 16–17. Cited by: §1, §4.2.
  • M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz (2018) Parameter space noise for exploration. In International Conference on Learning Representations, Cited by: §1.
  • A. Pourchot and O. Sigaud (2019) CEM-rl: combining evolutionary and gradient-based methods for policy search. In International Conference on Learning Representations, Cited by: §5.
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: Appendix A, §1, §2, §3.3.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §2, §3.4.
  • H. Tang, R. Houthooft, D. Foote, A. Stooke, O. X. Chen, Y. Duan, J. Schulman, F. DeTurck, and P. Abbeel (2017) # exploration: a study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pp. 2753–2762. Cited by: §1.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §1, §4.1.
  • S. Zhang and H. Yao (2019) Ace: an actor ensemble algorithm for continuous control with tree search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5789–5796. Cited by: §1, §5.