Towards Simplicity in Deep Reinforcement Learning: Streamlined Off-Policy Learning

10/05/2019 ∙ by Che Wang, et al. ∙ NYU college 0

The field of Deep Reinforcement Learning (DRL) has recently seen a surge in the popularity of maximum entropy reinforcement learning algorithms. Their popularity stems from the intuitive interpretation of the maximum entropy objective and their superior sample efficiency on standard benchmarks. In this paper, we seek to understand the primary contribution of the entropy term to the performance of maximum entropy algorithms. For the Mujoco benchmark, we demonstrate that the entropy term in Soft Actor-Critic (SAC) principally addresses the bounded nature of the action spaces. With this insight, we propose a simple normalization scheme which allows a streamlined algorithm without entropy maximization match the performance of SAC. Our experimental results demonstrate a need to revisit the benefits of entropy regularization in DRL. We also propose a simple non-uniform sampling method for selecting transitions from the replay buffer during training. We further show that the streamlined algorithm with the simple non-uniform sampling scheme outperforms SAC and achieves state-of-the-art performance on challenging continuous control tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Off-policy deep Reinforcement Learning (RL) algorithms aim to improve sample efficiency by reusing past experience. Recently a number of new off-policy Deep Reinforcement Learning algorithms have been proposed for control tasks with continuous state and action spaces, including Deep Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3) (Lillicrap et al., 2015; Fujimoto et al., 2018). TD3, in particular, has been shown to be significantly more sample efficient than popular on-policy methods for a wide range of Mujoco benchmarks.

The field of Deep Reinforcement Learning (DRL) has also recently seen a surge in the popularity of maximum entropy reinforcement learning algorithms. Their popularity stems from the intuitive interpretation of the maximum entropy objective and their superior sample efficiency on standard benchmarks. In particular, Soft Actor Critic (SAC), which combines off-policy learning with maximum-entropy RL, not only has many attractive theoretical properties, but can also give superior performance on a wide-range of Mujoco environments, including on the high-dimensional environment Humanoid for which both DDPG and TD3 perform poorly (Haarnoja et al., 2018a, b; Langlois et al., 2019). The TD3 and SAC algorithms share many common features, including an actor-critic structure, off-policy learning, and the use of double Q-networks (Van Hasselt et al., 2016). The primary difference between the two approaches is that SAC employs maximum entropy reinforcement learning whereas TD3 does not.

In this paper, we first seek to understand the primary contribution of the entropy term to the performance of maximum entropy algorithms. For the Mujoco benchmark, we demonstrate that when using the standard objective without entropy along with standard additive noise exploration, there is often insufficient exploration due to the bounded nature of the action spaces. Specifically, the outputs of the policy network are often way outside the bounds of the action space, so that they need to be squashed to fit within the action space. The squashing results in actions persistently taking on their maximal values, so that there is insufficient exploration. In contrast, the entropy term in the SAC objective forces the outputs to have sensible values, so that even with squashing, exploration is maintained. We conclude that the entropy term in the objective for Soft Actor Critic principally addresses the bounded nature of the action spaces in the Mujoco environments.

With this insight, we propose Streamlined Off Policy (SOP), a streamlined algorithm using the standard objective without the entropy term. SOP employs a simple normalization scheme to address the bounded nature of the action spaces, thereby allowing for satisfactory exploration throughout training. Our experimental results show that SOP matches the sample-efficiency and robustness performance of SAC, including on the more challenging Ant and Humanoid environments. This demonstrates a need to revisit the benefits of entropy maximization in DRL.

Keeping with the theme of simplicity with the goal of meeting Occam’s principle, we also propose a simple non-uniform sampling method for selecting transitions from the replay buffer during training. In vanilla SOP (as well as in DDPG, TD3, and SAC), samples from the replay buffer are chosen uniformly at random during training. Our method, called Emphasizing Recent Experience (ERE), samples more aggressively recent experience while not neglecting past experience. Unlike Priority Experience Replay (PER) (Schaul et al., 2015), a popular non-uniform sampling scheme for the Atari environments, ERE is only a few lines of code and does not rely on any sophisticated data structures. We show that SOP combined with ERE out-performs SAC and provides state of the art performance. For example, for Ant and Humanoid, it improves over SAC by with one million samples. Furthermore, we also investigate combining SOP with PER, and show SOP+ERE also out-performs the more complicated SOP+PER scheme.

The contributions of this paper are thus threefold. First, we uncover the primary contribution of the entropy term of maximum entropy RL algorithms when the environments have bounded action spaces. Second, we develop a new streamlined algorithm which does not employ entropy maximization but nevertheless matches the sampling efficiency and robustness performance of SAC for the Mujoco benchmarks. And third, we combine our streamlined algorithm with a simple non-uniform sampling scheme to achieve state-of-the art performance for the Mujoco benchmark. We provide anonymized code for reproducibility 111https://anonymous.4open.science/r/e484a8c7-268a-4a66-a001-1e7676540237/.

2 Preliminaries

We represent an environment as a Markov Decision Process (MDP) which is defined by the tuple

, where and are continuous multi-dimensional state and action spaces, is a bounded reward function, is a transition function, and is the discount factor. Let and respectively denote the state of the environment and the action chosen at time . Let denote the policy. We further denote for the dimension of the action space, and write for the th component of an action , that is, .

The expected discounted return for policy beginning in state is given by:

(1)

Standard MDP and reinforcement learning problem formulations seek to maximize over policies . For finite state and action spaces, under suitable conditions for continuous state and action spaces, the optimal policy is deterministic (Puterman, 2014; Bertsekas & Tsitsiklis, 1996). In reinforcement learning with unknown environment, exploration is required to learn a suitable policy.

In DRL with continuous action spaces, typically the policy is modeled by a parameterized policy network which takes as input a state and outputs a value , where represents the current parameters of the policy network (Schulman et al., 2015, 2017; Vuong et al., 2018; Lillicrap et al., 2015; Fujimoto et al., 2018). During training, the actual action taken when in state often takes the form where is a random

-dimensional vector which is independently drawn at each time step and may, in some circumstances, also depend on

. During testing, is set to zero.

2.1 Entropy Maximization RL

Maximum entropy reinforcement learning takes a different approach than (1) by optimizing policies to maximize both the expected return and the expected entropy of the policy (Ziebart et al., 2008; Ziebart, 2010; Todorov, 2008; Rawlik et al., 2013; Levine & Koltun, 2013; Levine et al., 2016; Nachum et al., 2017; Haarnoja et al., 2017, 2018a, 2018b).

In particular, with maximization entropy RL, the objective is to maximize

(2)

where is the entropy of the policy when in state , and the temperature parameter determines the relative importance of the entropy term against the reward.

For entropy maximization DRL, when given state the policy network will typically output a -dimensional vector in addition to the vector . The action selected when in state is then modeled as where .

Maximum entropy RL has been touted to have a number of conceptual and practical advantages for DRL (Haarnoja et al., 2018a, b)

. For example, it has been argued that the policy is incentivized to explore more widely, while giving up on clearly unpromising avenues. It has also been argued that the policy can capture multiple modes of near-optimal behavior, that is, in problem settings where multiple actions seem equally attractive, the policy will commit equal probability mass to those actions. In this paper, we will highlight another advantage, namely, retaining sufficient exploration when facing bounded action spaces.

3 The Squashing Exploration Problem

3.1 Bounded Action Spaces

Continuous environments typically have bounded action spaces, that is, along each action dimension there is a minimum possible action value and a maximum possible action value . When selecting an action, the action needs to be selected within these bounds before the action can be taken. DRL algorithms often handle this by squashing the action so that it fits within the bounds. For example, if along any one dimension the value exceeds , the action is set (clipped) to . Alternatively, a smooth form of squashing can be employed. For example, suppose and for some positive number , then a smooth form of squashing could use in which is being applied to each component of the -dimensional vector. DDPG (Hou et al., 2017) and TD3 (Fujimoto et al., 2018) use clipping, and SAC (Haarnoja et al., 2018a, b) uses smooth squashing with the function. For concreteness, henceforth we will assume that smooth squashing with the is employed.

We note that an environment may actually allow the agent to input actions that are outside the bounds. In this case, the environment will typically first clip the actions internally before passing them on to the “actual” environment (Fujita & Maeda, 2018).

We now make a simple but crucial observation: squashing actions so that they fit into a bounded action space can have a disastrous effect on additive-noise exploration strategies. To see this, let the output of the policy network be denoted by . Consider an action taken along one dimension , and suppose and is relatively small compared to . Then the action will be very close (essentially equal) to . If the condition persists over many consecutive states, then will remain close to 1 for all these states, and consequently there will be essentially no exploration along the th dimension. We will refer to this problem as the squashing exploration problem. We will argue that algorithms such as DDPG and TD3 based on the standard objective (1) with additive noise exploration can be greatly impaired by squashing exploration.

3.2 What does entropy maximization bring to SAC for the Mujuco environments?

SAC is a maximum-entropy based off-policy DRL algorithm which provides good performance across all of the Mujuco benchmark environments. To the best of our knowledge, it currently provides state of the art performance for the Mujoco benchmark. In this section, we argue that the principle contribution of the entropy term in the SAC objective is to resolve the squashing exploration problem, thereby maintaining sufficient exploration when facing bounded action spaces. To argue this, we consider two DRL algorithms: SAC with adaptive temperature (Haarnoja et al., 2018b), and SAC with entropy removed altogether (temperature set to zero) but everything else the same. We refer to them as SAC and as SAC without entropy. For SAC without entropy, for exploration we use additive zero-mean Gaussian noise with fixed at . Both algorithms use squashing. We compare these two algorithms on two Mujoco environments: Humanoid-v2 and Walker-v2.

Figure 1 shows the performance of the two algorithms with 10 seeds. We see that for Humanoid, SAC with entropy maximization performs much better than SAC without entropy maximization. However, for Walker, SAC without entropy performs nearly as well as SAC, implying maximum entropy RL is not as critical for this environment.

(a) Humanoid-v2
(b) Walker2d-v2
Figure 1: SAC performance with and without entropy maximization

To understand why entropy maximization is important for one environment but less so for another, we examine the actions selected when training these two algorithms. Humanoid and Walker have action dimensions and , respectively. Here we show representative results for one dimension for both environments, and provide the results for all the dimensions in the Appendix. The top and bottom rows of Figure 2 shows results for Humanoid and Walker, respectively. The first column shows the values for an interval of 1,000 consecutive time steps, namely, for time steps 599,000 to 600,000. The second column shows the actual action values passed to the environment again for time steps 599,000 to 600,000. The third and fourth columns show a concatenation of 10 such intervals of 1000 time steps, with each interval coming from a larger interval of 100,000 time steps.

The first and third columns use a log scale on the y-axis.

The top and bottom rows of Figure 2 are strikingly different. For Humanoid using SAC (which uses entropy maximization), the values are small, mostly in the range [-1.5,1.5], and fluctuate significantly. This allows the action values to also fluctuate significantly, providing exploration in the action space. On the other hand, for SAC without entropy the values are typically huge, most of which are well outside the interval [-10,10]. This causes the actions to be persistently clustered at either or -, leading to essentially no exploration along that dimension. As shown in the Appendix, this property (lack of exploration for SAC without entropy maximization) does not hold for just a few dimensions, but instead for all 17 dimensions.

For Walker, we see that for both algorithms, the values are sensible, mostly in the range [-1,1] and therefore the actions chosen by both algorithms exhibit exploration.

In conclusion, the principle benefit of maximum entropy RL in SAC for the Mujuco environments is that it resolves the squashing exploration problem. For some environments (such as Walker), the outputs of the policy network take on sensible values, so that sufficient exploration is maintained and overall good performance is achieved without the need for entropy maximization. For other environments (such as Humanoid), entropy maximization is needed to reduce the magnitudes of the outputs so that exploration is maintained and overall good performance is achieved.

(a) Humanoid-v2
(b) Walker2d-v2
Figure 2: and values from SAC and SAC without entropy maximization

4 Streamlined Off-Policy (SOP) Algorithm

Given the observations in the previous section, a natural question is: is it possible to design a streamlined off policy algorithm that does not employ entropy maximization but offers performance comparable to SAC (which has entropy maximization)?

As we observed in the previous section, without entropy maximization, in some environments the policy network output values , can become persistently huge, which leads to insufficient exploration due to the squashing. A simple solution is to modify the outputs of the policy network by normalizing the output values when they collectively (across the action dimensions) become too large. To this end, for any -dimensional vector let denote the norm of . Let be a constant (hyper parameter) close to 1. The normalization procedure is as follows. Let be the output of the original policy network. If , then we reset for all ; otherwise, we leave unchanged. With this normalization, we are assured that is never greater than . Henceforth we assume the policy network has been modified with the simple normalization scheme just described.

Our Streamlined Off Policy (SOP) algorithm is described in Algorithm 1. The algorithm is essentially DDPG plus the normalization described above, plus double Q-learning (Van Hasselt et al., 2016) and target policy smoothing (Fujimoto et al., 2018). Another way of looking at it is as TD3 plus the normalization described above, minus the delayed policy updates and the target policy parameters. SOP also uses squashing instead of clipping, since gives somewhat better performance in our experiments. The SOP algorithm is “streamlined” as it has no entropy terms, temperature adaptation, target policy parameters or delayed policy updates.

1:Input: initial policy parameters , Q-function parameters , , empty replay buffer
2:Set target parameters equal to main parameters  for i = 1, 2
3:repeat
4:     Generate an episode using actions where .
5:     for  in range(however many updates) do
6:         Randomly sample a batch of transitions, from
7:         Compute targets for Q functions:
               
8:         Update Q-functions by one step of gradient descent using
               
9:         Update policy by one step of gradient ascent using
               
10:         Update target networks with
                     
Algorithm 1 Streamlined Off-Policy

4.1 Experimental Results for SOP

Without performing a careful hyper-parameter search, we found and works well for all environments. For the normalization for SOP, we use , that is, the norm.

Figure 3 compares SAC (with temperature adaptation (Haarnoja et al., 2018a, b)

) with SOP for five of the most challenging Mujuco environments. Using the same baseline code, we train with ten different random seeds for each of the two algorithms. Each algorithm performs five evaluation rollouts every 5000 environment steps. The solid curves correspond to the mean, and the shaded region to the standard deviation of the returns over the ten seeds.

Results show that SOP and SAC have essentially the same sample-efficiency performance and robustness across all environments. This result confirms that when using a simple output normalization in the policy network, the performance of SAC can be achieved without maximum entropy RL.

In the Appendix we provide an ablation study for SOP, which shows a major performance drop when removing either double Q-learning or normalization, whereas removing target policy smoothing (Fujimoto et al., 2018) results in only a small performance drop in some environments.

(a) Hopper-v2
(b) Walker2d-v2
(c) HalfCheetah-v2
(d) Ant-v2
(e) Humanoid-v2
Figure 3: Streamlined Off-Policy (SOP) versus SAC

5 Non-uniform sampling

We now show how a small change in the sampling scheme for SOP can achieve state of the art performance for the Mujoco benchmark. We call this samping scheme Emphasizing Recent Experience (ERE). ERE has 3 core features: It is a general method applicable to any off-policy algorithm; It requires no special data structure, is very simple to implement, and has near-zero computational overhead;

It only introduces one additional important hyperparameter.

The basic idea is: during the parameter update phase, the first mini-batch is sampled from the entire buffer, then for each subsequent mini-batch we gradually reduce our range of sampling to sample more aggressively from more recent data. Specifically, assume that in the current update phase we are to make mini-batch updates. Let be the max size of the buffer. Then for the update, we sample uniformly from the most recent data points, where and

is a hyper-parameter that determines how much emphasis we put on recent data. is uniform sampling. When , decreases as we perform each update. can made to adapt to the learning speed of the agent so that we do not have to tune it for each environment.

The effect of such a sampling formulation is twofold. The first is recent data have a higher chance of being sampled.

The second is that we do this in an ordered way: we first sample from all the data in the buffer, and gradually shrink the range of sampling to only sample from the most recent data. This scheme reduces the chance of over-writing parameter changes made by new data with parameter changes made by old data (French, 1999; McClelland et al., 1995; McCloskey & Cohen, 1989; Ratcliff, 1990; Robins, 1995). This process allows us to quickly obtain new information from recent data, and better approximate the value functions near recently-visited states, while still maintaining an acceptable approximation near states visited in the more distant past.

What is the effect of replacing uniform sampling with ERE? First note if we do uniform sampling on a fixed buffer, the expected number of times a data point is sampled is the same for all data points. Now consider a scenario where we have a buffer of size 1000 (FIFO queue), we collect one data point at a time, and we then perform one update with mini-batch size of one. If we start with an empty buffer and sample uniformly, as data fills the buffer, each data point gets less and less chance of being sampled. Specifically, over a period of 1000 updates, the expected number of times the th data point is sampled is: .

Figure 3(f) shows the expected number of times a data point is sampled as a function of its position in the buffer. We see that older data points have a much higher expected number of times of being sampled compared to newer data points. This is undesirable because when the agent is improving and exploring new areas of the state space; the new data points may contain more interesting information than the old ones, which have already been updated many times.

When we apply the ERE scheme, we effectively skew the curve towards assigning higher expected number of samples for the newer data, allowing the newer data to be frequently sampled soon after being collected, which can accelerate the learning process. Further algorithmic detail and analysis on ERE can be found in the Appendix.

5.1 Experimental results for SOP+ERE

Figure 4 compares the performance of SOP, SOP+ERE and SAC. SOP+ERE learns faster than SAC and vanilla SOP in all Mujoco environments. SOP+ERE also greatly improves overall performance for the two most challenging environments, Ant and Humanoid. For SOP we found that fine tuning for each environment can give further improvement in sample efficiency, but for fairness of comparison, we use exactly the same hyperparameters for all environments. In table 1, we show the mean test episode return and std across 10 random seeds at 1M timesteps for all environments. The last column displays the percentage improvement of SOP+ERE over SAC, showing hat SOP+ERE achieves state of the art performance. In both Ant and Humanoid, SOP+ERE improves average performance by 24% over SAC at 1 million timesteps. As for the std, SOP+ERE gives lower values, and for Humanoid a higher value.

(a) Hopper-v2
(b) Walker2d-v2
(c) HalfCheetah-v2
(d) Ant-v2
(e) Humanoid-v2
(f) Uniform and ERE sampling
Figure 4: (a) to (e) show Streamlined Off-Policy (SOP) with ERE sampling versus SAC. (f) shows over a period of 1000 updates, the expected number of times the th data point is sampled (with ). ERE allows new data to be sampled many times soon after being collected.
Environment SAC Adaptive SOP SOP+ERE Improvement
Hopper 6.9%
Walker 10.2%
HalfCheetah 7.5%
Ant 23.9%
Humanoid 24.1%
Table 1: Performance comparison at one million samples. Last column shows percentage improvement of SOP+ERE over SAC.

6 Related work

In recent years, there has been significant progress in improving the sample efficiency of DRL for continuous robotic locomotion tasks with off-policy algorithms (Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018a, b). There is also a significant body of research on maximum entropy RL methods (Ziebart et al., 2008; Ziebart, 2010; Todorov, 2008; Rawlik et al., 2013; Levine & Koltun, 2013; Levine et al., 2016; Nachum et al., 2017; Haarnoja et al., 2017, 2018a, 2018b).

By taking clipping in the Mujoco environments explicitly into account, Fujita & Maeda (2018)

modified the policy gradient algorithm to reduce variance and provide superior performance among on-policy algorithms.

Eisenach et al. (2018) extend the work of Fujita & Maeda (2018) for when an action may be direction. Hausknecht & Stone (2015) and Chou et al. (2017) also explores DRL in the context of bounded action spaces. Dalal et al. (2018) consider safe exploration in the context of constrained action spaces.

Uniform sampling is the most common way to sample from a replay buffer. One of the most well-known alternatives is prioritized experience replay (PER) (Schaul et al., 2015). PER uses the absolute TD-error of a data point as the measure for priority, and data points with higher priority will have a higher chance of being sampled. This method has been tested on DQN (Mnih et al., 2015) and double DQN (DDQN) (Van Hasselt et al., 2016) with significant improvement. PER has been combined with the dueling architecture (Wang et al., 2015), with an ensemble of recurrent DQN (Schulze & Schulze, 2018), and PER is one of six crucial components in Rainbow (Hessel et al., 2018), which achieves state-of-the-art on the Atari game environments. PER has also been successfully applied to other algorithms such as DDPG (Hou et al., 2017) and can be implemented in a distributed manner (Horgan et al., 2018). There are other methods proposed to make better use of the replay buffer. In Sample Efficient Actor-Critic with Experience Replay (ACER), the algorithm has an on-policy part and an off-policy part, with a hyper-parameter controlling the ratio of off-policy updates to on-policy updates (Wang et al., 2016). The RACER algorithm (Novati & Koumoutsakos, 2018) selectively removes data points from the buffer, based on the degree of ”off-policyness” which is measured by their importance sampling weight, bringing improvement to DDPG (Lillicrap et al., 2015), NAF (Gu et al., 2016) and PPO (Schulman et al., 2017). In De Bruin et al. (2015), replay buffers of different sizes were tested on DDPG, and result shows that a large enough buffer with enough data diversity can lead to better performance. Finally, with Hindsight Experience Replay (HER)(Andrychowicz et al., 2017)

, priority can be given to trajectories with lower density estimation

(Zhao & Tresp, 2019) to tackle multi-goal, sparse reward environments.

7 Conclusion

In this paper we first showed that the primary role of maximum entropy RL for the Mujoco benchmark is to maintain satisfactory exploration in the presence of bounded action spaces. We then developed a new streamlined algorithm which does not employ entropy maximization but nevertheless matches the sampling efficiency and robustness performance of SAC for the Mujoco benchmarks. Our experimental results demonstrate a need to revisit the benefits of entropy regularization in DRL. Finally, we combined our streamlined algorithm with a simple non-uniform sampling scheme to achieve state-of-the art performance for the Mujoco benchmark.

References

Appendix A Ablation Study

In this ablation study we separately examine the importance of the normalization at the output of the policy network; the double Q networks; and randomization used in the line 8 of the SOP algorithm (that is, target policy smoothing (Fujimoto et al., 2018)).

Figure 5 shows the results for the five environments considered in this paper. In Figure 5, “no normalization” is SOP without the normalization of the outputs of the policy network; “single Q” is SOP with one Q-network instead of two; and “no smoothing” is SOP without the randomness in line 8 of the algorithm.

Figure 5 confirms that double Q-networks are critical for obtaining good performance (Van Hasselt et al., 2016; Fujimoto et al., 2018; Haarnoja et al., 2018a). Figure 5 also shows that output normalization is also critical. Without output normalization, performance fluctuates wildly, and average performance can decrease dramatically, particularly for Humanoid and HalfCheetah. Target policy smoothing improves performance by a relatively small amount.

(a) Hopper-v2
(b) Walker2d-v2
(c) HalfCheetah-v2
(d) Ant-v2
(e) Humanoid-v2
Figure 5: Ablation Study

Appendix B Additional analysis and results comparing SAC with and without entropy

To understand why entropy maximization is important for one environment but less so for another, we examine the actions selected when training SAC with and without entropy. Humanoid and Walker2d have action dimensions and , respectively. In addition to the representative results shown for one dimension for both environments in Section 3.2, the results for all the dimensions are provided here in Figures 6 and 9.

From Figure 6, we see that for Humanoid using SAC (which uses entropy maximization), the values are small and fluctuate significantly for all 17 dimensions. On the other hand, for SAC without entropy the values are typically huge, again for all 17 dimensions. This causes the actions to be persistently clustered at either or -. As for Walker, the values are sensible for both algorithms for all 6 dimensions, as shown in figure 9.

(a) k = 1
(b) k = 2
(c) k = 3
(d) k = 4
(e) k = 5
(f) k = 6
(g) k = 7
Figure 6: Humanoid-v2: and values from SAC and SAC without entropy maximization
(a) k = 8
(b) k = 9
(c) k = 10
(d) k = 11
(e) k = 12
(f) k = 13
(g) k = 14
Figure 7: Humanoid-v2: and values from SAC and SAC without entropy maximization
(a) k = 15
(b) k = 16
(c) k = 17
Figure 8: Humanoid-v2: and values from SAC and SAC without entropy maximization
(a) k = 1
(b) k = 2
(c) k = 3
(d) k = 4
(e) k = 5
(f) k = 6
Figure 9: Walker2d-v2: and values from SAC and SAC without entropy maximization

Appendix C Hyperparameters

Table 2

shows hyperparameters used for SOP, SOP+ERE and SOP+PER. For adaptive SAC, we use our own PyTorch implementation for the comparisons. Our implementation uses the same hyperparameters as used in the original paper

(Haarnoja et al., 2018b). Our implementation of SOP variants and adaptive SAC share most of the code base, to make comparisons fair.

Parameter Value
Shared
optimizer Adam (Kingma & Ba, 2014)
learning rate
discount () 0.99
target smoothing coefficient () 0.005
target update interval 1
replay buffer size
number of hidden layers for all networks 2
number of hidden units per layer 256
mini-batch size 256
nonlinearity ReLU
SAC adaptive
entropy target -dim() (e.g., 6 for HalfCheetah-v2)
SOP
gaussian noise std 0.3
normalization constant
ERE
ERE initial
PER
PER ( in PER paper)
PER ( in PER paper)
Table 2: SAC Hyperparameters

Appendix D ERE Pseudocode

1:Input: initial policy parameters , Q-function parameters , , empty replay buffer of size , initial , recent and max performance improvement .
2:Set target parameters equal to main parameters  for i = 1, 2
3:repeat
4:     Generate an episode using actions where .
5:     update with training episode returns, let length of episode
6:     compute
7:     for  in range(do
8:         Compute
9:         Sample a batch of transitions, from most recent data in
10:         Compute targets for Q functions:
               
11:         Update Q-functions by one step of gradient descent using
               
12:         Update policy by one step of gradient ascent using
               
13:         Update target networks with
                     
Algorithm 2 SOP with Emphasizing Recent Experience

Appendix E Additional ERE analysis and results

Figure 10 shows, for fixed , how affects the data sampling process, under the ERE sampling scheme. Recent data points have a much higher probability of being sampled compared to older data, and a smaller value gives more emphasis to recent data.

Different values are desirable depending on how fast the agent is learning and how fast the past experiences become obsolete. So to make ERE work well in different environments with different reward scales and learning progress, we adapt to the the speed of learning. To this end, define performance to be the training episode return. Define to be how much performance improved from timesteps ago, and to be the maximum improvement throughout training, where is the buffer size. Let the hyperparameter be the initial value. We then adapt according to the formula: .

Under such an adaptive scheme, when the agent learns quickly, the value is low in order to learn quickly from new data. When progress is slow, is higher to make use of the stabilizing effect of uniform sampling from the whole buffer.

(a)
(b)
Figure 10: Effect of different values. The plots assume a replay buffer with 1 million samples, and 1,000 mini-batches of size 256 in an update phase. Figure 9(a) plots (ranging from 0 to 1 million) as a function of (ranging from 1 to 1,000). Figure 9(b) plots the expected number of times a data point in the buffer is sampled, with the data points ordered from most to least recent.

e.1 SOP with Prioritized Experience Replay

We also implement the proportional variant of Prioritized Experience Replay (Schaul et al., 2015) with SOP.

Since SOP has two Q-networks, we redefine the absolute TD error of a transition to be the average absolute TD error in the Q network update:

(3)

Within the sum, the first term , is simply the target for the Q network, and the term is the current estimate of the Q network. For the data point, the definition of the priority value is . The probability of sampling a data point is computed as:

(4)

where is a hyperparameter that controls how much the priority value affects the sampling probability, which is denoted by in Schaul et al. (2015), but to avoid confusion with the in SAC, we denote it as . The importance sampling (IS) weight for a data point is computed as:

(5)

where is denoted as in Schaul et al. (2015).

Based on the SOP algorithm, we change the sampling method from uniform sampling to sampling using the probabilities , and for the Q updates we apply the IS weight . This gives SOP with Prioritized Experience Replay (SOP+PER). We note that as compared with SOP+PER, ERE does not require a special data structure and has negligible extra cost, while PER uses a sum-tree structure with some additional computational cost. We also tried several variants of SOP+PER, but preliminary results show that it is unclear whether there is improvement in performance, so we kept the algorithm simple.

e.2 PER experiment results

Figure 11 shows a performance comparison of adaptive SOP, SOP, SOP+ERE and SOP+PER. From these experiments, SOP+PER does not give a significant performance boost to SOP (if any boost at all). We also found that it is difficult to find hyperparameter settings for SOP+PER that work well for all environments. Some of the other hyperparameter settings actually reduce performance. It is unclear why PER does not work so well for SOP. A similar result has been found in another recent paper (Fu et al., 2019), showing that PER can significantly reduce performance on TD3. Further research is needed to understand how PER can be successfully adapted to environments with continuous action spaces and dense reward structure.

(a) Hopper-v2
(b) Walker2d-v2
(c) Halfcheetah-v2
(d) Ant-v2
(e) Humanoid-v2
Figure 11: Streamlined Off-Policy (SOP), with ERE and PER sampling schemes

Appendix F Additional implementation details

f.1 ERE implementation

In this section we discuss some programming details. These details are not necessary for understanding the algorithm, but they might help with reproducibility.

In the ERE scheme, the sampling range always starts with the entire buffer (1M data) and then gradually shrinks. This is true even when the buffer is not full. So even if there are not many data points in the buffer, we compute based as if there are 1M data points in the buffer. One can also modify the design slightly to obtain a variant that uses the current amount of data points to compute . In addition to the reported scheme, we also tried shrinking the sampling range linearly, but it gives less performance gain.

In our implementation we set the number of updates after an episode to be the same as the number of timesteps in that episode. Since environments do not always end at 1000 timesteps, we can give a more general formula for . Let be the number of mini-batch updates, let be the max size of the replay buffer, then:

(6)

With this formulation, the range of sampling shrinks in more or less the same way with varying number of mini-batch updates. We always do uniform sampling in the first update, and we always have in the last update.

When is small, can also become small for some of the mini-batches. To prevent getting a mini-batch with too many repeating data points, we set the minimum value for to 5000. We did not find this value to be too important and did not find the need to tune it. It also does not have any effect for any since the sampling range cannot be lower than 6000.

In the adaptive scheme with buffer of size 1M, the recent performance improvement is computed as the difference of the current episode return compared to the episode return 500,000 timesteps earlier. Before we reach 500,000 timesteps, we simply use . The exact way of computing performance improvement does not have a significant effect on performance as long as it is reasonable.

f.2 Programming and computation complexity

In this section we give analysis on the additional programming and computation complexity brought by ERE and PER.

In terms of programming complexity, ERE is a clear winner since it only requires a small adjustment to how we sample mini-batches. It does not modify how the buffer stores the data, and does not require a special data structure to make it work efficiently. Thus the implementation difficulty is minimal. PER (proportional variant) requires a sum-tree data structure to make it run efficiently. The implementation is not too complicated, but compared to ERE it is a lot more work.

In terms of computation complexity (not sample efficiency), and wall-clock time, ERE’s extra computation is negligible. In practice we observe no difference in computation time between SOP and SOP+ERE. PER needs to update the priority of its data points constantly and compute sampling probabilities for all the data points. The complexity for sampling and updates is , and the rank-based variant is similar (Schaul et al., 2015). Although this is not too bad, it does impose a significant overhead on SOP: SOP+PER runs twice as long as SOP. Also note that this overhead grows linearly with the size of the mini-batch. The overhead for the Mujoco environments is higher compared to Atari, possibly because the Mujoco environments have a smaller state space dimension while a larger batch size is used, making PER take up a larger portion of computation cost.