1 Introduction
Offpolicy deep Reinforcement Learning (RL) algorithms aim to improve sample efficiency by reusing past experience. Recently a number of new offpolicy Deep Reinforcement Learning algorithms have been proposed for control tasks with continuous state and action spaces, including Deep Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3) (Lillicrap et al., 2015; Fujimoto et al., 2018). TD3, in particular, has been shown to be significantly more sample efficient than popular onpolicy methods for a wide range of Mujoco benchmarks.
The field of Deep Reinforcement Learning (DRL) has also recently seen a surge in the popularity of maximum entropy reinforcement learning algorithms. Their popularity stems from the intuitive interpretation of the maximum entropy objective and their superior sample efficiency on standard benchmarks. In particular, Soft Actor Critic (SAC), which combines offpolicy learning with maximumentropy RL, not only has many attractive theoretical properties, but can also give superior performance on a widerange of Mujoco environments, including on the highdimensional environment Humanoid for which both DDPG and TD3 perform poorly (Haarnoja et al., 2018a, b; Langlois et al., 2019). The TD3 and SAC algorithms share many common features, including an actorcritic structure, offpolicy learning, and the use of double Qnetworks (Van Hasselt et al., 2016). The primary difference between the two approaches is that SAC employs maximum entropy reinforcement learning whereas TD3 does not.
In this paper, we first seek to understand the primary contribution of the entropy term to the performance of maximum entropy algorithms. For the Mujoco benchmark, we demonstrate that when using the standard objective without entropy along with standard additive noise exploration, there is often insufficient exploration due to the bounded nature of the action spaces. Specifically, the outputs of the policy network are often way outside the bounds of the action space, so that they need to be squashed to fit within the action space. The squashing results in actions persistently taking on their maximal values, so that there is insufficient exploration. In contrast, the entropy term in the SAC objective forces the outputs to have sensible values, so that even with squashing, exploration is maintained. We conclude that the entropy term in the objective for Soft Actor Critic principally addresses the bounded nature of the action spaces in the Mujoco environments.
With this insight, we propose Streamlined Off Policy (SOP), a streamlined algorithm using the standard objective without the entropy term. SOP employs a simple normalization scheme to address the bounded nature of the action spaces, thereby allowing for satisfactory exploration throughout training. Our experimental results show that SOP matches the sampleefficiency and robustness performance of SAC, including on the more challenging Ant and Humanoid environments. This demonstrates a need to revisit the benefits of entropy maximization in DRL.
Keeping with the theme of simplicity with the goal of meeting Occam’s principle, we also propose a simple nonuniform sampling method for selecting transitions from the replay buffer during training. In vanilla SOP (as well as in DDPG, TD3, and SAC), samples from the replay buffer are chosen uniformly at random during training. Our method, called Emphasizing Recent Experience (ERE), samples more aggressively recent experience while not neglecting past experience. Unlike Priority Experience Replay (PER) (Schaul et al., 2015), a popular nonuniform sampling scheme for the Atari environments, ERE is only a few lines of code and does not rely on any sophisticated data structures. We show that SOP combined with ERE outperforms SAC and provides state of the art performance. For example, for Ant and Humanoid, it improves over SAC by with one million samples. Furthermore, we also investigate combining SOP with PER, and show SOP+ERE also outperforms the more complicated SOP+PER scheme.
The contributions of this paper are thus threefold. First, we uncover the primary contribution of the entropy term of maximum entropy RL algorithms when the environments have bounded action spaces. Second, we develop a new streamlined algorithm which does not employ entropy maximization but nevertheless matches the sampling efficiency and robustness performance of SAC for the Mujoco benchmarks. And third, we combine our streamlined algorithm with a simple nonuniform sampling scheme to achieve stateofthe art performance for the Mujoco benchmark. We provide anonymized code for reproducibility ^{1}^{1}1https://anonymous.4open.science/r/e484a8c7268a4a66a0011e7676540237/.
2 Preliminaries
We represent an environment as a Markov Decision Process (MDP) which is defined by the tuple
, where and are continuous multidimensional state and action spaces, is a bounded reward function, is a transition function, and is the discount factor. Let and respectively denote the state of the environment and the action chosen at time . Let denote the policy. We further denote for the dimension of the action space, and write for the th component of an action , that is, .The expected discounted return for policy beginning in state is given by:
(1) 
Standard MDP and reinforcement learning problem formulations seek to maximize over policies . For finite state and action spaces, under suitable conditions for continuous state and action spaces, the optimal policy is deterministic (Puterman, 2014; Bertsekas & Tsitsiklis, 1996). In reinforcement learning with unknown environment, exploration is required to learn a suitable policy.
In DRL with continuous action spaces, typically the policy is modeled by a parameterized policy network which takes as input a state and outputs a value , where represents the current parameters of the policy network (Schulman et al., 2015, 2017; Vuong et al., 2018; Lillicrap et al., 2015; Fujimoto et al., 2018). During training, the actual action taken when in state often takes the form where is a random
dimensional vector which is independently drawn at each time step and may, in some circumstances, also depend on
. During testing, is set to zero.2.1 Entropy Maximization RL
Maximum entropy reinforcement learning takes a different approach than (1) by optimizing policies to maximize both the expected return and the expected entropy of the policy (Ziebart et al., 2008; Ziebart, 2010; Todorov, 2008; Rawlik et al., 2013; Levine & Koltun, 2013; Levine et al., 2016; Nachum et al., 2017; Haarnoja et al., 2017, 2018a, 2018b).
In particular, with maximization entropy RL, the objective is to maximize
(2) 
where is the entropy of the policy when in state , and the temperature parameter determines the relative importance of the entropy term against the reward.
For entropy maximization DRL, when given state the policy network will typically output a dimensional vector in addition to the vector . The action selected when in state is then modeled as where .
Maximum entropy RL has been touted to have a number of conceptual and practical advantages for DRL (Haarnoja et al., 2018a, b)
. For example, it has been argued that the policy is incentivized to explore more widely, while giving up on clearly unpromising avenues. It has also been argued that the policy can capture multiple modes of nearoptimal behavior, that is, in problem settings where multiple actions seem equally attractive, the policy will commit equal probability mass to those actions. In this paper, we will highlight another advantage, namely, retaining sufficient exploration when facing bounded action spaces.
3 The Squashing Exploration Problem
3.1 Bounded Action Spaces
Continuous environments typically have bounded action spaces, that is, along each action dimension there is a minimum possible action value and a maximum possible action value . When selecting an action, the action needs to be selected within these bounds before the action can be taken. DRL algorithms often handle this by squashing the action so that it fits within the bounds. For example, if along any one dimension the value exceeds , the action is set (clipped) to . Alternatively, a smooth form of squashing can be employed. For example, suppose and for some positive number , then a smooth form of squashing could use in which is being applied to each component of the dimensional vector. DDPG (Hou et al., 2017) and TD3 (Fujimoto et al., 2018) use clipping, and SAC (Haarnoja et al., 2018a, b) uses smooth squashing with the function. For concreteness, henceforth we will assume that smooth squashing with the is employed.
We note that an environment may actually allow the agent to input actions that are outside the bounds. In this case, the environment will typically first clip the actions internally before passing them on to the “actual” environment (Fujita & Maeda, 2018).
We now make a simple but crucial observation: squashing actions so that they fit into a bounded action space can have a disastrous effect on additivenoise exploration strategies. To see this, let the output of the policy network be denoted by . Consider an action taken along one dimension , and suppose and is relatively small compared to . Then the action will be very close (essentially equal) to . If the condition persists over many consecutive states, then will remain close to 1 for all these states, and consequently there will be essentially no exploration along the th dimension. We will refer to this problem as the squashing exploration problem. We will argue that algorithms such as DDPG and TD3 based on the standard objective (1) with additive noise exploration can be greatly impaired by squashing exploration.
3.2 What does entropy maximization bring to SAC for the Mujuco environments?
SAC is a maximumentropy based offpolicy DRL algorithm which provides good performance across all of the Mujuco benchmark environments. To the best of our knowledge, it currently provides state of the art performance for the Mujoco benchmark. In this section, we argue that the principle contribution of the entropy term in the SAC objective is to resolve the squashing exploration problem, thereby maintaining sufficient exploration when facing bounded action spaces. To argue this, we consider two DRL algorithms: SAC with adaptive temperature (Haarnoja et al., 2018b), and SAC with entropy removed altogether (temperature set to zero) but everything else the same. We refer to them as SAC and as SAC without entropy. For SAC without entropy, for exploration we use additive zeromean Gaussian noise with fixed at . Both algorithms use squashing. We compare these two algorithms on two Mujoco environments: Humanoidv2 and Walkerv2.
Figure 1 shows the performance of the two algorithms with 10 seeds. We see that for Humanoid, SAC with entropy maximization performs much better than SAC without entropy maximization. However, for Walker, SAC without entropy performs nearly as well as SAC, implying maximum entropy RL is not as critical for this environment.
To understand why entropy maximization is important for one environment but less so for another, we examine the actions selected when training these two algorithms. Humanoid and Walker have action dimensions and , respectively. Here we show representative results for one dimension for both environments, and provide the results for all the dimensions in the Appendix. The top and bottom rows of Figure 2 shows results for Humanoid and Walker, respectively. The first column shows the values for an interval of 1,000 consecutive time steps, namely, for time steps 599,000 to 600,000. The second column shows the actual action values passed to the environment again for time steps 599,000 to 600,000. The third and fourth columns show a concatenation of 10 such intervals of 1000 time steps, with each interval coming from a larger interval of 100,000 time steps.
The first and third columns use a log scale on the yaxis.
The top and bottom rows of Figure 2 are strikingly different. For Humanoid using SAC (which uses entropy maximization), the values are small, mostly in the range [1.5,1.5], and fluctuate significantly. This allows the action values to also fluctuate significantly, providing exploration in the action space. On the other hand, for SAC without entropy the values are typically huge, most of which are well outside the interval [10,10]. This causes the actions to be persistently clustered at either or , leading to essentially no exploration along that dimension. As shown in the Appendix, this property (lack of exploration for SAC without entropy maximization) does not hold for just a few dimensions, but instead for all 17 dimensions.
For Walker, we see that for both algorithms, the values are sensible, mostly in the range [1,1] and therefore the actions chosen by both algorithms exhibit exploration.
In conclusion, the principle benefit of maximum entropy RL in SAC for the Mujuco environments is that it resolves the squashing exploration problem. For some environments (such as Walker), the outputs of the policy network take on sensible values, so that sufficient exploration is maintained and overall good performance is achieved without the need for entropy maximization. For other environments (such as Humanoid), entropy maximization is needed to reduce the magnitudes of the outputs so that exploration is maintained and overall good performance is achieved.


4 Streamlined OffPolicy (SOP) Algorithm
Given the observations in the previous section, a natural question is: is it possible to design a streamlined off policy algorithm that does not employ entropy maximization but offers performance comparable to SAC (which has entropy maximization)?
As we observed in the previous section, without entropy maximization, in some environments the policy network output values , can become persistently huge, which leads to insufficient exploration due to the squashing. A simple solution is to modify the outputs of the policy network by normalizing the output values when they collectively (across the action dimensions) become too large. To this end, for any dimensional vector let denote the norm of . Let be a constant (hyper parameter) close to 1. The normalization procedure is as follows. Let be the output of the original policy network. If , then we reset for all ; otherwise, we leave unchanged. With this normalization, we are assured that is never greater than . Henceforth we assume the policy network has been modified with the simple normalization scheme just described.
Our Streamlined Off Policy (SOP) algorithm is described in Algorithm 1. The algorithm is essentially DDPG plus the normalization described above, plus double Qlearning (Van Hasselt et al., 2016) and target policy smoothing (Fujimoto et al., 2018). Another way of looking at it is as TD3 plus the normalization described above, minus the delayed policy updates and the target policy parameters. SOP also uses squashing instead of clipping, since gives somewhat better performance in our experiments. The SOP algorithm is “streamlined” as it has no entropy terms, temperature adaptation, target policy parameters or delayed policy updates.
4.1 Experimental Results for SOP
Without performing a careful hyperparameter search, we found and works well for all environments. For the normalization for SOP, we use , that is, the norm.
Figure 3 compares SAC (with temperature adaptation (Haarnoja et al., 2018a, b)
) with SOP for five of the most challenging Mujuco environments. Using the same baseline code, we train with ten different random seeds for each of the two algorithms. Each algorithm performs five evaluation rollouts every 5000 environment steps. The solid curves correspond to the mean, and the shaded region to the standard deviation of the returns over the ten seeds.
Results show that SOP and SAC have essentially the same sampleefficiency performance and robustness across all environments. This result confirms that when using a simple output normalization in the policy network, the performance of SAC can be achieved without maximum entropy RL.
In the Appendix we provide an ablation study for SOP, which shows a major performance drop when removing either double Qlearning or normalization, whereas removing target policy smoothing (Fujimoto et al., 2018) results in only a small performance drop in some environments.
5 Nonuniform sampling
We now show how a small change in the sampling scheme for SOP can achieve state of the art performance for the Mujoco benchmark. We call this samping scheme Emphasizing Recent Experience (ERE). ERE has 3 core features: It is a general method applicable to any offpolicy algorithm; It requires no special data structure, is very simple to implement, and has nearzero computational overhead;
It only introduces one additional important hyperparameter.
The basic idea is: during the parameter update phase, the first minibatch is sampled from the entire buffer, then for each subsequent minibatch we gradually reduce our range of sampling to sample more aggressively from more recent data. Specifically, assume that in the current update phase we are to make minibatch updates. Let be the max size of the buffer. Then for the update, we sample uniformly from the most recent data points, where and
is a hyperparameter that determines how much emphasis we put on recent data. is uniform sampling. When , decreases as we perform each update. can made to adapt to the learning speed of the agent so that we do not have to tune it for each environment.
The effect of such a sampling formulation is twofold. The first is recent data have a higher chance of being sampled.
The second is that we do this in an ordered way: we first sample from all the data in the buffer, and gradually shrink the range of sampling to only sample from the most recent data. This scheme reduces the chance of overwriting parameter changes made by new data with parameter changes made by old data (French, 1999; McClelland et al., 1995; McCloskey & Cohen, 1989; Ratcliff, 1990; Robins, 1995). This process allows us to quickly obtain new information from recent data, and better approximate the value functions near recentlyvisited states, while still maintaining an acceptable approximation near states visited in the more distant past.
What is the effect of replacing uniform sampling with ERE? First note if we do uniform sampling on a fixed buffer, the expected number of times a data point is sampled is the same for all data points. Now consider a scenario where we have a buffer of size 1000 (FIFO queue), we collect one data point at a time, and we then perform one update with minibatch size of one. If we start with an empty buffer and sample uniformly, as data fills the buffer, each data point gets less and less chance of being sampled. Specifically, over a period of 1000 updates, the expected number of times the th data point is sampled is: .
Figure 3(f) shows the expected number of times a data point is sampled as a function of its position in the buffer. We see that older data points have a much higher expected number of times of being sampled compared to newer data points. This is undesirable because when the agent is improving and exploring new areas of the state space; the new data points may contain more interesting information than the old ones, which have already been updated many times.
When we apply the ERE scheme, we effectively skew the curve towards assigning higher expected number of samples for the newer data, allowing the newer data to be frequently sampled soon after being collected, which can accelerate the learning process. Further algorithmic detail and analysis on ERE can be found in the Appendix.
5.1 Experimental results for SOP+ERE
Figure 4 compares the performance of SOP, SOP+ERE and SAC. SOP+ERE learns faster than SAC and vanilla SOP in all Mujoco environments. SOP+ERE also greatly improves overall performance for the two most challenging environments, Ant and Humanoid. For SOP we found that fine tuning for each environment can give further improvement in sample efficiency, but for fairness of comparison, we use exactly the same hyperparameters for all environments. In table 1, we show the mean test episode return and std across 10 random seeds at 1M timesteps for all environments. The last column displays the percentage improvement of SOP+ERE over SAC, showing hat SOP+ERE achieves state of the art performance. In both Ant and Humanoid, SOP+ERE improves average performance by 24% over SAC at 1 million timesteps. As for the std, SOP+ERE gives lower values, and for Humanoid a higher value.
Environment  SAC Adaptive  SOP  SOP+ERE  Improvement 

Hopper  6.9%  
Walker  10.2%  
HalfCheetah  7.5%  
Ant  23.9%  
Humanoid  24.1% 
6 Related work
In recent years, there has been significant progress in improving the sample efficiency of DRL for continuous robotic locomotion tasks with offpolicy algorithms (Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018a, b). There is also a significant body of research on maximum entropy RL methods (Ziebart et al., 2008; Ziebart, 2010; Todorov, 2008; Rawlik et al., 2013; Levine & Koltun, 2013; Levine et al., 2016; Nachum et al., 2017; Haarnoja et al., 2017, 2018a, 2018b).
By taking clipping in the Mujoco environments explicitly into account, Fujita & Maeda (2018)
modified the policy gradient algorithm to reduce variance and provide superior performance among onpolicy algorithms.
Eisenach et al. (2018) extend the work of Fujita & Maeda (2018) for when an action may be direction. Hausknecht & Stone (2015) and Chou et al. (2017) also explores DRL in the context of bounded action spaces. Dalal et al. (2018) consider safe exploration in the context of constrained action spaces.Uniform sampling is the most common way to sample from a replay buffer. One of the most wellknown alternatives is prioritized experience replay (PER) (Schaul et al., 2015). PER uses the absolute TDerror of a data point as the measure for priority, and data points with higher priority will have a higher chance of being sampled. This method has been tested on DQN (Mnih et al., 2015) and double DQN (DDQN) (Van Hasselt et al., 2016) with significant improvement. PER has been combined with the dueling architecture (Wang et al., 2015), with an ensemble of recurrent DQN (Schulze & Schulze, 2018), and PER is one of six crucial components in Rainbow (Hessel et al., 2018), which achieves stateoftheart on the Atari game environments. PER has also been successfully applied to other algorithms such as DDPG (Hou et al., 2017) and can be implemented in a distributed manner (Horgan et al., 2018). There are other methods proposed to make better use of the replay buffer. In Sample Efficient ActorCritic with Experience Replay (ACER), the algorithm has an onpolicy part and an offpolicy part, with a hyperparameter controlling the ratio of offpolicy updates to onpolicy updates (Wang et al., 2016). The RACER algorithm (Novati & Koumoutsakos, 2018) selectively removes data points from the buffer, based on the degree of ”offpolicyness” which is measured by their importance sampling weight, bringing improvement to DDPG (Lillicrap et al., 2015), NAF (Gu et al., 2016) and PPO (Schulman et al., 2017). In De Bruin et al. (2015), replay buffers of different sizes were tested on DDPG, and result shows that a large enough buffer with enough data diversity can lead to better performance. Finally, with Hindsight Experience Replay (HER)(Andrychowicz et al., 2017)
, priority can be given to trajectories with lower density estimation
(Zhao & Tresp, 2019) to tackle multigoal, sparse reward environments.7 Conclusion
In this paper we first showed that the primary role of maximum entropy RL for the Mujoco benchmark is to maintain satisfactory exploration in the presence of bounded action spaces. We then developed a new streamlined algorithm which does not employ entropy maximization but nevertheless matches the sampling efficiency and robustness performance of SAC for the Mujoco benchmarks. Our experimental results demonstrate a need to revisit the benefits of entropy regularization in DRL. Finally, we combined our streamlined algorithm with a simple nonuniform sampling scheme to achieve stateofthe art performance for the Mujoco benchmark.
References
 Andrychowicz et al. (2017) Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058, 2017.
 Bertsekas & Tsitsiklis (1996) Dimitri P Bertsekas and John N Tsitsiklis. Neurodynamic programming, volume 5. Athena Scientific Belmont, MA, 1996.

Chou et al. (2017)
PoWei Chou, Daniel Maturana, and Sebastian Scherer.
Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution.
InProceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 834–843. JMLR. org, 2017.  Dalal et al. (2018) Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.
 De Bruin et al. (2015) Tim De Bruin, Jens Kober, Karl Tuyls, and Robert Babuška. The importance of experience replay database composition in deep reinforcement learning. In Deep reinforcement learning workshop, NIPS, 2015.
 Eisenach et al. (2018) Carson Eisenach, Haichuan Yang, Ji Liu, and Han Liu. Marginal policy gradients: A unified family of estimators for bounded action spaces with applications. arXiv preprint arXiv:1806.05134, 2018.
 French (1999) Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
 Fu et al. (2019) Justin Fu, Aviral Kumar, Matthew Soh, and Sergey Levine. Diagnosing bottlenecks in deep qlearning algorithms. arXiv preprint arXiv:1902.10250, 2019.
 Fujimoto et al. (2018) Scott Fujimoto, Herke van Hoof, and Dave Meger. Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477, 2018.
 Fujita & Maeda (2018) Yasuhiro Fujita and Shinichi Maeda. Clipped action policy gradient. arXiv preprint arXiv:1802.07564, 2018.
 Gu et al. (2016) Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep qlearning with modelbased acceleration. In International Conference on Machine Learning, pp. 2829–2838, 2016.
 Haarnoja et al. (2017) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energybased policies. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1352–1361. JMLR. org, 2017.
 Haarnoja et al. (2018a) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018a.
 Haarnoja et al. (2018b) Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actorcritic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b.
 Hausknecht & Stone (2015) Matthew Hausknecht and Peter Stone. Deep reinforcement learning in parameterized action space. arXiv preprint arXiv:1511.04143, 2015.

Hessel et al. (2018)
Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski,
Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver.
Rainbow: Combining improvements in deep reinforcement learning.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  Horgan et al. (2018) Dan Horgan, John Quan, David Budden, Gabriel BarthMaron, Matteo Hessel, Hado Van Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018.
 Hou et al. (2017) Yuenan Hou, Lifeng Liu, Qing Wei, Xudong Xu, and Chunlin Chen. A novel ddpg method with prioritized experience replay. In 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 316–321. IEEE, 2017.
 Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Langlois et al. (2019) Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking modelbased reinforcement learning. arXiv preprint arXiv:1907.02057, 2019.
 Levine & Koltun (2013) Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pp. 1–9, 2013.
 Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 McClelland et al. (1995) James L McClelland, Bruce L McNaughton, and Randall C O’reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419, 1995.
 McCloskey & Cohen (1989) Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109–165. Elsevier, 1989.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Nachum et al. (2017) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2775–2785, 2017.
 Novati & Koumoutsakos (2018) Guido Novati and Petros Koumoutsakos. Remember and forget for experience replay. arXiv preprint arXiv:1807.05827, 2018.
 Puterman (2014) Martin L Puterman. Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
 Ratcliff (1990) Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97(2):285, 1990.
 Rawlik et al. (2013) Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. In TwentyThird International Joint Conference on Artificial Intelligence, 2013.
 Robins (1995) Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
 Schaul et al. (2015) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Schulze & Schulze (2018) Christopher Schulze and Marcus Schulze. Vizdoom: Drqn with prioritized experience replay, doubleq learning and snapshot ensembling. In Proceedings of SAI Intelligent Systems Conference, pp. 1–17. Springer, 2018.
 Todorov (2008) Emanuel Todorov. General duality between optimal control and estimation. In 2008 47th IEEE Conference on Decision and Control, pp. 4286–4292. IEEE, 2008.
 Van Hasselt et al. (2016) Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In AAAI, volume 2, pp. 5. Phoenix, AZ, 2016.
 Vuong et al. (2018) Quan Vuong, Yiming Zhang, and Keith W Ross. Supervised policy update for deep reinforcement learning. arXiv preprint arXiv:1805.11706, 2018.
 Wang et al. (2015) Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
 Wang et al. (2016) Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
 Zhao & Tresp (2019) Rui Zhao and Volker Tresp. Curiositydriven experience prioritization via density estimation. arXiv preprint arXiv:1902.08039, 2019.
 Ziebart (2010) Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. PhD thesis, figshare, 2010.
 Ziebart et al. (2008) Brian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. 2008.
Appendix A Ablation Study
In this ablation study we separately examine the importance of the normalization at the output of the policy network; the double Q networks; and randomization used in the line 8 of the SOP algorithm (that is, target policy smoothing (Fujimoto et al., 2018)).
Figure 5 shows the results for the five environments considered in this paper. In Figure 5, “no normalization” is SOP without the normalization of the outputs of the policy network; “single Q” is SOP with one Qnetwork instead of two; and “no smoothing” is SOP without the randomness in line 8 of the algorithm.
Figure 5 confirms that double Qnetworks are critical for obtaining good performance (Van Hasselt et al., 2016; Fujimoto et al., 2018; Haarnoja et al., 2018a). Figure 5 also shows that output normalization is also critical. Without output normalization, performance fluctuates wildly, and average performance can decrease dramatically, particularly for Humanoid and HalfCheetah. Target policy smoothing improves performance by a relatively small amount.
Appendix B Additional analysis and results comparing SAC with and without entropy
To understand why entropy maximization is important for one environment but less so for another, we examine the actions selected when training SAC with and without entropy. Humanoid and Walker2d have action dimensions and , respectively. In addition to the representative results shown for one dimension for both environments in Section 3.2, the results for all the dimensions are provided here in Figures 6 and 9.
From Figure 6, we see that for Humanoid using SAC (which uses entropy maximization), the values are small and fluctuate significantly for all 17 dimensions. On the other hand, for SAC without entropy the values are typically huge, again for all 17 dimensions. This causes the actions to be persistently clustered at either or . As for Walker, the values are sensible for both algorithms for all 6 dimensions, as shown in figure 9.























Appendix C Hyperparameters
Table 2
shows hyperparameters used for SOP, SOP+ERE and SOP+PER. For adaptive SAC, we use our own PyTorch implementation for the comparisons. Our implementation uses the same hyperparameters as used in the original paper
(Haarnoja et al., 2018b). Our implementation of SOP variants and adaptive SAC share most of the code base, to make comparisons fair.Parameter  Value  

Shared  
optimizer  Adam (Kingma & Ba, 2014)  
learning rate  
discount ()  0.99  
target smoothing coefficient ()  0.005  
target update interval  1  
replay buffer size  
number of hidden layers for all networks  2  
number of hidden units per layer  256  
minibatch size  256  
nonlinearity  ReLU  
SAC adaptive  
entropy target  dim() (e.g., 6 for HalfCheetahv2)  
SOP  
gaussian noise std  0.3  
normalization constant  
ERE  
ERE initial  
PER  
PER ( in PER paper)  
PER ( in PER paper) 
Appendix D ERE Pseudocode
Appendix E Additional ERE analysis and results
Figure 10 shows, for fixed , how affects the data sampling process, under the ERE sampling scheme. Recent data points have a much higher probability of being sampled compared to older data, and a smaller value gives more emphasis to recent data.
Different values are desirable depending on how fast the agent is learning and how fast the past experiences become obsolete. So to make ERE work well in different environments with different reward scales and learning progress, we adapt to the the speed of learning. To this end, define performance to be the training episode return. Define to be how much performance improved from timesteps ago, and to be the maximum improvement throughout training, where is the buffer size. Let the hyperparameter be the initial value. We then adapt according to the formula: .
Under such an adaptive scheme, when the agent learns quickly, the value is low in order to learn quickly from new data. When progress is slow, is higher to make use of the stabilizing effect of uniform sampling from the whole buffer.
e.1 SOP with Prioritized Experience Replay
We also implement the proportional variant of Prioritized Experience Replay (Schaul et al., 2015) with SOP.
Since SOP has two Qnetworks, we redefine the absolute TD error of a transition to be the average absolute TD error in the Q network update:
(3) 
Within the sum, the first term , is simply the target for the Q network, and the term is the current estimate of the Q network. For the data point, the definition of the priority value is . The probability of sampling a data point is computed as:
(4) 
where is a hyperparameter that controls how much the priority value affects the sampling probability, which is denoted by in Schaul et al. (2015), but to avoid confusion with the in SAC, we denote it as . The importance sampling (IS) weight for a data point is computed as:
(5) 
where is denoted as in Schaul et al. (2015).
Based on the SOP algorithm, we change the sampling method from uniform sampling to sampling using the probabilities , and for the Q updates we apply the IS weight . This gives SOP with Prioritized Experience Replay (SOP+PER). We note that as compared with SOP+PER, ERE does not require a special data structure and has negligible extra cost, while PER uses a sumtree structure with some additional computational cost. We also tried several variants of SOP+PER, but preliminary results show that it is unclear whether there is improvement in performance, so we kept the algorithm simple.
e.2 PER experiment results
Figure 11 shows a performance comparison of adaptive SOP, SOP, SOP+ERE and SOP+PER. From these experiments, SOP+PER does not give a significant performance boost to SOP (if any boost at all). We also found that it is difficult to find hyperparameter settings for SOP+PER that work well for all environments. Some of the other hyperparameter settings actually reduce performance. It is unclear why PER does not work so well for SOP. A similar result has been found in another recent paper (Fu et al., 2019), showing that PER can significantly reduce performance on TD3. Further research is needed to understand how PER can be successfully adapted to environments with continuous action spaces and dense reward structure.
Appendix F Additional implementation details
f.1 ERE implementation
In this section we discuss some programming details. These details are not necessary for understanding the algorithm, but they might help with reproducibility.
In the ERE scheme, the sampling range always starts with the entire buffer (1M data) and then gradually shrinks. This is true even when the buffer is not full. So even if there are not many data points in the buffer, we compute based as if there are 1M data points in the buffer. One can also modify the design slightly to obtain a variant that uses the current amount of data points to compute . In addition to the reported scheme, we also tried shrinking the sampling range linearly, but it gives less performance gain.
In our implementation we set the number of updates after an episode to be the same as the number of timesteps in that episode. Since environments do not always end at 1000 timesteps, we can give a more general formula for . Let be the number of minibatch updates, let be the max size of the replay buffer, then:
(6) 
With this formulation, the range of sampling shrinks in more or less the same way with varying number of minibatch updates. We always do uniform sampling in the first update, and we always have in the last update.
When is small, can also become small for some of the minibatches. To prevent getting a minibatch with too many repeating data points, we set the minimum value for to 5000. We did not find this value to be too important and did not find the need to tune it. It also does not have any effect for any since the sampling range cannot be lower than 6000.
In the adaptive scheme with buffer of size 1M, the recent performance improvement is computed as the difference of the current episode return compared to the episode return 500,000 timesteps earlier. Before we reach 500,000 timesteps, we simply use . The exact way of computing performance improvement does not have a significant effect on performance as long as it is reasonable.
f.2 Programming and computation complexity
In this section we give analysis on the additional programming and computation complexity brought by ERE and PER.
In terms of programming complexity, ERE is a clear winner since it only requires a small adjustment to how we sample minibatches. It does not modify how the buffer stores the data, and does not require a special data structure to make it work efficiently. Thus the implementation difficulty is minimal. PER (proportional variant) requires a sumtree data structure to make it run efficiently. The implementation is not too complicated, but compared to ERE it is a lot more work.
In terms of computation complexity (not sample efficiency), and wallclock time, ERE’s extra computation is negligible. In practice we observe no difference in computation time between SOP and SOP+ERE. PER needs to update the priority of its data points constantly and compute sampling probabilities for all the data points. The complexity for sampling and updates is , and the rankbased variant is similar (Schaul et al., 2015). Although this is not too bad, it does impose a significant overhead on SOP: SOP+PER runs twice as long as SOP. Also note that this overhead grows linearly with the size of the minibatch. The overhead for the Mujoco environments is higher compared to Atari, possibly because the Mujoco environments have a smaller state space dimension while a larger batch size is used, making PER take up a larger portion of computation cost.
Comments
There are no comments yet.