Soft Actor-Critic haarnoja2018soft ; haarnoja2018softapps is an off-policy actor-critic deep reinforcement learning (DRL) algorithm based on maximum entropy reinforcement learning. By combining off-policy updates with an actor-critic formulation, SAC achieves state-of-the-art performance on a range of continuous-action benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, SAC has been shown to be relatively robust, achieving similar performance across different initial random seeds.
SAC is an off-policy method which uses a buffer to store past experience for experience replay lin1992experiencereplay
. SAC samples data uniformly from the buffer when performing parameter updates. A uniform sampling scheme implicitly assumes that data in the replay buffer are of equal importance. However, intuitively it is more important to build relatively accurate function approximators in regions of the state and action spaces for which the current policy is likely to operate. At the same time, it is important that the function approximators also be reasonably accurate in other regions where the policy may visit with lower probability.
To address this problem, we propose Emphasizing Recent Experience (ERE), a simple but powerful off-policy sampling technique, which emphasizes recently observed data while not forgetting the past. When performing updates, the ERE algorithm samples more aggressively from recent experience, and also orders the updates to ensure that updates from old data do not overwrite updates from new data. We compare vanilla SAC and SAC+ERE, and show that ERE provides significant performance improvements over SAC in terms of sample efficiency for continuous-action Mujoco tasks. It provides this improvement without degrading the excellent robustness of SAC.
We also consider combining SAC with Prioritized Experience Relay (PER) schaul2015prioritized , a scheme originally proposed for deep Q-learning which prioritizes the data based on the temporal-difference (TD) error. We show that SAC+PER can marginally improve the sample efficiency performance of SAC, but much less so than SAC+ERE. We also compare the programming and computational complexity of ERE with PER, and show that ERE is easier to implement, with no special data structure required, and fewer hyper-parameters, which are also easier to optimize. Finally, we propose an algorithm which integrates ERE and PER and show that it gives the best results for some environments.
2 Overview of Experience Replay and Related Work
Experience replay lin1992experiencereplay is a simple yet powerful method for enhancing the performance of an off-policy DRL algorithm. Experience replay stores past experience in a replay buffer and reuses this past data when making updates. Experience replay achieved great successes in Deep Q-Networks (DQN) mnih2013dqn ; mnih2015dqn . In DQN, a large buffer of size 1 million is used to store past experience, and the algorithm samples data uniformly from this large buffer for each mini-batch update.
Experience replay schemes alternate between two phases: a data collection phase and a parameter update phase. In the data collection phase, the current policy interacts with the environment to generate transitions, which are added to a replay buffer . Each data point is a tuple , where is the current state, is the action taken, is the resulting reward, and
is the subsequent state. The replay buffer is fixed to a finite size (e.g., one million data points) so that very old data is dropped from the buffer. During the parameter update phase, the parameters of the neural networks are updated with samples drawn from the replay buffer. Typically, this phase consists of several iterations, with each iteration drawing a mini-batch of data from and updating the parameters using the mini-batch. At the end of these iterations, the new parameters provide a new policy.
When doing mini-batch update with data from the replay buffer, a straightforward method is to simply sample uniformly from the buffer. Many other sampling methods have been proposed in the past, and one of the most well-known methods is prioritized experience replay (PER) schaul2015prioritized . PER uses the absolute TD-error of a data point as the measure for priority, and data points with higher priority will have a higher chance of being sampled. This method has been tested on DQN mnih2015dqn and double DQN (DDQN) van2016ddqn , and results show significant improvement over using uniform sampling. PER has been combined with the dueling network architecture in wang2015dueling , with an ensemble of recurrent DQN in schulze2018vizdoom , and PER is one of six crucial components in Rainbow hessel2018rainbow , which achieves state-of-the-art on the Atari game environments. PER has also been successfully applied to other algorithms such as DDPG hou2017ddpgper and can be implemented in a distributed manner horgan2018distributed .
There are other methods proposed to make better use of the replay buffer. In Sample Efficient Actor-Critic with Experience Replay (ACER), the algorithm has an on-policy part and an off-policy part, with a hyper-parameter controlling the ratio of off-policy updates to on-policy updates wang2016acer . The RACER algorithm novati2018remember selectively removes data points from the buffer, based on the degree of "off-policyness" which is measured by their importance sampling weight, bringing improvement to DDPG lillicrap2015ddpg , NAF gu2016naf and PPO schulman2017proximal . In de2015replaydatabase , replay buffers of different sizes were tested on DDPG, and result shows that a large enough buffer with enough data diversity can lead to better performance. Finally, with Hindsight Experience Replay (HER)andrychowicz2017her
, priority can be given to trajectories with lower density estimationzhao2019curiosity to tackle multi-goal, sparse reward environments.
To our knowledge, this is the first paper that considers non-uniform data sampling techniques for SAC, and also the first paper to consider the ERE scheme for off-policy DRL algorithms.
3 Emphasizing Recent Experience
In this section we first give a brief review of the SAC algorithm. We then propose three SAC variants for enhancing experience replay. Pseudo-code for each variant can be found in the Appendix.
3.1 Soft Actor-Critic Algorithm
Soft Actor-Critic (SAC) haarnoja2018soft is a model-free off-policy deep reinforcement learning (DRL) algorithm based on maximum entropy reinforcement learning. By combining off-policy updates with an actor-critic formulation, SAC achieves state-of-the-art performance on a range of continuous-action benchmark tasks, outperforming prior on-policy and off-policy methods, including proximal policy optimization (PPO) schulman2017proximal , deep determinisitc policy gradient (DDPG) lillicrap2015ddpg , soft Q-learning haarnoja2017sql , twin delayed determinisitc policy gradient (TD3) fujimoto2018td3 , and trust region path consistency learning (Trust-PCL) nachum2017trustpcl . The experimental results show that SAC consistently outperforms the other RL algorithms for continuous-action benchmarks, both in terms of learning speed and robustness haarnoja2018soft .
Here we give a brief summary of Soft Actor-Critic (SAC); for more details please refer to the SAC paper haarnoja2018soft . SAC tries to maximize the expected sum of rewards and the entropy of a policy :
Here is the state-action marginals of the trajectory distribution induced by . The hyper-parameter balances exploitation and exploration, and affects the stochasticity of the optimal policyhaarnoja2018soft .
SAC consists of five networks: a policy network
that takes in the state and outputs the mean and standard deviation of an action distribution; two Q-networks, to estimate the value of state-action pairs; a value network that estimates the value of a state; and a target value network which is simply an exponentially moving average of the value network . Since SAC is an off-policy scheme employing experience replay, it alternates between a data collection phase using the current policy, and a parameter update phase, where mini-batches of data are uniformly sampled from the replay buffer to perform updates of the parameters. In the original SAC implementation, only one sample (one interaction with the environment) is collected during the data collection phase, and one mini-batch update is performed during the update phase. In our implementation, we first collect data for an episode until it terminates, either because of a bad action, or reaching 1000 timesteps; we then set the number of mini-batch updates to be the same as the length of the episode. Both of these implementations give almost the same sample-efficiency and robustness performance.
In SAC, the maximum entropy formulation is a critical component that enhances its exploration and robustness ziebart2008maximum ; haarnoja2017sql . In a recently updated version of SAC haarnoja2018softapps , the entropy term is learned and adapted for each environment. The new version performs better than the earlier version in many but not all environments. In this paper, we use the original and simpler SAC haarnoja2018soft for constructing new variants using non-uniform sampling.
3.2 Soft Actor-Critic with Emphasizing Recent Experience
In this section we propose SAC with Emphasizing Recent Experience (SAC+ERE), a simple yet powerful method for replaying experience. The core idea is that during the parameter update phase, the first mini-batch is sampled from all the data in the replay buffer, then for each subsequent mini-batch we gradually reduce our range of sampling to sample more aggressively from more recent data points. There are two key points of this scheme: we sample more recent data with higher frequency; we arrange the updates so that updates with older data do not overwrite the updates with the fresher data.
Specifically, assume that in the current update phase we are to make mini-batch updates. Let be the max size of the replay buffer. Then for the th update, , we sample uniformly from the most recent data points, where
is a hyperparameter that determines how much emphasis we put on recent data. Whenthis is equivalent to uniform sampling. In our experiments we found to be a good value for all environments. When , decreases as we perform each update. We set as the minimum allowable value of . This can help prevent sampling from a very small amount of recent data, which may cause overfitting. We used this formulation here instead of just because the length of an episode can vary greatly depending on the environment, and it can be beneficial for the range of sampling to change in more or less the same way during a set of updates, even when the number of updates vary. The constant 1000 here can also be set differently, but that will change the the best values. With this formulation, we always do uniform sampling in the first update, and we always have in the last update.
The effect of such a sampling formulation is twofold. The first effect is the first mini-batch will be uniformly sampled from the whole buffer, the second mini-batch will be uniformly sampled from the the whole buffer excluding a few of the oldest data points in the buffer, and as grows more of the older data gets excluded. Clearly, the more recent a data point is, the more likely it will get sampled. The second effect is that we are doing this in an ordered way: we first sample from all the data in the buffer, and gradually shrink the range of sampling to only sample from the most recent data. This scheme reduces the chance of over-writing parameter changes made by new data with parameter changes made by old data. We hypothesize that this process will allow us to better approximate the value functions near recently-visited states, while still maintaining an acceptable approximation near states visited in the more distant past.
Different values are desirable depending on how fast the agent is learning and how fast the past experiences become obsolete. When the agent is learning fast we want to be lower so that we put more emphasis on the newer data. When the agent is learning slowly, we want to be higher so that it becomes closer to uniform sampling and the agent can make use of more data points in the past. A simple solution is to anneal during training. Let be the total number of timesteps in training. Let and be the initial and final value. We can set so that it anneals to uniform sampling. The we use for timestep is .
Figure 1 shows how the and values affect the data sampling process. Figure 0(a) shows that within an update phase, the sampling range shrinks for each new mini-batch. In general, we found that is a good range for . Figure 0(b) shows that the expected number of samples of a given data point decreases from most recent to least recent data points. When , the most recent data point has a sampling expectation that is about 10,000 times higher than the oldest data in buffer. Figure 0(c) shows that when , a large value increases the expected number of times an older data point is sampled. When equals the buffer size, we again obtain uniform sampling.
3.3 Soft Actor-Critic with Prioritized Experience Replay
We also implement the proportional variant of Prioritized Experience Replay schaul2015prioritized in SAC. Since SAC has two Q-networks, we redefine the absolute TD error of a transition to be the average absolute TD error of two Q networks:
Within the sum, the first two terms is simply the target for the Q network, and the third term is the current estimate of the Q network. For the data point, the definition of the priority value is . The probability of sampling a data point is computed as:
where is a hyperparameter that controls how much the priority value affects the sampling probability, which is denoted by in schaul2015prioritized , but to avoid confusion with the in SAC, we denote it as . The importance sampling (IS) weight for a data point is computed as:
where is denoted as in schaul2015prioritized .
Based on the original SAC algorithm, we change the sampling method from uniform sampling to sampling using the probabilities , and for the Q updates we apply the IS weight . This gives SAC with Prioritized Experience Replay (SAC+PER). We note that as compared with SAC+PER, ERE does not require a special data structure and has negligible extra cost, while PER uses a sum-tree structure with some additional computational cost. We also tried several variants of SAC+PER, but preliminary results show that it is unclear whether there is improvement in performance, so we kept the algorithm simple.
3.4 Soft Actor-Critic with Emphasizing Recent Experience and Prioritized Experience Replay
We also propose a method that combines the above 2 methods (SAC+ERE+PER). This method does two things: first, during a set of mini-batch updates, the sampling range gradually shrinks as before. And second, from this sampling range, instead of uniformly sampling, we do priority sampling, where the sampling probability is proportional to the absolute TD-error of a data point.
Assume we make mini-batch updates after some amount of data collection. Let be the max size of the replay buffer. Define as before. Let be the most recent data points in the replay buffer. Then the probability of sampling a data point is computed as:
The priority value and importance sampling weight computation are the same as in SAC+PER.
4 Mujoco experiments
We perform experiments on a set of Mujoco todorov2012mujoco environments implemented in OpenAI Gym brockman2016openai . We aim to show how different experience replay schemes can affect the performance of SAC. We are mainly concerned with four variants of SAC: vanilla SAC, SAC+ERE, SAC+PER and SAC+ERE+PER. For SAC+ERE, we pay special attention to how it affects the learning speed especially in early-stage. We perform additional experiments to show that the update order is important for SAC+ERE, and show how different hyperparameters can affect the performance of the SAC variants.
, for each variant we use the same SAC code base that we implemented in PyTorch, mainly based on the minimal SAC implementation inOpenAIspinup
. We use the same neural net architecture, activation function, optimizer, replay buffer size, learning rate and other hyper-parameters as reported in the SAC paperhaarnoja2018soft for the SAC baseline as well as for our three proposed enhancements to SAC. Note in the original SAC, all environments except Humanoid use the same reward scale. All other hyper-parameters are the same across environments. We run each set of experiments on ten random seeds. We run five evaluation episodes every 5000 data points. During evaluation episodes, we run the SAC policy deterministically, instead of sampling from the action distribution. For the plots, a solid line indicates the mean across 10 random seeds and the shaded area shows min and max values. Each point on the plot is smoothed over 50 evaluation episodes to make the figures easier to read. Additional implementation details can be found in the appendix. We will also post all code and data files online after proper cleaning and documentation.
4.1 SAC with smaller buffer size
As a motivating example, we first provide a set of experiments on SAC where the only difference is the buffer size. Figure 2 shows how different buffer sizes can affect the performance of the original SAC algorithm. We tested buffer sizes of 1M (baseline), 0.5M, 0.2M and 0.1M. Results show that a smaller buffer in general can make learning faster in the early stage, but at the same time can reduce the late-stage performance of the algorithm. For instance, in Ant-v2 and Walker2d-v2, a buffer size of 0.1M leads to the fastest learning in the first 0.75M data points, but then its performance plateaus and other variants with larger buffer size perform better.
We hypothesize that a potential problem of using a small buffer is: since we only have a small amount of data, the neural networks in SAC might forget about how to perform the task well on some states visited earlier. This is a problem similar to catastrophic forgetting french1999catastrophic ; mcclelland1995there ; mccloskey1989catastrophic ; ratcliff1990connectionist ; robins1995catastrophic , a term often used to refer to the situation where the agent has to learn two tasks A and B in a sequential order, and the knowledge about task A quickly gets forgotten as the agent starts to train on task B. We believe this issue also arises in the case of an RL agent learning a single highly-complex task. When using a small buffer, the agent might be able to learn well how to act in states that are stored in the buffer, but forget about the states that have been removed from the buffer.
4.2 SAC with Emphasizing Recent Experience
Figure 3 shows the performance of the variants of SAC on 6 different Mujoco environments. We first focus our analysis on the performance of SAC+ERE (green) compared with the SAC baseline (blue).
For SAC+ERE we chose for all environments. The hyperparameter is obtained through preliminary hyperparameter search on Ant-v2. For all other hyperparameters, we use exactly those in the original SAC paper haarnoja2018soft . The result shows that SAC+ERE consistently outperforms the SAC baseline in all environments and in all stages of training. For instance, in Ant-v2, SAC+ERE is 3 times faster to reach an average performance of 4500 compared to SAC, and it reaches 5500 at one million samples, while vanilla SAC never reaches 5500 in the first three million samples. In Hopper-v2, SAC+ERE is 1.5 times faster to reach 2500 compared to SAC. In Walker2d-v2, SAC+ERE is 1.5 times faster to reach 3000, in HalfCheetah-v2, SAC+ERE is 1.5 times faster to reach 10,000. Note that for SAC+ERE, we anneal to 1 linearly, which gives uniform sampling in the end; we therefore expect its performance to be the same as SAC when trained sufficiently long.
We also found that SAC+ERE is relatively robust to the hyperparameter . We found that any value in the range of consistently improves performance on all Mujoco environments, and especially in the early stages. Figure 3(a) shows how different values can affect performance of SAC+ERE on Ant-v2. When using a large value it is similar to uniform sampling, so the learning becomes slower; and a small value such as can lead to very fast learning in the beginning.
Figure 3(b) shows that annealing can improve robustness and long term performance of the SAC+ERE. Note that compared with results in Figure 3(a), which has annealing , not annealing makes early stage learning even faster, but gives worse result in the long run. For instance, when , SAC+ERE with annealing can reach an average score of 6000 near 3M, while without annealing it fluctuates around 5500.
Our results also show that update order is indeed critical to improved performance. Figure 3(c) shows how different update orders can affect the performance of SAC+ERE on Ant-v2. We can see that SAC+ERE significantly outperform SAC in all stages of training. But if we reverse the update order, although the performance is still better than SAC, the average performance is greatly reduced in all stages of training compared to with the correct order. This shows that the two key components of ERE are both important to boost performance.
SAC is well-known to have excellent robustness properties haarnoja2018soft , that is, the sample efficiency performance is not highly dependent on the initial seeds. Table 1 compares the robustness of SAC with SAC+ERE (as well as with other algorithms soon to be discussed). At 1.5 million samples, we see that SAC+ERE has lower standard deviation than vanilla SAC for four of the six environments. Similar robustness metrics are considered in the appendix. We can conclude that ERE boosts the sample efficiency of SAC without compromising its robustness.
4.3 SAC with Emphasizing Recent Experience and Prioritized Experience Replay
We now analyze the performance of the other two SAC variants. For the hyperparameters for SAC+PER, we chose , , obtained through preliminary hyperparameter search on Ant-v2. We found that although a wide range of and values give performance gain on Ant-v2, they did not work on all environments. A more detailed analysis on hyperparameters for SAC+PER is given in the appendix.
From the results in Figure 3 we see that SAC+PER (red) significantly outperforms SAC (blue) on Ant-v2, which is the environment used to do hyperparameter search, and does better than SAC near the end of training on HalfCheetah-v2, but it does similar or worse compared to SAC in other environments. It seems that a good hyperparameter combination for SAC+PER can be very different across environments.
SAC+ERE+PER (purple) can further boost early stage learning speed beyond the SAC+ERE boost, and sometimes can boost overall performance too. For instance, SAC+ERE+PER outperforms all other SAC variants on HalfCheetah-v2 in all stages of training, but it does similar to SAC+ERE, or somewhere in-between SAC+ERE and SAC+PER in other environments.
We proposed Emphasizing Recent Experience, a new experience replay method that is simple but powerful. We showed it can significantly boost the learning speed of SAC, and in some environments it can also achieve better results in the long run. ERE is a general method that in theory can be applied to any off-policy DRL algorithm with a replay buffer.
We compared SAC+ERE with the popular Prioritized Experience Replay method and showed that ERE is easier to implement and does not require special data structures. With ERE the additional computation cost is negligible, and there is only one important hyperparameter, which we found to be easy to tune since a good hyperparameter found for one environment () also works well in all environments. We also showed that empirically in Mujoco environments, SAC+ERE has stronger performance than SAC+PER. However, it is possible that a more sophisticated formulation of SAC+PER can give better results. We believe the two methods each have their unique strengths; for example, when the reward is sparse, we expect PER to do well, since PER by design is strong at tackling sparse reward situations while ERE focuses on emphasizing recent data. We then proposed SAC+ERE+PER, which is a combination of the ERE and PER, and showed that it achieves even better performance in some environments. However, this variant loses the simplicity of SAC+ERE and has some extra computation cost due to the PER part.
For future work, we plan to also test ERE on other off-policy DRL algorithms such as DQN and on other benchmarks such as the Atari games to see if the significant performance gains observed on Mujoco generalize to other algorithms and environments.
-  Josh Achiam. Openai spinning up documentation. https://spinningup.openai.com/en/latest/index.html. Accessed: 2018-12-20.
-  Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
-  Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
-  Tim De Bruin, Jens Kober, Karl Tuyls, and Robert Babuška. The importance of experience replay database composition in deep reinforcement learning. In Deep reinforcement learning workshop, NIPS, 2015.
Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel.
Benchmarking deep reinforcement learning for continuous control.
International Conference on Machine Learning, pages 1329–1338, 2016.
-  Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
-  Scott Fujimoto, Herke van Hoof, and Dave Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018.
-  Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
-  Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1352–1361. JMLR. org, 2017.
-  Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
-  Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup,
and David Meger.
Deep reinforcement learning that matters.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018.
-  Yuenan Hou, Lifeng Liu, Qing Wei, Xudong Xu, and Chunlin Chen. A novel ddpg method with prioritized experience replay. In 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 316–321. IEEE, 2017.
-  Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133, 2017.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
-  Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
-  James L McClelland, Bruce L McNaughton, and Randall C O’reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419, 1995.
-  Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
-  Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Trust-pcl: An off-policy trust region method for continuous control. arXiv preprint arXiv:1707.01891, 2017.
-  Guido Novati and Petros Koumoutsakos. Remember and forget for experience replay. arXiv preprint arXiv:1807.05827, 2018.
-  Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97(2):285, 1990.
-  Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
-  Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
-  John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
-  Christopher Schulze and Marcus Schulze. Vizdoom: Drqn with prioritized experience replay, double-q learning and snapshot ensembling. In Proceedings of SAI Intelligent Systems Conference, pages 1–17. Springer, 2018.
-  Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
-  Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI, volume 2, page 5. Phoenix, AZ, 2016.
-  Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
-  Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
-  Rui Zhao and Volker Tresp. Curiosity-driven experience prioritization via density estimation. arXiv preprint arXiv:1902.08039, 2019.
-  Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
Appendix A Pseudocode
In this section we give the pseudocode for the 3 SAC variants we proposed. For a minimal SAC pseudocode please check section 4.2 in , for PER please check section 3.3 in . Our pseudocode has been mainly based on the original SAC and PER pseudocode. We have modified some of the code structure to make it more similar to our actual implementation. And we give a large number of comments in our pseudocode to make sure each step is clear. Algorithm 1 shows the code for SAC+ERE, and algorithm 2 shows the code for SAC+ERE+PER. To obtain SAC+PER one can simply replace line 14, 15 in algorithm 2
with uniform sampling. For the computation of the loss functions and the gradients for networks in SAC, the following is a very short summary. Please refer to for theory and details. Note that in the pseudocode we use to denote learning rate, we use the same learning rate for every network, and for the gradient update steps, we do not expand the gradient equations to make thing simpler.
The loss for training the network is:
An unbiased estimator of the gradient of the above loss function is:
The loss for training the network is:
The gradient of the above loss function can be computed as:
Where the target value network is an exponentially moving average of the value network.
The loss for the policy network is the expected KL-Divergence:
After applying the reparameterization trick:
The loss now becomes:
And the gradient can be computed as:
Appendix B Implementation details and hyperparameters
Here we give implementation details and list the hyperparameters we used to run the experiments.
b.1 Details in SAC implementation
We first give a list of details in our codebase to facilitate reproduction of code. We implemented SAC using PyTorch, and our code structure has mainly followed the clear explanation on . We used reparameterization trick to generate the action from the policy, for the log probability computation of the actions, we used the technique described in enforcing action bounds section in the SAC paper. Since our policy network gives action in the range , we obtain from each environment an action limit value (how big the magnitude of the action can be) and when our network outputs an action in range , the action is multiplied with the action limit to give an action that is in the action range of the environment.
One important difference from the original SAC code base is, in original SAC, a data collection step is immediately followed by a mini-batch update. While in our case we first collect an episode of data points (could be anywhere between 1 to 1000 data points), and then we do a number of mini-batch updates, the number of mini-batch is the same as the number of data points collected in that episode. Compared to original SAC, throughout training we collect the same number of data points, and take the same total number of mini-batch updates, for example, in Ant-v2, this is 3M data points and 3M updates. It’s unclear which way of doing the updates is more beneficial, we found that our SAC baseline to be slightly stronger than original SAC in some environments and slightly weaker in others, the difference is very small. However, enforcing an update order makes more sense when we do a set of updates while the agent is not interacting with the environment. It might be possible to formulate a novel ordered update scheme in the case of one data, one update, but this will be left as future work.
b.2 SAC hyperparameters
All hyperparameters related to original SAC are same as used in the original SAC paper. For the value, we set it to be 0.05 for Humanoid-v2 and 0.2 for all other OpenAI Mujoco environments, as given in the original SAC paper. This is given in table 2.
|replay buffer size|
|number of hidden layers (all networks)||2|
|number of hidden units per layer||256|
|number of samples per minibatch||256|
|target smoothing coefficient ()||0.005|
|target update interval||1|
b.3 Hyperparameters of SAC+ERE
The hyperparameter choice for SAC+ERE is decided with some hyperparameter search on Ant-v2. We first reasoned that we should start searching the range , since smaller values of likely will put too much emphasis on the most recent data and breaks performance. We then found that a value in the range give improvements on Ant-v2. And they also seem to work pretty well in other Mujoco environments as well. We did not fine tune SAC related hyperparameters for SAC+ERE, to showcase what performance gain we can obtain by simply changing the replay scheme to ERE.
b.4 Hyperparameters of SAC+PER
Figure 5 shows how different hyperparameter settings can affect training of SAC+PER in Ant-v2 environment. We mainly look at the and hyper-parameter. When we compared all the results together, the best setting on average is and . So we use these values across all experiments. Note that some other hyperparameter settings give better performance on some seeds, but not better on average. Although these values work well in Ant-v2, they don’t seem to work too well for the other environments.
Figure 6 and 7 show additional hyperparameter search on Hopper-v2 and Walker2d-v2. We found that it can be relatively difficult to find a good hyperparameter combination for SAC+PER. Fine tuning on each environment extensively can indeed improve performance, but the hyperparameter search is much more difficult compared to SAC+ERE.
We have also tried reduce the learning rate to and of the original learning rate, since this was done in the PER paper , but our preliminary results show no significant improvement in performance. It’s possible that we might need to perform a more extensive hyperparameter search for SAC+PER in order to get better results.
b.5 Hyperparameters of SAC+ERE+PER
For the hybrid algorithm, we did not fine tune its hyperparameters, but used the same values from SAC+ERE and SAC+PER.
Appendix C Robustness of SAC versus SAC+ERE
Appendix D Computing infrastructure
We mainly run our experiments on cpu nodes of a high-performance computer cluster, the specification of a single cpu node is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz. Each job is run on a single cpu node until completion.
Appendix E Programming and computation complexity
In this section we give a more detailed analysis of the additional programming and computation complexity that are added to SAC by our proposed experience replay schemes.
In terms of programming complexity, SAC+ERE is a clear winner since it only requires a small adjustment to how your buffer sample mini-batches. It doesn’t modify how the buffer store the data, and doesn’t require special data structure to make it work efficiently. Thus the implementation difficulty is minimal. PER (proportional variant) requires a sum-tree data structure to make it run efficiently. The implementation is not too complicated, but compared to ERE it’s a lot more work.
In terms of computation complexity (not sample efficiency), and wall-clock time, ERE’s extra computation is negligible. For each mini-batch update we only need to compute one value, and annealing is also just a constant cost operation. In practice we observe no difference in computation time between SAC and SAC+ERE. On Ant-v2 with 3M data points, SAC takes 25-30 hours to run, and SAC-ERE takes about the same time. PER needs to update the priority of its data points constantly and compute sampling probability for all the data points. The complexity for sampling and updates is , the rank-based variant is similar . Although this is not too bad, it does impose a significant overhead on SAC, also note that this overhead grows linearly with the size of mini-batch. In our experiments, SAC+PER on Ant-v2 with 3M data can take up to 40 hours to run. SAC+ERE+PER runs with the same computation as SAC+PER.