Sparse reward settings are useful because the reward function engineering is easy. A simple for achieving the required goal and otherwise is one example. Recent work on Hindsight Learning HER has shown that learning from trajectories in which the agent does not succeed in achieving the goal can improve performance greatly. This alleviates the sparse reward problem by ensuring there are transitions with non-zero rewards in every rollout. A replay buffer is used to improve the sample efficiency of procedure mnih2015human .
Prioritized Experience Replay PER , a variant of Experience Replay, samples the transitions based on priority values assigned to them, unlike uniform sampling that is followed in vanilla Experience Replay. Though in principle the priority values could be calculated using any formulation, it is common to use the TD error as a proxy. This follows from the intuition that a large TD error indicates a shortcoming of the agent in the learning about that part of the environment.
In this paper, we propose Hindsight Prioritized Experience Replay, a variant which aims to leverage the best of both the worlds. It can be used in settings which have multiple goals, which is common in robotic tasks.
2 Background and Notation
This section introduces the concepts motivating Hindsight Learning and Prioritized Experience Replay.
2.1 Hindsight Experience Replay
Humans seem to learn both from successes and failures. In sparse reward settings, the agent does not gather much information from trajectories in which it failed to achieve the goal because the total return () for all the steps is . Consider the example of an agent being trained to play soccer to understand why it could benefit from learning from its mistakes.
Though the agent is not able to achieve its goal, it can take away important information about how it needs to kick the ball if at all it was told to kick to the center of the goal post.
This work is based on the idea of Universal Value Function Approximators schaul2015universal where the reward function is conditioned on the goal, . A goal is chosen at the start of the episode and remains the same throughout. It has been shown that this approach can be used to generalize to previously unseen state-goal-action pairs.
The strategy for learning from alternate goals is called future. We refer a reader seeking a detailed explanation of the same to HER , but the key idea is to consider future states of the episode, with respect to a state and use that as an alternate goal. Consider the following episode where the agent failed to achieve the required goal.
If the state is considered as an alternate goal, a reward of is used to backup the values of all the states before it. Using this intuitive strategy, the following is the pseudo code of the algorithm.
A key thing to note in the algorithm is that after the agent experiences an episode, it just stores it in the buffer and the sampling of future states in the future strategy happens at learning time. This ensures that in expectation, any future state of a state
has an equal probability of getting picked. We refer to this asuniform sampling and this plays a key role in our experiments.
2.2 Prioritized Experience Replay
A work on prioritizing transitions PER showed that prioritizing transitions based on their TD-error serves as a good proxy for the amount of learning that can happen with it. An ideal implementation would involve sorting all the transitions based on their TD-error after each rollout, but this is infeasible in practice. There are two strategies considered for serving as an approximation for this, proportional and rank-based. Interested readers are referred to PER for more details, but it was observed that rank-based performed better.
This strategy uses a priority queue for storing the transitions, with the TD-error as the key value. This reduces the insertion time to , where is the size of the priority queue.
2.3 Hindsight Policy Gradients
Hindsight Experience relies on an off-policy Reinforcement Learning algorithm as it uses a replay buffer to de-correlate the training samples. One other way to de-correlate training samples is to use Asynchronous methods as mentioned in A3C . The best performing algorithm based on Asynchronous methods, Asynchronous Advantage Actor Critic algorithm A3C learns both a value function and a policy. To learn a policy, one must compute the policy gradients i.e, the gradients of the average expected reward with respect to the policy parameters. In a recent work HPG , policy gradients were derived to incorporate hindsight learning for the case of goal directed tasks. The expression for the policy gradient as mentioned in HPG is
where is the goal that is being pursued, are the goals that were achieved in the process, is the policy parametrized by parameters . It has been shown in HPG that
can be used to train agents to perform goal directed tasks. But, the tasks have were limited to bit flipping and empty grid. This approach can directly be extended to continuous state and action spaces. In addition, the policy gradient estimatecan be improved by using baseline corrections which paves a way for A3C. The policy gradient in the case of A3C would be
3 Hindsight Prioritized Experience replay
The intuition for this algorithm is to have the best of both worlds. The vanilla replay buffer is replaced with a priority queue to implement the rank-based strategy. There is one major difference though. The fact that we are using a priority-queue means that we necessarily have to to calculate the TD-error before storing. This means that the goals will be required to calculate the reward using the conditional function () that we discussed in Section . This forces us to sample the alternate goals before storing.
3.1 Controlling the ratio of Actual goals and Alternate goals
In the original work HER , ensures that exactly fraction of sampled transitions are from actual goals and fraction of sampled transitions are for alternate goals. In the modified method, since we are storing the transitions based on their TD-error and then sampling, there is no restriction on the ratio of actual goals and alternate goals in the sample. A method called two_queues is developed to ensure the same. The idea is very simple and uses priority queues instead of .
Instantiate two queues
Observe a trajectory
For each transition choose number of alternate goals
Push the transitions with actual goals to priority queue
Push the transitions with alternate goals to priority queue
At sample time, keep the ratio of sampling from the two priority queues as
3.2 Caveats of uniform sampling
One caveat both in the method developed so far and in HER is the uniform sampling of goals that was discussed in Section . Sampling alternate goals is done by uniformly sampling from future states. Each state in the trajectory has equal chance to get picked. Consider a trajectory . If all the states have an equal probability to get picked, the probability of picking the state goal pair is times more than the probability of picking . This is because there are different options for , whereas there is only for . This effect will be observed in the tail of all the trajectories.
This problem has worse effects on our method because it stores number of alternate goals for each transition. There are thus multiple copies of the same transition in the buffer and updating the values of one of them does not change the other. To alleviate this, we propose a method called non-uniform sampling. Instead of choosing number of goals per transition, if the transition is observed at time with episode length , choose a number characterized by the formula below.
As shown in the experiments later, this method gave the best performance.
3.3 Annealing replay_k
With the ratio set to , the agent is stuck with this throughout the training procedure. A very high value of means that the agent cares only about alternate goals, and that is not desirable. However, a low value of
means that Hindsight Learning is not being leveraged. Like other trade-offs, hyperparameter tuning overis required. But having a constant
throughout learning may not be ideal. Consider this graph which has a very high variance.
The ratio seems to be falling slowly on average from about to . In the initial stages of learning when the agent makes more mistakes, it might make sense for it to learn more from its mistakes and as it becomes more experienced, it might care only about the actual goals. There are two ways to give the agent this freedom.
3.4 Using a single priority queue
The second solution discussed in section is implemented. Instead of maintaining two different queues, only queue is maintained. But to ensure that the ratio of actual to alternate goals is such that actual goals are still being used to learn, we set . In the case when the TD-error of all the transitions is the same, the ratio of actual goals and alternate goals is . This method is referred to as single_queue.
The following is the pseudo code for the algorithm. The main difference from Algorithm is that goals are sampled at storage time.
The following results illustrate the experiments that were performed. A more thorough analysis is deferred to the appendix.
There are oscillations that are visible in the plots. The oscillations decreased only marginally when the batch size was increased to from . The proposed algorithm reaches the optimal performance for the FetchReach environments in the same amount of training time as HER , but it fails on more complex environments.
|Environment Name||Best success rate||
5 Future Work
The results of Prioritized Experience Replay were unsatisfactory. A major shortcoming of our approach is the fact that the method is tied to calculate the TD-error before storing, while the other solution is computationally intractable. Though prioritized experience replay used TD-error as a proxy, to the best of the authors’ knowledge, there is no method which uses some heuristics/algorithm to suggest which transition may be more important. Some example heuristics could be, “If a state-goal pair has high TD-error, other goals with the same states have a high error as well”, “Some goals are hard to achieve for a large number of states”, and so on. An ideal case would be to realize a mode which can suggest what transitions to train onplappert2018multi . This gives a flavor of Model-Based Learning and Prioritized Sweeping and this will be the focus of the authors in the future. The current interest in the community to make prioritized methods more efficient horgan2018distributed leaves a lot of promise.
The initial results of Hindsight policy gradients in bit flipping and empty grid environments show good promise. This leaves one to extend the hindsight policy gradients beyond vanilla policy gradient methods to more advanced methods like A2C/ A3C and TRPO/ PPO.
With efficient and faster ways of performing comes more data, but a uniform sampling of this data may not be the best way to harness it. Reinforcement Learning setups have an inherent way of assigning weights to states based on their importance (state visitation frequencies on successful trajectories). If the TD-error calculation becomes a bottleneck, most of the computational resources should be spent on important transitions. Prioritized Experience Replay offers one such method, but as analyzed in the paper, it does not work very well. Unless there is a way to compute the TD error after sampling and yet have a priority assigned to a transition while storing, the authors don’t see much hope in the traditional methods.
Online versions of Hindsight Experience Replay can benefit from multiple optimization techniques that have been developed for the same. But at the moment, the variance reduction techniques that have been used for Hindsight Policy Gradients are not sufficient to get complex environments working. Performance even in simple environments likebit-flipping is also only satisfactory. There is a need to come up with a more principled approach to multi-goal RL.
-  Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
-  Christian Darken and John E Moody. Note on learning rate schedules for stochastic optimization. In Advances in neural information processing systems, pages 832–838, 1991.
-  Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018.
-  Yuenan Hou, Lifeng Liu, Qing Wei, Xudong Xu, and Chunlin Chen. A novel ddpg method with prioritized experience replay. In Systems, Man, and Cybernetics (SMC), 2017 IEEE International Conference on, pages 316–321. IEEE, 2017.
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy
Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu.
Asynchronous methods for deep reinforcement learning.
International Conference on Machine Learning, pages 1928–1937, 2016.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
-  Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464, 2018.
-  Paulo Rauber, Filipe Mutz, and Juergen Schmidhuber. Hindsight policy gradients. arXiv preprint arXiv:1711.06006, 2017.
-  Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International Conference on Machine Learning, pages 1312–1320, 2015.
-  Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
-  John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
Appendix A Bias correction for PER sampling
Prioritized experience replay induces a bias as it changes the distribution of the sample transitions used for updates. The loss minimization for learning the value function relies on the sampling being uniform. Hence, to make the updates unbiased, we must introduce an importance sample ratio given by where P(i) is the probability of picking that transition whereas N is the size of the buffer. In order to make the updates unbiased towards the end of training while still exploiting the benefits of prioritized sampling, we anneal the bias by using an importance sample of the form . The value of is annealed over time such that towards the end of training, . As mentioned in , bias correction is used only for the value function updates and not for the policy updates. This showed better performance than using bias correction for both value function and policy updates.
Appendix B Evaluating Baselines
The baseline code provided by  is evaluated on all the MuJoCo environments. The hyperparameters provided by their implementation were used along with , which means that the code is run on CPUs. Though they recommend usage of CPUs, we had to stick to a lower number because of unavailability of computational resources. The following plots illustrate the success rates.
None of the hand environments reached the optimal performance even when run for epochs. Barring handmanipulateegg, the other environments had reached saturation. There is scope for future work on improving performance.
To put things into perspective,  report in a blog that handmanipulateblock reaches a success rate of in just 25 epochs with their machines. We observed saturation at around . Just a reduction in CPUs does not seem to be the right reasoning for this discrepancy. The authors also do not mention the hyperparameters for the environments.
Appendix C Implementation of Rank Based Prioritized Experience Replay
We use an implementation from here. There is no principled reason to choose rank based prioritization over proportional prioritization other than the fact that rank based is more robust to small changes in TD error and proportional prioritization takes into account the magnitude of the TD error.
Each transition is given a rank between and the probability of picking transition with a rank is . An ideal implementation would use a sorted array, but since maintaining a sorted array is very expensive a Binary Heap is used as an approximation.
Sampling is done by using inverse transform sampling. If a batch size of is required, the distribution is used and transitions are sampled times. However, to make sure the sampling is stratified, the numbers from are divided into buckets, where is usually the batch_size and each bucket is defined to have equal probability. Sampling once from each bucket now gives a stratified sample.
There seems to be only
open source implementation for tensorflow-python for rank based and it turned out to be very slow for our purposes. A modified version of the code will be made available soon which givesspeed with only added amortized costs. This speed up was obtained by building cumulative distributions that are required incrementally rather than all at once at the starting.
Appendix D TRPO Version of Hindsight Policy Gradients
As mentioned in section 5, Hindsight Policy Gradients can be extended by to TRPO. In this section, we derive the TRPO version of the Hindsight Policy Gradients by following the same mathematical steps as in . Consider a trajectory denoted by , obtained when an agent tries to achieve a goal . The performance of a policy is defines as:
Assume that the agent was pursuing a goal . Consider as a goal conditioned policy for pursuing a goal . Let be the old policy while is the current policy. Using the definition of value function and advantage function, we arrive at
This can be simplified using the steady state probability distribution as
A surrogate objective function is defined as
It can be verified that at the first order, this objective gives the same policy gradient at . The monotonic improvement is yet to be proven for this surrogate objective function. The way to prove it is to show a bound on the following quantity
Appendix E Effect of alternate goal sampling on performance
Alternate goals are sampled at storage time for appending the transitions with these alternate goals. As discussed before, a uniform alternate goal sampling allows for multiple copies of the transitions to be stored in the buffer. This makes the agent sample these transitions more often than necessary simply because updating the priority of one of the multiple copies doesn’t change the priority of all of them. Hence this transition will be sample unnecessarily large number of times. This effect can be seen in play in the following performance curves
Appendix F Comparing single_queue and two_queues
As mentioned before, single_queue
gives the agent the degree of freedom to pick it’s owni.e., the ratio of alternate goal appended samples to the actual goal appended samples. This degree of freedom improves the performance of single_queue as transitions are sampled for replay based solely on the TD error whereas in two_queues, there is a constraint on the which doesn’t allow such a sampling. The performance curves for both single_queue and two_queues are as follows