Improvements on Hindsight Learning

09/16/2018 ∙ by Ameet Deshpande, et al. ∙ 0

Sparse reward problems are one of the biggest challenges in Reinforcement Learning. Goal-directed tasks are one such sparse reward problems where a reward signal is received only when the goal is reached. One promising way to train an agent to perform goal-directed tasks is to use Hindsight Learning approaches. In these approaches, even when an agent fails to reach the desired goal, the agent learns to reach the goal it achieved instead. Doing this over multiple trajectories while generalizing the policy learned from the achieved goals, the agent learns a goal conditioned policy to reach any goal. One such approach is Hindsight Experience replay which uses an off-policy Reinforcement Learning algorithm to learn a goal conditioned policy. In this approach, a replay of the past transitions happens in a uniformly random fashion. Another approach is to use a Hindsight version of the policy gradients to directly learn a policy. In this work, we discuss different ways to replay past transitions to improve learning in hindsight experience replay focusing on prioritized variants in particular. Also, we implement the Hindsight Policy gradient methods to robotic tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sparse reward settings are useful because the reward function engineering is easy. A simple for achieving the required goal and otherwise is one example. Recent work on Hindsight Learning HER has shown that learning from trajectories in which the agent does not succeed in achieving the goal can improve performance greatly. This alleviates the sparse reward problem by ensuring there are transitions with non-zero rewards in every rollout. A replay buffer is used to improve the sample efficiency of procedure mnih2015human .

Prioritized Experience Replay PER , a variant of Experience Replay, samples the transitions based on priority values assigned to them, unlike uniform sampling that is followed in vanilla Experience Replay. Though in principle the priority values could be calculated using any formulation, it is common to use the TD error as a proxy. This follows from the intuition that a large TD error indicates a shortcoming of the agent in the learning about that part of the environment.

In this paper, we propose Hindsight Prioritized Experience Replay, a variant which aims to leverage the best of both the worlds. It can be used in settings which have multiple goals, which is common in robotic tasks.

2 Background and Notation

This section introduces the concepts motivating Hindsight Learning and Prioritized Experience Replay.

2.1 Hindsight Experience Replay

Humans seem to learn both from successes and failures. In sparse reward settings, the agent does not gather much information from trajectories in which it failed to achieve the goal because the total return () for all the steps is . Consider the example of an agent being trained to play soccer to understand why it could benefit from learning from its mistakes.

Though the agent is not able to achieve its goal, it can take away important information about how it needs to kick the ball if at all it was told to kick to the center of the goal post.

This work is based on the idea of Universal Value Function Approximators schaul2015universal where the reward function is conditioned on the goal, . A goal is chosen at the start of the episode and remains the same throughout. It has been shown that this approach can be used to generalize to previously unseen state-goal-action pairs.

The strategy for learning from alternate goals is called future. We refer a reader seeking a detailed explanation of the same to HER , but the key idea is to consider future states of the episode, with respect to a state and use that as an alternate goal. Consider the following episode where the agent failed to achieve the required goal.

If the state is considered as an alternate goal, a reward of is used to backup the values of all the states before it. Using this intuitive strategy, the following is the pseudo code of the algorithm.

do
       for episode in max_episodes do
            Store trajectory in buffer
       end for
      Sample number of transitions ;
       Set ;
       For fraction of transitions, pick a future state at random;
       Use this modified sample to learn;
      
while ;
Algorithm 1 Hindsight Experience Replay

A key thing to note in the algorithm is that after the agent experiences an episode, it just stores it in the buffer and the sampling of future states in the future strategy happens at learning time. This ensures that in expectation, any future state of a state

has an equal probability of getting picked. We refer to this as

uniform sampling and this plays a key role in our experiments.

2.2 Prioritized Experience Replay

A work on prioritizing transitions PER showed that prioritizing transitions based on their TD-error serves as a good proxy for the amount of learning that can happen with it. An ideal implementation would involve sorting all the transitions based on their TD-error after each rollout, but this is infeasible in practice. There are two strategies considered for serving as an approximation for this, proportional and rank-based. Interested readers are referred to PER for more details, but it was observed that rank-based performed better.

This strategy uses a priority queue for storing the transitions, with the TD-error as the key value. This reduces the insertion time to , where is the size of the priority queue.

2.3 Hindsight Policy Gradients

Hindsight Experience relies on an off-policy Reinforcement Learning algorithm as it uses a replay buffer to de-correlate the training samples. One other way to de-correlate training samples is to use Asynchronous methods as mentioned in A3C . The best performing algorithm based on Asynchronous methods, Asynchronous Advantage Actor Critic algorithm A3C learns both a value function and a policy. To learn a policy, one must compute the policy gradients i.e, the gradients of the average expected reward with respect to the policy parameters. In a recent work HPG , policy gradients were derived to incorporate hindsight learning for the case of goal directed tasks. The expression for the policy gradient as mentioned in HPG is

(1)

where is the goal that is being pursued, are the goals that were achieved in the process, is the policy parametrized by parameters . It has been shown in HPG that

can be used to train agents to perform goal directed tasks. But, the tasks have were limited to bit flipping and empty grid. This approach can directly be extended to continuous state and action spaces. In addition, the policy gradient estimate

can be improved by using baseline corrections which paves a way for A3C. The policy gradient in the case of A3C would be

(2)

3 Hindsight Prioritized Experience replay

The intuition for this algorithm is to have the best of both worlds. The vanilla replay buffer is replaced with a priority queue to implement the rank-based strategy. There is one major difference though. The fact that we are using a priority-queue means that we necessarily have to to calculate the TD-error before storing. This means that the goals will be required to calculate the reward using the conditional function () that we discussed in Section . This forces us to sample the alternate goals before storing.

3.1 Controlling the ratio of Actual goals and Alternate goals

In the original work HER , ensures that exactly fraction of sampled transitions are from actual goals and fraction of sampled transitions are for alternate goals. In the modified method, since we are storing the transitions based on their TD-error and then sampling, there is no restriction on the ratio of actual goals and alternate goals in the sample. A method called two_queues is developed to ensure the same. The idea is very simple and uses priority queues instead of .

  1. Instantiate two queues

  2. Observe a trajectory

  3. For each transition choose number of alternate goals

  4. Push the transitions with actual goals to priority queue

  5. Push the transitions with alternate goals to priority queue

  6. At sample time, keep the ratio of sampling from the two priority queues as

3.2 Caveats of uniform sampling

One caveat both in the method developed so far and in HER is the uniform sampling of goals that was discussed in Section . Sampling alternate goals is done by uniformly sampling from future states. Each state in the trajectory has equal chance to get picked. Consider a trajectory . If all the states have an equal probability to get picked, the probability of picking the state goal pair is times more than the probability of picking . This is because there are different options for , whereas there is only for . This effect will be observed in the tail of all the trajectories.

This problem has worse effects on our method because it stores number of alternate goals for each transition. There are thus multiple copies of the same transition in the buffer and updating the values of one of them does not change the other. To alleviate this, we propose a method called non-uniform sampling. Instead of choosing number of goals per transition, if the transition is observed at time with episode length , choose a number characterized by the formula below.

As shown in the experiments later, this method gave the best performance.

3.3 Annealing replay_k

With the ratio set to , the agent is stuck with this throughout the training procedure. A very high value of means that the agent cares only about alternate goals, and that is not desirable. However, a low value of

means that Hindsight Learning is not being leveraged. Like other trade-offs, hyperparameter tuning over

is required. But having a constant

throughout learning may not be ideal. Consider this graph which has a very high variance.

Figure 1: Actual Alternate Ratio

The ratio seems to be falling slowly on average from about to . In the initial stages of learning when the agent makes more mistakes, it might make sense for it to learn more from its mistakes and as it becomes more experienced, it might care only about the actual goals. There are two ways to give the agent this freedom.

3.4 Using a single priority queue

The second solution discussed in section is implemented. Instead of maintaining two different queues, only queue is maintained. But to ensure that the ratio of actual to alternate goals is such that actual goals are still being used to learn, we set . In the case when the TD-error of all the transitions is the same, the ratio of actual goals and alternate goals is . This method is referred to as single_queue.

  1. Anneal by following some schedule. There are popular schedules like the ones used for learning rates darken1991note

  2. Use a single queue instead of two_queues and hence allow the TD-error to decide the ratio. This was the scheme used to generate Figure 1.

3.5 Algorithm

The following is the pseudo code for the algorithm. The main difference from Algorithm is that goals are sampled at storage time.

do
       for episode in  do
            Observe trajectory ;
             for  in  do
                   ;
                   Sample number of goals ;
                   append transition with sampled goals and actual goal ;
                   Compute priority for each goal appended transition ;
                   ;
                   Push goal appended transition to priority queue ;
                  
             end for
            Sample batch number of transitions;
             Use sample to learn;
            
       end for
      
while ;
Algorithm 2 Hindsight Prioritized Experience Replay

4 Experiments

The following results illustrate the experiments that were performed. A more thorough analysis is deferred to the appendix.

Figure 2: Fetch Environments

There are oscillations that are visible in the plots. The oscillations decreased only marginally when the batch size was increased to from . The proposed algorithm reaches the optimal performance for the FetchReach environments in the same amount of training time as HER , but it fails on more complex environments.

Environment Name Best success rate

Best epoch

Strategy
FetchReach-v1 4 512 50 1 1
FetchReach-v1 6 512 50 1 1
FetchSlide-v1 4 512 50 0.0167 26
FetchPush-v1 8 256 50 0.09583 38
FetchPickAndPlace-v1 8 256 50 0.05833 18
FetchSlide-v1 8 512 50 0.12083 20
FetchReach-v1 6 512 40 1 1
FetchReach-v1 8 512 40 1 1
FetchPush-v1 4 256 40 0.09583 16
FetchPush-v1 6 256 40 0.1125 45
FetchPush-v1 8 256 40 0.083 47
FetchPush-v1 8 512 40 0.09583 1

5 Future Work

The results of Prioritized Experience Replay were unsatisfactory. A major shortcoming of our approach is the fact that the method is tied to calculate the TD-error before storing, while the other solution is computationally intractable. Though prioritized experience replay used TD-error as a proxy, to the best of the authors’ knowledge, there is no method which uses some heuristics/algorithm to suggest which transition may be more important. Some example heuristics could be, “If a state-goal pair has high TD-error, other goals with the same states have a high error as well”, “Some goals are hard to achieve for a large number of states”, and so on. An ideal case would be to realize a mode which can suggest what transitions to train on

plappert2018multi . This gives a flavor of Model-Based Learning and Prioritized Sweeping and this will be the focus of the authors in the future. The current interest in the community to make prioritized methods more efficient horgan2018distributed leaves a lot of promise.

The initial results of Hindsight policy gradients in bit flipping and empty grid environments show good promise. This leaves one to extend the hindsight policy gradients beyond vanilla policy gradient methods to more advanced methods like A2C/ A3C and TRPO/ PPO.

6 Conclusion

With efficient and faster ways of performing comes more data, but a uniform sampling of this data may not be the best way to harness it. Reinforcement Learning setups have an inherent way of assigning weights to states based on their importance (state visitation frequencies on successful trajectories). If the TD-error calculation becomes a bottleneck, most of the computational resources should be spent on important transitions. Prioritized Experience Replay offers one such method, but as analyzed in the paper, it does not work very well. Unless there is a way to compute the TD error after sampling and yet have a priority assigned to a transition while storing, the authors don’t see much hope in the traditional methods.

Online versions of Hindsight Experience Replay can benefit from multiple optimization techniques that have been developed for the same. But at the moment, the variance reduction techniques that have been used for Hindsight Policy Gradients are not sufficient to get complex environments working. Performance even in simple environments like

bit-flipping is also only satisfactory. There is a need to come up with a more principled approach to multi-goal RL.

References

Appendix A Bias correction for PER sampling

Prioritized experience replay induces a bias as it changes the distribution of the sample transitions used for updates. The loss minimization for learning the value function relies on the sampling being uniform. Hence, to make the updates unbiased, we must introduce an importance sample ratio given by where P(i) is the probability of picking that transition whereas N is the size of the buffer. In order to make the updates unbiased towards the end of training while still exploiting the benefits of prioritized sampling, we anneal the bias by using an importance sample of the form . The value of is annealed over time such that towards the end of training, . As mentioned in [4], bias correction is used only for the value function updates and not for the policy updates. This showed better performance than using bias correction for both value function and policy updates.

Appendix B Evaluating Baselines

The baseline code provided by [1] is evaluated on all the MuJoCo environments. The hyperparameters provided by their implementation were used along with , which means that the code is run on CPUs. Though they recommend usage of CPUs, we had to stick to a lower number because of unavailability of computational resources. The following plots illustrate the success rates.

Figure 3: Fetch Environments
Figure 4: Hand Environments

None of the hand environments reached the optimal performance even when run for epochs. Barring handmanipulateegg, the other environments had reached saturation. There is scope for future work on improving performance.

To put things into perspective, [1] report in a blog that handmanipulateblock reaches a success rate of in just 25 epochs with their machines. We observed saturation at around . Just a reduction in CPUs does not seem to be the right reasoning for this discrepancy. The authors also do not mention the hyperparameters for the environments.

Appendix C Implementation of Rank Based Prioritized Experience Replay

We use an implementation from here. There is no principled reason to choose rank based prioritization over proportional prioritization other than the fact that rank based is more robust to small changes in TD error and proportional prioritization takes into account the magnitude of the TD error.

Each transition is given a rank between and the probability of picking transition with a rank is . An ideal implementation would use a sorted array, but since maintaining a sorted array is very expensive a Binary Heap is used as an approximation.

Sampling is done by using inverse transform sampling. If a batch size of is required, the distribution is used and transitions are sampled times. However, to make sure the sampling is stratified, the numbers from are divided into buckets, where is usually the batch_size and each bucket is defined to have equal probability. Sampling once from each bucket now gives a stratified sample.

There seems to be only

open source implementation for tensorflow-python for rank based and it turned out to be very slow for our purposes. A modified version of the code will be made available soon which gives

speed with only added amortized costs. This speed up was obtained by building cumulative distributions that are required incrementally rather than all at once at the starting.

Appendix D TRPO Version of Hindsight Policy Gradients

As mentioned in section 5, Hindsight Policy Gradients can be extended by to TRPO. In this section, we derive the TRPO version of the Hindsight Policy Gradients by following the same mathematical steps as in [11]. Consider a trajectory denoted by , obtained when an agent tries to achieve a goal . The performance of a policy is defines as:

(3)

Assume that the agent was pursuing a goal . Consider as a goal conditioned policy for pursuing a goal . Let be the old policy while is the current policy. Using the definition of value function and advantage function, we arrive at

(4)

This can be simplified using the steady state probability distribution as

(5)

A surrogate objective function is defined as

(6)
(7)

It can be verified that at the first order, this objective gives the same policy gradient at . The monotonic improvement is yet to be proven for this surrogate objective function. The way to prove it is to show a bound on the following quantity

(8)

Appendix E Effect of alternate goal sampling on performance

Alternate goals are sampled at storage time for appending the transitions with these alternate goals. As discussed before, a uniform alternate goal sampling allows for multiple copies of the transitions to be stored in the buffer. This makes the agent sample these transitions more often than necessary simply because updating the priority of one of the multiple copies doesn’t change the priority of all of them. Hence this transition will be sample unnecessarily large number of times. This effect can be seen in play in the following performance curves

Figure 5: The right hand side is an agent learning from samples appended with alternate goals sampled uniformly from future while, the left hand side is an agent learning from samples appended with alternate goals as mentioned in section . As it can be seen, it takes longer for the agent to reach a success rate of when, uniform sampling of alternate goals is used for appending to transition at storage time. This fact has been established over multiple environments at various hyper-parameter values.

Appendix F Comparing single_queue and two_queues

As mentioned before, single_queue

gives the agent the degree of freedom to pick it’s own

i.e., the ratio of alternate goal appended samples to the actual goal appended samples. This degree of freedom improves the performance of single_queue as transitions are sampled for replay based solely on the TD error whereas in two_queues, there is a constraint on the which doesn’t allow such a sampling. The performance curves for both single_queue and two_queues are as follows

Figure 6: The right hand side is an agent learning using single_queue strategy while, the left hand side is an agent learning using two_queues strategy. As it can be seen, it takes longer for the agent to reach a success rate of when it uses two_queues strategy. This again has been observed over multiple experiments.