## I Introduction

Deep reinforcement learning (RL) has achieved remarkable successes in recent years, for example learning to play Atari computer games [mnih2015human] or defeating the world champion in the game of Dota 2 [berner2019dota]. However, these successes have generally relied on the availability of vast amounts of training data and a dense/well-shaped reward function to guide exploration, neither of which are easily obtained in practical robotic scenarios. Thus, while RL has impressed in virtual worlds, successes in the real world have been relatively limited.

The data hungry nature of RL is a fundamental barrier to its application in robotics; unlike in video games, gathering data in the real world is prohibitively time-consuming and expensive. However, model-based RL has shown promise in improving sample efficiency and tackling this issue. Unlike imitation learning

[hussein2017imitation] or simulation to real transfer (sim-to-real) [andrychowicz2020learning], these algorithms do not come with a hard requirement for expert domain knowledge or significant engineering effort. Unlike model-free RL methods, model-based RL learns to explicitly model the transition dynamics of the environment, using this model to aid the learning of a control policy. Since the model can be learned in an entirely supervised manner, the primary advantage of model-based RL is a significant improvement in sample efficiency. In fact, recent advances in the state-of-the-art have allowed model-based methods to match the asymptotic performance of model-free methods, while requiring orders of magnitude less data [janner2019trust, hafner2020mastering, schrittwieser2020mastering]. However, data-efficiency issues aside, these state-of-the-art techniques generally rely on dense reward signals, and in this respect are still somewhat impractical for robotics applications.In robotic manipulation tasks, significant expertise can be required to design and implement a dense reward that can successfully guide learning. For example, Popov et al. [popov2017data] require a highly complex reward function to learn to stack blocks, while Andrychowicz et al. [andrychowicz2017hindsight]

demonstrate that an over-simplified dense reward function can be detrimental to learning performances. Suitable shaped rewards are not only difficult to design, but also difficult to implement in the real world; unlike in simulation, often computer vision systems must be employed to track reward-relevant information, such as object positions

[nagabandi2020deep].To bypass the challenge of hand-crafting dense reward signals, methods have been developed which can learn from sparse rewards; rewards with little information that are generally only received once the task has been successfully completed. These sparse rewards are much easier to design and implement, but are conversely much more difficult to learn from. Deep RL methods such as curiosity-driven exploration [pathak2017curiosity, burda2018exploration] and Hindsight Experience Replay (HER) [andrychowicz2017hindsight] have made significant progress in the sparse-reward setting, however, few works have successfully integrated these techniques with model-based methods to obtain the benefits of both in difficult robotic tasks.

In this paper, a novel model-based RL method specifically tailored for sparse-reward tasks is proposed to jointly address the data-efficiency and reward engineering issues inherent in the application of RL to robotics. The method trains the policy with HER to ensure maximum efficacy in the multi-goal tasks commonly found in robotics, and incorporates imaginary/model-generated data into policy updates to reduce sample complexity. To ensure the learned policy does not overfit to model inaccuracies, imagined data is generated by an ensemble of models [kurutach2018model], and is regenerated each time the model is updated. To improve environment exploration, the ensemble also provides the agent with intrinsic rewards based on the disagreement between its predictions [pathak2019self]. To allow the agent to adapt its behaviors specifically for the real environment, real and imaginary data are distinguished from each other when input to the policy. We term this technique Imaginary Hindsight Experience Replay (I-HER).

I-HER is evaluated in the challenging OpenAI Gym Fetch Robotics tasks [brockman2016openai, plappert2018multi], and converges using, on average, an order of magnitude less data than the state-of-the-art model-free algorithm, matching its asymptotic performance in all but one task. The diversity of the Fetch tasks is found to expose various modes in which model-based RL can struggle, but an ablation study demonstrates how I-HER’s components help to overcome these issues. Compared to state-of-the-art model-free and model-based methods, I-HER offers increased feasibility for robotic applications due to its combined data-efficiency and ability to effectively tackle sparse-reward tasks. Moreover, I-HER’s empirical results outperform those presented by other model-based methods also tailored for multi-goal sparse-reward tasks.

## Ii Related Work

### Ii-a Model-based Reinforcement Learning

The data-efficiency of model-based RL methods make them promising for real-world applications, and many variants have been proposed. Learned dynamics models have been used for model predictive control (MPC), but these methods have struggled to match the asymptotic performance of model-free methods [nagabandi2018neural], and generally rely on a dense reward function to identify good trajectories [nagabandi2020deep, hafner2019learning, chua2018deep]

. While MPC is purely model-based, often model-based and model-free components are combined to yield both high data efficiency and high asymptotic performance. In one such approach, a dynamics model can be used to predict into the future to improve the value estimates of a model-free algorithm

[feinberg2018model, buckman2018sample]. Another approach is to train a model-free policy with cheap model-generated data; a concept first introduced by Dyna-Q [sutton1990integrated].Like any model-based method, Dyna-Q-type approaches are limited by inaccuracies in the learned dynamics model [gu2016continuous, kaiser2019model]. Kalweit and Boedecker [kalweit2017uncertainty] prevent an inaccurate model from limiting asymptotic performance by sampling less imaginary data as uncertainty in the Q-function decreases. Other works employ an ensemble of dynamics models to prevent the policy from overfitting to the inaccuracies of a single model [kurutach2018model, clavera2018model]. SLBO [luo2018algorithmic] minimises overfitting to any single iteration of the model by updating the model regularly. MBPO [janner2019trust] optimises the length of model-generated rollouts to avoid compounding model errors, while Ha and Schmidhuber [ha2018world] optimise stochasticity in the imagined environment to prevent exploitation of a deterministic model.

This paper extends these Dyna-Q approaches, accounting for model inaccuracies by (i) employing an ensemble of models, (ii) introducing a step which regenerates the data in the imaginary replay buffer each time the model is updated so as to eradicate inaccuracies produced by older iterations of the model (unlike in [kalweit2017uncertainty]), and (iii) introducing an explicit distinguishing of real and imaginary transitions to allow the policy to account for differences in their dynamics.

### Ii-B Curiosity-Driven Exploration

An agent can be endowed with curiosity by encouraging it to visit novel states, which can greatly improve exploration of the state space in the absence of dense extrinsic rewards. In large continuous state spaces, the prediction error of a learned dynamics model can act as a measure of novelty, and thus be leveraged as an intrinsic reward signal to create a curious agent [burda2018large, li2019curiosity]. However, this approach can lead to the agent becoming ‘obsessed’ with inherently unpredictable elements of the environment [schmidhuber2010formal]. To balance this issue, Pathak et al. [pathak2017curiosity] make predictions in a feature space learned by an inverse dynamics model, while Burda et al. [burda2018exploration]

employ a randomly initialised neural network to remove any stochasticity in the production of prediction targets. Intrinsic rewards have also be generated based on the disagreement between the next-state predictions of an ensemble of dynamics models

[pathak2019self, henaff2019explicit]. Such disagreement is more pronounced in regions of the state-space where training data is limited, thus acts as a good measure of novelty. Moreover, when repeatedly exposed to stochastic aspects of the environment, models will converge to predicting the mean and disagreement will diminish.### Ii-C Model-Based Learning with Sparse Rewards

Few proposed model-based methods have been well-equipped to handle difficult sparse-reward tasks. Indeed, Yang et al. [yang2021mher] find that MBPO and MVE [feinberg2018model] perform poorly in goal-based sparse-reward tasks. Multi-DEX [kaushik2018multi], a method tailored for sparse rewards, uses a curious model-based policy search algorithm which (as part of its multi-objective) seeks maximally novel state trajectories. Orthogonally, exploration policies have been learned efficiently in imagination via ensemble disagreement intrinsic rewards, aiding quick adaptation to downstream tasks [shyam2019model, sekar2020planning]. However, these methods do not leverage the learning efficiencies that HER can provide in multi-goal tasks.

MHER [yang2021mher] increases the efficiency of HER by generating imagined future achieved goals via a learned dynamics model, but this is only demonstrated in relatively trivial reaching tasks. Taking inspiration from HER, PlanGAN [charlesworth2020plangan] trains an ensemble of generative models to generate trajectories that lead from the current state towards a specified goal, and use these trajectories for MPC. This improves efficiency versus HER, but entails a highly computationally expensive online planning procedure.

Unlike these competing methods ([yang2021mher, charlesworth2020plangan]), ours is demonstrated to be robust across the full set of benchmark Fetch tasks, providing roughly an order of magnitude increase in data-efficiency versus HER. This is achieved via a novel combination of curiosity-driven exploration, model-based learning, and HER, yielding an algorithm well suited to sparse-reward multi-goal tasks.

## Iii Background

### Iii-a Multi-goal Reinforcement Learning

Our reinforcement learning agent interacts with a Markov decision process (MDP), defined by the tuple

. , , and are the state, action and goal spaces, respectively, and is the discount factor. The state transition distribution is denoted as , the initial state distribution as , and the reward function as . The goal of the RL agent is to find the optimal policy that maximizes the expected sum of discounted rewards in this MDP: .The reward function is assumed to be known but the dynamics are unknown. Model-based reinforcement learning constructs a model of the dynamics,

, using data collected from the MDP and supervised learning.

### Iii-B Hindsight Experience Replay (HER)

HER can be used with any off-policy RL algorithm in multi-goal tasks, proving highly effective in these settings [andrychowicz2017hindsight]. In these tasks, transition tuples collected in the MDP take the form of , where is the goal of an episode, and the reward function is generally binary:

(1) |

Now, if the goal is never reached in an episode in this MDP, all rewards will be negative and the collected transitions will provide minimal information for policy optimization. To solve this, HER employs a simple trick: a proportion of sampled transitions have their goal altered to , where is a goal achieved later in the episode. The rewards of these altered transitions are then recalculated with respect to , leaving the altered transition tuples as . Even if the original episode was unsuccessful, the altered transitions will teach the agent how to achieve , thus accelerating its acquisition of skills.

## Iv Imaginary Hindsight Experience Replay (I-HER)

Excellent data-efficiency and the ability to learn from sparse rewards are appealing qualities for an RL algorithm being deployed in a robotic scenario. While HER has excelled in sparse-reward tasks where standard model-free RL algorithms have tended to fail [andrychowicz2017hindsight, dai2020episodic], it is apparent that its data efficiency could be improved through the use of a model-based component. In light of this, Imaginary Hindsight Experience Replay (I-HER) is now presented as a technique that incorporates imaginary/model-generated data into the HER learning procedure to improve its data efficiency. We follow the original paper by employing Deep Deterministic Policy Gradients (DDPG) [lillicrap2015continuous] as the base off-policy RL algorithm. See Algorithm 1

for the I-HER pseudo-code and refer to the appendix for comprehensive hyperparameter details.

### Iv-a I-HER Overview

I-HER consists of a continuous cycle of: (i) collecting real experience and updating the ensemble of dynamics models, then; (ii) generating imaginary experience and updating the policy. Collected real and imaginary MDP transitions are stored in their respective replay buffers, and . The dynamics model is trained on . The RL policy is trained on data sampled from both and , combined into a single batch for the mini-batch gradient descent step. The sampling ratio here is based on the ratio of real to imaginary data collected so far: , where is the proportion of sampled transitions that are imaginary, and and are the respective number of imaginary and real transitions collected so far. This unbiased approach aims to avoid any overfitting that may occur from oversampling real experiences.

### Iv-B Generating Imaginary Experience

In I-HER, an ensemble of dynamics models is maintained. Following [nagabandi2018neural, kurutach2018model]

, each model is a feed-forward neural network trained on data from

to minimize the one-step prediction loss: , with normalized inputs and output targets. To generate imaginary experience, at each time step a model is sampled uniformly from the ensemble to predict . This serves to minimise policy exploitation of any single models inaccuracies and thus improve policy robustness in the real environment [kurutach2018model]. Diversity is maintained in the ensemble by varying model weight initializations and training input sequences [kurutach2018model, clavera2018model]. Imaginary rollouts are of equal length to the real rollouts (50 steps in this case).### Iv-C Regenerating Imaginary Experience

Off-policy RL algorithms learn from all previously collected experience maintained in the replay buffer. This presents an issue when learning from imaginary transitions: older imaginary transitions are likely more inaccurate than those generated by the latest dynamics model (which has been trained with the most data). Ideally, all imaginary experience should at least be consistent with the latest dynamics model, and I-HER ensures this is the case. Recognising that imaginary experience can be generated rapidly via batch parallelization, is simply emptied each time the dynamic model is updated and refilled to its previous level with data generated by the up-to-date model^{2}^{2}2On an Intel Xeon Gold 6252 CPU, it takes less than 30 seconds to refill a full imaginary replay buffer with 1 million transitions.. To ensure the buffer is still filled with a diverse set of experiences (an important factor for off-policy algorithms [fedus2020revisiting, zhang2017deeper]), and to maintain learning stability, the new imaginary transitions are collected with the current and many older versions of the policy (see appendix for more details).

### Iv-D Intrinsic Rewards

In I-HER, the agent is provided with a useful intrinsic reward , added to the extrinsic reward during DDPG updates in the same manner as [li2019curiosity], to give a total reward of: . Following [pathak2019self]

, this intrinsic reward is based on the variance between the ensemble predictions of

. Importantly, these ensemble-generated intrinsic rewards can be applied to imaginary data [sekar2020planning], and thus are applied to both real and imaginary transitions during policy updates. To weight the intrinsic rewards, and to ensure they do not overshadow the extrinsic rewards, they are scaled and clipped [li2019curiosity]. Thus, the intrinsic reward is calculated as: , where is the variance between ensemble predictions, is the scaling factor, and is the maximum value can take.This intrinsic reward signal serves two important purposes: (i) to strengthen the dynamics model by encouraging the collection of data in areas where it experiences epistemic uncertainty (uncertainty due to a lack of data), and (ii) to improve general exploration of the state-space.

### Iv-E Distinguishing Real from Imaginary Experiences

When training in a imaginary environment, a model-based RL algorithm can overfit to inaccuracies in the imaginary dynamics at the expense of its real world performance. Now, to account for differences between a training environment and a target environment, meta-learning is commonly employed. For example, a recurrent policy can be learned whose memory allows it to discern the current dynamics and alter its behaviour appropriately [andrychowicz2020learning]. Alternatively, a policy capable of quickly adapting to slight environmental differences in one gradient step can be learned [clavera2018model]. A simpler solution is used in I-HER, albeit one that requires the use of some

real data to train the policy; the policy is told whether input observations are from the real or imaginary environment by appending to them a binary variable; a 1 if the observation is real, a 0 otherwise. This is done to allow the policy to detect and account for noticeable differences between the real and imaginary environments, and thus learn a set of behaviours fine-tuned for the real world.

## V Experiments

### V-a Environments

I-HER is evaluated in the OpenAI Gym Fetch Robotics environments [plappert2018multi], which consist of four diverse and challenging continuous control tasks: Reach, Push, PickAndPlace, and Slide (see Fig. 1). Rewards in these multi-goal tasks are sparse and binary (as in Eq. (1)). Observations include relevant state information.

In these tasks, HER is known to outperform standard model-free algorithms, which struggle due to the sparse reward function [andrychowicz2017hindsight, dai2020episodic]. Although there have been several iterations on the original HER (e.g. [he2020soft, fang2019curriculum]), these have generally provided incremental improvements in performance, and could be integrated into the I-HER algorithm if desired. As such, we regard HER as the state-of-the-art model-free method in the benchmark Fetch tasks for the purposes of comparison to I-HER.

### V-B Benchmark Results

The performance of I-HER is now compared to HER^{3}^{3}3All presented experiments were performed over 3 random seeds. Both methods were implemented with DDPG as the base RL algorithm. To enable fair comparison, shared hyperparameters were maintained at equal values, largely based on those employed by [plappert2018multi]. The results of this comparative evaluation are shown in Fig. 2.

The model-based I-HER matches the asymptotic performance of the model-free HER in all tasks except Slide, requiring on average an order of magnitude less real data to learn. Roughly 30, 15, 7, and 4 less data are required in Reach, Push, PickAndPlace, and Slide, respectively (notably, gains diminish as task difficulty increases). To solve the non-trivial Push task, I-HER requires 40 minutes of experience versus the 10 hours required by HER. Interestingly, I-HER achieves a slightly higher asymptotic performance in the PickAndPlace task, perhaps due to the improved exploration encouraged by its intrinsic rewards.

Turning now to the reported results of competing model-based methods. Of the Fetch tasks, MHER [yang2021mher] is only tested in the trivial Reach but does not match our 30 times improvement in data-efficiency versus HER. PlanGAN [charlesworth2020plangan] displays similar improvements in PickAndPlace but do not match our 15 times improvement in efficiency in Push.

### V-C Ablation Study

#### Intrinsic rewards

The intrinsic rewards prove very useful in Push and PickAndPlace (see Fig. 3), serving to encourage to interaction with the cube, whose dynamics are complex and difficult to model. Without curiosity, the agent struggles to learn to pickup the cube in PickAndPlace. Here, when faced with a goal it was unable to reach, it was observed that the curious agent would continue to ‘play’ with the cube while the non-curious agent would remain relatively inactive. This ‘play’ was crucial for the agent to quickly learn to pick up the cube and move it to goals above the table.

#### Distinguishing between reality and imagination

Contrary to the intrinsic rewards, distinguishing real and imaginary experiences only provides notable benefits in Slide (see Fig. 3). This can be explained as follows. In Push and PickAndPlace, the bottleneck in learning the task is exploration; once the agent has adequately explored the cube, the success rate quickly converges to 100%. In Slide, however, highly precise control is required (as demonstrated by the inability of I-HER or HER to reach an 100% success rate in Fig. 2), and this becomes the bottleneck rather than exploration; the Slide policy is still being fine-tuned long after it has learned to interact with the puck.

Since highly precise control over the dynamics is required, small modelling errors can hurt policy performance significantly. Distinguishing real and imaginary data allows the policy to account for any noticeable differences in the real and imaginary dynamics, and thus learn a set of behaviors fine-tuned for the real world. This helps minimise the adverse effect of model errors, providing a significant boost in I-HER’s final Slide performance.

#### Regenerating imaginary experiences

The results of Fig. 3 suggest that the regeneration of imaginary experiences only provides minor benefits in the final version of I-HER (with an ensemble of five models). However, an analysis of the effect of the ensemble (see Figure 4) demonstrates that this regeneration step is highly beneficial when only a single dynamics model is employed.

With a single model, model inaccuracies can be easily exploited in the imaginary environment. For example, at one stage the policy was observed capable of ‘magically’ moving the imaginary cube without touching it, due to a poor dynamics model. Such inaccuracies are generally corrected once the policy collects further data in real environment, however, if the inaccurate imaginary data remains in the buffer, the policy will continue to try to exploit it. Thus, regenerating the imaginary experience in the buffer is crucial when only a single dynamics model is employed. With an ensemble of five models, the policy cannot overfit to model inaccuracies as easily and maintaining older imaginary experiences in the buffer is not as detrimental to performance.

## Vi Challenges in Model-Based RL

The diversity of the Fetch tasks expose two modes in which model-based RL methods can struggle: (a) where significant exploratory barriers must be passed; (b) where highly precise control over complex dynamics is necessary. These are now discussed in the context of the I-HER results.

#### Exploration

In model-based RL, the policy and model are co-dependant. The model depends on the policy to provide it with data, while the policy requires an adequately accurate model to aid its learning of the task. When exploration is difficult, this codependency becomes an issue. This is evident in PickAndPlace, where learning to pick up the cube is a major exploratory barrier. While the intrinsic rewards are highly beneficial here, I-HER still struggles to breach this barrier; as demonstrated by the learning plateaus around the 50-65% success rate in the PickAndPlace results of Fig. 3 (half of the goals are on the table while the remainder are above). Analysis found that before and during this plateau it was impossible to properly pick up the cube in the imaginary environment; the model had not yet seen any cube-picking data, so its predictions were biased towards the cube falling back to the table. Hence, the policy struggled to learn to pick up the cube as it was impossible to do so in imaginary environment at this stage. This in turn prevented the policy from providing the model with the cube-picking data it needed to improve its predictions, which in turn continued to hurt the policies ability to learn cube-picking, and so on. This unfortunate cycle that was difficult to break without the use of intrinsic rewards.

#### Precision

Slide requires highly precise control over complex dynamics. The ‘slide’ must be perfectly weighted and directed, and once the puck moves out of reach any misjudgements cannot be corrected. Thus, small biases in the imaginary environment can hurt policy performance substantially. For example, if the imagined friction is slightly too low, the policy will tend to leave its shots too short in the real environment. Ultimately, the requirement for high precision in both the dynamical modelling and the control policy leads to suboptimal I-HER results in Slide.

These exploration and precision issues are resolved to some extent by I-HER’s intrinsic rewards and distinguishing of real and imaginary experiences, respectively. However, there is room for improvement. An exploration mechanism better able to identify completely unvisited areas of the state space would be beneficial; in PickAndPlace the policy can consistently receive easy intrinsic rewards by aggressively fumbling the cube on the table, thus reducing its incentive to explore picking up the cube. Learning a separate Q-function for intrinsic and extrinsic rewards, similar to [lanier2019curiosity], could be beneficial. Enhancing model capacity to improve predictions would likely help with both issues. This could potentially be achieved by increasing the model size, using probabilistic dynamics models [chua2018deep] (particularly useful if dealing with the stochasticity of the real world), or by introducing a multi-step prediction objective [hafner2019learning].

## Vii Conclusion

Solving difficult robotic tasks via reinforcement learning can require the collection of excessive amounts of data, while designing and implementing a suitably shaped reward function can require expert knowledge and complex engineering. To address these issues, this paper proposes Imaginary Hindsight Experience Replay (I-HER); a model-based method specifically tailored for sparse-reward multi-goal tasks. Empirical results demonstrate that this method significantly reduces sample complexity versus the state-of-the-art model-free method in challenging robotic tasks, matching its asymptotic performance in all but one task.

An important direction for future work will be to extend this approach to visual observations; acquiring state observations in the real world is challenging and does not scale well to increasingly unstructured settings. This would require the adaption of the concept of HER to handle image observations, as attempted in [nair2018visual, sahni2019addressing]. Finally, the benefits of I-HER should be verified via its deployment in a real-world robotics scenarios, further testing the limits of model-based reinforcement learning.

## Appendix

#### Dynamics model

An ensemble of 5 models is maintained. Each model is a feed-forward neural network with 2 hidden layers of size 512 and ReLU nonlinearities. Models are trained with the Adam optimizer

[kingma2014adam] with a learning rate of 0.001 and batch size of 512 (each MPI worker samples its own batch of size 512 and gradients are averaged for the update step). Each epoch, each model is trained for K steps, jumpstarted off the previous epochs model. In the first epoch, the model is trained for 5K steps.To attempt to improve the efficiency of the update steps, a bias is introduced when sampling from to update the dynamics models. Each transition

is sampled with probability

, where , is the epoch in which was collected, is the current epoch, and decides how much to bias towards newer transitions. In all experiments, is used.#### Hyperparameters

The following hyperparameters are equivalent across all tasks: , buffer sizes = ; imaginary rollouts per MPI worker (I) = 2; batches per cycle (N) = 40; intrinsic reward scale () = 0.5; intrinsic reward clip () = 0.8. The following are different across Reach, Push, PickAndPlace, Slide; MPI workers = 1, 8, 8, 8; dynamics update steps (K) = 50, 1250, 2500, 2500; real rollouts per MPI worker (R) = 2, 32, 64, 64; cycles (C) = 50, 250, 250, 250.

#### Refilling imaginary replay buffer (line 7 of Algorithm 1)

Iterating through the policies stored in from newest to oldest, each policy collects imaginary rollouts which are added to . This process stops once is refilled to its previous level. A copy of is added to every 50 cycles (C).

#### Distinguishing between real and imaginary data

During updates, the policy is always truthfully told whether a transition is real or imaginary (via the binary variable env_is_real). However, when collecting experience, 10% of real rollouts are collected with env_is_real = 0 as input, and 10% of imaginary rollouts are collected with env_is_real = 1 as input. This mixing is to (i) encourage the real and imaginary behaviours to learn from each other (beyond their shared weights), and (ii) to ensure exploiting imaginary behaviours are corrected by testing them in the real environment.

#### HER experiments

1 MPI worker is used in Reach and 8 are used in the remaining tasks^{6}^{6}6Our HER implementation is taken from https://github.com/TianhongDai/hindsight-experience-replay. Besides this, all relevant hyperparameters are kept equivalent to [plappert2018multi].

Comments

There are no comments yet.