Visual Hindsight Experience Replay

01/31/2019 ∙ by Himanshu Sahni, et al. ∙ 2

Reinforcement Learning algorithms typically require millions of environment interactions to learn successful policies in sparse reward settings. Hindsight Experience Replay (HER) was introduced as a technique to increase sample efficiency through re-imagining unsuccessful trajectories as successful ones by replacing the originally intended goals. However, this method is not applicable to visual domains where the goal configuration is unknown and must be inferred from observation. In this work, we show how unsuccessful visual trajectories can be hallucinated to be successful using a generative model trained on relatively few snapshots of the goal. As far as we are aware, this is the first work that does so with the agent policy conditioned solely on its state. We then apply this model to training reinforcement learning agents in discrete and continuous settings. We show results on a navigation and pick-and-place task in a 3D environment and on a simulated robotics application. Our method shows marked improvement over standard RL algorithms and baselines derived from prior work.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Reinforcement learning (RL) has recently demonstrated success in a range of previously unsolved tasks, from playing Atari and Go on a superhuman level (Mnih et al., 2015; Silver et al., 2017) to learning control policies for real robotics tasks (Levine et al., 2016; OpenAI, 2018; Pinto et al., 2017). But deep RL algorithms are highly sample inefficient for complex tasks and learning from sparse rewards can be challenging. In these settings, millions of steps are wasted exploring trajectories that yield no learning signal. On the other hand, providing dense rewards along these trajectories is a tedious job that requires substantial domain knowledge and RL expertise. Ill-specified shaping rewards can also lead to unexpected ‘hacking’ behaviour (Ng et al., 1999; Randløv & Alstrøm, 1998)

. Therefore, an important vector for RL research is towards more sample efficient methods that minimize the number of environment interactions, yet can be trained using only sparse rewards. To this end,

Andrychowicz et al. (2017) introduced the idea of Hindsight Experience Replay (HER), which can rapidly train a goal-conditioned policy by retroactively imagining failed trajectories as successful ones. By making use of failed attempts to increase sample efficiency, HER was able to learn a range of robotics tasks that traditional RL methods were unable to solve. But HER was only shown to work in non-visual environments, where the precise goal configuration is provided to the agent’s policy throughout training and where it is straightforward to find a goal that is satisfied in any state. It is not directly applicable to challenging visual domains resembling real world applications, where the goal location is not explicitly known and must be searched for within the environment.

Figure 1: VHER works by using a generative model to hallucinate the presence of goals at the end of unsuccessful trajectories. The agent’s task is to search for a pebble randomly placed in its surroundings and collect it by approaching and centering it in its view. The top row shows a failed trajectory which ends in the agent not finding the pebble. The bottom row replays the same trajectory with a hallucinated visual goal inserted by HALGAN at every state such that a pebble appears to be collected.

Yet, we desire for RL agents to quickly learn to operate in the high-dimensional visual environments that humans inhabit. In HER, Andrychowicz et al. (2017) employed a goal conditioned policy using universal value function approximators (UVFAs) (Schaul et al., 2015) to generalize over multiple goals. Some recent work has extended that to visual goal conditioned policies (Nair et al., 2018) where goals are sampled from the set of possible agent states. But there is a wide range of visual tasks where we do not have an explicit representation of a goal beforehand and where a state may not easily map to a goal. Thus, we would like the agent to be able to perform visual tasks without providing it an exact specification of the goal during execution and instead have it search for the goal in its environment. For this, the agent must be able to infer the presence of goals from the state image itself. Without a direct goal specification, the agent must also learn to generalize over multiple goals just from its state.

To address high sample complexity of RL in such visual environments, we introduce Visual Hindsight Experience Replay (VHER), which combines a hallucinatory generative model with HER to rapidly solve tasks using only raw pixels in the state as input to the agent policy. The hallucinatory generative model, HALGAN, minimally alters images in snippets of failed trajectories to appear as if the desired goal is achieved at the end. In order to retroactively hallucinate success in a visual environment, it is necessary to alter the state images along the failed trajectory to make it appear as if the goal was present throughout (see figure 1). HALGAN is trained using a few snapshots of near goal images, where the relative location of the agent to the goal is known. It is then combined with HER during the reinforcement learning loop to hallucinate goals along unsuccessful trajectories. The RL policy is trained solely on images and without knowledge of relative goal configuration.

The key contributions of this work are to expand the applicability of HER to visual domains by providing a way to retroactively transform failed trajectories into successful ones and hence allow the agent to rapidly generalize across multiple goals using only the state as input to its policy. In this work, we aim to minimize the amount of direct goal specification required and learn RL policies conditioned solely on the agent state image. We believe that the sample complexity reduction that VHER provides is an important step towards being able to train RL policies directly in the real world.

2 Background

Below, we lay out some preliminary information on reinforcement learning and generative models.

2.1 Reinforcement Learning

In reinforcement learning, the agent is tasked with the maximization of some notion of a long term expected reward (Sutton & Barto, 2018)

. The problem is typically modeled as a Markov decision process (MDP). An MDP consists of a tuple

, where is the set of states the agent can exist in, is the set of environment actions, is the function mapping states and actions to a scalar reward, is the transition function, and is a discount factor that weighs how important future rewards are versus immediate ones. Stochasticity in the environment can be present in the form of uncertainties in transition or reward.

The agent must learn a policy, , mapping every state to an action. The optimal policy, , is often the goal of learning. It informs the agent on an action that typically maximizes expected value of the sum of future discounted rewards, , starting from any state . This expectation, known as the state value (), is over trajectories experienced under the current policy and environment dynamics. UVFAs (Schaul et al., 2015) approximate value functions with respect to a goal in addition to the state, . Goals are drawn from the space and are typically represented as desired agent states or configurations of objects in the environment or as desired state images. The optimal policy,

, in this case maximizes the probability of achieving a particular goal,

, from any state.

Off-policy RL algorithms can learn an optimal policy using experiences from a behavior policy separate from the optimal policy. In particular, off-policy algorithms can make use of samples collected in the past, leading to more sample efficient learning. An experience replay (Lin, 1992) is typically employed to store past transitions as tuples of . At every step of training, a minibatch of transitions is sampled from the replay at random and a loss on future expected return minimized. The off-policy algorithms employing an experience replay we use in this work are Double Deep Q-Networks (DDQN) (Van Hasselt et al., 2016) and Deep Deterministic Policy Gradients (DDPG) (Lillicrap et al., 2015).

2.2 Hindsight Experience Replay

HER was shown to achieve speedups in learning in environments where the goal configuration is provided along with the agent state to the policy. The essential idea is to store each trajectory, , with a number of additional goals along with the originally specified one. An off-policy algorithm employing an experience replay is used to train a UVFA which learns a policy which generalizes across multiple goals. During replay, the original goals are changed to states that have actually been achieved by the agent in the past.

The reward is also modified retroactively to reflect the new goal being replayed. In particular, HER assumes that every goal, , can be expressed as a predicate . That is to say, all states can be judged as to whether or not a goal has been achieved in them. Thus, while replaying the trajectory with a surrogate goal , one can easily reassign rewards along the entire trajectory as,

Andrychowicz et al. (2017) report that selecting to be a future state from within the same (failed) episode leads to the best results. This training approach forms a sort of implicit curriculum for the agent. In the beginning, it encourages the agent to explore further outwards along trajectories it has visited before. Since the surrogate goal, , is also explicitly provided to the UVFA policy, it soon learns to also generalize this curriculum over unseen goals. Over time, the agent is able to achieve any goal in , including the real ones.

2.3 Wasserstein GANs

We employ an improved Wasserstein ACGAN (Gulrajani et al., 2017; Odena et al., 2017) as our generative model because of its stability, realistic looking outputs, and ability to condition the generated images on a desired class. A typical W-ACGAN has a generator, , that takes as input a class variable and a latent vector of random noise. It then generates an image which is fed into the discriminator, . rates the image on its fidelity to the training data and, as an auxiliary task, predicts class membership. The Earth-Mover distance between the distributions of real, , and generated, , images is used as a loss to train the combined model. A standard practice in Wasserstein GANs is to train the discriminator multiple times for each generator update. The discriminator begins to act as a critic that rates images on their fidelity to the training data.

Here, it is important to point out that the motivation behind using a GAN in this work is to produce realistic looking hallucinations that will allow the agent to easily generalize from imagined goals to real ones. Realistic insertion of goals was not an issue in HER because a new goal could directly be substituted in a replayed transition without any modification to the states.

3 Related Work

Generative Models in RL. In recent years, generative models have demonstrated significant improvements in the areas of image generation, data compression, denoising, and latent-space representations, among others (Goodfellow et al., 2014; Chen et al., 2016; Vincent et al., 2008). Reinforcement learning has also benefited from incorporating generative models in the training process. Ha & Schmidhuber (2018)

synthesize a lot of prior work in the area by proposing a Recurrent Neural Network (RNN) based generative dynamics model

(Schmidhuber, 1990) of popular OpenAI gym (Brockman et al., 2016) and VizDoom (Kempka et al., 2016) environments. They employ a fairly common procedure of encoding high dimensional visual inputs from the environment into lower dimension embedding vectors using a Variational Auto Encoder (VAE) (Kingma & Welling, 2013) before passing it on to the RNN model. Held et al. (2017) use a GAN to generate goals matching in difficulty to an agent’s skill on a task. Called GoalGAN, it generates an automatic curriculum of incrementally harder to reach goals. But it assumes that goals can easily be set in the environment by the agent and does not make efficient use of trajectories that failed to achieve these objectives.

Generative models have also been used in the closely related field of imitation learning to learn from human demonstrations or observation sequences (Ho & Ermon, 2016; Edwards et al., 2018b; Schroecker et al., 2019). In our approach, we do not require demonstrations of the task, or even a sequence of observations, but random snapshots of the goal which we use to speed up reinforcement learning.

Goal Based RL. Some recent work has focused on leveraging information on the goal or surrounding states to speed up reinforcement learning. Edwards et al. (2018a) and Goyal et al. (2018) learn a reverse dynamics model to generate states backwards from the goal which are then added to the agent’s replay buffer. The former work assumes that the goal configuration is known and backtracks from there, whereas in the latter, high-value states are picked from the replay buffer or a GoalGAN is used to generate goals. The latter work also learns an inverse policy, to generate plausible actions leading back from goal states. In contrast, we focus on minimally altering states in existing failed trajectories already in the replay buffer to appear as if a goal has been completed in them. This avoids having to generate entirely new trajectories and allows us to make full use of the environment dynamics already present in previous state transitions.

Others have focused on learning goal-conditioned policies in visual domains using a single or few images of the goal (Xie et al., 2018; Zhu et al., 2017). Nair et al. (2018) train a -VAE (Burgess et al., 2018) on state images for a threefold purpose: (1) to sample new goals during training, (2) to use the Euclidean distance between feature encodings of current and goal images as a dense reward, and (3) to retroactively alter goals with VAE generated images and reassign rewards appropriately. The set of goals is assumed to be the same as the set of states and hence they are easy to swap back and forth. This works well for domains where the goal is separately provided to the policy along with the agent state, and where states do not have to be modified for changing goals. In this work, we attempt learning in domains where the goal image is not known beforehand and thus cannot be provided to the agent’s policy, and where the goal may or may not be present in a particular agent state.

4 The missing component in HER

First, we will more formally discuss what is missing from the original HER formulation that does not allow it to readily extend to visual domains. Then, in the next section, we will describe in detail how the use of hallucinatory generative models can help bridge the gap.

HER makes an assumption on the domain that “given a state we can easily find a goal which is satisfied in this state” (Andrychowicz et al., 2017). It requires a mapping, that maps every state to a goal that is achieved in that state. While this mapping may be relatively straightforward to hand design for real-valued state spaces, its analog for visual states cannot be constructed easily. For example, if the state space of the agent lies on the plane of real values in , the goal may be to achieve a particular -coordinate. So in the agent state , , a goal that is satisfied is simply . Now imagine if the agent must instead navigate to a beacon on a 2D plane using camera images as state inputs. In order to convert any arbitrary state into one in which a goal is satisfied, the beacon must be visually inserted into the image itself. We call these goal hallucinations (see figure 2).

In order to fully utilize the power of HER, not only should the agent be able to hallucinate goals in arbitrary states, but also consistently in the same absolute position throughout the failed trajectory. Note that with each step along the trajectory, the position of the goal (a beacon) changes relative to the agent’s and thus the agent’s observation must be correctly updated to reflect this change. The goal must appear to have been solved in a future state along every step of the trajectory (see figure 1). Only then can we make use of the existing transitions along the entire trajectory for replay with hallucinated as well as original goals. Thus, visual settings require the mapping to be extended along the entire trajectory and becomes , where is the maximum length of a trajectory and is the space of failed trajectories. Every state along a trajectory from must be modified by the mapping into a near-goal state that is consistent with the final goal state of that trajectory. This is where this work’s main contribution lies.

It is apparent that the use of UVFAs to generalize over multiple goals, as in HER, does not extend to visual settings where the goal location is unknown and must be identified within the environment. Hence, in this work, the agent’s policy is solely conditioned on its state.

5 Approach

To address the shortcomings of HER, we adopt a two part approach. First, a generative adversarial network (GAN), is trained to modify any existing state from a failed trajectory into a goal or near goal state. We call this model HALGAN. HALGAN generates goal hallucinations conditioned on the configuration of the robot in the current state relative to its configuration in a future state from the same episode. Note that we will make use of the assumption that in realistic robotic applications, while it may be difficult to obtain the explicit location of the goal throughout reinforcement learning, one can obtain the configuration of the robot relative to itself easily. This can be done using SLAM or other state tracking techniques (Montemerlo et al., 2002).

Then, during reinforcement learning, random snippets of past failed trajectories are replayed with the final state in the snippet set as the target goal location. The trained HALGAN modifies pairs of states that constitute the transitions along the trajectory to appear as if the goal was indeed achieved by the end of it. Details of the entire hallucinating process are provided in the next few subsections.

5.1 Hallucinating Visual Goals

Figure 2: Hallucinated images generated by our model. The original, failed, image is on the top left. All others are including goals generated by HALGAN. The goal distance is increased from top to bottom and angle from left to right. This image demonstrates that using our training approach, goal hallucinations can be generated with high fidelity in any relative configuration.

HALGAN is trained on a dataset, , of observations of the goal where its relative location to the agent is explicitly known. These snapshots of the goal can be collected beforehand and are only used once to train the generative model. HALGAN then generalizes to create thousands of hallucinations along failed trajectories during reinforcement learning. These failed trajectories are ones the agent has taken in the past and are stored in its experience replay.

In order to ‘fool’ the agent into thinking that it has indeed achieved a goal, one has to insert the goal into the final image of that trajectory snippet. Thus, the state at the end of a trajectory has to be modified to such that it appears as if the goal were achieved in it. This is in contrast to the regular HER approach or the approach by Nair et al. (2018), where the state can be directly mapped to a goal using the hand designed mapping .

During learning, a snippet of a failed trajectory in the agent’s experience replay is sampled randomly. Along with the final state of the snippet, , other states in the trajectory leading up to it, , must also be modified to appear as if the goal were indeed accomplished in . For this, the hallucinated goal location must remain consistent throughout the replayed trajectory. In the following subsections, we describe each component of HALGAN and then show how it fits together to generate consistent hallucinations of the goal.

Figure 3: A conditioning vector informs the generator, on the desired relative location of the goal. is a random noise vector drawn from . The generated goal image is added to a failed state and then passed through a renormalizing function. This is the final hallucinated state with the goal positioned as desired. is trained adversarially along with , which is learning to rate the fakes and real near goal images from the dataset . also predicts relative goal configurations in real and fake images, which in turn incentivizes to hallucinate goals in the correct relative locations.

5.2 Minimal Hallucinations

One of our aims is to minimally alter a failed trajectory in order to turn its states into goal () or near-goal () states. This makes full use of existing trajectories and does not require HALGAN to re-imagine the environment dynamics or unnecessary details about the goal state such as the background.

To this end, we train an additive model, such that the generator, , has to produce only differences to the state image that add in the goal. To obtain a hallucinated image with the goal at the final state of the trajectory, , we compute,

(1)

where, is the generative model function, is the relative configuration of the robot to a desired goal state and is a random latent conditioning vector. is used to re-normalize the hallucinated state image to . Any differentiable bounded function can be used for this purpose. The hallucinated state, , along with a state sampled from dataset , is then fed to the discriminator to compute the discriminative loss,

(2)

where and are the hallucinated and real near goal image distributions.

In addition to the discriminator image loss, a gradient penalty is employed in the improved training of Wasserstein GANs (see Gulrajani et al. (2017) for more details).

(3)

As a result of generating only image differences, the trained hallucinatory model is invariant to some kinds of visual variations, such as background, presence of other objects, etc. Note that we do not condition on the current failed state, , nor on the end state in the trajectory, . It is only conditioned on the agent’s relative configuration to the desired goal state. While this may lead to some awkward goal hallucinations, we found that in practice it did not influence the learning noticeably.

To encourage the model to generate minimal modifications to the original failed image, we also add a norm loss on the output of . In our experiments, this helped in discouraging the generator from focusing on unnecessary elements of goals such as background information or extra objects in the environment.

(4)

5.3 Regression Auxiliary Task

Typical ACGANs are conditioned on a discrete set of classes, such as flower, dog, etc (Odena et al., 2017). But to be useful for reinforcement learning along the failed trajectory, the generator must be conditioned on the relative configuration of the agent from the desired goal state, which is a vector . The auxiliary task for the discriminator then is to regress the real valued relative location of the goal seen in a training image. To train this regression based auxiliary task, we use a mean squared error loss,

(5)

where is the relative configuration predicted by . We found it helpful to add a small amount of Gaussian noise to our auxiliary inputs for robust training, especially on smaller datasets.

5.4 Halgan

Our final loss to the combined HALGAN is,

(6)

where, , , and

are weighting hyperparameters, which we set to

, , and respectively in all our experiments.

To summarize, the training process is as follows. The generator, conditioned on a randomly drawn relative goal location produces a difference image which is then added to a randomly selected image from a failed trajectory to create a goal hallucination. The discriminator is provided with these hallucinated images as well as ground truth images from and has to score the images on their authenticity and also predict the auxiliary variable. See figure 3 for a representation of the HALGAN training process and the appendix for more details on the network architectures and training procedure.

For the purposes of our experiments, we collect the training data for HALGAN, , by using the last few states of a successful rollout, in this case, a demonstration. Note that the exact data required in are randomly selected snapshots from near the goal and then the final agent configuration in which the goal is achieved to calculate relative poses. Note that only observations, including the state image and agent configuration, are used, no actions have to be provided or demonstrated. This alleviates the data collection burden as the human does not have to demonstrate the optimal completion of the task and snapshots can be collected in any order. For example, it is significantly simpler to record the desired final configuration of objects on a table than to record a full, optimal demonstration of a robot arm aranging them. It also allows the generative model to be independent of the agent and demonstrator action spaces. We also collect a dataset of failed trajectories using random exploration. These are used during HALGAN training to add to the output of and create hallucinated near goal states. Most off-policy RL methods that employ an experience replay have a replay warmup period where actions are taken randomly to fill the replay to a minimum before training begins. This dataset of failed trajectories can be the same as the replay warmup and no extra exploration is required.

5.5 Visual HER

During reinforcement learning, the agent explores its environment as normal. Every time a batch is sampled for training, a few of the data points from it are augmented with goal hallucinations. The detailed process is explained in algorithm 1. The result is that the agent encounters hallucinated near goal states with a much higher frequency than if it were randomly exploring. This in turn encourages the agent to explore further from near goal states.

1:  Given: Trained hallucinatory model , Reward reassignment strategy .
2:  Initialize off-policy Algorithm . {eg. DDQN, DDPG}
3:  Initialize Experience Replay by random exploration.
4:  for step do
5:     Sample an action according to behavior policy in current state.
6:     Execute in the environment and observe state , reward .
7:     Store tuple in .
8:     Sample minibatch from for training.
9:     for  in  do
10:        Sample { hallucination prob.}
11:        if  then
12:           Sample {distance to goal state}
13:           Compute relative configurations and . {Setting as the goal state}
14:           
15:           
16:           
17:        end if
18:     end for
19:     Perform one step of optimization using on the modified minibatch .
20:  end for
Algorithm 1 Visual Hindsight Experience Replay

An important consideration is the retroactive reassignment of rewards. As a reminder, HER uses a manually defined function which decides if the goal is satisfied in a state to designate rewards during hindsight replay. This sort of retroactive reward function is hard to hand design in visual environments. Comparing state and goal images pixel by pixel is typically ineffective. Fortunately, for the purposes of reward reassignment during hindsight replay, one need only compare the agent state to a future one in the same episode. Hence, we assume the existence of a similar function, , which decides whether a pair of states are the same for the purpose of goal completion. This sort of function is also difficult to hand specify for visual states because of the above mentioned difficulties in pixel-by-pixel comparisons. As mentioned in section 3, Nair et al. (2018) use a trained -VAE as to reassign rewards in a dense manner. Here, we make use of the access to the robot’s own configuration to design a similar function, , where is the robot configuration at a particular state. We then assume that any goal satisfied in must also be satisfied in any other state with a similar enough configuration. During retroactive reward reassignment, we compare the relative configurations in the current and future goal state, and hallucinate a reward if they are similar. We also compare against the distance metric employed by (Nair et al., 2018), which did not perform as well as using the agent configuration in our experiments.

6 Experiments

We test our method on two first person visual environments. In a modified version of MiniWorld (Chevalier-Boisvert, 2018), we design two tasks. The first one is to navigate to a red box located in an enclosed room (figure 4). The second task is a pick-and-place variant for first person 3D environments, where the agent must navigate to the red box, visually center it to pick it up and then carry it to a green box somewhere else in the room (see figure 4).

The second environment is a more visually realistic simulated robotics domain, where a TurtleBot2 (Wise & Foote, ) equipped with an RGB camera is simulated within Gazebo (Koenig & Howard, 2004). We use gym-gazebo (Zamora et al., 2016) to interface with Gazebo. In this environment, the agent must collect a pebble scattered randomly on a road by approaching and centering it in its visual field (figure 4). The episode ends and the agent is reset to the starting location if it wanders too far. Episodes also end after 400 steps or upon completion of the goal.

Figure 4: Example of a near goal state in Turtlebot (left) and MiniWorld navigate (center) and pick-and-place (right) environments.

Figure 4 depicts near goal states in all of our tasks. The goal is randomly spawned a small distance away from the agent. Encountering the goal is extremely rare and standard RL is sample inefficient or completely ineffective. The size of the near goal dataset, , for the Turtlebot, navigation and pick-and-place tasks is 6840, 2000, and 6419 images with relative goal configurations respectively. Though, we show that the effect of reduction in the amount of near goal states leads to little performance degradation in the Turtlebot environment (figure 6).

In the Turtlebot and MiniWorld navigation tasks, the configuration of the agent is simply it’s . In pick-and-place, an additional binary field indicates whether the red box is held by the agent. The agent’s relative configuration is calculated with respect to the red box before it is picked up, and the green box afterwards. Hallucinations are generated for the agent approaching both boxes. In the tasks, we found it helpful to anneal the amount of hallucinations in a batch over time as the agent starts filling the replay with real reward. Details of the annealing rate and other experimental hyperparameters are provided in the appendix.

Comparisons. We compare our approach against a few extensions of prior work into the visual domain where goals are not provided to the policy explicitly. We also compare against standard model-free RL baselines. A naive extension of HER into the visual domain, her, simply rewards the agent for states at the end of failed trajectories during replay without hallucinating. Hence, the agent receives hindsight rewards, but the sampled trajectories still seem to end in failures. This is an ablation of our approach where the effect of removing HALGAN from the training procedure is tested.

A second baseline is derived from Nair et al. (2018)’s work (RIG) in training goal-conditioned policies with a dense reward based on the distance between the embedding of the sampled state and that of a goal image. RIG’s retroactive reassignment of goals relies on the use of UVFAs, which is not possible for our domains where the goal image is unknown. Therefore we test two variants of this baseline where we attempt to find a suitable comparison. We first train a VAE on the exact data available to HALGAN, i.e. near goal images in and failed state images collected by random exploration. Then, during RL, vae-her simply sets the final image in a failed trajectory, without any hallucinations, as the goal and uses the trained VAE to compute reward for a transition along that trajectory. This baseline evaluates the effectiveness of dense reward reassignment in our domains without the use of hallucinations from HALGAN.

rig- follows a similar dense reward reassignment strategy, but computes distance of a state to a randomly sampled goal image in . Goal images are identified in by filtering for the relative configuration of the agent from the goal being zero. Hence, rig- rewards the agent for being in states that look similar to goal states in retrospect, without employing any hallucinations. For the the distance based rewards provided by the VAE in rig- to be the same order of magnitude as the environment rewards, it was necessary to re-scale them. The scaling factor in all our experiments was set to .

Figure 5: In all tasks, VHER starts learning immediately whereas the baselines needs to explore far more to randomly encounter positive rewards. In the Turtlebot pebble collection task (left), all algorithms eventually learn an optimal policy but VHER begins learning immediately and converges quickly. In the harder, continuous control MiniWorld navigate task (middle), neither DDPG nor naive-HER are able to learn to complete the task. Only the rig- baseline somewhat learns the task eventually on three of the five random seeds. In the final pick-and-place task, only VHER learns the optimal policy in four out of five random seeds.

Discrete and Continuous Control. An advantage of our method is that HALGAN is agnostic to the agent’s action space. As a result of conditioning on the relative location of the robot to a state in the future, we are freeing the model of any assumptions of how the robot actually gets there.

In the discrete TurtleBot environment, only a sparse reward is used to indicate completion of a goal. The action space is back and forth movement and turning (4 actions). The base off-policy algorithm used is Double DQN (Van Hasselt et al., 2016). For the continuous MiniWorld environments, a penalty on the norm of the output actions is applied at each step to simulate energy step cost. Otherwise, the agent is only provided the sparse task completion reward. The output actions are the linear and rotational velocities of the agent at the next step, capped at a fixed amount. The base algorithm used in this setting is DDPG (Lillicrap et al., 2015)

. We employ deep convolutional neural networks as function approximators that take in the state image as input and outputs the desired control actions or values.

7 Results

In all of our experiments, VHER begins learning immediately (figure 5). This is due to the realistic looking hallucinated goals being quickly identified as desirable states. This is in contrast to standard RL which rarely encounters reward and must explore at length to encounter random rewards in order to begin the learning process, if at all.

In the discrete TurtleBot pebble collection domain (figure 5), the naive HER strategy provides a good enough exploration bonus for the agent to explore further and quicker than standard DDQN. It begins learning by 100K steps. VHER, by contrast, starts learning to navigate to real goals immediately.

For the continuous control experiments in MiniWorld (figure 5 and 5), only VHER is able to learn to complete the task. Note that achieving a reward of in this environment is relatively easy, it is only positive rewards that indicate achievement of goal. DDPG never encounters any reward during exploration and hence learns to simply minimize its actions in order to avoid movement penalty. Naive her initially explores heavily and hence incurs a heavy penalty, but doesn’t learn to associate the rewards it receives with the presence of a goal. Some of the random seeds eventually converge to the same degenerate policy as DDPG. vae-her, the augmentation of her with dense rewards from a trained VAE, also proves unsuccessful for either task, demonstrating that dense rewards without hallucinated or real goals in failed trajectories are also ineffective for learning in these domains. Only the rig- strategy of providing dense rewards relative to random goal images eventually learns to complete the navigation task for some of the seeds. For the pick-and-place task, rig- only learns a working policy on a single seed and the other baselines perform similarily or worse. Interestingly, rig’s dense reward reassignment can be readily combined with our approach of state modification by hallucination, providing directions for future work.

Figure 6:

Reinforcement learning using VHER in TurtleBot task with varying size of training dataset for HALGAN. The curves being similar is a positive result that shows only minor variance of RL agent performance with training data available for HALGAN from 6800 (original) down to 1000

near goal training samples.

Finally in figure 6, we show the change in performance on the TurtleBot pebble collection task due to using fewer training samples in . The effect is only slightly slower learning even for the largely reduced dataset of only 1000 images. The minimalistic hallucinations created by HALGAN require a relatively small amount of data to train well enough to provide a significant boost in reinforcement learning.

8 Discussion

A major impediment to training RL agents in the real world is the amount of data an agent must collect and process before it can start drawing inference on which actions lead to rewards and which ones are to be avoided. High sample complexity makes problems such as fragility of physical systems, energy consumption, speed of robots and sensor errors manifest themselves acutely when one attempts running the reinforcement learning process in the real world.

In this work, we have shown that Hindsight Experience Replay can be extended to visual scenarios where the goal location is not explicitly known beforehand, as is common in many realistic applications. We empirically prove that by hallucinating goals along failed trajectories, the agent can begin learning to solve tasks immediately. VHER converges faster than standard RL techniques that flounder around fruitlessly before encountering rewards, and in complex tasks fail to find a working policy at all. VHER requires relatively few snapshots of near goal images with known goal configurations. In certain environments, this dataset could be generated online as the agent learns, or supplied from orthogonal techniques such as GoalGAN (Held et al., 2017). We leave this as an avenue for future work.

9 Acknowledgements

We would like to thank the entire OffWorld team for their enthusiastic support of this work. Special thanks to Ashish Kumar for help in the setup of experiments and for many hours of fruitful discussions.

References

Appendix A Experimental Hyperparameters

Refer to table below for environment specific hyperparameters.

Hyperparameter TurtleBot MiniWorld Navigate MiniWorld Pick-and-Place
Replay Warmup 10,000 10,000 10,000
Replay Capacity 100,000 100,000 100,000
Initial Exploration 1.0 1.0 1.0
Final Exploration 0.5 0.5 0.5
Anneal Steps 100,000 100,000 250,000
Discount () 0.99 0.99 0.99
Off-Policy Algorithm DDQN DDPG DDPG
Policy Optimizer ADAM ADAM ADAM
Learning Rate (Actor), (Critic) (Actor), (Critic)
Size of for HALGAN 6,840 2,000 6,419
Hallucination Start % 20% 30% 30%
Hallucination END % 0% 0% 0%
Max Failed Trajectory Length 16 32 16
Image size 64x64 64x64 64x64
Random Seeds 75839, 69045, 47040 75839, 69045, 47040, 75839, 69045, 47040,
60489, 11798 60489, 11798
Table 1: Environment Specific Hyperparameters

Refer to table below for HALGAN specific hyperparameters.

Hyperparameter Value
Latent Vector Size 128
Latent Sampling Distribution
Auxiliary Task Weight 10
Gradient Penalty Weight 10
loss on Weight 1
Optimizer ADAM
Learning Rate
Adam 0.5
Adam 0.9
Iters per Iter 5
Table 2: Hyperparameters involved in training HALGAN

Appendix B Network Architectures

Refer to table below for details on the network architecture for DDQN. LeakyReLu’s were used as activations throughout except for the output layer where no activation was used.

Layer Shape Filters #params
Image Input 64x64 3 0
Conv 1 5x5 4 304
Conv 2 5x5 8 808
Conv 3 5x5 16 3216
Conv 4 5x5 32 12832
Dense 1 32 - 16416
Dense 2 4 () - 132
Total - - 33708
Table 3: Network Architecture for DDQN Agent

Refer to table below for details on the network architecture for actor for DDPG. LeakyReLu’s were used as activations throughout except for the output layer where a Tanh was used.

Layer Shape Filters #params
Image Input 64x64 3 0
Conv 1 5x5 4 304
Conv 2 5x5 8 808
Conv 3 5x5 16 3216
Conv 4 5x5 32 12832
Dense 1 32 - 16416
Dense 2 2 () - 66
Total - - 33642
Table 4: Network Architecture for DDPG Actor

Refer to table below for details on the network architecture for critic for DDPG. LeakyReLu’s were used as activations throughout except for the output layer where no activation was used.

Layer Shape Filters #params
Image Input 64x64 3 0
Conv 1 5x5 4 304
Conv 2 5x5 8 808
Conv 3 5x5 16 3216
Conv 4 5x5 32 12832
Dense 1 32 - 16416
Dense 2 1 - 33
Total - - 33673
Table 5: Network Architecture for DDPG Critic

Refer to table below for details on the network architecture for the generator in HALGAN. LeakyReLu’s were used as activations throughout except immediately after the conditioning layer where no activation was used and the output where tanh was used.

Layer Shape Filters #params
Config Input 3 - 0
Dense 1 128 - 384
Conditioning Input 128 - 0
Multiply 128 - 0
Reshape 1x1 128 0
UpSample + Conv 1 4x4 64 131136
BatchNorm 2x2 64 256
UpSample + Conv 2 4x4 64 65600
BatchNorm 4x4 64 256
UpSample + Conv 3 4x4 64 65600
BatchNorm 8x8 64 256
UpSample + Conv 4 4x4 32 32800
BatchNorm 16x16 32 256
UpSample + Conv 5 4x4 32 16416
BatchNorm 32x32 32 128
UpSample + Conv 6 4x4 16 8028
BatchNorm 64x64 16 64
Conv 7 4x4 8 2056
BatchNorm 64x64 8 32
Conv 8 4x4 3 387
Total - - 323707
Table 6: Network Architecture HALGAN Generator

Refer to table below for details on the network architecture for the discriminator in HALGAN. LeakyReLu’s were used as activations throughout except at the output where no activation was used.

Layer Shape Filters #params
Image Input 64x64 3 0
Conv 1 4x4 32 1568
Conv 2 4x4 32 16416
Conv 3 4x4 32 16416
Conv 4 4x4 64 32832
Conv 5 4x4 64 65600
Conv 6 4x4 64 65600
Conv 7 4x4 128 131200
Dense (Aux) 2 - 129
Dense (real/fake) 1: - 258
Total - - 330019
Table 7: Network Architecture for HALGAN Discriminator