Deep Reinforcement learning (RL) has recently demonstrated success in a range of previously unsolved tasks, from playing Atari and Go on a superhuman level (Mnih et al., 2015; Silver et al., 2017) to learning control policies for real robotics tasks (Levine et al., 2016; OpenAI, 2018; Pinto et al., 2017). But deep RL algorithms are highly sample inefficient for complex tasks and learning from sparse rewards can be challenging. In these settings, millions of steps are wasted exploring trajectories that yield no learning signal. On the other hand, providing dense rewards along these trajectories is a tedious job that requires substantial domain knowledge and RL expertise. Ill-specified shaping rewards can also lead to unexpected ‘hacking’ behaviour (Ng et al., 1999; Randløv & Alstrøm, 1998)
. Therefore, an important vector for RL research is towards more sample efficient methods that minimize the number of environment interactions, yet can be trained using only sparse rewards. To this end,Andrychowicz et al. (2017) introduced the idea of Hindsight Experience Replay (HER), which can rapidly train a goal-conditioned policy by retroactively imagining failed trajectories as successful ones. By making use of failed attempts to increase sample efficiency, HER was able to learn a range of robotics tasks that traditional RL methods were unable to solve. But HER was only shown to work in non-visual environments, where the precise goal configuration is provided to the agent’s policy throughout training and where it is straightforward to find a goal that is satisfied in any state. It is not directly applicable to challenging visual domains resembling real world applications, where the goal location is not explicitly known and must be searched for within the environment.
Yet, we desire for RL agents to quickly learn to operate in the high-dimensional visual environments that humans inhabit. In HER, Andrychowicz et al. (2017) employed a goal conditioned policy using universal value function approximators (UVFAs) (Schaul et al., 2015) to generalize over multiple goals. Some recent work has extended that to visual goal conditioned policies (Nair et al., 2018) where goals are sampled from the set of possible agent states. But there is a wide range of visual tasks where we do not have an explicit representation of a goal beforehand and where a state may not easily map to a goal. Thus, we would like the agent to be able to perform visual tasks without providing it an exact specification of the goal during execution and instead have it search for the goal in its environment. For this, the agent must be able to infer the presence of goals from the state image itself. Without a direct goal specification, the agent must also learn to generalize over multiple goals just from its state.
To address high sample complexity of RL in such visual environments, we introduce Visual Hindsight Experience Replay (VHER), which combines a hallucinatory generative model with HER to rapidly solve tasks using only raw pixels in the state as input to the agent policy. The hallucinatory generative model, HALGAN, minimally alters images in snippets of failed trajectories to appear as if the desired goal is achieved at the end. In order to retroactively hallucinate success in a visual environment, it is necessary to alter the state images along the failed trajectory to make it appear as if the goal was present throughout (see figure 1). HALGAN is trained using a few snapshots of near goal images, where the relative location of the agent to the goal is known. It is then combined with HER during the reinforcement learning loop to hallucinate goals along unsuccessful trajectories. The RL policy is trained solely on images and without knowledge of relative goal configuration.
The key contributions of this work are to expand the applicability of HER to visual domains by providing a way to retroactively transform failed trajectories into successful ones and hence allow the agent to rapidly generalize across multiple goals using only the state as input to its policy. In this work, we aim to minimize the amount of direct goal specification required and learn RL policies conditioned solely on the agent state image. We believe that the sample complexity reduction that VHER provides is an important step towards being able to train RL policies directly in the real world.
Below, we lay out some preliminary information on reinforcement learning and generative models.
2.1 Reinforcement Learning
In reinforcement learning, the agent is tasked with the maximization of some notion of a long term expected reward (Sutton & Barto, 2018)
. The problem is typically modeled as a Markov decision process (MDP). An MDP consists of a tuple, where is the set of states the agent can exist in, is the set of environment actions, is the function mapping states and actions to a scalar reward, is the transition function, and is a discount factor that weighs how important future rewards are versus immediate ones. Stochasticity in the environment can be present in the form of uncertainties in transition or reward.
The agent must learn a policy, , mapping every state to an action. The optimal policy, , is often the goal of learning. It informs the agent on an action that typically maximizes expected value of the sum of future discounted rewards, , starting from any state . This expectation, known as the state value (), is over trajectories experienced under the current policy and environment dynamics. UVFAs (Schaul et al., 2015) approximate value functions with respect to a goal in addition to the state, . Goals are drawn from the space and are typically represented as desired agent states or configurations of objects in the environment or as desired state images. The optimal policy,
, in this case maximizes the probability of achieving a particular goal,, from any state.
Off-policy RL algorithms can learn an optimal policy using experiences from a behavior policy separate from the optimal policy. In particular, off-policy algorithms can make use of samples collected in the past, leading to more sample efficient learning. An experience replay (Lin, 1992) is typically employed to store past transitions as tuples of . At every step of training, a minibatch of transitions is sampled from the replay at random and a loss on future expected return minimized. The off-policy algorithms employing an experience replay we use in this work are Double Deep Q-Networks (DDQN) (Van Hasselt et al., 2016) and Deep Deterministic Policy Gradients (DDPG) (Lillicrap et al., 2015).
2.2 Hindsight Experience Replay
HER was shown to achieve speedups in learning in environments where the goal configuration is provided along with the agent state to the policy. The essential idea is to store each trajectory, , with a number of additional goals along with the originally specified one. An off-policy algorithm employing an experience replay is used to train a UVFA which learns a policy which generalizes across multiple goals. During replay, the original goals are changed to states that have actually been achieved by the agent in the past.
The reward is also modified retroactively to reflect the new goal being replayed. In particular, HER assumes that every goal, , can be expressed as a predicate . That is to say, all states can be judged as to whether or not a goal has been achieved in them. Thus, while replaying the trajectory with a surrogate goal , one can easily reassign rewards along the entire trajectory as,
Andrychowicz et al. (2017) report that selecting to be a future state from within the same (failed) episode leads to the best results. This training approach forms a sort of implicit curriculum for the agent. In the beginning, it encourages the agent to explore further outwards along trajectories it has visited before. Since the surrogate goal, , is also explicitly provided to the UVFA policy, it soon learns to also generalize this curriculum over unseen goals. Over time, the agent is able to achieve any goal in , including the real ones.
2.3 Wasserstein GANs
We employ an improved Wasserstein ACGAN (Gulrajani et al., 2017; Odena et al., 2017) as our generative model because of its stability, realistic looking outputs, and ability to condition the generated images on a desired class. A typical W-ACGAN has a generator, , that takes as input a class variable and a latent vector of random noise. It then generates an image which is fed into the discriminator, . rates the image on its fidelity to the training data and, as an auxiliary task, predicts class membership. The Earth-Mover distance between the distributions of real, , and generated, , images is used as a loss to train the combined model. A standard practice in Wasserstein GANs is to train the discriminator multiple times for each generator update. The discriminator begins to act as a critic that rates images on their fidelity to the training data.
Here, it is important to point out that the motivation behind using a GAN in this work is to produce realistic looking hallucinations that will allow the agent to easily generalize from imagined goals to real ones. Realistic insertion of goals was not an issue in HER because a new goal could directly be substituted in a replayed transition without any modification to the states.
3 Related Work
Generative Models in RL. In recent years, generative models have demonstrated significant improvements in the areas of image generation, data compression, denoising, and latent-space representations, among others (Goodfellow et al., 2014; Chen et al., 2016; Vincent et al., 2008). Reinforcement learning has also benefited from incorporating generative models in the training process. Ha & Schmidhuber (2018)
synthesize a lot of prior work in the area by proposing a Recurrent Neural Network (RNN) based generative dynamics model(Schmidhuber, 1990) of popular OpenAI gym (Brockman et al., 2016) and VizDoom (Kempka et al., 2016) environments. They employ a fairly common procedure of encoding high dimensional visual inputs from the environment into lower dimension embedding vectors using a Variational Auto Encoder (VAE) (Kingma & Welling, 2013) before passing it on to the RNN model. Held et al. (2017) use a GAN to generate goals matching in difficulty to an agent’s skill on a task. Called GoalGAN, it generates an automatic curriculum of incrementally harder to reach goals. But it assumes that goals can easily be set in the environment by the agent and does not make efficient use of trajectories that failed to achieve these objectives.
Generative models have also been used in the closely related field of imitation learning to learn from human demonstrations or observation sequences (Ho & Ermon, 2016; Edwards et al., 2018b; Schroecker et al., 2019). In our approach, we do not require demonstrations of the task, or even a sequence of observations, but random snapshots of the goal which we use to speed up reinforcement learning.
Goal Based RL. Some recent work has focused on leveraging information on the goal or surrounding states to speed up reinforcement learning. Edwards et al. (2018a) and Goyal et al. (2018) learn a reverse dynamics model to generate states backwards from the goal which are then added to the agent’s replay buffer. The former work assumes that the goal configuration is known and backtracks from there, whereas in the latter, high-value states are picked from the replay buffer or a GoalGAN is used to generate goals. The latter work also learns an inverse policy, to generate plausible actions leading back from goal states. In contrast, we focus on minimally altering states in existing failed trajectories already in the replay buffer to appear as if a goal has been completed in them. This avoids having to generate entirely new trajectories and allows us to make full use of the environment dynamics already present in previous state transitions.
Others have focused on learning goal-conditioned policies in visual domains using a single or few images of the goal (Xie et al., 2018; Zhu et al., 2017). Nair et al. (2018) train a -VAE (Burgess et al., 2018) on state images for a threefold purpose: (1) to sample new goals during training, (2) to use the Euclidean distance between feature encodings of current and goal images as a dense reward, and (3) to retroactively alter goals with VAE generated images and reassign rewards appropriately. The set of goals is assumed to be the same as the set of states and hence they are easy to swap back and forth. This works well for domains where the goal is separately provided to the policy along with the agent state, and where states do not have to be modified for changing goals. In this work, we attempt learning in domains where the goal image is not known beforehand and thus cannot be provided to the agent’s policy, and where the goal may or may not be present in a particular agent state.
4 The missing component in HER
First, we will more formally discuss what is missing from the original HER formulation that does not allow it to readily extend to visual domains. Then, in the next section, we will describe in detail how the use of hallucinatory generative models can help bridge the gap.
HER makes an assumption on the domain that “given a state we can easily find a goal which is satisfied in this state” (Andrychowicz et al., 2017). It requires a mapping, that maps every state to a goal that is achieved in that state. While this mapping may be relatively straightforward to hand design for real-valued state spaces, its analog for visual states cannot be constructed easily. For example, if the state space of the agent lies on the plane of real values in , the goal may be to achieve a particular -coordinate. So in the agent state , , a goal that is satisfied is simply . Now imagine if the agent must instead navigate to a beacon on a 2D plane using camera images as state inputs. In order to convert any arbitrary state into one in which a goal is satisfied, the beacon must be visually inserted into the image itself. We call these goal hallucinations (see figure 2).
In order to fully utilize the power of HER, not only should the agent be able to hallucinate goals in arbitrary states, but also consistently in the same absolute position throughout the failed trajectory. Note that with each step along the trajectory, the position of the goal (a beacon) changes relative to the agent’s and thus the agent’s observation must be correctly updated to reflect this change. The goal must appear to have been solved in a future state along every step of the trajectory (see figure 1). Only then can we make use of the existing transitions along the entire trajectory for replay with hallucinated as well as original goals. Thus, visual settings require the mapping to be extended along the entire trajectory and becomes , where is the maximum length of a trajectory and is the space of failed trajectories. Every state along a trajectory from must be modified by the mapping into a near-goal state that is consistent with the final goal state of that trajectory. This is where this work’s main contribution lies.
It is apparent that the use of UVFAs to generalize over multiple goals, as in HER, does not extend to visual settings where the goal location is unknown and must be identified within the environment. Hence, in this work, the agent’s policy is solely conditioned on its state.
To address the shortcomings of HER, we adopt a two part approach. First, a generative adversarial network (GAN), is trained to modify any existing state from a failed trajectory into a goal or near goal state. We call this model HALGAN. HALGAN generates goal hallucinations conditioned on the configuration of the robot in the current state relative to its configuration in a future state from the same episode. Note that we will make use of the assumption that in realistic robotic applications, while it may be difficult to obtain the explicit location of the goal throughout reinforcement learning, one can obtain the configuration of the robot relative to itself easily. This can be done using SLAM or other state tracking techniques (Montemerlo et al., 2002).
Then, during reinforcement learning, random snippets of past failed trajectories are replayed with the final state in the snippet set as the target goal location. The trained HALGAN modifies pairs of states that constitute the transitions along the trajectory to appear as if the goal was indeed achieved by the end of it. Details of the entire hallucinating process are provided in the next few subsections.
5.1 Hallucinating Visual Goals
HALGAN is trained on a dataset, , of observations of the goal where its relative location to the agent is explicitly known. These snapshots of the goal can be collected beforehand and are only used once to train the generative model. HALGAN then generalizes to create thousands of hallucinations along failed trajectories during reinforcement learning. These failed trajectories are ones the agent has taken in the past and are stored in its experience replay.
In order to ‘fool’ the agent into thinking that it has indeed achieved a goal, one has to insert the goal into the final image of that trajectory snippet. Thus, the state at the end of a trajectory has to be modified to such that it appears as if the goal were achieved in it. This is in contrast to the regular HER approach or the approach by Nair et al. (2018), where the state can be directly mapped to a goal using the hand designed mapping .
During learning, a snippet of a failed trajectory in the agent’s experience replay is sampled randomly. Along with the final state of the snippet, , other states in the trajectory leading up to it, , must also be modified to appear as if the goal were indeed accomplished in . For this, the hallucinated goal location must remain consistent throughout the replayed trajectory. In the following subsections, we describe each component of HALGAN and then show how it fits together to generate consistent hallucinations of the goal.
5.2 Minimal Hallucinations
One of our aims is to minimally alter a failed trajectory in order to turn its states into goal () or near-goal () states. This makes full use of existing trajectories and does not require HALGAN to re-imagine the environment dynamics or unnecessary details about the goal state such as the background.
To this end, we train an additive model, such that the generator, , has to produce only differences to the state image that add in the goal. To obtain a hallucinated image with the goal at the final state of the trajectory, , we compute,
where, is the generative model function, is the relative configuration of the robot to a desired goal state and is a random latent conditioning vector. is used to re-normalize the hallucinated state image to . Any differentiable bounded function can be used for this purpose. The hallucinated state, , along with a state sampled from dataset , is then fed to the discriminator to compute the discriminative loss,
where and are the hallucinated and real near goal image distributions.
In addition to the discriminator image loss, a gradient penalty is employed in the improved training of Wasserstein GANs (see Gulrajani et al. (2017) for more details).
As a result of generating only image differences, the trained hallucinatory model is invariant to some kinds of visual variations, such as background, presence of other objects, etc. Note that we do not condition on the current failed state, , nor on the end state in the trajectory, . It is only conditioned on the agent’s relative configuration to the desired goal state. While this may lead to some awkward goal hallucinations, we found that in practice it did not influence the learning noticeably.
To encourage the model to generate minimal modifications to the original failed image, we also add a norm loss on the output of . In our experiments, this helped in discouraging the generator from focusing on unnecessary elements of goals such as background information or extra objects in the environment.
5.3 Regression Auxiliary Task
Typical ACGANs are conditioned on a discrete set of classes, such as flower, dog, etc (Odena et al., 2017). But to be useful for reinforcement learning along the failed trajectory, the generator must be conditioned on the relative configuration of the agent from the desired goal state, which is a vector . The auxiliary task for the discriminator then is to regress the real valued relative location of the goal seen in a training image. To train this regression based auxiliary task, we use a mean squared error loss,
where is the relative configuration predicted by . We found it helpful to add a small amount of Gaussian noise to our auxiliary inputs for robust training, especially on smaller datasets.
Our final loss to the combined HALGAN is,
where, , , and
are weighting hyperparameters, which we set to, , and respectively in all our experiments.
To summarize, the training process is as follows. The generator, conditioned on a randomly drawn relative goal location produces a difference image which is then added to a randomly selected image from a failed trajectory to create a goal hallucination. The discriminator is provided with these hallucinated images as well as ground truth images from and has to score the images on their authenticity and also predict the auxiliary variable. See figure 3 for a representation of the HALGAN training process and the appendix for more details on the network architectures and training procedure.
For the purposes of our experiments, we collect the training data for HALGAN, , by using the last few states of a successful rollout, in this case, a demonstration. Note that the exact data required in are randomly selected snapshots from near the goal and then the final agent configuration in which the goal is achieved to calculate relative poses. Note that only observations, including the state image and agent configuration, are used, no actions have to be provided or demonstrated. This alleviates the data collection burden as the human does not have to demonstrate the optimal completion of the task and snapshots can be collected in any order. For example, it is significantly simpler to record the desired final configuration of objects on a table than to record a full, optimal demonstration of a robot arm aranging them. It also allows the generative model to be independent of the agent and demonstrator action spaces. We also collect a dataset of failed trajectories using random exploration. These are used during HALGAN training to add to the output of and create hallucinated near goal states. Most off-policy RL methods that employ an experience replay have a replay warmup period where actions are taken randomly to fill the replay to a minimum before training begins. This dataset of failed trajectories can be the same as the replay warmup and no extra exploration is required.
5.5 Visual HER
During reinforcement learning, the agent explores its environment as normal. Every time a batch is sampled for training, a few of the data points from it are augmented with goal hallucinations. The detailed process is explained in algorithm 1. The result is that the agent encounters hallucinated near goal states with a much higher frequency than if it were randomly exploring. This in turn encourages the agent to explore further from near goal states.
An important consideration is the retroactive reassignment of rewards. As a reminder, HER uses a manually defined function which decides if the goal is satisfied in a state to designate rewards during hindsight replay. This sort of retroactive reward function is hard to hand design in visual environments. Comparing state and goal images pixel by pixel is typically ineffective. Fortunately, for the purposes of reward reassignment during hindsight replay, one need only compare the agent state to a future one in the same episode. Hence, we assume the existence of a similar function, , which decides whether a pair of states are the same for the purpose of goal completion. This sort of function is also difficult to hand specify for visual states because of the above mentioned difficulties in pixel-by-pixel comparisons. As mentioned in section 3, Nair et al. (2018) use a trained -VAE as to reassign rewards in a dense manner. Here, we make use of the access to the robot’s own configuration to design a similar function, , where is the robot configuration at a particular state. We then assume that any goal satisfied in must also be satisfied in any other state with a similar enough configuration. During retroactive reward reassignment, we compare the relative configurations in the current and future goal state, and hallucinate a reward if they are similar. We also compare against the distance metric employed by (Nair et al., 2018), which did not perform as well as using the agent configuration in our experiments.
We test our method on two first person visual environments. In a modified version of MiniWorld (Chevalier-Boisvert, 2018), we design two tasks. The first one is to navigate to a red box located in an enclosed room (figure 4). The second task is a pick-and-place variant for first person 3D environments, where the agent must navigate to the red box, visually center it to pick it up and then carry it to a green box somewhere else in the room (see figure 4).
The second environment is a more visually realistic simulated robotics domain, where a TurtleBot2 (Wise & Foote, ) equipped with an RGB camera is simulated within Gazebo (Koenig & Howard, 2004). We use gym-gazebo (Zamora et al., 2016) to interface with Gazebo. In this environment, the agent must collect a pebble scattered randomly on a road by approaching and centering it in its visual field (figure 4). The episode ends and the agent is reset to the starting location if it wanders too far. Episodes also end after 400 steps or upon completion of the goal.
Figure 4 depicts near goal states in all of our tasks. The goal is randomly spawned a small distance away from the agent. Encountering the goal is extremely rare and standard RL is sample inefficient or completely ineffective. The size of the near goal dataset, , for the Turtlebot, navigation and pick-and-place tasks is 6840, 2000, and 6419 images with relative goal configurations respectively. Though, we show that the effect of reduction in the amount of near goal states leads to little performance degradation in the Turtlebot environment (figure 6).
In the Turtlebot and MiniWorld navigation tasks, the configuration of the agent is simply it’s . In pick-and-place, an additional binary field indicates whether the red box is held by the agent. The agent’s relative configuration is calculated with respect to the red box before it is picked up, and the green box afterwards. Hallucinations are generated for the agent approaching both boxes. In the tasks, we found it helpful to anneal the amount of hallucinations in a batch over time as the agent starts filling the replay with real reward. Details of the annealing rate and other experimental hyperparameters are provided in the appendix.
Comparisons. We compare our approach against a few extensions of prior work into the visual domain where goals are not provided to the policy explicitly. We also compare against standard model-free RL baselines. A naive extension of HER into the visual domain, her, simply rewards the agent for states at the end of failed trajectories during replay without hallucinating. Hence, the agent receives hindsight rewards, but the sampled trajectories still seem to end in failures. This is an ablation of our approach where the effect of removing HALGAN from the training procedure is tested.
A second baseline is derived from Nair et al. (2018)’s work (RIG) in training goal-conditioned policies with a dense reward based on the distance between the embedding of the sampled state and that of a goal image. RIG’s retroactive reassignment of goals relies on the use of UVFAs, which is not possible for our domains where the goal image is unknown. Therefore we test two variants of this baseline where we attempt to find a suitable comparison. We first train a VAE on the exact data available to HALGAN, i.e. near goal images in and failed state images collected by random exploration. Then, during RL, vae-her simply sets the final image in a failed trajectory, without any hallucinations, as the goal and uses the trained VAE to compute reward for a transition along that trajectory. This baseline evaluates the effectiveness of dense reward reassignment in our domains without the use of hallucinations from HALGAN.
rig- follows a similar dense reward reassignment strategy, but computes distance of a state to a randomly sampled goal image in . Goal images are identified in by filtering for the relative configuration of the agent from the goal being zero. Hence, rig- rewards the agent for being in states that look similar to goal states in retrospect, without employing any hallucinations. For the the distance based rewards provided by the VAE in rig- to be the same order of magnitude as the environment rewards, it was necessary to re-scale them. The scaling factor in all our experiments was set to .
Discrete and Continuous Control. An advantage of our method is that HALGAN is agnostic to the agent’s action space. As a result of conditioning on the relative location of the robot to a state in the future, we are freeing the model of any assumptions of how the robot actually gets there.
In the discrete TurtleBot environment, only a sparse reward is used to indicate completion of a goal. The action space is back and forth movement and turning (4 actions). The base off-policy algorithm used is Double DQN (Van Hasselt et al., 2016). For the continuous MiniWorld environments, a penalty on the norm of the output actions is applied at each step to simulate energy step cost. Otherwise, the agent is only provided the sparse task completion reward. The output actions are the linear and rotational velocities of the agent at the next step, capped at a fixed amount. The base algorithm used in this setting is DDPG (Lillicrap et al., 2015)
. We employ deep convolutional neural networks as function approximators that take in the state image as input and outputs the desired control actions or values.
In all of our experiments, VHER begins learning immediately (figure 5). This is due to the realistic looking hallucinated goals being quickly identified as desirable states. This is in contrast to standard RL which rarely encounters reward and must explore at length to encounter random rewards in order to begin the learning process, if at all.
In the discrete TurtleBot pebble collection domain (figure 5), the naive HER strategy provides a good enough exploration bonus for the agent to explore further and quicker than standard DDQN. It begins learning by 100K steps. VHER, by contrast, starts learning to navigate to real goals immediately.
For the continuous control experiments in MiniWorld (figure 5 and 5), only VHER is able to learn to complete the task. Note that achieving a reward of in this environment is relatively easy, it is only positive rewards that indicate achievement of goal. DDPG never encounters any reward during exploration and hence learns to simply minimize its actions in order to avoid movement penalty. Naive her initially explores heavily and hence incurs a heavy penalty, but doesn’t learn to associate the rewards it receives with the presence of a goal. Some of the random seeds eventually converge to the same degenerate policy as DDPG. vae-her, the augmentation of her with dense rewards from a trained VAE, also proves unsuccessful for either task, demonstrating that dense rewards without hallucinated or real goals in failed trajectories are also ineffective for learning in these domains. Only the rig- strategy of providing dense rewards relative to random goal images eventually learns to complete the navigation task for some of the seeds. For the pick-and-place task, rig- only learns a working policy on a single seed and the other baselines perform similarily or worse. Interestingly, rig’s dense reward reassignment can be readily combined with our approach of state modification by hallucination, providing directions for future work.
Finally in figure 6, we show the change in performance on the TurtleBot pebble collection task due to using fewer training samples in . The effect is only slightly slower learning even for the largely reduced dataset of only 1000 images. The minimalistic hallucinations created by HALGAN require a relatively small amount of data to train well enough to provide a significant boost in reinforcement learning.
A major impediment to training RL agents in the real world is the amount of data an agent must collect and process before it can start drawing inference on which actions lead to rewards and which ones are to be avoided. High sample complexity makes problems such as fragility of physical systems, energy consumption, speed of robots and sensor errors manifest themselves acutely when one attempts running the reinforcement learning process in the real world.
In this work, we have shown that Hindsight Experience Replay can be extended to visual scenarios where the goal location is not explicitly known beforehand, as is common in many realistic applications. We empirically prove that by hallucinating goals along failed trajectories, the agent can begin learning to solve tasks immediately. VHER converges faster than standard RL techniques that flounder around fruitlessly before encountering rewards, and in complex tasks fail to find a working policy at all. VHER requires relatively few snapshots of near goal images with known goal configurations. In certain environments, this dataset could be generated online as the agent learns, or supplied from orthogonal techniques such as GoalGAN (Held et al., 2017). We leave this as an avenue for future work.
We would like to thank the entire OffWorld team for their enthusiastic support of this work. Special thanks to Ashish Kumar for help in the setup of experiments and for many hours of fruitful discussions.
- Andrychowicz et al. (2017) Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and Zaremba, W. Hindsight experience replay. In Advances in Neural Information Processing Systems 30, pp. 5048–5058. Curran Associates, Inc., 2017.
- Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.
- Burgess et al. (2018) Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. Understanding disentangling in -vae. arXiv preprint arXiv:1804.03599, 2018.
- Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180, 2016.
- Chevalier-Boisvert (2018) Chevalier-Boisvert, M. gym-miniworld environment for openai gym. https://github.com/maximecb/gym-miniworld, 2018.
- Edwards et al. (2018a) Edwards, A. D., Downs, L., and Davidson, J. C. Forward-backward reinforcement learning. CoRR, abs/1803.10227, 2018a. URL http://arxiv.org/abs/1803.10227.
- Edwards et al. (2018b) Edwards, A. D., Sahni, H., Schroecker, Y., and Isbell, C. L. Imitating latent policies from observation. CoRR, abs/1805.07914, 2018b. URL http://arxiv.org/abs/1805.07914.
- Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
- Goyal et al. (2018) Goyal, A., Brakel, P., Fedus, W., Lillicrap, T. P., Levine, S., Larochelle, H., and Bengio, Y. Recall traces: Backtracking models for efficient reinforcement learning. CoRR, abs/1804.00379, 2018. URL http://arxiv.org/abs/1804.00379.
- Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems 30, pp. 5767–5777. Curran Associates, Inc., 2017.
- Ha & Schmidhuber (2018) Ha, D. and Schmidhuber, J. World models. arXiv preprint arXiv:1803.10122, 2018.
- Held et al. (2017) Held, D., Geng, X., Florensa, C., and Abbeel, P. Automatic goal generation for reinforcement learning agents. CoRR, abs/1705.06366, 2017. URL http://arxiv.org/abs/1705.06366.
- Ho & Ermon (2016) Ho, J. and Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565–4573, 2016.
- Kempka et al. (2016) Kempka, M., Wydmuch, M., Runc, G., Toczek, J., and Jaśkowski, W. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In Computational Intelligence and Games (CIG), 2016 IEEE Conference on, pp. 1–8. IEEE, 2016.
- Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Koenig & Howard (2004) Koenig, N. P. and Howard, A. Design and use paradigms for gazebo, an open-source multi-robot simulator. In IROS, volume 4, pp. 2149–2154. Citeseer, 2004.
Levine et al. (2016)
Levine, S., Finn, C., Darrell, T., and Abbeel, P.
End-to-end training of deep visuomotor policies.
The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
- Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Lin (1992) Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
- Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Montemerlo et al. (2002) Montemerlo, M., Thrun, S., Koller, D., Wegbreit, B., et al. Fastslam: A factored solution to the simultaneous localization and mapping problem. Aaai/iaai, 593598, 2002.
- Nair et al. (2018) Nair, A. V., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine, S. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pp. 9209–9220, 2018.
- Ng et al. (1999) Ng, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp. 278–287, 1999.
Odena et al. (2017)
Odena, A., Olah, C., and Shlens, J.
Conditional image synthesis with auxiliary classifier gans.In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2642–2651. JMLR. org, 2017.
- OpenAI (2018) OpenAI. Learning dexterous in-hand manipulation. CoRR, abs/1808.00177, 2018. URL http://arxiv.org/abs/1808.00177.
- Pinto et al. (2017) Pinto, L., Andrychowicz, M., Welinder, P., Zaremba, W., and Abbeel, P. Asymmetric actor critic for image-based robot learning. CoRR, abs/1710.06542, 2017. URL http://arxiv.org/abs/1710.06542.
- Randløv & Alstrøm (1998) Randløv, J. and Alstrøm, P. Learning to drive a bicycle using reinforcement learning and shaping. In ICML, volume 98, pp. 463–471. Citeseer, 1998.
- Schaul et al. (2015) Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1312–1320, Lille, France, 07–09 Jul 2015. PMLR.
- Schmidhuber (1990) Schmidhuber, J. Making the world differentiable: On using self-supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments. 1990.
- Schroecker et al. (2019) Schroecker, Y., Vecerik, M., and Scholz, J. Generative predecessor models for sample-efficient imitation learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkeVsiAcYm.
- Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
- Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
- Van Hasselt et al. (2016) Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In AAAI, volume 2, pp. 5. Phoenix, AZ, 2016.
Vincent et al. (2008)
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A.
Extracting and composing robust features with denoising autoencoders.In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. ACM, 2008.
- (35) Wise, M. and Foote, T. Rep: 119-specification for turtlebot compatible platforms, dec. 2011.
- Xie et al. (2018) Xie, A., Singh, A., Levine, S., and Finn, C. Few-shot goal inference for visuomotor learning and planning. CoRR, abs/1810.00482, 2018. URL http://arxiv.org/abs/1810.00482.
- Zamora et al. (2016) Zamora, I., Lopez, N. G., Vilches, V. M., and Cordero, A. H. Extending the openai gym for robotics: a toolkit for reinforcement learning using ros and gazebo. arXiv preprint arXiv:1608.05742, 2016.
- Zhu et al. (2017) Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., and Farhadi, A. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 3357–3364. IEEE, 2017.
Appendix A Experimental Hyperparameters
Refer to table below for environment specific hyperparameters.
|Hyperparameter||TurtleBot||MiniWorld Navigate||MiniWorld Pick-and-Place|
|Learning Rate||(Actor), (Critic)||(Actor), (Critic)|
|Size of for HALGAN||6,840||2,000||6,419|
|Hallucination Start %||20%||30%||30%|
|Hallucination END %||0%||0%||0%|
|Max Failed Trajectory Length||16||32||16|
|Random Seeds||75839, 69045, 47040||75839, 69045, 47040,||75839, 69045, 47040,|
|60489, 11798||60489, 11798|
Refer to table below for HALGAN specific hyperparameters.
|Latent Vector Size||128|
|Latent Sampling Distribution|
|Auxiliary Task Weight||10|
|Gradient Penalty Weight||10|
|loss on Weight||1|
|Iters per Iter||5|
Appendix B Network Architectures
Refer to table below for details on the network architecture for DDQN. LeakyReLu’s were used as activations throughout except for the output layer where no activation was used.
|Dense 2||4 ()||-||132|
Refer to table below for details on the network architecture for actor for DDPG. LeakyReLu’s were used as activations throughout except for the output layer where a Tanh was used.
|Dense 2||2 ()||-||66|
Refer to table below for details on the network architecture for critic for DDPG. LeakyReLu’s were used as activations throughout except for the output layer where no activation was used.
Refer to table below for details on the network architecture for the generator in HALGAN. LeakyReLu’s were used as activations throughout except immediately after the conditioning layer where no activation was used and the output where tanh was used.
|UpSample + Conv 1||4x4||64||131136|
|UpSample + Conv 2||4x4||64||65600|
|UpSample + Conv 3||4x4||64||65600|
|UpSample + Conv 4||4x4||32||32800|
|UpSample + Conv 5||4x4||32||16416|
|UpSample + Conv 6||4x4||16||8028|
Refer to table below for details on the network architecture for the discriminator in HALGAN. LeakyReLu’s were used as activations throughout except at the output where no activation was used.