1 Introduction
Reinforcement learning (RL) algorithms hold the promise of allowing autonomous agents, such as robots, to learn to accomplish arbitrary tasks. However, the standard RL framework involves learning policies that are specific to individual tasks, which are defined by handspecified reward functions. Agents that exist persistently in the world can prepare to solve diverse tasks by setting their own goals, practicing complex behaviors, and learning about the world around them. In fact, humans are very proficient at setting abstract goals for themselves, and evidence shows that this behavior is already present from early infancy Smith and Gasser (2005), albeit with simple goals such as reaching. The behavior and representation of goals grows more complex over time as they learn how to manipulate objects and locomote. How can we begin to devise a reinforcement learning system that sets its own goals and learns from experience with minimal outside intervention and manual engineering?
In this paper, we take a step toward this goal by designing an RL framework that jointly learns representations of raw sensory inputs and policies that achieve arbitrary goals under this representation by practicing to reach selfspecified random goals during training. To provide for automated and flexible goalsetting, we must first choose how a general goal can be specified for an agent interacting with a complex and highly variable environment. Even providing the state of such an environment to a policy is a challenge. For instance, a task that requires a robot to manipulate various objects would require a combinatorial representation, reflecting variability in the number and type of objects in the current scene. Directly using raw sensory signals, such as images, avoids this challenge, but learning from raw images is substantially harder. In particular, pixelwise Euclidean distance is not an effective reward function for visual tasks since distances between images do not correspond to meaningful distances between states (Ponomarenko et al., 2015; Zhang et al., 2018). Furthermore, although endtoend modelfree reinforcement learning can handle image observations, this comes at a high cost in sample complexity, making it difficult to use in the real world.
We propose to address both challenges by incorporating unsupervised representation learning into goalconditioned policies. In our method, which is illustrated in Figure 1
, a representation of raw sensory inputs is learned by means of a latent variable model, which in our case is based on the variational autoencoder (VAE)
Kingma and Welling (2014). This model serves three complementary purposes. First, it provides a more structured representation of sensory inputs for RL, making it feasible to learn from images even in the real world. Second, it allows for sampling of new states, which can be used to set synthetic goals during training to allow the goalconditioned policy to practice diverse behaviors. We can also more efficiently utilize samples from the environment by relabeling synthetic goals in an offpolicy RL algorithm, which makes our algorithm substantially more efficient. Third, the learned representation provides a space where distances are more meaningful than the original space of observations, and can therefore provide wellshaped reward functions for RL. By learning to reach random goals sampled from the latent variable model, the goalconditioned policy learns about the world and can be used to achieve new, userspecified goals at testtime.The main contribution of our work is a framework for learning generalpurpose goalconditioned policies that can achieve goals specified with target observations. We call our method reinforcement learning with imagined goals (RIG). RIG combines sampleefficient offpolicy goalconditioned reinforcement learning with unsupervised representation learning. We use representation learning to acquire a latent distribution that can be used to sample goals for unsupervised practice and data augmentation, to provide a wellshaped distance function for reinforcement learning, and to provide a more structured representation for the value function and policy. While several prior methods, discussed in the following section, have sought to learn goalconditioned policies, we can do so with image goals and observations without a manually specified reward signal. Our experimental evaluation illustrates that our method substantially improves the performance of imagebased reinforcement learning, can effectively learn policies for complex imagebased tasks, and can be used to learn realworld robotic manipulation skills with raw image inputs. Videos of our method in simulated and realworld environments can be found at https://sites.google.com/site/visualrlwithimaginedgoals/.
2 Related Work
While prior works on visionbased deep reinforcement learning for robotics can efficiently learn a variety of behaviors such as grasping (Pinto et al., 2017; Pinto and Gupta, 2016; Levine et al., 2017), pushing (Agrawal et al., 2016; Ebert et al., 2017; Finn and Levine, 2016), navigation (Pathak et al., 2018; Lange et al., 2012), and other manipulation tasks (Lillicrap et al., 2016; Levine et al., 2016; Pathak et al., 2018), they each make assumptions that limit their applicability to training generalpurpose robots. Levine et al. (2016) uses timevarying models, which requires an episodic setup that makes them difficult to extend to nonepisodic and continual learning scenarios. Pinto et al. (2017) proposed a similar approach that uses goal images, but requires instrumented training in simulation. Lillicrap et al. (2016) uses fully modelfree training, but does not learn goalconditioned skills. As we show in our experiments, this approach is very difficult to extend to the goalconditioned setting with image inputs. Modelbased methods that predict images (Watter et al., 2015; Finn and Levine, 2016; Ebert et al., 2017; Oh et al., 2015) or learn inverse models (Agrawal et al., 2016) can also accommodate various goals, but tend to limit the horizon length due to model drift. To our knowledge, no prior method uses modelfree RL to learn policies conditioned on a single goal image with sufficient efficiency to train directly on realworld robotic systems, without access to groundtruth state or reward information during training.
Our method uses a goalconditioned value function (Schaul et al., 2015) in order to solve more general tasks Sutton et al. (2011); Kaelbling (1993). To improve the sampleefficiency of our method during offpolicy training, we retroactively relabel samples in the replay buffer with goals sampled from the latent representation. Goal relabeling has been explored in prior work (Kaelbling, 1993; Andrychowicz et al., 2017; Rauber et al., 2017; Levy et al., 2017; Pong et al., 2018). Andrychowicz et al. (2017) and Levy et al. (2017) use goal relabeling for sparse rewards problems with known goal spaces, restricting the resampled goals to states encountered along that trajectory, since almost any other goal will have no reward signal. We sample random goals from our learned latent space to use as replay goals for offpolicy Qlearning rather than restricting ourselves to states seen along the sampled trajectory, enabling substantially more efficient learning. We use the same goal sampling mechanism for exploration in RL. Goal setting for policy learning has previously been discussed (Baranes and Oudeyer, 2012) and recently Péré et al. (2018)
have also proposed using unsupervised learning for setting goals for exploration. However, we use a modelfree Qlearning method that operates on raw state observations and actions, allowing us to solve visually and dynamically complex tasks.
A number of prior works have used unsupervised learning to acquire better representations for RL. These methods use the learned representation as a substitute for the state for the policy, but require additional information, such as access to the ground truth reward function based on the true state during training time (Higgins et al., 2017b; Ha and Schmidhuber, 2018; Watter et al., 2015; Finn et al., 2016; Lange et al., 2012; Jonschkowski et al., 2017), expert trajectories (Srinivas et al., 2018), human demonstrations (Sermanet et al., 2017), or pretrained objectdetection features (Lee et al., 2017). In contrast, we learn to generate goals and use the learned representation to obtain a reward function for those goals without any of these extra sources of supervision. Finn et al. (2016) combine unsupervised representation learning with reinforcement learning, but in a framework that trains a policy to reach a single goal. Many prior works have also focused on learning controllable and disentangled representations Schmidhuber (1992); Chen et al. (2016); Cheung et al. (2014); Reed et al. (2014); Desjardins et al. (2012); Thomas et al. (2017). We use a method based on variational autoencoders, but these prior techniques are complementary to ours and could be incorporated into our method.
3 Background
Our method combines reinforcement learning with goalconditioned value functions and unsupervised representation learning. Here, we briefly review the techniques that we build on in our method.
Goalconditioned reinforcement learning.
In reinforcement learning, the goal is to learn a policy that maximizes expected return, which we denote as , where and the expectation is under the current policy and environment dynamics. Here, is a state observation, is an action, and is a discount factor. Standard modelfree RL learns policies that achieve a single task. If our aim is instead to obtain a policy that can accomplish a variety of tasks, we can construct a goalconditioned policy and reward, and optimize the expected return with respect to a goal distribution: , where is the set of goals and the reward is also a function of . A variety of algorithms can learn goalconditioned policies, but to enable sampleefficient learning, we focus on algorithms that acquire goalconditioned Qfunctions, which can be trained offpolicy. A goalconditioned Qfunction learns the expected return for the goal starting from state and taking action . Given a state , action , next state , goal , and correspond reward , one can train an approximate Qfunction parameterized by by minimizing the following Bellman error
(1) 
where indicates that is treated as a constant. Crucially, one can optimize this loss using offpolicy data with a standard actorcritic algorithm (Lillicrap et al., 2016; Fujimoto et al., 2018; Mnih et al., 2016).
Variational Autoencoders.
Variational autoencoders (VAEs) have been demonstrated to learn structured latent representations of high dimensional data
Kingma and Welling (2014). The VAE consists of an encoder , which maps states to latent distributions, and a decoder , which maps latents to distributions over states. The encoder and decoder parameters, and respectively, are jointly trained to maximize(2) 
where is some prior, which we take to be the unit Gaussian,
is the KullbackLeibler divergence, and
is a hyperparameter that balances the two terms. The use of
values other than one is sometimes referred to as a VAE Higgins et al. (2017a). The encoderparameterizes the mean and logvariance diagonal of a Gaussian distribution,
. The decoderparameterizes a Bernoulli distribution for each pixel value. This parameterization corresponds to training the decoder with crossentropy loss on normalized pixel values. Full details of the hyperparameters are in the Supplementary Material.
4 GoalConditioned Policies with Unsupervised Representation Learning
To devise a practical algorithm based on goalconditioned value functions, we must choose a suitable goal representation. In the absence of domain knowledge and instrumentation, a generalpurpose choice is to set the goal space to be the same as the state observations space . This choice is fully general as it can be applied to any task, and still permits considerable user control since the user can choose a “goal state” to set a desired goal for a trained goalconditioned policy. But when the state space corresponds to highdimensional sensory inputs such as images ^{1}^{1}1We make the simplifying assumption that the system is Markovian with respect to the sensory input. One could incorporate memory into the state for partially observed tasks. learning a goalconditioned Qfunction and policy becomes exceedingly difficult as we illustrate empirically in Section 5.
Our method jointly addresses a number of problems that arise when working with highdimensional inputs such as images: sample efficient learning, reward specification, and automated goalsetting. We address these problems by learning a latent embedding using a VAE. We use this latent space to represent the goal and state and retroactively relabel data with latent goals sampled from the VAE prior to improve sample efficiency. We also show that distances in the latent space give us a wellshaped reward function for images. Lastly, we sample from the prior to allow an agent to set and “practice” reaching its own goal, removing the need for humans to specify new goals during training time. We next describe the specific components of our method, and summarize our complete algorithm in Section 4.5.
4.1 SampleEfficient RL with Learned Representations
One challenging problem with endtoend approaches for visual RL tasks is that the resulting policy needs to learn both perception and control. Rather than operating directly on observations, we embed the state and goals into a latent space using an encoder to obtain a latent state and latent goal . To learn a representation of the state and goal space, we train a VAE by executing a random policy and collecting state observations, , and optimize Equation . We then use the mean of the encoder as the state encoding, i.e. .
After training the VAE, we train a goalconditioned Qfunction and corresponding policy in this latent space. The policy is trained to reach a goal using the reward function discussed in Section 4.2. For the underlying RL algorithm, we use twin delayed deep deterministic policy gradients (TD3) (Fujimoto et al., 2018), though any valuebased RL algorithm could be used. Note that the policy (and Qfunction) operates completely in the latent space. During test time, to reach a specific goal state , we encode the goal and input this latent goal to the policy.
As the policy improves, it may visit parts of the state space that the VAE was never trained on, resulting in arbitrary encodings that may not make learning easier. Therefore, in addition to procedure described above, we finetune the VAE using both the randomly generated state observations and the state observations collected during exploration. We show in Section 9.3 that this additional training helps the performance of the algorithm.
4.2 Reward Specification
Training the goalconditioned value function requires defining a goalconditioned reward . Using Euclidean distances in the space of image pixels provides a poor metric, since similar configurations in the world can be extremely different in image space. In addition to compactly representing highdimensional observations, we can utilize our representation to obtain a reward function based on a metric that better reflects the similarity between the state and the goal. One choice for such a reward is to use the negative Mahalanobis distance in the latent space:
where the matrix weights different dimensions in the latent space. This approach has an appealing interpretation when we set to be the precision matrix of the VAE encoder, . Since we use a Gaussian encoder, we have that
(3) 
In other words, minimizing this squared distance in the latent space is equivalent to rewarding reaching states that maximize the probability of the latent goal
. In practice, we found that setting , corresponding to Euclidean distance, performed better than Mahalanobis distance, though its effect is the same — to bring close to and maximize the probability of the latent goal given the observation. This interpretation would not be possible when using normal autoencoders since distances are not trained to have any probabilistic meaning. Indeed, we show in Section 5 that using distances in a normal autoencoder representation often does not result in meaningful behavior.4.3 Improving Sample Efficiency with Latent Goal Relabeling
To further enable sampleefficient learning in the real world, we use the VAE to relabel goals. Note that we can optimize Equation (1) using any valid tuple. If we could artificially generate these tuples, then we could train our entire RL algorithm without collecting any data. Unfortunately, we do not know the system dynamics, and therefore have to sample transitions by interacting with the world. However, we have the freedom to relabel the goal and reward synthetically. So if we have a mechanism for generating goals and computing rewards, then given , we can generate a new goal and new reward to produce a new tuple . By artificially generating and recomputing rewards, we can convert a single transition into potentially infinitely many valid training datums.
For imagebased tasks, this procedure would require generating goal images, an onerous task on its own. However, our reinforcement learning algorithm operates directly in the latent space for goals and rewards. So rather than generating goals , we generate latent goals by sampling from the VAE prior . We then recompute rewards using Equation (3). By retroactively relabeling the goals and rewards, we obtain much more data to train our value function. This sampling procedure is made possible by our use of a latent variable model, which is explicitly trained so that sampling from the latent distribution is straightforward.
In practice, the distribution of latents will not exactly match the prior. To mitigate this distribution mismatch, we use a fitted prior when sampling from the prior: we fit a diagonal Gaussian to the latent encodings of the VAE training data, and use this fitted prior in place of the unit Gaussian prior.
Retroactively generating goals is also explored in tabular domains by Kaelbling (1993) and in continuous domains by Andrychowicz et al. (2017) using hindsight experience replay (HER). However, HER is limited to sampling goals seen along a trajectory, which greatly limits the number and diversity of goals with which one can relabel a given transition. Our final method uses a mixture of the two strategies: half of the goals are generated from the prior and half from goals use the “future” strategy described in Andrychowicz et al. (2017). We show in Section 5 that relabeling the goal with samples from the VAE prior results in significantly better sampleefficiency.
4.4 Automated GoalGeneration for Exploration
If we do not know which particular goals will be provided at test time, we would like our RL agent to carry out a selfsupervised “practice” phase during training, where the algorithm proposes its own goals, and then practices how to reach them. Since the VAE prior represents a distribution over latent goals and state observations, we again sample from this distribution to obtain plausible goals. After sampling a goal latent from the prior , we give this to our policy to collect data.
4.5 Algorithm Summary
We call the complete algorithm reinforcement learning with imagined goals (RIG) and summarize it in Algorithm 1. We first collect data with a simple exploration policy, though any exploration strategy could be used for this stage, including offtheshelf exploration bonuses Pathak et al. (2017); Bellemare et al. (2016) or unsupervised reinforcement learning methods Eysenbach et al. (2018); Florensa et al. (2017). Then, we train a VAE latent variable model on state observations and finetune it over the course of training. We use this latent variable model for multiple purposes: We sample a latent goal from the model and condition the policy on this goal. We embed all states and goals using the model’s encoder. When we train our goalconditioned value function, we resample goals from the prior and compute rewards in the latent space using Equation (3). Any RL algorithm that trains Qfunctions could be used, and we use TD3 (Fujimoto et al., 2018) in our implementation.
5 Experiments
Our experiments address the following questions:

How does our method compare to prior modelfree RL algorithms in terms of sample efficiency and performance, when learning continuous control tasks from images?

How critical is each component of our algorithm for efficient learning?

Does our method work on tasks where the state space cannot be easily specified ahead of time, such as tasks that require interaction with variable numbers of objects?

Can our method scale to real world visionbased robotic control tasks?
For the first two questions, we evaluate our method against a number of prior algorithms and ablated versions of our approach on a suite of the following simulated tasks. Visual Reacher: a MuJoCo (Todorov et al., 2012) environment with a 7dof Sawyer arm reaching goal positions. The arm is shown the left of Figure 2. The endeffector (EE) is constrained to a 2dimensional rectangle parallel to a table. The action controls EE velocity within a maximum velocity. Visual Pusher: a MuJoCo environment with a 7dof Sawyer arm and a small puck on a table that the arm must push to a target push. Visual MultiObject Pusher: a copy of the Visual Pusher environment with two pucks. Visual Door: a Sawyer arm with a door it can attempt to open by latching onto the handle. Visual Pick and Place: a Sawyer arm with a small ball and an additional dimension of control for opening and closing the gripper. Detailed descriptions of the environments are provided in the Supplementary Material. The environments and algorithm implementation are available publicly.
Solving these tasks directly from images poses a challenge since the controller must learn both perception and control. The evaluation metric is the distance of objects (including the arm) to their respective goals. To evaluate our policy, we set the environment to a sampled goal position, capture an image, and encode the image to use as the goal. Although we use the groundtruth positions for evaluation,
we do not use the groundtruth positions for training the policies. The only inputs from the environment that our algorithm receives are the image observations. For Visual Reacher, we pretrained the VAE with 100 images. For other tasks, we used 10,000 images.In all our simulation results, each plot shows a 95% confidence interval of the mean across 5 seeds.
. RIG (red) consistently outperforms the baselines, except for the oracle which uses ground truth object state for observations and rewards. On the hardest tasks, only our method and the oracle discover viable solutions.We compare our method with the following prior works. L&R: Lange and Riedmiller (Lange and Riedmiller, 2010) trains an autoencoder to handle images. DSAE: Deep spatial autoencoders (Finn et al., 2016) learns a spatial autoencoder and uses guided policy search (Levine et al., 2016) to achieve a single goal image. HER: Hindsight experience replay (Andrychowicz et al., 2017) utilizes a sparse reward signal and relabeling trajectories with achieved goals. Oracle: RL with direct access to state information for observations and rewards.
To our knowledge, no prior work demonstrates policies that can reach a variety of goal images without access to a truestate reward function, and so we needed to make modifications to make the comparisons feasible. L&R assumes a reward function from the environment. Since we have no statebased reward function, we specify the reward function as distance in the autoencoder latent space. HER does not embed inputs into a latent space but instead operates directly on the input, so we use pixelwise mean squared error (MSE) as the metric. DSAE is trained only for a single goal, so we allow the method to generalize to a variety of test goal images by using a goalconditioned Qfunction. To make the implementations comparable, we use the same offpolicy algorithm, TD3 (Fujimoto et al., 2018), to train L&R, HER, and our method. Unlike our method, prior methods do not specify how to select goals during training, so we favorably give them real images as goals for rollouts, sampled from the same distribution that we use to test.
We see in Figure 3 that our method can efficiently learn policies from visual inputs to perform simulated reaching and pushing, without access to the object state. Our approach substantially outperforms the prior methods, for which the use of image goals and observations poses a major challenge. HER struggles because pixelwise MSE is hard to optimize. Our latentspace rewards are much better shaped and allow us to learn more complex tasks. Finally, our method is close to the statebased “oracle" method in terms of sample efficiency and performance, without having any access to object state. Notably, in the multiobject environment, our method actually outperforms the oracle, likely because the statebased reward contains local minima. Overall, these result show that our method is capable of handling raw image observations much more effectively than previously proposed goalconditioned RL methods. Next, we perform ablations to evaluate our contributions in isolation. Results on Visual Pusher are shown but see the Supplementary Material (section 9) for experiments on all three simulated environments.
Reward Specification Comparison
We evaluate how effective distance in the VAE latent space is for the Visual Pusher task. We keep our method the same, and only change the reward function that we use to train the goalconditioned valued function. We include the following methods for comparison: Latent Distance, which uses the reward used in RIG, i.e. in Equation (3); Log Probability, which uses the Mahalanobis distance in Equation (3), where is the precision matrix of the encoder; and Pixel MSE, which uses meansquared error (MSE) between state and goal in pixel space. ^{4}^{4}4To compute the pixel MSE for a sampled latent goal, we decode the goal latent using the VAE decoder, , to generate the corresponding goal image. In Figure 4, we see that latent distance significantly outperforms log probability. We suspect that small variances of the VAE encoder results in drastically large rewards, making the learning more difficult. We also see that latent distances results in faster learning when compared to pixel MSE.
Relabeling Strategy Comparison
As described in section 4.3, our method uses a novel goal relabeling method based on sampling from the generative model. To isolate how much our new goal relabeling method contributes to our algorithm, we vary the resampling strategy while fixing other components of our algorithm. The resampling strategies that we consider are: Future, relabeling the goal for a transition by sampling uniformly from future states in the trajectory as done in Andrychowicz et al. (2017); VAE, sampling goals from the VAE only; RIG, relabeling goals with probability from the VAE and probability using the future strategy; and None, no relabeling. In Figure 5, we see that sampling from the VAE and Future is significantly better than not relabeling at all. In RIG, we use an equal mixture of the VAE and Future sampling strategies, which performs best by a large margin. Appendix section 9.1 contains results on all simulated environments, and section 9.4 considers relabeling strategies with a known goal distribution.
Learning with Variable Numbers of Objects
A major advantage of working directly from pixels is that the policy input can easily represent combinatorial structure in the environment, which would be difficult to encode into a fixedlength state vector even if a perfect perception system were available. For example, if a robot has to interact with different combinations and numbers of objects, picking a single MDP state representation would be challenging, even with access to object poses. By directly processing images for both the state and the goal, no modification is needed to handle the combinatorial structure: the number of pixels always remains the same, regardless of how many objects are in the scene.
We demonstrate that our method can handle this difficult scenario by evaluating on a task where the environment, based on the Visual MultiObject Pusher, randomly contains zero, one, or two objects in each episode during testing. During training, each episode still always starts with both objects in the scene, so the experiments tests whether a trained policy can handle variable numbers of objects at test time. Figure 6 shows that our method can learn to solve this task successfully, without decrease in performance from the base setting where both objects are present (in Figure 3). Developing and demonstrating algorithms that solve tasks with varied underlying structure is an important step toward creating autonomous agents that can handle the diversity of tasks present “in the wild.”
Puck Distance to Goal (cm)

(Left) The learning curve for realworld pushing. (Middle) Our robot pushing setup is pictured, with frames from test rollouts of our learned policy. (Right) Our method compared to the HER baseline on the realworld visual pushing task. We evaluated the performance of each method by manually measuring the distance between the goal position of the puck and final position of the puck for 15 test rollouts, reporting mean and standard deviation.
5.1 Visual RL with Physical Robots
RIG is a practical and straightforward algorithm to apply to real physical systems: the efficiency of offpolicy learning with goal relabeling makes training times manageable, while the use of imagebased rewards through the learned representation frees us from the burden of manually design reward functions, which itself can require handengineered perception systems (Rusu et al., 2017). We trained policies for visual reaching and pushing on a realworld Sawyer robotic arm, shown in Figure 7. The control setup matches Visual Reacher and Visual Pusher respectively, meaning that the only input from the environment consists of camera images.
We see in Figure 7 that our method is applicable to realworld robotic tasks, almost matching the statebased oracle method and far exceeding the baseline method on the reaching task. Our method needs just 10,000 samples or about an hour of realworld interaction time to solve visual reaching.
Realworld pushing results are shown in Figure 8. To solve visual pusher, which is more visually complicated and requires reasoning about the contact between the arm and object, our method requires about 25,000 samples, which is still a reasonable amount of realworld training time. Note that unlike previous results, we do not have access to the true puck position during training so for the learning curve we report test episode returns on the VAE latent distance reward. We see RIG making steady progress at optimizing the latent distance as learning proceeds.
6 Discussion and Future Work
In this paper, we present a new RL algorithm that can efficiently solve goalconditioned, visionbased tasks without access to any ground truth state or reward functions. Our method trains a generative model that is used for multiple purposes: we embed the state and goals using the encoder; we sample from the prior to generate goals for exploration; we also sample latents to retroactively relabel goals and rewards; and we use distances in the latent space for rewards to train a goalconditioned value function. We show that these components culminate in a sample efficient algorithm that works directly from vision. As a result, we are able to apply our method to a variety of simulated visual tasks, including a variableobject task that cannot be easily represented with a fixed length vector, as well as real world robotic tasks. Algorithms that can learn in the real world and directly use raw images can allow a single policy to solve a large and diverse set of tasks, even when these tasks require distinct internal representations.
The method we presented can be extended in a number of ways. First, an exciting line of future work would be to combine our method with existing work on exploration and intrinsic motivation. In particular, our method already provides a natural mechanism for autonomously generating goals by sampling from the prior. Modifying this procedure to not only be goaloriented but also, e.g., be information seeking or uncertainty aware could provide better and safer exploration. Second, since our method operates directly from images, a single policy could potentially solve a large diverse set of visual tasks, even if those tasks have different underlying state representations. Combining these ideas with methods from multitask learning and metalearning is a promising path to creating generalpurpose agents that can continuously and efficiently acquire skills. Lastly, while RIG uses goal images, extending the method to allow goals specified by demonstrations or more abstract representations such as language would enable our system to be much more flexible in interfacing with humans and therefore more practical.
7 Code
The environments are available publicly at https://github.com/vitchyr/multiworld and the algorithm implementation is available at https://github.com/vitchyr/rlkit.
8 Acknowledgements
We would like to thank Aravind Srinivas and Pulkit Agrawal for useful discussions, and Alex Lee for helpful feedback on an initial draft of the paper. We would also like to thank Carlos Florensa for making multiple useful suggestions in later version of the draft. This work was supported by the National Science Foundation IIS1651843 and IIS1614653, a Huawei Fellowship, Berkeley DeepDrive, Siemens, and support from NVIDIA.
References
 Agrawal et al. (2016) Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by Poking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems (NIPS), 2016.
 Andrychowicz et al. (2017) Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mcgrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay. In Advances in Neural Information Processing Systems (NIPS), jul 2017.
 Baranes and Oudeyer (2012) A Baranes and PY Oudeyer. Active Learning of Inverse Models with Intrinsically Motivated Goal Exploration in Robots. Robotics and Autonomous Systems, 61(1):49–73, 2012. doi: 10.1016/j.robot.2012.05.008.
 Bellemare et al. (2016) Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems (NIPS), pages 1471–1479, 2016.
 Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), pages 2172–2180, 2016.
 Cheung et al. (2014) Brian Cheung, Jesse A Livezey, Arjun K Bansal, and Bruno A Olshausen. Discovering hidden factors of variation in deep networks. arXiv preprint arXiv:1412.6583, 2014.
 Desjardins et al. (2012) Guillaume Desjardins, Aaron Courville, and Yoshua Bengio. Disentangling factors of variation via generative entangling. CoRR, abs/1210.5, 2012.
 Ebert et al. (2017) Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. SelfSupervised Visual Planning with Temporal Skip Connections. In Conference on Robot Learning (CoRL), 2017.
 Eysenbach et al. (2018) Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is All You Need: Learning Skills without a Reward Function. arXiv preprint arXiv:1802.06070, 2018.
 Finn and Levine (2016) Chelsea Finn and Sergey Levine. Deep Visual Foresight for Planning Robot Motion. In Advances in Neural Information Processing Systems (NIPS), 2016.
 Finn et al. (2016) Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep spatial autoencoders for visuomotor learning. In IEEE International Conference on Robotics and Automation (ICRA), volume 2016June, pages 512–519. IEEE, 2016.
 Florensa et al. (2017) Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012, 2017.
 Fujimoto et al. (2018) Scott Fujimoto, Herke van Hoof, and David Meger. Addressing Function Approximation Error in ActorCritic Methods. arXiv preprint arXiv:1802.09477, 2018.
 Ha and Schmidhuber (2018) David Ha and Jürgen Schmidhuber. World Models. arXiv preprint arXiv:1803.10122, 2018.
 Higgins et al. (2017a) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. VAE: Learning basic visual concepts with a constrained variational framework. International Conference on Learning Representations (ICLR), 2017a.

Higgins et al. (2017b)
Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess,
Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander
Lerchner.
Darla: Improving zeroshot transfer in reinforcement learning.
International Conference on Machine Learning (ICML)
, 2017b.  Jonschkowski et al. (2017) Rico Jonschkowski, Roland Hafner, Jonathan Scholz, and Martin Riedmiller. Pves: Positionvelocity encoders for unsupervised learning of structured state representations. arXiv preprint arXiv:1705.09805, 2017.

Kaelbling (1993)
L P Kaelbling.
Learning to achieve goals.
In
IJCAI93. Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence
, volume vol.2, pages 1094 – 8, 1993.  Kingma and Welling (2014) Diederik P Kingma and Max Welling. AutoEncoding Variational Bayes. In International Conference on Learning Representations (ICLR), 2014.

Lange and Riedmiller (2010)
Sascha Lange and Martin A Riedmiller.
Deep learning of visual control policies.
In
European Symposium on Artificial Neural Networks (ESANN)
. Citeseer, 2010.  Lange et al. (2012) Sascha Lange, Martin Riedmiller, Arne Voigtlander, and Arne Voigtländer. Autonomous reinforcement learning on raw visual input data in a real world application. In International Joint Conference on Neural Networks (IJCNN), number June, pages 1–8. IEEE, 2012.

Lee et al. (2017)
Alex Lee, Sergey Levine, and Pieter Abbeel.
Learning Visual Servoing with Deep Features and Fitted QIteration.
In International Conference on Learning Representations (ICLR), 2017.  Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. EndtoEnd Training of Deep Visuomotor Policies. Journal of Machine Learning Research (JMLR), 17(1):1334–1373, 2016. ISSN 15337928. doi: 10.1007/s1339801401737.2.
 Levine et al. (2017) Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning HandEye Coordination for Robotic Grasping with Deep Learning and LargeScale Data Collection. International Journal of Robotics Research, 2017.
 Levy et al. (2017) Andrew Levy, Robert Platt, and Kate Saenko. Hierarchical ActorCritic. arXiv preprint arXiv:1712.00948, 2017.
 Lillicrap et al. (2016) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.
 Mnih et al. (2016) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P Lillicrap, David Silver, Koray Kavukcuoglu, Korayk@google Com, and Google Deepmind. Asynchronous Methods for Deep Reinforcement Learning. In International Conference on Machine Learning (ICML), 2016.
 Oh et al. (2015) Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis, and Satinder Singh. ActionConditional Video Prediction using Deep Networks in Atari Games. In Advances in Neural Information Processing Systems (NIPS), 2015.
 Pathak et al. (2017) Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. CuriosityDriven Exploration by SelfSupervised Prediction. In International Conference on Machine Learning (ICML), pages 488–489. IEEE, 2017.
 Pathak et al. (2018) Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. ZeroShot Visual Imitation. In International Conference on Learning Representations (ICLR), 2018.
 Péré et al. (2018) Alexandre Péré, Sebastien Forestier, Olivier Sigaud, and PierreYves Oudeyer. Unsupervised Learning of Goal Spaces for Intrinsically Motivated Goal Exploration. In International Conference on Learning Representations (ICLR), 2018.

Pinto and Gupta (2016)
Lerrel Pinto and Abhinav Gupta.
Supersizing Selfsupervision: Learning to Grasp from 50K Tries and 700 Robot Hours.
IEEE International Conference on Robotics and Automation (ICRA), 2016.  Pinto et al. (2017) Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric Actor Critic for ImageBased Robot Learning. arXiv preprint arXiv:1710.06542, 2017.
 Plappert et al. (2018) Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob Mcgrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, Vikash Kumar, and Wojciech Zaremba. MultiGoal Reinforcement Learning: Challenging Robotics Environments and Request for Research. arXiv preprint arXiv:1802.09464, 2018.
 Pong et al. (2018) Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal Difference Models: ModelFree Deep RL For ModelBased Control. In International Conference on Learning Representations (ICLR), 2018.
 Ponomarenko et al. (2015) Nikolay Ponomarenko, Lina Jin, Oleg Ieremeiev, Vladimir Lukin, Karen Egiazarian, Jaakko Astola, Benoit Vozel, Kacem Chehdi, Marco Carli, Federica Battisti, and Others. Image database TID2013: Peculiarities, results and perspectives. Signal Processing: Image Communication, 30:57–77, 2015.
 Rauber et al. (2017) Paulo Rauber, Filipe Mutz, and Juergen Jürgen Schmidhuber. Hindsight policy gradients. In CoRR, volume abs/1711.0, 2017.
 Reed et al. (2014) Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. Learning to disentangle factors of variation with manifold interaction. In International Conference on Machine Learning, pages 1431–1439, 2014.
 Rusu et al. (2017) Andrei A Rusu, Matej Vecerik, Thomas Rothörl, Nicolas Heess, Razvan Pascanu, and Raia Hadsell. Simtoreal robot learning from pixels with progressive nets. Conference on Robot Learning (CoRL), 2017.
 Schaul et al. (2015) Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal Value Function Approximators. In International Conference on Machine Learning (ICML), pages 1312–1320, 2015.
 Schmidhuber (1992) Jürgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863–879, 1992.
 Sermanet et al. (2017) Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey Levine. Timecontrastive networks: Selfsupervised learning from video. arXiv preprint arXiv:1704.06888, 2017.
 Smith and Gasser (2005) Linda Smith and Michael Gasser. The development of embodied cognition: Six lessons from babies. Artificial life, 11(12):13–29, 2005.
 Srinivas et al. (2018) Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks. arXiv preprint arXiv:1804.00645, 2018.
 Sutton et al. (2011) Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White, and Doina Precup. Horde: A Scalable Realtime Architecture for Learning Knowledge from Unsupervised Sensorimotor Interaction. International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 10:761–768, 2011.
 Thomas et al. (2017) Valentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati, Philippe Beaudoin, MarieJean Meurs, Joelle Pineau, Doina Precup, and Yoshua Bengio. Independently Controllable Factors. In NIPS Workshop, 2017.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for modelbased control. In The IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
 Watter et al. (2015) Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images. In Advances in Neural Information Processing Systems (NIPS), pages 2728–2736, 2015.
 Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. arXiv preprint arXiv:1801.03924, 2018.
9 Complete Ablative Results
9.1 Relabeling strategy ablation
In this experiment, we compare different goal resampling strategies for training the Q function. We consider: Future, relabeling the goal for a transition by sampling uniformly from future states in the trajectory as done in Andrychowicz et al. (2017); VAE, sampling goals from the VAE only; RIG, relabeling goals with probability from the VAE and probability using the future strategy; and None, no relabeling. Figure 9 shows the effect of different relabeling strategies with our method.
9.2 Reward type ablation
In this experiment, we change only the reward function that we use to train the goalconditioned valued function to show the effect of using the latent distance reward. We include the following methods for comparison: Latent Distance, which is the reward used in RIG, i.e. in Equation (3); Log Probability, which uses the Mahalanobis distance in Equation (3), where is the precision matrix of the encoder; and Pixel MSE, which computes meansquared error (MSE) between state and goal in pixel space. To compute the pixel MSE for a sampled latent goal, we decode the goal latent using the VAE decoder, , to generate the corresponding goal image. Figure 10 shows the effect of different rewards with our method.
9.3 Online training ablation
Rather than pretraining the VAE on a set of images collected by a random policy, here we train the VAE in an online manner: the VAE is not trained when we initially collect data with our policy. After every 3000 environment steps, we train the VAE on all of the images observed by the policy. We show in Figure 11 that this online training results in a good policy and is substantially better than leaving the VAE untrained. These results show that the representation learning can be done simultaneously as the reinforcement learning portion of RIG, eliminating the need to have a predefined set of images to train the VAE.
The Visual Pusher experiment for this ablation is performed on a slightly easier version of the Visual Pusher used for the main results. In particular, the goal space is reduced to be three quarters of its original size in the lateral dimension.
9.4 Comparison to Hindsight Experience Replay
In this section, we study in isolation the effect of sampling goals from the goal space directly for Qlearning, as covered in Section 4.3. Like hindsight experience replay Andrychowicz et al. (2017), in this section we assume access to state information and the goal space, so we do not use a VAE.
To match the original work as closely as possible, this comparison was based off of the OpenAI baselines code Plappert et al. (2018) and we compare on the same Fetch robotics tasks. To minimize sample complexity and due to computational constraints, we use single threaded training with rollout_batch_size=1, n_cycles=1, batch_size=256. For testing, n_test_rollouts=1 and the results are averaged over the last 100 test episodes. Number of updates per cycle corresponds to n_batches.
On the plots, “Future” indicates the future strategy as presented in Andrychowicz et al. (2017) with . “Ours” indicates resampling goals with probability 0.5 from the "future" strategy with and probability 0.5 uniformly from the environment goal space. Each method is shown with dense and sparse rewards.
Results are shown in Figure 12. Our resampling strategy with sparse rewards consistently performs the best on the three tasks. Furthermore, it performs reasonably well with dense rewards, unlike HER alone which often fails with dense rewards. While the evaluation metric used here, success rate, is favorable to the sparse reward setting, learning with dense rewards is usually more sample efficient on most tasks and being able to do offpolicy goal relabeling with dense rewards is important for RIG.
Finally, as the number of gradient updates per training cycle is increased, the performance of our strategy improves while HER does not improve and sometimes performs worse. As we apply reinforcement learning to realworld tasks, being able to reduce the required number of samples on hardware is one of the key bottlenecks. Increasing the number of gradient updates costs more compute but reduces the number of samples required to learn the tasks.
10 Hyperparameters
Table 1 lists the hyperparameters used for the experiments.
Hyperparameter  Value  Comments 

Mixture coefficient  See relabeling strategy ablation  
# training batches per time step  Marginal improvements after  
Exploration Policy  OU,  Outperformed Gaussian and greedy 
for VAE  Values around were effective  
Critic Learning Rate  Did not tune  
Critic Regularization  None  Did not tune 
Actor Learning Rate  Did not tune  
Actor Regularization  None  Did not tune 
Optimizer  Adam  Did not tune 
Target Update Rate  Did not tune  
Target Update Period  time steps  Did not tune 
Target Policy Noise  Did not tune  
Target Policy Noise Clip  Did not tune  
Batch Size  Did not tune  
Discount Factor  Did not tune  
Reward Scaling  Did not tune  
Normalized Observations  False  Did not tune 
Gradient Clipping  False  Did not tune 
11 Environment Details
Below we provide a more detailed description of the simulated environments.
Visual Reacher: A MuJoCo environment with a 7DoF Sawyer arm reaching goal positions. The arm is shown on the left of Figure 2 with two extra objects for the Visual MultiObject Pusher environment (see below). The endeffector (EE) is constrained to a 2dimensional rectangle parallel to a table. The action controls EE velocity within a maximum velocity. The underlying state is the EE position , and the underlying goal is to reach a desired EE position, .
Visual Pusher: A MuJoCo environment with a 7DoF Sawyer arm and a small puck on a table that the arm must push to a target position. Control is the same as in Visual Reacher. The underlying state is the EE position, and puck position . The underlying goal is for the EE to reach a desired position and the puck to reach a desired position .
Visual MultiObject Pusher: A copy of the Visual Pusher environment with two pucks. The underlying state is the EE position, and puck positions and . The underlying goal is for the EE to reach desired position and the pucks to reach desired positions and in their respective halves of the workspace. Each puck and respective goal is initialized in half of the workspace.
Videos of our method in simulated and realworld environments can be found at https://sites.google.com/site/visualrlwithimaginedgoals/.
Comments
There are no comments yet.