Self-supervised Learning of Distance Functions for Goal-Conditioned Reinforcement Learning

07/05/2019 ∙ by Srinivas Venkattaramanujam, et al. ∙ McGill University 0

Goal-conditioned policies are used in order to break down complex reinforcement learning (RL) problems by using subgoals, which can be defined either in state space or in a latent feature space. This can increase the efficiency of learning by using a curriculum, and also enables simultaneous learning and generalization across goals. A crucial requirement of goal-conditioned policies is to be able to determine whether the goal has been achieved. Having a notion of distance to a goal is thus a crucial component of this approach. However, it is not straightforward to come up with an appropriate distance, and in some tasks, the goal space may not even be known a priori. In this work we learn a distance-to-goal estimate which is computed in terms of the number of actions that would need to be carried out in a self-supervised approach. Our method solves complex tasks without prior domain knowledge in the online setting in three different scenarios in the context of goal-conditioned policies a) the goal space is the same as the state space b) the goal space is given but an appropriate distance is unknown and c) the state space is accessible, but only a subset of the state space represents desired goals, and this subset is known a priori. We also propose a goal-generation mechanism as a secondary contribution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement Learning (RL) is a framework for training agents to interact optimally with an environment. Recent advances in RL have lead to algorithms that are capable of succeeding in a variety of environments, ranging from video games with high-dimensional image observations Mnih et al. (2013, 2015) to continuous control in complex robotic tasks Lillicrap et al. (2016); Schulman et al. (2015). Meanwhile, innovations in training powerful function approximators have all but removed the need for hand-crafted state representation, thus enabling RL methods to work with minimal human oversight or domain knowledge. However, one component of the RL workflow that still requires significant human input is the design of the reward function that the agent optimizes.

One way of alleviating this reliance on human input is by allowing the agent to condition its behavior on a provided goal Kaelbling (1993); Schaul et al. (2015), and training the agent to achieve (some approximation of) all possible goals afforded by the environment. A number of algorithms have recently been proposed along these lines, often making using of curriculum learning techniques to discover goals and train agents to achieve them in a structured way Narvekar et al. (2017); Florensa et al. (2018). At the end of this process, the agent is expected to be able to achieve any desired goal.

An important component of this class of algorithm is a distance function, used to determine whether the agent has reached its goal; this can also require human input and domain knowledge. In past work, it has been common to assume that the goal space is known and use the distance between the current state and the goal. However, this straightforward choice is not be satisfactory for general environments, as it does not take environment dynamics into account. For example, it is possible for a state to be close to a goal in terms of distance, and yet be far from satisfying it in terms of environment dynamics.

We propose a self-supervised method for learning a distance between a state and a goal which accurately reflects the dynamics of the environment. We begin by defining the distance between two states as the average number of time steps required to move from the first state to the second under some policy

. To make this distance usable as part of a goal-conditioned reward function, we train a neural network to approximate this quantity from data. The distance network is trained online, in conjunction with the training the goal-conditioned policy.

The contributions of this work are as follows. i) We propose a self-supervised approach to learn a distance estimate, by learning an embedding with the property that the -norm between the embeddings of two states approximates the average temporal distance between the states according to a policy , ii) we demonstrate that the learned distance estimate can be used in the online setting in goal-conditioned policies, iii) we develop an automatic curriculum generation mechanism that takes advantage of our distance learning algorithm, and iv) we explain a phenomenon that arises due to learning the distance using samples from the behavior policy.

2 Related Work

Goal-conditioned RL aims to train agents that can reach any goal provided to them. Automatic goal generation approaches such as Florensa et al. (2018, 2017) focus on automatically generating goals of appropriate difficulty for agent in order to facilitate efficient learning. These methods utilize domain knowledge to learn a goal space and use distance as the distance function in the goal space.

The simplest approach of using distance in the goal space is applicable only when the goal space is known a priori. In most tasks however the goal space is inaccessible or an appropriate distance function in the goal space is unknown. For such scenarios there has been a recent focus on learning an embedding space for goals and using distance in the embedding space Péré et al. (2018)

as the distance function in the goal space. Typically the embedding for the goal space is learned in an unsupervised fashion such as training an autoencoder to minimize the reconstruction loss in the state space and using the representation learned by the autoencoder as the goal space

Nair et al. (2018); Warde-Farley et al. (2018); Sukhbaatar et al. (2018)

. The main drawback of such unsupervised learning of the goal space is that they do not capture the environment dynamics. For example, it is possible for two states to appear similar , and thus be given similar embeddings by the autoencoder approach, while being far away in terms of action distance.

Andrychowicz et al. (2017) and Rauber et al. (2019) focus on improving the sample efficiency of the goal-conditioned policies by relabeling or reweighting the reward from a goal on which the trajectory was conditioned to a different goal that was a part of the trajectory. Our method is complementary to these approaches since they rely on prior knowledge of the goal space and use the or distance in the goal space to determine whether the goal has been reached.

Similar to our work, Savinov et al. (2019)

trained a network to predict whether the distance in actions between two states is smaller than some fixed hyperparameter

. However, this was done in the context of intrinsic motivation, in contrast to our work. The network was used to to provide agents with an exploration bonus for visiting novel states; given a state visited by the agent, an exploration bonus was provided if the network judged to be far from the states in a buffer storing a representative sample of states previously visited by the agent.

Ghosh et al. (2018) defines the actionable distance between states and in terms of expected Jensen-Shannon Divergence between and , where is a fully trained goal-conditioned policy. They then train an embedding such that the distance between the embeddings of and is equal to the actionable distance between and . This differs from our approach in that we use a different objective for training the distance function, and, more importantly, we do not assume availability of a pre-trained goal-conditioned policy; rather, in our work the distance function is trained online, in conjunction with the policy.

3 Background

3.1 Goal-Conditioned Reinforcement Learning

Reinforcement Learning (RL) provides a framework for sequential decision making under uncertainty by modeling the decision making problem as a Markov Decision Process (MDP). An MDP is 5-tuple (

for state set , action set , transition kernel , reward function , and distribution over initial states . An agent is formalized as policy , a function which maps from a state to distribution over actions. A trajectory is a sequence arising from interaction between the agent and environment:

for horizon , , , , and . We define the discounted return for trajectory as for discount factor .

The value function gives the expected discounted return starting from following the policy :

where is such that , and the sum may be replaced by an integral in continuous state/action spaces. RL algorithm’s objective is to learn a policy maximizing the expectation of :

Optimizing this objective can be achieved using any of a wide range of RL algorithms invented in recent years; a detailed introduction to these methods can be found in Sutton & Barto (1998).

In the standard RL framework outlined above, the agent is trained to solve a single task, represented by the reward function. Goal-conditioned reinforcement learning generalizes this to allow agents capable of solving multiple tasks Schaul et al. (2015). We assume a goal space , which may be identical to the state space or related to it in some other way, and introduce the goal-augmented state space . Given some goal , the policy , reward function and value function all now condition on in addition to the current state. The intent is to train the agent to achieve all goals afforded by the environment.

Throughout this work we assume a setting similar to that explored in Florensa et al. (2018). In particular, we assume the goal space is either identical to or a subspace of the state space, that all trajectories begin from a single start state , and that the environment does not provide a means of sampling over all possible goals (instead, goals must be discovered through experience). Moreover, we require a distance function ; agents are given a reward of 0 at all timesteps until , for hyperparameter , at which point a reward of 1 is provided and the episode terminates.

3.2 Goal Generation and Curriculum Learning

In order to train agents to achieve all goals in this setting, it is necessary to have a way of systematically exploring the state space in order to discover as many goals as possible, as well as a means of tailoring the difficulty of goals to the current abilities of the agent (a form of goal-based curriculum learning). An algorithmic framework satisfying both of these requirements was proposed in Florensa et al. (2018). Under this framework, one maintains a working set of goals, and alternates between two phases. In the policy-learning phase, the agent is trained to achieve goals sampled uniformly from the working set using an off-the-shelf RL algorithm. In the goal-selection phase, the working set of goals is adapted to the current abilities of the agent in order to enable efficient learning in the next policy-learning stage. In particular, the aim is to have the working set consist of goals that are of intermediate difficulty for the agent; goals that are too hard yield little reward to learn from, while goals that are too easy leave little room for improvement. Formally, given hyperparameters , a goal is considered to be a Goal of Intermediate Difficulty (GOID) if , where is the undiscounted return.

As an instantiation of this framework, Florensa et al. (2018) proposed the GoalGAN algorithm. There, the Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) algorithm is used for the policy-learning phase. The goal-selection phase is achieved by training a Generative Adversarial Network (GAN) (Goodfellow et al., 2014) on GOID goals from the working set, and then sampling from the GAN’s generator. This process is expected to generate a diverse set of GOID goals for use in the next policy-learning phase.

4 Method

In this section we introduce an action-based distance measure for use in trajectory-based reinforcement learning which faithfully captures the dynamics of the environment. We then present a method for automatically learning an estimator of that distance using samples generated by a policy . Additionally, we present a simple action-based method for generating new goals for goal-based curriculum learning.

4.1 Learned Action Distance

We propose to learn a task specific distance function where the distance between states and is defined as the expected number of actions under a policy required to reach from which we call the action distance. Defining the distance in terms of reachability of the states captures the environment dynamics as experienced by the agent under the policy . In order to learn a distance estimator we propose to learn an embedding such that the distance between the embeddings is equal to the action distance between the states where the distance function in the embedding space is a hyperparameter.

Formally, let , and define the action distance as:

(1)

where is a function giving the temporal index of in , and the expectation is taken over trajectories sampled from such that and both occur in . If is a goal-conditioned policy we also average over goals provided to .

In general, is difficult to compute; this is problematic, since this distance is intended to be used for detecting when goals have been achieved and will be called frequently. Therefore, we propose to train a neural network to estimate it. Specifically, we learn an embedding function

of the state space, parameterized by vector

, such that the -norm between a pair of embedding is close to the action distance between the corresponding states. The objective function used to train the embeddings is:

(2)

Learning an embedding with Equation (2) as the objective function is an instance of metric multidimensional scaling. Multidimensional scaling Cox & Cox (2008) is a well-known technique which takes as input a set of pairwise distances between objects and produces an embedding of the objects such that the distance between the embeddings of the objects best preserves the pairwise distance specified.

When used as part of a GoalGAN-style algorithm, the distance predictor can be trained using the trajectories collected by the behavior policy during the policy-learning phase. We call this case the on-policy distance predictor. We emphasize that the on-policy nature of this distance predictor is independent of whether the policies or value functions are on-policy. While such an on-policy distance possesses desirable properties, such as a simple training scheme, it also has drawbacks. Both the behavior policy and goal distribution will change over time and thus the distance function will be non-stationary. This can create a number of issues, the most important of which is difficulty in setting the threshold . Recall that in our setting, the goal is considered to be achieved and the episode terminated once where is a threshold hyperparameter. This thresholding creates an -sphere around the goal, with the episode terminating whenever the agent enters this sphere. The interaction between the -sphere and the non-stationarity of the on-policy distance function cause a subtle issue that we dub the expanding -sphere phenomenon, discussed in detail in Section 5.2.

An alternative approach to learning the distance function from the trajectories generated by the behavior policy is to apply a random policy for a fixed number of timesteps at the end of each episode. The states visited under the random policy are then used to train the distance function. Specifically, we take the trajectories generated by the random policy for obtaining new goals, and re-use them to train the distance function. Since the random policy is independent of the behavior policy, we describe the learned distance function as off-policy in this case. The stationarity of the random policy helps in overcoming the expanding -sphere phenomenon of the on-policy distance predictor.

Finally, we remark that the action distance in general is not a metric. For example, if the environment allows for states to be visited multiple times in a trajectory, then the action distance between a state with itself is non-zero. Hence, the term distance is used rather loosely, appealing to the intuition, and not in the strict mathematical sense.

4.2 Action Space Goal Generation

Our algorithm maintains a working set of goals for the policy to train against. The central challenge in designing a curriculum is coming up with a way to ensure that the working set contains as many GOID goals as possible. The most straightforward way of generating new GOID goals from old ones is by applying perturbations to the old goals. The downside of this simple approach is that the noise must be carefully tailored to the environment of interest, which places a significant demand for domain knowledge about the nature of the environment’s state space. Consider, for instance, that in an environment where the states are represented by images it would be difficult to come up with any kind of noise such that the newly generated states are feasible (i.e. in ). Another option is to train a generative neural network to generate new GOID goals, as proposed by GoalGAN; however, this introduces significant additional complexity.

A simple alternative is to employ action space noise. That is, to generate new goals from an old goal, reset the environment to the old goal and take a series of actions using a random policy; take a random subset of the encountered states as the new goals. The states generated in this way are guaranteed to be both feasible and near the agent’s current capability. Moreover, applying this approach requires only knowledge of the environment’s action space, which is typically required anyway in order to interact with the environment. A similar approach was used in Florensa et al. (2017), but in the context of generating a curriculum of start states growing outward from a fixed goal state.

If implemented without care, action space noise has its own significant drawback: it requires the ability to arbitrarily reset the environment to a state of interest in order to start taking random actions, a strong assumption which is not satisfied for many real-world tasks. Fortunately, we can avoid this requirement as follows. Whenever a goal is successfully achieved during the policy optimization phase, rather than terminating the trajectory immediately, we instead continue for a fixed number of timesteps using the random policy. During the goal selection phase, we can take states generated in this way for GOID goals as new candidate goals. The part of the trajectory generated under the random policy is not used for policy optimization. Note that this combines nicely with the off-policy method for training the distance predictor, as the distance predictor can be trained on these trajectories; this results in a curriculum learning procedure for goal-conditioned policies that requires minimal domain knowledge.

5 Experiments

As a test bed we use a set of 3 Mujoco environments in which agents control simulated robots with continuous state/action spaces and complex dynamics. The first environment is called Point Mass, wherein an agent controls a sphere constrained to a 2-dimensional horizontal plane which moves through a figure eight shaped room. The state space is 4-dimensional, specifying position and velocity in each direction, while the 2-dimensional action space governs the sphere’s acceleration. In the other two environments, the agent controls a quadripedal robot with two joints per leg, vaguely resembling an ant. The 41-dimensional state space includes the center-of-mass of the ant’s torso as well as the angle and angular velocity of the joints, while the action space controls torques for the joints. The Ant is significantly more difficult than point mass from a control perspective, since a complex gait must be learned in order to navigate through space. We experiment with this robot in two different room layouts: a simple rectangular room (Free Ant) and a U-shaped maze (Maze Ant). Further discussion of our environments can be found in Florensa et al. (2018) and Duan et al. (2016).

In the first set of experiments we seek to determine whether our online distance learning approach can replace the hand-coded distance function in the GoalGAN algorithm, thereby eliminating the need for a human to choose and/or design the distance function for each new environment. We experiment with different choices for the goal space, beginning with the simplest case in which the goal space is the coordinates of the robot’s center-of-mass (i.e. a subspace of the state space) before proceeding to the more difficult case in which the goal and state spaces are identical. Next, we empirically demonstrate the expanding -sphere phenomenon mentioned in Section 4.1, which results from training the distance predictor with on-policy samples. Finally, we show that the goal generation approach proposed in Section 4.2 yields performance that is on par with GoalGAN while requiring significantly less domain knowledge.

5.1 GoalGAN with Learned Action Distance

Here we test whether our proposed method can be used to learn a distance function for use in GoalGAN, in place of the hard-coded L2 distance. We explore two methods for learning the distance function: 1) the on-policy approach, in which the distance function is trained using states from the trajectories sampled during GoalGAN’s policy-learning phase, and 2) the off-policy approach, in which the distance function is trained on states from random trajectories sampled at the end of controlled trajectories during the policy-learning phase. For the embedding network

we use a multi-layer perceptron with one hidden layer with

hidden units and an embedding size of . As we are interested in an algorithm’s ability to learn to accomplish all goals in an environment, our evaluation measure is a quantity called coverage

: the probability of goal completion, averaged over all goals in the environment, with Euclidean distance used to determine whether the goal has been reached. Since the goal spaces are large and real-valued, in practice we approximate coverage by partitioning the maze into a fine grid and average over goals placed at grid cell centers. Completion probability for an individual goal is taken as an empirical average over a small number of rollouts.

5.1.1 XY Goal Space

In this experiment, the goal space is the coordinates of the robot’s center-of-mass (i.e. a subspace of the state space). We compare our approach with the baseline where the distance is used as the distance metric in the goal space. In this setting we are able to achieve performance comparable to the baseline without using any domain knowledge, as shown in Fig. 2.

(a) Point Mass (b) Maze Ant (c) Free Ant
Figure 1: Coverage plots in goal space.
(a) Point Mass (b) Maze Ant (c) Free Ant
Figure 2: Coverage plots in the full goal space.

5.1.2 Full State Space as Goal Space

In this setting the goal space is the entire state space, and the objective of the agent is to learn to reach all feasible configurations of the state space. This is straightforward for the Point Mass environment, as its 4-dimensional state space is a reasonable size for a goal space, though harder than the case explored in the previous section.

For the Ant environment, the 41-dimensional state space is quite large, making it difficult for any policy to learn to reach every state even with a perfect distance function. Consequently, for Ant tasks we modify the setup to consider only the stable positions as the goals. In contrast, the distance network is trained on the original unprojected states. This modification makes policy learning tractable while preserving the difficulty for learning the distance predictor.

(a) Point Mass (b) Maze Ant (c) distance predictor (Maze Ant)
Figure 3: (a) & (b): Visualizing predicted action distance between a goal and states along a trajectory. (c) Visualizing predicted action distance of stable ant positions with respect to reference states in the Ant Maze environment.

Results for these experiments are shown in Fig. 2, where we can see that the agents are able to make progress in all the tasks. For the Point Mass agent, progress is slow compared to the case, since now to achieve a goal the agent has to reach a specified position with a specified velocity. The learning progress of the Ant agent trained to reach only stable positions suggests that our approach of learning the distance function can be robust to unseen states, since the distance estimator can generalize to states not seen or only rarely seen during training. Fig. 3 shows the visualization of the distance estimate of our predictor on states visited during a sample trajectory and on a set of reference states in the Point Mass and Maze Ant environments in and full goal spaces respectively.

We observe that in all experiments, including the set with the goal space, the on-policy method for training the distance predictor performed at least as well as the off-policy method, and often significantly better. In the next section we study whether the on-policy method is objectively superior.

5.2 Expanding -sphere

In this section we show that there are qualitative differences between using on-policy and off-policy samples to train the distance predictor. Since the goal is considered achieved when the agent is within the -sphere of the goal the episode is terminated when the agent reaches the boundary of the -sphere. As the learning progresses and the agent learns a shortest path to the goal, the agent only learns a shortest path to a state on the boundary of the -sphere of the corresponding goal. In this scenario, the path to the goal from any state within the -sphere of under the policy conditioned on need not necessarily be optimal since such trajectories are not seen by the policy conditioned on that specific goal . However, the number of actions required to reach the goal from the states outside the -sphere along the path to the goal decreases as a result of learning a shorter path due to policy improvement. Therefore, the number of states from which the goal can be reached within a fixed number of actions increases as the learning progresses until an optimal policy is learned for that goal and therefore resulting in an increasing the volume of the -sphere centered on the goal for a fixed action distance when using on-policy samples to learn the distance predictor.

This phenomenon is illustrated in the top row in Fig. 4. For a fixed state near the starting position, the distance from all other states to is plotted. The evolution of the distance function over iterations shows that as the policy improves, states that are farther in terms of the distance get closer in terms of action distance. The orientation of the thin dark patch is actually the optimal path to reach the points around the corner. As a result, the -sphere centered on increases in volume for a fixed action distance. The bottom row in Fig. 4 illustrates the predictions for off-policy distance estimator. In the off-policy distance case, the dark region is almost always concentrated densely near , and the volume of the -sphere exhibits significantly less growth.

As explained earlier, the policy conditioned on a goal does not learn an optimal path from states within the -sphere of since the episode is terminated upon reaching the sphere boundary. Hence, it is desirable to keep the -sphere as small as possible. In the on-policy case the for has to be decreased as the learning progresses for the goal in order to ensure that the the -sphere centered on remains small. Therefore, for the sake of practical simplicity and ease of use we favor the off-policy distance predictor.

(a) (b) (c) (d) (e) (f) (g) (h) (i) Itr 10 (j) Itr 20 (k) Itr 50 (l) Itr 75 (m) Itr 100 (n) Itr 120 (o) Itr 150 (p) Itr 200
Figure 4: Predictions of the distance estimator trained with on-policy (top) and off-policy (bottom) samples in Maze Ant with goal space illustrating how the predictions evolve over time. Darker colors indicate smaller predicted distance and the small blue dot indicates the reference state.

5.3 Generating Goals in Action Space

In this section we compare the performance of the goal generation strategy proposed in Section 4.2 against GoalGAN. We perform this comparison in both the Maze Ant and Free Ant environments, using as the goal space and the Euclidean distance. The results, shown in Fig. 6, demonstrate that the performance of our approach is comparable to that of GoalGAN while not requiring the additional complexity introduced by the GAN. Even though our approach requires additional environment interactions it does not necessarily have a higher sample complexity compared to GoalGAN in the the indicator reward function case. This is because goals generated by GoalGAN can be infeasible and the reward for the trajectory will be , thereby not contributing to the learning. The evolution of the working set of goals maintained by our algorithm for Maze Ant is visualized in Fig. 6.

(a) Maze Ant (b) Free Ant
Figure 5: Comparing the proposed goal generation algorithm against GoalGAN. All experiments are averaged over random seeds.

.

Figure 6: Evolution of the goals generated by our goal generation approach (top). A sample of goals so-far encountered (bottom), color-coded according to estimated difficulty: green are easy. blue are GOID and red are hard.

6 Conclusion

We have presented an approach to automatically learn a task-specific distance function without the requirement of domain knowledge, and demonstrated that our approach is effective in the online setting where the distance function is learn alongside a goal-conditioned policy while also playing a role in training that policy. We then discussed and empirically demonstrated the expanding -sphere phenomenon which arises when using the on-policy method for training the distance predictor. This can cause difficulty in setting the

hyperparameter, particularly when the final performance has to be evaluated using the learned distance function instead of using a proxy evaluation metric like the Euclidean distance. This indicates that off-policy distance predictor training should be preferred in general. Finally, we introduced an action space goal generation scheme which plays well with off-policy distance predictor training and avoids the complexity introduced by the GAN in the GoalGAN algorithm, and showed that this approach achieves performance on par with GoalGAN. We believe that our contributions represent a promising step towards making goal-conditioned policies applicable in a wider variety of environments (e.g. visual domains), and towards automating the design of distance functions that take environment dynamics into account.

References

Appendix A Training the distance predictor

The distance predictor has to be trained prior to being used in determining whether the goal has been reached in order to produce meaningful estimates. To produce a initial set of samples to train the distance predictor, we use a randomly initialized policy to generate a set of samples. This provides a meaningful initialization for the distance predictor since these are states that are most likely to occur under the initial policy.

The distance predictor is a MLP with hidden layer and

hidden units and ReLU activation. We initialize distance predictor by training on

samples collected according to the randomly initialized policy for epochs. In subsequent iterations of our training procedure the MLP is trained for one epoch. The learning rate is set to - and mini-batch size is . The distance predictor is trained after every policy optimization step using either either off-policy or the on-policy samples. The embeddings are -dimensional and we use -norm distance between embeddings as the predicted distance.

Appendix B Goal Generation Buffer

To generate goals according to the proposed approach we store the states visited under the random policy after reaching the goals in a specialized buffer and sample uniformly from the buffer to generate the goals for each iteration. The simplest approach of storing the goals in a list suffers from the two following issues: i) all the states visited under the random policy from the beginning of the training procedure will be considered as a potential goal for each iteration and ii) the goals will be sampled according to the state visitation distribution under the random policy. The issues i) and ii) are problematic because the goal generation procedure has to adapt to the current capacity of the agent and avoid the goals that have already been mastered by the agent; sampling goals according to the visitation distribution will bias the agent towards the states that are more likely under the random policy. To overcome these issues we use a fixed size queue and by ensure that the goals in the buffer are unique. To avoid replacing the entire queue after each iteration, only a fixed fraction of the states in the queue are replaced in each iteration. In our experiments, the queue size was set and goals were updated after each iteration.

We observe that our goal generation or off-policy distance predictor are agnostic to the policy being a random policy and hence the random policy can be replaced with any policy if desired.

Appendix C Hyperparameters

The GoalGAN architecture and its training procedure and the policy optimization procedure in our experiments are similar to Florensa et al. (2018). Similar to the distance predictor, GoalGAN is trained initially with samples generated by a random policy. The GAN generator and discriminator have hidden layers with and units respectively with non-linearity and the GAN is trained for 200 iterations after every

policy optimization iterations. A component-wise gaussian noise of mean zero and variance

are to the output of the GAN similar to Florensa et al. (2018). The policy network has hidden layers with hidden units in each layer and non-linearity and is optimized using TRPO Schulman et al. (2015) with a discount factor of and GAE of . The value was set to and with and the learned distance distance respectively in the ant environments and and for and learned distance respectively for the point-mass environments.

The hyperparameter for and the learning rate were determined by performing grid search with values and (Maze Ant) and (point-mass) and the learning rates of -, -- in the off policy setting. For the sake of simplicity we use the same

for the on-policy and off-policy distance predictors in our experiments. All the plots show the mean and the confidence interval of

% for all our experiments. Our implementation is based on the github repository of Florensa et al. (2018).