1 Introduction
Reinforcement Learning (RL) studies how agents can learn a desired behaviour by simply using interactions with an environment and a reinforcement signal. Central to RL is the longstanding problem of balancing exploration and exploitation. Agents must first sufficiently explore the environment to identify highreward behaviours, before this knowledge can be exploited and refined to maximize longterm rewards. Many recent RL successes have been obtained by relying on wellformed reward signals, that provide rich gradient information to guide policy learning. However, designing such informative rewards is challenging, and rewards are often highly specific to the particular task being solved. Sparse rewards, which carry little or no information besides binary success or failure, are much easier to design. This simplicity comes at a cost; most rewards are identical, so that there is little gradient information to guide policy learning. In this setting, the sample complexity of simple exploration strategies was shown to grow exponentially with state dimension in some cases (osband2016generalization). Intuition behind this phenomenon can be gained by inspecting Figure 0(a): exploration in regions where the return surface is flat leads to a random walk type search. This inefficient search continues until nonzero gradients are found, which can then be followed to a local optimum.
Planning algorithms can achieve much better exploration performance than random walk by taking search history into account (Lavalle98rapidlyexploringrandom). These techniques are also often guaranteed to find a solution in finite time if one exists (Karaman2011). In order to leverage the advantages of these methods, we formulate RL exploration as a planning problem in the state space. Solutions found by search algorithms are then used as demonstrations for RL algorithms, initializing them in regions of policy parameter space where the return surface is not flat. Figure 0(b) shows the importance of such good initialization; surface gradients can be followed, which greatly facilitates learning.
This paper brings the following contributions. We first formulate RL exploration as a planning problem. This yields a simple and effective method for automatically generating demonstrations without the need for an external expert, solving the planning problem by adapting the classic Rapidlyexploring Random Tree algorithm (RRT) (kuffner2000rrt). The demonstrations are then used to initialize an RL policy, which can be refined with a classic RL method such as TRPO (schulman2015trust). We call the proposed method Rapidly Randomlyexploring Reinforcement Learning (R3L)^{1}^{1}1Code will be made available on Github.
, provide theoretical guarantees for finding successful solutions and derive bounds for its sampling complexity. Experimentally, we demonstrate R3L improves exploration and outperforms classic and recent exploration techniques, and requires only a fraction of the samples while achieving better asymptotic performance. Lastly, we show that R3L lowers the variance of policy gradient methods such as TRPO, and verify that initializing policies in regions with rich gradient information makes them less sensitive to initial conditions and random seed.
The paper is structured as follows: Section 2 analyzes the limitations of classic RL exploration. Section 3 describes R3L and provides theoretical exploration guarantees. Related work is discussed in Section 4, followed by experimental results and comments in Section 5. Finally, Section 6 concludes and gives directions for future work.
2 SparseReward RL as Random Walk
Many recent RL methods are based on a policy gradient optimization scheme. This approach optimizes policy parameters
with gradient descent, using a loss function
(e.g. expected return) and gradient . Since computingexactly is intractable, it is common to use unbiased empirical estimators
and , estimated from samples acquired by executing the policy. Optimization of then follows the common stochastic gradient descent (SGD) updaterule (Bottou2010_SGD; robbins1951stochastic): , where is the learning rate.The SGD update rule defines a discretetime stochastic process (mandt2017stochastic). Note that is the mean of
i.i.d. samples. Following the central limit theorem, the distribution over
is approximately , meaningis an unbiased estimator of
with covariance . Consequently, the update rule can be rewritten as (mandt2017stochastic):(1) 
Here we assume that , i.e. approximately constant w.r.t. , and factorizes as .
SGD is efficient for highdimensional problems as it offers almost dimension independent convergence rates (Nesterov2018). However, SGD requires nonzero gradients to guide the search towards the optimum , i.e. , , . In the case of sparsereward RL problems, such as in Figure 1, much of the loss surface is flat. This leads to inefficient exploration of parameter space , as the drift component in Eq. (1) , turning the SGD to a random walk in ; . Random walk is guaranteed to wander to infinity when dimensionality (polya1921; Kakutani1944)
. However, the probability of it reaching a desired region in
, e.g. where , depends heavily on problem specific parameters. The probability of ever reaching a sphere of radius centered at such that is (dvoretzky1951):(2) 
In sparse RL problems , thus the probability of reaching a desired region by random walk is smaller than 1, i.e. there are no guarantees of finding any solution, even in infinite time. This is in stark contrast with the R3L exploration paradigm, as discussed in Section 3.5.
3 R3L: Rapidly and Randomlyexploring Reinforcement Learning
R3L adapts RRT to the RL setting by formulating exploration as a planning problem in state space. Unlike random walk, RRT encourages uniform coverage of the search space and is probabilistically complete, i.e. guaranteed to find a solution (kleinbort2019).
R3L is decomposed into three main steps: (i) exploration is first achieved using RRT to generate a dataset of successful trajectories, described in Sections 3.2 and 3.3, (ii) successful trajectories are converted to a policy using learning from demonstrations (Section 3.4), and (iii) the policy is refined using classic RL methods.
3.1 Definitions
This work is based on the Markov Decision Process (MDP) framework, defined as a tuple
. and are spaces of states and actions respectively.is a transition probability distribution so that
, where the subscript indicates the discrete timestep. is a reward function defining rewards associated with transitions . is a discount factor. Solving a MDP is equivalent to finding the optimal policy maximizing the expected return for some time horizon , where actions are chosen according to . Lastly, let be a Euclidean space, equipped with the standard Euclidean distance metric with an associated norm denoted by . The space of valid states is denoted by .3.2 Exploration as a planning problem with RRT
The RRT algorithm (LaValle2001) provides a principled approach for planning in problems that cannot be solved directly (e.g. using inverse kinematics), but where it is possible to sample transitions. RRT builds a tree of valid transitions between states in , grown from a root . As such, the tree maintains information over valid trajectories. The exploration problem is defined by the pair . In RL environments with a known goal set (e.g. MountainCar), the exploration problem is defined by .
The RRT algorithm starts by sampling a random state , used to encourage exploration in a specific direction in the current iteration. This necessitates the first of two assumptions.
Assumption 1.
Random states can be sampled uniformly from the MDP state space .
Sampled states are not required to be valid, thus sampling a random state is typically equivalent to trivially sampling a hyperrectangle.
Then, the vertex is found. RRT attempts to expand the tree from toward by sampling an action according to a steering function . In many planning scenarios, samples randomly from . A forward step is then simulated from using action , where is defined by the transition dynamics. Being able to expand the tree from arbitrary relies on another assumption.
Assumption 2.
The environment state can be set to a previously visited state .
Although this assumption largely limits the algorithm to simulators, it has previously been used in florensa2017reverse; nair2018overcoming; ecoffet2019go for example; see discussion in Section 6 on overcoming the limitation to simulated environments.
is added as a new vertex of , alongside an edge with edge information . This process repeats until a sampling budget is exhausted or the goal set is reached (i.e. ).
Definition 1.
A valid trajectory is a sequence such that is a valid transition and , . Whenever a goal set is defined, a successful valid trajectory end state satisfies .
Once planning is finished, a successful valid trajectory can easily be generated from by retrieving all nodes between the leaf and the root. Because tree is grown from valid transitions between states in , these trajectories are valid by construction.
3.3 R3L exploration: Adapting RRT to RL
RRT is not directly applicable to RL problems. This subsection presents the necessary adaptations to the exploration phase of R3L, summed up in Algorithm 1. Figure 2 shows R3L’s typical exploration behaviour.
Local policy learning: In classic planning problems, selecting actions extending towards is easy, as these have a geometric interpretation or there is a known steering function. RL state spaces do not benefit from the same geometric properties, and properly selecting actions can be challenging. We solve this problem by defining a local policy , which models the distribution of actions to transition from a state to a neighbouring goal state. Actions to extend a tree branch are sampled as . We formulate the problem of learning
as supervised learning, where inputs are starting states
augmented with the difference, and targets are actions. The model is learned using transition data collected from previous tree expansion iterations. Results in this paper use Bayesian linear regression to represent
, but any supervised learning model can applied instead.Unknown dynamics: RRT was designed for problems with known continuous dynamics , but RL features unknown discrete transition dynamics. In R3L, is replaced with an environment interaction from , with selected action , resulting in a new state . Since is a real transition, it must be valid, and can be added to the tree .
Biasing search with : Better exploration efficiency can be achieved if goal information is available. Indeed, the RRT framework allows for biasing exploration towards , often resulting in faster exploration. This is achieved by sampling from instead of with low probability , while the rest of the iteration remains unchanged.
Since R3L uses RRT to explore, the algorithm is most suitable for RL problems that are fully observable, exhibit sparse rewards and have continuous controls. R3L is applicable to other RL problems, but may not perform as well as methods tailored to specific cases.
3.4 Policy initialization from R3L demonstrations
Upon completion, R3L exploration yields a successful trajectory , which may not be robust to various starting conditions and/or stochastic transitions. Converting successful trajectories into a policy is crucial to achieve robustness and enable further refinement with RL.
Policy initialization is applied to a set of successful trajectories generated using runs of R3L exploration with different starting conditions. An imitation policy is learned by supervised learning on transitions from . Policy is then refined using traditional RL algorithms like TRPO. As shown in Figure 1, initializing policy parameters in the vicinity of a local optimum is crucial.
3.5 Exploration guarantees
The RL planning environment defines differential constraints of the form:
(3) 
Therefore, starting at , the trajectory can be generated by forward integrating Eq. (3) with the applied actions. As with many RL problems, is timediscretized resulting in a piecewise constant control function. This means is constructed of segments of fixed time duration such that the overall trajectory duration . Thus, is defined as where and . Furthermore, as all transitions between states in are known, the trajectory return can be defined as .
R3L explores in stateaction/trajectory space instead of policy parameter space. Furthermore, it is an effective exploration framework which provides probabilistic completeness (PC):
Definition 2.
A probabilistically complete planner finds a feasible solution (if one exists) with a probability approaching 1 in the limit of infinite samples.
With the aforementioned dynamic characteristics, we prove that R3L exploration under the RL setting is PC. This is in stark contrast to the random walk exploration process, discussed in section 2, which is not PC. We begin with the following theorem, a modification of Theorem 2 from kleinbort2019, which is applied to kinodynamic RRT where a goal set is defined.
Theorem 1.
Suppose that there exists a valid trajectory from to as defined in definition 1, with a corresponding piecewise constant control. The probability that R3L exploration fails to reach from after iterations is bounded by , for some constants .
The proof, which is a modification of Theorem 2 from kleinbort2019, can be found in Appendix S2. It should be noted that R3L exploration does not require an explicit definition for in order to explore the space. While in some path planning variants of RRT, is used to bias sampling, the main purpose of is to indicate that a solution has been found. Therefore, can be replaced by another implicit success criterion. In the RL setting, this can be replaced by a returnrelated criterion.
Theorem 2.
Suppose that there exists a trajectory with a return . The probability that R3L exploration fails to find a valid trajectory from with after iterations is bounded by , for some constants .
Proof.
The proof is straightforward. We augment each state in with the return to reach it from :
(4) 
where . For consistency we modify the distance metric by simply adding a reward distance metric. With the above change in notation, we modify the goal set to , such that there is an explicit criterion for minimal return as a goal. Consequently, the exploration problem can be written for the augmented representation as , where . Theorem 1 satisfies that R3L exploration can find a feasible solution to this problem within finite time, i.e. PC, and therefore the probability of not reaching after iterations is upperbounded by the exponential term , for some constants ∎
We can now state our main result on the sampling complexity of the exploration process.
Theorem 3.
If trajectory exploration is probabilistically complete and satisfies an exponential convergence bound, the expected sampling complexity is finite and bounded such that
(5) 
where .
Proof.
Theorem 2 provides an exponential bound for the probability the planner will fail in finding a feasible path. Hence, we can compute a bound for the expected number of iterations needed to find a solution, i.e. sampling complexity:
(6) 
where we used the relation . ∎
It is worth noting that while the sample complexity is bounded, the above result implies that the bound varies according to problemspecific properties, which are encapsulated in the value of and . Intuitively, depends on the scale of the problem. It grows as becomes smaller or as the length of the solution trajectory becomes longer. depends on the probability of sampling states that will expand the tree in the right direction. It therefore shrinks as the dimensionality of increases. We refer the reader to Appendix S2 for more details on the meaning of and the derivation of the tail bound in Theorem 1.
4 Related work
Exploration in RL has been extensively studied. Classic techniques typically rely on adding noise to actions (mnih2015human; schulman2015trust) or to policy parameters (plappert2018parameter). However, these methods perform very poorly in settings with sparse rewards.
Intrinsic motivation tackles this problem by defining a new reward to direct exploration. Many intrinsic reward definitions were proposed, based on information theory (oudeyer2008can), state visit count (lopes2012exploration; bellemare2016unifying; szita2008many; fox2018dora), value function posterior variance (osband2016deep; morere2018bayesian), or model prediction error (stadie2015incentivizing; pmlrv70pathak17a). Methods extending intrinsic motivation to continuous state and action spaces were recently proposed (houthooft2016vime; morere2018bayesian). However, these approaches are less interpretable and offer no guarantees for the exploration of the state space.
Offering exploration guarantees, Bayesian optimization was adapted to RL in wilson2014using, to search over the space of policy parameters in problems with very few parameters. Recent work extends the method to functional policy representations (vien2018bayesian), but results are still limited to toy problems and specific policy model classes.
Motion planning in robotics is predominantly addressed with samplingbased methods. This type of approach offers a variety of methodologies for exploration and solution space representation (e.g., Probabilistic roadmaps (PRM) (Kavraki1996), Expansive space trees (ESP) (Hsu1997) and Rapidlyexploring random tree (RRT) (kuffner2000rrt)), which have shown excellent performance in path planning in highdimensional spaces under dynamic constraints (LaValle2001; Hsu2002; Kavraki1996).
RL was previously combined with samplingbased planning to replace core elements of planning algorithms, such as PRM’s pointtopoint connection (faust2018prm), local RRT steering function (chiang2019rl) or RRT expansion policy (chen2019learning). In contrast, the proposed method bridges the gap in the opposite direction, employing a samplingbased planner to generate demonstrations that kickstart RL algorithms and enhance their performance.
Accelerating RL by learning from demonstration was investigated in niekum2015learning; bojarski2016end; torabi2018behavioral. However, these techniques rely on usergenerated demonstrations or apriori knowledge of environment parameters. In contrast, R3L automatically generates demonstrations, with no need of an external expert.
5 Experiments
Goal bias ()  Unbiased ()  

Learned  Random  Learned  Random  
MountainCar  
timesteps  
Pendulum  
timesteps  
Acrobot  
timesteps  
Cartpole Swingup  
timesteps  
Reacher  
timesteps 
on R3L exploration. Results show the mean and standard deviation of successful trajectory length
and number of timesteps required, computed over runs.In this section, we investigate (i) how learning a local policy and biasing search towards with probability affects R3L exploration, (ii) whether separating exploration from policy refinement is a viable and robust methodology in RL, (iii) whether R3L reduces the number of exploration samples needed to find good policies, compared with methods using classic and intrinsic exploration, and (iv) how R3L exploration can reduce the variance associated with policy gradient methods. All experiments make use of the Garage (duan2016benchmarking) and Gym (gym) frameworks. The experimental setup features the following tasks with sparse rewards: Cartpole Swingup (), MountainCar (), Acrobot (), Pendulum (), Reacher () Fetch Reach (), and Hand Reach (). The exact environment and reward definitions are described in Appendix S3.
R3L exploration analysis We first analyze the exploration performance of R3L in a limited set of RL environments, to determine the impact that learning policy has on exploration speed. We also investigate whether R3L exploration is viable in environments where no goal information is available. Table 1 shows the results of this analysis. Learning seems to greatly decrease the number of exploration timesteps needed on most environments. However, it significantly increases the number of timesteps on the acrobot and reacher environments. Results also suggest that learning helps R3L to find shorter trajectories on the same environments, which is a desirable property in many RL problems. Biasing R3L exploration towards the goal set helps finding successful trajectories faster, as well as reducing their length. However, R3L exploration without goal bias is still viable in all cases. Although goal information is not given in the classic MDP framework, it is often available in realworld problems and can be easily utilized by R3L. Lastly, successful trajectory lengths have low variance, which suggests R3L finds consistent solutions.
Comparison to classic and intrinsic exploration on RL benchmarks We examine the rates at which R3L learns to solve several RL benchmarks, and compare them with stateoftheart RL algorithms. Performance is measured in terms of undiscounted returns and aggregated over 10 random seeds, sampled at random for each environment. We focus on domains with sparse rewards, which are notoriously difficult to explore for traditional RL methods. Our experiments focus on the widelyused methods TRPO (schulman2015trust) and DDPG (lillicrap2015continuous). R3LTRPO and R3LDDPG are compared to the baseline algorithms with Gaussian action noise. As an additional baseline we include VIMETRPO (houthooft2016vime). VIME is an exploration strategy based on maximizing information gain about the agent’s belief of the environment dynamics. It is included to show that R3L can improve on stateoftheart exploration methods as well as naive ones, even though the return surface for VIMETRPO is no longer flat, unlike Figure 1. The exact experimental setup is described in Appendix S3.2. The R3L exploration phase is first run to generate training trajectories for all environments. The number of environment interactions during this phase is accounted for in the results, displayed as an offset with a vertical dashed black line. The average performance achieved by these trajectories is also reported as a guideline, with the exception of Cartpole Swingup where doing so does not make sense. RL is then used to refine a policy pretrained with these trajectories.









Figure 3
shows the median and interquartile range for all methods. R3L is very competitive with both vanilla and VIME baselines. It converges faster and achieves higher performance at the end of the experiment. In most cases, the upper quartile for our method begins well above the minimum return, indicating that R3L exploration and pretraining are able to produce successful though not optimal policies. R3LDDPG performance, for the majority of problems, starts significantly above the minimum return, plunges due to the inherent instability of DDPG, but eventually recovers, indicating that R3L pretraining can help mitigate the instability. It is worth noting that R3L’s lower quartile is considerably higher than that of baselines. Indeed, for many of the baselines the lower quartile takes a long time to improve on the minimum return, and in some cases it never manages to do so at all. This is a common problem in sparse reward RL, where there is no alternative but to search the parameter space randomly until the first successful trajectory is found, as explained in Section
2. While a few random seeds will by chance find a successful trajectory quickly (represented by the quickly rising upper quartile), others take a long time (represented by the much slower rise of the median and lower quartile). In other words, R3LTRPO/DDPG is much more robust to random policy initialization and to the random seed than standard RL methods. This is because R3L is able to use automatically generated demonstrations to initialize RL policy parameters to a region with informative return gradients.6 Conclusion
We proposed Rapidly Randomlyexploring Reinforcement Learning (R3L), an exploration paradigm for leveraging planing algorithms to automatically generate successful demonstrations, which can be converted to policies then refined by classic RL methods. We provided theoretical guarantees of R3L finding solutions, as well as sampling complexity bounds. Empirical results show that R3L outperforms classic and intrinsic exploration techniques, requiring only a fraction of exploration samples and achieving better asymptotic performance.
As future work, R3L could be extended to realworld problems by leveraging recent advances on bridging the gap between simulation and reality (peng2018sim). Respecting Assumption 2, a policy would first be trained on a simulator and then transferred to the realworld. Exploration in highdimensional tasks is also challenging as stated in Theorem 3 and confirmed experimentally by increased R3L exploration timesteps. Exploiting external prior knowledge and/or the structure of the problem can benefit exploration in highdimensional tasks, and help make R3L practical for problems such as Atari games. Lastly, recent advances in RRT (chiang2019rl) and learning from demonstration (torabi2018behavioral) could also improve R3L.
References
S1 Appendix A: RRT algorithm psuedocode
In this section, we provide pseudocode of the classic RRT algorithm.
S2 Appendix B: Proof of Theorem 1
This appendix proves Theorem 1 which shows that planning using RRT under differential constraints is probabilistically complete. The following proof is a modification of Theorem 2 from kleinbort2019, where completeness of RRT in the RL setting is maintained without the need to explicitly sample a duration for every action.
Equation 3 defines the environment’s differential constraints. In practice, Eq. (3) is approximated by an Euler integration step. With the interval divided into equal time intervals of duration with . Eq. (3) is then approximated by an Euler integration step, where the transition between consecutive time steps is given by:
(S1)  
Furthermore, we define as a ball with a radius centered at for any given state .
We assume that the planning environment is Lipschitz continuous in both state and action, constraining the rate of change of Eq. (S1). Formally, there exists two positive constants , such that
(S2)  
(S3) 
Lemma 1.
For two trajectories , , where and such that with a positive constant. Suppose that for each trajectory a piecewise constant action is applied, so that and is fixed during a time period . Then .
The proof for Lemma 1 is given in Lemma 2 of (kleinbort2019). Intuitively, this bound is derived from compounding worstcase divergence between and at every Euler step along which leads to an overall exponential dependence.
Using Lemma 1, we want to provide a lower bound on the probability of choosing an action that will expand the tree successfully. We note that this scenario assumes that actions are drawn uniformly from , i.e. there is no steering function ^{2}^{2}2The function returns an action given two states and such that , for a prespecified (Karaman2011).. When better estimations of the steering function are available, as described in 3.3, the performance of RRT significantly improves.
Definition 3.
A trajectory is defined as clear if for , for all
Lemma 2.
Suppose that is a valid trajectory from to with a duration of and a clearance of . Without loss of generality, we assume that actions are fixed for all , such that .
Suppose that RRT expands the tree from a state to a state , for any and we can define the following bound:
Here, is the Lebesgue measure for a unit circle in .
Proof.
We denote a trajectory that starts from and is expanded with an applied random action . According to Lemma 1,
where since . Now, we want to find such that the distance between the goal points of these trajectories, i.e. in the worstcase scenario, is bounded:
After rearranging this formula, we can obtain a bound for :
Assuming that
is drawn out of a uniform distribution, the probability of choosing the proper action is
(S4) 
where is used to account for the degeneracy in action selection due to the dimensionality of . We note that guarantees , thus a valid probability> ∎
Equation S4 provides a lower bound for the probability of choosing the suitable action. The following lemma provides a bound on the probability of randomly drawing a state that will expand the tree toward the goal.
Lemma 3.
Let be a state with clearance , i.e. . Suppose that for an RRT tree there exist a vertex such that . Following the definition in Section 3.2, we denote as the closest vertex to . Then, the probability that is at least .
Proof.
Let . Therefore the distance between and is upperbounded by . If there exists a vertex such that and , then . Hence, by choosing , we are guaranteed . As is drawn uniformly, the probability for is . ∎
We can now prove the main theorem.
Theorem 1.
Suppose that there exists a valid trajectory from to as defined in definition 1, with a corresponding piecewise constant control. The probability that RRT fails to reach from after iterations is bounded by , for some constants .
Proof.
Lemma 2 puts bound on the probability to find actions that expand the tree from one state to another in a given time. As we assume that a valid trajectory exists, we can assume that the probability defined in Lemma 2 is nonzero, i.e. , hence:
(S5) 
where we set and as was also done in (kleinbort2019). We additionally require that , which is typically defined as an RL environment parameter, is chosen accordingly so to ensure that Eq. (S5) holds, i.e. .
We cover with balls of radius , where . The balls are spaced equally in time with the center of the ball is in , where . Therefore, and . We now examine the probability of RRT propagating along . Suppose that there exists a vertex , we need to bound the probability that by taking a random sample , there will be a vertex such that . Lemma 3 provides a lower bound for the probability that , given that there exists a vertex , of . Lemma 2 provide a lower bound for the probability of choosing an an action from to of , where we have substituted with . Consequently, .
For RRT to recover , the transition between consecutive circles must be repeated
times. This stochastic process can be described as a binomial distribution, where we perform
trials (randomly choosing ), with successes (transition between circles) and a transition success probability . The probability mass function of a binomial distribution is. We use the cumulative distribution function (CDF) to represent the upper bound for failure, i.e. the process was unable to perform
steps, which can be expressed as:(S6) 
Using Chernoff’s inequality we derive the tail bounds of the CDF when :
(S7)  
(S8)  
(S9) 
In the other case, where , the upper bound is given by (Arratia1989):
(S10) 
where is the relative entropy such that
Rearranging , we can rewrite S10 as follows:
(S11)  
(S12)  
(S13)  
(S14)  
(S15)  
(S16) 
where (S14) is justified for worstcase scenario where , (S15) uses the fact that , hence . The last step, (S16) is derived from the first term of the Taylor expansion of .
As and are fixed and independent of , we show that the expression for decays to zero exponentially with , therefore RRT is probabilistically complete. ∎
It worth noting that as expected the failure probability depends on problemspecific properties, which give rise to the values of and . Intuitively, depends on the scale of the problem such as volume of the goal set and how complex and long the solution needs to be, as evident in Eq. (S9). More importantly, depends on the probability . Therefore, it is a function of the dimensionality of (through the probability of sampling ) and other environment parameters such as clearance (defined by ) and dynamics (via , ), as specified in Eq. (S4).
S3 Appendix C: experimental setup
All experiments were run using a single GHz core and a GeForce GTX 1080 Ti GPU.
s3.1 Environments
All environments are made available in supplementary code. Environments are based on Gym (gym), with modified sparse reward functions and state spaces. All environments emit a reward per timestep unless noted otherwise. The environments have been further changed from Gym as follows:

Cartpole Swingup The state space consists of states where is cart position, is cart linear velocity, is pole angle (measuring from the axis) and pole angular velocity. Actions are force applied on the cart along the axis. The goal space is . Note that reaching the goal space does not terminate an episode, but yields a reward of . Time horizon is . Reaching the bounds of the rail does not cause failure but arrests the linear movement of the cart.

MountainCar The state space consists of states where is car position and is car velocity. Actions are force applied by the car engine. The goal space is . Time horizon is .

Acrobot The state space consists of states where are the angles of the joints (measuring from the
axis and from the vector parallel to the
link, respectively) and are their angular velocities. Actions are torque applied on the joint. The goal space is . In other words, the set of states where the end of the second link is at a height . Time horizon is . 
Pendulum The state space consists of states where is the joint angle (measured from the axis) and is the joint angular velocity. Actions are torque applied on the joint. The goal space is . Note that reaching the goal space does not terminate an episode, but yields a reward of . Time horizon is .

Reacher The state space consists of states where are the angles of the joints, are the coordinates of the target and are the joint angular velocities. Actions are torques applied at the joints. The goal space is the set of states where the endeffector is within a distance of from the target. Time horizon is .

Fetch Reach A highdimensional robotic task where the state space consists of states where are the Cartesian coordinates and velocities of the Fetch robot’s gripper, are the twodimensional position and velocity of the gripper fingers, and are the Cartesian coordinates of the goal. Actions are relative target positions of the gripper and fingers, which the MuJoCo controller will try to achieve. The goal space is the set of states where the endeffector is within a distance of from . Time horizon is . Note that this problem is harder than the original version in OpenAI Gym, as we only sample that are far from the gripper’s initial position.

Hand Reach A highdimensional robotic task where the state space consists of states where are the angles and angular velocities of the Shadow hand’s joints, are the Cartesian coordinates of the fingertips, and are the Cartesian coordinates of the goal positions for each fingertip. Actions are absolute target angles of the controllable joints, which the MuJoCo controller will try to achieve. The goal space is the set of states where all fingertips are simultaneously within a distance of from their respective goals. Time horizon is .
s3.2 Experimental setup and hyperparameter choices
All experiments feature a policy with fullyconnected hidden layers of units each with tanh activation, with the exception of Reacher, for which a policy network of fullyconnected hidden layers of
units each with relu activation is used. For all environments we use a linear feature baseline for TRPO.
Default values are used for most hyperparameters. A discount factor of
is used in all environments. For VIME, hyperparameters values reported in the original paper are used, and the implementation published by the authors was used.For TRPO, default hyperparameter values and implementation from Garage are used: KL divergence constraint , and Gaussian action noise .
In comparisons with VIMETRPO and vanilla TRPO, the R3L goal sampling probability