Planning to Explore via Self-Supervised World Models

05/12/2020 ∙ by Ramanan Sekar, et al. ∙ 2

Reinforcement learning allows solving complex tasks, however, the learning tends to be task-specific and the sample efficiency remains a challenge. We present Plan2Explore, a self-supervised reinforcement learning agent that tackles both these challenges through a new approach to self-supervised exploration and fast adaptation to new tasks, which need not be known during exploration. During exploration, unlike prior methods which retrospectively compute the novelty of observations after the agent has already reached them, our agent acts efficiently by leveraging planning to seek out expected future novelty. After exploration, the agent quickly adapts to multiple downstream tasks in a zero or a few-shot manner. We evaluate on challenging control tasks from high-dimensional image inputs. Without any training supervision or task-specific interaction, Plan2Explore outperforms prior self-supervised exploration methods, and in fact, almost matches the performances oracle which has access to rewards. Videos and code at



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The dominant approach in sensorimotor control is to train the agent on one or more pre-specified tasks either via rewards in reinforcement learning, or via demonstrations in imitation learning. However, learning each task from scratch is often inefficient, requiring a large amount of task-specific environment interaction for solving each task. How can an agent quickly generalize to unseen tasks it has never experienced before in a zero or few-shot manner?

Task-agnostic RL

Because data collection is often expensive, it would be ideal to not keep collecting data for each new task. In this work, we explore the environment once without reward to collect a diverse dataset for later solving any downstream task, as shown in Figure 1. After the task-agnostic exploration phase, the agent is provided with downstream reward functions and needs to solve the tasks with limited or no further environment interaction. Such a self-supervised approach would allow solving various tasks without having to repeat the expensive process of data collection for each task.

Intrinsic motivation

To explore complex environments in the absence of rewards, the agent needs to follow a form of intrinsic motivation that is computed from inputs that could be high-dimensional images. For example, an agent could seek inputs that it cannot yet predict accurately (schmidhuber1991possibility; oudeyer2007intrinsic; pathakICMl17curiosity), maximally influence its inputs (klyubin2005empowerment; eysenbach2018diversity), or visit rare states (poupart2006analytic; lehman2011evolving; bellemare2016unifying; burda2018rnd). However, most of these prior methods learn a model-free exploration policy to collect diverse environment interactions which needs large amounts of sample for finetuning or adaptation when presented with rewards for downstream tasks.

Figure 1: The agent first leverages planning to explore in a self-supervised manner, without task-specific rewards, to efficiently learn a global world model. After the exploration phase, it receives reward functions to adapt to multiple downstream tasks, such as standing, walking, running, and flipping using either zero or few tasks-specific interactions.
Figure 2: Overview of Plan2Explore. Each observation is first encoded into features which are then used at each time step to infer a recurrent latent state . At each training step, the agent leverages planning to explore by imagining the consequences of the actions of policy using the current world model. The planning objective is to maximize expected novelty computed as the disagreement in the predicted next image embedding from an ensemble of learned transition dynamics

. This planning objective is backpropagated all the way through the imagined rollout states to improve the exploration policy

. The learned model is used for planning to explore in latent space, and the data collected during exploration is in turn used to improve the model. This world model is later used to plan for novel tasks at test time.

Retrospective novelty

Model-free exploration methods not only require large amounts of experience to adapt to downstream tasks, they can also be inefficient during exploration. These agents usually first act in the environment, collect trajectories, and then calculate an intrinsic reward as the agent’s current estimate of novelty. This approach misses out on efficiency by operating retrospectively, that is, the novelty of inputs is computed after the agent has already reached them. Hence, it seeks out previously novel inputs that have already been visited and would not be novel anymore. Instead, one should directly seek out future inputs that are expected to be novel.

Planning to explore

We address both of these challenges — quick adaptation and expected future novelty — within a common framework, while learning directly from high-dimensional image inputs. Instead of maximizing intrinsic rewards in retrospect, we learn a world model to plan ahead and seek out expected novelty of future situations. This lets us learn the exploration policy purely from imagined model states, without causing additional environment interaction (sun2011planning; shyam2019max). The exploration policy is optimized purely from trajectories imagined under the model to maximize the intrinsic rewards computed by the model itself. After the exploration phase, the learned world model is used to train downstream task policies in imagination via offline reinforcement learning, without any further environment interaction.


The key challenges for planning to explore are to train an accurate world model from high-dimensional inputs and to define an effective exploration objective. We focus on world models that predict ahead in a compact latent space, and have recently been shown to solve challenging control tasks from images (hafner2018planet; zhang2018solar). Predicting future compact representations facilitates accurate long-term predictions and lets us efficiently predict thousands of future sequences in parallel for policy learning.

An ideal exploration objective should seek out inputs that the agent can learn the most from (epistemic uncertainty), while being robust to stochastic parts of the environment that cannot be learned accurately (aleatoric uncertainty). This is formalized in the expected information gain (lindley1956expinfogain), that we approximate as the disagreement in predictions of an ensemble of one-step models. These one-step models are trained alongside the world model and mimic it’s transition function. The disagreement is positive for novel states, but given enough samples, it eventually reduces to zero even for stochastic environments because all one-step predictions approach the mean value of next input (pathak2019self).


We introduce Plan2Explore, a self-supervised reinforcement learning agent that leverages planning to efficiently explore visual environments without rewards. Across 20 challenging control tasks without access to proprioceptive states or rewards, Plan2Explore achieves state-of-the-art zero-shot and adaptation performance. Moreover, we empirically study the questions:

  • [noitemsep,topsep=0pt]

  • How does planning to explore via latent disagreement compare to a supervised oracle and other model-free and model-based intrinsic reward objectives?

  • How much task-specific experience is enough to finetune a self-supervised model to reach the task performance of atask-specific agent?

  • To what degree does a self-supervised model generalize to unseen tasks compared to a task-specific model trained on a different task in the same environment?

  • What is the advantage of maximizing expected future novelty in comparison to retrospective novelty?

2 Control with Latent Dynamics Models

World models summarize past experience into a representation of the environment that enables predicting imagined future sequences (sutton1991dyna; watter2015embed; ha2018world). When sensory inputs are high-dimensional observations, predicting compact latent states lets us predict many future sequences in parallel due to memory efficiency.111The latent model states are not to be confused with the unknown true environment states. Specifically, we use the latent dynamics model of PlaNet (hafner2018planet), that consists of the following key components that are illustrated in Figure 2,


The image encoder is implemented as a CNN, and the posterior and prior dynamics share an RSSM (hafner2018planet)

. The temporal prior predicts forward without access to the corresponding image. The reward predictor and image decoder provide a rich learning signal to the dynamics. The distributions are parameterized as diagonal Gaussians. All model components are trained jointly similar to a variational autoencoder (VAE) 

(kingma2013vae; rezende2014vae) by maximizing the evidence lower bound (ELBO).

Given this learned world model, we need to derive behaviors from it. Instead of online planning, we use Dreamer (hafner2019dreamer)

to efficiently learn a parametric policy inside the world model that considers long-term rewards. Specifically, we learn two neural networks that operate on latent states of the model. The state-value estimates the sum of future rewards and the actor tries to maximize these predicted values,


The learned world model is used to predict the sequences of future latent states under the current actor starting from the latent states obtained by encoding images from the replay buffer. The value function is computed at each latent state and the actor policy is trained to maximize the predicted values by propagating their gradients through the neural network dynamics model as shown in Figure 2.

3 Planning to Explore

We consider a learning setup with two phases, as illustrated in Figure 1. During self-supervised exploration, the agent gathers information about the environment and summarizes this past experience in the form of a parametric world model. After exploration, the agent is given a downstream task in the form of a reward function that it should adapt to with no or limited additional environment interaction.

During exploration, the agent begins by learning a global world model using data collected so far and then this model is in turn used to direct agent’s exploration to collect more data, as described in Algorithm 1. This is achieved by training an exploration policy inside of the world model to seek out novel states. Novelty is estimated by ensemble disagreement in latent predictions made by 1-step transition models trained alongside the global recurrent world model. More details to follow in Section 3.1.

During adaptation, we can efficiently optimize a task policy by imagination inside of the world model, as shown in Algorithm 2. Since our self-supervised model is trained without being biased toward a specific task, a single trained model can be used to solve multiple downstream tasks.

3.1 Latent Disagreement

1:  initialize: Dataset D from a few random episodes.
2:   World model M.
3:   Latent disagreement ensemble E.
4:   Exploration actor-critic .
5:  while exploring do
6:     Train M on D.
7:     Train E on D.
8:     Train on LD reward in imagination of M.
9:     Execute in the environment to expand D.
10:  end while
11:  return Task-agnostic D and M.
Algorithm 1 Planning to Explore via Latent Disagreement
1:  input: World model M.
2:   Dataset D without rewards.
3:   Reward function R.
4:  initialize: Latent-space reward predictor R̂.
5:   Task actor-critic .
6:  while adapting do
7:     Distill R into R̂ for sequences in D.
8:     Train on R̂ in imagination of M.
9:     Execute for the task and report performance.
10:     Optionally, add task-specific episode to D and repeat.
11:  end while
12:  return Task actor-critic .
Algorithm 2 Zero and Few-Shot Task Adaptation

To efficiently learn a world model of an unknown environment, a successful strategy should explore the environment such as to collect new experience that improves the model the most. For this, we quantify the model’s uncertainty about its predictions for different latent states. An exploration policy then seeks out states with high uncertainty. The model is then trained on the newly acquired trajectories and reduces its uncertainty in these and the process is repeated.

Quantifying uncertainty is a long standing open challenge in deep learning 

(mackay1992infogain; gal2016uncertainty). In this paper, we use ensemble disagreement as an empirically successful method for quantifying uncertainty (lakshminarayanan2017simple; osband2018randomized). As shown in Figure 2, we train a bootstrap ensemble (breiman1996bagging)

to predict, from each model state, the next encoder features. The variance of the ensemble serves as the estimate of uncertainty.

Intuitively, because the ensemble models have different initialization and observe data in a different order, their predictions differ for unseen inputs. Once the data is added to the training set, however, the models will converge towards more similar predictions and the disagreement decreases. Eventually, once the whole environment is explored, the models should converge to identical predictions.

Formally, we define a bootstrap ensemble of one-step predictive models with parameters . Each of these models takes a model state and action as input and predicts the next image embedding . The models are trained with the mean squared error, which is equivalent to Gaussian log-likelihood,


We quantify model uncertainty as the variance over predicted means of the different ensemble members and use this disagreement as the intrinsic reward to train the exploration policy,


The intrinsic reward is non-stationary because the world model and the ensemble predictors change throughout exploration. Indeed, once certain states are visited by the agent and the model gets trained on them, these states will become less interesting for the agent and the intrinsic reward for visiting them will decrease.

We learn the exploration policy using Dreamer (Section 2). Since the intrinsic reward is computed in the compact representation space of the latent dynamics model, we can optimize the learned actor and value from imagined latent trajectories without generating images. This lets us efficiently optimize the intrinsic reward without additional environment interaction. Furthermore, the ensemble of lightweight 1-step models adds little computational overhead as they are trained together efficiently in parallel across all time steps.

3.2 Expected Information Gain

Figure 3: Zero-shot RL performance from raw pixels. After training the agent without reward supervision, we provide it with a task by specifying the reward function. Throughout the exploration, we take snapshots of the agent to train a task policy on the final task and plot its zero-shot performance. We see that Plan2Explore achieves state-of-the-art zero-shot task performance on a range of tasks, and even demonstrates competitive performance to Dreamer (hafner2019dreamer), a state-of-the-art supervised reinforcement learning agent. This indicates that Plan2Explore is able to explore and learn a global model of the environment that is useful for adapting to new tasks, demonstrating the potential of self-supervised reinforcement learning. Results on all 20 tasks is in the appendix Figure 6 and videos on the website.

Latent disagreement has an information-theoretic interpretation. This subsection derives our method from the amount of information gained by interacting with the environment, which has its roots in optimal Bayesian experiment design (lindley1956expinfogain; mackay1992infogain).

Because the true dynamics are unknown, the agent treats the optimal dynamics parameters as a random variable

. To explore the environment as efficiently as possible, the agent should seek out future states that are informative of our belief over the parameters.

Mutual information formalizes the amount of bits that a future trajectory provides about the optimal model parameters on average. We aim to find a policy that shapes the distribution over future states to maximize the mutual information between the image embeddings and parameters ,


We operate on latent image embeddings to save computation. To select the most promising data during exploration, the agent maximizes the expected information gain,


This expected information gain can be rewritten as conditional entropy of trajectories subtracted from marginal entropy of trajectories, which correspond to, respectively, the aleatoric and the total uncertainty of the model,


We see that the information gain corresponds to the epistemic uncertainty, i.e. the reducible uncertainty of the model that is left after subtracting the expected amount of data noise from the total uncertainty.

Trained via squared error, our ensemble members are conditional Gaussians with means produced by neural networks and fixed variances. The ensemble can be seen as a mixture distribution of parameter point masses,


Because the variance is fixed, the conditional entropy does not depend on the state or action in our case ( is the dimensionality of the predicted embedding),


Maximizing information gain then means to simply maximize the marginal entropy of the ensemble prediction. For this, we make the following observation: the marginal entropy is maximized when the ensemble means are far apart (disagreement) so the modes overlap the least, maximally spreading out probability mass. As the marginal entropy has no closed-form expression suitable for optimization, we instead use the empirical variance over ensemble means to measure how far apart they are,


To summarize, our exploration objective defined in Section 3.1, which maximizes the variance of ensemble means, approximates the information gain and thus should find trajectories that will efficiently reduce the model uncertainty.

4 Experimental Setup

Figure 4: Performance on few-shot adaptation from raw pixels without state-space input. After the exploration phase of 1M steps (white background), during which the agent does not observe the reward and thus does not solve the task, we let the agent collect a small amount of data from the environment (shaded background). We see that Plan2Explore is able to explore the environment efficiently in only 1000 episodes, and then adapt its behaviour immediately after observing the reward. Plan2Explore adapts rapidly, producing effective behavior competitive to state-of-the-art supervised reinforcement learning in just a few collected episodes.

Environment Details

We use the DM Control Suite (deepmindcontrolsuite2018), a standard benchmark for continuous control. All experiments use visual observations only, of size pixels. The episode length is 1000 steps and we apply an action repeat of

for all the tasks. We run every experiment with three different random seeds with standard deviation shown in shaded region. Further details are in the appendix.


We use (hafner2019dreamer)

with the original hyperparameters unless specified otherwise to optimize both exploration and task policies of Plan2Explore. We found that additional capacity provided by increasing the hidden size of the GRU in the latent dynamics model to

and the deterministic and stochastic components of the latent space to helped performance. For a fair comparison, we maintain this model size for Dreamer and other baselines. For latent disagreement, we use an ensemble of one-step prediction models implemented as hidden-layer MLP. Full details are in the supplementary material.


We compare our agent to a state-of-the-art task-oriented agent that receives rewards throughout training, Dreamer  (hafner2019dreamer). We also compare to state-of-the art unsupervised agents: Curiosity (pathakICMl17curiosity) and Model-based Active Exploration (shyam2019max, MAX). Because Curiosity is inefficient during fine-tuning and would not be able to solve a task in a zero-shot way, we adapt it into the model-based setting. We further adapt MAX to work with image observations as (shyam2019max) only addresses learning from low-dimensional states. We use (hafner2019dreamer) as the base agent for all methods to provide a fair comparison. We additionally compare to a random data collection policy that uniformly samples from the action space of the environment. All methods share the same model hyperparameters to provide a fair comparison.

5 Results and Analysis

Our experiments focus on evaluating whether our proposed Plan2Explore agent efficiently explores and builds a model of the world that allows quick adaptation to solve tasks in zero or few-shot manner. The rest of the subsections are organized in terms of the key scientific questions we would like to investigate as discussed in the introduction.

5.1 Does the model transfer to solve tasks zero-shot?

To test whether Plan2Explore has learned a global model of the environment that can be used to solve new tasks, we evaluate the zero-shot performance of our agent. Our agent learns a model without using any task-specific information. After that, a separate downstream agent is trained, which optimizes the task reward using only the self-supervised world model and no new interaction with the world. To specify the task, we provide the agent with the reward function that is used to label its replay buffer with rewards and train a reward predictor. This process is described in the Algorithm 2, with the step 10 omitted.

In Figure 3, we compare the zero-shot performance of our downstream agent with respect to the amount of exploration data. This is done by training the downstream agent continuously. We see that Plan2Explore overall performs better than prior state-of-the-art exploration strategies from high dimensional pixel input, sometimes being the only successful unsupervised method. Moreover, the zero-shot performance of Plan2Explore is competitive to Dreamer, even outperforming it in the hopper hop task.

Plan2Explore was able to successfully learn a good model of the environment and efficiently derive task-oriented behaviors from this model. We emphasize that Plan2Explore explores without task rewards, and Dreamer is the oracle as it is given task rewards during exploration. Yet, Plan2Explore almost matches the performance of this oracle.

Figure 5: Do task-specific models generalize? We test Plan2Explore on zero-shot performance on four different tasks in the cheetah environment from raw pixels without state-space input. Throughout the exploration, we take snapshots of policy to plot its zero-shot performance. In addition to random exploration, we compare to an oracle agent, Dreamer, that uses the data collected when trained on the run forward task with rewards. Although Dreamer trained on ’run forward’ is able to solve the task it is trained on, it struggles on the other tasks, indicating that it has not learned a global world model.

5.2 How much task-specific interaction is needed for finetuning to reach the supervised oracle?

While zero-shot learning might suffice for some tasks, in general we will want to adapt our model of the world to task-specific information. In this section, we test whether few-shot adaptation of the model to a particular task is competitive to training a fully supervised task-specific model. To adapt our model, we only add supervised episodes which falls under ‘few-shot’ adaptation. Furthermore, in this setup, to evaluate the data efficiency of Plan2Explore we set the number of exploratory episodes to only 1000.

In the exploration phase of Figure 4, i.e., left of the vertical line, our agent does not aim to solve the task, as it is still unknown, however we expect that during some period of exploration it will coincidentally achieve higher rewards as it explores the parts of the state space relevant for the task. The performance of unsupervised methods is coincidental until 1000 episodes and then it switches to task-oriented behaviour for remaining 150 episodes, while for supervised, it is task-oriented throughout. That’s why we see a big jump for unsupervised methods where the shaded region begins.

In the few-shot learning setting, Plan2Explore eventually performs competitively to Dreamer on all tasks, significantly outperforming it on the hopper task. Plan2Explore is also able to adapt quicker or similar to other unsupervised agents on all tasks. These results show that a self-supervised agent, when presented with a task specification, should be able to rapidly adapt its model to the task information, matching or outperforming the fully supervised agent trained only for that task. Moreover, Plan2Explore is able to learn this general model with a small amount of samples, matching Dreamer, which is fully task-specific, in data efficiency. This shows the potential of an unsupervised pre-training in reinforcement learning. Please refer to appendix for detailed quantitative results.

5.3 Do self-supervised models generalize better than supervised task-specific models?

If the quality of our learned model is good, it should be transferable to multiple tasks. In this section, we test the quality of the learned model on generalization to multiple tasks in the same environment. We devise a set of three new tasks for the Cheetah environment, specifically, running backward, flipping forward, and flipping backward. We evaluate the zero-shot performance of Plan2Explore, and additionally compare to a Dreamer agent that is only allowed to collect data on the running forward task and then tested on zero-shot performance on the three other tasks.

Figure 5 shows that while Dreamer performs well on the task it is trained on, running forward, it fails to solve all other tasks, performing comparably to random exploration. It even fails to generalize to the running backward task. In contrast, Plan2Explore performs well across all tasks, outperforming Dreamer on the other three tasks. This indicates that the model learned by Plan2Explore is indeed global, while the model learned by Dreamer, which is task-oriented, fails to generalize to different tasks.

5.4 What is the advantage of maximizing expected novelty in comparison to retrospective novelty?

Our Plan2Explore agent is able to measure expected novelty by imagining future states that have not been visited yet. A model-free agent, in contrast, is only trained on the states from the replay buffer, and only gets to see the novelty in retrospect, after the state has been visited. Here, we evaluate the advantages of computing expected versus retrospective novelty by comparing Plan2Explore to a one-step planning agent. The one-step planning agent is not able to plan to visit states that are more than one step away from the replay buffer, and is somewhat similar to a Q-learning agent with a particular parametrization of the Q-function. We refer to this approach as Retrospective Disagreement. Figures 4 and 3 show the performance of this approach. Our agent achieves superior performance, which is consistent with our intuition about importance of computing expected novelty.

6 Related Work


Efficient exploration is a crucial component of an effective reinforcement learning agent (kakade2002approximately). In tabular settings, it is efficiently addressed with exploration bonuses based on state visitation counts (strehl08; jaksch2010near) or fully Bayesian approaches (duff2002optimal; poupart2006analytic), however these approaches are hard to generalize to high-dimensional states, such as images. Recently, several methods were proposed based on generalization of these early approaches, such as using pseudo-count measures of state visitation (bellemare2016unifying; ostrovski2017count). osband2016deep

derived an efficient approximation to the Thompson sampling procedure via ensembles of Q-functions.

osband2018randomized; lowrey2018plan use ensembles of Q-functions to track the posterior of the value functions. In contrast to these task-oriented methods, our approach uses neither reward nor state at training time.

Intrinsic motivation

A different line of work on intrinsic motivation considered exploration as an objective on its own (oudeyer2007intrinsic; oudeyer2009intrinsic). Practical examples of such approaches focus on maximizing prediction error as the intrinsic reinforcement learning objective (pathakICMl17curiosity; burda2018large; haber2018learning). These approaches can also be understood as maximizing the agent’s surprise (schmidhuber1991curious; josh_surprise). Similar to our work, other recent approaches use the notion of model disagreement to encourage visiting states with the highest potential to improve the model (burda2018rnd; pathak2019self)

, motivated by the active learning literature 

(Seung1992; mccallumzy1998employing). However, these approaches are model-free and are very hard to fine-tune to a new task, requiring millions of environment steps for fine-tuning.

Model-based control

Model-free agents are often data-inefficient (kaelbling1996reinforcement) and hard to adapt to different tasks, although one promising avenue for adapting these agents is goal-conditioned reinforcement learning (kaelbling1993learning; pathak2018zero; pong2019skewfit). Model-based agents, which do not suffer from these issues, are then a natural choice for learning in self-supervised manner. Early work on model-based reinforcement learning used Gaussian processes and time-varying linear dynamical systems and has shown significant improvements in data efficiency over model-free agents (deisenroth2011pilco; levine2013guided) when low-dimensional state information is available. Recent work on latent dynamics models has shown that model-based agents can achieve performance competitive with model-free agents while attaining much higher data efficiency, and even scale to high-dimensional observations (chua2018deep; buesing2018learning; ebert2018visual; ha2018world; hafner2018planet; nagabandi2019deep). We base our agent on a state-of-the-art model-based agent, Dreamer (hafner2019dreamer), and use it to perform self-supervised exploration in order to solve tasks in few-shot manner.

The idea of actively exploring to collect the most informative data goes back to the formulation of the information gain (lindley1956expinfogain). mackay1992infogain described how a learning system might optimize Bayesian objectives for active data selection based on the information gain. sun2011planning derived a model-based reinforcement learning agent that can optimize the infinite-horizon information gain and experimented with it in tabular settings. (amos2018awareness) proposes a method for model-based active learning for proprioceptive continuous control based on maximizing entropy of a single mixture density network model. The closest works to ours are shyam2019max; henaff2019explicit, which use a measurement of disagreement or information gain through ensembles of neural networks in order to incentivize exploration. However, these approaches are restricted to setups where low-dimensional states are available, whereas we design a latent state approach that scales to high-dimensional observations. Moreover, we provide a theoretical connection between information gain and model disagreement. Concurrently with us, (ball2020ready) discusses the connection between information gain and model disagreement in the context of task-specific exploration from low-dimensional state information.

7 Discussion

We presented Plan2Explore, a self-supervised reinforcement learning method that learns a world model of its environment through unsupervised exploration and uses this model to solve tasks in a zero-shot or few-shot manner. We derived connections of our method to the expected information gain, a principled objective for exploration. Building on recent work on learning dynamics models and behaviors from images, we constructed a model-based zero-shot reinforcement learning agent that was able to achieve state-of-the-art zero-shot task performance on the DeepMind Control Suite. Moreover, the agent’s zero-shot performance was competitive to Dreamer, a state-of-the-art supervised reinforcement learning agent on some tasks, with the few-shot performance eventually matching or outperforming the supervised agent. By presenting a method that can learn effective behavior for many different tasks in a scalable and data-efficient manner, we hope this work constitutes a step toward building scalable real-world reinforcement learning systems.


We thank Rowan McAllister, Aviral Kumar, and the members of GRASP for fruitful discussions. This work was supported in part by Curious Minded Machines grant from Honda Research and DARPA MCS.


Appendix A Appendix

Figure 6: We evaluate the zero-shot performance of the self-supervised agents as well as supervised performance of Dreamer on all tasks from the DM control suite. All agents operate from raw pixels. The experimental protocol is the same as in Figure 3 of the main paper. To produce this plot, we take snapshots of the agent throughout exploration to train a task policy on the downstream task and plot its zero-shot performance. We use the same hyperparameters for all environments. We see that Plan2Explore achieves state-of-the-art zero-shot task performance on a range of tasks. Moreover, even though Plan2Explore is a self-supervised agent, it demonstrates competitive performance to Dreamer (hafner2019dreamer), a state-of-the-art supervised reinforcement learning agent. This shows that self-supervised exploration is competitive to task-specific approaches in these continuous control tasks.

Results DM Control Suite

In Figure 6, we show the performance of our agent on all 20 DM Control Suite tasks from pixels. In addition, we show videos corresponding to all the plots on the project website:

Convention for plots

We run every experiment with three different random seeds. The shaded area of the graphs shows the standard deviation in performance. All plot curves are smoothed with a moving mean that takes into account a window of the past 20 data points. Only Figure 5 was smoothed with a window of past 5 data points so as to provide cleaner looking plots that indicate the general trend. Low variance in all the curves consistently across all figures suggests that our approach is very reproducible.

Rewards of new tasks

To test the generalization performance of the our agent, we define three new tasks in the Cheetah environment:

  • Cheetah Run Backward Analogous to the forward running task, the reward is linearly proportional to the backward velocity up to a maximum of , which means , where and is the forward velocity of the Cheetah.

  • Cheetah Flip Backward The reward is linearly proportional to the backward angular velocity up to a maximum of , which means , where and is the angular velocity about the positive Z-axis, as defined in DeepMind Control Suite.

  • Cheetah Flip Forward The reward is linearly proportional to the forward angular velocity up to a maximum of , which means .


We use the DeepMind Control Suite (deepmindcontrolsuite2018) tasks, a standard benchmark of tasks for continuous control agents. All experiments are performed with only visual observations. We use RGB visual observations with resolution. We have selected a diverse set of 8 tasks that feature sparse rewards, high dimensional action spaces, and environments with unstable equilibria and environments that require a long planning horizon. We use episode length of 1000 steps and a fixed action repeat of for all the tasks.

Agent implementation

For implementing latent disagreement, we use an ensemble of one-step prediction models with a hidden-layer MLP, which takes in the RNN-state of RSSM and the action as inputs, and predicts the encoder features, which have a dimension of . We scale the disagreement of the predictions by for the final intrinsic reward, this was found to increase performance on some environments. We do not normalize the rewards, both extrinsic and intrinsic. This setup for the one-step model was chosen over other variants, in which we tried predicting the deterministic, stochastic and the combined features of RSSM respectively. The performance benefits of this ensemble over the variants potentially come from the large parametrization that comes with predicting the large encoder features.


We note that while Curiosity (pathakICMl17curiosity) uses loss to train the model, the RSSM loss is different (see (hafner2018planet)); we use use the full RSSM loss as the intrinsic reward for the Curiosity comparison, as we found it produces best performance. Note that this reward can only be computed when ground truth data is available, and needs a separate reward predictor to optimize it in a model-based fashion.

Zero-shot performance Plan2Explore Curiosity Random MAX Retrospective Dreamer
Task-agnostic experience 3.5M 3.5M 3.5M 3.5M 3.5M
Task-specific experience 3.5M
Acrobot Swingup 280.23 219.55 107.38 64.30 110.84 408.27
Cartpole Balance 950.97 917.10 963.40 970.28
Cartpole Balance Sparse 860.38 695.83 764.48 926.9
Cartpole Swingup 759.65 747.488 516.04 144.05 700.59 855.55
Cartpole Swingup Sparse 602.71 324.5 94.89 9.23 180.85 789.79
Cheetah Run 784.45 495.55 0.78 0.76 9.11 888.84
Cup Catch 962.81 963.13 660.35 963.4
Finger Spin 655.4 661.96 676.5 333.73
Finger Turn Easy 401.64 266.96 495.21 551.31
Finger Turn Hard 270.83 289.65 464.01 435.56
Hopper Hop 432.58 389.64 12.11 17.39 41.32 336.57
Hopper Stand 841.53 889.87 180.86 923.74
Pendulum Swingup 792.71 56.80 16.96 748.53 1.383 829.21
Quadruped Run 223.96 164.02 139.53 373.25
Quadruped Walk 182.87 368.45 129.73 921.25
Reacher Easy 530.56 416.31 229.23 242.13 230.68 544.15
Reacher Hard 66.76 123.5 4.10 438.34
Walker Run 429.30 446.45 318.61 783.95
Walker Stand 331.20 459.29 301.65 655.80
Walker Walk 911.04 889.17 766.41 148.02 538.84 965.51
Task Average 563.58 489.26 342.11 694.77
Table 1: Zero-shot performance at 3.5 million environment steps (corresponding to 1.75 agent steps times 2 for action repeat). We report the average performance of the last 20 episodes before the 3.5 million steps point. The performance is computed by executing the mode of the actor without action noise. Among the agents that receive no task rewards, the highest performance of each task is highlighted. The corresponding training curves are visualized in Figure 6.
Adaptation performance Plan2Explore Curiosity Random MAX Retrospective Dreamer
Task-agnostic experience 1M 1M 1M 1M 1M
Task-specific experience 150K 150K 150K 150K 150K 1.15M
Acrobot Swingup 312.03 163.71 27.54 108.39 76.92 345.51
Cartpole Swingup 803.53 747.10 416.82 501.93 725.81 826.07
Cartpole Swingup Sparse 516.56 456.8 104.88 82.06 211.81 758.45
Cheetah Run 697.80 572.67 18.91 0.76 79.90 852.03
Hopper Hop 307.16 159.45 5.21 64.95 29.97 163.32
Pendulum Swingup 771.51 377.51 1.45 284.53 21.23 781.36
Reacher Easy 848.65 894.29 358.56 611.65 104.03 918.86
Walker Walk 892.63 932.03 308.51 29.39 820.54 956.53
Task Average 643.73 537.95 155.23 210.46 258.78 700.27
Table 2: Adaptation performance after 1M task-agnostic environment steps, followed by 150K task-specific environment steps (agent steps are half as much due to the action repeat of 2). We report the average performance of last 20 episodes before the 1.15M steps point. The performance is computed by executing the mode of the actor without action noise. Among the self-supervised agents, the highest performance of each task is highlighted. The corresponding training curves are visualized in Figure 4.