1 Introduction
Deep Reinforcement Learning (RL) has been at the source of a number of breakthroughs in autonomous control over the last five years. RL algorithms have been used to train agents to play Atari video games directly from pixels mnih2015human; mnih2016a2c, learn robotic locomotion trpo; schulman2015high; ppo and manipulation akkaya2019solving policies from raw sensory input, master the game of Go alphago; alphazero, and play large-scale multiplayer video games berner2019dota; alphastar. While these results were significant advances in autonomous decision making, a deeper look reveals a fundamental limitation. The above algorithms produced agents capable of only solving the single task they were trained to solve. As a result, current RL approaches produce brittle policies with poor generalization capabilities cobbe2020leveraging, which limits their applicability to many problems of interest Gleave2020Adversarial. It is therefore important to move beyond today’s powerful but narrow RL systems toward generalist systems capable of quickly adapting to new downstream tasks.
In contrast, in the fields of Computer Vision (CV) and Natural Language Processing (NLP), large-scale unsupervised pre-training has enabled sample-efficient few-shot adaptation. In NLP, unsupervised sequential modeling has produced powerful few-shot learners
brown2020language; devlin2018bert; radford2019language. In CV, unsupervised representation learning techniques such as contrastive learning have produced algorithms that are dramatically more label-efficient than their supervised counterparts chen2020simclr; he2020moco; henaff2019cpcv2; grill2020byoland more capable of adapting to a host of downstream supervised tasks such as classification, segmentation, and object detection. While these advances in unsupervised learning have also benefited RL in terms of learning efficiently from images
laskin2020reinforcement; laskin2020curl; schwarzer2021dataefficient; stooke2020decoupling; yarats2021image as well as introducing new architectures for RL chen_lu2021dt; janner2021tto, the resulting agents have remained narrow since they still optimize a single extrinsic reward as before.
Fully unsupervised training of RL algorithms requires not only learning self-supervised representations but also learning policies without access to extrinsic rewards. Recently, unsupervised RL algorithms have begun to show progress toward more generalist systems by training policies without extrinsic rewards. Exploration with self-supervised prediction has enabled agents to explore video games from pixels pathak2017curiosity; Pathak19disagreement, mutual information-based approaches have demonstrated self-supervised skill discovery and generalization to downstream tasks in continuous control domains EysenbachGIL19diayn; hansen20visr; liu21aps; SharmaGLKH20dads, and maximal entropy RL has yielded policies capable of diverse exploration liu2021unsupervised; seo21re3; yarats21protorl. However, comparing and developing new algorithms has been challenging due to a lack of a unified evaluation benchmark. Reward-free RL algorithms often use different optimization schemes, different tasks for evaluation, and have different evaluation procedures. Additionally, unlike more mature supervised RL algorithms sac; hessel18rainbow; ppo, there does not exist a unified codebase for unsupervised RL that can be used to develop new methods quickly.
To make benchmarking and developing new unsupervised RL approaches easier, we introduce the Unsupervised Reinforcement Learning Benchmark (URLB). Built on top of the widely adopted DeepMind Control Suite tassa2018deepmind, URLB provides a suite of domains of varying difficulty for unsupervised pre-training with diverse downstream evaluation tasks. URLB standardizes evaluation of unsupervised RL algorithms by defining fixed pre-training and fine-tuning procedures across all baselines. Perhaps most importantly, we open-source code for URLB environments as well as 8 leading baselines that represent the main approaches taken towards unsupervised pre-training in RL to date. Unlike prior code releases for unsupervised RL, URLB uses the same exact optimization algorithm for each baseline which enables transparent benchmarking and lowers the barrier to entry for developing new algorithms. We summarize the main contributions of this paper below:
-
We introduce URLB, a new benchmark for evaluating unsupervised RL algorithms, which consists of three domains and twelve continuous control tasks of varying difficulty to evaluate the adaptation efficiency of unsupervised RL algorithms.
-
We open-source a unified codebase for eight leading unsupervised RL algorithms. Each algorithm is trained with the same optimization backbone for fairness of comparison.
-
We find that while the implemented baselines make progress on the proposed benchmark, no existing unsupervised RL algorithm can solve URLB, and consequently identify promising research directions to progress unsupervised RL.
The benchmark environments, algorithmic baselines, and pre-training and evaluation scripts are available at https://github.com/rll-research/url_benchmark. We believe that URLB will make the development of unsupervised RL agents easier and more transparent by providing a unified set of evaluation environments, systematic procedures for pre-training and evaluation, and algorithmic baselines that share the same optimization backbone.
2 Preliminaries and Notation
Markov Decision Process:
We consider the typical Reinforcement Learning setting where an agent’s interaction with the environment is modeled through a Markov Decision Process (MDP)
sutton2018reinforcement. In this work, we benchmark unsupervised RL algorithms in both fully observable MDPs where the agent learns from coordinate state as well as partially observable MDPs (POMDPs) where the agent learns from partially observable image observations. For simplicity we refer to both image and state-based observations as . At every timestep , the agent sees an observation and selects an action based on its policy . The agent then sees the next observation and an extrinsic reward provided by the environment (supervised RL) or an intrinsic reward defined through a self-supervised objective (unsupervised RL). In this work, we pre-train agents with intrinsic rewards and fine-tune them to downstream tasks with extrinsic rewards. Some algorithms considered in this work condition the agent on a learned task vector which we denote as
.Learning from pixels vs states: We benchmark unsupervised RL where environment observations can be either proprioceptive states or RGB images. When learning from pixels, rather than defining the self-supervised task directly as a function of image observations, it is usually more convenient to first embed the image and compute the intrinsic reward as a function of these lower dimensional features burda2018exploration; liu2021unsupervised; liu21aps; pathak2017curiosity. We therefore define an embedding as where is an encoder function. We employ different encoder architectures depending on whether the algorithm receives pixel or state-based input. For pixel-based inputs we use the convolutional encoder architecture from SAC-AE (yarats2019improving), while for state-based inputs we use the identity function by default unless the unsupervised RL algorithm explicitly specifies a different encoding. The intrinsic reward can be a function of any and all depending on the algorithm. Finally, note that the encoder may or may not be shared with components of the base RL algorithm such as the actor and critic.
3 URLB: Evaluation and Environments

3.1 Standardized of Pre-training and Fine-tuning Procedures
One reason why unsupervised RL has been hard to benchmark to date is that there is no agreed upon procedure for training and evaluating unsupervised RL agents. To this end, we standardize pre-training, fine-tuning, and evaluation in URLB. We split pre-training and fine-tuning into two phases consisting of and environment steps respectively. During pre-training, we checkpoint agents at 100k, 500k, 1M, 2M steps in order to evaluate downstream performance as a function of pre-training steps. For adapting the pre-trained policy to downstream tasks, we evaluate in the data-efficient regime where is 100k, since we are interested in agents that are quick to adapt.
3.2 Evaluation
We evaluate the performance of an unsupervised RL algorithm by measuring how quickly it adapts to a downstream task. For each fine-tuning task, we initialize the agent with the pre-trained network parameters, fine-tune the agent for 100k steps and measure its performance on the downstream task. This evaluation procedure is similar to how pre-trained networks in CV and NLP are fine-tuned to downstream tasks such as classification, object detection, and summarization. There exist other means of evaluating the quality of pre-trained RL agents such as measuring the diversity of data collected during exploration or zero-shot generalization of goal-conditioned agents. However, it is challenging to produce a general method to measure data diversity, and while zero-shot generalization with goal-conditioned agents can be powerful such a benchmark would be limited to goal-conditioned RL. For these reasons, data diversity and goal-conditioned zero-shot generalization are less common evaluation metrics. In an effort to provide a general benchmark, we focus on the fine-tuning efficiency of the agent after pre-training which allows us to evaluate a diverse set of baselines.
Unlike unsupervised methods in CV and NLP which focus solely on representation learning, unsupervised pre-training in RL requires both representation learning and behavior learning. For this reason, URLB benchmarks performance for both state-based and pixel-based agents. Benchmarking both state and pixel-based RL separately is important because it allows us to decouple unsupervised behavior learning from unsupervised representation learning. In state-based RL, the agent receives a near-optimal representation of the world through coordinate states. Evaluating state-based unsupervised RL agents allows us to isolate unsupervised behavior discovery without worrying about representation learning as confounding factor. Evaluating pixel-based unsupervised RL agents provides insight into how representations and behaviors can be learned jointly.
3.3 URLB Environments
We release a set of domains and downstream tasks for URLB that are based on the DeepMind Control Suite (DMC) tassa2018deepmind. The three reasons for building URLB on top of DMC are (i) DMC is already widely adopted and familiar to RL practitioners; (ii) DMC environments can be used with both state and pixel-based inputs; (iii) DMC features environments of varying difficulty which is useful for designing a benchmark that contains both challenging and feasible tasks. URLB evaluates performance on 12 continuous control tasks (3 domains with 4 downstream tasks per domain). From easiest to hardest, the URLB domains and tasks are:
Walker (Stand, Walk, Flip, Run): A biped constrained to a 2D vertical plane. Walker is a challenging introduction domain for unsupervised RL because it requires the unsupervised agent to learn balancing and locomotion skills in order to fine-tune efficiently. Quadruped (Stand, Walk, Jump, Run): A quadruped within a a 3D space. Like walker, quadruped requires the agent to learn to balance and move but is harder due to a high-dimensional state and action spaces and 3D environment. Jaco Arm (Reach top left, Reach top right, Reach bottom left, Reach bottom right): Jaco Arm is a 6-DOF robotic arm with a three-finger gripper. This environment tests the unsupervised RL agent’s ability to control the robot arm without locking and perform simple manipulation tasks. It was recently shown that this environment is particularly challenging for unsupervised RL yarats21protorl.
4 URLB: Algorithmic Baselines for Unsupervised RL
In addition to introducing URLB, the other primary contribution of this work is open-sourcing a unified codebase for eight leading unsupervised RL algorithms. To date, unsupervised RL algorithms have been hard to compare due to confounding factors such as different evaluation procedures and optimization schemes. While URLB provides standardized pre-training, fine-tuning, and evaluation procedures, current algorithms are hard to compare since they rely on different optimization algorithms. For instance, Curiosity pathak2017curiosity utilizes PPO ppo while APT liu2021unsupervised uses SAC sac for optimization. Moreover, even if two unsupervised RL methods use the same optimization algorithm, small differences in implementation can result in large performance differences that are independent of the pre-training algorithm. For this reason, it is important to provide a unified codebase with identical implementations of the optimization algorithm for each baseline. Providing such a unified codebase is one of the main contributions of this benchmark.
4.1 Backbone RL Algorithm
Since most of the above algorithms rely on off-policy optimization (and some cannot be optimized on-policy at all), we opt for a state-of-the-art off-policy optimization algorithm. While SAC sac has been the de facto off-policy RL algorithm for many RL methods in the last few years, it is prone to suffering from policy entropy collapse. DrQ-v2 (yarats2021drqv2) recently showed that using DDPG lillicrap15ddpg instead of SAC as a learning algorithm leads to a more robust performance on tasks from DMC. For this reason, we opt for DrQ-v2 (yarats2021drqv2) as our base optimization algorithm to learn from images, and DDPG, as implemented in DrQ-v2, to learn from states. DDPG is an actor-critic off-policy algorithm for continuous control tasks. The critic minimizes the Bellman error
(1) |
where is an exponential moving average of the critic weights. The deterministic actor is learned by maximizing the expected returns
(2) |
4.2 Unsupervised RL Algorithms
As part of URLB, we open-source code for eight leading or well-known algorithms across all three of these categories all of which utilize the same optimization backbone. All algorithms provided with URLB differ only in their intrinsic reward while keeping all other parts of the RL architecture the same. We list all implemented baselines in Table 1 and provide a brief overview of the algorithms considered, which are binned into three categories – knowledge-based, data-based, and competence-based algorithms.111We borrow this terminology from the following unsupervised RL tutorial srinivas_abbeel_2021_icml_tutorial. For detailed descriptions of each method we refer the reader to Appendix A.
Name | Algo. Type | Intrinsic Reward |
---|---|---|
ICM pathak2017curiosity | Knowledge | |
Disagreement Pathak19disagreement | Knowledge | |
RND burda2018exploration | Knowledge | |
APT liu2021unsupervised | Data | |
ProtoRL yarats21protorl | Data | |
SMM lee2019smm | Competence | |
DIAYN EysenbachGIL19diayn | Competence | |
APS liu21aps | Competence |
Knowledge-based Baselines: Knowledge-based methods aim to increase knowledge about the world by maximizing prediction error. As part of the knowledge-based suite, we implement the Intrinsic Curiosity Module (ICM) pathak2017curiosity, Disagreement Pathak19disagreement, and Random Network Distillation (RND) burda2018exploration. All three methods utilize a function to either predict the dynamics (ICM, Disagreement) or predict the output of a random network (RND), where is the encoding of . ICM and RND maximize prediction error while Disagreement maximizes prediction uncertainty.
Data-based Baselines: Data-based methods aim to achieve data diversity by maximizing entropy. We implement APT liu2021unsupervised and ProtoRL yarats21protorl both of which maximize entropy
in different ways. Both methods utilize a particle estimator
singh03entropyto maximize the entropy by maximizing the distance between k-nearest neighbors (kNN) for each state or observation embedding
. Since computing kNN over the entire replay buffer is expensive, APT estimates entropy across transitions in a randomly sampled minibatch. ProtoRL improves on APT by clustering the replay buffer with a contrastive deep clustering algorithm SWaV caron20swav. The centroids of the clusters are called prototypes, which are used by ProtoRL to estimate entropy.Competence-based Baselines: Competence-based algorithms, learn an explicit skill vector by maximizing the mutual information between the encoded observation and skill . This mutual information can be decomposed in two ways, . We provide baselines for both decompositions. The former decomposition is utilized in skill discovery algorithms such as DIAYN EysenbachGIL19diayn, VIC GregorRW17vic, VALOR achiam2018valor, which are conceptually similar. For URLB, we implement DIAYN. The latter decomposition, though less common, is implemented in the APS liu21aps, which uses a particle estimator for the entropy term and successor features to represent the conditional entropy hansen20visr. Lastly, we implement SMM lee2019smm which combines both decompositions into one objective. Note that the SMM paper describes both skill-based and skill-free variants, so it can be categorized as both competence and data-based.
5 Experiments

. Scores are normalized by the asymptotic performance on each task (i.e., DrQ-v2 and DDPG performance after training from 2M steps on pixels and states correspondingly) and we show the mean and standard error of each category. Each algorithm is evaluated across ten random seeds. To provide an aggregate view of each algorithm category, the scores are averaged over individual tasks and methods (see Appendix
C for detailed results for each algorithm and downstream task). The Random Init baseline represents DrQ-v2 and DDPG trained from a random initialization for 100k steps. Full results can be found in Section C.We evaluate the algorithms listed in Table 1 by pre-training with the intrinsic reward objective and fine-tuning on the downstream task as described in Section 3.2. For DrQ-v2 optimization we fix the hyper-parameters from yarats2021drqv2 and for algorithm-specific hyper-parameters we perform a grid sweep and pick the best performing parameters. We benchmark both state and pixel-based experiments and keep all non-algorithm-specific architectural details the same with a full description available in Appendix B. Performance on each downstream task is evaluated over ten random seeds and we display the mean scores and standard errors. We summarize the main results of our evaluation in Figures 3 and 6, which show evaluation scores grouped by algorithm category, described in Section 4.2, and environment, described in Section 3.3. An extensive list of results across all algorithms considered in this work can be found in Appendix C.
![]() |
![]() |
By benchmarking a wide array of exploration algorithms on both state and pixel-based tasks we are able to get perspective on the current state of unsupervised RL. Overall, we find that while unsupervised RL shows promise, it is still far from solving the proposed benchmark and many open questions need to be addressed to make progress toward unsupervised pre-training for RL. We note that solving the benchmark means matching the asymptotic DrQ-v2 (for pixels) and DDPG (for states) performance within 100k steps of fine-tuning. The motivation for this definition is that unsupervised RL agents get access to unlimited reward-free environment interactions. After pre-training, we seek to develop agents that adapt quickly to the desired downstream task. We list our observations below:
O1: None of the implemented unsupervised RL algorithms solve the benchmark. Despite access to up to 2M pre-training steps, after 100k steps of fine-tuning no method matches asymptotic performance on most tasks. The best-performing benchmarked algorithms achieve normalized return whereas the benchmark is considered solved when the agent achieves near normalized returns. This suggests that we are still far as a community from efficient generalization in deep RL.
O2: Unsupervised RL is not universally better than random initialization. We also observe that fine-tuning an unsupervised RL baseline is not always preferable to fine-tuning from a random initialization. In particular when learning from states, a random initialization is competitive with most baselines. However, when learning from pixels fine-tuning from random initialization degrades suggesting that representation learning is an important component of unsupervised pre-training.
O3: There exists a large gap in performance between exploring from states and exploring from pixels. Another observation that supports representation learning as an important aspect of exploration is that exploration algorithms degrade substantially when learning from pixels compared to learning from state. Shown in Figure 3, most algorithms lose when learning from pixels compared to state and especially so on the harder environments (Quadruped, Jaco Arm). These results suggest that better representation learning during pre-training is an important research direction.
O4: In aggregate, competence-based approaches underperform knowledge-based and data-based approaches. While knowledge-based and data-based approaches both perform competitively across URLB, we find that competence-based approaches are lagging behind. Specifically, there is no competence-based approach that achieves state-of-the-art mean performance on any of the URLB tasks, which points to competence-based unsupervised RL as an impactful research direction with significant room for improvement.
O5: There is not a single leading unsupervised RL algorithm for both states and pixels. We observe that there is no single state-of-the-art algorithm for unsupervised RL. At 2M pre-training steps, APT liu2021unsupervised and ProtoRL yarats21protorl are the leading algorithms for state-based URLB while ICM pathak2017curiosity achieves leading performance on pixel-based URLB despite the existence of more sophisticated knowledge-based methods Pathak19disagreement; burda2018exploration (see Figure 7).
O6: For many unsupervised RL algorithms, rather than monotonically improving performance decays as a function of pre-training steps. We desire and would expect that the fine-tuning efficiency of unsupervised RL algorithms would improve as a function of pre-training steps. Surprisingly, we find that for 9 out of 18 experiments shown in Figure 6, performance either does not improve or even degrades as a function of pre-training steps. We see this as potentially the biggest drawback of current unsupervised RL approaches – they do not scale with the number of environment interactions. Developing algorithms that improve monotonically as a function of pre-training steps is an open and impactful line of research.
O7: New fine-tuning strategies will likely be needed for fast adaptation. While not investigated in depth in this benchmark, new fine-tuning strategies could play a large role in the adoption of unsupervised RL. Perhaps part of the issue raised in O6 could be addressed with better fine-tuning. The algorithms in URLB are all fine-tuned by initializing the actor-critic with the pre-trained weights and fine-tuning with an extrinsic reward. There are likely other better strategies for fine-tuning, particularly for competence based approaches that are conditioned on the skill .
6 Related work
Deep Reinforcement Learning Benchmarks. Part of the accelerated progress in deep RL over the last few years has been due to the existence of stable benchmarks. Specifically, the Atari Arcade Learning Environment bellemare2013arcade, the OpenAI gym brockman2016openai, and more recently the DeepMind Control (DMC) Suite tassa2018deepmind have become standard benchmarks for evaluating supervised RL agents in both state and pixel-based observation spaces and discrete and continuous action spaces. Open-sourcing code for algorithms has been another aspect that accelerated progress in deep RL. For instance, duan2016benchmarking not only presented a benchmark for continuous control but also provided baselines for common supervised RL algorithms, which led to the development of the widely used OpenAI gym benchmark brockman2016openai and baselines baselines. The combination of challenging yet feasible benchmarks and open-sourced code were important components in the discovery of many widely adopted RL algorithms sac; mnih2015human; trpo; schulman2015high; ppo.
In addition to Atari, OpenAi gym, and DeepMind control, there have been many other benchmarks designed to study different aspects of supervised RL. DeepMind lab beattie2016deepmind benchmarks 3D navigation from pixels, ProcGen cobbe2019quantifying; cobbe2020leveraging measures generalization of supervised agents in procedurally generated environments, D4RL fu2020d4rl and RL unplugged gulcehre2020rl benchmark performance of offline RL methods, B-Pref lee2021bpref benchmarks performance of preference-based RL methods, Metaworld yu2020meta measures the performance of multi-task and meta-RL algorithms, and SafetyGym ray2019benchmarking measures how RL agents can achieve tasks with safety constraints. However, while the existing benchmarks are suitable for supervised RL algorithms, there is no such benchmark and collections of easy-to-use baseline algorithms for unsupervised RL, which is our primary motivation for accelerating progress in unsupervised RL through URLB.
Unsupervised Reinforcement Learning. While investigations into unsupervised deep RL appeared shortly after the landmark DQN mnih2015human, the field has experienced accelerated progress over the last year, which has been in part due to advents in unsupervised representation learning in CV chen2020simclr; he2020moco; henaff2019cpcv2 and NLP brown2020language; devlin2018bert; radford2019language as well as the development for stable RL optimization algorithms sac; hessel18rainbow; lillicrap15ddpg; ppo. However, unlike CV and NLP which focus solely on unsupervised representation learning, unsupervised RL has required both unsupervised representation and behavioral learning.
Unsupervised Representation Learning for Deep RL: In order for an RL algorithm to learn a policy it must first have a good representation for the state . When working with coordinate state, the representation is supplied by a the human task designer but when operating from image observations , we must first transform the observations into latent vectors . This transformation comprises the study of representation learning for RL. One of the first seminal works on unsupervised representation learning for RL showed that unsupervised auxiliary tasks improve performance of supervised RL jaderberg17unreal. Over the last two years, a series of works in unsupervised representation learning for RL with world models hafner2018learning; hafner2019dream contrastive learning laskin2020curl; stooke2020decoupling; yarats21protorl
yarats2019improving, and data augmentation laskin2020reinforcement; yarats2021drqv2; yarats2021image have dramaticaly improved learning efficiency from pixels. On many tasks from the DMC suite, learning from pixels is now as data-efficient as learning from state laskin2020curl.Unsupervised Behavioral Learning for Deep RL: One caveat is that the above algorithms are not fully unsupervised since they still optimize for an extrinsic reward but with an auxiliary unsupervised loss. Fully unsupervised RL also requires unsupervised learning of behaviors, which is typically achieved by optimizing for an intrinsic reward oudeyer2007intrinsic. Given that representation learning is already heavily benchmarked for RL hafner2018learning; laskin2020curl; yarats2021image, URLB focuses mostly on unsupervised behavior learning. Many recent algorithms have been proposed for intrinsic behavioral learning, which include prediction methods burda2018largescale; pathak2017curiosity; Pathak19disagreement, maximal entropy-based methods campos2021beyond; liu2021unsupervised; liu21aps; mutti2020policy; seo21re3; yarats21protorl, and maximal mutual information-based methods EysenbachGIL19diayn; hansen20visr; liu21aps; SharmaGLKH20dads. However, these methods use different pre-training and evaluation procedures, different optimization algorithms, and different environments. To make fully unsupervised RL algorithm comparisons transparent and easier to develop, we introduce URLB.
7 Conclusion
We presented URLB, a benchmark designed to measure the performance of unsupervised RL algorithms. URLB consists of a suite of twelve evaluation tasks of varying difficulty from three domains and standardized procedures for pre-training and evaluation. We’ve open-sourced implementations and evaluation scores for eight leading unsupervised RL algorithms from all major algorithm categories. To minimize confounding factors, we utilized the same optimization method across all baselines. While none of the implemented baselines solve URLB, many make substantial progress suggesting a number of fruitful directions for unsupervised RL research. We hope that this benchmark makes the development and comparison of unsupervised RL algorithms easier and clearer.
Limitations. There are a number of limitations for both URLB and unsupervised RL methods in general. While URLB tasks are designed to be challenging, they are far from the visual and combinatorial complexity of real-world robotics. However, existing algorithms are unable to solve the benchmark meaning there is substantial room for improvement on the URLB tasks before moving on to even more challenging ones. While we present standardized pre-training and evaluation procedures, there can be many other ways of measuring the quality of the exploration algorithm. For instance, the quality of pre-training can be evaluated not only through policy adaptation but also through dataset diversity which we do not consider in this paper. In this work, similar to the Atari bellemare2013arcade and DMC tassa2018deepmind benchmarks for supervised RL we do not consider goal-conditioned RL which can be quite powerful for exploration ecoffet2020goexplore. For generality, we chose the currently most commonly used evaluation procedure that allowed us to benchmark a diverse set of leading exploration algorithms but, of course, other choices are available and would be interesting to investigate in future work.
Potential negative impacts. Unsupervised RL has the benefits of requiring zero extrinsic reward interactions during pre-training, and due to this the resulting agents may develop policies that are not aligned with human intent. This could be problematic in the long-term if not addressed early and carefully because as unsupervised robotics get more capable they can inadvertently inflict harm on themselves or the environment. Methods for constraining exploration within a broad set of human preferences (e.g. explore without harming the environment) is an interesting and important direction for future research in order to produced safe agents.
Acknowledgements
This work was partially supported by Berkeley DeepDrive, BAIR, the Berkeley Center for Human-Compatible AI, the Office of Naval Research grant N00014-21-1-2769, and DARPA through the Machine Common Sense Program.
References
Checklist
-
For all authors…
-
Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
-
Did you describe the limitations of your work? See Section 7.
-
Did you discuss any potential negative societal impacts of your work? See Section 7.
-
Have you read the ethics review guidelines and ensured that your paper conforms to them?
-
-
If you are including theoretical results…
-
Did you state the full set of assumptions of all theoretical results?
-
Did you include complete proofs of all theoretical results?
-
-
If you ran experiments (e.g. for benchmarks)…
-
Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? See Section 5
, the appendix for hyperparameters. You can access the code with full instructions in the supplementary materials or using this link
https://github.com/rll-research/url_benchmark. -
Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? See Section 5 and the supplementary material.
-
Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See Section F.
-
-
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
-
If your work uses existing assets, did you cite the creators?
-
Did you mention the license of the assets?
-
Did you include any new assets either in the supplemental material or as a URL? See
custom_dmc_tasks
folder in supplementary materials codebase. -
Did you discuss whether and how consent was obtained from people whose data you’re using/curating?
-
Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?
-
-
If you used crowdsourcing or conducted research with human subjects…
-
Did you include the full text of instructions given to participants and screenshots, if applicable?
-
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?
-
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?
-
Appendix:
Unsupervised Reinforcement Learning Benchmark
Appendix A Unsupervised Reinforcement Learning Baselines
a.1 Knowledge-based Baselines
Prediction methods train a forward dynamics model and define a self-supervised task based on the outputs of the model prediction.
Curiosity pathak2017curiosity: The Intrinsic Curiosity Module (ICM) defines the self-supervised task as the error between the state prediction of a learned dynamics model and the observation. The intuition is that parts of the state space that are hard to predict are good to explore because they were likely to be unseen before. An issue with Curiosity is that it is susceptible to the noisy TV problem wherein stochastic elements of the environment will always cause high prediction error while not being informative for exploration:
Disagreement Pathak19disagreement:
Disagreement is similar to ICM but instead trains an ensemble of forward models and defines the intrinsic reward as the variance (or disagreement) among the models. Disagreement has the favorable property of not being susceptible to the noisy TV problem, since high stochasticity in the environment will result high prediction error but low variance if it has been thoroughly explored:
RND burda2018exploration:
Random Network Distillation (RND) defines the self-supervised task by predicting the output of a frozen randomly initialized neural network
. This differs from ICM only in that instead of predicting the next state, which is effectively an environment-defined function, it tries to predict the vector output of a randomly defined function. Similar to ICM, RND can suffer from the noisy TV problem:a.2 Data-based Baselines
Recently, exploration through state entropy maximization has resulted in simple yet effective algorithms for unsupervised pre-training. We implement two leading variants of this approach for URLB.
APT liu2021unsupervised: Active Pre-training (APT) utilizes a particle-based estimator singh03entropy that uses K nearest-neighbors to estimate entropy for a given state or image embedding. Since APT does not itself perform representation learning, it requires an auxiliary representation learning loss to provide latent vectors for entropy estimation, although it is also possible to use random network embeddings seo21re3. We provide implementations of APT with the forward and inverse dynamics representation learning losses:
ProtoRL yarats21protorl: ProtoRL devises a self-supervised pre-training scheme that allows to decouple representation learning and exploration to enable efficient downstream generalization to previously unseen tasks. For this, ProtoRL uses the contrastive clustering assignment loss from SWaV (caron20swav) and learns latent representations and a set of prototypes to form the basis of the latent space. The prototypes are then used for more accurate estimation of entropy of the state-visitation distribution via KNN particle-based estimator:
a.3 Competence-based Baselines
Competence-based approaches learn skills that maximize the mutual information between encoded observations (or states) and skills . The mutual information has two decompositions . We provide baselines for both decompositions.
SMM lee2019smm: SMM minimizes , which maximizes the state entropy, while minimizing the cross entropy from the state to the target state distribution. When using skills, can be rewritten as . can maximized by optimizing the reward , which is estimated using a VAE kingma2013auto that models the density of while executing skill . Similar to other mutual information methods that decompose , SMM learns a discriminator over a set of discrete skills with a uniform prior that maximizes :
DIAYN EysenbachGIL19diayn: DIAYN and similar algorithms such as VIC GregorRW17vic and VALOR achiam2018valor are perhaps the best competence-based exploration algorithms. These methods estimate the mutual information through the first decomposition . is kept maximal by drawing from a discrete uniform prior distribution and the density is estimated with a discriminator .
APS liu21aps: APS is a recent leading mutual information exploration method that uses the second decomposition . is estimated with a particle estimator as in APT liu2021unsupervised while is estimated with successor features as in VISR hansen20visr.222In this benchmark, the generalized policy improvement(GPI) barreto2018transfer that is used in Atari games for APS and VISR is not implemented for continuous control experiments.
Appendix B Hyper-parameters
In Table 2 we present a common set of hyper-parameters used in our experiments, while in table 3 we list individual hyper-parameters for each method.
Common hyper-parameter | Value |
---|---|
Replay buffer capacity | |
Action repeat | states-based and for pixels-based |
Seed frames | |
-step returns | |
Mini-batch size | states-based and for pixels-based |
Seed frames | |
Discount () | |
Optimizer | Adam |
Learning rate | |
Agent update frequency | |
Critic target EMA rate () | |
Features dim. | states-based and for pixels-based |
Hidden dim. | |
Exploration stddev clip | |
Exploration stddev value | |
Number pre-training frames | up to |
Number fine-turning frames |
ICM hyper-parameter | Value |
---|---|
Representation dim. | |
Reward transformation | |
Forward net arch. | ReLU MLP |
Inverse net arch. | ReLU MLP |
Disagreement hyper-parameter | Value |
Ensemble size | |
Forward net arch: | ReLU MLP |
RND hyper-parameter | Value |
Representation dim. | |
Predictor & target net arch. | ReLU MLP |
Normalized observation clipping | 5 |
APT hyper-parameter | Value |
Representation dim. | |
Reward transformation | |
Forward net arch. | ReLU MLP |
Inverse net arch. | ReLU MLP |
in | |
Avg top in | True |
ProtoRL hyper-parameter | Value |
Predictor dim. | |
Projector dim. | |
Number of prototypes | |
Softmax temperature | |
in | |
Number of candidates per prototype | |
Encoder target EMA rate () | |
SMM hyper-parameter | Value |
Skill dim. | |
Skill discrim lr | |
VAE lr | |
DIAYN hyper-parameter | Value |
Skill dim | 16 |
Skill sampling frequency (steps) | 50 |
Discriminator net arch. | ReLU MLP |
APS hyper-parameter | Value |
Representation dim. | |
Reward transformation | |
Successor feature dim. | |
Successor feature net arch. | ReLU MLP |
in | |
Avg top in | True |
Least square batch size |
Appendix C Per-domain Individual Results
Individual fine-tuning results for each methods are shown in Figure 7. Furthermore, Figures 8 and 9 demonstrate individual results of states and pixels based fine-tuning performance as a function of pre-training steps for each considered method and task.



Appendix D Finetuning Learning Curves
We provide finetuning learning curves for agents pre-trained for 2M steps with intrinsic rewards.

Appendix E Individual Numerical Results
The individual numerical results of fine-tuning for each task and each method are presented in Table 4 for states-based learning, and in Table 5 for pixels-based learning.
Pre-trainining for frames | ||||||||||
Domain | Task | DDPG (DrQ-v2) | ICM | Disagreement | RND | APT | ProtoRL | SMM | DIAYN | APS |
Walker | Flip | 53827 | 53519 | 55920 | 58121 | 58028 | 52316 | 38836 | 35222 | 63833 |
Run | 32525 | 38424 | 43724 | 43733 | 42428 | 25034 | 24428 | 25926 | 42826 | |
Stand | 89923 | 9445 | 9376 | 9476 | 9259 | 92624 | 73861 | 78468 | 87234 | |
Walk | 74847 | 80546 | 91114 | 85732 | 88819 | 83131 | 59253 | 58425 | 73170 | |
Quadruped | Jump | 23648 | 29135 | 26145 | 38357 | 33448 | 22033 | 38656 | 26734 | 58957 |
Run | 15731 | 19531 | 19840 | 20320 | 16127 | 13821 | 22433 | 17926 | 42049 | |
Stand | 39273 | 39059 | 42076 | 44617 | 55957 | 42582 | 43086 | 35055 | 66264 | |
Walk | 22957 | 18520 | 26545 | 22926 | 17329 | 14123 | 22747 | 19329 | 66456 | |
Jaco | Reach bottom left | 7222 | 11724 | 10020 | 12119 | 12419 | 8620 | 6413 | 6415 | 1569 |
Reach bottom right | 11718 | 15510 | 1796 | 1618 | 14114 | 8221 | 6817 | 448 | 16410 | |
Reach top left | 11622 | 15224 | 14314 | 14115 | 13623 | 11019 | 336 | 266 | 15313 | |
Reach top right | 9418 | 15915 | 15915 | 1689 | 1756 | 11622 | 4712 | 5912 | 1867 | |
Pre-trainining for frames | ||||||||||
Domain | Task | DDPG | ICM | Disagreement | RND | APT | ProtoRL | SMM | DIAYN | APS |
Walker | Flip | 53827 | 55427 | 56842 | 68546 | 59429 | 50119 | 47230 | 38024 | 63736 |
Run | 32525 | 41618 | 48525 | 49921 | 41027 | 22818 | 32819 | 24119 | 33725 | |
Stand | 89923 | 9307 | 9408 | 9465 | 9305 | 92517 | 90618 | 76244 | 86927 | |
Walk | 74847 | 84630 | 9235 | 86923 | 82640 | 86516 | 79143 | 63234 | 77858 | |
Quadruped | Jump | 23648 | 25241 | 45245 | 54251 | 28248 | 22532 | 38756 | 35059 | 49355 |
Run | 15731 | 18442 | 36828 | 37728 | 18222 | 15329 | 20526 | 25824 | 34739 | |
Stand | 39273 | 42249 | 64953 | 72249 | 47080 | 43363 | 49978 | 45954 | 74346 | |
Walk | 22957 | 23743 | 41267 | 49868 | 21729 | 20947 | 23827 | 21824 | 55374 | |
Jaco | Reach bottom left | 7222 | 9416 | 14513 | 11311 | 12315 | 10618 | 6110 | 389 | 1346 |
Reach bottom right | 11718 | 11915 | 13615 | 1446 | 13613 | 11519 | 8210 | 6311 | 1318 | |
Reach top left | 11622 | 12518 | 1659 | 12116 | 11818 | 12223 | 579 | 2910 | 12411 | |
Reach top right | 9418 | 1518 | 1817 | 1508 | 1708 | 12022 | 6010 | 436 | 10610 | |
Pre-trainining for frames | ||||||||||
Domain | Task | DDPG | ICM | Disagreement | RND | APT | ProtoRL | SMM | DIAYN | APS |
Walker | Flip | 53827 | 52420 | 58636 | 61027 | 50522 | 48017 | 48617 | 33823 | 53124 |
Run | 32525 | 34430 | 48823 | 48223 | 37322 | 25429 | 33240 | 24924 | 35231 | |
Stand | 89923 | 92215 | 91911 | 9465 | 91620 | 90525 | 90318 | 87034 | 84628 | |
Walk | 74847 | 84527 | 88019 | 87324 | 82135 | 84829 | 74651 | 55333 | 80861 | |
Quadruped | Jump | 23648 | 30642 | 59539 | 61553 | 40053 | 28752 | 34963 | 36515 | 41546 |
Run | 15731 | 15724 | 44439 | 44442 | 23735 | 20632 | 28040 | 34325 | 40048 | |
Stand | 39273 | 42879 | 73651 | 76379 | 52658 | 43662 | 39136 | 52952 | 71257 | |
Walk | 22957 | 14024 | 72946 | 64470 | 24633 | 26664 | 31263 | 52576 | 50584 | |
Jaco | Reach bottom left | 7222 | 1149 | 1449 | 11410 | 12510 | 12222 | 588 | 4313 | 878 |
Reach bottom right | 11718 | 12610 | 12910 | 10613 | 12812 | 11320 | 629 | 346 | 1099 | |
Reach top left | 11622 | 14611 | 15610 | 13613 | 1105 | 11419 | 617 | 122 | 10813 | |
Reach top right | 9418 | 14310 | 15910 | 13210 | 14911 | 12321 | 619 | 319 | 1019 | |
Pre-trainining for frames | ||||||||||
Domain | Task | DDPG | ICM | Disagreement | RND | APT | ProtoRL | SMM | DIAYN | APS |
Walker | Flip | 53827 | 51425 | 49121 | 51517 | 47716 | 48023 | 50526 | 38117 | 46124 |
Run | 32525 | 38830 | 44421 | 43934 | 34428 | 20015 | 43026 | 24211 | 25727 | |
Stand | 89923 | 91312 | 90715 | 9239 | 9148 | 87023 | 87734 | 86026 | 83554 | |
Walk | 74847 | 71331 | 78233 | 82829 | 75935 | 77733 | 82136 | 66126 | 71168 | |
Quadruped | Jump | 23648 | 20533 | 66824 | 59033 | 46248 | 42563 | 29839 | 57846 | 53842 |
Run | 15731 | 13320 | 46112 | 46223 | 33940 | 31636 | 22037 | 41528 | 46537 | |
Stand | 39273 | 32958 | 84033 | 80450 | 62257 | 56071 | 36742 | 70648 | 71450 | |
Walk | 22957 | 14331 | 72156 | 82619 | 43464 | 40391 | 18426 | 40664 | 60286 | |
Jaco | Reach bottom left | 7222 | 1068 | 1348 | 10112 | 8812 | 12122 | 409 | 175 | 9613 |
Reach bottom right | 11718 | 1199 | 1224 | 10010 | 11512 | 11316 | 509 | 314 | 939 | |
Reach top left | 11622 | 11912 | 11714 | 11110 | 11211 | 12420 | 507 | 113 | 6510 | |
Reach top right | 9418 | 1379 | 1407 | 14010 | 1365 | 13519 | 378 | 194 | 8111 |
Pre-trainining for frames | ||||||||||
Domain | Task | DrQ-v2 | ICM | Disagreement | RND | APT | ProtoRL | SMM | DIAYN | APS |
Walker | Flip | 8123 | 25254 | 8033 | 21438 | 251 | 29333 | 241 | 13224 | 387 |
Run | 4111 | 11021 | 5714 | 7813 | 251 | 13513 | 221 | 506 | 261 | |
Stand | 21228 | 31567 | 25062 | 26134 | 1629 | 35367 | 1338 | 23322 | 1629 | |
Walk | 14153 | 30245 | 19268 | 26343 | 4316 | 32052 | 231 | 13825 | 292 | |
Quadruped | Jump | 27835 | 22640 | 17315 | 22330 | 16024 | 24633 | 21125 | 20424 | 18232 |
Run | 15621 | 15613 | 11212 | 14517 | 13421 | 15627 | 14818 | 17323 | 13324 | |
Stand | 30947 | 32949 | 25931 | 35043 | 26637 | 34235 | 29736 | 35048 | 26548 | |
Walk | 15131 | 16010 | 13424 | 15416 | 11917 | 16824 | 14918 | 15723 | 16127 | |
Jaco | Reach bottom left | 2310 | 187 | 124 | 417 | 00 | 3811 | 11 | 124 | 00 |
Reach bottom right | 238 | 3012 | 238 | 578 | 00 | 379 | 10 | 103 | 00 | |
Reach top left | 409 | 3111 | 309 | 669 | 00 | 5914 | 21 | 194 | 21 | |
Reach top right | 379 | 3713 | 228 | 487 | 32 | 4516 | 43 | 248 | 42 | |
Pre-trainining for frames | ||||||||||
Domain | Task | DDPG | ICM | Disagreement | RND | APT | ProtoRL | SMM | DIAYN | APS |
Walker | Flip | 8123 | 26039 | 36016 | 22237 | 282 | 21044 | 251 | 11718 | 322 |
Run | 4111 | 11015 | 13119 | 10313 | 261 | 8516 | 231 | 474 | 271 | |
Stand | 21228 | 49962 | 39865 | 28926 | 15511 | 35561 | 1399 | 24316 | 1619 | |
Walk | 14153 | 30551 | 34846 | 25833 | 3710 | 25054 | 231 | 12519 | 4819 | |
Quadruped | Jump | 27835 | 28650 | 21424 | 36644 | 14726 | 22942 | 20120 | 24828 | 21227 |
Run | 15621 | 19829 | 15319 | 26138 | 11220 | 14427 | 13816 | 19724 | 17823 | |
Stand | 30947 | 39869 | 29835 | 45347 | 22942 | 35567 | 27926 | 31331 | 28151 | |
Walk | 15131 | 19331 | 12920 | 20630 | 11120 | 15725 | 13913 | 14019 | 14124 | |
Jaco | Reach bottom left | 2310 | 6520 | 308 | 478 | 00 | 3114 | 11 | 123 | 00 |
Reach bottom right | 238 | 5616 | 3410 | 526 | 00 | 3511 | 10 | 72 | 00 | |
Reach top left | 409 | 8722 | 4711 | 558 | 11 | 3515 | 21 | 204 | 21 | |
Reach top right | 379 | 6817 | 335 | 618 | 21 | 4214 | 43 | 215 | 11 | |
Pre-trainining for frames | ||||||||||
Domain | Task | DDPG | ICM | Disagreement | RND | APT | ProtoRL | SMM | DIAYN | APS |
Walker | Flip | 8123 | 25649 | 30534 | 25029 | 4617 | 24437 | 261 | 12626 | 4212 |
Run | 4111 | 11616 | 15512 | 9613 | 261 | 8411 | 241 | 474 | 303 | |
Stand | 21228 | 53473 | 56537 | 37437 | 15013 | 48063 | 1456 | 25119 | 17010 | |
Walk | 14153 | 28545 | 43329 | 31641 | 369 | 25839 | 251 | 13725 | 5425 | |
Quadruped | Jump | 27835 | 34547 | 19931 | 36838 | 15627 | 23750 | 20120 | 31938 | 18429 |
Run | 15621 | 17911 | 14024 | 29736 | 11521 | 10816 | 13913 | 16517 | 15522 | |
Stand | 30947 | 43044 | 25735 | 55951 | 22942 | 33871 | 27926 | 31927 | 27548 | |
Walk | 15131 | 25125 | 10419 | 27419 | 11521 | 15228 | 13913 | 21320 | 14623 | |
Jaco | Reach bottom left | 2310 | 6519 | 4210 | 469 | 00 | 3613 | 11 | 83 | 00 |
Reach bottom right | 238 | 8823 | 5811 | 449 | 00 | 4310 | 10 | 41 | 00 | |
Reach top left | 409 | 7619 | 8919 | 597 | 64 | 4110 | 21 | 204 | 20 | |
Reach top right | 379 | 8724 | 4912 | 477 | 21 | 4712 | 43 | 226 | 51 | |
Pre-trainining for frames | ||||||||||
Domain | Task | DDPG | ICM | Disagreement | RND | APT | ProtoRL | SMM | DIAYN | APS |
Walker | Flip | 8123 | 23134 | 33916 | 28031 | 282 | 22327 | 261 | 11421 | 389 |
Run | 4111 | 9811 | 1549 | 13315 | 252 | 8718 | 241 | 453 | 303 | |
Stand | 21228 | 40140 | 55292 | 38949 | 15511 | 46769 | 1456 | 29871 | 17210 | |
Walk | 14153 | 27444 | 42436 | 32143 | 358 | 29748 | 251 | 13219 | 375 | |
Quadruped | Jump | 27835 | 31218 | 19421 | 38328 | 16427 | 19735 | 20120 | 26222 | 19929 |
Run | 15621 | 24921 | 14325 | 28418 | 12120 | 13735 | 13913 | 19018 | 15624 | |
Stand | 30947 | 50640 | 30535 | 56143 | 24341 | 29056 | 27926 | 42640 | 33143 | |
Walk | 15131 | 23115 | 14510 | 29420 | 12221 | 13835 | 13913 | 18423 | 14624 | |
Jaco | Reach bottom left | 2310 | 7220 | 10618 | 393 | 00 | 215 | 11 | 72 | 10 |
Reach bottom right | 238 | 5819 | 9015 | 479 | 00 | 287 | 10 | 93 | 11 | |
Reach top left | 409 | 8922 | 12721 | 606 | 00 | 4716 | 21 | 112 | 21 | |
Reach top right | 379 | 6918 | 11823 | 7611 | 11 | 5212 | 43 | 163 | 103 |
Appendix F Compute Resources
URLB is designed to be accessible to the RL research community. Both state and pixel-based algorithms are implemented such that each algorithm requires a single GPU. For local debugging experiments we used NVIDIA RTX GPUs. For large-scale runs used to generate all results in this manuscripts, we used NVIDIA Tesla V100 GPU instances. All experiments were run on internal clusters. Each algorithm trains in roughly 30 mins - 12 hours depending on the snapshot (100k, 500k, 1M, 2M) and input (states, pixels). Since this benchmark required roughly 8k experiments (2 states / pixels, 12 tasks, 8 algorithms, 10 seeds, 4 snapshots) a total of 100 V100 GPUs were used to produce the results in this benchmark. Researchers who wish to build on URLB will, of course, not need to run this many experiments since they can utilize the results presented in this benchmark.
Appendix G Intuition on Competence-based Approaches Underperform on URLB
Across the three methods - data-based, knowledge-based, and competence-based - the best data-based and knowledge-based methods are competitive with one another. For instance, RND (a leading knowledge-based methods) and ProtoRL (a leading data-based method) achieve similar finetuning scores. Both are maximizing data diversity in two different ways - one through maximizing prediction error and the other through entropy maximization.
On the other hand, competence-based methods as a whole do much worse than data-based and knowledge-based ones. We hypothesize that this is due to current competence-based methods only supporting small skill spaces. Competence-based methods maximize a variational lower bound to the mutual information of the form:
where
is called the discriminator. The discriminator can be interpreted as a classifier from
(or vice versa depending on how you decompose ). In order to have an accurate discriminator, is chosen to be small in practice (DIAYN - is a 16 dim one-hot, SMM - is 4 dim continuous, APS - is 10 dim continuous).OpenAI gym environments for continuous control mask this limitation because they terminate if the agent falls over and hence leak extrinsic signal about the downstream task into the environment. This means that the agent learns only useful behaviors that keep it balanced and therefore a small skill vector is sufficient for classifying these behaviors. However, in DeepMind control (and hence URLB) the episodes have fixed length and therefore the set of possible behaviors is much larger. If the skill space is too small, the most likely skills to be classified are different configurations of the agent lying on the ground. We hypothesize that building more powerful discriminators would improve competence-based exploration.
Comments
There are no comments yet.