URLB: Unsupervised Reinforcement Learning Benchmark

Deep Reinforcement Learning (RL) has emerged as a powerful paradigm to solve a range of complex yet specific control tasks. Yet training generalist agents that can quickly adapt to new tasks remains an outstanding challenge. Recent advances in unsupervised RL have shown that pre-training RL agents with self-supervised intrinsic rewards can result in efficient adaptation. However, these algorithms have been hard to compare and develop due to the lack of a unified benchmark. To this end, we introduce the Unsupervised Reinforcement Learning Benchmark (URLB). URLB consists of two phases: reward-free pre-training and downstream task adaptation with extrinsic rewards. Building on the DeepMind Control Suite, we provide twelve continuous control tasks from three domains for evaluation and open-source code for eight leading unsupervised RL methods. We find that the implemented baselines make progress but are not able to solve URLB and propose directions for future research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 19

page 20

05/18/2019

Evolving Rewards to Automate Reinforcement Learning

Many continuous control tasks have easily formulated objectives, yet usi...
03/08/2021

Behavior From the Void: Unsupervised Active Pre-Training

We introduce a new unsupervised pre-training method for reinforcement le...
11/19/2020

Parrot: Data-Driven Behavioral Priors for Reinforcement Learning

Reinforcement learning provides a general framework for flexible decisio...
11/16/2016

Reinforcement Learning with Unsupervised Auxiliary Tasks

Deep reinforcement learning agents have achieved state-of-the-art result...
09/18/2020

Deep Reinforcement Learning for Closed-Loop Blood Glucose Control

People with type 1 diabetes (T1D) lack the ability to produce the insuli...
05/11/2021

Return-based Scaling: Yet Another Normalisation Trick for Deep RL

Scaling issues are mundane yet irritating for practitioners of reinforce...
01/20/2020

Memristor Hardware-Friendly Reinforcement Learning

Recently, significant progress has been made in solving sophisticated pr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Reinforcement Learning (RL) has been at the source of a number of breakthroughs in autonomous control over the last five years. RL algorithms have been used to train agents to play Atari video games directly from pixels mnih2015human; mnih2016a2c, learn robotic locomotion trpo; schulman2015high; ppo and manipulation akkaya2019solving policies from raw sensory input, master the game of Go alphago; alphazero, and play large-scale multiplayer video games berner2019dota; alphastar. While these results were significant advances in autonomous decision making, a deeper look reveals a fundamental limitation. The above algorithms produced agents capable of only solving the single task they were trained to solve. As a result, current RL approaches produce brittle policies with poor generalization capabilities cobbe2020leveraging, which limits their applicability to many problems of interest Gleave2020Adversarial. It is therefore important to move beyond today’s powerful but narrow RL systems toward generalist systems capable of quickly adapting to new downstream tasks.

In contrast, in the fields of Computer Vision (CV) and Natural Language Processing (NLP), large-scale unsupervised pre-training has enabled sample-efficient few-shot adaptation. In NLP, unsupervised sequential modeling has produced powerful few-shot learners

brown2020language; devlin2018bert; radford2019language. In CV, unsupervised representation learning techniques such as contrastive learning have produced algorithms that are dramatically more label-efficient than their supervised counterparts chen2020simclr; he2020moco; henaff2019cpcv2; grill2020byol

and more capable of adapting to a host of downstream supervised tasks such as classification, segmentation, and object detection. While these advances in unsupervised learning have also benefited RL in terms of learning efficiently from images

laskin2020reinforcement; laskin2020curl; schwarzer2021dataefficient; stooke2020decoupling; yarats2021image as well as introducing new architectures for RL chen_lu2021dt; janner2021tto, the resulting agents have remained narrow since they still optimize a single extrinsic reward as before.

Figure 1: Unlike supervised RL which requires reward interaction at every step, unsupervised RL has two phases: (i) reward-free pre-training and (ii) fine-tuning to an extrinsic reward. During phase (i) an agent explores the environment through reward-free interaction with the environment. The quality of exploration depends on the intrinsic reward that the agent sets for itself. During phase (ii) the quality of pre-training is evaluated by its adaptation efficiency to a downstream task.

Fully unsupervised training of RL algorithms requires not only learning self-supervised representations but also learning policies without access to extrinsic rewards. Recently, unsupervised RL algorithms have begun to show progress toward more generalist systems by training policies without extrinsic rewards. Exploration with self-supervised prediction has enabled agents to explore video games from pixels pathak2017curiosity; Pathak19disagreement, mutual information-based approaches have demonstrated self-supervised skill discovery and generalization to downstream tasks in continuous control domains EysenbachGIL19diayn; hansen20visr; liu21aps; SharmaGLKH20dads, and maximal entropy RL has yielded policies capable of diverse exploration liu2021unsupervised; seo21re3; yarats21protorl. However, comparing and developing new algorithms has been challenging due to a lack of a unified evaluation benchmark. Reward-free RL algorithms often use different optimization schemes, different tasks for evaluation, and have different evaluation procedures. Additionally, unlike more mature supervised RL algorithms sac; hessel18rainbow; ppo, there does not exist a unified codebase for unsupervised RL that can be used to develop new methods quickly.

To make benchmarking and developing new unsupervised RL approaches easier, we introduce the Unsupervised Reinforcement Learning Benchmark (URLB). Built on top of the widely adopted DeepMind Control Suite tassa2018deepmind, URLB provides a suite of domains of varying difficulty for unsupervised pre-training with diverse downstream evaluation tasks. URLB standardizes evaluation of unsupervised RL algorithms by defining fixed pre-training and fine-tuning procedures across all baselines. Perhaps most importantly, we open-source code for URLB environments as well as 8 leading baselines that represent the main approaches taken towards unsupervised pre-training in RL to date. Unlike prior code releases for unsupervised RL, URLB uses the same exact optimization algorithm for each baseline which enables transparent benchmarking and lowers the barrier to entry for developing new algorithms. We summarize the main contributions of this paper below:

  1. We introduce URLB, a new benchmark for evaluating unsupervised RL algorithms, which consists of three domains and twelve continuous control tasks of varying difficulty to evaluate the adaptation efficiency of unsupervised RL algorithms.

  2. We open-source a unified codebase for eight leading unsupervised RL algorithms. Each algorithm is trained with the same optimization backbone for fairness of comparison.

  3. We find that while the implemented baselines make progress on the proposed benchmark, no existing unsupervised RL algorithm can solve URLB, and consequently identify promising research directions to progress unsupervised RL.

The benchmark environments, algorithmic baselines, and pre-training and evaluation scripts are available at https://github.com/rll-research/url_benchmark. We believe that URLB will make the development of unsupervised RL agents easier and more transparent by providing a unified set of evaluation environments, systematic procedures for pre-training and evaluation, and algorithmic baselines that share the same optimization backbone.

2 Preliminaries and Notation

Markov Decision Process:

We consider the typical Reinforcement Learning setting where an agent’s interaction with the environment is modeled through a Markov Decision Process (MDP) 

sutton2018reinforcement. In this work, we benchmark unsupervised RL algorithms in both fully observable MDPs where the agent learns from coordinate state as well as partially observable MDPs (POMDPs) where the agent learns from partially observable image observations. For simplicity we refer to both image and state-based observations as . At every timestep , the agent sees an observation and selects an action based on its policy . The agent then sees the next observation and an extrinsic reward provided by the environment (supervised RL) or an intrinsic reward defined through a self-supervised objective (unsupervised RL). In this work, we pre-train agents with intrinsic rewards and fine-tune them to downstream tasks with extrinsic rewards

. Some algorithms considered in this work condition the agent on a learned task vector which we denote as

.

Learning from pixels vs states: We benchmark unsupervised RL where environment observations can be either proprioceptive states or RGB images. When learning from pixels, rather than defining the self-supervised task directly as a function of image observations, it is usually more convenient to first embed the image and compute the intrinsic reward as a function of these lower dimensional features burda2018exploration; liu2021unsupervised; liu21aps; pathak2017curiosity. We therefore define an embedding as where is an encoder function. We employ different encoder architectures depending on whether the algorithm receives pixel or state-based input. For pixel-based inputs we use the convolutional encoder architecture from SAC-AE (yarats2019improving), while for state-based inputs we use the identity function by default unless the unsupervised RL algorithm explicitly specifies a different encoding. The intrinsic reward can be a function of any and all depending on the algorithm. Finally, note that the encoder may or may not be shared with components of the base RL algorithm such as the actor and critic.

3 URLB: Evaluation and Environments

Figure 2: The three domains (walker, quadruped, jaco arm) and twelve downstream tasks considered in URLB. The environments include tasks of varying complexity and require an agent pre-trained on a given domain to adapt efficiently to the downstream tasks within that domain.

3.1 Standardized of Pre-training and Fine-tuning Procedures

One reason why unsupervised RL has been hard to benchmark to date is that there is no agreed upon procedure for training and evaluating unsupervised RL agents. To this end, we standardize pre-training, fine-tuning, and evaluation in URLB. We split pre-training and fine-tuning into two phases consisting of and environment steps respectively. During pre-training, we checkpoint agents at 100k, 500k, 1M, 2M steps in order to evaluate downstream performance as a function of pre-training steps. For adapting the pre-trained policy to downstream tasks, we evaluate in the data-efficient regime where is 100k, since we are interested in agents that are quick to adapt.

3.2 Evaluation

We evaluate the performance of an unsupervised RL algorithm by measuring how quickly it adapts to a downstream task. For each fine-tuning task, we initialize the agent with the pre-trained network parameters, fine-tune the agent for 100k steps and measure its performance on the downstream task. This evaluation procedure is similar to how pre-trained networks in CV and NLP are fine-tuned to downstream tasks such as classification, object detection, and summarization. There exist other means of evaluating the quality of pre-trained RL agents such as measuring the diversity of data collected during exploration or zero-shot generalization of goal-conditioned agents. However, it is challenging to produce a general method to measure data diversity, and while zero-shot generalization with goal-conditioned agents can be powerful such a benchmark would be limited to goal-conditioned RL. For these reasons, data diversity and goal-conditioned zero-shot generalization are less common evaluation metrics. In an effort to provide a general benchmark, we focus on the fine-tuning efficiency of the agent after pre-training which allows us to evaluate a diverse set of baselines.

Unlike unsupervised methods in CV and NLP which focus solely on representation learning, unsupervised pre-training in RL requires both representation learning and behavior learning. For this reason, URLB benchmarks performance for both state-based and pixel-based agents. Benchmarking both state and pixel-based RL separately is important because it allows us to decouple unsupervised behavior learning from unsupervised representation learning. In state-based RL, the agent receives a near-optimal representation of the world through coordinate states. Evaluating state-based unsupervised RL agents allows us to isolate unsupervised behavior discovery without worrying about representation learning as confounding factor. Evaluating pixel-based unsupervised RL agents provides insight into how representations and behaviors can be learned jointly.

3.3 URLB Environments

We release a set of domains and downstream tasks for URLB that are based on the DeepMind Control Suite (DMC) tassa2018deepmind. The three reasons for building URLB on top of DMC are (i) DMC is already widely adopted and familiar to RL practitioners; (ii) DMC environments can be used with both state and pixel-based inputs; (iii) DMC features environments of varying difficulty which is useful for designing a benchmark that contains both challenging and feasible tasks. URLB evaluates performance on 12 continuous control tasks (3 domains with 4 downstream tasks per domain). From easiest to hardest, the URLB domains and tasks are:

Walker (Stand, Walk, Flip, Run): A biped constrained to a 2D vertical plane. Walker is a challenging introduction domain for unsupervised RL because it requires the unsupervised agent to learn balancing and locomotion skills in order to fine-tune efficiently. Quadruped (Stand, Walk, Jump, Run): A quadruped within a a 3D space. Like walker, quadruped requires the agent to learn to balance and move but is harder due to a high-dimensional state and action spaces and 3D environment. Jaco Arm (Reach top left, Reach top right, Reach bottom left, Reach bottom right): Jaco Arm is a 6-DOF robotic arm with a three-finger gripper. This environment tests the unsupervised RL agent’s ability to control the robot arm without locking and perform simple manipulation tasks. It was recently shown that this environment is particularly challenging for unsupervised RL yarats21protorl.

1:Randomly initialized actor , critic , and encoder networks, replay buffer .
2:Intrinsic and extrinsic reward functions, discount factor .
3:Environment (env), downstream tasks .
4:pre-train and fine-tune steps.
5:for  do Part 1: Unsupervised Pre-training
6:       and
7:      
8:      
9:      Update , , and using minibatches from and intrinsic reward according to Eqs. 1 and 2.
10:end for
11:Outputs pre-trained parameters , and
12:for do Part 2: Supervised Fine-tuning
13:      initialize ,, , reset
14:      for  do
15:             and
16:            
17:            
18:            Update , , and using minibatches from according to Eqs. 1 and 2.
19:      end for
20:      Evaluate performance of RL agent on task
21:end for
Algorithm 1 Unsupervised RL: Unsupervised Pre-training and Supervised Fine-tuning

4 URLB: Algorithmic Baselines for Unsupervised RL

In addition to introducing URLB, the other primary contribution of this work is open-sourcing a unified codebase for eight leading unsupervised RL algorithms. To date, unsupervised RL algorithms have been hard to compare due to confounding factors such as different evaluation procedures and optimization schemes. While URLB provides standardized pre-training, fine-tuning, and evaluation procedures, current algorithms are hard to compare since they rely on different optimization algorithms. For instance, Curiosity pathak2017curiosity utilizes PPO ppo while APT liu2021unsupervised uses SAC sac for optimization. Moreover, even if two unsupervised RL methods use the same optimization algorithm, small differences in implementation can result in large performance differences that are independent of the pre-training algorithm. For this reason, it is important to provide a unified codebase with identical implementations of the optimization algorithm for each baseline. Providing such a unified codebase is one of the main contributions of this benchmark.

4.1 Backbone RL Algorithm

Since most of the above algorithms rely on off-policy optimization (and some cannot be optimized on-policy at all), we opt for a state-of-the-art off-policy optimization algorithm. While SAC sac has been the de facto off-policy RL algorithm for many RL methods in the last few years, it is prone to suffering from policy entropy collapse. DrQ-v2 (yarats2021drqv2) recently showed that using DDPG lillicrap15ddpg instead of SAC as a learning algorithm leads to a more robust performance on tasks from DMC. For this reason, we opt for DrQ-v2 (yarats2021drqv2) as our base optimization algorithm to learn from images, and DDPG, as implemented in DrQ-v2, to learn from states. DDPG is an actor-critic off-policy algorithm for continuous control tasks. The critic minimizes the Bellman error

(1)

where is an exponential moving average of the critic weights. The deterministic actor is learned by maximizing the expected returns

(2)

4.2 Unsupervised RL Algorithms

As part of URLB, we open-source code for eight leading or well-known algorithms across all three of these categories all of which utilize the same optimization backbone. All algorithms provided with URLB differ only in their intrinsic reward while keeping all other parts of the RL architecture the same. We list all implemented baselines in Table 1 and provide a brief overview of the algorithms considered, which are binned into three categories – knowledge-based, data-based, and competence-based algorithms.111We borrow this terminology from the following unsupervised RL tutorial srinivas_abbeel_2021_icml_tutorial. For detailed descriptions of each method we refer the reader to Appendix A.

Name Algo. Type Intrinsic Reward
ICM pathak2017curiosity Knowledge
Disagreement Pathak19disagreement Knowledge
RND burda2018exploration Knowledge
APT liu2021unsupervised Data
ProtoRL yarats21protorl Data
SMM lee2019smm Competence
DIAYN EysenbachGIL19diayn Competence
APS liu21aps Competence
Table 1: Unsupervised RL Algorithms implemented in URLB.

Knowledge-based Baselines: Knowledge-based methods aim to increase knowledge about the world by maximizing prediction error. As part of the knowledge-based suite, we implement the Intrinsic Curiosity Module (ICM) pathak2017curiosity, Disagreement Pathak19disagreement, and Random Network Distillation (RND) burda2018exploration. All three methods utilize a function to either predict the dynamics (ICM, Disagreement) or predict the output of a random network (RND), where is the encoding of . ICM and RND maximize prediction error while Disagreement maximizes prediction uncertainty.

Data-based Baselines: Data-based methods aim to achieve data diversity by maximizing entropy. We implement APT liu2021unsupervised and ProtoRL yarats21protorl both of which maximize entropy

in different ways. Both methods utilize a particle estimator 

singh03entropy

to maximize the entropy by maximizing the distance between k-nearest neighbors (kNN) for each state or observation embedding

. Since computing kNN over the entire replay buffer is expensive, APT estimates entropy across transitions in a randomly sampled minibatch. ProtoRL improves on APT by clustering the replay buffer with a contrastive deep clustering algorithm SWaV caron20swav. The centroids of the clusters are called prototypes, which are used by ProtoRL to estimate entropy.

Competence-based Baselines: Competence-based algorithms, learn an explicit skill vector by maximizing the mutual information between the encoded observation and skill . This mutual information can be decomposed in two ways, . We provide baselines for both decompositions. The former decomposition is utilized in skill discovery algorithms such as DIAYN EysenbachGIL19diayn, VIC GregorRW17vic, VALOR achiam2018valor, which are conceptually similar. For URLB, we implement DIAYN. The latter decomposition, though less common, is implemented in the APS liu21aps, which uses a particle estimator for the entropy term and successor features to represent the conditional entropy hansen20visr. Lastly, we implement SMM lee2019smm which combines both decompositions into one objective. Note that the SMM paper describes both skill-based and skill-free variants, so it can be categorized as both competence and data-based.

5 Experiments

Figure 3: Aggregate results for each algorithm category after pre-training the agent with intrinsic rewards for 2M environment steps and finetuning with extrinisc rewards for 100k steps as described in Sec. 3.2

. Scores are normalized by the asymptotic performance on each task (i.e., DrQ-v2 and DDPG performance after training from 2M steps on pixels and states correspondingly) and we show the mean and standard error of each category. Each algorithm is evaluated across ten random seeds. To provide an aggregate view of each algorithm category, the scores are averaged over individual tasks and methods (see Appendix 

C for detailed results for each algorithm and downstream task). The Random Init baseline represents DrQ-v2 and DDPG trained from a random initialization for 100k steps. Full results can be found in Section C.

We evaluate the algorithms listed in Table 1 by pre-training with the intrinsic reward objective and fine-tuning on the downstream task as described in Section 3.2. For DrQ-v2 optimization we fix the hyper-parameters from yarats2021drqv2 and for algorithm-specific hyper-parameters we perform a grid sweep and pick the best performing parameters. We benchmark both state and pixel-based experiments and keep all non-algorithm-specific architectural details the same with a full description available in Appendix B. Performance on each downstream task is evaluated over ten random seeds and we display the mean scores and standard errors. We summarize the main results of our evaluation in Figures 3 and 6, which show evaluation scores grouped by algorithm category, described in Section 4.2, and environment, described in Section 3.3. An extensive list of results across all algorithms considered in this work can be found in Appendix C.

(a) State-based learning.
(b) Pixel-based learning.
Figure 6: We display the fine-tuning efficiency as a function of pre-training steps. As in Fig. 3 scores are asymptotically normalized, averaged across tasks and algorithms on a per-category basis, and evaluated over ten seeds. Our expectation is that a longer pre-training phase should lead to more efficient fine-tuning. However, in several cases the empirical evidence goes against our intuition demonstrating that longer pre-training is not always beneficial. Understanding this shortcoming of current methods is an important direction for future research. Detailed results can be found in Figures 8 and 9.

By benchmarking a wide array of exploration algorithms on both state and pixel-based tasks we are able to get perspective on the current state of unsupervised RL. Overall, we find that while unsupervised RL shows promise, it is still far from solving the proposed benchmark and many open questions need to be addressed to make progress toward unsupervised pre-training for RL. We note that solving the benchmark means matching the asymptotic DrQ-v2 (for pixels) and DDPG (for states) performance within 100k steps of fine-tuning. The motivation for this definition is that unsupervised RL agents get access to unlimited reward-free environment interactions. After pre-training, we seek to develop agents that adapt quickly to the desired downstream task. We list our observations below:

O1: None of the implemented unsupervised RL algorithms solve the benchmark. Despite access to up to 2M pre-training steps, after 100k steps of fine-tuning no method matches asymptotic performance on most tasks. The best-performing benchmarked algorithms achieve normalized return whereas the benchmark is considered solved when the agent achieves near normalized returns. This suggests that we are still far as a community from efficient generalization in deep RL.

O2: Unsupervised RL is not universally better than random initialization. We also observe that fine-tuning an unsupervised RL baseline is not always preferable to fine-tuning from a random initialization. In particular when learning from states, a random initialization is competitive with most baselines. However, when learning from pixels fine-tuning from random initialization degrades suggesting that representation learning is an important component of unsupervised pre-training.

O3: There exists a large gap in performance between exploring from states and exploring from pixels. Another observation that supports representation learning as an important aspect of exploration is that exploration algorithms degrade substantially when learning from pixels compared to learning from state. Shown in Figure 3, most algorithms lose when learning from pixels compared to state and especially so on the harder environments (Quadruped, Jaco Arm). These results suggest that better representation learning during pre-training is an important research direction.

O4: In aggregate, competence-based approaches underperform knowledge-based and data-based approaches. While knowledge-based and data-based approaches both perform competitively across URLB, we find that competence-based approaches are lagging behind. Specifically, there is no competence-based approach that achieves state-of-the-art mean performance on any of the URLB tasks, which points to competence-based unsupervised RL as an impactful research direction with significant room for improvement.

O5: There is not a single leading unsupervised RL algorithm for both states and pixels. We observe that there is no single state-of-the-art algorithm for unsupervised RL. At 2M pre-training steps, APT liu2021unsupervised and ProtoRL yarats21protorl are the leading algorithms for state-based URLB while ICM pathak2017curiosity achieves leading performance on pixel-based URLB despite the existence of more sophisticated knowledge-based methods Pathak19disagreement; burda2018exploration (see Figure 7).

O6: For many unsupervised RL algorithms, rather than monotonically improving performance decays as a function of pre-training steps. We desire and would expect that the fine-tuning efficiency of unsupervised RL algorithms would improve as a function of pre-training steps. Surprisingly, we find that for 9 out of 18 experiments shown in Figure 6, performance either does not improve or even degrades as a function of pre-training steps. We see this as potentially the biggest drawback of current unsupervised RL approaches – they do not scale with the number of environment interactions. Developing algorithms that improve monotonically as a function of pre-training steps is an open and impactful line of research.

O7: New fine-tuning strategies will likely be needed for fast adaptation. While not investigated in depth in this benchmark, new fine-tuning strategies could play a large role in the adoption of unsupervised RL. Perhaps part of the issue raised in O6 could be addressed with better fine-tuning. The algorithms in URLB are all fine-tuned by initializing the actor-critic with the pre-trained weights and fine-tuning with an extrinsic reward. There are likely other better strategies for fine-tuning, particularly for competence based approaches that are conditioned on the skill .

6 Related work

Deep Reinforcement Learning Benchmarks. Part of the accelerated progress in deep RL over the last few years has been due to the existence of stable benchmarks. Specifically, the Atari Arcade Learning Environment bellemare2013arcade, the OpenAI gym brockman2016openai, and more recently the DeepMind Control (DMC) Suite tassa2018deepmind have become standard benchmarks for evaluating supervised RL agents in both state and pixel-based observation spaces and discrete and continuous action spaces. Open-sourcing code for algorithms has been another aspect that accelerated progress in deep RL. For instance, duan2016benchmarking not only presented a benchmark for continuous control but also provided baselines for common supervised RL algorithms, which led to the development of the widely used OpenAI gym benchmark brockman2016openai and baselines baselines. The combination of challenging yet feasible benchmarks and open-sourced code were important components in the discovery of many widely adopted RL algorithms sac; mnih2015human; trpo; schulman2015high; ppo.

In addition to Atari, OpenAi gym, and DeepMind control, there have been many other benchmarks designed to study different aspects of supervised RL. DeepMind lab beattie2016deepmind benchmarks 3D navigation from pixels, ProcGen cobbe2019quantifying; cobbe2020leveraging measures generalization of supervised agents in procedurally generated environments, D4RL fu2020d4rl and RL unplugged gulcehre2020rl benchmark performance of offline RL methods, B-Pref lee2021bpref benchmarks performance of preference-based RL methods, Metaworld yu2020meta measures the performance of multi-task and meta-RL algorithms, and SafetyGym ray2019benchmarking measures how RL agents can achieve tasks with safety constraints. However, while the existing benchmarks are suitable for supervised RL algorithms, there is no such benchmark and collections of easy-to-use baseline algorithms for unsupervised RL, which is our primary motivation for accelerating progress in unsupervised RL through URLB.

Unsupervised Reinforcement Learning. While investigations into unsupervised deep RL appeared shortly after the landmark DQN mnih2015human, the field has experienced accelerated progress over the last year, which has been in part due to advents in unsupervised representation learning in CV chen2020simclr; he2020moco; henaff2019cpcv2 and NLP brown2020language; devlin2018bert; radford2019language as well as the development for stable RL optimization algorithms sac; hessel18rainbow; lillicrap15ddpg; ppo. However, unlike CV and NLP which focus solely on unsupervised representation learning, unsupervised RL has required both unsupervised representation and behavioral learning.

Unsupervised Representation Learning for Deep RL: In order for an RL algorithm to learn a policy it must first have a good representation for the state . When working with coordinate state, the representation is supplied by a the human task designer but when operating from image observations , we must first transform the observations into latent vectors . This transformation comprises the study of representation learning for RL. One of the first seminal works on unsupervised representation learning for RL showed that unsupervised auxiliary tasks improve performance of supervised RL jaderberg17unreal. Over the last two years, a series of works in unsupervised representation learning for RL with world models hafner2018learning; hafner2019dream contrastive learning laskin2020curl; stooke2020decoupling; yarats21protorl

, autoencoders 

yarats2019improving, and data augmentation laskin2020reinforcement; yarats2021drqv2; yarats2021image have dramaticaly improved learning efficiency from pixels. On many tasks from the DMC suite, learning from pixels is now as data-efficient as learning from state laskin2020curl.

Unsupervised Behavioral Learning for Deep RL: One caveat is that the above algorithms are not fully unsupervised since they still optimize for an extrinsic reward but with an auxiliary unsupervised loss. Fully unsupervised RL also requires unsupervised learning of behaviors, which is typically achieved by optimizing for an intrinsic reward oudeyer2007intrinsic. Given that representation learning is already heavily benchmarked for RL hafner2018learning; laskin2020curl; yarats2021image, URLB focuses mostly on unsupervised behavior learning. Many recent algorithms have been proposed for intrinsic behavioral learning, which include prediction methods burda2018largescale; pathak2017curiosity; Pathak19disagreement, maximal entropy-based methods campos2021beyond; liu2021unsupervised; liu21aps; mutti2020policy; seo21re3; yarats21protorl, and maximal mutual information-based methods EysenbachGIL19diayn; hansen20visr; liu21aps; SharmaGLKH20dads. However, these methods use different pre-training and evaluation procedures, different optimization algorithms, and different environments. To make fully unsupervised RL algorithm comparisons transparent and easier to develop, we introduce URLB.

7 Conclusion

We presented URLB, a benchmark designed to measure the performance of unsupervised RL algorithms. URLB consists of a suite of twelve evaluation tasks of varying difficulty from three domains and standardized procedures for pre-training and evaluation. We’ve open-sourced implementations and evaluation scores for eight leading unsupervised RL algorithms from all major algorithm categories. To minimize confounding factors, we utilized the same optimization method across all baselines. While none of the implemented baselines solve URLB, many make substantial progress suggesting a number of fruitful directions for unsupervised RL research. We hope that this benchmark makes the development and comparison of unsupervised RL algorithms easier and clearer.

Limitations. There are a number of limitations for both URLB and unsupervised RL methods in general. While URLB tasks are designed to be challenging, they are far from the visual and combinatorial complexity of real-world robotics. However, existing algorithms are unable to solve the benchmark meaning there is substantial room for improvement on the URLB tasks before moving on to even more challenging ones. While we present standardized pre-training and evaluation procedures, there can be many other ways of measuring the quality of the exploration algorithm. For instance, the quality of pre-training can be evaluated not only through policy adaptation but also through dataset diversity which we do not consider in this paper. In this work, similar to the Atari bellemare2013arcade and DMC tassa2018deepmind benchmarks for supervised RL we do not consider goal-conditioned RL which can be quite powerful for exploration ecoffet2020goexplore. For generality, we chose the currently most commonly used evaluation procedure that allowed us to benchmark a diverse set of leading exploration algorithms but, of course, other choices are available and would be interesting to investigate in future work.

Potential negative impacts. Unsupervised RL has the benefits of requiring zero extrinsic reward interactions during pre-training, and due to this the resulting agents may develop policies that are not aligned with human intent. This could be problematic in the long-term if not addressed early and carefully because as unsupervised robotics get more capable they can inadvertently inflict harm on themselves or the environment. Methods for constraining exploration within a broad set of human preferences (e.g. explore without harming the environment) is an interesting and important direction for future research in order to produced safe agents.

Acknowledgements

This work was partially supported by Berkeley DeepDrive, BAIR, the Berkeley Center for Human-Compatible AI, the Office of Naval Research grant N00014-21-1-2769, and DARPA through the Machine Common Sense Program.

References

Checklist

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work? See Section 7.

    3. Did you discuss any potential negative societal impacts of your work? See Section 7.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments (e.g. for benchmarks)…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? See Section 5

      , the appendix for hyperparameters. You can access the code with full instructions in the supplementary materials or using this link

      https://github.com/rll-research/url_benchmark.

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? See Section 5 and  3.2 and the supplementary material.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? See Section 5 and the supplementary material.

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See Section F.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets?

    3. Did you include any new assets either in the supplemental material or as a URL? See custom_dmc_tasks folder in supplementary materials codebase.

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix:

Unsupervised Reinforcement Learning Benchmark

Appendix A Unsupervised Reinforcement Learning Baselines

a.1 Knowledge-based Baselines

Prediction methods train a forward dynamics model and define a self-supervised task based on the outputs of the model prediction.

Curiosity pathak2017curiosity: The Intrinsic Curiosity Module (ICM) defines the self-supervised task as the error between the state prediction of a learned dynamics model and the observation. The intuition is that parts of the state space that are hard to predict are good to explore because they were likely to be unseen before. An issue with Curiosity is that it is susceptible to the noisy TV problem wherein stochastic elements of the environment will always cause high prediction error while not being informative for exploration:

Disagreement Pathak19disagreement:

Disagreement is similar to ICM but instead trains an ensemble of forward models and defines the intrinsic reward as the variance (or disagreement) among the models. Disagreement has the favorable property of not being susceptible to the noisy TV problem, since high stochasticity in the environment will result high prediction error but low variance if it has been thoroughly explored:

RND burda2018exploration:

Random Network Distillation (RND) defines the self-supervised task by predicting the output of a frozen randomly initialized neural network

. This differs from ICM only in that instead of predicting the next state, which is effectively an environment-defined function, it tries to predict the vector output of a randomly defined function. Similar to ICM, RND can suffer from the noisy TV problem:

a.2 Data-based Baselines

Recently, exploration through state entropy maximization has resulted in simple yet effective algorithms for unsupervised pre-training. We implement two leading variants of this approach for URLB.

APT liu2021unsupervised: Active Pre-training (APT) utilizes a particle-based estimator singh03entropy that uses K nearest-neighbors to estimate entropy for a given state or image embedding. Since APT does not itself perform representation learning, it requires an auxiliary representation learning loss to provide latent vectors for entropy estimation, although it is also possible to use random network embeddings seo21re3. We provide implementations of APT with the forward and inverse dynamics representation learning losses:

ProtoRL yarats21protorl: ProtoRL devises a self-supervised pre-training scheme that allows to decouple representation learning and exploration to enable efficient downstream generalization to previously unseen tasks. For this, ProtoRL uses the contrastive clustering assignment loss from SWaV (caron20swav) and learns latent representations and a set of prototypes to form the basis of the latent space. The prototypes are then used for more accurate estimation of entropy of the state-visitation distribution via KNN particle-based estimator:

a.3 Competence-based Baselines

Competence-based approaches learn skills that maximize the mutual information between encoded observations (or states) and skills . The mutual information has two decompositions . We provide baselines for both decompositions.

SMM lee2019smm: SMM minimizes , which maximizes the state entropy, while minimizing the cross entropy from the state to the target state distribution. When using skills, can be rewritten as . can maximized by optimizing the reward , which is estimated using a VAE kingma2013auto that models the density of while executing skill . Similar to other mutual information methods that decompose , SMM learns a discriminator over a set of discrete skills with a uniform prior that maximizes :

DIAYN EysenbachGIL19diayn: DIAYN and similar algorithms such as VIC GregorRW17vic and VALOR achiam2018valor are perhaps the best competence-based exploration algorithms. These methods estimate the mutual information through the first decomposition . is kept maximal by drawing from a discrete uniform prior distribution and the density is estimated with a discriminator .

APS liu21aps: APS is a recent leading mutual information exploration method that uses the second decomposition . is estimated with a particle estimator as in APT liu2021unsupervised while is estimated with successor features as in VISR hansen20visr.222In this benchmark, the generalized policy improvement(GPI) barreto2018transfer that is used in Atari games for APS and VISR is not implemented for continuous control experiments.

Appendix B Hyper-parameters

In Table 2 we present a common set of hyper-parameters used in our experiments, while in table 3 we list individual hyper-parameters for each method.

Common hyper-parameter Value
Replay buffer capacity
Action repeat states-based and for pixels-based
Seed frames
-step returns
Mini-batch size states-based and for pixels-based
Seed frames
Discount ()
Optimizer Adam
Learning rate
Agent update frequency
Critic target EMA rate ()
Features dim. states-based and for pixels-based
Hidden dim.
Exploration stddev clip
Exploration stddev value
Number pre-training frames up to
Number fine-turning frames
Table 2: A common set of hyper-parameters used in our experiments.
ICM hyper-parameter Value
Representation dim.
Reward transformation
Forward net arch. ReLU MLP
Inverse net arch. ReLU MLP
Disagreement hyper-parameter Value
Ensemble size
Forward net arch: ReLU MLP
RND hyper-parameter Value
Representation dim.
Predictor & target net arch. ReLU MLP
Normalized observation clipping 5
APT hyper-parameter Value
Representation dim.
Reward transformation
Forward net arch. ReLU MLP
Inverse net arch. ReLU MLP
in
Avg top in True
ProtoRL hyper-parameter Value
Predictor dim.
Projector dim.
Number of prototypes
Softmax temperature
in
Number of candidates per prototype
Encoder target EMA rate ()
SMM hyper-parameter Value
Skill dim.
Skill discrim lr
VAE lr
DIAYN hyper-parameter Value
Skill dim 16
Skill sampling frequency (steps) 50
Discriminator net arch. ReLU MLP
APS hyper-parameter Value
Representation dim.
Reward transformation
Successor feature dim.
Successor feature net arch. ReLU MLP
in
Avg top in True
Least square batch size
Table 3: Per algorithm sets of hyper-parameters used in our experiments.

Appendix C Per-domain Individual Results

Individual fine-tuning results for each methods are shown in Figure 7. Furthermore, Figures 8 and 9 demonstrate individual results of states and pixels based fine-tuning performance as a function of pre-training steps for each considered method and task.

Figure 7: Individual results of fine-tuning for 100k steps after different degrees of pre-training for each considered method. The performance is aggregated across all the tasks within a domain and normalized with respect to the optimal performance.
Figure 8: Individual results of fine-tuning efficiency as a function of pre-training steps for states-based learning.
Figure 9: Individual results of fine-tuning efficiency as a function of pre-training steps for pixels-based learning.

Appendix D Finetuning Learning Curves

We provide finetuning learning curves for agents pre-trained for 2M steps with intrinsic rewards.

Figure 10: Finetuning curves for each evaluated unsupervised algorithm for each task considered in this benchmark after the agent has been pre-trained with intrinsic rewards.

Appendix E Individual Numerical Results

The individual numerical results of fine-tuning for each task and each method are presented in Table 4 for states-based learning, and in Table 5 for pixels-based learning.

Pre-trainining for frames
Domain Task DDPG (DrQ-v2) ICM Disagreement RND APT ProtoRL SMM DIAYN APS
Walker Flip 53827 53519 55920 58121 58028 52316 38836 35222 63833
Run 32525 38424 43724 43733 42428 25034 24428 25926 42826
Stand 89923 9445 9376 9476 9259 92624 73861 78468 87234
Walk 74847 80546 91114 85732 88819 83131 59253 58425 73170
Quadruped Jump 23648 29135 26145 38357 33448 22033 38656 26734 58957
Run 15731 19531 19840 20320 16127 13821 22433 17926 42049
Stand 39273 39059 42076 44617 55957 42582 43086 35055 66264
Walk 22957 18520 26545 22926 17329 14123 22747 19329 66456
Jaco Reach bottom left 7222 11724 10020 12119 12419 8620 6413 6415 1569
Reach bottom right 11718 15510 1796 1618 14114 8221 6817 448 16410
Reach top left 11622 15224 14314 14115 13623 11019 336 266 15313
Reach top right 9418 15915 15915 1689 1756 11622 4712 5912 1867
Pre-trainining for frames
Domain Task DDPG ICM Disagreement RND APT ProtoRL SMM DIAYN APS
Walker Flip 53827 55427 56842 68546 59429 50119 47230 38024 63736
Run 32525 41618 48525 49921 41027 22818 32819 24119 33725
Stand 89923 9307 9408 9465 9305 92517 90618 76244 86927
Walk 74847 84630 9235 86923 82640 86516 79143 63234 77858
Quadruped Jump 23648 25241 45245 54251 28248 22532 38756 35059 49355
Run 15731 18442 36828 37728 18222 15329 20526 25824 34739
Stand 39273 42249 64953 72249 47080 43363 49978 45954 74346
Walk 22957 23743 41267 49868 21729 20947 23827 21824 55374
Jaco Reach bottom left 7222 9416 14513 11311 12315 10618 6110 389 1346
Reach bottom right 11718 11915 13615 1446 13613 11519 8210 6311 1318
Reach top left 11622 12518 1659 12116 11818 12223 579 2910 12411
Reach top right 9418 1518 1817 1508 1708 12022 6010 436 10610
Pre-trainining for frames
Domain Task DDPG ICM Disagreement RND APT ProtoRL SMM DIAYN APS
Walker Flip 53827 52420 58636 61027 50522 48017 48617 33823 53124
Run 32525 34430 48823 48223 37322 25429 33240 24924 35231
Stand 89923 92215 91911 9465 91620 90525 90318 87034 84628
Walk 74847 84527 88019 87324 82135 84829 74651 55333 80861
Quadruped Jump 23648 30642 59539 61553 40053 28752 34963 36515 41546
Run 15731 15724 44439 44442 23735 20632 28040 34325 40048
Stand 39273 42879 73651 76379 52658 43662 39136 52952 71257
Walk 22957 14024 72946 64470 24633 26664 31263 52576 50584
Jaco Reach bottom left 7222 1149 1449 11410 12510 12222 588 4313 878
Reach bottom right 11718 12610 12910 10613 12812 11320 629 346 1099
Reach top left 11622 14611 15610 13613 1105 11419 617 122 10813
Reach top right 9418 14310 15910 13210 14911 12321 619 319 1019
Pre-trainining for frames
Domain Task DDPG ICM Disagreement RND APT ProtoRL SMM DIAYN APS
Walker Flip 53827 51425 49121 51517 47716 48023 50526 38117 46124
Run 32525 38830 44421 43934 34428 20015 43026 24211 25727
Stand 89923 91312 90715 9239 9148 87023 87734 86026 83554
Walk 74847 71331 78233 82829 75935 77733 82136 66126 71168
Quadruped Jump 23648 20533 66824 59033 46248 42563 29839 57846 53842
Run 15731 13320 46112 46223 33940 31636 22037 41528 46537
Stand 39273 32958 84033 80450 62257 56071 36742 70648 71450
Walk 22957 14331 72156 82619 43464 40391 18426 40664 60286
Jaco Reach bottom left 7222 1068 1348 10112 8812 12122 409 175 9613
Reach bottom right 11718 1199 1224 10010 11512 11316 509 314 939
Reach top left 11622 11912 11714 11110 11211 12420 507 113 6510
Reach top right 9418 1379 1407 14010 1365 13519 378 194 8111
Table 4: Individual results of fine-tuning for frames after different levels of pre-training in the states-based settings.
Pre-trainining for frames
Domain Task DrQ-v2 ICM Disagreement RND APT ProtoRL SMM DIAYN APS
Walker Flip 8123 25254 8033 21438 251 29333 241 13224 387
Run 4111 11021 5714 7813 251 13513 221 506 261
Stand 21228 31567 25062 26134 1629 35367 1338 23322 1629
Walk 14153 30245 19268 26343 4316 32052 231 13825 292
Quadruped Jump 27835 22640 17315 22330 16024 24633 21125 20424 18232
Run 15621 15613 11212 14517 13421 15627 14818 17323 13324
Stand 30947 32949 25931 35043 26637 34235 29736 35048 26548
Walk 15131 16010 13424 15416 11917 16824 14918 15723 16127
Jaco Reach bottom left 2310 187 124 417 00 3811 11 124 00
Reach bottom right 238 3012 238 578 00 379 10 103 00
Reach top left 409 3111 309 669 00 5914 21 194 21
Reach top right 379 3713 228 487 32 4516 43 248 42
Pre-trainining for frames
Domain Task DDPG ICM Disagreement RND APT ProtoRL SMM DIAYN APS
Walker Flip 8123 26039 36016 22237 282 21044 251 11718 322
Run 4111 11015 13119 10313 261 8516 231 474 271
Stand 21228 49962 39865 28926 15511 35561 1399 24316 1619
Walk 14153 30551 34846 25833 3710 25054 231 12519 4819
Quadruped Jump 27835 28650 21424 36644 14726 22942 20120 24828 21227
Run 15621 19829 15319 26138 11220 14427 13816 19724 17823
Stand 30947 39869 29835 45347 22942 35567 27926 31331 28151
Walk 15131 19331 12920 20630 11120 15725 13913 14019 14124
Jaco Reach bottom left 2310 6520 308 478 00 3114 11 123 00
Reach bottom right 238 5616 3410 526 00 3511 10 72 00
Reach top left 409 8722 4711 558 11 3515 21 204 21
Reach top right 379 6817 335 618 21 4214 43 215 11
Pre-trainining for frames
Domain Task DDPG ICM Disagreement RND APT ProtoRL SMM DIAYN APS
Walker Flip 8123 25649 30534 25029 4617 24437 261 12626 4212
Run 4111 11616 15512 9613 261 8411 241 474 303
Stand 21228 53473 56537 37437 15013 48063 1456 25119 17010
Walk 14153 28545 43329 31641 369 25839 251 13725 5425
Quadruped Jump 27835 34547 19931 36838 15627 23750 20120 31938 18429
Run 15621 17911 14024 29736 11521 10816 13913 16517 15522
Stand 30947 43044 25735 55951 22942 33871 27926 31927 27548
Walk 15131 25125 10419 27419 11521 15228 13913 21320 14623
Jaco Reach bottom left 2310 6519 4210 469 00 3613 11 83 00
Reach bottom right 238 8823 5811 449 00 4310 10 41 00
Reach top left 409 7619 8919 597 64 4110 21 204 20
Reach top right 379 8724 4912 477 21 4712 43 226 51
Pre-trainining for frames
Domain Task DDPG ICM Disagreement RND APT ProtoRL SMM DIAYN APS
Walker Flip 8123 23134 33916 28031 282 22327 261 11421 389
Run 4111 9811 1549 13315 252 8718 241 453 303
Stand 21228 40140 55292 38949 15511 46769 1456 29871 17210
Walk 14153 27444 42436 32143 358 29748 251 13219 375
Quadruped Jump 27835 31218 19421 38328 16427 19735 20120 26222 19929
Run 15621 24921 14325 28418 12120 13735 13913 19018 15624
Stand 30947 50640 30535 56143 24341 29056 27926 42640 33143
Walk 15131 23115 14510 29420 12221 13835 13913 18423 14624
Jaco Reach bottom left 2310 7220 10618 393 00 215 11 72 10
Reach bottom right 238 5819 9015 479 00 287 10 93 11
Reach top left 409 8922 12721 606 00 4716 21 112 21
Reach top right 379 6918 11823 7611 11 5212 43 163 103
Table 5: Individual results of fine-tuning for frames after different levels of pre-training in the pixels-based settings.

Appendix F Compute Resources

URLB is designed to be accessible to the RL research community. Both state and pixel-based algorithms are implemented such that each algorithm requires a single GPU. For local debugging experiments we used NVIDIA RTX GPUs. For large-scale runs used to generate all results in this manuscripts, we used NVIDIA Tesla V100 GPU instances. All experiments were run on internal clusters. Each algorithm trains in roughly 30 mins - 12 hours depending on the snapshot (100k, 500k, 1M, 2M) and input (states, pixels). Since this benchmark required roughly 8k experiments (2 states / pixels, 12 tasks, 8 algorithms, 10 seeds, 4 snapshots) a total of 100 V100 GPUs were used to produce the results in this benchmark. Researchers who wish to build on URLB will, of course, not need to run this many experiments since they can utilize the results presented in this benchmark.

Appendix G Intuition on Competence-based Approaches Underperform on URLB

Across the three methods - data-based, knowledge-based, and competence-based - the best data-based and knowledge-based methods are competitive with one another. For instance, RND (a leading knowledge-based methods) and ProtoRL (a leading data-based method) achieve similar finetuning scores. Both are maximizing data diversity in two different ways - one through maximizing prediction error and the other through entropy maximization.

On the other hand, competence-based methods as a whole do much worse than data-based and knowledge-based ones. We hypothesize that this is due to current competence-based methods only supporting small skill spaces. Competence-based methods maximize a variational lower bound to the mutual information of the form:

where

is called the discriminator. The discriminator can be interpreted as a classifier from

(or vice versa depending on how you decompose ). In order to have an accurate discriminator, is chosen to be small in practice (DIAYN - is a 16 dim one-hot, SMM - is 4 dim continuous, APS - is 10 dim continuous).

OpenAI gym environments for continuous control mask this limitation because they terminate if the agent falls over and hence leak extrinsic signal about the downstream task into the environment. This means that the agent learns only useful behaviors that keep it balanced and therefore a small skill vector is sufficient for classifying these behaviors. However, in DeepMind control (and hence URLB) the episodes have fixed length and therefore the set of possible behaviors is much larger. If the skill space is too small, the most likely skills to be classified are different configurations of the agent lying on the ground. We hypothesize that building more powerful discriminators would improve competence-based exploration.