Reinforcement Learning (RL) has driven progress in areas like game playing (Silver et al., 2016; Badia et al., 2020), robot manipulation (Lee et al., 2020), traffic control (Arel et al., 2010), chemistry (Zhou et al., 2017) and logistics Li et al. (2019). At the same time, RL has shown little to no success in real-world deployment in important areas such as healthcare or autonomous driving. This is largely explained by the fact that modern RL agents are often not primarily designed for generalization, making them brittle when faced with even slight variations in their environment (Yu et al., 2019; Lu et al., 2020; Meng and Khushi, 2019). Since we cannot assume that RL agents will be able to observe all kinds of states and transitions for varying instances of tasks, these agents need to become more adaptable and robust.
To address this limitation, there is increased interest in Meta-RL approaches, aiming to improve learning across different tasks (Finn et al., 2017; Schulman et al., 2016; Duan et al., 2016; Wang et al., 2017; Matiisen et al., 2020; Klink et al., 2020; Nguyen et al., 2021; Eimer et al., 2021). The focus mostly lies on increasing the sample efficiency of agents, few-shot transfer of policies to new tasks or on solving harder tasks. Similarly, Robust-RL addresses generalization to smaller variations in the environment by ensuring a stable performance under task modelling errors or noisy observations (Morimoto and Doya, 2000; Pinto et al., 2017; Zhang et al., 2021b). While these directions are important in making RL more broadly and robustly applicable, with CARL we aim for providing the foundations of more general RL agents. Optimally, these agents should be capable of zero-shot transfer to prior unseen environments and changes in transition dynamics while interacting with an environment Zhang et al. (2021a); Fu et al. (2021b); Yarats et al. (2021); Abdolshah et al. (2021); Sodhani et al. (2021b).
Unfortunately, there is a lack of established benchmarks for studying the notion of generalization. In fact, we often observed that researchers employed hand-crafted modifications to commonly accepted tasks to enable benchmarking of Meta-RL. For example, multiple different modifications to the well known CartPole task, in which an agent needs to learn to balance a pole on top of a movable cart, have been used in different publications to show generalization abilities of agents (Seo et al., 2020; Kaddour et al., 2020; Eimer et al., 2021). In particular, different pole lengths are used to study whether a general agent can balance poles that it has not seen during training. However, the pole lengths or distributions over pole lengths vary in different publications, hindering comparisons and reproducibility. With CARL we provide such distributions to facilitate better comparability and reproducibility for further research.
Our goal is to address all these issues by proposing CARL, a benchmark library allowing to reliably and reproducibly study general RL agents. To this end, CARL has well defined distributions and bounds over the space of environments to generalize to and poses a low barrier of entry in terms of compute. To build on a sound theoretical foundation, we make use of the contextual Reinforcement Learning paradigm (cRL) (Hallak et al., 2015) and build contextual extensions to environments from the literature including OpenAIs Gym (Brockman et al., 2016) and the Brax physics engine Freeman et al. (2021). The notion of context in the environment enables us to define a variety of tasks and distributions of tasks which an agent can encounter during training. Changes in tasks can be as simple as defining different goal states, more complex by changing the transition dynamics (as the changes in pole length for the CartPole environment mentioned above) or a combination thereof, leading to varying levels of difficulty. Most importantly and wherever possible, we base the notion of context on real-world physical properties, such as gravity, friction or mass of an object, see Figure 0(a) for an example of a contextually extended environment. Those properties are intuitive to understand and individually adjustable.
The proposed benchmark enables research on generalization capabilities of RL agents in cases where agents are explicitly or implicitly exposed to the context at hand but also in cases where the context is hidden and potentially has to be learned. In particular, we demonstrate the usefulness of our proposed CARL benchmark library by evaluating and discussing:
The influence of varying context and the importance of contexts in deep RL by increasing learning efficiency via available knowledge of task variations,
The generalization capability of trained agents to in-distribution environments,
The generalization capability of trained agents to out-of-distribution environments, and
The big next challenges for general RL which can be studied in principled way with CARL.
2 CARL’s Theoretical Foundation: Contextual RL (cRL)
A basic MDP is a -tuple consisting of a state space , an action space , a transition function and a reward function . This abstraction, however, views reward and transition function as fixed, constraining the environment to a single instantiation without room for variations we would potentially see in real-world applications. In the following we will refer to such a particular instantiation of an environment as an instance.
CARL’s theoretical foundation is build upon contextual RL. It extends the MDP formulation of RL problems to allow for the notion of several instances. For example, an instance sampled from a distribution of instances could determine a different goal position in a maze problem (e.g., Figure 0(b)) or different gravity conditions (e.g., moon vs. earth) for an airborne navigation task. We refer to the information defining the instance at hand as the instance’s context . We note that is sometimes known to the agent (e.g., a broken leg), sometimes measured with noise (e.g., friction of floor), or maybe even completely unobservable (e.g., mass of an external object). With CARL we provide a variety of such contexts that can influence an agent’s learning and generalization capabilities. We further provide bounds and distributions of these contexts to facilitate better reproducibility and comparability for future research.
In order to incorporate context into the MDP definition, we use contextual MDPs (cMDPs) (Hallak et al., 2015; Modi et al., 2018; Biedenkapp et al., 2020). A contextual MDP is a collection of MDPs with . This formulation assumes a common state and action space as the underlying task stays the same; however, in each an agent will potentially only be able to reach only a subset of states . Transition and reward functions may vary between instances.
There are several different tasks of interest concerning cMDPs, all of which define an optimal policy for a given cMDP in different ways. An example would be the focus on generalization performance where maximizes the expected return across a test set drawn from the same distribution as the training set with discount factor Biedenkapp et al. (2020):
where corresponds to the maximal episode length. In policy transfer, the focus is on the performance across a set of transfer instances specifically, which is often relatively small but rather heterogeneous. Here, the test instance set can largely differ from , but the optimal policy would still maximize the mean reward across it just as above. On the other hand, only the final performance on a single, very hard instance might be important and all other instances are only used to work towards that goal. That could be the case in, e.g., curriculum learning. In that case, we use the available training set to find a such that:
cRL in CARL subsumes several other related formulations. For example, Goal-based RL (Florensa et al., 2018) uses the same idea of conditioning the reward function of each task on its specific goal, but is more limited in scope as the environment dynamics stay static throughout. Block MDPs (Du et al., 2019), on the other hand, focus on state representations for generalization. The task here is to learn a representation of the observable space of a family of environments that enables generalization across that family. Just as in cRL, reward functions and transition dynamics both may vary with the family, but the focus is shifted away from learning a policy towards learning a meaningful representation. While the original block MDP did not include specifics about how reward and transition functions differ within the environment family, contextual block MDPs provide the context as additional information (Zhang et al., 2021a). As we have direct access to the context information on all CARL benchmarks, the base case provides context as in a cMDP. However, users are free to switch to a hidden context version that requires a representation learning as in a block MDP.
3 Related Work
|Benchmark||Open Source||Explicit Context||Cheap Training111“Cheap Training” here refers the total runtime of one agent. This takes into account both to the computational cost of the environment itself as well as the number of training steps necessary to expect results.||Diverse Tasks||Varying|
|MDPP (Rajan et al., 2019)||⚫||⚫||✔||✔||✔|
|bsuite (Osband et al., 2020)||✔||✘||✔||✔||✘|
|ALE (Machado et al., 2018)||✔||✘||⚫||⚫||✘|
|ProcGen (Cobbe et al., 2019)||✔||✘||✘||✔||✘|
|Alchemy (Wang et al., 2021)||✔||⚫||✘||⚫||✔|
|Meta-world (Yu et al., 2019)||✘||✔||✘||⚫||⚫|
|MTEnv (Sodhani et al., 2021a)||✘||✔||✘||⚫||⚫|
|Safety Gym (Ray et al., 2019)||✘||✘||✘||⚫||✔|
|TMA (Romac et al., 2021)||✔||⚫||✘||⚫||✔|
|MiniGrid (Chevalier-Boisvert et al., 2018)||✔||✘||✔||✔||⚫|
|NetHack (Küttler et al., 2020)||✔||⚫||✘||✔||✘|
|MiniHack (Samvelyan et al., 2021)||✔||⚫||⚫||✔||✔|
Benchmarks for generalization exist in different sub-fields of RL, each with its own focus. MDP Playground (Rajan et al., 2019) and bsuite (Osband et al., 2020) both contain small scale benchmarks intended to test specific qualities in RL algorithms (e.g., resistance to noise), both for the purpose of development and comparison between different algorithms. In contrast, the focus of CARL is less on assessing RL algorithms against each other on fixed MDPs but in terms of their generalization capabilities to variations of MDPs. Benchmarks such as in MDP Playground and bsuite provide valuable feedback for researchers in development before they tackle more complex and opaque problems like the ones we provide.
In game simulations, the Arcade Learning Environment (ALE) has made an effort to include some challenges geared towards policy transfer and generalization in their “flavours” (Machado et al., 2018). However, the bigger challenge in this field is ProcGen (Cobbe et al., 2019). It contains several arcade-style games with procedurally generated level structures. In a similar way, Alchemy (Wang et al., 2021) also provides a procedurally generated benchmark. Even though it only contains a single task, this task is very complex compared to the games in ProcGen on their own. Both are challenging benchmarks that require generalization from state observations only. We believe that this approach is less valuable in many applications other than in games, because most often additional information is available. Additionally, while it is possible to specify levels with certain attributes in Alchemy, these procedurally generated benchmarks provide a far less fine-grained control over their context than the diverse set of benchmarks in CARL where users can directly specify their instances and control the similarity of their sampled contexts. CARL’s flexibility allows for a better characterization of agents’ generalization capabilities as well as the possibility of adding custom curricula for each environment.
Multi-task learning requires some amount of generalization, although here the focus is on accelerating the acquisition of skills on completely new tasks. For example, Meta-world Yu et al. (2019) focuses on skill transfer in a few-shot setting, providing standardized test sets of different sizes. Its tasks are based on MuJoCo, however, which requires a paid license for large scale experiments, is comparatively much more expensive to run than the Brax physics simulator (used as part of CARL) and thus limits the accessibility of the benchmark. Meta-world is also integrated in MTEnv Sodhani et al. (2021a). MTEnv provides a strong benchmark for multi-task learning as well as representation learning. CARL can accommodate multi-task learning as well, but the focus is on the multitude of context options available in each of our environments and therefore generalization across different transition dynamics.
There also exist more, related benchmarks in specialized subfields of RL. Safety Gym Ray et al. (2019) is targeted towards developing and testing algorithms for risk-sensitive domains. Also, TeachMyAgent Romac et al. (2021) is a benchmark for teacher-student based curriculum learning. Both are well suited to the needs of their communities, but also narrow in their scope. While CARL currently does not provide explicit contexts for either safety or curriculum learning, it could be extended to cover both domains and will be especially relevant for any curriculum learning algorithms not using the teacher-student framework.
Overall, CARL is the only benchmark library that is completely open-source, allows for fine-grained control of context on a diverse set of benchmarks and thus allows to study the next generation of general RL agents in a reliable and reproducible way. We summarize this comparison in Table 1.
4 The Role of Context in Deep RL and CARL
One important distinction that needs to be made in contextual RL concerns the ease of identifiability of context information. Here we broadly distinguish between explicit contexts, i.e. directly available information provided by the environment, and implicit contexts, i.e. abstract information hidden in the available state. While explicit contexts can directly be used by agents to infer the underlying transition dynamics, implicit contexts potentially need to be disentangled from the provided state. In particular, we argue that deep RL research commonly already makes use of the notion of contexts. This context however is only present in an implicit form in the state, thereby entangling representation learning capabilities with generalization capabilities of a policy.
For example in the ProcGen maze environment (see Figure 0(b)
), an agent can observe the whole maze and is tasked with guiding a mouse from the bottom left corner to the cheese. The maze structure, texture of the walkable tiles and the location of the goal (i.e. cheese) are randomly generated for each new instance. Note that the wall texture never changes and that observations are only available as images. Thus, a capable RL agent could learn to directly extract the location of the mouse and cheese as well as classify which tiles are (not) walkable. This extracted information then allows the agent to perform contextual RL. In particular,Eimer et al. Eimer et al. (2021)
showed that providing an agent with the coordinates of the agent and goal states as well as a flattened vector representation of a maze allows agents to make use of this context information to transfer behaviours between similar mazes.
A similar argument can be made for more complex environments where a “level” might not be fully observable. For example, in the game Super Mario Bros., see Figure 1(d), Mario needs to reach the goal on the right side of the screen while avoiding enemies. If an agent is made aware of the enemy types appearing in the level, through the use of context, this information can be used downstream in the policy net to learn appropriate offensive or defensive behaviour. Another direct context feature could be an indicator which special ability an agent can use, potentially leading to different behaviour when Mario picked up a power-up.
We argue that benchmarks using procedural content generation are more suitable for evaluating the representation learning capabilities of agents rather than their ability to generalize. In fact, the authors of ProcGen Cobbe et al. (2019) used it to determine that the IMPALA-CNN Espeholt et al. (2018) architecture is more capable than the Nature-CNN Mnih et al. (2015a) architecture for their considered setup. Here, we propose benchmarks that provide a ground truth for the changes in underlying transition dynamics to study generalization while also containing more complex environments that can be used to study representation learning. Disentangling these two important tasks will enable researchers to target each more efficiently and ultimately facilitates the development of new RL algorithms targeted towards generalization. We use CARL to demonstrate that an agent making use of context information during training can learn to solve instances quicker and generalizes better than those that have to infer this information themselves (see Section 6.2). This gives additional evidence that disentangling learning of such contextual features from learning the behaviour policy can improve the generalization capabilities of RL agents (see Appendix D.1 for further discussion).
5 The CARL Benchmarks
In order to gain insight on how the context and its augmentation influences the agent’s learning and behavior, we provide several benchmarks in CARL. As first benchmarks we include and contextually extend classic control and box2d environments from OpenAI Gym Brockman et al. (2016), Google Brax’ walkers Freeman et al. (2021), a RNA folding environment Runge et al. (2019) as well as Super Mario levels Awiszus et al. (2020); Schubert et al. (2021). See Figure 2 for an overview of included environments. Although each environment has different tasks, goals and mechanics, the behavior of the dynamics and the rewards is influenced by physical properties. A more detailed description of the environments is given in Appendix A. In the following we will discuss the properties of the CARL Benchmarks which are summarized in Figure 3.
Most of our benchmarks have vector based state spaces that can either be extended to include the context information or not. The notable exceptions here are CARLVehicleRacing and CARLToadGAN, which exclusively use pixel-based observations. The size of the vector based spaces range from only two state variables in the CARLMountainCar environments to 299 for the CARLHumanoid environment.
We provide both discrete and continuous environments, with six requiring discrete actions and the other ten continuous ones. The actions range from a single dimension to 19.
Quality of Reward
We cover different kinds of reward signals with our benchmarks, ranging from relatively sparse step penalty style rewards where the agent only receives a reward of each step to complex composite reward functions in e.g. the Brax-based environments. The latter version is also quite informative, providing updates on factors like movement economy and progress towards the goal whereas the former does not let the agents distinguish between transitions without looking at the whole episode. Further examples for sparse rewards are the CARLCartPoleEnv and CARLVehicleRacingEnv.
While the full details of all possible context configurations can be seen in Appendix G, for brevity here we only discuss the differences between context spaces and the configuration possibilities they provide. Depending on the environment the context features have different influence on the dynamics and the reward. Of all registered context features, influence the dynamics. This means that if a context feature is changed the transition from one state into the other is changed as well. Only of the context features shape the reward. Most context features () are continuous, the rest is categorical or discrete. With the explicit availability of context features CARL lends it self to study the robustness of agents by adding noise on top of the specific context features. Further, the provided bounds and sampling distributions of the context spaces that are provided as part of CARL enable better comparability and reproducibility for future research efforts in the realm of general RL agents.
Comparing our benchmarks along these attributes, we see a wide spread in most of them (Figure 3). For the first iteration of CARL, we focused on fairly cheap-to-run problems to lower the barrier of entry as much as possible. Nevertheless, as CARL will further grow over time, the diversity of benchmarks will further increase and we will also include harder benchmarks. Already now, CARL provides a benchmarking collection that tasks agents with generalizing in addition to solving the tasks most common in modern RL while providing a platform for reproducible research.
Having discussed CARL’s theoretical foundation as well as its initial set of benchmarks, we now study several first research questions regarding the effects of context. Our experiments are designed to demonstrate that we can use CARL to gain meaningful insights into the Meta-RL setting even on simple environments. Details about the hyperparameter settings and used hardware for all experiments are listed in AppendixB. In each of them, we train and evaluate on different random seeds and a set of sampled contexts. All experiments can be reproduced using the scripts we provide with the benchmark library at https://www.github.com/automl/CARL.
is the standard deviation for sampling the context. The context featuredt refers to the observation interval length, g to gravity, l to the pole length, m to the pole mass and max_speed to the maximal speed of CARLPendulumEnv.
6.1 Q1: How do Context Features Influence an Agent’s Training Performance?
In order to gain an intuition on how context features influence an agent’s training performance, we evaluate a DDPG Lillicrap et al. (2016) agent on the well known Pendulum task from OpenAI gym Brockman et al. (2016). Through CARL, we can vary the context features gravity (g), integration time step (dt), pendulum mass (m) and length (l) as well as the maximal speed (see Figure 4). In Equation A.1 in the appendix, we show the dynamic system of Pendulum.
To understand how context features influence an agent’s performance, for each considered context feature we sample a set of instances for each task within the ranges provided by the environment specification while keeping the others context features fixed to their default. Each context feature
of an instance is sampled from a normal distribution, centered around its default value such thatwhere is the default value defined in the original environment and is the relative standard deviation. Here we evaluate three relative standard deviations to show the impact of varying similarities of the instance distribution.
Further, we treat the context as hidden, only implicitly noticeable to the agent through the observation of the state features. While small changes in the context barely have an impact on the training performance of the agent, large variations of a single context feature can make the learning task challenging (see Figure 3(a)). This gives evidence that even simple, cheap-to-run environments can provide an agent with challenging learning tasks, depending on the level of generalization required. Note, this style of training with implicit contexts is currently the default setting for training on vision-based environments such as ProcGen (see Section 4). We refer to a the appendix Section C for a first impression on the influence of context on a vision-based environment, CARLMarioEnv, showing similar insights as for Pendulum.
6.2 Q2: Are Explicit Context Features Necessary to Learn General Agents?
To answer this question we first use the same agent and environment setup, i.e. DDPG with the same hyperparameters on Pendulum and the widest context distribution (). Our results (see Figure 3(b)) suggest that explicitly making agents aware to the change in transition dynamics generally results in a better performance when a generalization over strong deviations in context features is required. This is clearly observable by comparing results for context features that have a higher impact on the final performance, such as gravity (g), pole length (l) and pole mass (m). For a fairly low impact context feature integration time step (dt), making the context visible results in a lower standard deviation and a slightly higher final reward. Still, varying dt led to minor loss of reward compared to original Pendulum task (black curve in Figure 3(a)). For the max_speed context, both training variants struggle to achieve as high a reward. In the early training stages, the agent trained with access to the context achieved a lower reward than its counterpart. However, in the latter half it could catch up and slightly improve over the context-oblivious agent.
One context feature that heavily influences the dynamics in CARLMarioEnv environment is the inertia of Mario. In Figure 5, we see that a higher variation of the inertia improves the performance of the PPO agent and leads to faster training. This effect can be explained by the influence of Mario’s inertia on the exploration of the agent (i.e. a lower inertia makes it easier for the agent to move).
An interesting question for future work is how different context features change the learning behavior of agents and to which degree generalization is impacted by it.
6.3 Q3: To What Extent Can We Transfer a Learned Policy to a New Context?
To gain insights to what extend the agent is able to transfer a learned policy to a new context we create the “Landing in Space” scenario based on the well known LunarLander environment. To this end, a DQN Mnih et al. (2015b) agent is trained on a rather narrow context distribution. For testing, we then place this agent in a new context which might not have been observed during training.
Landing in Space
In this task, the agent is challenged to land a spacecraft on seen as well as unseen planets. We model the different planets by only adjusting the gravities by a well-defined normal distribution and train the LunarLander to land on smaller planets. The train distribution is centered on Mars () and the standard deviation () is chosen such that Mars and Moon are considered as in-distribution whilst Pluto, Earth, Neptune and Jupiter are considered as unseen and out-of-distribution, see Figure 5(a). Here we deem planets as in-distribution if their gravity is within the -interval of the training distribution. For training, we sample gravities from this distribution. We use random seeds for training and testing and collect episodes on each planet for both cases where the context is hidden and where the context feature gravity is visible to the agent. Note that although for each test episode on a planet the gravity is fixed and the same, the LunarLander environment generates different initial starting conditions and landscapes to land on. For this reason the lander might still fail to safely land and crash in some cases. To capture crashes and to distinguish them from successful but suboptimal landings, we increase the game over penalty from to during testing.
In Distribution Generalization
As to be expected, the test performances for landing on Mars and Moon are most similar to the Mars-centered training distribution. Agents trained with access to the gravity feature receive higher rewards and less crashes on in-distribution planets than their context-oblivious counterparts, as shown at the bottom of Figure 5(b) .
Out of Distribution Generalization
A more interesting and understudied question in RL is the extend to which agents are capable of generalizing to out-of-distribution tasks. With CARLs possibility to define particular distributions over context features, we can study this question in detail. The agent fails least often on Pluto since (a) it has a gravity still close to the training distribution and (b) has a low gravity. The lower the gravity, the longer the timeframe to land is which creates easier landing scenarios. We can further note that fewer runs on Pluto lead to as high rewards as on in-distribution planets Moon and Mars. This is likely due to agents wasting fuel by anticipating a harder landing, thus burning more fuel to counteract. Interestingly, even for more difficult out-of-distribution planets such as Earth and Neptune we can observe positive landing results for both agents trained with and those without access to the context. However, test performance deteriorates due to more frequent crashing on more high-gravity planets. While we have seen some capability of trained agents to transfer even to out-of-distribution environments, we do not expect vanilla agents to generalize to highly different environments.
7 Further Open Challenges Enabled by CARL
Although these experiments only give a first impression on how CARL can be used to gain novel insights into RL agents, we see many more possibilities for future research involving CARL. We discuss six open challenges and how CARL could be used to tackle these. (I) As CARL provides ground truths for all considered context features it is suitable to study novel agents that separate representation learning from policy learning. (II) CARL will be useful in studying RL agents for uncertain dynamics, by easily perturbing context features. (III) It is particularly suitable for training and evaluation of continual RL methods by continuously adapting context distributions over time. (IV) The ground truth on contexts can also be used to study explainability and interpretability methods of deep RL. (V) With the complexity of modern RL methods, they have become very sensitive to their hyperparameters. CARL’ flexibility and focus on generalization enables research into AutoRL methods that optimize agents for generality. (VI) Finally, it is an open question for safe RL whether context information could contribute to decide whether policies are applicable to unseen instances of an environment. Please find a detailed discussion in Appendix D.
8 Limitations and Societal and Ethical Implications
Although in principle some environments of CARL allow to study the impact of context on vision-based agents, our analysis focuses on featurized environments. Thus, we did not study different ways of directly exposing context information to vision-based agents which would require novel architectures to handle this context. We see such experiments and design of novel agents as future work that can follow from using CARL.
Our experimentation limits itself to static contexts and does not consider learning with dynamic contexts or continual learning. We leave this for future studies since generalization to fixed contexts already poses a major challenge. Lastly, we limited our experiments on varying individual context features. Off-the-shelf agents are not yet designed to be adaptive to contexts. Varying individual features already posed a challenge to learn with for the considered agents. With progress in the field we hope that agents will become more flexible and can handle ever more changes in environments.
We foresee no new direct societal and ethical implications other than the known concerns regarding autonomous agents and RL (e.g., in a military context). However, by trying to lower the barrier of entry for Meta-RL research we hope to i) reduce the required compute for future research, ii) facilitate novel designs of RL agents and iii) reach a more diverse research community.
We introduced CARL, a highly flexible benchmark library for enabling studies on generalizable RL via task variations and context features. By employing contextual RL, CARL extends common RL environments by making the context configurable and potentially visible. Besides providing a ready-to-use benchmark library and discussing the role of context in general RL, we ran first experiments to analyse its aspects. Our main insights are that (i) the more the context is varied, the more difficult learning becomes and (ii) making the agent context-aware can facilitate training and increase generalization. In addition, CARL is suitable to study generalization in detail by being able to carefully set instance and context distributions. We provide empirical evidence that current agents can generalize well on in-distribution test instances but fail to do so on out-of-distribution settings. In conclusion, we believe that CARL will be a valuable benchmark to advance on open challenges like generalizing RL, representation and continual learning, safe RL and AutoRL.
Carolin Benjamins, Theresa Eimer and Marius Lindauer acknowledge funding by the German Research Foundation under LI 2801/4-1. André Biedenkapp and Frank Hutter acknowledge funding by the Robert Bosch GmbH.
Abdolshah et al. (2021)
Abdolshah, M., Le, H., George, T. K., Gupta, S., Rana, S., and Venkatesh, S.
A new representation of successor features for transfer across
In Meila, M. and Zhang, T., editors,
Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 1–9. PMLR.
- Arel et al. (2010) Arel, I., Liu, C., Urbanik, T., and Kohls, A. (2010). Reinforcement learning-based multi-agent system for network traffic signal control. IET Intelligent Transport Systems, 4(2):128–135.
- Awiszus et al. (2020) Awiszus, M., Schubert, F., and Rosenhahn, B. (2020). TOAD-GAN: Coherent style level generation from a single example. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 16.
- Badia et al. (2020) Badia, A., Piot, B., Kapturowski, S., Sprechmann, P., Vitvitskyi, A., Guo, Z., and Blundell, C. (2020). Agent57: Outperforming the atari human benchmark. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 507–517. PMLR.
- Biedenkapp et al. (2020) Biedenkapp, A., Bozkurt, H. F., Eimer, T., Hutter, F., and Lindauer, M. (2020). Dynamic Algorithm Configuration: Foundation of a New Meta-Algorithmic Framework. In Lang, J., Giacomo, G. D., Dilkina, B., and Milano, M., editors, Proceedings of the Twenty-fourth European Conference on Artificial Intelligence (ECAI’20), pages 427–434.
- Biedenkapp et al. (2018) Biedenkapp, A., Marben, J., Lindauer, M., and Hutter, F. (2018). CAVE: Configuration assessment, visualization and evaluation. In Proceedings of the International Conference on Learning and Intelligent Optimization (LION), Lecture Notes in Computer Science. Springer.
- Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. CoRR, abs/1606.01540.
- Chevalier-Boisvert et al. (2018) Chevalier-Boisvert, M., Willems, L., and Pal, S. (2018). Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid.
- Co-Reyes et al. (2021) Co-Reyes, J. D., Miao, Y., Peng, D., Real, E., Le, Q. V., Levine, S., Lee, H., and Faust, A. (2021). Evolving reinforcement learning algorithms. In Proceedings of the International Conference on Learning Representations (ICLR’21). OpenReview.net. Published online: iclr.cc.
- Cobbe et al. (2019) Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. (2019). Leveraging procedural generation to benchmark reinforcement learning. arXiv preprint arXiv:1912.01588.
- Du et al. (2019) Du, S., Krishnamurthy, A., Jiang, N., Agarwal, A., Dudík, M., and Langford, J. (2019). Provably efficient RL with rich observations via latent state decoding. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 1665–1674. PMLR.
- Duan et al. (2016) Duan, Y., Schulman, J., Chen, X., Bartlett, P., Sutskever, I., and Abbeel, P. (2016). Rl$^2$: Fast reinforcement learning via slow reinforcement learning. CoRR, abs/1611.02779.
- Eimer et al. (2021) Eimer, T., Biedenkapp, A., Hutter, F., and Lindauer, M. (2021). Self-paced context evaluation for contextual reinforcement learning. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning (ICML’21), volume 139 of Proceedings of Machine Learning Research, pages 2948–2958. PMLR.
- Espeholt et al. (2018) Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., Legg, S., and Kavukcuoglu, K. (2018). IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning (ICML’18), volume 80, pages 1406–1415. PMLR.
- Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In Precup, D. and Teh, Y., editors, Proceedings of the 34th International Conference on Machine Learning (ICML’17), volume 70, pages 1126–1135. Proceedings of Machine Learning Research.
- Florensa et al. (2018) Florensa, C., Held, D., Geng, X., and Abbeel, P. (2018). Automatic goal generation for reinforcement learning agents. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, volume 80 of Proceedings of Machine Learning Research, pages 1514–1523. PMLR.
- Freeman et al. (2021) Freeman, C. D., Frey, E., Raichuk, A., Girgin, S., Mordatch, I., and Bachem, O. (2021). Brax - A differentiable physics engine for large scale rigid body simulation. CoRR, abs/2106.13281.
- Fu et al. (2021a) Fu, H., Tang, H., Hao, J., Chen, C., Feng, X., Li, D., and Liu, W. (2021a). Towards effective context for meta-reinforcement learning: an approach based on contrastive learning. In Proceedings of the Conference on Artificial Intelligence (AAAI’21), pages 7457–7465. AAAI Press.
- Fu et al. (2021b) Fu, X., Yang, G., Agrawal, P., and Jaakkola, T. (2021b). Learning task informed abstractions. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3480–3491. PMLR.
- Hallak et al. (2015) Hallak, A., Castro, D. D., and Mannor, S. (2015). Contextual markov decision processes. arXiv:1502.02259 [stat.ML].
- J. Parker-Holder et al. (2020) J. Parker-Holder, V., Nguyen, S. J., and Roberts (2020). Provably efficient online hyperparameter optimization with population-based bandits. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Proceedings of the 33rd International Conference on Advances in Neural Information Processing Systems (NeurIPS’20), volume 33, pages 17200–17211.
- Jaderberg et al. (2017) Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W., Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, I., Simonyan, K., Fernando, C., and Kavukcuoglu, K. (2017). Population based training of neural networks. arXiv:1711.09846 [cs.LG].
- Kaddour et al. (2020) Kaddour, J., Saemundsson, S., and Deisenroth, M. (2020). Probabilistic active meta-learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 20813–20822. Curran Associates, Inc.
- Klink et al. (2020) Klink, P., D’Eramo, C., Peters, J., and Pajarinen, J. (2020). Self-paced deep reinforcement learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Kostas et al. (2021) Kostas, J., Chandak, Y., Jordan, S. M., Theocharous, G., and Thomas, P. (2021). High confidence generalization for reinforcement learning. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5764–5773. PMLR.
- Küttler et al. (2020) Küttler, H., Nardelli, N., Miller, A., Raileanu, R., Selvatici, M., Grefenstette, E., and Rocktäschel, T. (2020). The nethack learning environment. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020.
- Lee et al. (2020) Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., and Hutter, M. (2020). Learning quadrupedal locomotion over challenging terrain. Science in Robotics, 5.
- Li et al. (2019) Li, X., Zhang, J., Bian, J., Tong, Y., and Liu, T. (2019). A cooperative multi-agent reinforcement learning framework for resource balancing in complex logistics network. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS, pages 980–988. International Foundation for Autonomous Agents and Multiagent Systems.
- Lillicrap et al. (2016) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2016). Continuous control with deep reinforcement learning. In Bengio, Y. and LeCun, Y., editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
- Lu et al. (2020) Lu, M., Shahn, Z., Sow, D., Doshi-Velez, F., and Lehman, L. H. (2020). Is deep reinforcement learning ready for practical applications in healthcare? A sensitivity analysis of duel-ddqn for hemodynamic management in sepsis patients. In AMIA 2020, American Medical Informatics Association Annual Symposium, Virtual Event, USA, November 14-18, 2020. AMIA.
- Machado et al. (2018) Machado, M., Bellemare, M., Talvitie, E., Veness, J., Hausknecht, M., and Bowling, M. (2018). Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. J. Artif. Intell. Res., 61:523–562.
Matiisen et al. (2020)
Matiisen, T., Oliver, A., Cohen, T., and Schulman, J. (2020).
Teacher-student curriculum learning.
IEEE Trans. Neural Networks Learn. Syst., 31(9):3732–3740.
- Meng and Khushi (2019) Meng, T. and Khushi, M. (2019). Reinforcement learning in financial markets. Data, 4(3):110.
- Mnih et al. (2015a) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015a). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.
- Mnih et al. (2015b) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015b). Human-level control through deep reinforcement learning. Nat., 518(7540):529–533.
- Modi et al. (2018) Modi, A., Jiang, N., Singh, S. P., and Tewari, A. (2018). Markov decision processes with continuous side information. In Algorithmic Learning Theory (ALT’18), volume 83, pages 597–618.
- Morimoto and Doya (2000) Morimoto, J. and Doya, K. (2000). Robust reinforcement learning. In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems 13, Papers from Neural Information Processing Systems (NIPS) 2000, Denver, CO, USA, pages 1061–1067. MIT Press.
- Nguyen et al. (2021) Nguyen, S., Duminy, N., Manoury, A., Duhaut, D., and Buche, C. (2021). Robots learn increasingly complex tasks with intrinsic motivation and automatic curriculum learning. Künstliche Intell., 35(1):81–90.
- Osband et al. (2020) Osband, I., Doron, Y., Hessel, M., Aslanides, J., Sezener, E., Saraiva, A., McKinney, K., Lattimore, T., Szepesvári, C., Singh, S., Roy, B. V., Sutton, R. S., Silver, D., and van Hasselt, H. (2020). Behaviour suite for reinforcement learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Pinto et al. (2017) Pinto, L., Davidson, J., Sukthankar, R., and Gupta, A. (2017). Robust adversarial reinforcement learning. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 2817–2826. PMLR.
- Raffin (2020) Raffin, A. (2020). Rl baselines3 zoo. https://github.com/DLR-RM/rl-baselines3-zoo.
- Raffin et al. (2019) Raffin, A., Hill, A., Ernestus, M., Gleave, A., Kanervisto, A., and Dormann, N. (2019). Stable baselines3. https://github.com/DLR-RM/stable-baselines3.
- Rajan et al. (2019) Rajan, R., Diaz, J. L. B., Guttikonda, S., Ferreira, F., Biedenkapp, A., and Hutter, F. (2019). MDP Playground: Controlling dimensions of hardness in reinforcement learning. CoRR, abs/1909.07750.
- Rakelly et al. (2019) Rakelly, K., Zhou, A., Finn, C., Levine, S., and Quillen, D. (2019). Efficient off-policy meta-reinforcement learning via probabilistic context variables. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning (ICML’19), volume 97, pages 5331–5340. PMLR.
- Ray et al. (2019) Ray, A., Achiam, J., and Amodei, D. (2019). Benchmarking Safe Exploration in Deep Reinforcement Learning.
- Rice (1976) Rice, J. (1976). The algorithm selection problem. Advances in Computers, 15:65–118.
- Romac et al. (2021) Romac, C., Portelas, R., Hofmann, K., and Oudeyer, P. (2021). Teachmyagent: a benchmark for automatic curriculum learning in deep RL. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, volume 139 of Proceedings of Machine Learning Research, pages 9052–9063. PMLR.
- Runge et al. (2019) Runge, F., Stoll, D., Falkner, S., and Hutter, F. (2019). Learning to Design RNA. In Proceedings of the International Conference on Learning Representations (ICLR’19). Published online: iclr.cc.
- Samvelyan et al. (2021) Samvelyan, M., Kirk, R., Kurin, V., Parker-Holder, J., Jiang, M., Hambro, E., Petroni, F., Kuttler, H., Grefenstette, E., and Rocktäschel, T. (2021). Minihack the planet: A sandbox for open-ended reinforcement learning research.
- Schubert et al. (2021) Schubert, F., Awiszus, M., and Rosenhahn, B. (2021). Toad-gan: a flexible framework for few-shot level generation in token-based games. IEEE Transactions on Games, pages 1–1.
Schulman et al. (2016)
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2016).
High-dimensional continuous control using generalized advantage estimation.In Bengio, Y. and LeCun, Y., editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
- Seo et al. (2020) Seo, Y., Lee, K., Clavera, I., Kurutach, T., Shin, J., and Abbeel, P. (2020). Trajectory-wise multiple choice learning for dynamics generalization in reinforcement learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 12968–12979. Curran Associates, Inc.
- Silver et al. (2016) Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L., Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489.
- Sodhani et al. (2021a) Sodhani, S., Denoyer, L., Kamienny, P., and Delalleau, O. (2021a). Mtenv - environment interface for mulit-task reinforcement learning. Github.
- Sodhani et al. (2021b) Sodhani, S., Zhang, A., and Pineau, J. (2021b). Multi-task reinforcement learning with context-based representations. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 9767–9779. PMLR.
- van Rijn and Hutter (2018) van Rijn, J. and Hutter, F. (2018). Hyperparameter importance across datasets. In Guo, Y. and F.Farooq, editors, Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 2367–2376. ACM Press.
- Wang et al. (2021) Wang, J., King, M., Porcel, N., Kurth-Nelson, Z., Zhu, T., Deck, C., Choy, P., Cassin, M., Reynolds, M., Song, H., Buttimore, G., Reichert, D., Rabinowitz, N., Matthey, L., Hassabis, D., Lerchner, A., and Botvinick, M. (2021). Alchemy: A structured task distribution for meta-reinforcement learning. CoRR, abs/2102.02926.
- Wang et al. (2017) Wang, J., Kurth-Nelson, Z., Soyer, H., Leibo, J., Tirumala, D., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. (2017). Learning to reinforcement learn. In Gunzelmann, G., Howes, A., Tenbrink, T., and Davelaar, E., editors, Proceedings of the 39th Annual Meeting of the Cognitive Science Society. cognitivesciencesociety.org.
- Xu et al. (2010) Xu, L., Hoos, H., and Leyton-Brown, K. (2010). Hydra: Automatically configuring algorithms for portfolio-based selection. In Fox, M. and Poole, D., editors, Proceedings of the Twenty-fourth National Conference on Artificial Intelligence (AAAI’10), pages 210–216. AAAI Press.
- Yarats et al. (2021) Yarats, D., Fergus, R., Lazaric, A., and Pinto, L. (2021). Reinforcement learning with prototypical representations. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11920–11931. PMLR.
- Yu et al. (2019) Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., and Levine, S. (2019). Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning (CoRL).
- Zhang et al. (2021a) Zhang, A., Sodhani, S., Khetarpal, K., and Pineau, J. (2021a). Learning robust state abstractions for hidden-parameter block mdps. In 9th International Conference on Learning Representations, ICLR 2021. OpenReview.net.
- Zhang et al. (2021b) Zhang, H., Chen, H., Boning, D., and Hsieh, C. (2021b). Robust reinforcement learning on state observations with learned optimal adversary. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Zhou et al. (2019) Zhou, W., Pinto, L., and Gupta, A. (2019). Environment probing interaction policies. In Proceedings of the International Conference on Learning Representations (ICLR’19). Published online: iclr.cc.
- Zhou et al. (2017) Zhou, Z., Li, X., and Zare, R. (2017). Optimizing chemical reactions with deep reinforcement learning. ACS central science, 3(12):1337–1344.
Appendix A Benchmark Categories
To encourage generalization in RL, we chose a wide variety of common task characteristics as well as well-known environments as the basis of CARL.
The physical simulation environments (Brax, box2d and classic control) defining a dynamic body in a static world have similar context features like gravity, geometry of the moving body, position and velocity, mass, friction and joint stiffness. For brevity, we only detail the context features of CARLFetch and list all other environments’ context features in Section G of the appendix.
CARLFetch embeds Brax’ Fetch Freeman et al. (2021) as a cMDP, see Figure 0(a). The goal of Fetch is to move the agent to the target area. The context features joint stiffness, gravity, friction, (joint) angular damping, actuator strength, torso mass as well as target radius and distance define the context. The defaults of the context features are copied from the original environment. Furthermore, appropriate bounds must be set for the specific application. We set the bounds such that the environment’s purpose is not violated, e.g., restricting the gravity towards the ground greater than (otherwise the agent would fly up and it would be impossible to act).
Besides physical simulation environments, CARL provides two more specific, challenging environments. The first is the CARLMarioEnv environment built on top of the TOAD-GAN level generator Awiszus et al. (2020); Schubert et al. (2021). It provides a procedurally generated game playing environment (similarly to the ones discussed in Section 4) that allows customization of the generation process. This environment is therefore especially interesting for exploring representation learning for the purpose of learning to better generalize. Secondly, we move closer to real-world application by including the CARLRNADesignEnvironment Runge et al. (2019). The task here is to design RNA sequences given structural constraints. As two different datasets of structures and their instances are used in this benchmark, it is ideally suited for testing policy transfer between RNA structures.
a.1 Pendulum’s Dynamic Equations
Because we use gym’s Pendulum Brockman et al. (2016) for our experiments Q1 and Q2 (see Section 6), we provide the dynamic equations to show the simplicity of the system. The state consists of the angular position and velocity of the pendulum . The discrete equation defining the behavior of the environment is defined as follows:
Here, is the index of the iteration/step, the gravity, the length of the pendulum, the control input and (or dt) the timestep.
Appendix B Hardware and Hyperparameters
All experiments on all benchmarks were conducted on a Slurm CPU cluster if not stated otherwise (see Table 2). The experiments for CARLMarioEnv were replicated on a slurm GPU cluster consisting of 6 nodes with eight RTX 2080 Ti each.
|Machine no.||CPU model||cores||RAM|
|1||Xeon E5-2670||16||188 GB|
|2||Xeon E5-2680 v3||24||251|
|3-6||Xeon E5-2690 v2||20||125 GB|
|7-10||Xeon Gold 5120||28||187|
Hyperparameters and Training Details
We used agents from stable baselines 3 Raffin et al. (2019) (version 1.1.0) for all of our experiments. For the DQN (used in CARLLunarLanderEnv, Section 6.3) and the PPO agent (used in CARLMarioEnv in Section C) we employ the hyperparameters from the stable baselines zoo Raffin (2020), see Table 3. For the DDPG agent (used for CARLPendulumEnv in Section 6.1 and 6.2) we use the defaults with a MLP policy. We train each agent for steps. Every steps we evaluate one episode on each train instance and report the mean reward across instances. All experiments can be reproduced using the scripts we provide with the benchmark library at https://www.github.com/automl/CARL.
Appendix C Additional Experimental Results
To further illustrate the influence of varying context, we show experimental results for a PPO agent trained on the CARLMarioEnv. Again, the agent is trained for different random seeds and with training instances. In CARLMarioEnv different instances (Mario levels) are created by using TOAD-GAN Schubert et al. (2021). By varying the noise input vector for TOAD-GAN we can generate different levels and the greater the noise, the greater the differences to the original level. Because CARLMarioEnv is pixel-based the context is implicitly coded in the state and we hide the context. As we can see in Figure 7 a diverse training distribution () increases the performance and facilitates generalization. On the other hand if the noise becomes too large () the performance decreases again. A reason might be that the levels for this noise level are very diverse and thus the current setup, with only implicit context, might not be suitable.
In the main paper we analyzed acting on in-distribution and out-of-distribution instances. For this we vary the context feature ’gravity’ for CARLLunarLanderEnv which extends gym’s LunarLander Brockman et al. (2016)
. In the first experiment we defined our training distribution as a Gaussian distribution, see Figure7(a) for the actual gravity values used for training. Now, we trained our agent on 5 random seeds on 100 contexts distributed on two gravity intervals, and (see Figure 7(b)). In one case we hid the context from the agent, in the other case the context was visible. For the latter, we only added the changing context feature, i.e. the gravity, to the state. In general, providing the context shows a clear benefit: The agent reaches a higher reward in all cases, see Figure 7(c). Furthermore, we can observe that with an higher magnitude of gravity the difficulty also increases. That is the case because the agent has less time to act before reaching the ground if the gravity increases. In addition, if we compare the performance on the gravities (in-distribution) to the performance on (out-of-distribution) we notice more crashes on the out-of-distribution case.
Appendix D Open Challenges Enabled by CARL
We used CARL to demonstrate the usefulness of a benchmark that can provide the ground truth of available context information. Based on that, we showed that making such information about the environment explicitly available to the agent enables faster training and transfer of agents (see Section 6). While this already provides valuable insights to the community that increasingly cares about learning agents capable of generalization (see Sections 1 & 3) CARL enables to study further open challenges for general RL.
d.1 Challenge I: Representation Learning
Our experiments using CARL demonstrated that an agent that is given access to context information is capable of learning better than an agent that has to learn behaviours given an implicit context via state observations. This provides evidence that disentangling the representation learning aspect from the policy learning task reduces complexity. As CARL provides a ground truth for representations of environment properties we envision future work on principled studies of novel RL algorithms that, by design, disentangle representation learning and policy learning (see, e.g., Rakelly et al. (2019); Fu et al. (2021a); Zhang et al. (2021a) as first works along this line of research). The ground truth given by the context would allow to measure the quality of learned representations and allows us to relate this to true physical properties of an environment.
Another use-case of CARL we envision under the umbrella of representation learning follows the work of environment probing policies Zhou et al. (2019). There, exploratory policies are learned that allow to identify which environment type an agent encounters. This is complementary to the prior approaches as representations are not jointly learned with the behaviour policies as in the previously discussed approaches but rather in a separate offline phase. Based on CARL, huge amounts of meta-data could be collected that will enable the community to make use of classical meta-algorithmic approaches such as algorithm selection Rice (1976) for selecting previously learned policies or learning approaches.
d.2 Challenge II: Uncertainty of RL Agents
With the access to context information CARL enables to study the influence of noise on RL agents in a novel way. While prior environments enabled studies of the behaviour of agents when they could not be certain about their true state in a particular environment, CARL further allows to study agents behaviours in scenarios with uncertainty on their current contextual environment, e.g., because of noise on the context features. In practical deployment of RL, this is reasonable concern since context feature have to be measured somehow by potentially noisy censors. As this setting affects the overall transition dynamics, CARL provides a unique test-bed in which the influence of uncertainty can be studied and how RL agents can deal with such.
d.3 Challenge III: Continual Learning
With the flexibility and easy modifiability of CARLs provided contexts, CARL is suitable for studying continual reinforcement learning agents. In this setting, the distributions provided by CARL could be modified, e.g., gradually shifted, during the training procedure. For example, CARL could be used to evaluate the behaviour of an agent in the Brax environments where one or more joints become stiffer over time. A learning agent would need to be able to handle this and adapt its gait accordingly. In particular, one could at some point “repair” the agent and reset the joints to their original stiffness. This would then allow to evaluate whether the agent has “unlearned” the original gait. In the same way, CARL allows also to study how agents would react to spontaneous, drastic changes, e.g., broken legs or changes of the environment such as changes of weather conditions.
d.4 Challenge IV: Interpretable and Explainable Deep RL
Trust is a crucial factor, for which interpretability or explainability often is mandatory. With the provided ground truth through the explicit use of context features, CARL could be the base for studying interpretability and explainability of (deep) RL. By enabling AutoRL studies and different representation learning approaches, CARL will contribute to better interpret the training procedures.
CARL further allows to study explainability on the level of learned policies. We propose to study the sensitivity of particular policies to different types of context. Thus, the value and variability of a context might serve as a proxy to explain the resulting learned behavior. Such insights might then be used to predict how policies might look like or act (e.g., in terms of frequency of action usage) in novel environments, solely based on the provided context features.
d.5 Challenge V: AutoRL
AutoRL (e.g., Jaderberg et al. (2017); Runge et al. (2019); J. Parker-Holder et al. (2020); Co-Reyes et al. (2021)) addresses the optimization of the RL learning process. To this end, hyperparameters, architectures or both of agents are adapted either on the fly Jaderberg et al. (2017) or once at the beginning of a run Runge et al. (2019). However, as AutoRL typically requires large compute resources for this procedure, optimization is most often done only on a per-environment basis. It is reasonable to assume that such hyperparameters might not transfer well to unseen environments, as the learning procedures were not optimized to be robust or to facilitate generalization, but only to improve the reward on a particular instance.
As CARL provides easy-to-use contextual extensions of a diverse set of RL problems, it could be used to drive research in this open challenge of AutoRL. First of all, it enables a large scale-study to understand how static and dynamic configuration approaches complement each other and when one approach is to be preferred over another. Such a study will most likely also lead to novel default hyperparameter configurations that are more robust and tailored to fast learning and good generalization. In addition, it will open up the possibility to study whether it is reasonable to use a single hyperparameter configurations or whether a mix of configurations for different instances is required Xu et al. (2010). Furthermore, with the flexibility of defining a broad variety of instance distributions for a large set of provided context features, experiments with CARL would allow researchers to study which hyperparameters play a crucial role in learning general agents similar to studies done for supervised machine learning van Rijn and Hutter (2018) or AI algorithms Biedenkapp et al. (2018).
d.6 Challenge VI: High Confidence Generalization
The explicit context of the CARL benchmark enables tackling another challenge in the field of safe RL. High Confidence Generalization algorithms (HCGAs) Kostas et al. (2021) provide safety guarantees for the generalization of agents in testing environments. Given a worst-case performance bound, the agent can be tasked to decide whether a policy is applicable in an out-of-distribution context or not. This setting is especially important for the deployment of RL algorithms in the real world where policy failures can be costly and the context of an environment is often prone to change. CARL has the potential to facilitate the development of HCGAs that base their confidence estimates on the context of an environment.
Appendix E Future Maintenance
As our benchmark draws from several different RL environments as dependencies, we realize that it will need regular maintenance and updating. Furthermore, we would like to include more benchmarks and options that are closer to real-world applications. In part, we of course hope that the community will embrace CARL and work with us to extend it in order to match the needs of researchers working in cRL. We acknowledge, however, that relying on community driven progress only is infeasible. Therefore we commit to updating the current benchmark version including its dependencies at least twice a year or whenever critical updates in dependencies are released. As we plan to continue using GitHub for hosting, versioning as well as providing continued access to previous versions is feasible. We also aim to fix any issues that are brought to our attention in a reasonable timeframe. In case community-driven benchmarks are added, we will ensure the continued functionality of the benchmark as a whole (as far as our resources will allow). As we are researching solution methods in the field of cRL ourselves, we expect to contribute further benchmarks of our own as well.
Appendix F Statement
The authors’ acknowledge that they bear all responsibility in case of violation of rights, etc., and confirmation of the data license.
Appendix G Context Features for Each Environment