CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning

by   Ossama Ahmed, et al.
Max Planck Society
McGill University

Despite recent successes of reinforcement learning (RL), it remains a challenge for agents to transfer learned skills to related environments. To facilitate research addressing this problem, we propose CausalWorld, a benchmark for causal structure and transfer learning in a robotic manipulation environment. The environment is a simulation of an open-source robotic platform, hence offering the possibility of sim-to-real transfer. Tasks consist of constructing 3D shapes from a given set of blocks - inspired by how children learn to build complex structures. The key strength of CausalWorld is that it provides a combinatorial family of such tasks with common causal structure and underlying factors (including, e.g., robot and object masses, colors, sizes). The user (or the agent) may intervene on all causal variables, which allows for fine-grained control over how similar different tasks (or task distributions) are. One can thus easily define training and evaluation distributions of a desired difficulty level, targeting a specific form of generalization (e.g., only changes in appearance or object mass). Further, this common parametrization facilitates defining curricula by interpolating between an initial and a target task. While users may define their own task distributions, we present eight meaningful distributions as concrete benchmarks, ranging from simple to very challenging, all of which require long-horizon planning as well as precise low-level motor control. Finally, we provide baseline results for a subset of these tasks on distinct training curricula and corresponding evaluation protocols, verifying the feasibility of the tasks in this benchmark.


page 2

page 12

page 13

page 17


BulletArm: An Open-Source Robotic Manipulation Benchmark and Learning Framework

We present BulletArm, a novel benchmark and learning-environment for rob...

Theory-based Causal Transfer: Integrating Instance-level Induction and Abstract-level Structure Learning

Learning transferable knowledge across similar but different settings is...

Learning Causal Overhypotheses through Exploration in Children and Computational Models

Despite recent progress in reinforcement learning (RL), RL algorithms fo...

An Open-Source Multi-Goal Reinforcement Learning Environment for Robotic Manipulation with Pybullet

This work re-implements the OpenAI Gym multi-goal robotic manipulation e...

Structure Mapping for Transferability of Causal Models

Human beings learn causal models and constantly use them to transfer kno...

Transfer learning from synthetic to real images using variational autoencoders for robotic applications

Robotic learning in simulation environments provides a faster, more scal...

Robotic self-representation improves manipulation skills and transfer learning

Cognitive science suggests that the self-representation is critical for ...

Code Repositories


CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning

view repo

1 Introduction

Figure 1: Example of do-interventions on exposed variables in CausalWorld.

Benchmarks have played a crucial role in advancing entire research fields, for instance computer vision with the introduction of CIFAR-10 and ImageNet

(Krizhevsky et al., 2009, 2012). When it comes to the field of reinforcement learning (RL), similar breakthroughs have been achieved in domains such as game playing (Mnih et al., 2013; Silver et al., 2017), learning motor control for high-dimensional simulated robots (Akkaya et al., 2019), multi-agent settings (Baker et al., 2019; Berner et al., 2019) and for studying transfer in the context of meta-learning (Yu et al., 2019). Nevertheless, trained agents often fail to transfer the knowledge about the learned skills from a training environment to a different but related environment sharing part of the underlying task structure. This can be attributed to the fact that it is quite common to evaluate an agent on the training environments themselves, which leads to overfitting on these narrowly defined environments (Whiteson et al., 2011), or that algorithms are compared using highly engineered and biased reward functions which may result in learning suboptimal policies with respect to the desired behaviour; this is particularly evident in robotics.

In existing benchmarks (Yu et al., 2019; Goyal et al., 2019a; Cobbe et al., 2018; Bellemare et al., 2013; James et al., 2020) the amount of shared causal structure between the different environments is mostly unknown. For instance, in the Atari Arcade Learning environments, it is unclear how to quantify the underlying similarities between different Atari games and we generally do not know to which degree an agent can be expected to generalize. To overcome these limitations, we introduce a novel benchmark in a robotic manipulation environment which we call CausalWorld. It features a diverse set of environments which, in contrast to previous designs, share a large set of parameters and parts of the causal structure. Being able to intervene on these parameters (individually or collectively) permits the experimenter to evaluate agents’ generalization abilities with respect to different types and extents of changes in the environment. These parameters can be varied gradually, which yields a continuum of similar environments. This allows for fine-grained control of training and test distributions and the design of learning curricula.

Pick and Place
Stacked Blocks
Figure 2: Example tasks from the task generators provided in the benchmark. The goal shape is visualized in opaque red and the blocks in blue.

A remarkable skill that humans learn to master relatively early on in their life is building complex structures using their spatial-reasoning and dexterous manipulation abilities (Casey et al., 2008; Caldera et al., 1999; Kamii et al., 2004). Playing with toy blocks constitutes a natural environment for children to develop important visual-spatial skills, helping them ‘generalize’ in building complex composition designs from presented or imagined goal structures (Verdine et al., 2017; Nath and Szücs, 2014; Dewar, 2018; Richardson et al., 2014). Inspired by this, CausalWorld is designed to aid in learning and investigating these skills in a corresponding simulated robotics manipulation environment of the open-source TriFinger robot platform from Wüthrich et al. (2020), which can be built in the real world. Tasks are formulated as building 3D goal shapes using a set of available blocks by manipulating them - as seen in Fig. 1. This yields a diverse familiy of tasks, ranging from relatively simple (e.g. pushing a single object) to extremely hard (e.g. building a complex structure from a large number of objects).

CausalWorld improves upon previous benchmarks by exposing a large set of parameters in the causal generative model of the environments, such as weight, shape and appearance of the building blocks and the robot itself. The possibility of intervening on any of these properties at any point in time allows one to set up training curricula or to evaluate an agent’s generalization capability with respect to different parameters. Furthermore, in contrast to previous benchmarks (Chevalier-Boisvert et al., 2018; Cobbe et al., 2018), researchers may build their own real-world platform of this simulator at low cost, as detailed in Wüthrich et al. (2020), and transfer their trained policies to the real world.

Finally, by releasing this benchmark we hope to facilitate research in causal structure learning, i.e. learning the causal graph (or certain aspects of it) as we operate in a complex real-world environment whose dynamics follow the laws of physics which induce causal relations between the variables. Changes to the variables we expose can be considered do-interventions on the underlying structural causal model (SCM). Consequently, we believe that this benchmark offers an exciting opportunity to investigate causality and its connection to RL and robotics.

Our main contributions can be summarized as follows:

  • We propose CausalWorld, a new benchmark comprising a parametrized family of robotic manipulation environments for advancing out-of-distribution generalization and causal structure learning in RL.

  • We provide a systematic way of defining curricula and disentangling generalization abilities of trained agents by allowing do-interventions to be performed on all environment variables (parameters and states).

  • We establish baseline results for some of the available tasks under different learning algorithms, thus verifying the feasibility of the tasks.

  • We show how different learning curricula affect generalization across different axes by reporting some of the in-distribution and out-of-distribution generalization capabilities of the trained agents.

2 CausalWorld Benchmark

Benchmark do-interventions interface procedurally generated environments online distribution of tasks setup custom curricula disentangle generalization ability real-world similarity open-source robot low-level motor control long-term planning unified success metric
AtariArcade ✓/✗
Table 1: Comparison of Causal World with RLBench (James et al., 2020), MetaWorld (Yu et al., 2019), IKEA (Lee et al., 2019), BabyAI (Chevalier-Boisvert et al., 2018), CoinRun (Cobbe et al., 2018), AtariArcade (Bellemare et al., 2013), MuJoBan etc. (Mirza et al., 2020),

Here we make the desiderata outlined in the introduction more precise:

  1. The set of environments should be sufficiently diverse to allow for the design of challenging transfer tasks.

  2. We need to be able to intervene on different properties (e.g. masses, colors) individually, such that we can investigate different types of generalization.

  3. It should be possible to convert any environment to any other environment by gradually changing its properties through interventions; this requirement is important for evaluating different levels of transfer and for defining curricula.

  4. The environments should share some causal structure to allow algorithms to transfer the learned causal knowledge from one environment to another.

  5. There should be a unified measure of success, such that an objective comparison can be made between different learning algorithms.

  6. The benchmark should make it easy for users to define meaningful distributions of environments for training and evaluation. In particular, it should facilitate evaluation of in-distribution and out-of-distribution performance.

  7. The simulated benchmark should have a real-world counterpart to allow for sim2real.

In light of these desiderata, we propose a setup in which a robot must build goal shapes using a set of available objects. It is worth noting that similar setups were proposed previously in a less realistic setting as in (Janner et al., 2018; Bapst et al., 2019; McCarthy et al., ; Akkaya et al., 2019; Fahlman, 1974; Winston, 1970; Winograd, 1972). Specifically, a task is formulated as follows: given a set of available objects the agent needs to build a specific goal structure, see Fig. 1 for an example. The vast amount of possible target shapes and environment properties (e.g. mass, shape and appearance of objects and the robot itself) makes this a diverse and challenging setting to evaluate different generalization aspects. CausalWorld is a simulated version (using the Bullet physics engine (Coumans and others, 2013)) of the open-source TriFinger robot platform from Wüthrich et al. (2020). Each environment is defined by a set of variables such as gravity, floor friction, stage color, floor color, joint positions, various block parameters (e.g. size, color, mass, position, orientation), link colors, link masses and the goal shape. See Table 3 in the Appendix for a subset of these variables.

Desideratum 1 is satisfied since different environment properties and goal shapes give rise to very different tasks, ranging from relatively easy (e.g. re-positioning a single cube) to extremely hard (e.g. building a complex structure). Desideratum 2 is satisfied because we allow for arbitrary interventions on these properties, hence users or agents may change parameters individually or jointly. Desideratum 3 is satisfied because the parameters can be changed gradually. Desideratum 4 is satisfied because all the environments share the causal structure of the robot, and one may also use subsets of environments which share even more causal structure. We satisfy desideratum 5 by defining the measure of success for all environments as the volumetric overlap of the goal shape with available objects. Further, by splitting the set of parameters into a set A, intended for training and in-distribution evaluation, and a set B, intended for out-of-distribution evaluation, we satisfy desideratum 6. Finally, since the TriFinger robot (Wüthrich et al., 2020) can be built in the real-world, we satisfy desideratum 7. Desideratum 7 and 2 are in partial conflict since sim2real is only possible for the tasks which are constrained to the variables on which the robot can physically act upon.

Task generators:

To generate meaningful families of similar goal shapes, CausalWorld allows for defining task generators which can generate a variety of different goal shapes in an environment. For instance, one task generator may generate pushing tasks, while another one may generate tower-building tasks (see Fig. 2). Each task generator is initialized with a default goal shape from its corresponding family and comes with a sampler to sample new goal shapes from the same family. Additionally, upon construction, one can specify the environments’ initial state and initial goal shape structure when deviating from the default. The maximum episode time to build a given shape is seconds. CausalWorld comes with eight pre-defined task generators (see Fig. 2).

  • Three generators create goal shapes with a single block: Pushing with the goal shape on the floor, Picking having the goal shape defined above the floor and Pick and Place where a fixed obstacle is placed between the initial block and goal pose.

  • Stacking2 involves a goal shape of two stacked blocks, which can also be considered one instance of the towers generator.

  • The remaining generators use a variable number of blocks to generate much more complex and challenging target shapes with details in the appendix: Towers, Stacked Blocks, Creative Stacked Blocks and General.

Given that building new environments using current physics simulators is often tedious, we provide a simple API for users who wish to create new task generators, for new challenging shape families which may be added to CausalWorld’s task generators repository.

Action and Observation Spaces:

The robot’s action space can be chosen to operate in either joint position control mode, joint torque control mode, end-effector position control mode, or the delta of each. To address the different challenges in using high-dimensional visual observations as well as using a structured representation, we provide two observation modes: structured as well as pixel. In the structured

mode, the low-dimensional observation vector

follows a common rule for the ordering of the relevant variables, such as joints position, joints velocity, blocks linear velocity, time left for the task..etc. Thus, the observation space size depends on the number of blocks, which could potentially change with every new goal sampled, e.g. in Towers, (Creative) Stacked Blocks and General; therefore where varies across different environments. On the contrary, in the pixel mode, the agent receives six different RGB images, where , the first three images are rendered from different cameras mounted around the TriFinger robot, and the last three images specify the goal image of the target shape rendered from the same cameras. This mode can be mirrored on the real robotic platform and aids in investigating object-based learning approaches from pixel data as well as learning visual goal conditioned policies. Additionally, CausalWorld allows for setting up a fully customized observation space, if needed.


The reward function is defined uniformly across all possible goal shapes as the fractional volumetric overlap of the blocks with the goal shape, which ranges between 0 (no overlap) and 1 (complete overlap). This shared success metric can be returned at each time step where its scale is independent of the goal shape itself. Thus, an agent that learned this shared reward function from several different tasks could in principle use it to solve unseen goal structures. There is also the possibility of modifying the reward function to 1) sparsify the reward further by returning a binary reward signal instead, or 2) add a dense reward function in order to introduce inductive biases via domain knowledge and solution guidance. We hope that the considerable complexity and diversity of goal shapes motivate and accelerate the development of algorithms that are not dependent on highly tuned reward functions anymore.

Figure 3: Key components for generic training and evaluation of RL agents. Left: A learning curriculum which is composed of various intervention actors that decide on which variables to intervene on (for a valid intervention, values need to be in the allowed training space (ATS)). Right: Evaluation protocols are shown which may intervene on variables at episode resets or within episodes (for a valid intervention, values need to be in the evaluation space (ES)). Middle: we represent the ATS and ES, where each intervention results in one point in the spaces. As shown ATS and ES may intersect, eg. if the protocols are meant to evaluate in-distribution generalization. A learning curriculum is represented by subsequent interventions navigating the ATS resulting in the corresponding points in the space.

Training and evaluation spaces:

In this benchmark, a learning setting consists of an allowed training space (ATS) and an evaluation space (ES), both of which are subspaces of the full parameter space. During training, in the simplest setting, parameters are sampled iid from the ATS. However, unlike existing benchmarks, CausalWorld allows for curricula within the ATS as well as settings where the agent itself intervenes on the parameters within an episode (see Fig. 3). Similarly, during evaluation, parameters may be sampled iid from the evaluation space at each episode reset, or there can be interventions within an episode. Moreover, in order to retrieve the setting considered in most RL benchmarks, we could set the ATS and the ES to be identical and intervene only on object and robot states (and keep other environment properties constant) at each episode reset. However, to evaluate out-of-distribution generalization, one should set the two spaces (ATS and ES) to be different; possibly even disjoint. Additionally, to evaluate robustness with respect to a specific parameter (e.g. object mass), one may define the training and evaluation spaces to only differ in that particular parameter. In order to facilitate the definition of appropriate training and evaluation settings, we pre-define two disjoint sets, and , for each parameter . Through this, one can for instance define the training space to be and the evaluation space to be to assess generalization with respect to all parameters simultaneously. Alternatively, the evaluation space could be defined as to assess generalization with respect to parameter only. Lastly, users may also define their own spaces which could then be integrated into the benchmark to give rise to new learning settings.

Intervention actors:

To provide a convenient way of specifying learning curricula, we introduce intervention actors. At each time step, such an actor takes all the exposed variables of the environment as inputs and may intervene on them. To encourage modularity, one may combine multiple actors in a learning curriculum. This actor is defined by the episode number to start intervening, the episode number to stop intervening, the timestep within the episode it should intervene and the episode periodicity of interventions. We provide a set of predefined intervention actors, including an actor which samples parameters randomly at each episode reset, which corresponds to domain-randomization. It is also easy to define custom intervention actors, we hope that this facilitates investigation into optimal learning curricula (see Fig. 3).

3 Related Work

Previous benchmarks proposed for RL mostly focused on the single task learning setting such as OpenAI Gym and DM control suite (Tassa et al., 2018; Brockman et al., 2016). Although, a recent line of work, e.g. Meta-World and RLBench (Yu et al., 2019; James et al., 2020) aim at studying multi-task learning as well as meta-learning, respective benchmarks mostly exhibit non-parametric hand-designed task variations which makes it ambiguous and not explicit how much structure is shared between them. For instance, it is not clear how different it is to “open a door” compared to “opening a drawer”. To address the ambiguity in the shared structure between the tasks, CausalWorld was designed to allow interventions to be performed on many environment variables giving rise to a large space of tasks with well-defined relations between them, which we believe is a missing key component to address generalization in RL.

Similar parametric formulations of different environments were used in experiments in the generalization for RL literature, which have played an important role in advancing the field (Packer et al., 2018; Rajeswaran et al., 2017; Pinto et al., 2017; Yu et al., 2017; Henderson et al., 2017a; Dulac-Arnold et al., 2020; Chevalier-Boisvert et al., 2018). In these previous works, variables were mostly assigned randomly as opposed to the full control over the variables in CausalWorld by allowing do-interventions.

Another important remaining challenge for the RL community is the standardization of the reported learning curves and results. RL methods have been shown to be sensitive to a range of different factors (Henderson et al., 2017b). Thus it is crucial to devise a set of metrics that measure reliability of RL algorithms and ensure their reproducibility. Chan et al. (2019) distinguishes between several evaluation modes like ”evaluation during training” and ”evaluation after learning”. Osband et al. (2019) recently proposed a benchmarking suite that disentangles the ability of an algorithm to deal with different types of challenges. Its main components are: enforcing a specific methodology for an agent’s evaluation beyond the environment definition and isolating core capabilities with targeted ’unit tests’ rather than integrating the general learning ability.

Moreover, causality has been historically studied from the perspective of probabilistic and causal reasoning (Pearl, 2009), cognitive psychology (Griffiths and Tenenbaum, 2005)

, and more recently in the context of machine learning

(Goyal et al., 2019b; Schölkopf, 2019; Baradel et al., 2019; Bakhtin et al., 2019). On the contrary, we believe its link to robotics is not yet drawn systematically. To bridge this gap, one of the main motivations of CausalWorld was to facilitate research in causal learning for robotics, such as the capacity for observational discovery of causal effects in physical reality, counterfactual reasoning and causal structure learning.

4 Experiments

To illustrate the usage of this benchmark and to verify the feasibility of some basic tasks, we evaluate current state-of-the-art model-free (MF-RL) algorithms on a subset of the goal shape families described in Section 2 and depticed in Fig. 2: (a) Pushing, (b) Picking, (c) Pick and Place, and (d) Stacking2. These goal shapes reflect basic skills that are required to solve more complex construction tasks.


The idea here is to investigate how well an agent will perform on different evaluation distributions, depending on the curriculum it has been trained with. We train each method under the following curricula:

  • Curriculum 0: no environment changes; each episode is initialized from the default task lying in space A - note that here the initial state never changes (i.e. no interventions).

  • Curriculum 1: goal shape randomization; at the beginning of each episode a new goal shape is sampled from space (i.e. interventions on goal position and orientation).

  • Curriculum 2: full randomization w.r.t. the task variables5; every episode a simultaneous intervention on all variables is sampled from space (i.e. can be seen as equivalent to extreme domain randomization in one space).

00footnotetext: 5 Note that each task generator can suppress interventions that would yield goal shapes outside its family.

The curriculum will, as expected, affect the generalization capabilities of the trained agents. With CausalWorld’s formulation, these generalization capabilities can easily be disentangled and benchmarked quantitatively, as explained in Section 2. For each of the goal shape families (a, b, c, d from Fig. 2), we train agents under the three described curricula using the following MF-RL algorithms: The original Proximal Policy Optimization (PPO) from Schulman et al. (2017), Soft Actor-Critic (SAC) from Haarnoja et al. (2018) and the Twin Delayed DDPG (TD3) from Fujimoto et al. (2018). We provided these methods with a hand-designed dense reward function as we did not observe any success with the sparse reward only. Each of the mentioned setups is trained for five different random seeds, resulting in 180 trained agents.

Figure 4: Fractional success curves averaged over five random seeds for the tasks and learning algorithms specified above, under three different training curricula: (0) no curriculum, (1) goal position and orientation randomization in space every episode and (2) a curriculum where we intervene on all variables in space simultaneously every episode.

Training model-free RL methods:

We report the training curves averaged over the random seeds in Fig. 4. As can be seen from these fractional success training curves, MF-RL methods are capable of solving the single block goal shapes (pushing, picking, pick and place) seen during training time given enough experience. However, we observe that none of the methods studied here managed to solve stacking two blocks. The score below 0.5 indicates that it only learns to push the lower cube into the goal shape. This shows that multi-object target shapes can become nontrivial quickly and that there is a need for better methods making use of the modular structure of object-based environments. To no surprise, the training curriculum has a major effect on learning, but the interpretation of generalization capabilities becomes much more explicit in the following subsection. For example, methods rarely manage to pick up any significant success signal under full extreme domain randomization as in curriculum 2, even after 100 million timesteps. Note that these curves represent the scores under the shapes and conditions of the actual training environments. Therefore, we need the capability of setting different protocols, in other words standardized sets of evaluation environment, that allow to benchmark learned skills of different agents.

Figure 5: Evaluation scores for pushing baselines. Each protocol was evaluated for 200 episodes and each bar is averaged over five models with different random seeds. The variables listed under each protocol are sampled from the specified space at the start of every episode while all other variables remain fixed [bp block pose, bm block mass, bs block size, gp goal pose, ff floor friction].

Benchmarking generalization capabilities along various axes:

For each of the four goal shape families, we define a set of 12 evaluation protocols that we consider meaningful and representative for benchmarking the different algorithms. In the protocols presented here, we sample the values from a protocol-specific set of variables at the start of each episode while keeping all other variables fixed to their default values. After evaluating an agent on 200 episodes, we compute the fractional success score at the last time step of each episode and report the mean. These evaluation protocols allow to disentangle generalization abilities, as they show robustness with respect to different types of interventions, see Fig. 5. The following are some of the observations we made for pushing:

  • Agents that were trained on the default pushing task environment (curriculum 0) do well (as expected) on the default task (P0). Interestingly, we likewise see a generalization capability to initial poses from variable space A (P4). This can be explained by a substantial exploration of the block positions via manipulation during training. Similarly, we see that the agents exhibit weaknesses regarding goal poses (P5) but overfit on their training settings instead.

  • For agents trained with goal pose randomization (curriculum 1) we see similar results as with curriculum 0, with the difference that agents under this curriculum generalize robustly to different goal poses (P5), as one would expect.

  • Finally, agents that experience extreme domain randomization (curriculum 2) at training time, fail to learn any relevant skill as shown by the flat training curve in Fig. 4. An explanation for this behavior could be that the agent might need more data and optimization steps to handle this much more challenging setting. Another possibility is that it may simply not be possible to find a strategy which simultaneously works for all parameters (note that the agent does not have access to the randomized parameters and hence must be robust to them). This poses an interesting question for future work.

As expected, we observe that an agent’s generalization capabilities are related to the experience gathered under its training curriculum. CausalWorld allows us to explore this relationship in a differentiated manner, assessing which curricula lead to which generalization abilities. This will not only help uncover an agent’s shortcomings but may likewise aid in investigating novel learning curricula and approaches for robustness in RL. Lastly, we note that this benchmark comprises extremely challenging tasks that appear to be out of reach of current model free methods without any additional inductive bias.

5 Conclusion

We have introduced a new benchmark - CausalWorld - to accelerate research in causal structure and transfer learning using a simulated environment of an open-source robot, where learned skills could potentially be transferred to the real world. We showed how allowing for interventions on the environment’s properties yields a diverse familiy of tasks with a natural way of defining learning curricula and evaluation protocols that can disentangle different generalization capabilities.

We hope that the flexibility and modularity of CausalWorld will allow researchers to easily define appropriate benchmarks of increasing difficulty as the field progresses, thereby coordinating research efforts towards ever new goals.

6 Acknowledgments

The authors would like to thank Felix Widmaier, Vaibhav Agrawal and Shruti Joshi for the useful discussions and for the development of the TriFinger robot’s simulator (Joshi et al., 2020), which served as a starting point for the work presented in this paper. AG is also grateful to Alex Lamb and Rosemary Nan Ke for useful discussions. The authors are grateful for the support from CIFAR.


  • I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. (2019) Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113. Cited by: §1, §2.
  • B. Baker, I. Kanitscheider, T. Markov, Y. Wu, G. Powell, B. McGrew, and I. Mordatch (2019) Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528. Cited by: §1.
  • A. Bakhtin, L. van der Maaten, J. Johnson, L. Gustafson, and R. Girshick (2019) Phyre: a new benchmark for physical reasoning. In Advances in Neural Information Processing Systems, pp. 5082–5093. Cited by: §3.
  • V. Bapst, A. Sanchez-Gonzalez, C. Doersch, K. L. Stachenfeld, P. Kohli, P. W. Battaglia, and J. B. Hamrick (2019) Structured agents for physical construction. arXiv preprint arXiv:1904.03177. Cited by: §2.
  • F. Baradel, N. Neverova, J. Mille, G. Mori, and C. Wolf (2019) Cophy: counterfactual learning of physical dynamics. arXiv preprint arXiv:1909.12000. Cited by: §3.
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling (2013) The arcade learning environment: an evaluation platform for general agents.

    Journal of Artificial Intelligence Research

    47, pp. 253–279.
    Cited by: §1, Table 1.
  • C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al. (2019) Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680. Cited by: §1.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §3.
  • Y. M. Caldera, A. M. Culp, M. O’Brien, R. T. Truglio, M. Alvarez, and A. C. Huston (1999) Children’s play preferences, construction play with blocks, and visual-spatial skills: are they related?. International Journal of Behavioral Development 23 (4), pp. 855–872. Cited by: §1.
  • B. M. Casey, N. Andrews, H. Schindler, J. E. Kersh, A. Samper, and J. Copley (2008) The development of spatial skills through interventions involving block building activities. Cognition and Instruction 26 (3), pp. 269–309. Cited by: §1.
  • S. C. Chan, S. Fishman, J. Canny, A. Korattikara, and S. Guadarrama (2019) Measuring the reliability of reinforcement learning algorithms. arXiv preprint arXiv:1912.05663. Cited by: §3.
  • M. Chevalier-Boisvert, D. Bahdanau, S. Lahlou, L. Willems, C. Saharia, T. H. Nguyen, and Y. Bengio (2018) BabyAI: a platform to study the sample efficiency of grounded language learning. In International Conference on Learning Representations, Cited by: §1, Table 1, §3.
  • K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman (2018) Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341. Cited by: §1, §1, Table 1.
  • E. Coumans et al. (2013) Bullet real-time physics simulation. URL http://bulletphysics. org. Cited by: §2.
  • G. Dewar (2018) The benefits of toy blocks: the science of construction play. Parentig Science. Cited by: §1.
  • G. Dulac-Arnold, N. Levine, D. J. Mankowitz, J. Li, C. Paduraru, S. Gowal, and T. Hester (2020) An empirical investigation of the challenges of real-world reinforcement learning. arXiv preprint arXiv:2003.11881. Cited by: §3.
  • S. E. Fahlman (1974) A planning system for robot construction tasks. Artificial intelligence 5 (1), pp. 1–49. Cited by: §2.
  • S. Fujimoto, H. van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. External Links: 1802.09477 Cited by: §4.
  • A. Goyal, R. Islam, D. Strouse, Z. Ahmed, M. Botvinick, H. Larochelle, Y. Bengio, and S. Levine (2019a) Infobot: transfer and exploration via the information bottleneck. arXiv preprint arXiv:1901.10902. Cited by: §1.
  • A. Goyal, A. Lamb, J. Hoffmann, S. Sodhani, S. Levine, Y. Bengio, and B. Schölkopf (2019b) Recurrent independent mechanisms. arXiv preprint arXiv:1909.10893. Cited by: §3.
  • T. L. Griffiths and J. B. Tenenbaum (2005) Structure and strength in causal induction. Cognitive psychology 51 (4), pp. 334–384. Cited by: §3.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. External Links: 1801.01290 Cited by: §4.
  • P. Henderson, W. Chang, F. Shkurti, J. Hansen, D. Meger, and G. Dudek (2017a) Benchmark environments for multitask learning in continuous domains. arXiv preprint arXiv:1708.04352. Cited by: §3.
  • P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger (2017b) Deep reinforcement learning that matters. arXiv preprint arXiv:1709.06560. Cited by: §3.
  • S. James, Z. Ma, D. R. Arrojo, and A. J. Davison (2020) Rlbench: the robot learning benchmark & learning environment. IEEE Robotics and Automation Letters 5 (2), pp. 3019–3026. Cited by: §1, Table 1, §3.
  • M. Janner, S. Levine, W. T. Freeman, J. B. Tenenbaum, C. Finn, and J. Wu (2018) Reasoning about physical interactions with object-oriented prediction and planning. arXiv preprint arXiv:1812.10972. Cited by: §2.
  • S. Joshi, F. Widmaier, V. Agrawal, and M. Wüthrich (2020) GitHub. Note: Cited by: §6.
  • C. Kamii, Y. Miyakawa, and Y. Kato (2004) The development of logico-mathematical knowledge in a block-building activity at ages 1–4. Journal of Research in Childhood Education 19 (1), pp. 44–57. Cited by: §1.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • Y. Lee, E. S. Hu, Z. Yang, A. Yin, and J. J. Lim (2019) IKEA furniture assembly environment for long-horizon complex manipulation tasks. External Links: 1911.07246 Cited by: Table 1.
  • [32] W. McCarthy, D. Kirsh, and J. Fan Learning to build physical structures better over time. Cited by: §2.
  • M. Mirza, A. Jaegle, J. J. Hunt, A. Guez, S. Tunyasuvunakool, A. Muldal, T. Weber, P. Karkus, S. Racanière, L. Buesing, T. Lillicrap, and N. Heess (2020) Physically embedded planning problems: new challenges for reinforcement learning. External Links: 2009.05524 Cited by: Table 1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
  • S. Nath and D. Szücs (2014) Construction play and cognitive skills associated with the development of mathematical abilities in 7-year-old children. Learning and Instruction 32, pp. 73–80. Cited by: §1.
  • I. Osband, Y. Doron, M. Hessel, J. Aslanides, E. Sezener, A. Saraiva, K. McKinney, T. Lattimore, C. Szepezvari, S. Singh, et al. (2019) Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568. Cited by: §3.
  • C. Packer, K. Gao, J. Kos, P. Krähenbühl, V. Koltun, and D. Song (2018) Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282. Cited by: §3.
  • J. Pearl (2009) Causality. Cambridge university press. Cited by: §3.
  • L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta (2017) Robust adversarial reinforcement learning. arXiv preprint arXiv:1703.02702. Cited by: §3.
  • A. Rajeswaran, K. Lowrey, E. V. Todorov, and S. M. Kakade (2017) Towards generalization and simplicity in continuous control. In Advances in Neural Information Processing Systems, pp. 6550–6561. Cited by: §3.
  • M. Richardson, T. E. Hunt, and C. Richardson (2014) Children’s construction task performance and spatial ability: controlling task complexity and predicting mathematics performance. Perceptual and motor skills 119 (3), pp. 741–757. Cited by: §1.
  • B. Schölkopf (2019) Causality for machine learning. arXiv preprint arXiv:1911.10500. Cited by: §3.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. External Links: 1707.06347 Cited by: §4.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of go without human knowledge. nature 550 (7676), pp. 354–359. Cited by: §1.
  • Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. (2018) Deepmind control suite. arXiv preprint arXiv:1801.00690. Cited by: §3.
  • B. N. Verdine, R. M. Golinkoff, K. Hirsh-Pasek, and N. Newcombe (2017) Links between spatial and mathematical skills across the preschool years. Wiley. Cited by: §1.
  • S. Whiteson, B. Tanner, M. E. Taylor, and P. Stone (2011) Protecting against evaluation overfitting in empirical reinforcement learning. In 2011 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), pp. 120–127. Cited by: §1.
  • T. Winograd (1972) Understanding natural language. Cognitive psychology 3 (1), pp. 1–191. Cited by: §2.
  • P. H. Winston (1970) Learning structural descriptions from examples. Cited by: §2.
  • M. Wüthrich, F. Widmaier, F. Grimminger, J. Akpo, S. Joshi, V. Agrawal, B. Hammoud, M. Khadiv, M. Bogdanovic, V. Berenz, et al. (2020) TriFinger: an open-source robot for learning dexterity. arXiv preprint arXiv:2008.03596. Cited by: Appendix B, §1, §1, §2, §2.
  • T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine (2019) Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. arXiv preprint arXiv:1910.10897. Cited by: §1, §1, Table 1, §3.
  • W. Yu, J. Tan, C. K. Liu, and G. Turk (2017) Preparing for the unknown: learning a universal policy with online system identification. arXiv preprint arXiv:1702.02453. Cited by: §3.

7 Appendix

Appendix A Observations

Observations in CausalWorld has two modes, ”structured” and ”pixel”. When using ”pixel” mode, 6 images are returned consisting of the current images rendered from 3 different views on top of the TriFinger platform, showing the current state of the environment, as well as the 3 equivalent goal images rendered from the same points of view, showing the goal shape that the robot have to build by the end of the episode.

Current View60
Goal View60
Current View120
Goal View120
Current View300
Goal View300
Figure 6: Example ”pixel” mode observations returned at each step of the environment.
Figure 7: Structured observation description. For the scene features, all the blocks feature vector are concatenated first. Following that the partial goals feature vector are concatenated in the same order. Lastly, if there is any obstacles/ fixed blocks, their feature vectors are concatenated at the end following the same description as the partial goal features.

Appendix B TriFinger Platform

The robot from (Wüthrich et al., 2020) shown in figure 8 is open-sourced and can be reproduced and built in any research lab; since its inexpensive (about $5000), speeding up sim2real research.

Figure 8: The TriFinger platform.

Appendix C Task Generators

  1. Pushing: task where the goal is to push one block towards a goal position with a specific orientation; restricted to goals on the floor level.

  2. Picking: task where the goal is to pick one block towards a goal height above the center of the arena; restricted to goals above the floor level.

  3. Pick And Place: task where the arena is divided by a fixed long block and the goal is to pick one block from one side of the arena to a goal position with a variable orientation on the other side of the fixed block.

  4. Stacking2: task where the goal is to stack two blocks above each other in a specific goal position and orientation.

  5. Towers: task where the goal is to stack multiple n blocks above each other in a specific goal position and orientation - exactly above each other creating a tower of blocks.

  6. Stacked Blocks: task where the goal is to stack multiple n blocks above each other in an arbitrary way to create a stable structure. The blocks don’t have to be exactly above each other; making it more challenging than the ordinary towers task since the its harder to come up with a stable structure that covers the goal shape volume.

  7. Creative Stacked Blocks: exactly the same as the Stacked Blocks task except that the first and last levels of the goal are the only levels shown or ”imposed” and the rest of the structure is not explicitly specified, leaving the rest of the goal shape to the imagination of the agent itself; this is considered the most challenging since its it needs the agent to understand how to build stable structures and imagine what can be filled in the middle to connect the two levels in a stable way.

  8. General: the goal shape is an arbitrary shape created by initially dropping an arbitrary number of blocks from above the ground and waiting till all blocks come to a rest position where this becomes the goal shape that the agent needs to fill up afterwards.

Variable Sub Variable Space Space
gravity[z] -
floor friction -
stage friction -
stage color [rgb] -
floor color [rgb] -
joint positions - , ,
block size
block color
block mass
block position (cylindrical) , ,
goal cuboid size
goal cuboid color
link color
link mass
Table 2: Description of a subset of the high level variables, exposed in CausalWorld, and their corresponding spaces, refers to the height of the block.
Task Generator Variable Space Space
Picking goal height
Towers tower dims , ,
Table 3: Example of task generators’ specific high level variables, exposed in CausalWorld, and their corresponding spaces. For a full list of each task generators’ variables and their corresponding spaces, please refer to the documentation at (
Task generators Dense reward
Pick and Place
Table 4: Description of the dense rewards applied in our experiments. The following notation was applied: joint velocities, i-th end-effector positions, i-th block position, i-th goal block position, the distance between end-effectors and the block, the distance difference w.r.t. the previous timestep. The target height parameter for pick and place is 0.15 if block and goal are of different height. Otherwise, is half the goal height.

Appendix D Training Details

The experiments were carried out using the stable baselines implementation of PPO, SAC and TD3. We used a 2 layer MLP Policy [256,256] for all the policies. PPO was trained on 20 workers up to 100 million timesteps in parallel and SAC as well as TD3 were trained serially for 10 million timesteps.

discount 0.99
batch size 120000
learning rate 2.5e-4
entropy coef. 0.01
value function coef. 0.5
gradient clipping (max) 0.5
n minibatches per update 40

n training epochs

discount 0.95
entropy coeff 1e-3
batch size 256
learning rate 1e-4
target entropy auto
buffer size 1000000
tau 0.001
discount 0.96
batch size 128
learning rate 1e-4
buffer size 500000
tau 0.02
Table 5: Learning algorithms hyper parameters used in the baselines experiments.
Figure 9: An example of model selection in CausalWorld by evaluating generalization across the various axes using the previously mentioned protocols. Here we compare two agents trained on different curricula using PPO.
Figure 10: Evaluation scores, for pushing, picking, pick and place and stacking2 baselines, from top to bottom respectively. Each protocol was evaluated for 200 episodes and each bar is averaged over five models with different random seeds [bp block pose, bm block mass, bs block size, gp goal pose, ff floor friction].