Self-supervised Reinforcement Learning with Independently Controllable Subgoals

09/09/2021 ∙ by Andrii Zadaianchuk, et al. ∙ Max Planck Society 0

To successfully tackle challenging manipulation tasks, autonomous agents must learn a diverse set of skills and how to combine them. Recently, self-supervised agents that set their own abstract goals by exploiting the discovered structure in the environment were shown to perform well on many different tasks. In particular, some of them were applied to learn basic manipulation skills in compositional multi-object environments. However, these methods learn skills without taking the dependencies between objects into account. Thus, the learned skills are difficult to combine in realistic environments. We propose a novel self-supervised agent that estimates relations between environment components and uses them to independently control different parts of the environment state. In addition, the estimated relations between objects can be used to decompose a complex goal into a compatible sequence of subgoals. We show that, by using this framework, an agent can efficiently and automatically learn manipulation tasks in multi-object environments with different relations between objects.



There are no comments yet.


page 13

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Autonomous agents that need to solve manipulation tasks in environments with many objects have to master a variety of skills. In addition, such agents should be able to properly combine these skills to solve complex tasks. In modular environments, the agent must explore many different ways how it can control the environment [10]. Self-supervised agents that imagine their own goals can automate this process, and learn many skills without external reward signals [10, 34, 36, 33, 15, 3, 1]. One of the main challenges for goal-based autonomous agents is the choice of a suitable goal space and the corresponding reward function [11]. As this choice determines the difficulty of the learning, it is crucial to exploit all available structure in the environment state for construction of the goal space.

One natural way to represent the state in modular environments is to use an object-centric representation: the environment state is represented as a set of components, with each component corresponding to the state of an individual object [41, 46, 12]. Such representations can be learned in an unsupervised fashion from high-dimensional observations such as images [45, 21, 29, 41, 17, 7, 18]

. Therefore, methods that use object-centric representations can be readily extended to take high-dimensional data as input. A simple approach to use object-centric representations in autonomous learning is to first learn how to control each object individually (using the objects’ representations as subgoals), and then combine learned skills to control multiple objects 

[46]. However, in an environment where different objects interact with each other, this method might learn an incompatible sequence of skills, i.e. achieving one of the subgoals can destroy another previously achieved subgoal. For example, moving one object from a stack of objects may change the position of the others.

One line of work that aims at learning sequences of skills that are compatible is Hierarchical Reinforcement Learning (HRL) [42, 32, 25]. In principle, hierarchical agents should be able to transform a task into a sequence of subtasks that they solve sequentially. However, to date, existing hierarchical agents have mostly been applied to learn navigation or reaching tasks where learned skills do not interact with each other. It is unclear how sensitive hierarchical agents are to possible interactions between learned skills. In this paper, we investigate another approach by reformulating the agent’s subtasks and the corresponding reward signals. Similar to Thomas et al. [38], we train an agent such that it is motivated to control a particular component of the environment state representation while minimally affecting other components. Such an agent can learn to control components independently from other components, thus making the learned skills compatible with each other.

As the environment state representation is not necessarily disentangled as in Thomas et al. [38], our method should additionally account for possible relations between components. We propose a novel selectivity reward signal that uses an interaction graph to determine a set of components that can be selectively controlled without interacting with the remaining scene. The interaction graph can be inferred from observed objects dynamics collected by a random policy without supervision [5, 22]. Thus, we combine learning of such interaction graphs with a goal-conditional reinforcement learning (RL) method that operates on object-centric representations [46] and uses the selectivity reward signal. During training (schematically depicted in Fig. 1), our SRICS agent (for Self-supervised Relational RL with Independently Controllable Subgoals) learns how to efficiently achieve different subgoals (and control the corresponding subspaces) while being incentivized to minimize its effects on other parts of the environment.

Figure 1: Our SRICS method. First, the interaction graph is inferred from observed environment dynamics containing links from cause to affected entity. This gives rise to subspaces that can be independently controlled, corresponding to subgoals . Next, the subgoals are used to construct a selectivity reward signal . The selectivity reward incentivizes the agent to only control the main entity towards within each subgoal without affecting entities outside the subgoal. SRICS learns to solve an external goal by decomposing it into an ordered list of subgoals and solving each using SAC [19] with a goal-conditioned policy . As a result, the agent attempts to solve all the discovered subgoals one-by-one, without destroying previously solved subgoals.

Our main contributions are as follows:

  • [leftmargin=2em,topsep=0em,itemsep=-0.1em]

  • We show that the global interaction graph can be estimated from data using a recurrent graph neural network (GNN) dynamical model combined with a sparsity prior.

  • We propose a goal-directed selectivity reward function that allows an agent to learn how to control environment components independently from one another.

  • We develop SRICS, an algorithm that uses the inferred interaction graph to learn simple and independently controllable subtasks and decompose a complex goal into a compatible sequence of subgoals.

2 Modular Goal-conditional Reinforcement Learning

We are interested in an agent that can solve multiple tasks in an environment. In particular, we consider goal-based task encodings where each task corresponds to an environment state the agent has to reach, denoted as the goal state . The task is then given to the agent by conditioning its policy on the goal , and the agent’s objective is to maximize the expected goal-conditional return:

where is an unknown dynamics distribution, is the state marginal distribution induced by the agent’s policy and is some distribution over the space of goals that the agent receives for training. A common approach to define the reward function in this setting is the negative distance of the current state to the goal: . In general, however, the goal space does not need to be equal to the state space , but can be any task embedding space with potentially different dimensionality [10, 2]. As some tasks cannot be expressed as desired regions of the state space, the goal can parameterize a more general objective that the agent should maximize. Many environments are modular, in the sense that an agent’s overall goal (e.g. manipulating many objects) can be decomposed into different subgoals (e.g. manipulating individual objects) that can be sequentially achieved.

2.1 Object-Centric Representations

We use object-centric representations for the state space . That is, the state space is a direct product of all object subspaces, , where each corresponds to the state of an entity. The state of an entity is encoded by its position and an identifier . The semantics behind an entity are unknown to the agent, i.e. the state of both the agent and the objects to manipulate are encoded identically. The agent has no information about which objects are controllable and how they are related. Object-centric representations could be learned in an unsupervised way from sequential image data [41, 45, 21] and learning them is an orthogonal line of research. In this work, we focus on using object-centric representations for decomposing the goal to subgoals that are compatible with each other and can be achieved sequentially to solve the original goal.

The choice of the goal space plays a crucial role in determining the difficulty of the learning task. If the environment state consists of independent parts, it is easiest to learn to control these components independently [46]. However, in the case of interactions between these components, learning to achieve subgoals in such environments and combining learned skills could be harmful to achieving the original compositional goal. For example, assume that the subgoals consist of moving objects to different positions. By solving a single subgoal without taking the other subgoals into consideration, the agent might unintentionally rearrange objects from previous subtasks, resulting in an overall deterioration instead of an improvement. In the next section, we present a method that accounts for such dependencies by learning an interaction graph to decompose a goal into independently controllable subgoals and introduce a corresponding reward function.

3 Self-Supervised Relational RL with Independently Controllable Subgoals

In the setting we consider, at the training stage, the agent only receives a single compositional goal from the environment. The agent could try to solve the goal using the usual negative distance to the goal as a reward signal. However, achieving the compositional goal is quite a complex task by itself. This challenge can be addressed by discovering simple skills and combining them to solve the compositional goal. To achieve this, the agent needs to rely on self-supervision in the form of splitting the goal into subgoals and internally constructing the reward signal connected to each subgoal.

The agent uses data collected from the environment to discover how different parts of the environment are related, including the agent itself, and then uses the discovered relations for the construction of subtasks that are solvable and can be easily combined. First, we describe how to use object-centric representations to estimate a graph of relations between objects, and then show how to utilize the learned graph during agent training, and for goal decomposition during evaluation.

3.1 Estimation of the Latent Interaction Graph with a GNN Dynamical Model

Figure 2: The dynamical model. For a given object , the function computes each of the other objects’ effect on the object using the hidden states

. The effects from all the other objects are aggregated in the interaction effect vector

. Next, the function computes the action’s effect on the object . Both effects are combined in the GRU. Finally, object’s state estimation is estimated from the hidden state using the prediction function .

Relational information in the environment can help the autonomous agent to gain control over different parts of the environment. For example, if some parts of the environment cannot be affected, the agent will be more efficient by not trying to control them. Recently, several methods to estimate this relational information in an unsupervised way were proposed [22, 40, 23, 28, 6]. Most of them assume that the relations are static [22, 28, 30]. As this is not the case in many robotic manipulation applications, we propose to use a similar approach suitable for constantly changing relations. For this, we use a graph neural network (GNN) [5, 27] to model the forward dynamics of the objects.

Because states could be non-Markovian, we use a recurrent dynamical model. Specifically, we incorporate recurrence in the GNN model by adding a Gated recurrent unit (GRU) 

[8] to the GNN message passing operation (see Fig. 2a). We use the functions and to model the object-object interaction effect and the action’s effect, respectively. Next, both effects are combined in the GRU. More formally:


where and are vectors representing interaction and action effects, whereas is the hidden state for object at time step .

To model dynamics with sparse interactions between objects, we model as the product of an interaction weight and an interaction effect function :


The interaction weight represents the belief in the absence or presence of the interaction between object and object at time step . We model the weight’s distribution as


where is the interaction presence function. As we are interested in the estimation of the connections that are necessary for predictions, we additionally encourage the interaction weights distribution to be close to the sparsity prior . In our case, the sparsity prior

is the Bernoulli distribution with a large probability for zero (see App. 


Finally, we use a function to predict the change in coordinates (see Fig. 2b):


All functions in Eqs. 14 are modeled by small MLPs with parameters .

Now, as we defined all the parts of the GNN dynamical model, we describe how to estimate the interaction graph using a variational approach. First, similar to [22], we train our model by minimizing the negative ELBO loss:


where is the prediction of the position of object ,

is a fixed variance parameter and

denotes the Kullback-Leibler divergence. After training, we predict interaction weights

for each timestep independently, then we average them across the whole dataset. Next, we estimate the global interaction graph by thresholding the average interaction weights to find the most active relations. Finally, we identify which object is directly controlled by the actions by finding the node that is most correlated with the action variable . In the graph of our running example as in Figure 1, we denote this node as ”arm” since in all experiments the identified node corresponds to a simulated robot arm. We add the action node with index and the corresponding edge to the most correlated object to the graph (see App. G for the details and the graph learning results).

3.2 Learning to Independently Control Objects using the Interaction Graph

In this section, we show how the agent can use the learned interaction graph to solve compositional goal that consists of goals for individual objects . The SRICS agent sequentially gains control over the objects without affecting the previously moved objects. To achieve this, the SRICS method first identifies a set of objects that could be used to actively control object by analyzing the discovered relations in the interaction graph. For each node , we find the set of all nodes that lie in a path from the action node to object node . These ancestral nodes are the objects that could be used by the agent to control object . All the other nodes are not required and thus should not be affected during the manipulation of object .

Interaction graph (left) and the independently controllable subgoal for object  (right).

Next, we introduce the reward signal that uses to incentivize the agent to learn to control an object without moving others (line 8 of Alg. 1). In order to achieve this, we propose to replace the original subgoal by a novel independently controllable subgoal that consists of the subgoal and the ancestral nodes :


In contrast to the original notion of a subgoal which only specifies a state component that the agent should reach, an independently controllable subgoal also includes information about which objects should not be interacted with to reach the target state component.

We now formulate the goal-directed selectivity reward signal that explicitly incentivizes the agent to leave all objects except and untouched. As opposed to the usual reward signal, it depends on the independently controllable subgoal and reads:


The first term is the usual goal-based negative distance to the goal, which is needed to learn directed control over object . The second term includes the selectivity that we define as


The selectivity incentivizes the agent to maximize its influence [37, 47, 24] on object while having a minimal effect on objects (corresponding to non-ancestral nodes in graph ) until the subgoal corresponding to the object is solved. Selectivity reaches its maximum value of when the agent changes only the state of the object without affecting any objects . In App. F, we show that selectivity naturally increases during learning to control the environment and that using it as a reward signal increases efficiency and stability.

3.3 SRICS Policy Architecture and Training

Similar to the SMORL agent [46], we use a goal-conditioned attention policy for achieving subgoals. This kind of policy receives a set of object-centric representations as input together with the current subgoal representation. The aforementioned approach allows us to learn several different skills using only one policy. In addition, it is compatible with a different number of objects as inputs, thus allowing to use the agent in novel situations with a different number of objects. For more details on the goal-conditioned attention policy, we refer to App. E.

SRICS can be trained with any off-policy goal-conditioned RL algorithm. In particular, we use Soft-Actor Critic (SAC) [19] with Hindsight Experience Replay (HER) [2] as a method to improve sample efficiency. The training of SRICS is presented in Alg. 1.

3.4 Subgoal Ordering during Evaluation

After training, the agent can be applied to more complex tasks than the simple subtasks it was trained on. During the evaluation stage (Fig. 13 in App. I), SRICS encodes the compositional goal given by the environment into a set of independently controllable subgoals. Subsequently, it orders them by the depth of the corresponding nodes in the interaction graph . Due to this order, subgoals that have a large number of dependencies are attempted first and subgoals that have only a few dependencies, like the robotic arm itself, are attempted as the later subgoals. The order of the independently controllable subgoals makes them compatible with each other. For example, the agent has to first rearrange all objects that need to be manipulated and then try to “solve” the arm subgoal, without destroying the already rearranged objects. More details can be found in App. I.

4 Related Work

In self-supervised reinforcement learning, self-supervision refers to the agent constructing its own goals together with the corresponding reward signal and using them to learn to solve self-proposed goals [10, 6, 16, 9, 34, 36, 3, 4, 35, 20, 31, 44, 33, 46]. Self-supervised agents can acquire a diverse set of general-purpose robotic skills. In the case of complex tasks, it is often beneficial to discover simpler subgoals and learn to solve them [25]. From this point of view, recent hierarchical RL (HRL) agents [25, 32, 43, 26, 42, 14] that try to solve external tasks by proposing several levels of internal subgoals are also self-supervised agents.

Levy et al. [25], Nachum et al. [32] and Wang et al. [43] propose to learn several goal-conditioned policies. In the HIRO agent [32], lower-level controllers are supervised with goals that are learned and proposed automatically by the higher-level controllers. In contrast, the HAC agent [25] trains each level of the hierarchy independently of the lower levels. The IHRL agent [43] additionally allows bi-directional communication among HRL levels and influence-based exploration to make training more stable and efficient. As such agents need to discover all the structure in the environment while learning on several levels, such approaches struggle to solve complex tasks in modular environments [13]. Next, we review agents operating in environments where some structure is given.

The SMORL agent [46] exploits learned object-centric representations for gaining control over different objects in a self-supervised way and combines the learned skills for solving more complex compositional tasks. However, Zadaianchuk et al. [46] assume independence of different objects, restricting the use of the SMORL agent to settings where objects almost do not interact with each other. CURIOUS [11] and CWYC [6] exploit the modular structure of the goal space for efficient exploration in a given goal space. Colas et al. [11] use a policy that obtains the goal module identifier together with the goal value. Blaes et al. [6] also learn a relational graph between tasks. Both agents use a given modular structure for a learning curriculum [16], however, discovered subtasks are evaluated independently.

In realistic applications, autonomous agents usually do not have any well-structured representation. Nevertheless, agents can potentially infer it from data. We cover several directions that could be useful for such structure discovery. The first line of works [21, 45, 41, 29, 7] learns object-centric representations from images or videos. Such representations could be potentially used in combination with the SRICS agent. The second line [22, 23, 40, 28, 30] studies how object relations can be discovered from data. The improvements in both of these lines could lead to more general self-supervised agents that use a discovered structure for the generation of goals.

5 Experimental Results

Figure 3:

Average distance of objects and arm to the goal positions, comparing SRICS to SMORL, SAC+HER and HAC baselines. For all the experiments, results are averaged over 5 random seeds, shaded regions indicate one standard deviation.

In this section, we present our experiments that address the following questions:

  • [leftmargin=2em,topsep=0em,itemsep=-0.1em]

  • How does SRICS perform compared to prior goal-conditioned RL methods on multi-object continuous control manipulation tasks?

  • What is the performance gain obtained from the goal-directed selectivity reward and subgoal ordering during evaluation?

  • How does our agent perform in an environment with an unseen combination of objects?

We run SRICS and the baseline algorithms in the Multi-Object Rearrange from Zadaianchuk et al. [46] and the novel Multi-Object Relational Rearrange environments. The latter environment incorporates additional physical connections between objects such as spring connections. Both environments are based on the multiworld package for continuous control tasks introduced by Nair et al. [34] and use MuJoCo [39] as a realistic simulator. They contain a 7-DoF Sawyer arm where the agent needs to manipulate a variable number of pucks on a table. In the first environment, the task is to rearrange the objects from random starting positions to random target positions. In the second environment, we add a spring connection between some of the objects and constrain other objects to be static (see App. C). This makes the resulting interaction graph more challenging and thus provides additional insights on the sensitivity of the agent to different interactions between objects. For both environments, we measure the performance of the algorithms as the average distance of all objects (including the robotic arm) to their goal positions (computed on the last step of the episode).

5.1 Comparative Analysis

Figure 4: Subtask success rate for SRICS and SMORL for each subtask individually during evaluation in the Relational Rearrange environment. Both methods can solve Arm reaching subgoal, whereas on other subtasks SRICS performs better than SMORL.

As manipulation tasks in compositional environments can be approached from different perspectives, we provide a comparison with a state-of-the-art method from each perspective. In terms of problem assumptions, our work is closest to that of SMORL [46] which uses object-centric representations for subgoals and reward construction. In contrast to SRICS, SMORL executes subgoals in a random order and thus can potentially destroy previously solved subgoals. In addition, the SMORL agent does not have the incentive to influence the subgoal object during training. Another approach to learn goal-conditioned policy with coherent behavior is using Soft Actor-Critic (SAC) [19] with Hindsight Experience Replay (HER) [2] relabeling. This method tries to achieve the overall goal without splitting it into subgoals. Finally, we consider the Hierarchical Actor-Critic (HAC) [25] method that tries to solve compositional tasks on several levels and is state of the art on several continuous control tasks.

We show the results in Fig. 3 and Fig. 4. The performance of SRICS is significantly better than all other algorithms in both environments. SMORL is able to partially rearrange pucks on a table in the simpler Multi-Object Rearrange environment. However, its random subgoals ordering is inefficient for arranging all the objects including the arm. In addition, even when evaluating only based on the puck subtasks (see App. H), SRICS outperforms SMORL, which further demonstrates the benefits of using a goal-directed selectivity reward signal. Moreover, in the more complex Multi-Object Relational Rearrange environment, the gap between SRICS’s and SMORL’s performance is even larger. Furthermore, in all environments SAC is only able to solve the Arm subtasks, whereas HAC performance is close to that of a random agent. We present further comparison in more challenging environments with different objects and velocity-based state representations in App. D.

5.2 Ablative Analysis

Here, we study the importance of different ingredients of our method for the overall performance of the agent. First, we ablate the selectivity term in our reward signal, using only the negative distance between the object and the desired position as a reward signal. We then additionally ablate the ordering of subgoals described in Sec. 3.4, using instead a random ordering of all subgoals. The results of the ablations are presented in Fig. 5. Both ablations significantly deteriorate the performance of SRICS, showing the importance of both the goal-directed selectivity reward signal and the correct ordering in the goal decomposition for object manipulation in multi-object environments.

Figure 5: Average distance of objects and arm to the goal positions, comparing our method and two ablated variants on 3 and 4 objects Rearrange environments.

5.3 Generalization to Unseen Object Combinations

As SRICS can be used with different sets of objects as inputs, we investigate its performance on unseen combinations of objects. We train SRICS on the Multi-Object Rearrange environment, where different combinations of objects are presented. We leave out one of the combinations for evaluation and use the other combinations for training. The performance on this modified environment is indistinguishable from SRICS’s performance on Multi-Object Rearrange with objects, showing that SRICS can operate on novel combinations of objects (details in App. B).

6 Conclusion and Future Work

In this work, we introduce SRICS, a self-supervised RL method that learns the relational structure of the environment and exploits this structure to learn a compatible sequence of skills to solve a difficult compositional goal. In a range of experiments in multi-object environments with robotic arm manipulation tasks, we demonstrate that SRICS is effective at discovering the most active dynamic relations between objects and can successfully rearrange multiple objects even in the presence of object interactions.

There are several interesting directions for future work. First, one can extend SRICS to image-based object-centric representations, making it more applicable to realistic robotic settings where only high-dimensional sensory information is provided as input to the agent. Moreover, we expect that SRICS can be combined with different modular curriculum learning and exploration strategies [11, 6]. Finally, we expect that active training of the dynamic interaction graph (i.e. when the data for training is collected by the agent that actively explores the environment) could further improve the discovery of important structures in the environment.

Andrii Zadaianchuk is supported by the Max Planck ETH Center for Learning Systems. We acknowledge the support from the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039B). We are grateful to Maximilian Seitzer and Christian Gumbsch for their fruitful comments and corrections.


  • [1] A. Akakzia, C. Colas, P. Oudeyer, M. Chetouani, and O. Sigaud (2021) Grounding language to autonomously-acquired skills via goal generation. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • [2] M. Andrychowicz, D. Crow, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba (2017) Hindsight experience replay. In Advances in Neural Information Processing Systems (NIPS), Cited by: §2, §3.3, §5.1.
  • [3] A. Aubret, L. matignon, and S. Hassas (2021) DisTop: discovering a topological representation to learn diverse and rewarding skills. arXiv preprint arXiv:2106.03853. Cited by: §1, §4.
  • [4] A. Baranes and P. Oudeyer (2013) Active learning of inverse models with intrinsically motivated goal exploration in robots. Robotics Auton. Syst.. Cited by: §4.
  • [5] P. W. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, and K. Kavukcuoglu (2016) Interaction networks for learning about objects, relations and physics. In Advances in Neural Information Processing Systems (NIPS), Cited by: §1, §3.1.
  • [6] S. Blaes, M. V. Pogancic, J. Zhu, and G. Martius (2019) Control what you can: intrinsically motivated task-planning agent. In Advances in Neural Information Processing Systems (NIPS), Cited by: §3.1, §4, §4, §6.
  • [7] C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Lerchner (2019) MONet: unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390. Cited by: §1, §4.
  • [8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014)

    Empirical evaluation of gated recurrent neural networks on sequence modeling


    NIPS 2014 Workshop on Deep Learning

    (English (US)). Cited by: §3.1.
  • [9] C. Colas, T. Karch, N. Lair, J. Dussoux, C. Moulin-Frier, F. P. Dominey, and P. Oudeyer (2020) Language as a cognitive tool to imagine goals in curiosity driven exploration. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §4.
  • [10] C. Colas, T. Karch, O. Sigaud, and P. Oudeyer (2021) Intrinsically motivated goal-conditioned reinforcement learning: a short survey. arXiv preprint arXiv:2012.09830. Cited by: §1, §2, §4.
  • [11] C. Colas, P. Oudeyer, O. Sigaud, P. Fournier, and M. Chetouani (2019) CURIOUS: intrinsically motivated modular multi-goal reinforcement learning. In

    International Conference on Machine Learning (ICML)

    Cited by: §1, §4, §6.
  • [12] C. Devin, P. Abbeel, T. Darrell, and S. Levine (2018) Deep object-centric representations for generalizable robot learning. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §1.
  • [13] Z. Dwiel, M. Candadai, M. Phielipp, and A. K. Bansal (2019) Hierarchical policy learning is sensitive to goal space design. arXiv preprint arXiv:1905.01537. Cited by: §4.
  • [14] C. Florensa, Y. Duan, and P. Abbeel (2017) Stochastic neural networks for hierarchical reinforcement learning. In International Conference on Learning Representations (ICLR), Cited by: §4.
  • [15] C. Florensa, D. Held, X. Geng, and P. Abbeel (2018) Automatic goal generation for reinforcement learning agents. In International Conference on Machine Learning (ICML), Cited by: §1.
  • [16] S. Forestier, R. Portelas, Y. Mollard, and P. Oudeyer (2017) Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190. Cited by: §4, §4.
  • [17] K. Greff, R. L. Kaufman, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. Botvinick, and A. Lerchner (2019) Multi-object representation learning with iterative variational inference. In International Conference on Machine Learning (ICML), Cited by: §1.
  • [18] K. Greff, S. van Steenkiste, and J. Schmidhuber (2017)

    Neural expectation maximization

    In Advances in Neural Information Processing Systems (NIPS), Cited by: §1.
  • [19] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning (ICML), Cited by: Figure 1, §3.3, §5.1.
  • [20] K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. A. Riedmiller (2018) Learning an embedding space for transferable robot skills. In International Conference on Learning Representations (ICLR), Cited by: §4.
  • [21] J. Jiang, S. Janghorbani, G. D. Melo, and S. Ahn (2020) SCALOR: generative world models with scalable object representations. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1, §4.
  • [22] T. N. Kipf, E. Fetaya, K. Wang, M. Welling, and R. S. Zemel (2018) Neural relational inference for interacting systems. In International Conference on Machine Learning (ICML), Cited by: §1, §3.1, §3.1, §4.
  • [23] T. N. Kipf, E. van der Pol, and M. Welling (2020) Contrastive learning of structured world models. In International Conference on Learning Representations (ICLR), Cited by: §3.1, §4.
  • [24] A.S. Klyubin, D. Polani, and C.L. Nehaniv (2005) Empowerment: a universal agent-centric measure of control. In

    2005 IEEE Congress on Evolutionary Computation

    Cited by: §3.2.
  • [25] A. Levy, G. D. Konidaris, R. P. Jr., and K. Saenko (2019) Learning multi-level hierarchies with hindsight. In International Conference on Learning Representations (ICLR), Cited by: §1, §4, §4, §5.1.
  • [26] S. Li, L. Zheng, J. Wang, and C. Zhang (2021) Learning subgoal representations with slow dynamics. In International Conference on Learning Representations, External Links: Link Cited by: §4.
  • [27] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel (2016) Gated graph sequence neural networks. In International Conference on Learning Representations (ICLR), Cited by: §3.1.
  • [28] Y. Li, A. Torralba, A. Anandkumar, D. Fox, and A. Garg (2020) Causal discovery in physical systems from videos. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §3.1, §4.
  • [29] F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf (2020) Object-centric learning with slot attention. In Advances in Neural Information Processing Systems (NIPS), Cited by: §1, §4.
  • [30] S. Löwe, D. Madras, R. Zemel, and M. Welling (2020) Amortized causal discovery: learning to infer causal graphs from time-series data. arXiv preprint arXiv:2006.10833. Cited by: §3.1, §4.
  • [31] C. Lynch, M. Khansari, T. Xiao, V. Kumar, J. Tompson, S. Levine, and P. Sermanet (2019) Learning latent plans from play. In Conference on Robot Learning (CoRL), Cited by: §4.
  • [32] O. Nachum, S. Gu, H. Lee, and S. Levine (2018) Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), Cited by: §1, §4, §4.
  • [33] A. Nair, S. Bahl, A. Khazatsky, V. H. Pong, G. Berseth, and S. Levine (2019) Contextual imagined goals for self-supervised robotic learning. In Conference on Robot Learning (CoRL), Cited by: §1, §4.
  • [34] A. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine (2018) Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems (NIPS), Cited by: §1, §4, §5.
  • [35] A. Péré, S. Forestier, O. Sigaud, and P. Oudeyer (2018) Unsupervised learning of goal spaces for intrinsically motivated goal exploration. In International Conference on Learning Representations (ICLR), Cited by: §4.
  • [36] V. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine (2020) Skew-fit: state-covering self-supervised reinforcement learning. In International Conference on Machine Learning (ICML), Cited by: §1, §4.
  • [37] M. Seitzer, B. Schölkopf, and G. Martius (2021) Causal influence detection for improving efficiency in reinforcement learning. arXiv preprint arXiv:2106.03443. Cited by: §3.2.
  • [38] V. Thomas, E. Bengio, W. Fedus, J. Pondard, P. Beaudoin, H. Larochelle, J. Pineau, D. Precup, and Y. Bengio (2018) Disentangling the independently controllable factors of variation by interacting with the world. arXiv preprint arXiv:1802.09484. Cited by: §1, §1.
  • [39] E. Todorov, T. Erez, and Y. Tassa (2012) MuJoCo: a physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §5.
  • [40] S. van Steenkiste, M. Chang, K. Greff, and J. Schmidhuber (2018) Relational neural expectation maximization: unsupervised discovery of objects and their interactions. In International Conference on Learning Representations (ICLR), Cited by: §3.1, §4.
  • [41] R. Veerapaneni, J. D. Co-Reyes, M. Chang, M. Janner, C. Finn, J. Wu, J. B. Tenenbaum, and S. Levine (2020) Entity abstraction in visual model-based reinforcement learning. In Conference on Robot Learning (CoRL), Cited by: §1, §2.1, §4.
  • [42] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu (2017) FeUdal networks for hierarchical reinforcement learning. In International Conference on Machine Learning (ICML), Cited by: §1, §4.
  • [43] R. Wang, R. Yu, B. An, and Z. Rabinovich (2020) I2HRL: interactive influence-based hierarchical reinforcement learning. In

    Proceedings of the 29th International Joint Conference on Artificial Intelligence. IJCAI, Yokohama, Japan

    pp. 3131–3138. Cited by: §4, §4.
  • [44] Y. Wang, G. N. Narasimhan, X. Lin, B. Okorn, and D. Held (2020) ROLL: visual self-supervised reinforcement learning with object reasoning. In Conference on Robot Learning (CoRL), Cited by: §4.
  • [45] Y. Wu, O. P. Jones, M. Engelcke, and I. Posner (2021) APEX: unsupervised, object-centric scene segmentation and tracking for robot manipulation. arXiv preprint arXiv:2105.14895. Cited by: §1, §2.1, §4.
  • [46] A. Zadaianchuk, M. Seitzer, and G. Martius (2021) Self-supervised visual reinforcement learning with object-centric representations. In International Conference on Learning Representations (ICLR), Cited by: §J.2, Appendix E, Appendix H, §1, §1, §2.1, §3.3, §4, §4, §5.1, §5.
  • [47] R. Zhao, Y. Gao, P. Abbeel, V. Tresp, and W. Xu (2021) Mutual information state intrinsic control. arXiv preprint arXiv:2103.08107. Cited by: §3.2.

Appendix A SRICS pseudocode

0:  GNN Dynamical model , goal-conditional attention policy , goal-conditional SAC trainer, number of training episodes .
1:  Train GNN Dynamical model on sequences uniformly sampled from using the loss described in Eq. 5 and estimate the interaction graph .
2:  for  episodes do
3:      Sample goal and construct subgoal using . See Eq. 6
4:      Collect episode data with policy .
5:      Store transitions into replay buffer .
6:      Sample transitions from replay buffer .
7:      Relabel goal components to a combination of future states and goal sampling distribution.
8:      Compute selectivity reward signal . See Eq. 7
9:      Update policy using with SAC trainer.
10:  end for
Algorithm 1 SRICS: Self-Supervised Relational RL with Independently Controllable Subgoals

Appendix B Generalization to Unseen Combination of Objects

Train and evaluation environments.
Average distance to the goal positions, comparing SRICS performance on seen and unseen combinations of 3 objects.
Figure 6: Generalization to unseen combination of objects.

To test our agent on a novel combination of objects, we modify the Multi-Object Rearrange environment with objects by deleting one of the objects from the table. We split possible combinations of three objects on training and evaluation combinations as shown in Fig. 6a. We train the GNN dynamical model on the training combinations and then average all interaction weights to estimate the global interaction graph. Next, we train our SRICS agent on the training combinations and evaluate it on an unseen combination. The performance of SRICS on the unseen combination is close to its performance on Multi-Object Rearrange with 3 objects (see Fig. 6b). This demonstrates that agents equipped with object-centric representations and a compatible policy are not restricted to the particular combination of objects they were trained on. Such agents should be able to learn how to control many objects and reuse learned skills for manipulation over different scenes containing only a random subset of objects.

Appendix C Multi-object Rearrange Environments

We implement several modifications of the original Multi-object Rearrange environment to study how our agent performs in more challenging settings. First, we implement the Multi-object Relational Rearrange environment by incorporating additional constraints to the Multi-object Rearrange environment. In particular, for the Multi-object Rearrange environment with objects, we add one spring connection and make one object static. The goals are sampled from a random arrangement of the objects, where the constraints above are fulfilled. Next, we implement Multi-object Rearrange with different objects by varying objects’ shapes (cube and cylinder), size, and mass. The manipulation of such different objects is more challenging thus an agent has to learn more complex policy (see Fig. 7 for the visualization of the environments)

Figure 7: Visualization of the Multi-object Rearrange environment with a) 4 objects, b) 6 different objects and c) Multi-object Relational Rearrange environment.

Appendix D Additional Experimental Results

d.1 Larger Number of Different Objects

We have conducted several additional experiments to study how the SRICS method performs in a more challenging environment with a larger number of different objects. In particular, we trained the SRICS agent in the Multi-object Rearrange environment with different objects (see Fig. 7b). We compared SRICS performance to the SMORL and SAC baselines that are shown to work in more simple Multi-object Rearrange with

objects. For these experiments, we do no additional hyperparameter optimization using optimal parameters from

Multi-object Rearrange with objects.

We show the results in Fig. 8a. The SRICS agent makes progress in this environment while the SMORL agent performs close to a random agent and SAC consistently solves only the easiest ”arm” subgoal. This shows that the SRICS agent can learn and efficiently combine many subtasks when the subtasks are different (e.g. manipulation of objects with different shapes).

d.2 State Representation Extended with Object’s Velocity

In addition to more challenging environments, it is also important to show that the SRICS method is not restricted to coordinate-based object-centric representations. For this, we studied the performance of the SRICS agent when the state representation also includes the object’s velocity. In the modified environment the representation of each object is encoded by the position vector , the velocity vector and the identifier . For all the methods, we use positions to calculate the distance to the goal in the reward signal. The results are presented in Fig. 8b. The SRICS agent outperforms both baselines and reaches the performance that is comparable to its performance with coordinates-based state representation.

Figure 8: Average distance to the goal positions, comparing our method to the SAC and SMORL baselines on a) Rearrange environment with different objects and b) Rearrange environment with objects with coordinates and velocity state representation.

Appendix E Goal-Conditioned Attention Policy

We train one policy that incorporates all learned skills. For this purpose, we use a goal-conditioned attention policy [46]. This policy needs to vary its behavior based on the goal at hand (e.g. one goal can be reaching a particular position with the robotic arm, whereas another goal can be moving an object to a particular position). To allow this flexibility, we use the attention module with a goal-dependent query . Each object is allowed to match with the query via an object-dependent key and contribute to the attention’s output through the value , which is weighted by the similarity between and . The attention head is computed as

where is a packed matrix of all ’s; the parameters , ,

constitute learned linear transformations and

is the common dimensionality of the key, value and query. The attention output is a concatenation of all attention heads .

Finally, the attention output is combined with the subgoal and processed by a fully connected neural network :

The parameters used for training of the goal-conditioned attention policy are presented in App. J.1.

Appendix F Subgoals Selectivity as an Evaluation Metric

The selectivity (as defined in Eq. 7) is a measure of the agent’s selective influence on the components of the environment. In Sec. 3.2

, we show that it can be used as an additional reward signal to motivate the agent to selectively control different components of the environment. Additionally, the selectivity can be used as an evaluation metric. This metric features how selective is the influence of an agent on the components of the environment. Here, we compute the selectivity measure for SRICS and SMORL agents that learn to control components of the representation separately from each other. As seen in Fig. 

9, the selectivity measure increases for both agents during the goal-conditioned training. Concurrent with the objective the SRICS agent is trained on, the selectivity measure for the SRICS agent is increasing much faster and with smaller variance compared to the SMORL agent. Therefore, the selectivity is an important objective for autonomous control that can make training more stable and efficient.

Figure 9: The selectivity part of the reward signal for both SRICS and SMORL agents averaged over all entities. While the SMORL agent is not optimized for being selective, the selectivity increases over SMORL training because the agent is gaining control over objects. However, for the SRICS agent, the increase in selectivity is much faster as the agent is incentified to be selective.

Appendix G Estimation of the Global Interaction Graph

To learn sparse interaction weights , we use the sparsity prior (see Eq. 5). Specifically, the sparsity prior is the Bernoulli distribution

For all experiments, we use the same prior probability for the relation presence

. The interaction weights deviate from this prior only when the relation is required for the improvement of the dynamical model predictions. As can be seen in Fig. 10, such approach successfully reconstructs most of the relations for both Multi-Object Rearrange and Multi-Object Relational Rearrange environments.

Figure 10: Average interaction weights obtained from the GNN dynamical model.
Figure 11: Ordering of the independently controllable subgoals according to the depth of the corresponding nodes in the interaction graph. When the interaction graph is a DAG, such ordering corresponds to the reversed topological ordering.

Appendix H Evaluation on the Average Objects Distance

We additionally evaluate our method on the average object distance metric, similarly to the SMORL paper [46]. This metric is calculated as the average of all distances from objects on a table (without arm) to their goal position. Thus, it is biased towards controlling the external objects (which are mostly independent of each other). As can be seen in Fig. 12, SRICS outperforms SMORL on this metric, whereas SAC performs similar to a passive policy. This result shows the benefit of using the goal-directed selectivity reward signal for the control of external objects. In contrast to the average object distance metric, the average distance metric presented in this paper also reveals the importance of the goal decomposition into a sequence of compatible subgoals.

Figure 12: Average object distance to the goal positions, comparing SRICS to SMORL and SAC+HER.

Appendix I Ordering of the Subgoals

Figure 13: SRICS pipeline during evaluation.

As reaching one subgoal can affect the results of reaching other subgoals, it is necessary to order the subgoals such that the resulting sequence of skills is compatible. Intuitively, for each compositional goal, we want to first manipulate such objects that require movement of other objects for their manipulation. For example, in case of a robotic arm and objects on the table, we first want to control objects using the robotic arm and then control the arm itself. As the robotic arm has no dependencies in the interaction graph, the corresponding selectivity reward signal should incentify the agent to control the arm while not affecting all other objects, thus making the arm subgoal compatible with objects’ subgoals (if solved perfectly).

Generally, we order all subgoals by their depth in the interaction graph, executing subgoals with larger depth first (as illustrated in Fig. 11). The depth of a node is defined as the length of the longest path without loops from the action variable to the node. The order of the subgoals with the same depth is random. When the learned interaction graph is a directed acyclic graph (DAG), such ordering corresponds to the reversed topological ordering. The nodes in a DAG are topologically ordered if for any edge in graph , node comes after node . Due to such ordering, only the subgoals that correspond to the nodes are executed before the subgoal . These subgoals should not be affected when the selectivity part (Eq. 7) of the reward signal is maximized. Thus, such ordering of the subgoals guarantees the compatibility of the subgoals sequence when each of the subgoals is solved with a maximal reward signal.

Appendix J Implementation Details

j.1 Srics

We refer to Table 1 and Table 2 for the hyperparameters of SRICS for all environments. We use the same number of subgoal solving attempts as in SMORL. During the evaluation, the number of attempts is equal to for environments with objects and for environments with objects. As the number of attempts is larger than the number of entities , we order only the last subgoals.

j.2 Prior Work

For both SMORL and SAC, we use previously optimized settings for Multi-Object Rearrange with 3 and 4 objects from Zadaianchuk et al. [46]. In addition, we make a hyperparameter search over more than runs for finding the best HAC hyperparameters. Specifically, we evaluate the performance of HAC while varying the number of steps for each subgoal, number of levels, and action noise. For the Multi-Object Relational Rearrange environment with objects, we use the same parameters as in the Multi-Object Rearrange environment with objects for all algorithms.

Hyperparameter Value
Selectivity parameter
Optimizer Adam with default settings
RL Batch Size
Reward Scaling
Automatic SAC entropy tuning yes
SAC Soft Update Rate
# Training Batches per Time Step
Hidden Activation ReLU
Network Initialization Xavier uniform
Separate Attention for Policy & Q-Function yes
Replay Buffer Size
Relabeling Fractions Rollout/Future/Sampled Goals / /
Number of Initial Random Samples
Number of Heads
Discount Factor
Learning Rate
Policy Hidden Sizes
Q-Function Hidden Sizes
Training Path Length
Table 1: General hyperparameters used by SRICS for all environments.
Hyperparameter Value
Sparsity prior
Number of episodes
Episode length
Sequence size during RNN modeling
Number of updates
Learning Rate
Batch Size
Table 2: Hyperparameters for the interaction graph estimation for all environments.