Learning to Compose Hierarchical Object-Centric Controllers for Robotic Manipulation

by   Mohit Sharma, et al.
Carnegie Mellon University

Manipulation tasks can often be decomposed into multiple subtasks performed in parallel, e.g., sliding an object to a goal pose while maintaining contact with a table. Individual subtasks can be achieved by task-axis controllers defined relative to the objects being manipulated, and a set of object-centric controllers can be combined in an hierarchy. In prior works, such combinations are defined manually or learned from demonstrations. By contrast, we propose using reinforcement learning to dynamically compose hierarchical object-centric controllers for manipulation tasks. Experiments in both simulation and real world show how the proposed approach leads to improved sample efficiency, zero-shot generalization to novel test environments, and simulation-to-reality transfer without fine-tuning.



page 6

page 17

page 18

page 19

page 20

page 21

page 23


Generalizing Object-Centric Task-Axes Controllers using Keypoints

To perform manipulation tasks in the real world, robots need to operate ...

SORNet: Spatial Object-Centric Representations for Sequential Manipulation

Sequential manipulation tasks require a robot to perceive the state of a...

Learning Hierarchical Control for Robust In-Hand Manipulation

Robotic in-hand manipulation has been a long-standing challenge due to t...

Learning Preconditions of Hybrid Force-Velocity Controllers for Contact-Rich Manipulation

Robots need to manipulate objects in constrained environments like shelv...

Achieving Sample-Efficient and Online-Training-Safe Deep Reinforcement Learning with Base Controllers

Application of Deep Reinforcement Learning (DRL) algorithms in real-worl...

Compositional Multi-Object Reinforcement Learning with Linear Relation Networks

Although reinforcement learning has seen remarkable progress over the la...

Metrics and Benchmarks for Remote Shared Controllers in Industrial Applications

Remote manipulation is emerging as one of the key robotics tasks needed ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Manipulation tasks are inherently object-centric and often require a robot to perform multiple subtasks in parallel, such as pressing on a sponge while wiping across a surface, balancing a saucer while serving tea, or maintaining alignment of a screwdriver while unscrewing a screw. The individual subtasks need to be performed in parallel to accomplish the overall task. As the above examples illustrate, subtasks usually correspond to goals and constraints associated to objects in the robot’s environment. Thus, manipulation skills are often defined as 3D motions, which are implemented as simple position or force controllers, of the end effector in object-centric coordinate frames.

One drawback of such an approach is that it results in monolithic controllers for each task, i.e. controllers which act specifically with respect to some fixed coordinate frame. In addition, for many tasks it is not always necessary to control all axes of a given object-centric coordinate frame. For instance, for the wiping task in Figure 1, the sponge needs to use the table surface normal to make contact with the surface, while it is free to move with respect to any other object (wall, corners, dirt) on the surface. Based on this insight, we adopt a modular approach by defining task-axis controllers for each potential subtask. Importantly, the controllers are associated with object-centric axes, such as the normal of a surface or the direction from the end-effector to an object.

We focus on learning an hierarchy of such object-centric task-axis controllers, or object-axis controllers (Figure 1). This hierarchy is especially important since many tasks require performing multiple subtasks in parallel. Previous works use pre-defined sets of task frames attached to objects or the robot, and they often learn a fixed task-frame hierarchy from human demonstrations. Instead, we use Reinforcement Learning (RL) to learn a policy that outputs an ordered list of controllers, which are then composed to be executed on the robot. To ensure different object-axis controllers do not interfere with each other, we compose controllers via nullspace projections [1], where the control signals of lower-priority controllers are projected onto the nullspace of higher priority ones.

In addition to modularity, our approach provides several other benefits. First, the object-axis controllers are not task specific, so they can be reused across multiple tasks. Second, composing controllers across multiple different objects makes the learned policies invariant to certain object properties e.g., a controller that reaches toward an object is invariant to object size. Such invariances are useful for generalizing learned policies beyond the set of objects the policies are trained on. Finally, the use of a structured action space introduces meaningful inductive biases by ensuring robot actions are performed both in relation and with respect to objects in the scene. We successfully evaluated our approach on four different manipulation tasks, including two 2D tasks of fitting and pushing a block and two real robot tasks of screwing and door-opening. Experiments show that the proposed approach leads to improved sample efficiency, zero-shot generalization to novel environment configurations, and simulation-to-reality transfer without further fine-tuning. See videos and supplementary materials at https://sites.google.com/view/compositional-object-control/.

Figure 1: Controller Selection and Composition Pipeline. Given current observations and list of low-level controllers, an RL policy chooses an ordered list of controllers to use. These controllers are composed via nullspace projection, where the controls of lower-priority controllers are projected onto the nullspace of higher-priority ones. The combined control signals are used to actuate a robot via task-space impedance control. The controller combination runs for time steps before the RL policy is queried again.

2 Related Works

Task Frames: Our use of task-axes is related to the notion of 6D task frames [4, 3, 23]. One of the first works to formalize task frames is [23]. There, the authors referred to different task-axes as compliant or non-compliant based on the type of desired motion along each axis. The authors of [32] proposed hybrid force-position control, which selects different axes of the constraint frame for either position or force control. Simultaneously, the authors of [4, 3] proposed task frames to define robotic manipulation primitives; they noted that the geometric level of task frames can serve as a good middle ground between symbolic actions and the motor control input. Since then, task frames in the form of task spaces have been used extensively in robotics [34]. Prior works treat task-frames as fixed coordinate frames which are either attached to objects of interest or generated from constraints in the environment. By contrast, our approach is more modular and dynamic, as it enables an RL policy to combine task-axes across different objects and dynamically synthesize task-frames.

Task Frame Selection: Although the use of task frames and spaces is widespread in robotics [27, 6, 15, 16, 40, 25, 21], only a few works have explored using learning to select which task frames are appropriate for the given task [27, 16, 40, 31, 9]

. However, most of these works use imitation learning

i.e., they learn task frame selection from human demonstrations [27, 16, 40, 31]

. The criterion for task-frame selection is typically manually defined using properties such as inter-trial variance or convergence behavior of demonstrations. In our work, we set task-axes selection as the action space for an RL agent, so we do not require demonstrations. Moreover, the RL agent chooses a hierarchy of task-axis controllers, which are composed together for execution.

Hierarchical Controllers:

Combining multiple task-axis controllers is related to works in hierarchical control. Hierarchical control is often used in robots with redundant degrees of freedom or bi-manual robot setups where multiple tasks or objectives can be executed in parallel

[14, 28, 11, 13]. To combine different controllers, these works project the control signals of lower-priority controllers onto the nullspace of higher-priority controllers. However, most of these works assume a fixed priority order for the tasks/objectives being considered, while some recent works [13] learn the priorities from human demonstrations. Similar to these works, our approach also uses nullspace projections to combine multiple task-axis controllers together. However, instead of using a fixed priority order, our method learns to prioritize controllers by directly interacting with the environment.

Reinforcement Learning: Finally, our approach is related to works on structured action spaces for reinforcement learning (RL) for contact-rich manipulation tasks. Recent works have studied how the choice of action spaces affect robot learning performance [22, 7, 5]. However, these methods focus only on the final controller output, i.e., comparing fixed with variable impedance control [22, 7] or with hybrid-force position control [5] in joint and task-spaces. Our work provides additional structure to the action space via composing hierarchical object-centric controllers.

Hierarchical RL: Composing task-axes controllers for performing tasks is also related with hierarchical RL (HRL) [39, 8, 17]. HRL uses the notion of options, which are temporally-extended actions, and learns to combine them to accomplish a given task. There has been a large body of work which aims to extract the underlying options [38, 2, 37], using techniques such as bottleneck states [38], policy sketches [36], or expert demonstrations [18, 35, 10]. Similarly, there have also been works that use predefined option policies and compose them to learn a “meta-policy” [20]. However, these option policies are defined specifically with respect to the underlying task, and hence it is not clear how reusable these policies are. By contrast, our proposed task-axes controllers are reusable across multiple different manipulation tasks. This is desirable for efficient learning of new manipulation tasks[26]. Additionally, task-axes controllers are different than options since they can be composed both hierarchically and temporally.

3 Learning Hierarchical Compositions of Object-Centric Controllers

Figure 2: Force-Position Controller Composition. Here, the agent controls the green block to push the red block up along the vertical gray wall. A) The agent is given controllers to choose from, each corresponding to points of interests in the scene. B) The agent chooses controllers, with the force controller into the red block at the higher priority (), and position controller toward the wall corner at the lower priority (). C) The error of the lower-priority position controller is projected onto the null space of the higher-priority force controller (purple dashed line). D) The projected errors are combined to form the desired position target.

We propose training an RL policy to perform manipulation tasks by using a structured action space consisting of hierarchical compositions of object-centric controllers. Each object in the scene is associated with a fixed set of task-axes, positioned either at object centers or other object key points. For each axis, we define a set of controllers that perform force, position, and rotation controls. This gives a set of pre-defined object-centric task-axis controllers, or object-axis controllers, which define our structured action space. With this action space, instead of directly commanding the end-effector, the RL policy selects multiple object-axis controllers in a prioritized order, which are composed together using null-space projections. Figure 1 shows an overview of the overall proposed approach.

In the next subsections, we first define the different types of object-centric low-level controllers we use, including how their object-centric axes are defined. We then discuss how to combine different object-axis controllers together using null-space projections. Finally, we discuss different RL approaches for learning the high-level policy that selects multiple controllers.

3.1 Controller Types

In this work, we use three different types of controllers: position, force, and rotation. These controllers are object-centric, i.e. their control targets and axes correspond to objects in the scene. For example, position controllers could be attractors that lead the end-effector (EE) close to an object of interest, force controllers could be applying forces perpendicular to object surfaces, and rotation controllers could be aligning an axis of the EE with an axis of the object. Currently, these controllers are manually specified (see details in Section 4), but they could also be autonomously inferred from visual observations of objects in the environment. Figure 2 illustrates force and position controllers and their composition, and Figure 3 shows the rotation controllers.

Let , , and respectively denote the current end-effector position, orientation, and forces expressed in the robot’s base frame.

Position and Force Controllers: The position controller consists of a target position and an axis along which the controller will move the robot’s end-effector toward the target. can be a fixed direction, like the normal direction of a surface, or it can be adapted with respect to : . Let be the projection matrix for the given axis. Then, the translation error a position controller produces is defined as . The force controller is similar to the position controller, i.e. given a force target and an axes-direction , the force error the controller produces is .

Rotation Controller: The rotation controller attempts to align one axis of with a target axis , where

is a unit vector that performs axis-selection. For example,to align the X-axis of the end-effector frame to align with

, then . The rotation controller produces a delta rotation target in the end-effector frame, which we compute via the angle-axis representation:

Null Controllers: The high-level policy also has the option to choose a null controller, which would give errors for both and . While other controllers can be chosen at most time, the null controllers can be chosen multiple times, giving the high-level policy more flexibility.

3.2 Controller Composition

Force-Position Composition: The RL policy selects at most force and position controllers to compose. Only of force and position controllers can execute concurrently, because there are only position dimensions. The RL policy outputs a priority order for these controllers. Let the indices denote the controllers in decreasing priority, so is the highest, and the lowest. The final position target is computed by projecting the lower-priority targets onto the nullspaces of the higher-priority controllers, then summing them. Let be a nullspace projection matrix with respect to rows of , where denotes the pseudoinverse. Let be the position controller gain and the force gain:


where represents a concatenation operator, i.e. concatenation of vectors into a matrix, e.g., . Although the above expressions are written with all controllers as position controllers, in our implementation we combine multiple position and force controllers together. If force controllers are used, for the corresponding controller, swap with , with , with , and with . Figure 2 illustrates the force-position controller composition.

Figure 3: Rotation Controller Composition. Here, the agent rotates the Franka robot’s gripper from the initial pose (A) to the final pose (E), so the gripper aligns with a door handle. A) The agent is given rotation controllers to choose from, aligning various axes of the gripper with different target axes of the handle. B) Two controllers are chosen with the higher-priority labeled as () and the lower-priority as (). C) Both the current and target axes of the lower-priority controller (green arrows) are projected down to the null-space (green planes) of the current axis of the higher-priority controller (gripper’s blue axis). D) The desired rotation target is formed by combining the higher-priority rotation in the blue plane with the projected lower-priority rotation in the green plane. Note that the lower-priority rotation does not interfere with the higher-priority rotation.

Rotation Composition: The RL policy selects at most two rotation controllers to compose. This is because when the highest priority controller fixes one axis of a rotation frame, there is only one degree of freedom left, which is a rotation in the 2D nullspace of the fixed axis. Similar to force-position controller compositions, we project the errors of lower-priority controllers onto the nullspace of higher-priority controllers:


where denotes composing rotations, and denotes a rotation error gain. This procedure ensures the higher-priority rotation controller always reaches its goal, and the trajectory of that axis is not affected by the lower-priority controller (see Figure 3 for an illustration).

Controlling the Robot: We use task-space impedance control to convert translation and rotation targets to configuration-space targets via Jacobian transpose, and we actuate the robot via joint torques. We first concatenate the translation target with the axis-angle representation of to form the final 6D delta end-effector target . Then, the robot joint-torque commands are computed as , where and are diagonal stiffness and damping matrices, and is the analytic Jacobian. Terms for compensating gravity and Coriolis forces are omitted for brevity. In practice, we cap the magnitude of to limit maximum control effort, and we add an integral term to the force controllers for better convergence. Once a set of controllers are selected, their combination runs for timesteps before the RL policy is queried again for a new set of controllers.

3.3 RL with Object-Axis Controllers

We use RL to learn a policy that composes object-axis controllers to perform the underlying task. The policy outputs an ordered list of controllers, which are composed together to output the final control signal to move the robot. The combination of controllers is run for a fixed timesteps, before the RL policy is queried again. Note that the controllers do not have to converge before the RL policy switches to the next combination. We next discuss multiple ways in which the RL policy can output the ordered list of controllers.

Discrete Combinatorial Actions: Let be the total number of available controllers, and be the number of controllers that can be executed simultaneously. One simple way to output an ordered list of controllers is to use a discrete action space, where the policy selects an action from all available controller permutations. Such an action space grows combinatorially (), and is not scalable for environments with a large number of controllers.

Continuous Priority Scores: A continuous space alternative is to allow the policy to output a priority score in for all controllers. These priority scores are then used to order the controllers, where the controllers with highest priorities are executed at each step. Although the dimension of this action space grows linearly with the number of controllers, it can often lead to sub-optimal performance since the agent now needs to explore a much larger action space than before.

Expanded-MDP: To avoid the sub-optimal performance of the above methods, we propose an expanded-MDP formulation that still uses a discrete action space while avoiding combinatorial expansion. Here, we expand each environment-execution step of the MDP into intermediate controller-selection steps, with the original environment-execution step occurring after the ’th intermediate step. At each intermediate step, the policy selects one controller from the choices. Once controllers are selected, the robot takes an actual environment step. The reward function is modified such that rewards are given for the controller-selection steps before the ’th step. Similar MDP transformations have been suggested previously to solve continuous action MDPs using discrete action space RL algorithms  [30, 24].

To use the Expanded-MDP formulation, at each controller-selection step the policy needs to know its previous controller selections. One approach is representing each controller with 1-hot encoding and appending the 1-hot encodings of previously selected controllers to the observations. This expands the observation space by dimensions, and we refer to this representation as multi-1-hot. However, in many cases it might not be necessary to know the order of the previous controllers being selected, i.e.

, it is sufficient to know which controllers have been selected previously but not their order. So, for the second representation, we merge the one-hot encodings of multiple previous controllers into one binary vector. This only increases the observation space by

dimensions, and can lead to faster learning. We refer to this representation as single-1-hot

4 Experiment Tasks and Setup

With our experiments we aim to evaluate 1) How useful are the proposed object-axis controllers for task learning, 2) How important is controller composition for task learning, and 3) How well does our proposed approach generalize to the different test configurations.

Figure 4: Experiment Tasks. From left to right: Block Fit, Block Push, Franka Hex-Screw, Franka Door-Opening tasks implemented in simulation, and Franka tasks in the real world.

Figure 4 visualizes the tasks used to evaluate our approach. There are two 2D tasks, Block Fit and Block Push, and two real robot tasks, screwing hex-screws and opening doors with the 7 DoF Franka Emika Panda arm. We compare both learning performance of the proposed approach against baselines, as well as their ability to generalize to novel environment configurations. To study generalization, we train policies on a small set of training environment configurations and test them on a novel test set. Training over multiple environments is important to avoid overfitting. Details of each task, including controller specifications, task variations, observation and action spaces, and the reward functions can be found in the Appendix.

Figure 5: Example environment configurations for Block Push (left) and Block Fit (right) environments. Top row shows some examples of train configurations, and the bottom row shows some examples of test configurations. The orange wall shows the goal wall to reach.

Block Fit: In this task, a 2D block robot needs to navigate to a 2D goal pose in the scene. There are multiple walls or obstacles in the scene, so the robot cannot directly proceed towards the goal. Figure 5 (Left) shows some of the different train and test configurations. The low-level controllers are wall-centric. Different environment configurations have different wall lengths and angles between walls. The training set has different environment configurations, while the test set has .

Block Push: In this task, a 2D block robot needs to push another block along a vertical wall over a ledge to a desired goal pose. Figure 5 (right) visualizes some train and test configurations. Controllers and environment wall configurations are similar to those of Block Fit. The environment samples the initial pose of the block robot and the target block. The training set has different environment configurations, and the test set has .

Franka Hex-Screw: In this task, a 7-DoF Franka Panda arm is used to insert a hex-key into a screw, and turn the screw to a desired angle while applying a downward force and maintaining vertical orientation. The screw will not turn unless a sufficient pre-defined () downward force is applied. Different environment configurations have different wrench and screw sizes. The training set uses size scale multipliers of , and the test set uses .

Franka Door-Opening: In this task, the Franka robot needs to open a door by first turning its door handle and then pulling the door beyond an opening threshold. To avoid trivial policy solutions, the door will not open unless the handle is first turned to a desired angle. The environment samples the initial relative pose between the EE and the door, and different configurations have different locations of the door handle on the door. The training and test set contain and configurations.

Compared Approaches: We set across all experiments, which we found to be sufficient. To evaluate the utility of our proposed object-axis controllers we compare against an RL agent that controls the robot directly via end-effector delta-poses. We call this approach EE-Space. We also evaluate the need for executing multiple controllers in parallel by comparing against a baseline which only chooses 1 controller at each timestep. We call this 1-Ctrlr. To show the efficacy of our proposed Expanded-MDP formulation we compare against both: discrete combinatorial (3-Combo) and continuous priority scores (3-Priority) action spaces. Both these approaches naively combine all possible controller combinations and we show how this can lead to sub-optimal performance.

RL Training: We use Proximal Policy Optimization (PPO) [33] implemented in stable-baselines [12] across all tasks and action space variants. Given the high variance in policy-gradient RL algorithms, we run all methods with different seeds (sampled uniformly between and ). All tasks are simulated with an NVIDIA Isaac Gym 111https://developer.nvidia.com/isaac-gym, a GPU-accelerated robotics simulator [19].

Metrics: We report the success rates of the learned policies separately for train and test environment configurations. Performance on the train set indicates whether or not the approach can robustly solve a task, and performance on the test set evaluates generalization abilities. Test set is split into two subsets, one with small deviations from the train configurations, and another with larger deviations. We report additional results including more fine-grained analysis for each task in the Appendix.

5 Experiment Results and Discussion

Figure 6: Success rates for all tasks on training environment configurations.

Block Tasks: Figure 6 (left) plots the success ratios averaged over all train environment configurations for Block Fit and Block Push. The Expanded-MDP methods are able to successfully learn both tasks. While EE-Space also makes progress on both tasks, it has a lower success rate, and this is due to its inability to robustly solve a few challenging configurations (see Appendix). Both 1-Ctrlr and 3-Priority perform well on Block Fit but poorly on Block Push. We attribute this difference to how there is a greater need to use multiple controllers in the right order for Block Push. For instance, the policy needs to choose a force/position controller that pushes into the wall and then another controller to move up. In addition, robustly pushing the block around the edge of the vertical wall also requires multiple controllers. Although it is feasible to achieve this by quickly switching between controllers, such a strategy is not robust. 1-Ctrlr is unable to use multiple controllers at the same time, and using the high-dimensional priority score action space is challenging.

Table 1 shows success rates for both tasks on two sets of test configurations. Both EE-space and Expanded-MDP methods perform well when test configurations have small deviations from train configurations, with EE-Space performing slightly worse. However, for large deviations, EE-space performs poorly, achieving success ratios of for Block Fit and for Block Push. By contrast, Expanded-MDP methods perform much better, achieving for Block Fit and for Block Push, and 3-Priority also outperforms EE-Space for the Block Fit task. In addition, 1-Ctrlr sees greater performance degradation going from small to large deviations in test configurations. Together, these results indicate that using a structured action space of multiple object-centric controllers leads to better generalization than using one controller or directly learning in the EE-space.

Task Variation EE-Space 1-Ctrlr 3-Priority 3-Combo 3-Exp-Single 3-Exp-Multi
Block Fit Train 0.87 (0.213) 0.778 (0.38) 0.936 (0.032) 0.294 (0.18) 0.998 (0.002) 1.00 (0.0)
Test-Small 0.87 (0.10) 0.916 (0.14) 0.99 (0.001) 0.184 (0.12) 0.99 (0.001) 0.99 (0.01)
Test-Large 0.371 (0.246) 0.396 (0.423) 0.877 (0.141) 0.165 (0.23) 0.974 (0.048) 0.953 (0.087)
Block Push Train 0.966 (0.046) 0.594 (0.087) 0.548 (0.129) 0.0 (0.0) 0.974 (0.025) 0.978 (0.022)
Test-Small 0.912 (0.045) 0.577 (0.193) 0.396 (0.041) 0.0 (0.0) 0.945 (0.045) 0.960 (0.030)
Test-Large 0.518 (0.185) 0.152 (0.137) 0.376 (0.032) 0.0 (0.0) 0.751 (0.103) 0.788 (0.132)
Table 1: Mean (SD) success rates for Block Fit and Block Push tasks on different environment configurations.
Task Variation EE-Space 1-Ctrlr 3-Priority 3-Combo 3-Exp-Single 3-Exp-Multi
Hex-Screw Train 0.002 (0.002) 0.183 (0.303) 0.960 (0.048) 0.774 0.194) 0.984 (0.01) 0.980 (0.016)
Test-Small 0.00 (0.00) 0.13 (0.072) 0.62 (0.045) 0.429 (0.430) 0.963 (0.01) 0.966 (0.015)
Test-Large 0.00 (0.00) 0.026 (0.025) 0.633 (0.081) 0.34 (0.057) 0.936 (0.028) 0.936 (0.035)
Real-World n/a 0.0 0.5 0.0 0.9 0.6
Door-Open Train 0.002 (0.006) 0.947 (0.021) 0.982 (0.007) 0.984 (0.013) 0.987 (0.009) 0.984 (0.015)
Test-Small 0.066 (0.063) 0.922 (0.043) 0.965 (0.046) 0.975 (0.011) 0.997 (0.006) 0.992 (0.015)
Test-Large 0.000 (0.001) 0.936 (0.032) 0.983 (0.006) 0.985 (0.007) 0.996 (0.005) 0.994 (0.013)
Real-World n/a 0.0 1.0 0.9 1.0 1.0
Table 2: Success rates for Franka Hex-Screw and Open-Door tasks on train and test environment configurations across

seeds. Parentheses denote standard deviation. Real-world results are evaluated over

trials each. We did not run EE-Space policies in the real world as they were unable to learn the tasks in simulation.

Franka Tasks: Figure 6 (right) shows training results for both Franka Hex-Screw and Door-Open tasks. The Expanded-MDP methods perform well on both the tasks, while EE-Space does not make progress on either task. For Hex-Screw, the EE-Space policy is able to reach the screw, but is unable to learn to simultaneously rotate the screw and apply sufficient downward force. For Door-Open, the EE-Space policy reaches the door handle, but fails to grasp and completely rotate the door handle in a robust manner to open the door. One reason for these EE-Space failures is that exploration in both tasks is difficult in the end-effector space. To aid EE-Space exploration, we evaluated the approach from [29], which gives the agent additional exploration rewards. While doing so leads the agent to cover a larger region in the state space, the explored states do not always correspond with meaningful behaviors for task completion, so we did not observe any gains using this method.

Unlike with the Block 2D tasks, 3-Priority is able to learn both the Franka tasks. This is because the Franka tasks have fewer possible controllers, which resulted in lower dimensional priority-score action spaces. The reduced action-space dimensions of Franka tasks allowed us to evaluate 3-Combo, which is also able to learn both tasks, although it achieves worse performance on Hex-Screw. Similarly, 1-Ctrlr is able make progress on Door-Open but not Hex-Screw, which suggests that Hex-Screw requires more precise coordination of multiple controllers than Door-Open. Table 2 (rows 2, 3, 5 and 6) shows the success rates for both tasks on test configurations with small and large deviations. All methods that use hierarchical combination of multiple object-axis controllers generalize well to both small and large test deviations. Methods that performed poorly during training, EE-Space for both tasks and 1-Ctrlr Hex-Screw, do not generalize well.

To evaluate Franka tasks in the real-world, we performed trials of each method on the real robot, each trial with a different sampled initial state. For the Hex-Screw task, we further tested on different screw and key sizes. All methods that used the proposed composition of hierarchical controllers were able to robustly perform Door-Open in the real world, while only 3-Exp-Single was able to do so for Hex-Screw. Hex-Screw is more challenging than Door-Open, because it requires more precise movements for alignment and insertion. As a result, sim-to-real gap in the robot dynamics and controller responses leads to greater performance degradation for Hex-Screw than for Door-Open.

6 Conclusion and Future Work

In this work, we propose using RL to learn how to compose hierarchical object-centric controllers for manipulation tasks. Our approach has several advantages. First, the object-centric controllers can be reused across multiple tasks. Second, controller compositions are invariant to certain object properties. Finally, the use of a structured action space introduces meaningful inductive biases for manipulation. Our experiments show that the proposed approach leads to more guided exploration and consequently improved sample efficiency, and it enables zero-shot generalization to test environments and simulation-to-reality transfer without fine-tuning. In future work, we will tackle the main limitations of the current approach – the set of controllers is fixed and manually-defined.

This work was supported by NSF Award No. CMMI-1925130, NSF Graduate Research Fellowship Program Grant No. DGE 1745016, Office of Naval Research Grant No. N00014-18-1-2775, ARL grant W911NF-18-2-0218 as part of the A2I2 program, and Nvidia NVAIL.


  • [1] F. J. Abu-Dakka, B. Nemec, J. A. Jørgensen, T. R. Savarimuthu, N. Krüger, and A. Ude (2015) Adaptation of manipulation skills in physical contact with the environment to reference force profiles. Autonomous Robots 39 (2), pp. 199–217. Cited by: §1.
  • [2] P. Bacon, J. Harb, and D. Precup (2017) The option-critic architecture. In

    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: §2.
  • [3] D. H. Ballard and L. Hartman (1986) Task frames: primitives for sensory-motor coordination. Computer Vision, Graphics, and Image Processing 36 (2-3), pp. 274–297. Cited by: §2.
  • [4] D. H. Ballard (1984) Task frames in robot manipulation.. In AAAI, Vol. 19, pp. 109. Cited by: §2.
  • [5] C. C. Beltran-Hernandez, D. Petit, I. G. Ramirez-Alpizar, T. Nishi, S. Kikuchi, T. Matsubara, and K. Harada (2020) Learning contact-rich manipulation tasks with rigid position-controlled robots: learning to force control. arXiv preprint arXiv:2003.00628. Cited by: §2.
  • [6] D. Berenson, S. Srinivasa, and J. Kuffner (2011) Task space regions: a framework for pose-constrained manipulation planning. The International Journal of Robotics Research 30 (12), pp. 1435–1460. Cited by: §2.
  • [7] M. Bogdanovic, M. Khadiv, and L. Righetti (2019) Learning variable impedance control for contact sensitive tasks. arXiv preprint arXiv:1907.07500. Cited by: §2.
  • [8] G. Comanici and D. Precup (2010) Optimal policy switching algorithms for reinforcement learning. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-Volume 1, pp. 709–714. Cited by: §2.
  • [9] A. Conkey and T. Hermans (2019) Learning task constraints from demonstration for hybrid force/position control. In 2019 IEEE-RAS 19th International Conference on Humanoid Robots (Humanoids), pp. 162–169. Cited by: §2.
  • [10] C. Daniel, H. Van Hoof, J. Peters, and G. Neumann (2016) Probabilistic inference for determining options in reinforcement learning. Machine Learning 104 (2-3), pp. 337–357. Cited by: §2.
  • [11] A. Dietrich, C. Ott, and A. Albu-Schäffer (2015) An overview of null space projections for redundant, torque-controlled robots. The International Journal of Robotics Research. Cited by: §2.
  • [12] A. Hill, A. Raffin, M. Ernestus, A. Gleave, A. Kanervisto, R. Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu (2018) Stable baselines. GitHub. Note: https://github.com/hill-a/stable-baselines Cited by: §4.
  • [13] A. Karami, H. Sadeghian, M. Keshmiri, and G. Oriolo (2018) Hierarchical tracking task control in redundant manipulators with compliance control in the null-space. Mechatronics. Cited by: §2.
  • [14] O. Khatib (1987) A unified approach for motion and force control of robot manipulators: the operational space formulation. IEEE Journal on Robotics and Automation 3 (1), pp. 43–53. Cited by: §2.
  • [15] J. E. King, M. Cognetti, and S. S. Srinivasa (2016) Rearrangement planning using object-centric and robot-centric action spaces. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 3940–3947. Cited by: §2.
  • [16] J. Kober, M. Gienger, and J. J. Steil (2015) Learning movement primitives for force interaction tasks. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 3192–3199. Cited by: §2.
  • [17] G. D. Konidaris and A. G. Barto (2009) Efficient skill learning using abstraction selection.. Cited by: §2.
  • [18] S. Krishnan, R. Fox, I. Stoica, and K. Goldberg (2017) Ddco: discovery of deep continuous options for robot learning from demonstrations. arXiv preprint arXiv:1710.05421. Cited by: §2.
  • [19] J. Liang, V. Makoviychuk, A. Handa, N. Chentanez, M. Macklin, and D. Fox (2018) GPU-accelerated robotic simulation for distributed reinforcement learning. Conference on Robot Learning (CoRL). Cited by: §4.
  • [20] R. Liaw, S. Krishnan, A. Garg, D. Crankshaw, J. E. Gonzalez, and K. Goldberg (2017) Composing meta-policies for autonomous driving using hierarchical deep reinforcement learning. External Links: 1711.01503 Cited by: §2.
  • [21] S. Manschitz, M. Gienger, J. Kober, and J. Peters (2020) Learning sequential force interaction skills. Robotics 9 (2), pp. 45. Cited by: §2.
  • [22] R. Martín-Martín, M. Lee, R. Gardner, S. Savarese, J. Bohg, and A. Garg (2019) Variable impedance control in end-effector space. an action space for reinforcement learning in contact rich tasks. In Proceedings of the International Conference of Intelligent Robots and Systems (IROS), Cited by: §2.
  • [23] M. T. Mason (1981) Compliance and force control for computer controlled manipulators. IEEE Transactions on Systems, Man, and Cybernetics 11 (6), pp. 418–432. Cited by: §2.
  • [24] L. Metz, J. Ibarz, N. Jaitly, and J. Davidson (2017) Discrete sequential prediction of continuous actions for deep rl. arXiv preprint arXiv:1705.05035. Cited by: §3.3.
  • [25] T. Migimatsu and J. Bohg (2020) Object-centric task and motion planning in dynamic environments. IEEE Robotics and Automation Letters 5 (2), pp. 844–851. Cited by: §2.
  • [26] J. D. Morrow and P. K. Khosla (1997) Manipulation task primitives for composing robot skills. In Proceedings of International Conference on Robotics and Automation, Vol. 4, pp. 3354–3359 vol.4. External Links: Document Cited by: §2.
  • [27] M. Mühlig, M. Gienger, J. J. Steil, and C. Goerick (2009) Automatic selection of task spaces for imitation learning. In International Conference on Intelligent Robots and Systems, pp. 4996–5002. Cited by: §2.
  • [28] Y. Nakamura, H. Hanafusa, and T. Yoshikawa (1987) Task-priority based redundancy control of robot manipulators. The International Journal of Robotics Research 6 (2), pp. 3–15. Cited by: §2.
  • [29] D. Pathak, D. Gandhi, and A. Gupta (2019) Self-supervised exploration via disagreement. In ICML, Cited by: §5.
  • [30] J. Pazis and M. G. Lagoudakis (2011) Reinforcement learning in multidimensional continuous action spaces. In 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 97–104. Cited by: §3.3.
  • [31] L. Peternel, L. Rozo, D. Caldwell, and A. Ajoudani (2017) A method for derivation of robot task-frame control authority from repeated sensory observations. IEEE Robotics and Automation Letters 2 (2), pp. 719–726. Cited by: §2.
  • [32] M. H. Raibert and J. J. Craig (1981) Hybrid position/force control of manipulators. Cited by: §2.
  • [33] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4.
  • [34] L. Sciavicco and B. Siciliano (2012) Modelling and control of robot manipulators. Springer Science & Business Media. Cited by: §2.
  • [35] M. Sharma, A. Sharma, N. Rhinehart, and K. M. Kitani (2019) Directed-info GAIL: learning hierarchical policies from unsegmented demonstrations using directed information. In International Conference on Learning Representations, Cited by: §2.
  • [36] K. Shiarlis, M. Wulfmeier, S. Salter, S. Whiteson, and I. Posner (2018) Taco: learning task decomposition via temporal alignment for control. arXiv preprint arXiv:1803.01840. Cited by: §2.
  • [37] Ö. Şimşek, A. P. Wolfe, and A. G. Barto (2005) Identifying useful subgoals in reinforcement learning by local graph partitioning. In Proceedings of the 22nd international conference on Machine learning, pp. 816–823. Cited by: §2.
  • [38] M. Stolle and D. Precup (2002) Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approximation, pp. 212–223. Cited by: §2.
  • [39] R. S. Sutton, D. Precup, and S. Singh (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (1-2), pp. 181–211. Cited by: §2.
  • [40] A. L. P. Ureche, K. Umezawa, Y. Nakamura, and A. Billard (2015) Task parameterization using continuous constraints extracted from human demonstrations. IEEE Transactions on Robotics 31 (6), pp. 1458–1471. Cited by: §2.
  • [41] K. Zhang, M. Sharma, J. Liang, and O. Kroemer (2020) A modular robotic arm control stack for research: franka-interface and frankapy. External Links: 2011.02398 Cited by: §A.4.

Appendix A Controller Implementation Details

a.1 Specific Controllers for each Task

Block Fit. A set of controllers is associated with each wall in the environment. For a wall, let be the unit vector pointing in the wall’s normal direction, be the coordinate of the middle of the wall. The set of controllers associated for each wall include:

  1. Position attractor along normal direction. ,

  2. Position attractor along error direction. ,

  3. Force attractor along the normal direction. ,

  4. Rotation attractor aligning the block’s x-axis to the normal. ,

  5. Rotation attractor aligning the block’s y-axis to the normal. ,

Block Push. In addition to all the per-wall controllers of the Block Fit task, Block Push has the following per-wall controllers:

  1. Position controller along the side of a wall. Let be a unit vector orthogonal to . Since there are such possible directions, we pick the one that gives the direction pointing up along the vertical wall in the scene.

    Let be the coordinate of a wall corner. Since walls form a corner-connect chain in this task, using one of the two corners per wall covers all corners in the scene except the last corner in the chain, which we ignore.

    With these, this controller has and .

  2. Position curl controller around a wall corner. This controller rotates the end-effector in a fixed-radius circle around a point until it reaches the target position which also lies on the circle. The attractor target is , and the direction is , where gives a 2D rotation matrix with the angle .

Block Push has one more position controller that attracts the robot block toward the target block. Let be the current location of the center of the target block. This position controller has and .

Franka Hex-Screw. Let be the location of screw, and be a point cm above the screw (the -axis is vertical in our coordinate frame). Position attractor controllers use as the target, instead of , because attracting the hex-key tip toward the inside of the screw directly can result in collisions with the side of the screw and prevent the key from properly inserted.

  1. Position attractor along vertical direction. ,

  2. Position attractor along error direction. ,

  3. Position controller that prevents motion in the vertical direction , . This controller does not attract the end-effector toward a goal. Instead, its utility is solely in its nullspace projection, which ensures lower-priority controllers cannot move the end-effector outside of a horizontal plane. This controller is useful for preventing prematurely inserting the hex-key.

  4. Force controller that pushes downward toward the hex screw. , .

  5. Rotation controller that maintains the verticality of the end-effector. , . The positive -axis of the end-effector frame corresponds to the direction that the hex-key points towards.

  6. Rotation controller that rotates the hex-key counter-clockwise. , , where gives a rotation matrix that rotates around the -axis with the angle .

  7. Rotation controller that rotates the hex-key clockwise. ,

Figure 7: Axes Visualization for Franka End-Effector and Door Handle for the Door-Open Task. RGB corresponds to XYZ.

Franka Door-Open. See Figure 7 for a visualization of both the Franka end-effector and door handle axes. Let correspond to the axes of the door handle. Let be a grasp point on the door handle, be the center of the handle axle (dark gray cylinder in Figure 7). The set of controllers include:

  1. Position attractor to door handle along error direction. ,

  2. Position curl attractor for rotating around the handle in the plane of the door panel (the nullspace of ). Let . Then ,

  3. Force controller to pull the handle. ,

  4. Rotation controller to align the x-axes of the gripper and the handle. ,

  5. Rotation controller to align the y-axes of the gripper and the handle. ,

  6. Rotation controller to align the z-axes of the gripper and the handle. ,

a.2 Integral Term for Force Controllers

Using an integral term for force controllers can help reduce the force error and improve stability. Let be the accumulated force errors for the force controller used at the th priority. Then, the corresponding delta position target is computed as:


where .

a.3 Delta Target Magnitude Clipping

To ensure safety and limit the maximum speed at which our controllers can drive the robot, we clip the magnitude of delta position and rotation targets.

Let be the maximum delta translation magnitude corresponding to a position controller, and for a force controller. The clipping for force and position controllers are computed as follows:


Note that can be or , depending on if the th controller is a position or force controller.

Similarly, let be the maximum delta rotation angle for rotation controllers:


a.4 Controller Hyperparameters

Table 3

lists the different hyperparameters used for the object axes-controllers for each task. We list the gains used for each controller as well as the clipping used while executing each controller. Table 

4 lists the task-space impedance parameters used for simulation and real-world experiments. We use [41] to implement each controller for real-world experiments.

Block Fit Block Push Franka Hex-Screw Franka Door-Open
Table 3: Controller Gains and Magnitude Clips Across Tasks.
Simulation Real World
Table 4: Task-Space Impedance Control Parameters. is stiffness, is damping, and is how many timesteps a controller combination runs before the RL policy is queried again. The simulation and real-world values are not the same due to differences in control frequencies and Franka dynamics between real-world and simulation. We tune the real-world values to ensure that the resultant controller behaviors are similar to those in simulation. This tuning was done prior to task evaluations.

Appendix B Task Details

b.1 Block Fit


  1. 2D pose of block robot

  2. 2D contact force direction and magnitude experienced by the block robot in the world frame.

  3. 2D coordinates of centers and wall corners

Reward Function: Let be the previous distance between the block translation and the goal translation, be previous the absolute angle difference between the block rotation and the goal rotation, and let , be there current counterparts. The reward function rewards making progress towards the goal with a small alive penalty and a large task completion bonus:


The goal translation threshold is about half the width of the block.

b.2 Block Push


  1. 2D pose of block robot

  2. 2D contact force direction and magnitude experienced by the block robot in the world frame.

  3. 2D coordinates of centers and wall corners

  4. 2D pose of the target block

Reward Function: The reward function is similar to that of Block Fit, except the progress rewards are computed w.r.t. the target block, not the block robot, and there is an additional reward term for approaching the target block:


where is the previous distance between the robot block and the target block, and is the current counterpart. The goal translation threshold of is about half the width of the target block.

b.3 Franka Hex Screw

Figure 8: Different hex screw and key sizes used for testing in the real world. The middle size represents scale factor, while the left is , and the right .

See Figure 8 for the different screw and key sizes used in real-world experiments.


  1. 7-dimension robot arm joint angles

  2. Gripper width

  3. 6D pose of the tip of the hex-screw. Rotations are represented via quaternions

  4. End-effector contact forces

  5. Position of the hex screw relative to the robot base

Reward Function: Let be the previous distance from the hex-key tip to the hex screw, be the previous absolute angle difference between the screw angle and its target angle, and let , be their respective current counterparts. Let be the absolute angle difference between the negative -axis (pointing downwards) and the -axis of the end-effector. The reward function rewards approaching the hex-screw, making progress in turning the screw, maintaining a vertical hex key orientation, plus a small alive penalty and a large task bonus:


The target screw rotation angle (at which point ) is .

b.4 Franka Door Opening


  1. 7-dimension robot arm joint angles

  2. Gripper width

  3. 6D pose of the tip of the hex-screw. Rotations are represented via quaternions

  4. End-effector contact forces

  5. Door panel angle (How much the door has opened, not the angle of the door handle)

  6. Position of the door handle relative to the robot base

Reward Function: Let be the previous distance from the end-effector to the door handle grasp point, be the previous absolute angle difference between the door handle angle and the target handle turning angle, be the previous absolute angle difference between the door panel angle and the target door opening angle, and let , , be their respective current counterparts. Let denote the current end-effector contact forces. The reward function rewards approaching the door handle, turning the handle, turning the door, plus small alive penalties and excessive force penalties, plus a large task bonus:


The target door handle turning angle (at which point ) is , and the target door panel opening angle (at which point ) is .

Appendix C RL Training Details

c.1 PPO Hyperparameters

Block Fit Block Push Franka Hex-Screw Franka Door-Open
num steps
discount factor
entropy coefficient
learning rate
value loss coefficient
max gradient norm
num minibatches

num opt epochs

clip range
Table 5: PPO Hyperparameters Across All Tasks.

Table 5 lists the hyperparameters used for each of the experiments. In addition to the above parameters, we also decay the clip range using a linear schedule with a decay rate of 0.99 after every epoch. We set the minimum clip range value to be . Also, for the Franka Hex-Screw and Franka Door-Open task we evaluated a range of entropy coefficient values for the end-effector action space.

c.2 Network Architecture

For all tasks and methods, we use the same network architectures for both the policy and value function networks. The network consists of 2 hidden layers with hidden units each.

c.3 Controller Features in the Observation Space

Figure 9: Policy Architecture for 3-Exp-Features.

We experimented with giving features of each individual controllers to the RL policy in the Expanded-MDP approach. These features may allow the policy to better reason about the effects of individual controllers, and it also allows the policy to operate on a variable number of controllers. Controller features include a 1-hot encoding of the controller type (position, force or rotation), the controller target, axis, and the current error. We refer to this method as 3-Exp-Features.

See Figure 9 for an illustration of a policy architecture using controller features. Each controller feature vector of the th controller (there are in total) is processed by a shared controller feature encoder. These controller embeddings are then each concatenated with embeddings of the original observations, which include environment observations and encodings of previously selected controllers (there are in total). For , instead of single-1-hot or multi-1-hot embeddings, we also use the controller features, which are first processed by the shared controller encoder. Finally, the concatenated embeddings are separately processed and scored by the policy. The normalized scores ’s form the discrete distribution from which we sample the next controller selection.

Both the observation and controller feature encoders contain two hidden layers of hidden units each. The controller scorer has one hidden layer of hidden units. For the value function, we remove the last linear layer (size ) of the controller scorer in the policy, add the -dimensional outputs from the hidden layer across all intermediate outputs, and pass the sum through one last linear layer (also of size ) to obtain one scalar value.

While 3-Exp-Features policies are invariant to the number of controllers in the scene, we did not explicitly test for this capability. However, we still ran this method alongside the other approaches, and it achieves comparable performance to the other 3-Exp variants (see detailed results below).

Appendix D Detailed Experiment Results

We discuss results for each task in more detail in the following sections. Video results for all the different tasks and methods can be seen at https://sites.google.com/view/compositional-object-control/.

d.1 Block Fit

Figure 10: Different environment configurations used to train the Block Fit task. The plot below each environment configuration shows how the trained policy performed on each particular configuration.
Figure 11: Test configurations for the Block Fit task. Table 6 shows results on each environment configuration.

Test-Cfg EE-Space 1-Ctrl 3-Pri. 3-Combo 3-Exp-Feat. 3-Exp-Single 3-Exp-Multi
A 0.86 (0.07) 0.98 (0.01) 1.0 (0.0) 0.19 (0.12) 1.0 (0.0) 1.0 (0.0) 1.0 (0.0)
B 1.0 (0.0) 1.0 (0.0) 1.0 (0.0) 0.20 (0.14) 1.0 (0.0) 1.0 (0.0) 1.0 (0.0)
C 0.85 (0.27) 0.96 (0.03) 0.97 (0.03) 0.18 (0.08) 0.98 (0.01) 0.99 (0.01) 0.96 (0.01)
D 0.89 (0.14) 0.71 (0.31) 1.0 (0.03) 0.165 (0.06) 1.0 (0.0) 1.0 (0.0) 1.0 (0.01)
E 0.16 (0.23) 0.64 (0.16) 0.76 (0.16) 0.0375 (0.05) 0.97 (0.02) 0.99 (0.01) 0.99 (0.02)
F 0.75 (0.39) 0.87 (0.16) 1.0 (0.0) 0.20 (0.14) 0.98 (0.01) 1.0 (0.0) 1.0 (0.0)
G 0.17 (0.30) 0.08 (0.23) 0.75 (0.45) 0.02 (0.03) 0.89 (0.13) 0.82 (0.23) 0.90 (0.15)
H 0.50 (0.53) 0.0 (0.0) 1.0 (0.0) 0.27 (0.18) 1.0 (0.0) 1.0 (0.0) 1.0 (0.0)
Table 6: Block Fit mean success on test environment configurations. Parentheses denote standard deviation across 8 seeds.

Figure 10 plots the train results (success-rate) for all different train configurations. As can be seen in the above figure, for all configurations besides a couple (Figure 9(b) left two plots), all the methods perform quite well during training. The poor performance in two of the training environments is due to their more challenging configurations. In both of these configurations the slot to the target wall is oriented in a different direction, so a robust policy needs to reason about this change. We observe that our proposed Expanded-MDP methods are able to perform well for this configuration. However, the end-effector action space shows a large variance i.e. many of the seeds fail to learn a robust policy to solve these configurations. Additionally, we also observe that 1-Ctrlr is unable to solve this task robustly. This shows the advantage of using multiple-controllers in parallel.

Figure 10(a) shows the different test configurations we evaluated. Each of the test configurations is a delta change in the wall lengths or angles from the train configurations. Table 6 shows the results on each of these test configurations. As seen above, our proposed Expanded-MDP formulations are able to outperform all other methods for all the configurations. 3-Priority performs well on most test configurations besides the slightly harder ones (E, G). This indicates that Expanded-MDP methods are able to learn more robust policies as compared to using a continuous priority score. Additionally, EE-Space perform poorly, especially for more different test configurations (E, F, G, H). Qualitatively, we observe that EE-Space policies often completely fail to generalize to the test configurations. 1-Ctrlr performs poorly on the D, E, and G, H configurations. For the initial two configurations, we observe that the learned policy can often get stuck around wall corners, which prevents it from completing the episode within the given time. Alternately, for the latter two environments, the learned policies across all seeds perform quite poorly, so they are not able to generalize to either of the test configurations.

d.2 Block Push

Figure 12 shows the success rate for all the different environment configurations used in the Block Push task. As seen in the above figure, both 1-Ctrlr and 3-Priority show large variance in training performance. This is because both methods fail to learn the task for some seeds. All the Expanded-MDP methods are able to successfully complete the task without large variance. One reason for this is that a robust task strategy requires the use of multiple object-axis controllers to move the object along the vertical wall as well as to move it around the corner of the top wall. Although it is feasible to accomplish the task by quickly switching between controllers, it is hard to find a robust policy relying under such a switching mechanism, especially when controllers are being run for fixed number of steps. Additionally, while EE-Space also solves all the different environment configurations, its sample complexity is worse than the proposed Expanded-MDP methods.

Table 7 evaluates the learned policy on different test configurations. Figure 12(a) plots each of these test configurations. These test configurations involve small perturbations in either the wall length or the wall angles (or both) from the train configurations. Specifically, we limit small perturbations to be a max change in vertical wall angle of (A, C, D, F), while larger perturbations are sampled from (B, E, G, H). We observe that EE-Space is usually robust to small perturbations of the environment, while slightly larger perturbations can significantly affect its performance. However, even with small perturbations our expand-MDP based methods are able to outperform the end-effector space. This shows the advantage of using a structured action space for learning, as Expanded-MDP methods perform well across both sets of configurations. Notably, the proposed approach only performs poorly on B and E configurations. For both configurations, as the policy pushes the red block up the middle wall, the agent block (green block) can sometimes end up under the red block, which leads to the green block falling, ending the episode. Additionally, both 1-Ctrlr and 3-Priority perform poorly on the test configurations. This is due to the poor performance of some of the learned policies (across a few seeds) on the train configurations. However, good train performance does not necessarily lead to good test performance. Specifically, 1-Ctrlr can move the green block upwards (by using the position controller for the top-wall), but it is often not able to robustly push it around the corner. This shows the advantage of being able to choose multiple object-axis controllers at each step.

Figure 12: Different environment configurations used to train the Block Push task. The plot below each environment configuration shows how the trained policy performed on each particular configuration.
Figure 13: Test environment configurations for the Block Push task. Table 7 shows results on each environment config.

Config EE-Space 1-Ctrl 3-Priority 3-Combo 3-Exp-Feat. 3-Exp-Single 3-Exp-Multi
A 0.94 (0.06) 0.62(0.47) 0.43 (0.47) 0.0 (0.0) 0.97 (0.02) 0.98 (0.01) 0.99 (0.00)
B 0.27 (0.28) 0.27(0.30) 0.38 (0.43) 0.0 (0.0) 0.50 (0.27) 0.72 (0.18) 0.65 (0.30)
C 0.86 (0.23) 0.30 (0.40) 0.43(0.37) 0.0 (0.0) 0.91 (0.10) 0.97 (0.01) 0.97 (0.04)
D 0.70 (0.28) 0.07 (0.13) 0.42 (0.46) 0.0 (0.0) 0.89 (0.06) 0.93 (0.07) 0.88 (0.12)
E 0.48 (0.31) 0.01 (0.01) 0.36 (0.39) 0.0 (0.0) 0.79 (0.17) 0.69 (0.23) 0.67 (0.17)
F 0.96 (0.03) 0.73 (0.41) 0.38 (0.42) 0.0 (0.0) 0.88 (0.11) 0.95 (0.06) 0.96 (0.03)
G 0.89 (0.10) 0.67 (0.49) 0.35 (0.39) 0.0 (0.0) 0.97 (0.03) 0.92 (0.06) 0.89 (0.07)
H 0.61 (0.26) 0.27 (0.41) 0.34 (0.38) 0.0 (0.0) 0.79 (0.11) 0.78 (0.10) 0.88 (0.07)
Table 7: Block Push mean success on test environment configurations. Parentheses denote standard deviation across 8 seeds.

d.3 Franka Hex-Screw

(a) Franka Hex-Screw Task
(b) Franka Door-Opening Task
Figure 14: Franka tasks success ratios on training environment configurations during training.
Config EE-Space 1-Ctrl 3-Priority 3-Combo 3-Exp-Feat 3-Exp-Single 3-Exp-Multi
0.7 0.0 (0.00) 0.05(0.14) 0.69 (0.44) 0.45 (0.43) 0.95 (0.04) 0.97 (0.02) 0.96 (0.035)
0.8 0.0 (0.00) 0.11(0.17) 0.66 (0.49) 0.43 (0.43) 0.97 (0.05) 0.96 (0.03) 0.98 (0.01)
1.1 0.0 (0.00) 0.21(0.37) 0.63 (0.49) 0.43 (0.36) 0.96 (0.03) 0.98 (0.03) 0.97 (0.03)
1.2 0.0 (0.00) 0.07 (0.15) 0.57 (0.42) 0.44 (0.42) 0.97 (0.04) 0.95(0.03) 0.95 (0.04)
1.4 0.0 (0.00) 0.0 (0.00) 0.67 (0.48) 0.34 (0.36) 0.92 (0.03) 0.94 (0.04) 0.95 (0.03)
1.5 0.0 (0.00) 0.03 (0.14) 0.54 (0.46) 0.24 (0.36) 0.92 (0.04) 0.90 (0.05) 0.92 (0.03)
Table 8: Franka Hex-Screw mean success across all test environment configurations. Parentheses denote standard deviation across 8 seeds.

Figure 13(a) plots the mean success rates for all the different approaches (including using controller features) during training. Since performance on all three train configurations (wrench and screw sizes) is very similar, we report one plot which averages the result for all the configurations. As seen in Figure 13(a), EE-Space is not able to learn the task. While EE-Space policies can bring the wrench close to the screw, it does not achieve proper alignment and insertion, nor does it apply sufficient downard force, all of which are necessary to accomplish the task. Similarly, 1-Ctrlr also performs poorly. This is expected, since the task requires the use of multiple controllers i.e. force or position controller into the screw object while also rotating the wrench simultaneously. For approaches that use multiple object-axis controllers together, we find that the expand-MDP approaches perform the best, robustly learning the task each time. All the other approaches suffer from large variance in task performance.

Table 8 visualizes the result for each of the different test configurations. Each test configuration uses a different wrench and screw scale. Our proposed approach is able to generalize to the different test configurations, achieving success rate for all configurations. Although 3-Priority performs well in training, its test performance is slightly poorer. This is because some of the learned policies (seeds) fail to generalize well to any of the test configurations, while the remaining seeds perform as well as our Expanded-MDP approaches. This variance in performance of the learned policies leads to lower mean success rate for 3-Priority. 1-Ctrlr fails to work well on any of the test configurations, which is expected given its poor training performance.

d.4 Franka Door-Opening

Figure 13(b) shows the average success rate in all train environments for the Door-Open. All methods except EE-Space are able to learn this task. One reason for this is that object-centric controllers make exploration in this task much more efficient than directly using the end-effector space. Although the EE-Space policy is able to grasp the handle, it fails to turn and pull. Table 9 shows quantitative results on test environments. Methods that use 3 controllers have very similar performance and perform better than 1-Ctrlr. Using multiple controllers is beneficial for this task. When turning the handle, the robot needs to learn to rotate the gripper and press down at the same time; when opening the door, the robot needs to simultaneously press the handle and pull it open. Since the reward function contains separate rewards for approaching the handle, turning the handle, and opening the door, the performance differences are due to the complexity of the task and not a lack of informative reward signals.

Config EE-Space 1-Ctrl 3-Priority 3-Combo 3-Exp-Feat 3-Exp-Single 3-Exp-Multi
A 0.13 (0.13) 0.87(0.06) 0.93 (0.08) 0.97 (0.01) 0.99 (0.01) 0.99 (0.01) 0.99 (0.01)
B 0.00 (0.00) 0.96 (0.02) 0.99 (0.01) 0.99 (0.01) 0.99 (0.01) 0.99(0.01) 0.99 (0.02)
C 0.00 (0.00) 0.93 (0.03) 0.99 (0.01) 0.99 (0.01) 1.00 (0.00) 0.99 (0.01) 0.99 (0.01)
Table 9: Franka Door-Opening mean success on test environment configurations. Parentheses denote standard deviation across 8 seeds.

Appendix E Controller Selection Analyses

We perform an ablation study to better understand the effects of algorithmic choices in our proposed approach. First, we analyze the effects of controller selection frequency, i.e., we analyze the effect of , where is the number of steps for which object-axes controllers are run before the RL policy is queried again. Second, we qualitatively evaluate the learned controller selection policy by visualizing the learned policies. For both of these settings we use the Block Fit task.

Figure 15: Controller Selection Frequency: Success rate for Block Fit task when object axes-controllers are run for steps. Results averaged over 4 seeds (instead of usual 8).

e.1 Controller Selection Frequency

We evaluate how the controller selection frequency affects the learning performance. For all previous experiments we use , i.e., the object-axes controllers are run only for a few (10) steps. Although switching controllers frequently allows the RL policy to be more expressive, this comes at the associated cost of higher sample complexity. In this experiment we evaluate learning performance when controllers are allowed to run for much larger steps i.e. . To keep the overall simulation time fixed, we simultaneously reduce the maximum number of steps the RL policy is run, i.e. we reduce the episode length of the MDP . This is important since running both the controllers and the RL policy for large number of steps is computationally prohibitive, since the total number of steps taken in the simulator is . We set for this experiment.

Figure 15 plots the average success rate for all train configurations on the Block Fit task with . As seen above, our proposed expand-MDP based approaches are able to perform quite well. Alternately, selecting only one-controller (1-Controller) at each time step performs poorly as compared to (Figure 10). This shows the advantage of being able to use multiple object-axes controllers in parallel. With small the 1-Controller policy is able to complete tasks by quickly switching between different controllers. Since this is not possible with a larger value, its performance decreases. This emphasizes the importance of using multiple-controllers in parallel. Additionally, Figure 15 shows that the Expanded-MDP based approaches are able to learn to perform the task in steps only. This is significantly better than the steps required for (Figure 10).

e.2 Controller Selection Visualization

Figure 16: Controller Selection Visualization for Block Fit during Task Execution. The thick blue lines show the different walls in the environment. The dots represent the block position at each step. While the arrows represent the wall object used by the selected controller. The left most plot shows the top priority (priority: 0) controller being selected, while the right most plot shows the controllers with lowest priority (priority: 2). Top and bottom rows are two different train configurations (A and B from Figure 10(a)).

Figure 16 plots the controllers the policy selects along the block trajectory for two different train configuration of Block Fit. For the highest priority controller, the policy tends to select the one that attracts the block toward the target wall. Interestingly, the second priority controllers are associated with a different wall, i.e the left most wall. This shows that the RL policy learns to combine controllers across different objects (walls). For the initial part of the trajectory, the RL policy learns to rotate (priority 0) and move (priority 1) the block simultaneously. This composition of different behaviors is important for the policy to accomplish the task as fast as possible. In addition, the policy chooses from a few set of controllers for both priority 0 and priority 1, while it chooses from a large set of controllers for priority 2. This is because many different choices for the priority 2 controller would often have little to no effect, e.g. if both priority 0 and 1 controllers are position or force controllers, then choosing an additional position or force controller for priority 2 will likely have no effect. Thus, it is hard for the policy to learn the appropriate effect for lower priority controllers.