UMPNet: Universal Manipulation Policy Network for Articulated Objects

09/13/2021 ∙ by Zhenjia Xu, et al. ∙ 0

We introduce the Universal Manipulation Policy Network (UMPNet) – a single image-based policy network that infers closed-loop action sequences for manipulating arbitrary articulated objects. To infer a wide range of action trajectories, the policy supports 6DoF action representation and varying trajectory length. To handle a diverse set of objects, the policy learns from objects with different articulation structures and generalizes to unseen objects or categories. The policy is trained with self-guided exploration without any human demonstrations, scripted policy, or pre-defined goal conditions. To support effective multi-step interaction, we introduce a novel Arrow-of-Time action attribute that indicates whether an action will change the object state back to the past or forward into the future. With the Arrow-of-Time inference at each interaction step, the learned policy is able to select actions that consistently lead towards or away from a given state, thereby, enabling both effective state exploration and goal-conditioned manipulation. Video is available at



There are no comments yet.


page 1

page 3

page 4

page 5

page 6

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The ability to effectively interact and manipulate unknown articulated objects is critical for many robotics tasks. However, due to the large variance in the objects’ kinematic structure and 3D geometry, the actual action trajectories can vary drastically across different object instances and categories. Fig.

1 shows examples of action trajectories conditioned on different objects for opening a door, turning a switch, or opening a drawer. Extensive prior works have studied how to manually design or learn an object-specific policy for each type of interaction (e.g., opening doors). However, such policies are often time-consuming to design and fail to generalize across objects with different articulation structures.

However, while these interaction sequences are drastically different in their low-level geometric trajectories, many of them can be summarized by a similar high-level function conditioned on the objects’ underlying geometric and kinematic structure. For example, the motion trajectory of a door opening can be represented by a function conditioned on its frame size and its rotation axis, and a similar function can also be used for opening a fridge, a microwave, or even a laptop. By learning to interact with a diverse set of articulated objects, the system is able to acquire a generalizable knowledge about objects’ articulation structure and how these structures would react to different actions. Such knowledge goes beyond a specific object instance or category, allowing a universal interaction policy for any articulated objects.

Fig. 1: Universal Manipulation Policy for Articulated Objects. Instead of predicting a single step action, UMPNet predicts complex closed-loop 6DoF action sequences with varying trajectory length. As a result, the same policy network is able to handle a diverse set of objects regardless their joint types or number of links.

Can we enable a robot to automatically acquire these basic concepts about the object structure through self-supervised interactions and use them to infer the corresponding manipulation policies? In this paper, we introduce the Universal Manipulation Policy Network (UMPNet) – a single policy network that discovers possible manipulation policies for an arbitrary articulated object from visual observations (i.e., RGB-D images). The action trajectories inferred by the policy network (shown in Fig. 1) highlight the following attributes:

  • [leftmargin=*]

  • General action representation: In order to model all possible actions for any articulated object, the network should be able to represent a general action space with little constraints – it should be able to represent continuous actions in SE(3) with arbitrary trajectory length. To achieve this goal, we formulate an action trajectory by its initial 3D position and a sequence of action directions, which allows the network to describe complex motion trajectories with varying sequence lengths.

  • Closed-loop action sequence: Instead of predicting a single step action (e.g., push or pull), we are interested in predicting long-horizon sequential actions that could describe a complex motion trajectory. However, due to error accumulation and partial observation, directly predicting the full trajectory from the initial state can be challenging. To address this issue, we use a closed-loop formulation where the network continues to predict the next action conditioned on the object’s initial and current state, allowing the network to adjust its action prediction based on its continuous visual observation of the object.

  • Arrow-of-Time awareness: Most of the action trajectories are bi-directional in time (i.e., they are valid in either direction). Hence, conditioning on a single state can result in multiple effective next actions that would change the object’s state with the same magnitude. However, to avoid the back-and-forth actions, the network takes the history state as input and infers an additional “Arrow-of-Time (AoT)” attribute for each action. This AoT label indicates whether this action will change the object state back to the past or forward into the future. Apart from encouraging exploring new states, this Arrow-of-Time inference also allows us to directly apply the network in “goal conditioned manipulation”, where we can simply swap out the initial state with the goal state and choose the actions using a reversed Arrow-of-Time.

In summary, we present a unified framework that discovers possible manipulation policies for an arbitrary articulated object from visual observations. By using self-guided exploration, the policy network is able to learn a wide range of action trajectories for a diverse set of objects and generalize to unseen objects and categories. The training does not require any human demonstrations or pre-defined goal conditions. We validate our approach on two manipulation tasks (1) open-ended state exploration and (2) goal-conditioned manipulation. The experiments demonstrate that UMPNet is able to outperform alternative approaches in both tasks significantly.

Ii Related work

Open-loop manipulation with pose estimation. Many works has focused on learning task-specific manipulation primitives, such as grasping [zeng2018robotic, song2020grasping], pushing [li2018push] and tossing [zeng2019tossingbot]. For articulated object methods have focused on handling doors, and drawers [klingbeil2010learning, abbatematteo2019learning, jainscrewnet20, mittal2021articulated, jain2020learning, ruhr2012generalized, harada2019service, schmid2008opening, kessens2010utilizing]

. These priors works typically start with object’s pose estimation

[gadre2021act, li2019articulated-pose]

and then use the object pose to compute a open-loop motion trajectory. However, the action trajectory designed for one task (e.g., opening doors) may be too specific to be applied to other objects or tasks (e.g., pushing button). Moreover, performing pose estimation for arbitrary articulated objects with

unknown category and kinematic structure is an extremely challenging task. On the contrary, our model does not require any object detection, pose estimation or part segmentation, and demonstrates that it is in fact not necessary to perform explicit pose estimation to perform effective manipulations.

Learning action trajectories from demonstrations.

Another popular method for robots to acquire new manipulation skills is learning from demonstrations. This approach has been explored extensively in reinforcement learning literature

[niekum2013incremental]. Researchers has tried using behavioral cloning to learn from human demonstration data captured by various methods, for example, motion capture[kober2010imitate, peter2009generalization], videos[sermanet2017contrastive, huang2018neural, huang2020motion] and virtual reality[zhang2017imitate, lynch2020learning]. However, these works requires collection of large amount of human demonstrations, which is time-consuming and expensive. In contrast, our framework generates its own training data by allowing the agent to actively interact with objects and explore the environment.

Single-step action affordance. Action affordance describes the possibility of an action to be applied to a given location in the environment. The task of affordance prediction does not limit to a specific kind of object or action primitive. Building on the well-studied image segmentation problems, many existing methods have been developed to learn object affordance through passive observations, such as learning human-object interaction hotspots from video [interaction-hotspots, nagarajan2020ego-topo] and contact heatmap from RGB-D image [brahmbhatt2019contactdb]. The work most related to ours is “Where2Act” by Mo et al. [mo2021where2act], where the algorithm can infer single step action affordance for different articulated objects. However, limited by its single step formulation, this approach fails to generated long-horizon motion trajectories for goal-conditioned manipulation tasks, which is the focus of our approach.

Iii Approach

Fig. 2: Approach overview. UMPNet takes visual observation (i.e., RGB-D images) of an articulated object as input and generates a sequence of actions in SE(3) space to explore novel object states. (Left) A grasp position is selected in the first interaction step. (Right) In following steps, the outcomes for each action candidates ( and ) are inferred and then used for action direction selection. infers the potential moving distance of the joint after applying the action . infers whether or not the action will move the object towards a novel state. The action direction with largest and positive will be selected.

The goal of the manipulation policy is to generate a sequence of actions to interact with a random articulated object which would result in novel states that haven’t been visited before. Taking Fig. 2 as an example, to effectively explore novels states of the object (i.e., a toilet), the algorithm should be able to (a) choose right position on the object to interact with (i.e., interacting with the cover instead of the base), (b) select a proper action direction (i.e., pulling up instead of pushing down), and (c) consistently select actions in the following steps to explore novel states (i.e., keeping pulling up the cover instead of moving up-and-down). These three requirements directly correspond to the three key components of our algorithm, which are action position selection (a), action distance (b) and Arrow-of-Time inference (c) for action direction selection. As a result, the final system is able to learn through a self-guided exploration process, without explicit human demonstrations [lynch2020learning], scripted policy [mo2021where2act], or pre-defined goal conditions [nasiriany2019planning].

Iii-a Problem formulation

The task is defined as following: given a visual observation of an articulated object in the form of an RGB-D image at the initial and current state , the agent with a policy generates an action at each step that satisfies the aforementioned requirements. The action is represented in SE(3) space, parameterized by end-effector (i.e., a suction-based gripper) position and moving direction , where is a 3D coordiante and

is a unit vector in 3D indicating the end-effector moving direction.

In the first interaction step, the policy selects a 3D position to apply action (i.e., an immobilizing grasp via suction). To execute the action, the agent moves its end-effector to this position, with an orientation perpendicular to the object surface. In each following step, the agent will select a 3D direction and move its end-effect 0.18(m) along that direction, the position is fixed relative to the objects surface. The suction behavior is implemented as a force constraint between the suction cup and the selected 3D position on the object. The orientation of the end-effector is always aligned with the surface normal during the interaction.

Iii-B Position inference

To start, the policy needs to determine a suitable position on the object 3D surface to apply action (i.e., a immobilizing grasp via suction). To do so, the algorithm needs to select a pixel from the observation image to apply action. The selected pixel will then be projected back to the 3D space using the depth value provided in the RGB-D image.

We formulate this problem as an image labeling task, where the position network (Fig. 2a) takes in an RGB-D image and predicts per-pixel position affordance score . The affordance score implies the likelihood of the object part movement when applying an action in this position. We use a U-Net architecture for this task, the network is supervised by the outcome of the executed action (one out of pixels). The ground truth label is if and only if the object state is changed in the any of the future steps. The network is trained with Binary Cross-Entropy loss.

Note that simply selecting a position belonging to a movable link is a necessary but not sufficient criteria. For example, if the selected position is very close to the joint axis, the agent will not be able to apply enough force to move the object part. Furthermore, the label is affected by the quality of direction selection. A correct position can still be labeled as a negative case if the object state is not changed due to wrong direction predictions in the following steps.

Iii-C Direction inference

At this point, the end-effector has grasped the object link at which is visible to the camera. Conditioned on this information, the policy then needs to select a 3D direction , in this case . The outcome of an action is measured by the moving distance of the object joint position and Arrow-of-Time attribute :

where represents the joint state of the object in each step and is a threshold to determine whether the state is effectively changed. The goal of the policy is then to infer the outcome for all action candidates generated by a direction sampler. The direction action with the largest distance prediction () and positive AoT prediction ( ) will be selected.

Direction sampling. To generate direction candidates

, one naive method would be uniformly sampling in the SO(3) space. However, limited by the number of samples, the sampled directions can only cover a small portion of the continuous action space that does not include the optimal directions. To address this issue, we use a heuristic approach, iterative cross-entropy method (CEM), to reduce the sampling space to achieve efficient direction sampling. The algorithm starts with uniform sampling the SO(3) space for

samples. Then, it evaluates the sampled actions based on the predicted action scores:

. In the next iteration, the algorithm re-sampls the action candidates with probability correlated to its score:

, where is a temperature value. Added a random noise, they are considered as candidates in the second interaction. In this way, the samples in the second iteration will concentrate on the region that has more ”potential”, leading to better performance with the same number of samples. Detailed comparison are listed in appendix. Our final model uses CEM sampling with 64 samples.

Distance inference. To infer the moving distance of the joint for action , the network only need to consider the current object state and grasp positions which are both encoded in the current image observation . Taking in the RGB-D image of the current state, DistNet (Fig. 2b) outputs embedding vector . Then DistDecoder (Fig. 2d) takes both embedding vector and action as input, and outputs a scalar as the distance prediction

. DistNet is a convolution neural network and the output is flattened to an embedding vector. Dist-Decoder is a fully-connected neural network. The model is trained using MSE loss

for the executed action .

Arrow-of-Time inference. For single-step interaction, any action that changes the object’s state would result in a novel state. However, it is not true for multi-step interactions – the policy can move the object link back-and-forth without exploring any new states. To address this issue, we proposes an “Arrow-of-Time” (AoT) action attribute that indicates whether the action will change the object state back to the initial state or forward into the future (i.e., a novel state). Specifically, AoTNet (Fig. 2c) takes the current and initial observation as input and outputs another embedding vector . This embedding vector is then combined with the action embedding to infer the final AoT label for this action . The network architectures of the AoT branch is similar to those of the Dist branch while the only differences are the different input dimensions of the Dist Net and the AoT Net as well as the different output dimesions of the AoT Decoder and the Dist Decoder. The model is trained as a three-way classification with Cross-Entropy loss . The final loss for direction inference is: , where in our experiments.

Iii-D Training

All training data comes from interaction trials executed by the policy trained from scratch. A FIFO replay buffer (size=

) is used to store training data. To collect data with both positive and negative AoT labels, we employ contradictory policy for direction inference within a sequence. In the first half of each sequence, we select action with positive AoT prediction for execution to move the object away from its initial state. In the second half, actions with negative AoT prediction are executed to encourage the object to move back. 16 trajectories are collected in each epoch. The sequence length is 4 at the beginning. After 1000 epochs, it increases by 2 every 400 epochs, until reaching 20.

-greedy is used during training, where decreases linearly from to within epochs. In position inference, and . In direction inference, and .

Position module and direction module are trained with 8 iterations accordingly in each epoch. In each position training iteration, we sample a batch (size=) of examples from the replay buffer with a 1:1 positive to negative ratio. In each direction training iteration, 1:1:1 samples from positive, negative, and not-moving data form a batch (size=).

Fig. 3: Goal conditioned manipulation.
Novel instances in training categories Testing categories
Where2Act 0.94 2.08 1.10 0.79 0.92 1.24 1.05 1.06 0.80 0.74 0.96 0.57 0.96 1.48 1.01 1.17 1.17 1.95 0.82 1.02 1.38 0.81
AoTOnly 0.99 1.42 1.05 0.63 0.62 1.01 0.76 0.62 0.61 0.54 0.57 0.51 0.75 1.10 1.06 1.10 1.14 1.46 0.49 0.86 1.21 0.80
SignedDist 0.84 1.68 1.04 0.53 0.91 1.25 1.23 0.69 0.73 0.43 0.65 0.51 0.75 1.10 1.06 1.10 1.14 1.46 0.49 0.86 1.21 0.80
UMPNet 1.02 2.08 1.37 0.73 0.92 1.29 1.26 1.03 0.81 0.70 0.90 0.66 1.10 1.50 1.14 1.18 1.32 1.87 0.77 1.05 1.69 0.90
Single action effects
Where2Act 0.38 0.45 0.34 0.25 0.52 0.56 0.49 0.56 0.45 0.50 0.58 0.26 0.39 0.39 0.45 0.42 0.51 0.53 0.50 0.66 0.24 0.34
Where2Act+HP 0.72 0.85 0.89 0.48 0.60 0.83 0.85 0.72 0.62 0.63 0.73 0.50 0.75 0.87 0.79 0.84 0.81 0.89 0.54 0.86 0.91 0.65
SingleStep 0.31 0.42 0.39 0.26 0.47 0.51 0.48 0.49 0.44 0.47 0.57 0.24 0.44 0.38 0.39 0.41 0.45 0.45 0.47 0.78 0.29 0.31
AoTOnly 0.58 0.77 0.69 0.42 0.47 0.68 0.62 0.67 0.50 0.44 0.59 0.44 0.70 0.76 0.65 0.82 0.61 0.81 0.44 0.80 0.83 0.50
SignedDist 0.43 0.59 0.66 0.38 0.47 0.54 0.58 0.58 0.46 0.38 0.48 0.38 0.60 0.57 0.51 0.58 0.57 0.65 0.36 0.55 0.68 0.47
UMPNet 0.70 0.85 0.90 0.52 0.60 0.87 0.81 0.74 0.64 0.55 0.74 0.52 0.77 0.85 0.76 0.85 0.80 0.92 0.56 0.86 0.93 0.68
UMPNet+HP 0.71 0.86 0.90 0.57 0.64 0.88 0.83 0.74 0.65 0.60 0.74 0.55 0.77 0.88 0.78 0.86 0.83 0.92 0.56 0.88 0.93 0.70
Ratio of unique states visited
TABLE I: Effective state exploration.222Categories: fridge, folding chair, laptop, stapler, trashcan, microwave, toilet, window, cabinet, switch, kettle, toy, box, phone, dish washer, safe, oven, washing machine, table, kitchen pot, bucket, door.
Fig. 4: Open-ended state exploration. Arrow length indicates the inferred distance value, color indicates the inferred AoT label. We visualized the uniform samples to better illustrate the AoT distribution. (Left) Qualitative comparisons. All methods are able to choose a suitable position, however, both SingleStep and Where2Act cannot distinguish between actions that are moving away from or back to initial state (all directions are red) leading to inefficient exploration. In contrast, UMPNet is able to infer the correct AoT labels, hence, select the correct action to explore novel states. (Right) Number of unique state visited up to each step using different exploration strategy (laptop testing instances). The error bar is measure with five random seeds.

Iii-E Goal conditioned manipulation with reversed AoT

While open-ended interaction is useful for exploring and collecting information about the environment, most manipulation tasks are goal conditioned – the policy needs to generate actions that would lead towards a given goal state instead of a random novel state. Although the policy is trained with only open-ended exploration, the learned policy can be directly applied to perform goal conditioned manipulation without additional training.

Action selection with AoT label. The key idea to perform this task is to swap out the initial observation with the goal state observation as the input to the policy. Then by executing the actions with a reversed Arrow-of-Time (i.e., negative AoT), the policy tries to move object back to the “past”, which will effectively move the objects towards the goal. If the AoT prediction of all direction candidates are non-negative (no blue arrows in Fig. 3), the trajectory will terminate.

Apart from choosing the right action direction, another unique challenge for goal-conditioned manipulation is how to choose the correct link to interact when there are multiple movable links on the object (e.g., fridge with double doors in Fig. 3). While the position heatmap predicted by the network covers all movable links, only interacting with the right one can lead to the goal. Therefore, to choose a proper position, we first compute a difference mask between the initial and target observation. Then, we multiply the raw position heatmap and the mask to get the filtered position affordance (remove the pixels that are not changed). The final position is selected from the filtered heatmap. The algorithm for goal-conditioned manipulation is illustrated in Fig. 3.

Iv Evaluation

Novel Instances in Train Categories Test Categories
Inverse [agrawal2016learning] 0.30 0.21 0.32 0.31 0.27 0.17 0.28 0.09 0.27 0.25 0.09 0.34 0.25 0.32 0.09 0.17 0.27 0.15 0.21 0.00 0.51 0.27
AoTOnly 0.23 0.18 0.12 0.22 0.32 0.18 0.15 0.16 0.32 0.38 0.12 0.08 0.30 0.05 0.07 0.18 0.31 0.18 0.27 0.00 0.31 0.18
SignedDist 0.26 0.24 0.11 0.20 0.35 0.19 0.22 0.15 0.41 0.44 0.13 0.12 0.32 0.09 0.11 0.20 0.34 0.22 0.31 0.00 0.30 0.22
UMPNet 0.20 0.19 0.05 0.19 0.23 0.16 0.12 0.13 0.28 0.21 0.11 0.04 0.26 0.03 0.06 0.15 0.21 0.16 0.22 0.00 0.22 0.17
TABLE II: Goal conditioned manipulation (normalized distance to target )
Fig. 5: Goal conditioned manipulation results. At the beginning or in the middle of a trajectory, the action candidates have positive (red) and negative (blue) AoT labels. To move toward the goal, the policy selects the action with the largest distance prediction and a negative AoT label (the longest blue arrow) to execute. When reaching the goal state (current and goal state are similar), the AoT labels turn non-negative for all actions since all actions will either make no change or move further away from the goal state. The [Inverse] model (right-most column) often chooses sub-optimal action directions (highlighted by red dash circles) at the beginning of the interaction sequence where the current observation is far away from the goal states.

Our simulation environment uses objects from PartNet-Mobility [xiang2020sapien] and physics engine from Pybullet [coumans2017pybullet]. We use 12 categories for training and 10 categories for testing. There are 504 training object instances, 132 testing object instances from training categories, and 261 object instances in the testing categories. We randomly load an articulated object into the simulation for each interaction session with a randomly initialized pose and joint configurations.

Iv-a Open-ended state exploration

We first evaluate UMPNet’s effectiveness in exploring novel states of an articulated object. Being able to effectively explore the possible states of an object without a specific goal is a critical first step for many robot learning algorithms since it is often used to collect the initial observation about the environment to initiate the training. While random explorations can be used for simple environments, they are often not sufficient for tasks involving high-dimensional action space, where the majority of the actions will not change the object joint state in a meaningful way.

Instead, an effective state exploration policy should be able to choose actions that can (1) significantly change the joint state of an object and (2) lead to novel states that have not been visited before. The first property requires the system to understand the object structure, and the second property requires the system to be aware of the interaction history.

Metrics. We use two metrics to evaluate the effectiveness of state exploration: (1) Single action effects – measures the joint state difference before and after each interaction step . The threshold of significant state change is 0.15m for prismatic joint and 8.6°for revolute joint. This metric evaluates whether the algorithm can choose the action that would change the state of the object most significantly. (2) Novel state visited – measures the ratio between the number of unique states visited among all interaction steps: . Two states consider the “same” when the object’s joint difference is less than . This metric evaluates whether the algorithm is aware of the interaction history and chooses the action leading to novel states that have not been visited before.

Algorithm comparisons. We compare our final model with the following alternative approaches:

•  Where2Act [mo2021where2act]: This algorithm takes the current observation as input and selects single-step action. The model is with binary-classification loss where the action is positive if only the moving distance is larger than a threshold.
•  Where2Act+HP: an additional heuristic that filters out actions that has a larger than 90°angle with last-step action. This heuristic helps to avoid back-and-forth actions, however cannot be applied for goal-conditioned manipulation.
•  SingleStep: Single-step version of our method that only takes the current observation as input.
•  AoTOnly: This method only outputs AoT label for each action without the distance inference.
•  SignedDist: Instead of inference AoT and distance as separate outputs, this method infers signed distance by multiplying the AoT and distance value

Results and analysis. Quantities and qualitative results are summarised in Tab. I and Fig. 4.

Effect of the AoT prediction. Both [ Where2Act ] and [ SingleStep ] only take the current observation as input and infer actions for one step; hence, they do not need to understand the interaction history. From Tab. I we can see that [ Where2Act ] is able to achieve similar performance in “single action effects”, however, both [ Where2Act ] and [ SingleStep ] cannot effectively explore novel states with more interaction steps. Since both algorithms are not aware of interaction history, we observe that the policy often selects actions that would manipulate the object link back-and-forth instead of exploring new possible object states. When combined with the heuristic the algorithm [ Where2Act+HP ] can avoid back-and-forth action, however, it is sensitive to error propagation, where one sub-optimal action would affect all following steps through the filtering process, results in worse performance. Fig. 4 shows examples of action prediction results for [ UMPNet ]. With just the Arrow-of-Time prediction, [ UMPNet ] is able to identify the actions that would always move the object from the past states (i.e., red arrows); therefore, it is able to visit novel states much more frequently. When combined with heuristic filter, the performance improves slightly.

Effect of the distance prediction. Compared to [ AoTOnly ], we can observe that by explicitly predicting the distance value for each action candidate, [ UMPNet ] can better differentiate between different action directions and choose the optimal action direction that would introduce larger state changes. As a result, [ UMPNet ] can achieve a better “single action effect” for all object categories, leading to more efficient state exploration when considering the entire sequence.

Effect of decomposing AoT and distance prediction. Different from [ SignedDist ] that directly predicts a signed distance value that combines the AoT and distance, [ UMPNet ] decompose its output as an AoT label (trained with classification) and a distance value (trained with regression). This decomposition helps the algorithm better disentangle these two concepts, allowing the algorithm to achieve more accurate predictions for both. As a result, [ UMPNet ] can achieve better performance in both metrics.

Iv-B Goal conditioned manipulation

In this experiment, we evaluate UMPNet’s performance in the task of goal-conditioned manipulation. Given a target state in the form of an RGB-D image, the task is to infer a sequence of actions that manipulate the object toward the target state and halts when the object reaches the target state.

Metrics. The performance for this task is measured by normalized distance to target state after interaction: , where is vector of object’s joint state. To make the task more challenging, the initial and goal states are selected from the upper and lower limits of the joint. The initial state may be moved to ensure the task can be accomplished in 15 steps.

Algorithm comparisons. We compare with the [ Inverse ] model proposed by Agrawal et al. [agrawal2016learning], a single-step inverse model for goal-conditioned manipulation. Each step takes the current and goal observation as input and predicts the action that would change the state from the current state to the goal state. This model is trained on the same state-action pairs () as our method, and the action output is trained with direct regression loss.

Results and analysis. Tab. II shows that comparing to prior works [ Inverse ] and other alternative approaches, [ UMPNet ] is able to achieve more precise goal-conditioned manipulations by moving the object to a state that is closer to the target (lower value). From the qualitative comparisons in Fig. 5 we can observe that the performance of the [ Inverse ] model is much worse at the beginning of the interaction, where the algorithm often selects sub-optimal action directions that make less progress towards the goal (actions highlighted in red dash circle). Since the [ Inverse ] model only takes consecutive observations as input during training, it struggles to handle long-horizon manipulation tasks, where the current observation is far away from the goal states. Similar to exploration experiments, we observe that [ AoTOnly ] often chooses sub-optimal action direction as it is unaware of the actual magnitudes (i.e., distance) of different action effects.

Fig. 6: Action Articulation. The joint axes (red) are inferred from the actions selected by the learned policy (green), which indicates the system’s implicit understanding about the objects’ articulation structure.
Fig. 7: Real-world experiment. We test the model trained in simulation on a real-world platform. (a) We an RGB-D camera to capture visual observation and a UR5 with a suction gripper for manipulation. (b) Action trajectory. (c) For each object, we visualize the inferred action position and direction for two different target states. To move toward the goal, the policy will select the action with the largest distance prediction and a negative AoT label (the longest blue arrow) to execute.

Iv-C Inferring objects’ articulation structure from interactions

We hypothesize that one of the requirement for learning a universal policy for any articulated object is the ability to understand the object’s underlying articulation structure and how this structure react to different actions. Hence, the action selected by the policy should also, in return, reflects its belief on the objects’ structure. For example, we often apply forces along the axis for prismatic joints while applying actions perpendicular to the rotation axis for revolute joints.

To visualize the policy’s implicit belief about the object’s structure, we compute the joint parameters inferred from the actions selected by the policy. To compute the prismatic joint, we simply take the average of the action directions. To compute the revolute joint, we first compute a common action plane in the 3D space (brown plane in Fig 6). The normal direction of the plane is chosen as , where is the action direction in each interaction step. Then we vote for the axis position by computing the interaction sections between the directions perpendicular to all the actions in the common plane (blue lines in Fig 6). Finally, the final axis position is voted among the intersection points between each pair of the perpendicular lines. Fig. 6 shows examples of inferred joint parameters for objects with different articulation structures (red lines).

We also quantitatively evaluate the inferred joint parameters. While the algorithm has never been supervised on any of the joint parameters, it is able to estimate the joint axis orientation with an average error ° for revolute joints and °  for prismatic joints. Note that the error in prismatic joint estimation is higher since these objects often has higher tolerance on the sub-optimal action directions.

Iv-D Real-world experiment

Finally, we validate our method on a real-world platform with a calibrated RGB-D camera (Intel RealSense D415), a UR5 robot, and a suction gripper. Fig. 7 (a) shows the real-world setup. Considering large-scale real-world training is very challenging, in this experiment, we directly tested UMPNet trained in simulation on four different objects – box, laptop, microwave, and stapler. The inferred action trajectories to open and close the microwave are shown in Fig. 7 (b). The qualitative result of goal-conditioned manipulation shown in Fig. 7 (c) demonstrates that the trained model is able to infer proper grasping positions and action directions for different objects and goal conditions. While performing large-scale real-world training for UMPNet can still be challenging, we believe these results demonstrate the promises of the proposed method in real-world applications.

V Conclusion

We introduce the Universal Manipulation Policy Network (UMPNet) – a single image-based policy network that infers closed-loop action sequence for manipulating arbitrary articulated objects. The policy is trained with self-guided exploration without human demonstrations, scripted policy, or pre-defined goal conditions. Our experiment results demonstrate that the learned policy is able to perform well in both open-ended exploration and goal-conditioned manipulation and outperforms alternative approaches in both tasks.


Appendix A Additional details: object dataset

Our experiments 11 training categories and 10 testing categories from PartNet-Mobility dataset [xiang2020sapien]. Most of the objects in PartNet-Mobility have relatively simple joint configurations where the desired action trajectories often have a smooth change of direction. To test whether our algorithm is able to produce more complex action trajectories, we add an additional object category, ”Toy” (see Fig. A3), which requires the policy to constantly change output action directions at every interaction steps. Within this toy category, we create two types of object instances using Blender, where the wave-shaped instance is used for training, and the zig-zag-shaped instance is used for testing.

Detailed instance statistics and their joint types for each object category are listed in Tab. A1. Note that many object instances contain both revolute and prismatic joints. For each category, the maximum number of interaction steps is determined by the average joint range divided by . Fig. A3 presents example instances from each object category.

Appendix B Additional details: network structure

PositionNet. Given a visual observation captured by an RGB-D camera, we first calculate world coordinates for each pixel using depth value. Surface normals are then estimated via KD Tree searching. Next, RGB-D image, world coordinates, and surface normals are concatenated (size equals 10480

640) and then fed into PositionNet with a U-Net architecture. The PositionNet applies four down-sample blocks with 32, 64, 128, and 256 channels, followed by four up-sample blocks with 128, 64, 32, and 2 channels. Each down-sample(or up-sample) block includes a max-pooling(or bilinear interpolation) layer and two 3

3 convolution layers with ReLU. Finally, pixel-wise softmax is applied, and the output tensor is position affordance with a size of 2


DirectionNet. DistNet takes the current observation as input and applies seven 33 convolution layers with 32, 64, 128, 256, 512, 512, and 512 channels. Max pooling is also applied except for the first layer. The output tensor with a size of 512710 is then flattened as an embedding vector . DistDecoder takes the embedding vector and action as input. A two-layer MLP with both 256 dimensions is applied to the embedding vector . The action is also encoded via a two-layer MLP with both 128 dimensions. These two vectors are then concatenated and fed into a four-layer MLP with dimensions of 1024, 1024, 1024, and 1. Finally, the network outputs a scalar value as the distance prediction. The network architectures of the AoT branch are similar to those of Dist branch, with only two differences. First, the input channel of AoTNet is doubled since the current observation and initial observation are concatenated and fed into the network. Second, the output dimension of AoTDecoder is three and followed by softmax to perform a three-way classification.

Appendix C Effect of direction sampling strategy

Fig. A1: Action sample comparison
Fig. A2: Failure cases.

In this experiment, we evaluate the effect of the direction sampling strategy by comparing the Uniform and CEM direction sampling with a different number of samples. Fig. A1 (left) visualizes sampled actions from different strategies. Under the same number of samples, CEM can provide denser candidates in the region of interest, making it more likely for the model to select a direction of higher quality. Fig. A1 (right) shows the algorithm’s performance in the goal-conditioned manipulation task using different action samples. We observe that the performance improves with more action samples and that CEM sampling consistently achieves a better performance than Uniform sampling under the same number of samples. Therefore, our final model uses CEM sampling with 64 action samples.

Training Categories Testing Categories
Name # train # test Revolute Prismatic Name # test Revolute Prismatic
Fridge 33 9 12 Box 13 10
FoldingChair 16 4 8 Phone 3 12
Laptop 34 9 12 Dishwasher 41 10
Stapler 17 5 15 Safe 28 10
TrashCan 36 10 9 Oven 24 9
Microwave 8 2 8 WashingMachine 17 9
Toilet 25 7 7 Table 77 7
Window 41 11 6 KitchenPot 25 3
Cabinet 266 67 9 Bucket 6 13
Switch 8 2 7 Door 27 10
Kettle 19 5 3
Toy 1 1 10
Total 504 132 Total 261
TABLE A1: Statistics of the data splits.
Fig. A3: Training and testing categories.

Appendix D Limitations and failure cases

To allow goal-conditioned manipulation with reversed AoT actions, we assume the action trajectories are bi-directional in time (i.e., they are valid in either direction). While this assumption is true for most articulated objects, it does not apply to irreversible actions such as gluing or locking. In addition, our system assumes the agent uses a suction-based end-effector, which can provide robust grasps for a large variety of objects and is widely used in many real-world robotics systems. However, the policy cannot generalize to other grippers that requires more precise grasp poses. For example, while the suction gripper can grasp the door frame at almost any position, parallel jaw grippers can only grasp the door from its handle.

Fig. A2 shows examples of failure cases. Case (a) is ambiguous in position selection since the door could be opened from both sides, where the policy chooses to drag the middle of the door. In case (b), the selected action can’t change the object state since the microwave’s door has reached the maximum boundary. However, the joint range can’t be easily inferred from observation since some microwaves can in fact be opened up to 180°. In case(c), policy infers actions that will cause collisions between the end-effector and the object. In case(d), the end-effector is occluded after interactions. While a human is able to change the viewpoint for better observation, our agent uses a fixed camera position and therefore not robust enough for occlusion. Both (c) and (d) cases could be addressed by better modeling the agent’s hardware embodiment including end-effector shape and camera placement.

Appendix E Additional results: open-ended state exploration

Fig. A4 shows the plot of number of unique states visited during exploration.

Fig. A4: Number of unique states visited up to each step. UMPNet can achieve significantly better performance when handeling long-sequence interactions since it can choose actions with consistent direction with maximum state change, instead of moving the object link back-and-forth.

Appendix F Additional results: goal-conditioned manipulation

Fig. A4 shows the normalized distance to target in the task of goal-conditioned manipulation.

Fig. A5: Normalized distance to target up to each step. UMPNet outperforms other baselines when handling long-horizon goal-conditioned manipulation tasks.