We live in a 3D world composed of a plethora of 3D objects. To help humans perform everyday tasks, future home-assistant robots need to gain the capabilities of perceiving and manipulating a wide range of 3D objects in human environments. Articulated objects that contain functionally important and semantically interesting articulated parts (e.g.
, cabinets with drawers and doors) especially require significantly more attention, as they are more often interacted with by humans and artificial intelligent agents. Having much higher degree-of-freedom (DoF) state spaces, articulated objects are, however, generally more difficult to understand and subsequently to interact with, compared to 3D rigid objects that have only 6-DoF for their global poses.
There has been a long line of research studying the perception and manipulation of 3D articulated objects in computer vision and robotics. On the perception side, researchers have developed various successful visual systems for estimating kinematic structures[1, 2], articulated part poses [3, 4, 5], and joint parameters [6, 7]. Then, with these estimated visual articulation models, robotic manipulation planners and controllers can be leveraged to produce action trajectories for robot executions [8, 9, 10, 11, 12]. While the commonly used two-stage solution underlying most of these systems reasonably breaks the whole system into two phases and thus allows bringing together well-developed techniques from vision and robotics communities, the current handshaking point – the standardized visual articulation models (i.e. kinematic structure, articulated part poses, and joint parameters), may not be the best choice, since essential geometric and semantic features for robotic manipulation tasks, such as interaction hotspots (e.g. edges, holes, bars) and part functionality (e.g. handles, doors), are inadvertently abstracted away in these canonical representations.
We propose a new type of actionable visual representations [13, 14, 15] exploring a more geometry-aware, interaction-aware, and task-aware perception-interaction handshaking point for manipulating 3D articulated objects. Concretely, we train the perception system to predict action possibility and visual action trajectory proposals at every point over parts of 3D articulated objects (See Figure 1). In contrast to previous work that use standardized visual articulation models as visual representations, our framework VAT-Mart predicts per-point dense action trajectories that are adaptive to the change of geometric context (e.g., handles, door edges), interactions (e.g., pushing, pulling), and tasks (e.g., open a door for , close up a drawer by -unit-length). Abstracting away from concrete external manipulation environments, such as robot arm configurations, robot base locations, and scene contexts, we aim for learning unified object-centric visual priors with a dense and diverse superset of visual proposals that can be potentially applied to different manipulation setups, avoiding learning separate manipulation representations under different circumstances.
The proposed actionable visual priors, as a "preparation for future tasks"  or "visually-guided plans" [17, 18], can provide informative guidance for downstream robotic planning and control. Sharing a similar spirit with [14, 15], we formulate our visual action possibility predictions as per-point affordance maps, on which the downstream robotic planners may sample a position to interact according to the predicted likelihood of success. Then, for a chosen point for interaction, the discrete task planner may search for applicable interaction modes (e.g., whether to attempt a grasp) within a much smaller space formed by the visual action trajectory distribution, instead of searching in the entire solution space. Next, considering the robot kinematic constraints and physic collisions, the continuous motion planner can further select an open-loop trajectory from the set of proposed visual action trajectory candidates as an initial value for optimization, and finally pass to the robot controller for execution. More recent reinforcement learning (RL) based planners and controllers can also benefit from our proposed solution spaces for more efficient exploration.
To obtain such desired actionable visual priors, we design an interaction-for-perception learning framework VAT-Mart, as shown in Figure 2. By conducting trial-and-error manipulation with a set of diverse 3D articulated objects, we train an RL policy to learn successful interaction trajectories for accomplishing various manipulation tasks (e.g., open a door for , close up a drawer by -unit-length). In the meantime, the perception networks are simultaneously trained to summarize the RL discoveries and generalize the knowledge across points over the same shape and among various shapes. For discovering diverse trajectories, we leverage curiosity feedback  for enabling the learning of perception networks to reversely affect the learning of RL policy.
We conduct experiments using SAPIEN  over the large-scale PartNet-Mobility [21, 22] dataset of 3D articulated objects. We use 590 shapes in 7 object categories to perform our experiments and show that our VAT-Mart framework can successfully learn the desired actionable visual priors. We also observe reasonably good generalization capabilities over unseen shapes, novel object categories, and real-world data, thanks to large-scale training over diverse textureless geometry.
In summary, we make the following contributions in this work:
We formulate a novel kind of actionable visual priors making one more step towards bridging the perception-interaction gap for manipulating 3D articulated objects;
We propose an interaction-for-perception framework VAT-Mart to learn such priors with novel designs on the joint learning between exploratory RL and perception networks;
Experiments conducted over the PartNet-Mobility dataset in SAPIEN demonstrate that our system works at a large scale and learns representations that generalize over unseen test shapes, across object categories, and even real-world data.
2 Related Work
Perceiving and Manipulating 3D Articulated Objects
has been a long-lasting research topic in computer vision and robotics. A vast literature [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41] has demonstrated successful systems, powered by visual feature trackers, motion segmentation predictors, and probabilistic estimators, for obtaining accurate link poses, joint parameters, kinematic structures, and even system dynamics of 3D articulated objects. Previous works [42, 43, 44, 45] have also explored various robotic planning and control methods for manipulating 3D articulated objects. More recent works further leveraged learning techniques for better predicting articulated part configurations, parameters, and states [6, 7, 4, 46, 3, 5, 47], estimating kinematic structures [1, 2], as well as manipulating 3D articulated objects with the learned visual knowledge [8, 9, 10, 11, 12]. While most of these works represented visual data with link poses, joint parameters, and kinematic structures, such standardized abstractions may be insufficient if fine-grained part geometry, such as drawer handles and faucet switches that exhibit rich geometric diversity among different shapes, matters for downstream robotic tasks and motion planning.
Learning Actionable Visual Representations
aims for learning visual representations that are strongly aware of downstream robotic manipulation tasks and directly indicative of action probabilities for robotic executions, in contrast to predicting standardized visual semantics, such as category labels[48, 49], segmentation masks [50, 22], and object poses [51, 52], which are usually defined independently from any specific robotic manipulation task. Grasping [53, 54, 55, 56, 57, 58, 59, 60, 61] or manipulation affordance [62, 13, 63, 64, 14, 65, 66, 67, 15] is one major kind of actionable visual representations, while many other types have been also explored recently (e.g., spatial maps [68, 69], keypoints [70, 71], contact points , etc). Following the recent work Where2Act , we employ dense affordance maps as the actionable visual representations to suggest action possibility at every point on 3D articulated objects. Extending beyond Where2Act which considers task-less short-term manipulation, we further augment the per-point action predictions with task-aware distributions of trajectory proposals, providing more actionable information for downstream executions.
Learning Perception from Interaction
augments the tremendously successful learning paradigm using offline curated datasets [48, 50, 21, 22] by allowing learning agents to collect online active data samples, which are more task-aware and learning-efficient, during navigation [73, 16], recognition [74, 75, 76], segmentation [77, 78], and manipulation [79, 80]. Many works have also demonstrated the usefulness of simulated interactions for learning perception [81, 55, 82, 16, 65, 83, 15] and promising generalizability to the real world [84, 85, 86, 87, 88, 89, 90]. Our method follows the route of learning perception from interaction via using the action trajectories discovered by an RL interaction policy to supervise a jointly trained perception system, which reversely produces curiosity feedback  to encourage the RL policy to explore diverse action proposals.
3 Actionable Visual Priors: Action Affordance and Trajectory Proposals
We propose novel actionable visual representations for manipulating 3D articulated objects (see Fig. 1). For each articulated object, we learn object-centric actionable visual priors, which are comprised of: 1) an actionability map over articulated parts indicating where to interact; 2) per-point distributions of visual action trajectory proposals suggesting how to interact; and 3) estimated success likelihood scores rating the outcomes of the interaction. All predictions are interaction-conditioned (e.g., pushing, pulling) and task-aware (e.g., open a door for , close a drawer by -unit-length).
Concretely, given a 3D articulated object with its articulated parts , an interaction type , and a manipulation task , we train a perception system that makes dense predictions at each point over each articulated part : 1) an actionability score indicating how likely there exists an action trajectory of interaction type at point that can successfully accomplish the task ; 2) a distribution of visual action trajectories , from which we can sample diverse action trajectories of interaction type to accomplish the task at point ; and 3) a per-trajectory success likelihood score .
We represent the input 3D articulated object as a partial point cloud . We consider two typical interaction types: pushing and pulling. A pushing trajectory maintains a closed gripper and has 6-DoF motion performing the pushing, whereas a pulling trajectory first performs a grasping operation at the point of interaction by closing an initially opened gripper and then has the same 6-DoF motion during the pulling. For articulated objects we use in this work, we only consider 1-DoF part articulation and thus restrict the task specification . For example, a cabinet drawer has a 1-DoF prismatic translation-joint and a refrigerator door is modeled by a 1-DoF revolute hinge-joint. We use the absolute angular degrees in radian (i.e. ) for revolute joints and use the units of length (i.e. ) relative to the global shape scale for prismatic joints.
Both the actionability score and per-trajectory success likelihood score are scalars within , where larger values indicate higher likelihood. One can use a threshold of 0.5 to obtain binary decisions if needed. Every action trajectory is a sequence of 6-DoF end-effector waypoints , with variable trajectory length (). In our implementation, we adopt a residual representation for the action trajectory, as it empirically yields better performance. Each 6-DoF waypoint is comprised of a 3-DoF robot hand center and 3-DoF orientation . We use the 6D-rotation representation  for the orientation of and predict 3-DoF euler angles for subsequent orientation changes.
4 VAT-Mart: an Interaction-for-perception Learning Framework
The VAT-Mart system (Fig. 2) consists of two parts: an RL policy exploring diverse action trajectories and a perception system learning the proposed actionable visual priors. While the RL policy collects interaction trajectories for supervising the perception networks, the perception system provides curiosity feedback  for encouraging the RL policy to further explore diverse solutions. In our implementation, we first pretrain the RL policy, then train the perception network with RL-collected data, and finally finetune the two parts jointly with curiosity-feedback enabled. We describe the key system designs below.
4.1 The RL Policy for Interactive Trajectory Exploration
For every interaction type and part articulation type, we train a single conditional RL policy using TD3  to collect trajectories that can accomplish the interaction of varying task specifications across all shapes and contact points . In the RL training, since the RL policy is trained in the simulation for only collecting training data to supervise the perception networks, we can have access to the ground-truth state information of the simulation environment, such as the part poses, joint axis, and gripper poses. At the test time, we discard the RL network and only use the learned perception networks to predict the proposed actionable visual priors.
Fig. 2 (left) illustrates the RL training scheme and example explored diverse trajectory proposals for the task of pulling open the microwave door for 30 degrees. Below, we describe the RL specifications.
For the shape to interact with, we first randomly sample a shape category with equal probability, alleviating the potential category imbalance issue, and then uniformly sample a shape from the selected category for training. For the task specification , we randomly sample within for revolute parts and for prismatic parts. We also randomly sample a starting part pose with the guarantee that the task can be accomplished. For the gripper, we initialize it with fingertip 0.02-unit-length away from the contact point and pointing within a cone of 45 degrees along the negative direction of the surface normal at .
The RL state includes the 1-DoF part pose change at the current timestep, the target task , the difference , the gripper pose at the first timestep of interaction, the current gripper pose , the local positions for the gripper fingers , the current contact point location , a normalized direction of the articulated part joint axis , the articulated part joint location (defined as the closest point on the joint axis to the contact point ), the closest distance from the contact point to the joint axis
, and a directional vectorfrom the joint location to the contact point. We concatenate all the information together as a 33-dimensional state vector for feeding to the RL networks.
At each timestep, we predict a residual gripper pose to determine the next-step waypoint as the action output of the RL networks. We estimate a center offset and an euler angle difference .
There are two kinds of rewards: extrinsic task rewards and intrinsic curiosity feedbacks. For the extrinsic task rewards, we use: 1) a final-step success reward of for a task completion when the current part pose reaches the target within 15% relative tolerance range, 2) a step-wise guidance reward of encouraging the current part pose to get closer to the target than previous part pose, and 3) a distance penalty of to discourage the gripper from flying away from the intended contact point , where denotes the distance from the contact point to the current fingertip position and is a zero or one function indicating the boolean value of the predicate . We will describe the curiosity rewards in Sec. 4.3.
We stop an interaction trial until the task’s success or after five maximal steps.
Implementation and Training.
We implement the TD3 networks using Multilayer Perceptron (MLP). We use a replayed buffer with size 2048. To improve the positive data rates for efficient learning, we leverage Hindsight Experience Replay: an interaction trial may fail to accomplish the desired task , but it finally achieves the task of . See supplementary for more details.
4.2 The Perception Networks for Actionable Visual Priors
The perception system learns from the interaction trajectories collected by the RL exploration policy and predicts the desired per-pixel actionable visual priors. Besides several information encoding modules, there are three decoding heads: 1) an actionability prediction module that outputs the actionability score , 2) a trajectory proposal module that models per-point distribution of diverse visual action trajectories , and 3) a trajectory scoring module that rates the per-trajectory success likelihood . Fig. 2 (right) presents an overview of the system. Below we describe detailed designs and training strategies.
The perception networks require four input entities: a partial object point cloud , a contact point , a trajectory , and a task . For the point cloud input , we use a segmentation-version PointNet++  to extract a per-point feature . We employ three MLP networks that respectively encode the inputs , , and into , , and . We serialize each trajectory as a
-dimensional vector after flattening all waypoint information. We augment the trajectories that are shorter than five steps simply by zero paddings.
Actionability Prediction Module.
The actionability prediction network , similar to Where2Act , is implemented as a simple MLP that takes as inputs a feature concatenation of , , and , and predicts a per-point actionability score . Aggregating over all contact points , one can obtain an actionability map over the input partial scan , from which one can sample an interaction point at test time according to a normalized actionability distribution.
Trajectory Proposal Module.
The trajectory proposal module is implemented as a conditional variational autoencoder (cVAE), composed of a trajectory encoderthat maps the input trajectory into a Gaussian noise and a trajectory decoder that reconstructs the trajectory input from the noise vector. Both networks take additional input features of , , and
as conditions. We use MLPs to realize the two networks. We regularize the resultant noise vectors to get closer to a uniform Gaussian distribution so that one can sample diverse trajectory proposals by feeding random Gaussian noises to the decoderwith the conditional features as inputs.
Trajectory Scoring Module.
The trajectory scoring module , implemented as another MLP, takes as inputs features of , , and , as well as the trajectory feature , and learns to predict the success likelihood . One can use a success threshold of to obtain a binary decision.
Data Collection for Training.
We collect interaction data from the RL exploration to supervise the training of the perception system. We randomly pick shapes, tasks, and starting part poses similar to the RL task initialization. For positive data, we sample 5000 successful interaction trajectories outputted by the RL. We sample the same amount of negative data, which are produced by offsetting the desired task of a successful trajectory by a random value with for revolute parts and for prismatic parts. For the pulling experiments, we consider another type of negative data that the first grasping attempt fails.
Implementation and Training.
We find it beneficial to first train the trajectory scoring module and then jointly train all three. We use the standard binary cross-entropy loss to train the trajectory scoring module . To train the trajectory proposal cVAE of and , besides the KL divergence loss for regularizing Gaussian bottleneck noises, we use an loss to regress the trajectory waypoint positions and a 6D-rotation loss  for training the waypoints orientations. For training the actionability prediction module , we sample 100 random trajectories proposed by , estimate their success likelihood scores using , and regress the prediction to the mean score of the top-5 rated trajectories with a loss. See supplementary for more details.
4.3 Curiosity-driven Exploration
We build a bidirectional supervisory mechanism between the RL policy and the perception system. While the RL policy collects data to supervise the perception networks, we also add a curiosity-feedback  from the perception networks to inversely affect the RL policy learning for exploring more diverse and novel interaction trajectories, which will eventually diversify the per-point trajectory distributions produced by the trajectory proposal decoder . The intuitive idea is to encourage the RL network to explore novel trajectories that the perception system currently gives low success scores. In our implementation, during the joint training of the RL and perception networks, we generate an additional intrinsic curiosity reward of for a trajectory
proposed by the RL policy and use the novel interaction data to continue supervising the perception system. We make sure to have training trajectory proposals generated by the RL network at different epochs in the buffer to avoid mode chasing of training the generative model.
We perform large-scale learning and evaluation using the SAPIEN physical simulator  and PartNet-Mobility dataset [21, 22]. We evaluate the prediction quality of the learned visual priors and compare them to several baselines for downstream manipulation tasks. Qualitative results over novel shapes and real-world data show promising generalization capability of our approach.
Data and Settings.
In total, we use 590 shapes from 7 categories in the PartNet-Mobility dataset. We conduct experiments over two commonly seen part articulation types: doors and drawers. For each experiment, we randomly split the applicable object categories into training and test categories. We further split the data from the training categories into training and test shapes. We train all methods over training shapes from the training categories and report performance over test shapes from the training categories and shapes from the test categories to test the generalization capabilities over novel shapes and unseen categories. We use the Panda flying gripper as the robot actuator and employ a velocity-based PID-controller to realize the actuation torques between consecutive trajectory waypoints. We use an RGB-D camera of resolution and randomly sample viewpoints in front of the objects. See supplementary for more data statistics and implementation details.
5.1 Actionable Visual Priors
We present qualitative and quantitative evaluations of our learned actionable visual priors.
|Accuracy (%)||Precision (%)||Recall (%)||F-score (%)||Coverage (%)|
|door||pushing||80.28 / 76.60||76.73 / 75.48||86.83 / 78.69||81.47 / 77.05||82.63 / 78.88|
|pulling||75.82 / 76.89||69.76 / 74.62||88.52 / 81.51||78.02 / 77.91||55.26 / 50.76|
|drawer||pushing||84.36 / 84.48||77.78 / 79.29||96.08 / 93.28||85.96 / 85.71||95.49 / 95.46|
|pulling||87.9 / 90.48||88.60 / 90.25||87.11 / 90.76||87.85 / 90.50||93.05 / 91.62|
For quantitatively evaluating the trajectory scoring module, we use standard metrics that are commonly used for evaluating binary classification problems: accuracy, precision, recall, and F-score. Since there is no ground-truth annotation of successful and failed interaction trajectories, we run our learned RL policy over test shapes and collect test interaction trajectories. In our implementation, we gather 350 positive and 350 negative trajectories for each experiment. To evaluate if the learned trajectory proposal module can propose diverse trajectories to cover the collected ground-truth ones, we compute a coverage score measuring the percentage of ground-truth trajectories that is similar enough to the closest predicted trajectory. See supplementary for detailed metric definitions.
Results and Analysis.
Table 1 presents quantitative results which demonstrate that we successfully learn the desired representations and that our learned model also generalizes to shapes from totally unseen object categories. Since we are the first to propose such actionable visual priors, there is no baseline to compare against. Fig. 3 presents qualitative results of the actionability prediction and trajectory proposal modules, from which we can observe that, to accomplish the desired tasks: 1) the actionability heatmap highlights where to interact (e.g., to pull open a drawer, the gripper either grasp and pull open the handle from outside, or push outwards from inside), and 2) the trajectory proposals suggest diverse solutions for how to interact (e.g., to push closed the door, various trajectories may succeed). In Fig. 4, we additionally illustrate the trajectory scoring predictions using heatmap visualization, where we observe interesting learned patterns indicating which points to interact with when executing the input trajectories. In Fig. 5 (left), we further visualize the trajectory scoring predictions over the same input shape and observe different visual patterns given different trajectories and tasks.
5.2 Downstream Manipulation
We can easily use our learned actionable visual priors to accomplish the downstream manipulation of 3D articulated objects. To this end, we first sample a contact point according to the estimated actionability heatmap and then execute the top-rated trajectory among 100 random trajectory proposals.
|pushing door||pulling door||pushing drawer||pulling drawer|
|RL-baseline||1.43 / 3.72||0.0 / 0.0||0.43 / 0.43||0.0 / 0.0|
|Heuristic||15.00 / 11.48||1.14 / 0.0||13.45 / 12.54||21.04 / 18.00|
|Ours||17.65 / 21.65||7.76 / 6.02||24.37 / 18.77||23.88 / 14.57|
We compare to two baselines: 1) a naive TD3 RL that takes the shape point cloud together with the desired task as input and directly outputs trajectory waypoints for accomplishing the task, 2) a heuristic approach, in which we hand-engineer a set of rules for different tasks (e.g., to pull open a drawer, we grasp the handle and pull straight backward ). Note that we use ground-truth handle masks and joint parameters for the heuristic baseline. See supplementary for more details.
We run interaction trials in simulation and report success rates for quantitative evaluation.
Results and Analysis.
Table 2 presents the quantitative comparisons. Our method outperforms baselines on most comparisons. The naive RL baseline largely fails since we find it extremely difficult to train from scratch an end-to-end RL over highly diverse shapes for all tasks, which in fact implies the necessity of certain intermediate visual abstractions. For the heuristic baseline, we find that we win all comparisons except one task of drawer pulling. Knowing the ground-truth handle position and prismatic joint axis, it is quite easy to leverage the rule-based heuristics to pull a drawer along a straight-line trajectory, while our method does not take the ground-truth information and achieves slightly worse results since we have to predict such information. However, our method achieves better performance for the other three experiments since we find that the heuristic method fails severely when 1) there are no handle parts for pulling, 2) some grasps over intricate handle geometry may slip and fail, or 3) some locally subtle geometric patterns affect the intended pushing interaction. We provide more results and analysis in the supplementary material. Furthermore, our method provides a unified and automatic solution to various tasks instead of requiring hand-designed rules by humans.
5.3 Real-world Experiments
Fig. 5 (middle) presents qualitative results directly testing our learned perception model on some real-world data: a microwave model from the RBO dataset , one cabinet from Google Scanned Object , and one real-world 3D cabinet scan we capture using a ZED MINI RGB-D camera. We observe that our model trained on synthetic textureless data can generalize to real-world depth scans to some degree. We also show real-robot experiment in Fig. 5 (right) and supplementary video.
In this paper, we propose a novel perception-interaction handshaking point –object-centric actionable visual priors– for manipulating 3D articulated objects, which contains dense action affordance predictions and diverse visual trajectory proposals. We formulate a novel interaction-for-perception framework VAT-Mart to learn such representations. Experiments conducted on the large-scale PartNet-Mobility dataset and real-world data have proved the effectiveness of our approach.
Limitations and Future Works.
First, our work is only a first attempt at learning such representations and future works can further improve the performance. Besides, the current open-loop trajectory prediction is based on a single-frame input. One may obtain better results considering multiple frames during an interaction. Lastly, future works may study more interaction and articulation types.
We would like to thank Yourong Zhang for setting up ROS environment and helping in real robot experiments.
- Abbatematteo et al.  B. Abbatematteo, S. Tellex, and G. Konidaris. Learning to generalize kinematic models to novel objects. In Proceedings of the 3rd Conference on Robot Learning, 2019.
- Staszak et al.  R. Staszak, M. Molska, K. Młodzikowski, J. Ataman, and D. Belter. Kinematic structures estimation on the rgb-d images. In 2020 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), volume 1, pages 675–681. IEEE, 2020.
Li et al. 
X. Li, H. Wang, L. Yi, L. J. Guibas, A. L. Abbott, and S. Song.
Category-level articulated object pose estimation.In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3706–3715, 2020.
- Jain et al.  A. Jain, R. Lioutikov, C. Chuck, and S. Niekum. Screwnet: Category-independent articulation model estimation from depth images using screw theory. arXiv preprint arXiv:2008.10518, 2020.
- Liu et al.  Q. Liu, W. Qiu, W. Wang, G. D. Hager, and A. L. Yuille. Nothing but geometric constraints: A model-free method for articulated object pose estimation. arXiv preprint arXiv:2012.00088, 2020.
- Wang et al.  X. Wang, B. Zhou, Y. Shi, X. Chen, Q. Zhao, and K. Xu. Shape2motion: Joint analysis of motion parts and attributes from 3d shapes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8876–8884, 2019.
- Yan et al.  Z. Yan, R. Hu, X. Yan, L. Chen, O. van Kaick, H. Zhang, and H. Huang. RPM-NET: Recurrent prediction of motion and parts from point cloud. ACM Trans. on Graphics, 38(6):Article 240, 2019.
- Klingbeil et al.  E. Klingbeil, A. Saxena, and A. Y. Ng. Learning to open new doors. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2751–2757. IEEE, 2010.
- Arduengo et al.  M. Arduengo, C. Torras, and L. Sentis. Robust and adaptive door operation with a mobile robot. arXiv e-prints, pages arXiv–1902, 2019.
- Florence et al.  P. Florence, L. Manuelli, and R. Tedrake. Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters, 2019. ISSN 2377-3766.
- Urakami et al.  Y. Urakami, A. Hodgkinson, C. Carlin, R. Leu, L. Rigazio, and P. Abbeel. Doorgym: A scalable door opening environment and baseline agent. Deep RL workshop at NeurIPS 2019, 2019.
- Mittal et al.  M. Mittal, D. Hoeller, F. Farshidian, M. Hutter, and A. Garg. Articulated object interaction in unknown scenes with whole-body mobile manipulation. arXiv preprint arXiv:2103.10534, 2021.
Do et al. 
T.-T. Do, A. Nguyen, and I. Reid.
Affordancenet: An end-to-end deep learning approach for object affordance detection.In 2018 IEEE international conference on robotics and automation (ICRA), pages 5882–5889. IEEE, 2018.
- Nagarajan et al.  T. Nagarajan, C. Feichtenhofer, and K. Grauman. Grounded human-object interaction hotspots from video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8688–8697, 2019.
- Mo et al.  K. Mo, L. Guibas, M. Mukadam, A. Gupta, and S. Tulsiani. Where2act: From pixels to actions for articulated 3d objects. arXiv preprint arXiv:2101.02692, 2021.
- Ramakrishnan et al.  S. K. Ramakrishnan, D. Jayaraman, and K. Grauman. An exploration of embodied visual exploration. International Journal of Computer Vision, pages 1–34, 2021.
- Wang et al.  A. Wang, T. Kurutach, K. Liu, P. Abbeel, and A. Tamar. Learning robotic manipulation through visual planning and acting. In Robotics: science and systems, 2019.
- Karamcheti et al.  S. Karamcheti, A. J. Zhai, D. P. Losey, and D. Sadigh. Learning visually guided latent actions for assistive teleoperation. In 3rd Annual Learning for Dynamics & Control Conference (L4DC), 6 2021.
Pathak et al. 
D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell.
Curiosity-driven exploration by self-supervised prediction.
International Conference on Machine Learning, pages 2778–2787. PMLR, 2017.
- Xiang et al.  F. Xiang, Y. Qin, K. Mo, Y. Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y. Yuan, H. Wang, et al. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020.
- Chang et al.  A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- Mo et al.  K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 909–918, 2019.
- Yan and Pollefeys  J. Yan and M. Pollefeys. Automatic kinematic chain building from feature trajectories of articulated objects. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 1, pages 712–719. IEEE, 2006.
Katz et al. 
D. Katz, Y. Pyuro, and O. Brock.
Learning to manipulate articulated objects in unstructured
environments using a grounded relational representation.
In In Robotics: Science and Systems
. Citeseer, 2008.
- Sturm et al.  J. Sturm, V. Pradeep, C. Stachniss, C. Plagemann, K. Konolige, and W. Burgard. Learning kinematic models for articulated objects. In IJCAI, pages 1851–1856, 2009.
- Sturm et al.  J. Sturm, C. Stachniss, and W. Burgard. A probabilistic framework for learning kinematic models of articulated objects. Journal of Artificial Intelligence Research, 41:477–526, 2011.
- Huang et al.  X. Huang, I. Walker, and S. Birchfield. Occlusion-aware reconstruction and manipulation of 3d articulated objects. In 2012 IEEE International Conference on Robotics and Automation, pages 1365–1371. IEEE, 2012.
- Katz et al.  D. Katz, M. Kazemi, J. A. Bagnell, and A. Stentz. Interactive segmentation, tracking, and kinematic modeling of unknown 3d articulated objects. In 2013 IEEE International Conference on Robotics and Automation, pages 5003–5010. IEEE, 2013.
- Martin and Brock  R. M. Martin and O. Brock. Online interactive perception of articulated objects with multi-level recursive estimation based on task-specific priors. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2494–2501. IEEE, 2014.
- Höfer et al.  S. Höfer, T. Lang, and O. Brock. Extracting kinematic background knowledge from interactions using task-sensitive relational learning. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 4342–4347. IEEE, 2014.
- Katz et al.  D. Katz, A. Orthey, and O. Brock. Interactive perception of articulated objects. In Experimental Robotics, pages 301–315. Springer, 2014.
- Schmidt et al.  T. Schmidt, R. A. Newcombe, and D. Fox. Dart: Dense articulated real-time tracking. In Robotics: Science and Systems, volume 2. Berkeley, CA, 2014.
- Hausman et al.  K. Hausman, S. Niekum, S. Osentoski, and G. S. Sukhatme. Active articulation model estimation through interactive perception. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 3305–3312. IEEE, 2015.
- Martín-Martín et al.  R. Martín-Martín, S. Höfer, and O. Brock. An integrated approach to visual perception of articulated objects. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 5091–5097. IEEE, 2016.
- Tzionas and Gall  D. Tzionas and J. Gall. Reconstructing articulated rigged models from rgb-d videos. In European Conference on Computer Vision, pages 620–633. Springer, 2016.
- Paolillo et al.  A. Paolillo, A. Bolotnikova, K. Chappellet, and A. Kheddar. Visual estimation of articulated objects configuration during manipulation with a humanoid. In 2017 IEEE/SICE International Symposium on System Integration (SII), pages 330–335. IEEE, 2017.
- Martın-Martın and Brock  R. Martın-Martın and O. Brock. Building kinematic and dynamic models of articulated objects with multi-modal interactive perception. In AAAI Symposium on Interactive Multi-Sensory Object Perception for Embodied Agents, AAAI, Ed, 2017.
- Paolillo et al.  A. Paolillo, K. Chappellet, A. Bolotnikova, and A. Kheddar. Interlinked visual tracking and robotic manipulation of articulated objects. IEEE Robotics and Automation Letters, 3(4):2746–2753, 2018.
- Martín-Martín and Brock  R. Martín-Martín and O. Brock. Coupled recursive estimation for online interactive perception of articulated objects. The International Journal of Robotics Research, page 0278364919848850, 2019.
- Desingh et al.  K. Desingh, S. Lu, A. Opipari, and O. C. Jenkins. Factored pose estimation of articulated objects using efficient nonparametric belief propagation. In 2019 International Conference on Robotics and Automation (ICRA), pages 7221–7227. IEEE, 2019.
Nunes and Demiris 
U. M. Nunes and Y. Demiris.
Online unsupervised learning of the 3d kinematic structure of arbitrary rigid bodies.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3809–3817, 2019.
- Peterson et al.  L. Peterson, D. Austin, and D. Kragic. High-level control of a mobile manipulator for door opening. In Proceedings. 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000)(Cat. No. 00CH37113), volume 3, pages 2333–2338. IEEE, 2000.
- Jain and Kemp  A. Jain and C. C. Kemp. Pulling open novel doors and drawers with equilibrium point control. In 2009 9th IEEE-RAS International Conference on Humanoid Robots, pages 498–505. IEEE, 2009.
- Chitta et al.  S. Chitta, B. Cohen, and M. Likhachev. Planning for autonomous door opening with a mobile manipulator. In 2010 IEEE International Conference on Robotics and Automation, pages 1799–1806. IEEE, 2010.
- Burget et al.  F. Burget, A. Hornung, and M. Bennewitz. Whole-body motion planning for manipulation of articulated objects. In 2013 IEEE International Conference on Robotics and Automation, pages 1656–1662. IEEE, 2013.
- Zeng et al.  V. Zeng, T. E. Lee, J. Liang, and O. Kroemer. Visual identification of articulated object parts. arXiv preprint arXiv:2012.00284, 2020.
- Mu et al.  J. Mu, W. Qiu, A. Kortylewski, A. Yuille, N. Vasconcelos, and X. Wang. A-sdf: Learning disentangled signed distance functions for articulated shape representation. arXiv preprint arXiv: 2104.07645, 2021.
- Russakovsky et al.  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Wu et al.  Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
Lin et al. 
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick.
Microsoft coco: Common objects in context.In European conference on computer vision, pages 740–755. Springer, 2014.
- Hinterstoisser et al.  S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab, and V. Lepetit. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In 2011 international conference on computer vision, pages 858–865. IEEE, 2011.
- Xiang et al.  Y. Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mottaghi, L. Guibas, and S. Savarese. Objectnet3d: A large scale database for 3d object recognition. In European conference on computer vision, pages 160–176. Springer, 2016.
- Montesano and Lopes  L. Montesano and M. Lopes. Learning grasping affordances from local visual descriptors. In 2009 IEEE 8th international conference on development and learning, pages 1–6. IEEE, 2009.
- Lenz et al.  I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps. The International Journal of Robotics Research, 34(4-5):705–724, 2015.
- Mahler et al.  J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. 2017.
- Fang et al.  K. Fang, Y. Zhu, A. Garg, A. Kurenkov, V. Mehta, L. Fei-Fei, and S. Savarese. Learning task-oriented grasping for tool manipulation from simulated self-supervision. The International Journal of Robotics Research, 39(2-3):202–216, 2020.
- Mandikal and Grauman  P. Mandikal and K. Grauman. Learning dexterous grasping with object-centric visual affordances. In IEEE International Conference on Robotics and Automation (ICRA), 2021.
- Corona et al.  E. Corona, A. Pumarola, G. Alenya, F. Moreno-Noguer, and G. Rogez. Ganhand: Predicting human grasp affordances in multi-object scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5031–5041, 2020.
- Kokic et al.  M. Kokic, D. Kragic, and J. Bohg. Learning task-oriented grasping from human activity datasets. IEEE Robotics and Automation Letters, 5(2):3352–3359, 2020.
- Yang et al.  L. Yang, X. Zhan, K. Li, W. Xu, J. Li, and C. Lu. Cpf: Learning a contact potential field to model the hand-object interaction. arXiv preprint arXiv:2012.00924, 2020.
- Jiang et al.  Z. Jiang, Y. Zhu, M. Svetlik, K. Fang, and Y. Zhu. Synergies between affordance and geometry: 6-dof grasp detection via implicit representations. arXiv preprint arXiv:2104.01542, 2021.
- Kjellström et al.  H. Kjellström, J. Romero, and D. Kragić. Visual object-action recognition: Inferring object affordances from human demonstration. Computer Vision and Image Understanding, 115(1):81–90, 2011.
- Fang et al.  K. Fang, T.-L. Wu, D. Yang, S. Savarese, and J. J. Lim. Demo2vec: Reasoning object affordances from online videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2139–2147, 2018.
- Goff et al.  L. K. L. Goff, O. Yaakoubi, A. Coninx, and S. Doncieux. Building an affordances map with interactive perception. arXiv preprint arXiv:1903.04413, 2019.
- Nagarajan and Grauman  T. Nagarajan and K. Grauman. Learning affordance landscapes for interaction exploration in 3d environments. In NeurIPS, 2020.
- Nagarajan et al.  T. Nagarajan, Y. Li, C. Feichtenhofer, and K. Grauman. Ego-topo: Environment affordances from egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 163–172, 2020.
- Xu et al.  D. Xu, A. Mandlekar, R. Martín-Martín, Y. Zhu, S. Savarese, and L. Fei-Fei. Deep affordance foresight: Planning through what can be done in the future. IEEE International Conference on Robotics and Automation (ICRA), 2021.
- Wu et al.  J. Wu, X. Sun, A. Zeng, S. Song, J. Lee, S. Rusinkiewicz, and T. Funkhouser. Spatial action maps for mobile manipulation. In Proceedings of Robotics: Science and Systems (RSS), 2020.
- Wu et al.  J. Wu, X. Sun, A. Zeng, S. Song, S. Rusinkiewicz, and T. Funkhouser. Spatial intention maps for multi-agent mobile manipulation. In IEEE International Conference on Robotics and Automation (ICRA), 2021.
- Wang et al.  J. Wang, S. Lin, C. Hu, Y. Zhu, and L. Zhu. Learning semantic keypoint representations for door opening manipulation. IEEE Robotics and Automation Letters, 5(4):6980–6987, 2020.
- Qin et al.  Z. Qin, K. Fang, Y. Zhu, L. Fei-Fei, and S. Savarese. Keto: Learning keypoint representations for tool manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7278–7285. IEEE, 2020.
- You et al.  Y. You, L. Shao, T. Migimatsu, and J. Bohg. Omnihang: Learning to hang arbitrary objects using contact point correspondences and neural collision estimation. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021.
- Anderson et al.  P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018.
- Wilkes and Tsotsos  D. Wilkes and J. Tsotsos. Active object recognition. In Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 136–141. IEEE, 1992.
- Yang et al.  J. Yang, Z. Ren, M. Xu, X. Chen, D. Crandall, D. Parikh, and D. Batra. Embodied visual recognition. arXiv preprint arXiv:1904.04404, 2019.
- Jayaraman and Grauman  D. Jayaraman and K. Grauman. End-to-end policy learning for active visual categorization. IEEE transactions on pattern analysis and machine intelligence, 41(7):1601–1614, 2018.
- Pathak et al.  D. Pathak, Y. Shentu, D. Chen, P. Agrawal, T. Darrell, S. Levine, and J. Malik. Learning instance segmentation by interaction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2042–2045, 2018.
- Gadre et al.  S. Y. Gadre, K. Ehsani, and S. Song. Act the part: Learning interaction strategies for articulated object part discovery, 2021.
- Pinto et al.  L. Pinto, D. Gandhi, Y. Han, Y.-L. Park, and A. Gupta. The curious robot: Learning visual representations via physical interactions. In European Conference on Computer Vision, pages 3–18. Springer, 2016.
- Bohg et al.  J. Bohg, K. Hausman, B. Sankaran, O. Brock, D. Kragic, S. Schaal, and G. S. Sukhatme. Interactive perception: Leveraging action in perception and perception in action. IEEE Transactions on Robotics, 33(6):1273–1291, 2017.
- Wu et al.  J. Wu, I. Yildirim, J. J. Lim, B. Freeman, and J. Tenenbaum. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. Advances in neural information processing systems, 28:127–135, 2015.
- Xu et al.  Z. Xu, J. Wu, A. Zeng, J. B. Tenenbaum, and S. Song. Densephysnet: Learning dense physical object representations via multi-step dynamic interactions. In Robotics: Science and Systems (RSS), 2019.
- Lohmann et al.  M. Lohmann, J. Salvador, A. Kembhavi, and R. Mottaghi. Learning about objects by learning to interact with them. Advances in Neural Information Processing Systems (NeurIPS), 2020.
- James et al.  S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12627–12637, 2019.
- Chebotar et al.  Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. D. Ratliff, and D. Fox. Closing the sim-to-real loop: Adapting simulation randomization with real world experience. 2019 International Conference on Robotics and Automation (ICRA), pages 8973–8979, 2019.
- Hundt et al.  A. Hundt, B. Killeen, N. Greene, H. Wu, H. Kwon, C. Paxton, and G. D. Hager. “good robot!”: Efficient reinforcement learning for multi-step visual tasks with sim to real transfer. IEEE Robotics and Automation Letters (RA-L), 2019.
- Liang et al.  J. Liang, S. Saxena, and O. Kroemer. Learning active task-oriented exploration policies for bridging the sim-to-real gap. In RSS 2020, 2020.
- Kadian et al.  A. Kadian, J. Truong, A. Gokaslan, A. Clegg, E. Wijmans, S. Lee, M. Savva, S. Chernova, and D. Batra. Sim2real predictivity: Does evaluation in simulation predict real-world performance? IEEE Robotics and Automation Letters (RA-L), 2020.
- Anderson et al.  P. Anderson, A. Shrivastava, J. Truong, A. Majumdar, D. Parikh, D. Batra, and S. Lee. Sim-to-real transfer for vision-and-language navigation. Conference on Robotic Learning (CoRL), 2020.
- Rao et al.  K. Rao, C. Harris, A. Irpan, S. Levine, J. Ibarz, and M. Khansari. Rl-cyclegan: Reinforcement learning aware simulation-to-real. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11157–11166, 2020.
Zhou et al. 
Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li.
On the continuity of rotation representations in neural networks.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019.
- Fujimoto et al.  S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596. PMLR, 2018.
- Andrychowicz et al.  M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba. Hindsight experience replay. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/453fadbd8a1a3af50a9df4df899537b5-Paper.pdf.
- Qi et al.  C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413, 2017.
- Martín-Martín et al.  R. Martín-Martín, C. Eppner, and O. Brock. The rbo dataset of articulated objects and interactions. The International Journal of Robotics Research, 38(9):1013–1019, 2019.
-  OpenRobotics. Drawer, June . URL https://fuel.ignitionrobotics.org/1.0/OpenRobotics/models/Drawer.
- Kingma and Ba  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. The 3rd International Conference for Learning Representations, 2015.
Quigley et al. 
M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler,
A. Y. Ng, et al.
Ros: an open-source robot operating system.In ICRA workshop on open source software, volume 3, page 5. Kobe, Japan, 2009.
- Chitta et al.  S. Chitta, I. Sucan, and S. Cousins. Moveit! IEEE Robotics & Automation Magazine, 19(1):18–19, 2012.
Appendix A More Details for the RL Policy (Sec. 4.1)
a.1 Network Architecture Details
We leverage TD3  to train the RL policy. It consists of a policy network and a Q-value network, both implemented as a Multi-layer Perception (MLP). The policy network receives the state as input (), and predicts the residual gripper pose as action (). Its network architecture is implemented as a 4-layer MLP (), followed by two separate fully connected layers (
) that estimate the mean and variance of action probabilities respectively. The Q-value network receives both the state and action as input (), and predicts a single Q value (). Its network architecture is implemented as a 4-layer MLP ().
a.2 More Implementation and Training Details
We use the following hyper-parameters to train the RL policy: 2048 (buffer size); 512 (batch size); 0.0001 (learning rate of the Adam optimizer  for both the policy and Q-value network); 0.1 (initial exploration range, decayed by 0.5 every 500 epochs.
Appendix B More Details for the Perception Networks (Sec. 4.2)
b.1 Network Architecture Details
The perception networks consist of four input encoders that extract high-dimensional features from the input partial point cloud , contact point , trajectory and task , and three parallel output modules that estimate the actionability , trajectory proposal and scores from the extracted high-dimensional features of input encoders.
Regarding the four input encoders
The input point cloud with point-wise coordinate () is fed through the PointNet++  with the segmentation head and single-scale grouping (SSG) to extract the point-wise feature .
The input contact point is fed through a fully connected layer () to extract a high-dimensional contact point feature .
The input trajectory is composed of five waypoints, each of which contains 3D rotation and 3D translation information. The input trajectory is further flattened to be fed through a 3-layer MLP () to extract a high-dimensional trajectory feature .
The input task is fed through a fully connected layer () to extract a high-dimensional task feature .
Regarding the three output modules
The actionability prediction module concatenates the extracted point cloud , contact point and task features to form a high-dimensional vector (), which is fed through a 5-layer MLP () to predict a per-point actionability score .
The trajectory proposal module is implemented as a conditional variational auto-encoder (cVAE), which learns to generate diverse trajectories from the sampled Gaussian value and given conditions (point cloud , contact point and task features). The encoder is implemented as a 3-layer MLP () that takes the concatenation of the trajectory feature and given conditions as input and estimates a Gaussian distribution (). The decoder is also implemented as a 3-layer MLP () that takes the concatenation of the sampled Gaussian value and given conditions as input to recover the trajectory .
The trajectory scoring module concatenates the extracted point cloud , contact point , trajectory and task features to form a high-dimensional vector (), which is fed through a 2-layer MLP () to predict the success likelihood .
b.2 More Implementation and Training Details
The interaction data provided by the RL policy provides the ground truth label to train the perception networks. We start by training the trajectory scoring module, followed by joint training of all the three modules.
To train the trajectory scoring module , we firstly collect 5000 successful and 5000 unsuccessful interaction trajectories produced by the RL policy, then supervise the estimated success likelihood of the trajectory with binary cross entropy loss.
To train the trajectory proposal module, we supervise the output rotation and translation with the 6D-rotation loss  and loss respectively from the 5000 successful collected interaction trajectories. We follow the common cVAE training process to add a KL divergence loss on the estimated Gaussian distribution for regularization.
To train the actionability prediction module , we estimate the scores from 100 sampled trajectories with both the trajectory proposal and scoring modules, then calculate the mean value of top-5 rated scores as the ground truth label to supervise the actionability predictions. We adopt loss for supervision.
We use the following hyper-parameters to train the perception networks: 32 (batch size); 0.001 (learning rate of the Adam optimizer  for all three modules).
Appendix C More Details for the Curiosity-driven Exploration (Sec. 4.3)
The curiosity-driven exploration alternates between training the RL policy and perception networks at odd and even epochs respectively. It aims for exploring more diverse trajectories with the RL policy that the perception networks are not confident with and are further used to diversify the trajectory proposals generated by the perception networks.
When training the RL policy, besides the extrinsic task rewards, it also awards the policy with the intrinsic curiosity reward. It is computed as , which is the weighted negative success likelihood estimated by the trajectory scoring module. It penalizes the learned trajectories that the trajectory scoring module is confident with and encourage more diverse trajectories generated by the RL policy.
When training the perception networks, we collect the equal number of successful and unsuccessful trajectories produced by the RL policy. To encourage the perception networks to learn more diverse trajectories, we sample the successful trajectories fifty-fifty with both high (>0.5) and low (<0.5) success likelihood estimated by the trajectory scoring module. Then we use these data to train the perception networks.
Appendix D More Detailed Data Statistics
Table D.3 summarizes the data statistics and splits.
Fig. D.6 presents a visualization of some example shapes from our dataset.
Appendix E SAPIEN Simulation Environment Settings
We follow exactly the virtual environment settings as in , except that we randomly sample camera viewpoints only in front of the target shape since most articulated parts are present in the front side of the PartNet-Mobility shapes and it is very unlikely that robot agents need to manipulate object parts from the back of objects. By default, the camera looks at the center of the target shape, and the location is sampled on a unit sphere centered at the origin. For the spherical sampling of the camera location, we set the range of the azimuth angle to [-90, 90], and the elevation angle to [30, 60]. Please refer to Section A and Section B in the appendix of  for more in-depth details of the virtual environment.
Appendix F Computational Costs and Timing
Training of the proposed system took around 3 days per experiment on a single Nvidia V100 GPU. More specifically, the initial RL policy training took around 18 hours, followed by a 15-hour training of the perception module using the RL-collected data. The subsequent curiosity-driven exploration stage took around 10 hours to jointly train the RL and the perception networks. At last, the three prediction branches in the perception module were further fine-tuned for around 20 hours.
During the inference time, a forward pass takes around 15 ms, 8 ms, and 8 ms for the actionability map prediction, diverse trajectory proposals, and success rate predictions, respectively. Our model only consumes 1,063 MB memory of the GPU in test time.
Appendix G More Metric Definition Details
g.1 Evaluating Actionable Visual Priors (Sec. 5.1)
We adopt the following evaluation metrics for quantitative evaluation: precision, recall and F-score for the positive data, and the average accuracy equally balancing the negative and positive recalls. Formally, let , , , and
denote the data counts for false positives, false negatives, true positives, and true negatives, respectively. The F-score is the harmonic mean of precision and recall, and the other metrics are defined as follow:
We also compute a coverage score for evaluation, which is calculated as the percentage of ground-truth trajectories that are similar enough to be matched in the predicted trajectories. To this end, we need to measure the distance between two trajectories and , which takes into account both the position difference and the orientation difference at every waypoint. Concretely, we calculate the L1 distance of the waypoint positions as the position distance between the two trajectories. Then, the orientation distance is measured as the 6D-rotation distance of the waypoint orientations. To balance the dimension of the quantities, is calculated as: . We consider a ground truth trajectory to be covered if the distance between this ground truth trajectory and a predicted trajectory is lower than a threshold (10, in all our experiments), and then report the percentage of the ground truth trajectories that are covered by the predictions. To compensate the stochastic error, all reported quantitative results are averaged over 10 test runs.
g.2 Evaluating Downstream Manipulation (Sec. 5.2)
For each downstream manipulation task, we locate the contact point sample with the highest actionability score over the surface on each test shape, and then generate 100 diverse trajectories at this contact point, and pick the trajectory with the highest success rate for executing the downstream task. This results in 350 trajectories for each of the door pushing, door pulling, drawer pushing, and drawer pulling tasks.
In practice, generating trajectories that can perfectly fulfill the given task specification (e.g., push to open a door by 30) is very hard. Hence, we consider a trajectory can successfully accomplish the task if the specification is fulfilled within a tolerance threshold in percentage . For example, if the task is to push a door open by 10, a trajectory that can open the door between 8.5 and 11.5 is counted as a successful sample. We then report the success rate in percentage of all the generated trajectories on each task.
Appendix H More Details of Baselines in Sec. 5.2
In the following, we present more details about the baseline methods that we compare against in Section 5.2.
h.1 Naive RL Baseline
The naive RL baseline, which is comprised of two sub-RL modules denoted as sub-RL1 and sub-RL2 respectively. Sub-RL1 module takes as input the initial state of the target shape (i.e., the partial scan of the initial shape, and the mask of the target part), and the task specification, and then outputs the initial position and orientation of the gripper (i.e., the parameters of the first waypoint). Sub-RL2 module takes as input the state of the initial scene (i.e., the partial scan of the shape, the mask of the target part, and the position and orientation of the gripper at the initial state), the state of the current scene (i.e., the partial scan of the shape, the mask of the target part, and the position and orientation of the gripper at the current state), and the task specification, and then predicts the parameters of the next waypoint using the residual representation.
The network architectures of sub-RL1 and sub-RL2 are based on our RL networks as described in Section A.1, with minor adaptations for different input and output dimensionalities. We use a buffer size of 2048 and a batch size of 64 for training both sub-RL1 and sub-RL2 modules. We use the Adam optimizer and a learning rate 0.0001 for the policy and Q-value networks in both modules. We set the initial exploration range to 0.1, which is linearly decayed by a factor of 0.5 every 500 epochs during training.
h.2 Rule-based Heuristic Baseline
For the rule-based heuristic baseline, we hand-craft a set of rules for different tasks. We now describe the details in the following:
Door pushing, for which we randomly sample a point on the door surface, then initialize the gripper by snapping the gripper position to and set the forward orientation along the negative direction of the surface normal at . Let denote the distance between and the rotation shaft of the door, to push the door by a degree , we simply move the gripper along its forward direction by a distance of .
Door pulling, for which we initialize the gripper given a random sample as in door pushing, and move the gripper backward by a distance of . Different from pushing, we also perform a grasping at contact point for pulling.
Drawer pushing, for which, similarly, we randomly sample a point on the drawer, and then initialize the gripper by snapping the gripper position to and set the forward orientation along the negative direction of the normal at . Let denote the push distance in the task specification, we simply move the gripper by a distance of along the slide-to-close direction of the drawer.
Drawer pulling, in which the random point is particularly sampled on the handles of the drawer. Note that we count the trial as a failure if the drawer does not have handles. Then we initialize the gripper as in drawer pushing, and move the gripper by a distance of along the slide-to-open direction.
In Fig. H.7, we present some failure cases for the rule-based heuristic baseline to explain why such seemingly intuitive and easy heuristics often fail. See the figure caption for detailed explanations. Such rule-based heuristics require careful human hand-engineering given different task semantics. One needs to hand-design rules for different tasks and sometimes will find it difficult to enumerate all possible rules. Our system, instead, provides a unified system that automatically discovers useful knowledge for different tasks, without the need of spending human efforts, and learn a rich collection of data-driven priors from training over diverse shapes.
Appendix I Ablation Study on Curiosity Feedback
We compare our final approach to an ablated version that removes the curiosity feedback (Sec. 5.3), in order to validate that the proposed curiosity-feedback mechanism is beneficial.
In Table I.4, we evaluate the prediction quality of the proposed actionable visual priors. In Table I.5, we present the quantitative comparisons for downstream manipulation. From both tables, it is clear to see that the introduced curiosity feedback mechanism (Sec. 5.3) is beneficial as it improves the performance in most entries.
|Curiosity||Accuracy (%)||Precision (%)||Recall (%)||F-score (%)||Coverage (%)|
|pushing door||w/o||76.22 / 73.62||72.86 / 70.75||83.47 / 80.40||77.81 / 75.27||80.03 / 76.95|
|w||80.28 / 76.60||76.73 / 75.48||86.83 / 78.69||81.47 / 77.05||82.63 / 78.88|
|pulling door||w/o||71.33 / 66.39||70.51 / 73.68||70.31 / 50.98||70.41 / 60.26||43.17 / 48.01|
|w||75.82 / 76.89||69.76 / 74.62||88.52 / 81.51||78.02 / 77.91||55.26 / 50.76|
|pushing drawer||w/o||77.09 / 81.54||73.83 / 78.34||83.75 / 87.11||78.48 / 82.49||94.96 / 94.62|
|w||84.36 / 84.48||77.78 / 79.29||96.08 / 93.28||85.96 / 85.71||95.49 / 95.46|
|pulling drawer||w/o||86.27 / 87.68||84.53 / 87.05||88.80 / 88.52||86.61 / 87.78||93.31 / 89.08|
|w||87.96 / 90.48||88.60 / 90.25||87.11 / 90.76||87.85 / 90.50||93.05 / 91.62|
|pushing door||pulling door||pushing drawer||pulling drawer|
|Ours-w/o||16.71 / 18.36||2.54 / 3.65||15.40 / 19.33||22.41 / 17.98|
|Ours-w||17.65 / 21.65||7.76 / 6.02||24.37 / 18.77||23.88 / 14.57|
Appendix J More Results and Analysis
In Fig. J.8, we show additional qualitative results of the actionability prediction and trajectory proposal modules to augment Fig. 3 in the main paper.
In Fig. J.9, we present an additional qualitative analysis of the learned trajectory scoring module to augment Fig. 4 in the main paper.
In addition, we provide another set of result analyses in Fig. J.10, where we show that to accomplish the same task, interacting at different contact points over the shape will give different trajectories.
Appendix K Real-robot Settings and Experiments
For real-robot experiments, we use one real cabinet and set up one Franka Panda robot facing the front of the cabinet. One ZED MINI RGB-D camera is mounted at the front right of the cabinet. The camera captures 3D point cloud data as inputs to our learned models.
We control the robot using Robot Operating System (ROS) . The robot is programmed to execute each waypoint in the predicted trajectory step by step. We use MoveIt!  for the motion planning between every adjacent waypoint pair.
We demonstrate various tasks on the cabinet, including pulling open the door at the edge or handle, pushing closed the door at different contact points, pulling open the drawer by grasping the edge or handle. Please check our results in the supplementary video.