In order for robots to assist people in unstructured environments such as homes and hospitals, simple pick and place actions are not sufficient for robots to exploit the affordances [gibson1966senses] of objects when executing common tasks, such as grasping a container to pour as shown in Figure 1. Object affordances provide an agent-centric way of describing available actions that an object offers given the capability of the agent. To interact with common household objects, a robot must make use of the actions afforded by the component parts of the object. In this work, we exploit the notion of affordance and represent an object class as a composition of functional parts, which we call affordance parts. As shown in Figure 2, the concept of a mug can be generalized to the composition of two affordance parts, namely the body and the handle parts.
Recently, the notion of affordance for manipulation has garnered increasing interest in the robotics community [Jamone2018a]. Previous work on object affordances for robotic manipulation focuses on recognizing the affordance labels of novel objects [do2018affordancenet], which does not directly translate to specific manipulation poses for use by a planner towards task completion. Recent advances have enabled pose estimation of novel objects at the category level [wang2019normalized, chen2020learning]
. However, robots often need fine-grained pose information about specific object parts (e.g. a container handle) for manipulation actions. For such tasks, the category-level object pose is not sufficient to perform manipulation, due to large intra-category variance. Generalizable representations for object localization using keypoints have been proposed[manuelli2019kpam, qin2019keto], but these methods rely heavily on user-specified constraints in order to define manipulation actions over the keypoints.
In this paper, we aim to develop a generalized object representation, Affordance Coordinate Frame (ACF), that seamlessly links object part affordances and robot manipulation poses. Our key insight lies in that each object class is composed of a collection of functional parts, where each part is strongly associated with affordances (see Figure 2). For robots to exploit the affordances of object parts, we learn a category-level pose for each object part, similarly to the notion of category-level pose for objects [wang2019normalized]. In particular, the category-level pose for each object part serves as a coordinate frame (hence the name Affordance Coordinate Frame) in which to pre-define manipulation poses, such as for pre-grasp and grasp actions.
Building on the insights from Zeng et al. [zeng2019unsupervised]
, our work proposes a deep learning based pipeline for estimating ACF of object parts given a RGB-D observation, and robot can attach pre-defined manipulation poses to the estimated ACF when executing tasks. By defining ACF with respect to object parts, our method allows for better generalization and robustness under occlusion, because the robot can still act on one part of the object when other parts are occluded. For example, the handle of a mug can be grasped when the body is not visible. In contrast to other works aimed at pose estimation for robotics manipulation[tremblay2018deep, pavlasek2020parts], we do not require the full 6D object pose.
We show the accuracy and generalization of our method in detecting novel object instances, as well as estimating ACF for novel object parts in our experiments. We demonstrate that the proposed method outperforms the state-of-the-art methods for object detection, as well as category-level pose estimation for object parts. We further demonstrate robot manipulating objects through ACF estimation in a simulated cluttered environment.
Ii Related Work
Ii-a Affordances for Robotic Manipulation
The concept of affordance has been explored in many aspects of robot manipulation. For visual perception, Nguyen et al. [do2018affordancenet] proposed AffordanceNet to achieve state-of-the-art accuracy on affordance detection and segmentation on affordance datasets [nguyen2017object, myers2015affordance]. The affordance detection togher with segmentation can reduce search space of grasp pose for manipulation as explored in [detry2017task]. However, affordance detection does not directly translate into manipulation poses to guide robots towards task completion.
To learn manipulation trajectories, previous works have approached manipulation tasks by learning from human demonstration or simulations using supervised, self-supervised or reinforcement learning[hamalainen2019affordance, sung2018robobarista, fang2020learning, gualtieri2018pick]. These approaches lack explicit modeling of manipulation constraints, such as physical infeasible configurations. Affordance templates [hart2015affordance] were developed for highly constrained manipulation tasks but require user inputs to manually register affordance template in robot observations. The affordance wayfield [mcmahon2018affordance] extended the similar idea to a gradient field based representation, but also requires explicit registration of wayfield in robot observations.
Recent works that connect visual perception of object affordances with robot manipulations are most relevant to our work. kPAM [manuelli2019kpam] uses 3D semantic keypoints as the object representation for category-level object manipulations. However, kPAM rely heavily on user-specified constraints with respect to the keypoints in order to define manipulation actions. Specific to tool manipulation, Qin et al. [qin2019keto] also proposed a keypoint-based representation, which specifies the grasp point and function point (hammer head for example) on the tool and effect point on the target object. The robot motion was solved through optimization over the keypoint-based constraints. Compared to 3D keypoints registered on whole objects, our ACF representation decomposes object into object parts based on their functionality, and associate each part with a category-level pose.
Ii-B Object Pose Estimation for Manipulation
Object pose estimation offers a formal way to perform manipulation with known object geometry models. Instance-level pose estimators have been extensively studied in the computer vision community[tremblay2018deep, wang2019densefusion, he2020pvn3d]. In the robotics community, work has focused on the problem of object localization in clutter for the purpose of robotic manipulation [zeng2018srp, sui2015axiomatic, chenGrip]. These methods are limited to pick and place actions. The success of parts-based methods have been demonstrated in highly cluttered scenes [Desingheaaw4523, pavlasek2020parts]. These methods allude to the versatility of such parts-based models for manipulation tasks, which is a key motivation for this work. However, they rely on known mesh models and do not extensively validate the representations for manipulation.
Recently, several works aim to extend that to category-level pose estimation, which relaxes the assumption of known object geometry models. Wang et al. [wang2019normalized] proposed Normalized Object Coordinate Space (NOCS) as a dense reconstruction of object category in canonical space, under the assumption that the intra-category shape variance can be well approximated by a representative 3D shape. At the inference stage, the 6D pose and size are solved by least squares between the reconstructed depth map to depth sensor observation. Chen et al. [chen2020learning]
gave another category-level representation named Canonical Shape Space, modeled by a latent space vector generated from a variational autoencoder. This representation does not assume dense point correspondence and claims full 3D reconstruction. A recent work[li2020category] focused on pose estimation of articulated objects without object geometry models, which solves the articulation axes and parameters after estimating poses of individual parts using the same module as [wang2019normalized]. Compared to these representations of whole objects, the ACF representation focuses on functional parts of objects for manipulation tasks rather than whole object pose estimation.
Ii-C Task-Oriented Grasping
Task-oriented grasping incorporates semantics into classic grasping to provide richer affordance-based manipulation. Antanas et al. [antanas2019semantic] leverage object part segments and task semantics into probabilistic logic descriptions to find good grasping regions. Detry et al. [detry2017task] train a CNN model to detect suitable grasp regions along with task-agnostic grasping detection to locate suitable grasping regions. Kokic et al. [kokic2017affordance] employ affordances to learn contact and orientation from visual perception and modeled the constraints into fingertip grasp planning. Fang et al. [fang2020learning]
jointly train a task-oriented grasping neural network and a manipulation policy module through self-supervision in simulation. Liu et al.[liu2019cage] propose context-aware grasping by considering task action, object affording part, material and state in a wide-deep learning model and achieved generalization across objects and classes. These methods focus on grasp quality and do not deal with perceptual uncertainty arising from sensor noise and clutter.
Iii Affordance Coordinate Frames
Given a predefined task, our goal is to detect and localize objects in a scene in terms of their parts such that manipulation actions can be executed towards task completion. The problem is divided into two subtasks: the first is to perceive affordances from RGB-D observations from a robot, and the second is to generate manipulation actions from the perceived affordances. The Affordance Coordinate Frame (ACF) representation connects the visual perception and robot action into a unified manipulation pipeline. In this paper, we focus on the representation and the visual perception stage.
We define affordances as the functional interaction between an object and an agent. For common household objects, such affordances are associated with object parts, rather than the whole object, and we therefore associate ACFs with object parts. An ACF is formally defined as a 3D keypoint with a directed axis with its origin at the keypoint location. This representation is an intermediate between full 6D poses [wang2019densefusion, xiang2018posecnn], which do not generalize between objects, and 3D keypoints [manuelli2019kpam], from which it is challenging to generate robot end-effector poses directly. The ACF allows the affordance to be perceived from visual data while also acting as a concrete and generalization parametrization of a robot action. The advantages of the ACF are two-fold:
They can generalize to novel objects with the same functionality and similar part geometry, since parts have smaller geometric variations within a category compared to full objects.
They provide a sparse representation while providing enough information for robotic manipulation by exploiting the symmetric nature of object parts.
For instance, a mug is divided into a container body and a handle. The container body is rotation-symmetric about the central axis, and we define the ACF axis pointing upright to indicate the direction to pour liquid and place the mug. The 3D keypoint is defined to be the center of 3D bounding box enclosing the mug body. For the mug handle, we define the ACF keypoint to be at the hole of handles with axis pointing from handle to the mug body. Figure 2 illustrates the geometric relation of the defined axes.
Iv ACF Detection Pipeline
We estimate instance-level object part ACFs using a network similar to Mask R-CNN [he2017mask]. The overall framework is illustrated in Figure 3. To incorporate depth information, we add an additional channel to the first backbone layer to accept 4-channel RGB-D input. We add three network head architectures for 3D keypoint, axis and part affinity field estimation. Similar to the mask head, estimates are made in every proposed region of interest (ROI) with a detected bounding box and corresponding ROI feature. Similar to [wang2019densefusion], we concatenate each ROI feature with a global feature before passing it to the keypoint and axis estimation modules. The global feature is computed from all ROI features in the same image through convolution and average pooling layers. The concept of voting was proposed to describe 2D and 3D information in early study [medioni2000tensor]. Therefore, all three heads use convolutional layers and we employ deep Hough voting [he2020pvn3d, qi2019deep] to estimate the 3D keypoint and axis.
Iv-1 3D Keypoint Estimation
We sample 3D seed points and estimate the keypoint position by learning the position offset between all the seeds and the target keypoint. Each ROI consists of
superpixels. The image pixel coordinates and depth values of the seeds are computed through bilinear interpolation similar to RoIAlign[he2017mask]. Their 3D positions are restored through an inverted camera intrinsic transformation.
The keypoint loss is defined as the loss of estimated 3D offset against ground truth offset filtered by ground truth mask :
where output denotes the offset of seed along direction , denotes ground truth offset, and is the confidence score that seed region belongs to the mask. During training, only the seeds within the target mask are used to calculate L1 loss, which is the same method used by Mask R-CNN to project ground truth masks.
Iv-2 Directed Axis Estimation
We represent the directed axis using two 3D endpoints. This choice of representation is not unique and is chosen based on experimental results. Other representations and detailed comparison results can be found in Sec. V-A. We learn two separate 3D keypoints to form the axes using a network structure similar to the 3D keypoint estimation head.
The loss function is composed of three parts. The first lossis similar to , where the loss is the sum of losses for both endpoints of the axis. The second loss encourages the voter points to be near to the axis by calculating the distance from the voter points to the ground truth axis:
where is the ground truth translation from the seed to the axis endpoint, and is the ground truth normalized 3D axis. The third loss corrects the direction of the estimated axis, which is significant for robot manipulation. Since the connection of two 3D keypoints determines the axis direction, the directions of connection between corresponding voters of the two keypoints should be close to the ground truth axis:
The final loss for axis estimation is the sum of , and . During inference, two endpoints are clustered with the same method as for 3D keypoint estimation. Then, the final output axis is the link from one endpoint to another.
Iv-3 Part Affinity Field Estimation
In multi-object environment, we also need to determine which parts can be associated to the whole objects. In human pose estimation community, to represent association property between body parts, Part Affinity Field (PAF) was proposed in [cao2017realtime]. PAF is a set of 2D unit vectors that indicate the connecting directions between neighbor body limbs. Inspired by this, the network is designed to predict a set of 2D unit vectors, which determine the potential associated part direction for each ROI. The network structure is similar to the mask head, estimating 2D vectors for seed that point from the corresponding part keypoint to its associated part keypoint. Similarly, only seeds within the ground truth mask and estimated mask during training and testing, respectively, are used to measure the association between candidate parts. The loss function is defined as:
where is the ground truth 2D unit vector pointing toward the target part keypoint. During inference, the mean direction of affinity fields within estimated mask will be used as final field direction.
In the experimental section, we would like to evaluate our method from the following perspectives: (1) How does our method work compare to other category-level pose estimation methods? (2) Are there different ways to learn the ACF, and how do they perform compared to our method? (3) Is our method robust and efficient enough for robotic manipulation tasks?
For evaluation of the part-based hand-scale object pose estimation, we created a dataset of images which includes RGB, depth and instance-level segmentation for multi-object parts in cluttered indoor environments. To answer the first question, we compare our method to the state-of-the-art method NOCS [wang2019normalized] that supports part-based object representation in our dataset. Besides, we implement different ways of learning ACFs to get the answer of second question through comparison. Finally, we deploy our model to a robot simulator platform in CoppeliaSim [rohmer2013coppeliasim] and evaluate the performance of a specific robot drink-serving task that uses the predictions from our model.
Dataset We synthesized data using a data generation plugin named NDDS [to2018ndds] in Unreal Engine 4 (UE4). We selected two virtual indoor environments within UE4 and chose different locations inside environment to capture images. We picked several hand-scale objects that support robot grasping and manipulation to complete the drink-serving task. The object models include bottle, mug and bowl models from the ShapeNet dataset [shapenet2015], and spoon, spatula, hammer models from the internet. We divide objects into four predefined parts: body, handle, stir, head. The object-part-action relation is detailed in Table. I. The keypoints for body and head are defined as the center of their geometries. The keypoints for head and handle are the tangent points on the line parallel to the upright orientation. The axes are labeled as follows: body, from bottom to top; handle, from external tangent point to body direction; handle, from the tail endpoint to head endpoint; head, from the center point of concave surface to the right above point. Examples are shown in Fig. 2. During the rendering, all object models are randomly translated and rotated. Scene background and object textures are randomized. In total, we render 20k images, 600 images of which are withheld for validation.
We use Pytorch to implement the deep network modules. We use ResNet50 with feature pyramid network as backbone with pretrained model on COCO dataset. The initial weight for input depth channel is set as the average of 3 RGB channels at the first layer of backbone network. The output features of each ROI have 256 dimensions. The DenseFusion block extracts global features through shared MLP from each ROI and concatenates them with local ROI features in the same way as in[wang2019densefusion]. Then, the fused ROI features have 1792 dimensions. The axis endpoint head and the 3D keypoint head consist of shared MLPs. During inference, we sample the seeds from each ROI and only the projected coordinate of the seeds within the estimated mask will be considered as valid seeds. Estimated offsets will be added to these seeds for voters. Then, the Mean Shift algorithm [comaniciu2002mean] is applied to cluster voters, and the center of clusters will be regarded as the final estimation. The implementation of Meanshift algorithm for deep hough voting has the same structure as in [he2020pvn3d]
. We trained our network on a Nvidia 2080 Super GPU with a batch size of 2 for 60 epochs, with initial learning rate of 0.001 and AdamW optimizer with weight decay of 0.001.
V-a Ablation Studies on Axis Estimation
Apart from using two 3D keypoints as representation (we call the method ACF-Endpoints), we tested another two ways to represent a 3D axis. One representation is a 3D unit vector. We call the method ACF-Vector. During training, the loss is defined to be the difference between estimated vector and ground truth direction vector . During inference the estimated vector is directly regarded as the axis. Another representation is to use a 3D point set along the axis, with a binary label indicating the closer endpoint for each point in the set. We call the method ACF-ScatterLine. During training, the loss is defined as the sum of , and . pushes points to make offsets perpendicular to the axis direction
so that points spread out along the axis direction which benefits the linear regression during inference.
corrects the two endpoints order by a binary cross entropy with logits loss between estimated closer endpoint indexand ground truth closer endpoint index . During inference, the axis direction is solved through linear regression with RANSAC from estimated point set and the axis endpoint index is determined by point labels.
We compare the three variants of learning ACF axis. The metric of calculating angular distance between estimated and ground truth axis is defined as . From the result shown in Table II and the first row in Figure 4, we can draw several conclusions. Overall, ACF-Endpoints and ACF-Vector achieves similar mean average precision (mAP) values and ACF-Endpoints is better except for head parts. Considered axis direction errors of fewer than 15 degrees, ACF-Endpoints has better performance on the handle part and similar performance with ACF-Vectors on the head part. The reason might be that the head parts do not span a long distance along the direction of defined ACF axis, so the two endpoints cannot be learned to be as distinct as other parts. ACF-ScatterLine performs much better in ‘stir’ and ‘body’ parts compared to ‘head’ and ‘handle’. We believe that’s because the line-shape scatter suits better to straight line or cylinder shapes with a clear center axis or a distinct primary direction, and worse in asymmetric or irregular shapes. Overall, ACF-Endpoints performs robustly among parts with different shapes and achieves 60.2% mAP within 15, 2cm threshold.
V-B Comparison with NOCS
There is no affordance perception methods developed on 3D keypoint/axis estimation on objects or parts, so we compare our method with a state-of-the-art category-level 6D object pose estimation method NOCS [wang2019normalized]
. This method estimates the Normalized Object Coordinate Space (NOCS) map as a view-dependent shape reconstruction for each object category and predicts 6D pose of a specific object by fitting NOCS map and observed depth image. We trained NOCS network from their open source implementation on our dataset and compared the two methods with respect to object part keypoint position and axis direction. The results are shown in TableII and Figure 4. We can figure out that the NOCS does not perform well in our tasks. We think there are two main reasons: (1) The intra-class variance. In the NOCS original paper, they considered object-level classes. For instance, they separate objects into mug, bottle, and bowl classes based on their geometry. However, we regard all of these containers as the body and handle classes based on their functionality. The body parts of mug, bottle and bowl could vary more than that within one object class. (2) The amount of data versus information needed to learn. Compared to the training dataset used in [wang2019normalized], our dataset is about 20X smaller and our training iteration is 4X fewer. On the other hand, their network module learns full geometry reconstruction, which requires more data to learn the shape feature and more iterations to converge. Our method only learns the keypoint and axis required for manipulation, which means our method can learn more effectively on a smaller dataset.
V-C Object-Level Detection Accuracy
Besides part level keypoint and axis estimation for manipulation, we also aim to evaluate object level detection. The intuition is that compared to whole object detection, part-based representation might be more robust in highly cluttered scenes where objects are partly occluded. We choose mugs as test objects and compare our method with a Mask R-CNN baseline. The training set includes 6K images of mugs, and we create the test dataset by putting different body and handle parts together to form novel mugs and add some new instances. Mask R-CNN achieves 93.2 and 70.6 mAP on uncluttered and cluttered scenes respectively, while ACF-Endpoints achieves 93.3 and 90.2 mAP on uncluttered and cluttered scenes. We find that both methods have similar performances in uncluttered environments. But ACF-Endpoints performs better in heavily cluttered environments. The result supports reliability for robotic manipulation in cluttered environments based on the proposed part-level perception method.
V-D Application to Robotic Manipulation Task
To test the perceived ACFs from RGB-D observation, we created a scenario for robotic manipulation to complete the drink-serving task in CoppeliaSim simulation platform. As shown in Figure 1, random object clutters including containers and spoons are generated on tabletop, and a KUKA arm equipped with a parallel gripper is going to perform actions step by step to finish the task. An RGB-D camera is hanging on the side and provides input to the ACF perception system, which gives ACF keypoints and axes estimation to generate actions. There will be two containers with water and lemon tea powder inside (we use particles to simulate the liquid) on a tap with random position within the robot workspace, and the task is to pour the two types of liquid into a bowl and mix them together using a spoon. The drink-serving task is separated to several steps after random object sets are generated:
Take an image and use ACF perception module to get keypoints, axes and associated parts in the scene.
Find the containers with water and lemon tea power respectively, and a bowl to contain the drink.
Grasp the container with water and pour water into the bowl.
Grasp the container with lemon tea powder and pour the powder into the bowl.
Find a spoon in the view, grasp it and use it to stir and mix water and lemon tea powder in the bowl.
The step actions are implemented in the following way: we first determine simple trajectories or key poses based on part keypoints and axes. As we assume the transformation between two associated parts within an object, or between robot gripper and grasped parts are unchanged since grasping and perceived from ACF perception, respectively, the trajectory of robot gripper can be solved from the part poses and the transformations, and then the whole arm motion could be generated from off-the-shelf motion planner.
For grasping of mugs, the gripper will first try to grasp their handles by approaching its 3D keypoint on the handle part from the direction that is orthogonal to both axes of body and handle parts. If no motion plan can be generated because of collision or workspace limit, it will try to approach the mug body from the opposite direction of handles towards to body keypoint. For grasping of bottles, the gripper will approach the object from a random orthogonal direction to its axis, toward its keypoint. For example the bottle is placed on tabletop with its opening upright, the approaching direction will be in a horizontal plane. For grasping of spoons, the gripper will directly approach along the axis of head part, towards the keypoint of stir part.
For pour actions, we assume a container A is already grasped by the gripper and there is another receiver container B with its opening upright (body axis upright). The trajectory is interpolated from two key poses:
container A with its body axis upright, keypoint have a small offset to the container B’s body axis.
container A with its body axis in the horizontal direction, with its keypoint position fixed.
For stir actions, we also assume the container has its body axis upright and a spoon grasped by the gripper. We first move the gripper above the container, and then let the head part of spoons to move in a circle around the container keypoint and orthogonal to the container axis.
We show the screenshots when the robot is grasping an unseen bottle and stirring liquid particles in Figure 6.
We present a novel representation of object affordances called Affordance Coordinate Frame (ACF), and we propose a deep learning based perception method for estimating ACF given a RGB-D image. We demonstrate that our proposed perception method outperforms state-of-the-art method (NOCS [wang2019normalized]) in estimating ACF of novel objects. Our method also outperforms Mask R-CNN in detecting novel objects especially under cluttered environment. We further demonstrate the applicability of ACF for robot manipulation tasks in a simulated environment. As a future direction, more types of affordance and object part categories can be Incorporated to broaden its application scenarios.