Log In Sign Up

Detecting Robotic Affordances on Novel Objects with Regional Attention and Attributes

This paper presents a framework for predicting affordances of object parts of unseen categories, with application to robot manipulation. The framework generates affordance maps of novel objects within an image via region-based affordance segmentation. Earlier work used category priors while jointly optimizing detection and segmentation to boost accuracy with limited ability to generalize to unknown categories. This work integrates a category-agnostic region proposal network for proposing instance regions of an image across categories. A self-attention mechanism trained to interpret each proposal learns to capture rich contextual dependencies through the region. To further guide affordance learning in the absence of category priors, an auxiliary task of object attribute inference improves local feature learning. Experimental results show that the trained deep network architecture achieves state-of-the-art performance on affordance segmentation of novel objects and outperforms several baselines. An ablation study quantifies the effectiveness and contributions of each proposed component. Experiments demonstrate the use of affordance detection on novel objects for vision tasks and for manipulation.


page 1

page 4

page 6

page 7


Salient Instance Segmentation via Subitizing and Clustering

The goal of salient region detection is to identify the regions of an im...

OCNet: Object Context Network for Scene Parsing

Context is essential for various computer vision tasks. The state-of-the...

Manipulation-Oriented Object Perception in Clutter through Affordance Coordinate Frames

In order to enable robust operation in unstructured environments, robots...

Constrained R-CNN: A general image manipulation detection model

Recently, deep learning-based models have exhibited remarkable performan...

Simultaneous Detection and Segmentation

We aim to detect all instances of a category in an image and, for each i...

Denoise and Contrast for Category Agnostic Shape Completion

In this paper, we present a deep learning model that exploits the power ...

Code Repositories

I Introduction

Identifying the functionalities, or affordances, of object parts aids task completion by informing robot manipulators on how to use or interact with an object. Adult humans possess rich prior knowledge for recognizing the affordances of object parts. The affordance knowledge supports identification of potential interactions with nearby objects, and contributes to planning manipulation sequences with these objects towards achievement of a defined task. Endowing a robot with the same capabilities is crucial for assistive robots operating in human environments on a daily basis.

Affordance detection of object parts in images is frequently cast as a segmentation problem [1, 2, 3, 4, 5, 6]

. Object parts sharing the same functionality are segmented and grouped at the pixel-level, then assigned the corresponding affordance label. This problem formulation admits the use of state-of-the-art semantic segmentation architectures based on convolutional neural networks (CNNs) derived by the computer vision community

[7]. However, affordance identification differs from conventional semantic segmentation based on visual cues or physical properties. Understanding functionalities of an object part requires learning the concept of potential interactions with–or use by–humans. State-of-the-art affordance detection architectures [5, 6] improve segmentation performance by jointly optimizing object detection and affordance prediction. The object prior and instance features inferred from the detection process improve pixel-wise, affordance map predictions. Obtaining per-object bounding boxes and per-pixel labels for training purposes involves labor-intensive annotation efforts. Therefore, existing affordance datasets contain few object categories when compared to commonly seen classification datasets [8, 9, 10] in the vision community. Consequently, a limited amount of categories can be learned with these datasets, while the open world involves more diverse categories.


Fig. 1: Illustration of the affordance detection framework with the proposed attribute and attention modules for improving physical robotic manipulation. (a) The goal is to identify affordances of novel objects in order to execute robotic manipulations. Pixel-wise prediction of each object part indicates corresponding functionalities such as the grasp affordance and the pound affordance of the hammer’s handle and head, respectively; (b) The proposed attribute and attention modules improve pixel-wise prediction of object part affordance. The attribute module (upper branch) predicts existing affordances of a region of interest as shareable attributes across categories. This proposed auxiliary task guides the local region feature learning. The attention module (bottom branch) learns dependencies across pixels. For example, the two plus marks on the hammer’s handle with high correlation should have the same predicted affordance labels.

To fully utilize the annotated datasets, generalizing the learned affordances across novel categories is essential for open-world affordance detection. Object priors benefit the prediction of affordance segmentation but limit generalizability. Decoupling the object prior from the instance feature space potentially bridges the gap for novel categories while maintaining the benefits of utilizing local features for predicting object part affordances. With instance features agnostic to object label, the segmentation branch may learn features attuned to pixel-level affordances for each object proposal. Urban street segmentation [11, 12] has some similarity to affordance segmentation, but there are some critical differences. As opposed to multiple affordances across the entire image, affordance segmentation of object instances involves segmenting a small region with a small subset of affordances. For example, an instance proposal containing a knife is related to cut and grasp, and a pot is related to contain and wrap-grasp. Narrowing the potential subset of affordances based on the object instance context aids segmentation. Fig. 1 illustrates the framework of proposed affordance detection modules for real-world manipulation. The proposed architecture incorporates an intra-region processing module to learn long-range or non-local affordance relationships across instance contexts which benefits high-level understanding of correlations between affordances and improves segmentation accuracy. Furthermore, affordances as object attributes are shared across categories. Replacing the object prior with recognized category attributes guides the instance regions to learn shareable features across novel categories. The proposed CNN architecture incorporates an attribute detection branch for regional features.

The main contributions of the paper are:
(1) A deep network architecture to perform (object) category-agnostic affordance segmentation through a region-based self-attention mechanism. The network, trained with object attributes instead of object labels, achieves the state-of-the-art performance on the UMD benchmark.
(2) An extended UMD dataset with attribute annotations. The proposed attribute learning approach guides the region-level feature learning to improve performance. An ablation study shows the contribution of the attention mechanism and the attribute learning branch.
(3) Real-world manipulation experiments with a 7DoF manipulator and a RGB-D camera show the proposed approach effectively predicts the affordances and informs subsequent manipulation tasks.

Ii Related Work

Due to its primacy and role in robotic manipulation, the most commonly studied affordance has been the grasp affordance [13, 14, 15]

. Machine learning is increasingly being used for grasping based on the ability to learn to grasp objects

[16] for which generative or model-based methods would be difficult. Recent literature in this vein includes using a cascaded network [17] to encode grasp features, and using deep networks to learn graspable areas [18] and graspable regions [19, 20, 21] . In addition to learning grasp hypotheses, mapping from vision input to manipulation actions can be learned through a large collection of physical demonstrations or interactions, with [22]

initially taking a supervised approach that was later extended to reinforcement learning

[23]. Robotics research on general affordances, that include and go beyond grasping, studies the potential interactions of robots with their surrounding objects and/or environments [24, 25, 26]. Detecting the affordances of object parts in images is cast as a pixel-wise labelling problem to exploit advances in computer vision approaches to segmentation. Object parts sharing the same functionality are grouped with the corresponding ground truth affordance label. Detecting affordance involves identifying applicable actions on object parts and segmenting detected parts in pixel-level.

Geometric cues with manually designed features were utilized for pixel-wise affordance prediction in [1]. The need for feature engineering was shifted to affordance-specific feature learning using an encoder-decoder CNN architecture [3]. Later, the image-based approach was extended to a two-stage method [5] where regional features were obtained by applying object detection for object proposals on an image. Assisted by the object detection, the network predicted object affordance on selected proposals from an entire input image. Building on [27], the two-stage method was then improved [6]

by jointly optimizing object detection and affordance segmentation end-to-end. The object category priors from the object detection branch enhance the pixel-level affordance estimates. For objects with significant intra-object variation, the addition of key-point feature recognition improves task-based grasping


. Despite the state-of-the-art performance, annotation for detection and segmentation are labor-intensive. To reduce annotation demands in supervised learning, weakly supervised approaches instead rely on sparse key point annotations

[29, 30]. Since our work focuses on zero-order affordance [31] where affordances are functionalities found on objects and are irrespective of current states in the world, self-supervised approaches such as [32, 33] lie outside of the scope of the investigation.

Optimizing detection and segmentation boosts the performance with object and localization priors but limits transference of learned affordance labels to unseen categories. One solution is to detect objectness of a region proposal instead of predicting object class, thereby enabling category-agnostic affordance segmentation on (object) instance features. The loss of object label constraints on processing requires alternative mechanisms to induce region-driven learning or spatial aggregation. Recent studies on semantic segmentation enhance contextual aggregation by atrous spatial pyramid pooling and dilated convolutions [34, 35], merging information at various scales [36], and fusing semantic features between levels [37]. To model long-range pixel or channel dependencies in a feature map, attention modules [38] are applied in semantic segmentation for learning global dependencies [39, 40, 41]. Region-based contextual aggregation in semantic segmentation or object-based affordance segmentation remains unexplored, which this paper aims to address.

One means to induce contextual aggregation is to rely on object attributes. Attributes, as human describable properties, are known to assist vision tasks, such as face detection

[42, 43], object classification [43, 44], activity recognition [45], and fashion prediction [46]. Affordances should also serves as shareable features with semantic meaning whose use could benefit feature learning for object instances. Attribute categories replace the discarded object categories during training to guide feature learning, with the aim of improving generalizability across novel object categories with recognized affordances. We extend [6] by employing attribute prediction to guide instance feature learning while removing object category supervision. The attributes include affordance as a semantic label and additional self-annotated visual attributes. Following [39, 41], we propose to incorporate a self-attention mechanism to model long-range intraregional dependencies. Different from previous works, the proposed architecture adopts the attention mechanism in the affordance branch and operates on object-based feature maps. Decision dependencies are built upon local regions instead of whole images. The objective of the proposed framework is to guide the instance features with attribute learning, and model the dependencies within the instance feature map for affordance prediction. The affordance knowledge serves to aid real-world robotic manipulation.





Fig. 2: Network structure of the proposed detector with self-attention and attribute learning. The network predicts affordances of object parts for each object in the view. Blue blocks indicate network layers and gray blocks indicate images and feature maps. (a) RG-D images are input of the network; (b) Category-agnostic proposals with objectness are forced to predict attributes during training as an auxiliary task; (c) Deconvolutional layers lead to a fine-grained feature map for learning long-range dependencies; (d) Self-attention mechanism operation is incorporated in affordance branch on the intermediate feature (); (e) The final output includes bounding boxes and multiple layers indicating confidences for affordances on a single pixel.

Iii Problem Statement

Given a corresponding pair of color and depth images, the objective is to identify pixel-wise affordances of object parts for seen and unseen object categories for robotic manipulations. A region-based framework for foreground proposals focuses processing on local regions containing object features. To obtain features with non-local dependencies across a local region, an attention modules selectively aggregates features within a proposal. In addition to learning pixel-wise affordance segmentation, shareable attributes across object categories guide local region feature learning.

Iv Approach

Iv-a Overview

This section describes the proposed framework for jointly predicting object regions and their affordance segmentations across novel categories. The general framework of the design, depicted in Fig 2, adopts a two-stage architecture [27, 6] with VGG-16 [47] as a backbone. A set of region proposals is collected and input to the detection and segmentation branches for predicting object regions and affordance maps, respectively. To be specific, the shared feature map ( feature) from the intermediate convolutional layers (layer 13 of VGG-16) are sent to the Region Proposal network for region proposals; the two ROI align [27] layers feed the collected instances to the task branches. To generalize the segmentation branch (bottom branch) to novel categories, the detection branch performs binary classification to separate foreground object from background. Segmentation branch takes in category-agnostic object regions for predicting the affordance map within each region. To address the contextual dependencies and the non-local feature learning within a region proposal, we introduce two improvements, described next, to enhance the associations among local features and to guide feature learning to be object aware but not object specific.

Iv-B Region-based Self-Attention

The goal of object part affordance segmentation is to group pixels sharing the same functionality and to assign them the correct affordance labels. In urban street semantic segmentation [11, 12], the entire scene usually corresponds to large subset of possible ground truth labels. In contrast, affordance segmentation assigns labels only to object regions, with the assigned labels being a small subset relative to the set of known affordance labels. The semantic context of the image (e.g. a cup) narrows the set of relevant affordances and thus reduces the search space [40].

Fig. 3: Details of the attention module for regional features in affodance branch. , and refer to key, query and value, respectively

To aggregate non-local contextual information, the proposed architecture explicitly creates (and consequently, learns) associations between local features of pixels within a region proposal to compensate for the small receptive field of convolutional operations. A self-attention mechanism, as depicted in Fig. 3, on the segmentation branch adapts long-range contextual information. Given an instance feature map , a triplet of key , query and value feature maps are predicted. We model the contextual relationship with a spatial attention module for features at pixel positions and within the instance feature map:


where indicates the degree that the representation feature impacts on feature. is the normalization term where:


To aggregate the predicted correlation between features in different position within a region proposal, the feature map is associated with contextual relationship and learned a residual function with the original feature map for the final output . For features in pixel position :


where is a learnable scale parameter balancing the weighting of global contextual information.

Iv-C Attribute-guided Affordance Learning

Having an object detection branch with binary classification removes contextual information provided by the object category. While it does provide for category-agnostic processing, it does not leverage potential attributes that may be transferable to unseen object instances. Thus, in addition to predicting objectness through object detection, we augment the detection pathway with an attribute recognition module. The network is trained to predict attributes shared across categories, which should transfer to unseen categories during deployment. Attribute learning is commonly applied to zero-shot and one-shot learning scenarios, which bear resemblence to the problem of applying learned knowledge to unseen elements. Attribute learning enhances object classification when suitable attributes available [48, 49, 50], thus we anticipate that it can enhance affordance segmentation. Learning a visual representation of attributes across categories can be used to predict attributes of an unseen category for detection, given that the attributes of the unseen categories are recognizable.

To guide the feature learning of each region proposal, a task sub-branch parallel to objectness detection and bounding box regression is augmented for attribute prediction with outputs, where is the number of attributes defined. Each attribute output is a binary classification predicting whether a specific attribute is found in the region proposal, based on the instance feature shared with objectness detection and regression.

The objectness branch identifies foreground from background and hence the class number is . Let

denote the probability of an instance being foreground,

denote the corresponding bounding box, and

denote the corresponding probabilities of attributes within the instance region. Define the loss function of complete detection branch (

) to be:


where denotes the cross entropy loss for objectness classification (cls), denotes the loss for bounding box (bb) regression with the ground truth annotation, and denotes the binary cross entropy loss for each attribute. The scalars and are optimization weight factors, and is the Kronecker delta function.

Iv-D Regional Attention and Attribute Embedding

Both the regional attention module and attribute learning are aggregated in the final network. The attribute learning parallel to the foreground detection works as an auxiliary task to guide feature learning during training; it is discarded during inference. The attention module in the segmentation branch learns a representation of a region proposal gathering rich contextual information; it is applied during inference. To have a higher resolution affordance mask to learn long-range dependencies, deconvolutional layers initially upsample the feature map of the segmentation branch (bottom in Fig. 2). Attention is applied after the first deconvolutional operation on the feature map, followed by two deconvolutional operations for the final affordance map. To compute the loss for the affordance, let denote the predicted affordance mask on a RoI-based feature map, where is the pixel in a region proposal, denote the affordance on the pixel. The affordance loss is defined as multinomial cross entropy loss:


where is the total area of the region of interest, is the ground truth of the corresponding affordance mask with channels.

The overall network inherits Faster-RCNN [51] on a VGG16 [47] backbone with modified detection and segmentation branches while keeping the region proposal network (RPN) intact. Let denote the RPN loss from the original network, the loss for the entire network is:


V Experiments and Evaluation

V-a UMD Dataset

The UMD dataset [1] covers 17 categories with 7 affordances. The categories of objects range from kitchen, workshop, and garden. The dataset contains 28k+ RGB-D images captured by a Kinect sensor with the object on a rotating table for data collection. The segmentation label ground truth is provided in pixel-level, annotating the affordance of each object part. The additional ground truth of object bounding boxes is obtained by filtering out the background table from the foreground objects, and tight the foreground boundary into a rectangle bounding box. The UMD dataset has two benchmarking approaches, image split and category split. The category split is the benchmark that tests unseen categories and is used here for evaluation.

V-B Data Preprocessing and Training

The proposed framework reuses the weights of VGG-16 [47]

pre-trained on ImageNet

[9] for initialization. The layers for attribute prediction and the affordance branch including the proposed attention module are trained from scratch. To incorporate RGB-D images for geometric information with pre-trained weights, the blue channel is substituted with the depth channel as proposed in [19, 21]. Ideally, any channel can be replaced with the depth channel. The value of depth channel is normalized to the range with as the mean value. Missing value or in depth channel is filled with . The whole network is trained end-to-end for epochs. The training starts with initial learning rate which is divided by for every epochs. The training time is around days with a single nVidia GTX 1080 Ti.

The baseline approach is labeled Obj-wise since it is the deep network structure trained with objectness detection only, i.e., without regional attention, and without attribute embedding. It is a modified version of AffordanceNet [6]. A second baseline approach, denoted KL-divergence, uses the Obj-wise network but replaces the cross-entropy loss, , with KL-divergence to permit ranked affordance output. Likewise, we attempt the same KL-divergence replacement for the affordance branch of the proposed method. Due to the network model size, only the regional attention mechanism is incorporated (no attribute embedding). We label it Ours.

Fig. 4: Attributes collected from the UMD dataset.
TABLE I: Images per Attribute attribute_table
attribute images attribute images attribute images
grasp 18235 contain 7889 wrap 5250
cut 5542 pound 2257
scoop 2869 support 2317

Since no appropriate dataset exists for analyzing attributes of tools in the commonly used UMD affordance dataset, we treat the original UMD affordance labels as attributes across categories, see Fig. I. In total seven attributes are defined for representing UMD tool dataset, as summarized in Table attribute_table. An attempt was made to augment these attributes with additional object attributes defined in ImageNet and to augment the UMD dataset annotations. Shape and Texture were selected for their potential relevance (the others are Color and Pattern). However, no obvious improvements were observed. The resulting weighted F-measures score was 0.67 versus 0.69 (all experiments use affordance attributes only).

V-C Evaluation Metric

To evaluate the affordance segmentation responses, derived from probability outputs over affordance classes, against ground truth labels for each affordance, we adopt the weighted F-measures metric, , for the predicted masks:


where and

are the weighed precision and recall values, respectively

[52]. Higher weights are assigned to pixels closer to foreground ground truth. A second metric evaluates the prediction performance of the rankings for multiple affordance on object parts and applies to the KL-divergence trained network. It is the ranked weighted F-measures metric, ,


where are the ranked weights contributing to the weighted sum over the corresponding affordances [52]. The top affordance receives the most weight and so on, per .

V-D Benchmarking on UMD Novel Objects

Qualitative results of the proposed method are depicted in Fig. 5, where the bottom row shows affordance prediction of the baseline approach (Obj-wise), and the upper row shows the corresponding result with proposed attention module and attribute learning. The consistency of the affordance segmentation is improved with less oversegmentation of the individual continuous affordance segments. Quantitative evaluation employs the F-measures described earlier (whose outputs lie in the range ). Relative to image-split, category-split is a harder problem with less reported evaluations. Amongst published works, only [53] reports both and , while early works [1] and [29] report only scores. The score of our multi-affordance KL-divergence baseline is computed by taking the top affordance only. As a strong baseline for multi-affordance segmentation, a modified DeepLab [34] is tested. Performance outcomes and comparisons for the UMD benchmark are shown in Table fmeasure for weighted F-measures. The proposed approach has the strongest performance, with our Obj-wise baseline to be the next (it is a custom, modified implementation of [6] for the category-split benchmark). Compared to the best published result, the proposed approach achieves a 43% improvement (0.48 to 0.69). Compared to our strong baselines, it achieves 7% and 18% improvements over Obj-wise and KL-divergence, respectively. Moreover, the improvements on affordance ranking with ranked weighted F-measures is reported in Table ranked_fmeasure. This test is harder, due to the metric heavily penalizing incorrect rankings, compressing the score output values. The improvement over our strong baseline KL-divergence is 4.3%. Compared to the most recent reported result [53] and [34], the proposed approach improves 14% and 20%, respectively. The results show that the proposed method outperforms the existing approaches and strong baselines for novel object categories on both metrics.

Fig. 5: Comparison of detection results on UMD benchmark, where color overlays represent affordance labels, red: grasp; yellow: scoop; green:cut; dark blue: contain; blue: wrap-grasp; orange: support; purple: pound. Top: results using the proposed method; Bottom: results using only binary objectness (no self-attention mechanism nor attribute learning). The predictions are more consistent locally and are less oversegmented.
weighted F-measures
grasp cut scoop contain pound support w-grasp average
HMP [1] N/A N/A N/A N/A N/A N/A N/A N/A
SRF [1] N/A N/A N/A N/A N/A N/A N/A N/A
Lakani [53] 0.46 0.30 0.22 0.47 0.08 0.03 0.47 0.29
DeepLab [34] 0.55 0.30 0.36 0.58 0.42 0.22 0.93 0.48
KL-divergence (ours) 0.54 0.31 0.39 0.63 0.55 0.75 0.92 0.58
Obj-wise (ours) 0.61 0.37 0.60 0.61 0.81 0.59 0.94 0.64
Ours 0.60 0.37 0.60 0.61 0.80 0.88 0.94 0.69
TABLE III: Ranking Performance On UMD Dataset (novel category) ranked_fmeasure
ranked weighted F-measures
grasp cut scoop contain pound support w-grasp average
HMP [1] 0.16 0.02 0.15 0.18 0.02 0.05 0.10 0.10
SRF [1] 0.05 0.01 0.04 0.07 0.02 0.01 0.07 0.04
VGG [29] 0.18 0.05 0.18 0.20 0.03 0.07 0.11 0.12
ResNet [29] 0.16 0.05 0.18 0.19 0.02 0.06 0.11 0.11
Lakani [53] 0.19 0.18 0.28 0.32 0.08 0.11 0.32 0.21
DeepLab [34] 0.30 0.17 0.10 0.22 0.06 0.04 0.53 0.20
KL-divergence (ours) 0.32 0.18 0.18 0.23 0.09 0.10 0.52 0.23
Ours 0.33 0.18 0.18 0.25 0.10 0.10 0.53 0.24
TABLE IV: Ablation Study ablation_fmeasure
module weighted F-measures
attent attri grasp cut scoop contain pound support w-grasp average
Obj-wise 0.61 0.37 0.60 0.61 0.81 0.59 0.94 0.64
Obj-wise 0.62 0.33 0.51 0.61 0.81 0.79 0.94 0.66
Obj-wise 0.60 0.40 0.67 0.60 0.78 0.75 0.94 0.68
Obj-wise 0.60 0.37 0.60 0.61 0.80 0.88 0.94 0.69
TABLE II: Performance On UMD Dataset (novel category) fmeasure

V-E Real-world Manipulation with Affordance

Physical grasping based on affordances is tested on a 7-DoF robotic manipulator and a Microsoft Kinect to examine the grasp and contain affordances as in [6]. The real world robotic arm setup is shown in Fig. 6. The proposed method is deployed and compared with a state-of-the-art grasp detector [21] and an affordance detector [6], with results summarized in Table physical. Unseen objects are placed in the visible and reachable area. For experiment involving grasp affordance, the grasp center is achieved by averaging graspable pixels, with grasping orientation determined by fitting a line to predicted pixels. The proposed method matches the same performance as in [21]. The failure cases for the proposed method with the screwdriver is on account of the edge of the screwdriver’s handle being rounded and leading to it slipping from the parallel grippers. For the contain affordance, a ball is placed into an object predicted as contain-able with the robotic arm. As with the grasp affordance, the location is determined by averaging the pixels predicted as contain-able. The performance is competitive to [6]. The failure case for mug is due to incorrect affordance prediction of some pixels and therefore a shifted placement location. The cup has a similar failure mode, but since it has a smaller opening, performance drops more. Overall, compared to [21], the proposed approach applies to more manipulation tasks using a single network. Compared to [6], the generalizability enables the learned affordances to apply to novel categories with competitive performance. Thus, the proposed deep network applies to more situations than either of the two deep network methods compared (the marks in Table. physical are effectively zero scores), and matches their performance on the applicable test case subsets.

Fig. 6: Experimental setup for physical manipulation, with a 7 DoF manipulator and a Microsoft Kinect RGB-D sensor.

V-F Ablation Study

Table ablation_fmeasure shows the ablation study of the proposed approach, where the baseline network is the Obj-wise network (first row). The second and third rows quantify the improvements gained from regional attention and attribute learning, respectively. The last row reports the performance with both attention and attribute learning together, which achieves the best result. Each design is independently trained with ImageNet pre-trained weights, instead of finetuing one-by-one. While a 3.7% (0.770 to 0.799) improvement in [6] was regarded as reasonable achievement on UMD image-split benchmark, the ablation study shows 3.1% and 6.2% improvements by introducing attribute learning and ROI-based attention individually.

Fig. 7: Comparison on Cornell dataset. Top: detection results of proposed method. Bottom: Obj-wise model.
DeepGrasp [21] AffNet [6] Ours affordance
knife 10/10 10/10 10/10 grasp
screwdriver 7/10 –/10 7/10 grasp
mug –/10 8/10 9/10 contain
cup –/10 –/10 8/10 contain
average 8.5/10 9.0/10 8.5/10
TABLE V: Physical manipulation physical

V-G Affordance Detection across Datasets

To demonstrate the generalizability across datasets, the proposed method (trained on novel-category split of the UMD dataset) is evaluated on the Cornell dataset [54]. The Cornell dataset consists of 885 images of 244 different objects for learning robotic grasping. Each image is labelled with multiple ground truth grasps. Though affordance masks are not available for quantitative evaluation with and ranked metrics, visualizations of qualitative results and comparisons are presented in Fig. 7 for readers reference. Similar to Fig. 5, the outcomes here have more consistent and continuous affordance segmentations for the proposed method relative to the baseline method.

Vi Conclusion

This paper described a novel framework to predict affordances of object parts in an image. The deep network framework learns to generalize affordance segmentation across unseen categories in support of robotic manipulation. Compared to previous approaches, the proposed framework performs affordance segmentation within predicted foreground object proposals. The proposed framework learns a self-attention mechanism within the proposed foreground region and selectively adapts contextual dependencies within each instance region. To compensate for the absence of object category priors, category attributes are incorporated to guide the feature learning. Evaluation on the UMD dataset used the novel category split for comparison to state-of-the-arts, including several image-based and region-based baselines. Experiments with physical manipulation demonstrated the effectiveness of the proposed framework for manipulating unseen object categories in the real-world. All code and data will be publicly released.


  • [1] A. Myers, C. L. Teo, C. Fermüller, and Y. Aloimonos, “Affordance detection of tool parts from geometric features.” in IEEE International Conference on Robotics and Automation, 2015, pp. 1374–1381.
  • [2] A. Srikantha and J. Gall, “Weakly supervised learning of affordances,” arXiv preprint arXiv:1605.02964, 2016.
  • [3] A. Nguyen, D. Kanoulas, D. G. Caldwell, and N. G. Tsagarakis, “Detecting object affordances with convolutional neural networks,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2016, pp. 2765–2770.
  • [4] A. Roy and S. Todorovic, “A multi-scale cnn for affordance segmentation in rgb images,” in European Conference on Computer Vision.   Springer, 2016, pp. 186–201.
  • [5] A. Nguyen, D. Kanoulas, D. G. Caldwell, and N. G. Tsagarakis, “Object-based affordances detection with convolutional neural networks and dense conditional random fields,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2017.
  • [6]

    T. Do, A. Nguyen, and I. Reid, “AffordanceNet: An end-to-end deep learning approach for object affordance detection,” in

    IEEE International Conference on Robotics and Automation, 2018. [Online]. Available:
  • [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
  • [8] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
  • [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in

    IEEE conference on Computer Vision and Pattern Recognition

    , 2009, pp. 248–255.
  • [10] I. Krasin, T. Duerig, N. Alldrin, A. Veit, S. Abu-El-Haija, S. Belongie, D. Cai, Z. Feng, V. Ferrari, V. Gomes, A. Gupta, D. Narayanan, C. Sun, G. Chechik, and K. Murphy, “Openimages: A public dataset for large-scale multi-label and multi-class image classification.” Dataset available from, 2016.
  • [11]

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in

    IEEE conference on Computer Vision and Pattern Recognition, 2016.
  • [12] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” in European Conference on Computer Vision.   Springer, 2016, pp. 102–118.
  • [13] K. B. Shimoga, “Robot grasp synthesis algorithms: A survey,” The International Journal of Robotics Research, vol. 15, no. 3, pp. 230–266, 1996.
  • [14] A. Bicchi and V. Kumar, “Robotic grasping and contact: A review,” in IEEE International Conference on Robotics and Automation, 2000, pp. 348–353.
  • [15] J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Data-driven grasp synthesis—a survey,” IEEE Transactions on Robotics, vol. 30, no. 2, pp. 289–309, 2014.
  • [16] A. Saxena, J. Driemeyer, and A. Y. Ng, “Robotic grasping of novel objects using vision,” The International Journal of Robotics Research, vol. 27, no. 2, pp. 157–173, 2008.
  • [17] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,” The International Journal of Robotics Research, vol. 34, no. 4-5, pp. 705–724, 2015.
  • [18] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proceedings of International Conference on Machine Learning, 2011, pp. 689–696.
  • [19] J. Redmon and A. Angelova, “Real-time grasp detection using convolutional neural networks,” in IEEE International Conference on Robotics and Automation, 2015, pp. 1316–1322.
  • [20] D. Guo, F. Sun, H. Liu, T. Kong, B. Fang, and N. Xi, “A hybrid deep architecture for robotic grasp detection,” in IEEE International Conference on Robotics and Automation, 2017, pp. 1609–1614.
  • [21] F. Chu, R. Xu, and P. A. Vela, “Real-world multiobject, multigrasp detection,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3355–3362, Oct 2018. [Online]. Available:˙multiObject˙multiGrasp
  • [22] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordination for robotic grasping with large-scale data collection,” in International Symposium on Experimental Robotics, 2016, pp. 173–184.
  • [23] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine, “QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation,” vol. 87, p. 651–673, 29–31 Oct 2018.
  • [24] E. Ugur and J. Piater, “Bottom-up learning of object categories, action effects and logical rules: From continuous manipulative exploration to symbolic planning,” in IEEE International Conference on Robotics and Automation, 2015, pp. 2627–2633.
  • [25] A. Dehban, L. Jamone, A. R. Kampff, and J. Santos-Victor, “Denoising auto-encoders for learning of objects and tools affordances in continuous space,” in IEEE International Conference on Robotics and Automation, 2016, pp. 4866–4871.
  • [26] A. Nguyen, D. Kanoulas, D. G. Caldwell, and N. G. Tsagarakis, “Preparatory object reorientation for task-oriented grasping,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2016, pp. 893–899.
  • [27] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in International Conference on Computer Vision.   IEEE, 2017, pp. 2980–2988.
  • [28] L. Manuelli, F. P. Gao, Wei, and R. Tedrake, “kpam: Keypoint affordances for category-level robotic manipulation,” in International Symposium on Robotics Research, 2019.
  • [29] J. Sawatzky, A. Srikantha, and J. Gall, “Weakly supervised affordance detection,” in IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 5197–5206.
  • [30] J. Sawatzky and J. Gall, “Adaptive binarization for weakly supervised affordance segmentation,” arXiv preprint arXiv:1707.02850, 2017.
  • [31] A. Aldoma, F. Tombari, and M. Vincze, “Supervised learning of hidden and non-hidden 0-order affordances and detection in real scenes,” in IEEE International Conference on Robotics and Automation, 2012, pp. 1732–1739.
  • [32] P. R. Florence, L. Manuelli, and R. Tedrake, “Dense object nets: Learning dense visual object descriptors by and for robotic manipulation,” in Conference on Robot Learning, 2018, pp. 373–385.
  • [33] A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser, “Learning synergies between pushing and grasping with self-supervised deep reinforcement learning,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2018, pp. 4238–4245.
  • [34] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018.
  • [35] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
  • [36] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 2881–2890.
  • [37] H. Ding, X. Jiang, B. Shuai, A. Qun Liu, and G. Wang, “Context contrasted feature and gated multi-scale aggregation for scene segmentation,” in IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 2393–2402.
  • [38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
  • [39] Y. Yuan and J. Wang, “Ocnet: Object context network for scene parsing,” arXiv preprint arXiv:1809.00916, 2018.
  • [40] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic segmentation,” in IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 7151–7160.
  • [41] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in IEEE conference on Computer Vision and Pattern Recognition, 2019, pp. 3146–3154.
  • [42]

    N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attribute and simile classifiers for face verification,” in

    International Conference on Computer Vision.   IEEE, 2009, pp. 365–372.
  • [43] N. Kumar, A. Berg, P. N. Belhumeur, and S. Nayar, “Describable visual attributes for face verification and image search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 10, pp. 1962–1977, 2011.
  • [44] K. Duan, D. Parikh, D. Crandall, and K. Grauman, “Discovering localized attributes for fine-grained recognition,” in IEEE conference on Computer Vision and Pattern Recognition, 2012, pp. 3474–3481.
  • [45] H.-T. Cheng, F.-T. Sun, M. Griss, P. Davis, J. Li, and D. You, “Nuactiv: Recognizing unseen new activities using semantic attribute-based learning,” in Proceeding of International Conference on Mobile Systems, Applications, and Services.   ACM, 2013, pp. 361–374.
  • [46] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,” in IEEE conference on Computer Vision and Pattern Recognition, 2016, pp. 1096–1104.
  • [47] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2015.
  • [48] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” in IEEE conference on Computer Vision and Pattern Recognition, 2009, pp. 1778–1785.
  • [49] L.-J. Li, H. Su, Y. Lim, and L. Fei-Fei, “Objects as attributes for scene classification,” in European Conference on Computer Vision.   Springer, 2010, pp. 57–69.
  • [50] Y. Sun, L. Bo, and D. Fox, “Attribute based object identification,” in IEEE International Conference on Robotics and Automation, 2013, pp. 2096–2103.
  • [51] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
  • [52] R. Margolin, L. Zelnik-Manor, and A. Tal, “How to evaluate foreground maps?” in IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 248–255.
  • [53] S. Rezapour Lakani, A. J. Rodríguez-Sánchez, and J. Piater, “Towards affordance detection for robot manipulation using affordance for parts and parts for affordance,” Autonomous Robots, Jul 2018.
  • [54] R. L. Lab, “Cornell grasping dataset,”˙data/data.php, 2013, accessed: 2019-09-01.