One-Shot Object Affordance Detection in the Wild

by   Wei Zhai, et al.

Affordance detection refers to identifying the potential action possibilities of objects in an image, which is a crucial ability for robot perception and manipulation. To empower robots with this ability in unseen scenarios, we first study the challenging one-shot affordance detection problem in this paper, i.e., given a support image that depicts the action purpose, all objects in a scene with the common affordance should be detected. To this end, we devise a One-Shot Affordance Detection Network (OSAD-Net) that firstly estimates the human action purpose and then transfers it to help detect the common affordance from all candidate images. Through collaboration learning, OSAD-Net can capture the common characteristics between objects having the same underlying affordance and learn a good adaptation capability for perceiving unseen affordances. Besides, we build a large-scale Purpose-driven Affordance Dataset v2 (PADv2) by collecting and labeling 30k images from 39 affordance and 103 object categories. With complex scenes and rich annotations, our PADv2 dataset can be used as a test bed to benchmark affordance detection methods and may also facilitate downstream vision tasks, such as scene understanding, action recognition, and robot manipulation. Specifically, we conducted comprehensive experiments on PADv2 dataset by including 11 advanced models from several related research fields. Experimental results demonstrate the superiority of our model over previous representative ones in terms of both objective metrics and visual quality. The benchmark suite is available at Net.



There are no comments yet.


page 6

page 8

page 10

page 11

page 17

page 20

page 21

page 22


One-Shot Affordance Detection

Affordance detection refers to identifying the potential action possibil...

Phrase-Based Affordance Detection via Cyclic Bilateral Interaction

Affordance detection, which refers to perceiving objects with potential ...

PartImageNet: A Large, High-Quality Dataset of Parts

A part-based object understanding facilitates efficient compositional le...

An Evaluation of Action Recognition Models on EPIC-Kitchens

We benchmark contemporary action recognition models (TSN, TRN, and TSM) ...

A Variational Graph Autoencoder for Manipulation Action Recognition and Prediction

Despite decades of research, understanding human manipulation activities...

The Open Brands Dataset: Unified brand detection and recognition at scale

Intellectual property protection(IPP) have received more and more attent...

Onfocus Detection: Identifying Individual-Camera Eye Contact from Unconstrained Images

Onfocus detection aims at identifying whether the focus of the individua...

Code Repositories


Pytorch implementation of One-Shot Affordance Detection

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The concept of affordance was proposed by the ecological psychologist Gibson (gibson1977theory). It describes how the inherent “value” and “meanings” of objects in an environment are directly perceived and explains how this information can be linked to the action possibilities offered to an organism by the environment (hassanin2018visual). In particular, perceiving the affordance of objects has a wide range of applications in a wide range of fields (zhang2020empowering; DBLP:journals/corr/abs-1807-06775), such as action recognition (DBLP:journals/cviu/KjellstromRK11; qi2017predicting), robot grasping (yamanobe2017brief), autonomous driving, and scene understanding (liu2019auto; vu2014predicting).

Compared to the semantics of the object itself, affordance is an uncertain and dynamic property (e.g., a cell phone can both make phone calls and take pictures) that is closely related to the environment context and the possible interactions between the object and actors (fang2020learning; hassan2015attribute; chuang2018learning; Zhu_2015_CVPR; wang2017binge). Therefore, approaches that rely only on the construction of mapping relationships between the object structure and affordance labels of images in a fixed dataset do not have strong generalization capabilities, resulting in inferior affordance detection performance when the environment context or actor interactions change. Thus, it is expected that the model has the ability to adapt to unseen scenarios using a few samples or only one sample. Moreover, the ability of the model to rapidly adapt to changing scenes and generalize to unseen objects is of great significance in practical applications (nagarajan2020learning; demo2vec2018cvpr; interaction-hotspots; chen2021deep).

To learn such a capability of perceiving affordance, we consider the challenging one-shot affordance detection 111“Detection” refers to the pixel-wise detection task, which has also been used in the area of salient object detection. task in this paper, i.e., given a support image that depicts the human action purpose, all objects in a scene with the common affordance should be detected (as shown in Fig. 1). Unlike the object detection/segmentation problem (shaban2017one), affordance and semantic categories of objects are highly inter-correlated but do not imply each other. An object may have multiple affordances (as shown in Fig. 2 (A)), e.g., the sofa can be used to sit or lie down. An affordance category may cover more than one object category (as shown in Fig. 2 (B)), i.e., whisk, chopsticks, and spoon all belong to the affordance category of “mix”. The possible affordance depends on the human action purpose in real-world application scenarios. Directly learning the affordance from a single image without the guidance of action purpose makes the model focus on the statistically dominant affordances while ignoring other visual affordances that also coincide with the same action purpose.

To address this problem: 1) We try to find clear hints about the action purpose (i.e., via the subject and object locations and human poses (chen2020recursive; wei2017inferring)) from a single support image, which implicitly defines the object affordance and thus can be used to reduce the affordance ambiguity of objects in candidate images. 2) We adopt collaboration learning to capture the inherent relationship between different objects to counteract the interference caused by visual appearance differences and improve generalization. Specifically, we devise a novel One-Shot Affordance Detection Network (OSAD-Net) to solve the problem. We take an image as support and a set of images ( images in this paper) as a query, and the network first captures the human-object interactions from the support image via an action purpose learning (APL) module to encode the action purpose. Then, a mixture purpose transfer (MPT) module is devised to use the encoding of the action purpose to activate the features in query images that have common affordance. Finally, a densely collaborative enhancement (DCE) module is introduced to capture the intrinsic relationships between objects with the same affordance and suppress backgrounds irrelevant to the action purpose. In this way, our OSAD-Net can learn a good adaptation capability for perceiving unseen affordances.

Moreover, there is a gap between existing datasets and real-world application scenarios due to the limitation of their data diversity. The affordance detection model for scene understanding and general applications should be able to learn from the human-object interaction when the robot arrives at a new environment and search suitable objects as tools to complete specific tasks in the environment, rather than just finding objects with the same categories or similar appearance. To fill this gap, we propose the Purpose-driven Affordance Dataset v2 (PADv2), which contains k diverse images covering affordance categories as well as object categories from different scenes, and is much larger than the preliminary version (PAD) (Ours)

. We provide rich affordance mask annotations, depth information annotations, bounding boxes annotations of humans/objects, and human pose annotations in support images. Besides, we train several representative models in related fields on PADv2 dataset and compare them with our OSAD-Net in terms of both objective evaluation metrics and visual quality for one-shot affordance detection. Our main contributions are summarised as follows:

  • We introduce a new challenging one-shot affordance detection problem along with a large-scale benchmark to facilitate the research for empowering robots with the ability to perceive unseen affordances in real-world scenarios.

  • We propose a novel OSAD-Net that can efficiently learn the action purpose and use it to detect the common affordance of all objects in a scene via collaboration learning, resulting in a good adaptation capability that can deal with unseen affordances.

  • We establish a challenging PADv2 dataset containing k images, covering affordance categories and object categories with more complex scenes. We provide rich annotations for the dataset, including pixel-level affordance mask labels, depth information annotations, bounding box annotations of human/objects and human pose annotations in support images, which could greatly benefit various visual affordance perception tasks.

  • Experiments on the PADv2 and PAD datasets show that our OSAD-Net outperforms the state-of-the-art models and can serve as a strong baseline for future research.

Dataset Pub. Year Format Pixel HQ BG Obj. Aff. Img.
1 2011 ICRA (hermans2011affordance) ICRA 2011 RGB-D - - Fixed - 7 375
2 UMD (myers2015affordance) ICRA 2015 RGB-D Fixed - 17 30,000
3 2016 TASE (song2015learning) T-ASE 2016 RGB - - Fixed 8 1 10,360
4 IIT-AFF (nguyen2017object) IROS 2017 RGB-D General 10 9 8,835
5 CERTH-SOR3D (thermos2017deep) CVPR 2017 RGB-D - - Fixed 14 13 20,800
6 ADE-Aff (chuang2018learning) CVPR 2018 RGB General 150 7 10,000
7 PAD (Ours) IJCAI 2021 RGB General 72 31 4,002
8 PADv2 RGB-D General 103 39 30,000 amyers/part-affordance-dataset/ cychuang/learning2act/

Table 1: Statistics of existing image-based affordance datasets and the proposed PADv2 dataset. PADv2 dataset provides higher-quality annotations and covers much richer affordance categories. Pub.: Publication venue. Pixel: whether or not pixel-wise labels are provided. HQ: high-quality annotation. BG: the background is fixed or from general scenarios. Obj: number of object categories. Aff.: number of affordance categories. Img: number of images.

A preliminary version of this work was presented in Ours. In this paper, we extend the previous study by introducing three major improvements:

  • We introduce a novel One-Shot Affordance Detection Network (OSAD-Net). Compared to OSAD-Net 222It indicates our conference version model in Ours, we redesign all the three modules. Specifically, we consider the influence of human pose on learning action purpose, and introduce a probabilistic model for better purpose transfer and a dense comparison approach for feature enhancement of the same affordance object. In this way, we achieve better results than OSAD-Net for one-shot affordance detection with fewer parameters and query images.

  • We extend the Purpose-driven Affordance Dataset (PAD) further by collecting more images (up to k), enlarging the diversity of the affordance category (up to ) and object category (up to ). We also provide posture annotations of support images and depth annotations of all images. Further, we evaluate more state-of-the-art methods from five related fields to comprehensively demonstrate the superiority of the proposed model.

  • We carefully re-organize the dataset and describe it in detail, including a complete statistical and attribute analysis, a clear definition of the problem, a comprehensive analysis of the experimental results from a variety of different perspectives, as well as complete ablation studies.

The remainder of the paper is organized as follows. Section 2 describes existing works related to one-shot affordance detection. We introduce our OSAD-Net in Section 3 and describe the benchmark dataset PADv2 in Section 4. Section 5 presents the experimental results and analysis on both PADv2 and PAD datasets. We conclude the paper and discuss potential applications and future research directions in Section 6.

2 Related Work

2.1 Visual Affordance Learning

Visual affordance is a branch of affordance research that deals with affordance as an image- or video-based computer vision problem and uses machine learning-related techniques to address the challenges

(hassanin2018visual). In recent decades, a significant number of scholars try to explore object affordance from a vision perspective, which is divided into several main directions: affordance categorization, affordance detection, and affordance reasoning.

Affordance categorization is to predict the affordance category of an input image. stark2008functional propose an algorithm to acquire, learn, and detect functional object categories based on affordance clues. They use human interaction videos to obtain a visual feature representation of affordance. ugur2014bootstrapping

use affordance cues of individual objects to guide the learning of complex affordance features of pairs of objects. Complex affordance learning is guided by using pre-learned basic visual features as additional input to complex affordance predictor variables or as cues to the next target object to be explored. The task of affordance detection is to divide the object into regions, and all pixels in each region are assigned with an affordance label.


propose a deep learning-based object detector to improve affordance detection results, and subsequently,

do2018affordancenet improve this method and propose an end-to-end AffordanceNet. Unlike previous works that rely on separate and intermediate object detection steps, zhao2020object propose a novel relationship-aware network to directly generate pixel-wise affordance maps from an input image in an end-to-end manner. In addition, to avoid extensive pixel-level annotation, sawatzky2017weakly and sawatzky2017adaptive propose weakly supervised affordance detection methods, which can accomplish affordance detection using only a small number of keypoint annotations. Affordance reasoning refers to a more complex understanding of affordance, which requires higher-order contextual modeling, and the primary purpose of such reasoning is to infer hidden variables. zhu2014reasoning build a knowledge base to represent target objects and their descriptive properties (visual, physical, and category properties) to infer affordance labels, human poses, or relative position, and learn the model using Markov logic networks (richardson2006markov). demo2vec2018cvpr design the Demo2Vec model for extracting feature representations of demonstration videos and predicting the human interaction on the same target image regions and action labels on the same object image.

However, according to Gibson’s definition of affordance (gibson1977theory), “it implies the complementarity of the animal and the environment”. There exist multiple potential complementarities between animal and environment, which leads to multiple possibilities of particular affordance, i.e., an object may have multiple affordances, and the same affordance may cover multiple different object categories. This study attempts to establish a relationship between human action purpose and affordance and leverage a collaborative learning strategy to address the affordance ambiguity issue.

2.2 Visual Affordance Dataset

In the era of deep learning, the study of visual affordance usually follows a data-driven manner that requires annotated affordance datasets. hermans2011affordance collect data from an autonomous mobile robot with a Pan-Tilt-Zoom (PTZ) camera, resulting in a total of images from object categories. myers2015affordance introduce a large-scale RGB-D dataset containing pixel-level affordance labels and their ranks, which is the first pixel-wise labeled affordance dataset. song2015learning propose a novel dataset for evaluating visual grasp affordance estimation. All images in the dataset contain grasp affordance annotations, including grasp region and scale attributes. Since most of the previous datasets have simple backgrounds and are difficult to be applied to real-world robot scenes due to their limited diversity, nguyen2017object

select a subset of object categories from ImageNet

(russakovsky2015imagenet) and collect RGB-D images from clutter scenes for constructing the IIT-AFF dataset. thermos2017deep propose an RGB-D sensorimotor dataset for the sensorimotor object recognition task. However, the affordances of objects do not simply correspond to appearance, which are shifted in response to the state of interactions between objects and humans. Therefore, chuang2018learning

consider the problem of affordance reasoning in the real-world by taking into account both the physical world and the social norms imposed by the society and constructed the ADE-Affordance dataset based on ADE20k

(zhou2017scene). Different from these works, Ours construct a Purpose-driven Affordance Dataset (PAD) considering the relationship between human purpose and affordance, which involves more complex scenarios and thus potentially benefits practical robot applications.

In addition to the image-based affordance datasets presented above, some datasets consider other aspects of affordance. wang2017binge extract a diverse set of scenes and how actors interact with different objects in the scenes from seven sitcoms. Subsequently, li2019putting extend this work to 3D indoor scenes and construct a 3D pose synthesizer that fuses semantic knowledge from 2D poses extracted from TV shows as well as 3D geometric knowledge from voxel representations of indoor scenes. Recently, deng20213d propose a 3D AffordanceNet dataset containing k shapes from semantic object categories annotated with visual affordance categories.

In this paper, we focus on the visual affordance detection task and try to construct a benchmark to facilitate the research in this area. Specifically, we expand the size and diversity of the Purpose-driven Affordance Dataset in our preliminary work (Ours) and establish a large-scale PADv2 dataset with more complex scenes and richer affordance and object categories. Our dataset contains pixel-level and image-level labels and the depth information of the image to provide more comprehensive information for future study of affordance detection. Statistics about the existing image-based affordance datasets are summarized in Table 1.

2.3 One-Shot Learning

Few-shot learning refers to learning a model that can recognize new sample classes given a few of reference images, which primarily concerns about the model’s generalization ability. Existing works focus on metric-based, meta-based, and augmentation-based methods. The core idea of metric-based methods (cai2014attribute; snell2017prototypical; sung2018learning; vinyals2016matching)

is to optimize the distance/similarity between images or regions. While the meta-based approaches

(finn2017model; he2020progressive; rusu2018meta; zhu2020self; ravi2016optimization; wang2016learning)

mainly define a specific objective or loss function to guide the model to get a fast learning capability. Augmentation-based approaches

(li2020adversarial) mainly consider synthesizing more data in different ways from new categories to facilitate the learning stage.

Few-shot segmentation is a more challenging task to predict a label for each pixel instead of one for the whole image. shaban2017one propose a typical two-branch network. Later, dong2018few introduce the idea of prototype. zhu2019one propose an one-shot texture retrieval (OS-TR) network. Given an example of a new reference texture, the network detects and segments all the pixels of the same texture category within an arbitrary image. CANet (zhang2019canet) averages the object features in the support image, extends them to the size of the query feature and then concatenates them together, and leverages an iterative optimization module to refine the segmentation results. To cope with the problem of representing prototypes with ambiguity and intra-class variations due to the lack of training samples, wang2021variational leverage probabilistic hidden variables to represent prototype distributions, converting the discriminative model into a probabilistic model that allows for a more expressive representation of object categorical concepts. In addition, they represent optimization as a variational inference problem. johnander2021deep

propose a few-shot learner formulation based on gaussian process regression, enabling the network to model complex object apparent distributions in deep feature space.

li2021adaptive utilize superpixel-guided clustering to generate multiple prototypes and allocate them to query features.

Our one-shot affordance detection task is quite different from one-shot segmentation. (1) One-shot affordance detection focuses on the affordance property of the object, which is not completely equivalent to its semantics, since the same object may have multiple affordances and different objects may also share the same affordance. (2) The inputs required for the two tasks are different. One-shot segmentation requires a support image and a pixel-wise object mask, while we only need the bounding boxes of the human and object and the human pose in the one-shot affordance detection task, which are easier to obtain by off-the-shelf object detectors and human pose detection networks.

[width=1.]method1.pdf Section 3.6

Figure 3: The framework of the proposed OSAD-Net. OSAD-Net first uses a Resnet50 (he2016deep) to extract the features of support image and query images. Subsequently, the support feature, the bounding boxes of the person and object, and the pose of the person are fed into the APL module (see Section 3.3 for details) to obtain the human action purpose feature. Next, the human action purpose feature and query image feature are sent to the MPT module (see Section 3.4 for details) to transfer the human action purpose to query images and activate the object regions having the target affordance in the query images. Then, the output of the MPT is fed into a DCE module (see Section 3.5) to learn the commonality among objects of the same affordance and suppress the irrelevant background regions using a collaborative learning strategy. Finally, the enhanced features are fed into the decoder (see Section 3.6) to obtain the detection results.

3 Method

3.1 Problem Description

The one-shot affordance detection task consists of two main sets, i.e., the query set and support set . Given a support image from and a set of query images from , the goal of the task is to segment all objects in the query images with the same affordance based on the information provided by the support image. Models are trained on classes (base) and tested on previously unseen class (novel) in episodes (). Each episode is formed by a support set and a query set of the same class . The support set only contains one sample of class , where is the support image, is the human bounding box, is the object bounding box, and is the human pose. For the query set, we define it as , where is the -th query image and is its affordance mask label, is the number of query samples. In each batch, we define the input as a support sample and query images.

3.2 Pipeline

In this section, we briefly introduce the One-Shot Affordance Detection Network (OSAD-Net), as shown in Fig. 3. We first feed both the image in the support set and the images in the query set into a backbone network to extract features. In this paper, we use Resnet50 (he2016deep) as the backbone. Then, the feature maps of support image, the bounding boxes of human and object, and the pose of human are fed into the APL module (as shown in Fig. 4) to estimate the human action purpose feature, which implicitly defines the affordance in the current state. By leveraging the human-object interaction and the human pose information, APL can mitigate the affordance ambiguity issue brought by the fact that an object may have multiple affordances, i.e., multiple affordance possibilities collapse to an explicit affordance given the action purpose. Subsequently, we input the action purpose feature and the query image features into the MPT module to transfer the human action purpose into query images and activate the object regions belonging to the same affordance. After that, we feed the features of the activated query images into a DCE module (as shown in Fig. 6) to obtain enhanced features by mining the commonality among objects of the same affordance and effectively eliminate the influence of the appearance differences between object classes. Finally, the enhanced features are fed into the decoder to obtain the final detection results.

[width=0.99]module1.pdf Conv

Figure 4: The action purpose learning module. It mainly considers the object appearance, the relative position relationship between the human and object, and the human’s pose to reason about the action purpose jointly.

3.3 Action Purpose Learning

The goal of the APL module is to infer the purpose of human action from the support image. To this end, we try to discover the human action purpose from three clues: human pose, the relative position of human and object, and the properties of object appearance (xu2019interact; zhong2021polysemy; wei2017inferring). For example, the bicycle wheel allows us to associate it with “Push”, while the structure of the bicycle provides the affordance of “Ride”, and the relative position of the person and the object, e.g., up and down or left and right, can be used to determine whether it is a bicycle or a cart. Besides, the human pose also provides essential clues for reasoning about human action purpose, e.g., the fact that a person’s leg is bent or straight can determine whether the action is cycling or pushing.

As shown in Fig. 4, the APL module receives four inputs including the support feature , the bounding box of the human, the bounding box of the object, and the pose of the human. Inspired by kipf2016semi; yan2018spatial, we use the Graph Convolutional Network (GCN) to process the human pose , which is defined as follows:


where is the coordinates of the keypoints’ positions, , and is the learnable weight matrix. Meanwhile, we introduce a learnable weight matrix , which is multiplied with to measure the importance of edges. is initialized as an all-ones matrix. Therefore, the GCN operation defined in this paper is as follows:


We feed the skeletal data through four layers of GCN and add a residual connection to the output of each layer. Afterward, we leverage a global average pooling layer after the final layer of GCN to obtain the output of the skeleton branch

. Then, we expand to the same size as , concatenate it with , and pass them through a convolution layer to obtain the relevant region feature activated by the intrinsic linkage provided by pose, as illustrated in Fig. 4.

[width=0.95]module2.pdf Eq.6Eq.7Eq.8Eq.8Eq.8

Figure 5: The mixture purpose transfer module.

Transferring the human action purpose to query images implies to activate object regions with the same affordance, which can be modeled as a gaussian mixture model and the action purpose can be represented in a form of a compact set of bases.

[width=0.94]module3.pdf Eq.913Eq.913Eq.913Eq.12Eq.12

Figure 6: The densely collaborative enhancement module. We perform a dense comparison between two query images and calculate their correlation, so as to suppress the background unrelated to affordance and obtain better segmentation results.

At the same time, we leverage object features for action purpose inference, i.e., using the object features to activate the relevant regions of human-object interaction. As shown in Fig. 4, we first perform a dot product operation between the object bounding box and support feature to mask the object, which is then fed into a convolution layer and a global average pooling layer to obtain the feature representation of the object . Next, we calculate the correlation coefficient between and each location in . After normalizing the correlation coefficients using a Softmax function, we obtain the attention weights, which are then multiplied with using dot product to obtain the output . The calculation process can be formulated as follows:


The relative position of the person and the object also provide critical information for reasoning about the human action purpose. Therefore, we concatenate the bounding boxes of the person and object, feed them into a convolution layer to obtain the relative position feature , and then concatenate it with . The final human-object interaction representation is obtained after a convolution layer:


Finally, we sum and to get , i.e., the feature representation of human action purpose.

3.4 Mixture Purpose Transfer

After inferring the action purpose from the support image by the APL module, we introduce the MPT module to transfer the action purpose to the query images, activating all object regions that can accomplish that purpose. Since different categories of objects may have the same affordance, there are significant differences in appearance between the objects. In addition, there are great variations in poses for the same action purpose, therefore bringing a challenge to the transfer process. Inspired by the probabilistic model (johnander2021deep; wang2021variational; dempster1977maximum; sun2021attentional), we use a Gaussian mixture model (richardson1997bayesian) to account for the differences within the same affordance category and use multiple affordance-related purpose prototypes to jointly represent an action purpose. As shown in Fig. 5, we use a compact set of bases

to encode the action purpose, which can be obtained by Expectation-Maximization (EM) iteration


In the MPT module, we run “E-step” and “M-step” alternately on to obtain a compact set of bases and then use them to reconstruct the query features. For the input (), a set of bases is firstly randomly initialized. Then, the E-step process estimates the latent variable () from . The weight of the -th base at the -th pixel on is calculated as follows:


where is the feature at the -th position of and is defined as the exponential kernel function, i.e., . Thus, we can compute . In the M-step, is computed as a weighted average of . Concretely, the -th basis is updated as:


After several EM iterations, we obtained a set of bases, which is then used to reconstruct the query image features. For the input set of query image features , we calculate the attention weights as:


Then, we use and to reconstruct the query image feature as , i.e., . Finally, we concatenate and to obtain the output of the MPT module after a convolution layer.

3.5 Densely Collaborative Enhancement Module

After transferring the human action purpose from the support image to the query images, the object regions activated by the MPT module are coarse due to the limited representation ability of purpose feature given the fact that different categories of objects may have the same affordance and they have different appearances. In order to obtain more accurate affordance object regions, we introduce a DCE module to enhance the feature representation by capturing the common relationship between the objects having the same affordance.

As shown in Fig. 6, for the query feature , we compute the correlation mask between and (

) by dense comparison. We first compute the pixel-level cosine similarity

between and as follows:


For each , we take the maximum similarity among all pixels as the correspondence value:


Then, we reshape from to and perform min-max normalization to obtain the correlation mask of for query . For query , correlation masks can be obtained in this way. Then we calculate the average of correlation masks to obtain the output :


Finally, is multiplied with to obtain the enhanced feature, which is then concatenated with and fed into a convolution layer to get the -th query feature .

[width=1.]dataset_ins.pdf pitchermilk canbowlContain-2 spatulawhiskMix bicyclemotorbikeRide banjoguitarzitherPlay-2
Figure 7: Some examples from our PADv2 dataset. We visualize the foreground objects from some representative affordance categories.
[width=1.]dataset.pdf JumpThrowSwingSit
Figure 8: Images and their annotations from our PADv2 dataset. PADv2 dataset has rich annotations such as affordance masks (the second row in each group) as well as depth information (the third row in each group).

3.6 Decoder

For the - query image, the output of the - decoder layer is denoted as , , where . Subsequently, a convolutional prediction layer is appended to each to get the side output, i.e., , . The cross-entropy loss is used as the training objective. For , we calculate the loss as:


where denotes that the prediction map in which each pixel denotes the affordance confidence. and denote the pixel sets of the affordance regions and non-affordance regions, respectively. The final training objective is defined as:


4 The PADv2 Dataset

In this section, we describe the details of the proposed Purpose-driven Affordance Dataset v2 (PADv2), including the process of collecting images, annotations and statistical analysis of the dataset. Some examples from our PADv2 dataset are shown in Fig. 7, and affordance masks and depth information annotations are shown in Fig. 8.

[width=1.]dataset_abstract.png (a)(b)(c)(d)(e)(f)

Figure 9: The properties of our dataset. (a) The distribution of categories in the PADv2 dataset, which includes affordance categories and object categories. (b) The word cloud distribution of affordances in the PADv2 dataset. (c) Statistics of the number of images in each of the

object classes in the PADv2 dataset. (d) Confusion matrix between the affordance category and the object category in the PADv2 dataset, where the horizontal axis denotes the object category and the vertical axis denotes the affordance category. (e) Visualization of the spatial distribution of afforance masks from specific affordance categories and the average mask of PADv2 dataset. (f) The distribution of co-occurring attributes of the PADv2 dataset. The number in each grid denotes the total number of images having a pair of specific attributes, which are described in detail as shown in Table

Class Description Object Class
Play-1 Objects that make a sound when played by a person through a string as a medium. violin, cello, erhu fiddle, viola
Play-2 Objects that can interact directly with the hand to produce sound. guitar, banjo, harp, pipa
Play-3 Objects that can produce sound by a person playing a keyboard. piano, accordion, electricpiano
Play-4 Objects that a person blows through their mouth to make a sound. flute, trumpet, frenchhnorn, harmonica, cucurbit flute
Take photo Objects that can take pictures of people. camera
Contain-1 Containers that can hold a variety of household items and miscellaneous goods. backpack, gift box, handbag, purse, shopping trolley, suitcase, storage box, pushchair
Contain-2 Objects that have the capable of containing a variety of liquids. cup, bowl, beaker, pitcher, milk can, beer bottle, vase, soap dispenser, watering can
Contain-3 Containers that can be used to hold food. bowl, frying pan, plate
Scoop Objects that have the specific ability to scoop food up from the bottom. spatula, spoon, soup ladle
Wear-1 Objects are worn on a person’s head for decoration or protection. hat, helmet
Wear-2 Objects with a combination of lenses and frames used to improve vision, or protect the eyes. glasses
Wear-3 Objects are worn on the feet. high heels, boots, slippers, sports shoes
Wear-4 Objects are worn on the hand for decoration, warmth or protection. gloves
Sit Objects that can be used to sit. stool, sofa, bench, swing, wheelchair, rocking chair
Cut Objects that have the ability to cut other Obj. knife, scissors
Pick up Objects that have the function of holding food up. chopsticks
Brush Objects that can remove dirt or apply cosmetics. toothbrush, broom, brushes, cosmetic brushes
Ride Vehicles that can be used for riding. bicycle, motorbike
Kick Objects that can be kicked in direct contact with the foot. soccer ball, punching bag, rugby ball
Hit Indicates tools that can be used to strike other objects. axe, hammer, baseball bat, hoe, rolling pin
Beat Objects that can be played by beating a surface to produce a sound. drum
Jump Objects that allow rapid movement by allowing people to jump on surfaces. snowboard, surfboard, skis, skateboard
Swing Objects that a person interacts with by swinging their arm. baseball bat, tennis racket, table tennis bat
Lie Objects with a large surface space that allow a person to lie down. sofa, baby bed, bench, pushchair, rocking chair
Bounce Objects that can be slapped directly by a person’s hand and bounced. basketball, volleyball
Mix Objects that can be used for mixing. whisk, chopsticks, spoon, soup ladle, spatula
Look out Objects that can be used for seeing at a distance. binoculars
Fork Indicates tools used to fork up food. fork
Shelter Objects that provide shade from the sun or rain. umbrella
Roll dough Objects that can be used to roll out dough. rolling pin
Rolling Objects that can be made to roll by hitting them through an intermediate medium. table tennis ball, croquet ball, tennis ball, golf ball, baseball
Lift Objects that are lifted and lowered for fitness purpose. dumbbell
Throw Objects that a person uses to throw. frisbee, bowling,dart, javelin, basketball, weight throw, baseball, rugby ball
Boxing Objects hit by boxing sports. punching bag
Push&Pull Objects with wheels below that can be pushed or pulled. wheelchairs, bicycle, motorbike, suitcase, pushchair
Crutches Objects that can play the function of assisting to walk people or stand. crutch, umbrella
Standing Objects that can be stepped on to reach a higher position. stool, desk, bench, sofa
Support Objects with a smooth surface that can hold various items. desk
Write Objects that can be used for writing. pen, writing brush
Table 2: Definition of the affordance categories in the PADv2 dataset and the object categories contained in each affordance category.
Attr. Description
AC Appearance Change. Significant lighting changes appear in the object area of the image.
BO Big Objects. This refers to the ratio of object area to image area greater than 0.5.
HO Heterogeneus Object. Refers to Obj. that are composed of visually distinct or dissimilar parts.
OV Out-of-View. Object is partially clipped by the image boundaries.
SC Shape Complexity. The object has complex boundaries such asthin parts and holes.
SO Small Object. Refers to the ratio of object area to image area less than 0.1.
Table 3: The attribute list and associated description of the affordance object image. The choice of these attributes is inspired by perazzi2016benchmark.

4.1 Dataset Collection and Annotation

In this section, we describe the collection and annotation process of our PADv2 dataset, where the annotations include the annotations of query images (masks), the annotations of support images (object mask, bounding boxes of the person and object, and human pose), and the depth map of each image.

Data Collection. We construct the PADv2 dataset by collecting images mainly from ILSVRC (russakovsky2015imagenet)


(lin2014microsoft), etc. We retrieve and collect the images based on the keywords of object categories. The images in these datasets are from different scenes and have different object appearances, making the affordance detection task more challenging. In addition, to increase the diversity of the dataset, we also collect some images from the Internet. Finally, our dataset contains k images, covering affordance categories and object categories. The affordance and the object categories are shown in Fig. 9 (a). The word cloud statistics of all affordances are shown in Fig. 9 (b) and the statistics of objects in each affordance category are shown in Fig. 9 (c). Compared to PAD dataset (Ours), the PADv2 dataset is richer in content and more complex in scenarios. Furthermore, the multiple possibilities of object affordance in our PADv2 dataset make it more challenging and realistic for real-world application scenarios.

Category Annotation. We build a hierarchy for the PADv2 dataset by selecting common categories (e.g., “bicycle”, “stool”, “tennis racket”, “basketball”, “binoculars”, “chopsticks”, “sofa”, “soccer ball”) and assigning affordance category labels to each of them. The description of each affordance category of our PADv2 dataset and the object categories it contains are shown in Table 2. An affordance category may cover multiple object categories, e.g., objects having the “Swing” affordance label contain “tennis racket”, “golf club”, “baseball bat”, “badminton racket”, etc., and the appearances of these objects vary greatly. Furthermore, an object may have multiple affordances. For example, the objects belong to the “chopsticks” category have both “Pick Up” and “Mix” affordances. Fig. 9 (d) shows the confusion matrix between the affordance category and the object category.

Query Image Annotation. For the images from COCO (lin2014microsoft), part of the dataset is labeled with masks. Since the objects having the same affordance may have not been labeled in the above dataset, for example, the cups and bowls are not labeled in the category of cups, we filter these images and label them manually. For the images downloaded from the Internet, we also manually label the objects with the defined affordance categories. Some affordance masks from the PADv2 dataset are shown in Fig. 8.

Support Image Annotation. Most of the support images come from the Internet, and we annotate the human-object interaction within each support image, i.e., the bounding box of the human and the bounding box of the interacting object. To facilitate the comparison with the one-shot segmentation methods, we also provide mask labels for the objects in the support images. Furthermore, we extract the human pose using zhang2021towards, which can provide more information for inferring human action purpose.

Depth Information. As depth information provides a wealth of spatial structure, layout information, and geometric cues, it can facilitate the research of affordance detection. Therefore, we also provide the depth map of each image in the PADv2 dataset. We use the depth estimation network (Ranftl2020) to extract the depth map. Some examples of depth maps are shown in Fig. 8.

4.2 Dataset Features and Statistics

To get deeper insights into the PADv2 dataset, we show its important features from the following aspects.

Category Diversity of Object. PADv2 dataset contains affordance categories and object categories, including possible human-object interaction in outdoor, kitchen, living room, and other scenes. Each affordance category may cover multiple object categories. All the affordance categories cover most general objects in human life, supporting the research towards comprehensive understanding of the real-world scenes.

The Multiple Possibilities of Affordance. PADv2 dataset reflects the multiple possibility property of affordance, i.e., the same object may have multiple affordances and different categories of objects may have the same affordance. The confusion matrix between the affordance category and the object category is shown in Fig. 9 (d), from which we can see the pairwise relationship between object and affordance. In this sense, our dataset is of great research value in affordance perception, scene understanding, and robotics.

Spatial distribution of affordance masks. Fig. 9 (e) shows the average affordance mask of specific affordance category and the average mask of all affordance categories in the PADv2 dataset. It can be observed that some categories have unique location shapes. For example, “Sit” is mainly distributed in the lower part of the image, while “Shelter” is mainly in the upper part of the image and has a distinctive shape. In contrast, “Mix”, “Write”, and “Pick Up” show no clear shape or position bias. The average mask of all categories is centered in the image with a circular shape.

Property Analysis. The attribute information of the images in the PADv2 dataset facilitate future research of model performance evaluation regarding different types of parameters. Inspired by perazzi2016benchmark, we define a set of attributes to represent specific situations faced in real scenarios. Table 3 summarizes the list of attributes and their descriptions, and Fig. 9 (f) shows the distribution of image attributes in the dataset. Since the real-world scenes consist of materials with different visual properties, “HO” types occupy a large portion. There are also many “SO” samples, implying that there are quite a few smaller objects in the dataset and its challenge nature for affordance detection. It can also be seen from the figure that “HO” and “SO” have a strong dependency. In addition, the number of images with “OV” attributes also makes up a large portion of the dataset, which also poses a huge challenge for one-shot affordance detection.

5 Experiments

In this section, we present the experimental settings, results and analysis. Section 5.1 provides details of our benchmark setting. Section 5.2 describes state-of-the-art methods from five relevant areas for comparison. In Section 5.3, we introduce the implementation details of our model. Section 5.4 shows the experimental results and analyses. Specifically, Section 5.4.1 presents the results of different models on the PADv2 dataset and the analysis from multiple perspectives. Section 5.4.2 presents the results of multiple models on the PAD dataset. In Section 5.4.3, we investigate the impact of different modules in the proposed OSAD-Net and the hyper-parameter settings on the performance.

Fold Affordance Classes in the Test Set
Bounce, Boxing, Contain-1, Crutches, Kick, Lie, Play-1, Push&Pull, Ride, Shelter, Sit, Throw, Wear-1
Brush, Contain-2, Fork, Hit, Jump, Lift, Mix, Pick Up, Play-2, Roll Dough, Scoop, Swing, Wear-2
Beat, Contain-3, Cut, Look Out, Play-3, Play-4, Rolling, Standing, Support, Take Photo, Wear-3, Wear-4
Table 4: The division details of the PADv2 dataset for -fold evaluation. The table shows the affordance categories in each test set and the remaining part is the training set.

5.1 Benchmark Setting

To comprehensively evaluate different methods, we choose five widely used metrics for the One-Shot Affordance Detection task, i.e., IoU (long2015fully), F-measure () (arbelaez2010contour), E-measure () (18IJCAI-Emeasure) , Pearson’s Correlation Coefficient (CC) (le2007predicting), and Mean Absolute Error (MAE) (perazzi2012saliency). The evaluation code is released at

  • IoU (long2015fully). It is a critical metric commonly used to measure the results of pixel-level predictions and is calculated as follows:


    where and represent the prediction result and ground truth, respectively. is the intersection operation and is the union operation.

  • F-measure () (arbelaez2010contour). It is a broadly used evaluation metric that takes into account both recall and precision:


    where TP indicates True Positives, FP represents False Positives, and FN means False Negatives. is a hyper-parameter to balance the recall and precision. In this paper we set .

  • E-measure () (18IJCAI-Emeasure). It jointly evaluates the difference between the prediction result and ground truth from a local and global perspective:


    where refers to the enhanced alignment matrix and is the index of each pixel.

  • Pearson’s Correlation Coefficient (CC) (le2007predicting). It is a statistical metric usually used to count the correlation and dependence between two variables. In this paper, we use it to evaluate the correlation between the predicted map and ground truth:


    where is the covariance of and , and and

    are the variances of

    and , respectively. CC is symmetric and penalizes false positives and negatives equally.

  • Mean Absolute Error (MAE) (perazzi2012saliency). It measures the average absolute distance between the normalized predicted map and the ground-truth.


To evaluate different models comprehensively, we follow the -fold evaluation protocol, where is in this paper. To this end, the dataset is divided into three parts with non-overlapped categories, where any two of them are used for training while the left part is used for testing. The affordance categories included in each fold are shown in Table 4. The training set contains affordance categories per fold, while the test set contains affordance categories.

5.2 Comparison Methods

To demonstrate the superiority of our model, we select three segmentation models (UNet, PSPNet, and DeeplabV3+), three salient object detection models (CPD, BASNet, and CSNet), one co-saliency model (CoEGNet), two few-shot segmentation models (CANet, and PFENet) and two affordance detection models (RANet, and OSAD-Net) for comparison.

  • UNet (10.1007/978-3-319-24574-4_28): U-Net contains a contracting path to capture the semantic context and a symmetric expanding path to enable fine-grained localization.

  • PSPNet (zhao2017pspnet): Pyramid Scene Parsing Network leverages the pyramid pooling module to aggregate contextual information from different receptive fields, thus improving the ability for semantic segmentation tasks.

  • DeepLabV3+ (deeplabv3plus2018): DeepLabV3+ encodes multiscale features at different receptive fields by introducing an atrous spatial pyramid pooling (ASPP) module. Furthermore, it extends Deeplabv3 (chen2017rethinking) by adding a decoding module to refine the segmentation results, especially for object boundaries.

  • CPD (Wu_2019_CVPR): Cascaded Partial Decoder discards the use of shallow layer information and proposes a cascaded encoding-decoding structure.

  • BASNet (Qin_2019_CVPR): Boundary-Aware Salient object detection Network obtains a coarse result from an encoder-decoder structure and then refines it using a residual refinement module. In addition, it also utilizes a new loss, i.e., a mixture of the cross-entropy loss, the structural similarity loss, and the IoU loss.

  • CSNet (GaoEccv20Sal100K): Cross-Stages Network uses a generalized OctConv that effectively exploits multi-scale features within and across levels while reducing feature redundancy through a novel dynamic weight decay scheme.

    Method UNet PSPNet DLabV3+ CPD BASNet CSNet CoEGNet CANet PFENet RANet OSAD-Net Ours
    Params (M)
    IoU 0.483
    () 0.723
    CC 0.587
    MAE 0.123
    Table 5: The experimental results of models (UNet (10.1007/978-3-319-24574-4_28), PSPNet (zhao2017pspnet), DeeplabV3+ (DLabV3+) (deeplabv3plus2018), CPD (Wu_2019_CVPR), BASNet (Qin_2019_CVPR), CSNet (GaoEccv20Sal100K), CoEGNet (deng2021re), CANet (zhang2019canet), PFENet (tian2020prior), RANet (zhao2020object), OSAD-Net (Ours)) on the PADv2 dataset in terms of five metrics (IoU  (long2015fully),   (arbelaez2010contour),   (18IJCAI-Emeasure), CC  (le2007predicting), and MAE  (perazzi2012saliency)). Bold and underline indicate the best and the second-best scores, respectively.
  • CoEGNet (deng2021re): Co- Edge Guidance Network uses a co-saliency mapping strategy, i.e., the PCA technique, to identify the main components of common objects, helping to retain common objects and remove interference, thus enhancing the performance of EGNet (zhao2019egnet) for co-salient object detection.

  • CANet (zhang2019canet): Class-Agnostic segmentation Network mainly contains a two-branch dense comparison module to compare the multi-scale features of support image and query image, and an iterative optimization module to refine the prediction results.

  • PFENet (tian2020prior): Prior guided Feature Enrichment Network uses a training-free prior generation to improve the segmentation accuracy and generalization performance, and a feature enrichment module to address spatial inconsistency.

  • RANet (zhao2020object): Relationship-Aware Network improves the prediction results of affordance detection by exploring the relationship between affordance and objectness.

  • OSAD-Net (Ours): One-Shot Affordance Detection Network first learns the human action purpose, then transfers it to query images to obtain object regions with the same affordnace, and finally obtains results by collaborative learning.

5.3 Implementation Details

Our method is implemented in PyTorch. We choose the Resnet50


network pre-trained on ImageNet

(russakovsky2015imagenet) as the backbone, where the first three blocks are fixed. We randomly clip the input images from to with random horizontal flipping. We train the model for epochs using the Adam optimizer (kingma2014adam). The learning rate is initialized as and reduced by after epochs. It takes about one day for training on a single NVIDIA TitanXP GPU. We set the batch size to and set the number of query images in each batch to . For the action purpose learning module, we choose a four-layer GCN to process the skeletal data. The number of bases in the MPT module is =. The number of EM iteration steps is .

5.4 Analysis of Experimental Results

In this section, we provide experimental results of all the comparison models on the PADv2/PAD datasets and analyze the experimental results. Then, we perform ablation studies to investigate the impact of different modules and hyper-parameter settings.

[width=0.92]rank.png Ours(1.33)













Figure 10: Rank List. We ranked the metrics of different methods in the -fold evaluation on the PADv2 dataset (see Table 5 for experimental results), where the denotes how many metrics that the model are ranked the -th. The left Red letter denotes the average rank.
Classes UNet PSPNet DLabV3+ CPD BASNet CSNet CoEGNet CANet PFENet RANet OSAD-Net Ours
Play-1 0.544
Play-2 0.648
Play-3 0.552
Play-4 0.583
Take Photo 0.517
Contain-1 0.534
Contain-2 0.566
Contain-3 0.549
Scoop 0.381
Wear-1 0.448
Wear-2 0.325
Wear-3 0.630
Wear-4 0.357
Sit 0.409
Cut 0.380
Pick Up 0.442
Brush 0.429
Ride 0.441
Kick 0.665
Hit 0.486
Beat 0.650
Jump 0.272
Swing 0.379
Lie 0.552
Bounce 0.473
Mix 0.346 0.346
Look Out 0.603
Fork 0.110
Shelter 0.355
Roll dough 0.553
Rolling 0.530
Lift 0.691
Throw 0.464
Boxing 0.747
Push and Pull 0.493
Crutches 0.377
Standing 0.504
Support 0.529 0.529
Write 0.651
Table 6: The results of the different methods on the PADv2 for each affordance category. We use IoU as the evaluation metric. Bold and underline indicate the best and the second-best scores, respectively.

[width=1]main_result.pdf ImageGTOursOSAD-NetPFENetCoEGNetCPDPSPNet

Figure 11: Visual affordance maps obtained by different methods on the PADv2.


Figure 12: F-measure curves and PR curves of models on the PADv2 dataset. The first and second rows are the F-measure and PR curves of the 3-fold evaluation (), respectively.

5.4.1 Performance on the PADv2 dataset

We compare our method with the representative methods of semantic segmentation, salient object detection, co-salient object detection, one-shot segmentation, and affordance detection as described in Section 5.2. The experimental results are summarized in Table 5. We also calculate the mean values of all metrics of the -fold evaluation. As can be seen, our model outperforms all other methods in terms of all metrics. Taking the IoU metric as an example, our model outperforms the best affordance detection method by %, the best one-shot segmentation method by %, the co-salient object detection method by %, the best salient object detection method by %, and the best segmentation method by %. Furthermore, to compare the ranking of different methods in terms of all metrics, we rank all the evaluation metrics of the -fold evaluation and summarize the result as a matrix as shown in Fig. 10, where the element indicates how many metrics the