Affordances have attained new relevance in robotics over the last decade [10, 15]. Affordance refers to the possibility of performing different tasks with an object . As an example, grasping a pair of scissors from the tip affords the task handing over, but not a cutting task. Analogously, not all the regions on a mug’s handle comfortably afford to pour liquid from it. Current grasp affordance solutions successfully detect the parts of an object that afford different tasks [13, 3, 4, 2, 5, 1]. This allows agents to contextualise the grasp according to the objective task and also, to novel object instances. Nonetheless, these approaches lack an insight into the level of suitability that the grasp offers to accomplish the task. As a consequence, current literature on grasp affordance cannot guarantee any level of performance when executing the task and, in fact, not even a successful task completion.
On the grounds of the limitations mentioned above, a system should consider the expected task performance when deciding a grasp affordance. However, this is a challenging problem, given that the grasp and the task performance are codefining and conditional on each other . Recent research in robot affordances proposes to learn this relation via trial and error of the task [6, 14, 11]. Nevertheless, given the extensive amount of required data, the method can solely learn a single task at a time and perform on known scenarios. In contrast, an autonomous agent is expected to be capable of dealing with multiple task affordance problems even when those involve unfamiliar objects and new scenarios.
In this paper, we present a novel experience-based pipeline for self-assessment of grasp affordance transfer (SAGAT) that seeks to overcome the lack of deployment reliability of current state-of-the-art methods of grasp affordance detection. The proposed approach, depicted in Fig. 1, starts by extracting multiple grasp configuration candidates from a given grasp affordance region. The outcome of executing a task from the different grasp candidates is estimated via forward simulation. These estimates are employed to evaluate and rank the relation of task performance and grasp configuration candidates via a heuristic confidence function. Such information is stored in a library of task affordances. The library serves as a basis for one-shot transfer to identify grasp affordance configurations similar to those previously experienced, with the insight that similar regions lead to similar deployments of the task. We evaluate the method’s efficacy on addressing novel task affordance problems by training on one single object and testing on multiple new ones. We observe a significant performance improvement up to in the considered tasks when using our proposal in comparison to state-of-the-art approaches on grasp affordance detection. Experimental evaluation on a PR2 robotic platform demonstrates highly reliable deployability of the proposed method in real-world task affordance problems.
Ii Related Work
Understanding grasp affordances for objects has been an active area of research for robotic manipulation tasks. Ideally, an autonomous agent should be able to identify all the tasks that an object can afford, and infer the grasp configuration that leads to a successful completion of each task. A common approach to tackle this challenge is via visual features, e.g. [13, 3, 4, 2]
. Methods based on visual grasp affordance detection identify candidate grasps either via deep learning architectures that detect grasp areas on an object[13, 3, 4]
, or via supervised learning techniques that obtain grasping configurations based on an object’s shape. While these techniques offer robust grasp candidates, they uniquely seek grasp stability. Consequently, these methods cannot guarantee any level of performance when executing a task, and in fact, not even a successful task completion. In order to, move towards reliable task deployment on autonomous agents, there is the need to bridge the gap between grasp affordance detection and task-oriented grasping.
Work on grasp affordances aims at robust interactions between objects and the autonomous agent. However, it is typically limited to a single grasp affordance detection per object, thus reducing its deployment in real-world scenarios. Some works, such as , focus on relating abstractions of sensory-motor processes with object structures (e.g., object-action complexes (OACs)) to extract the best grasp candidate given an object affordance. Others use purely visual input to learn affordances using deep learning [5, 4] or supervised learning techniques to relate objects and actions [22, 18, 16, 1]. Although these works are successful in detecting grasp affordance regions, they hypothesise suitable grasp configurations based on visual features, rather than indicators that hint such proposals suitability to accomplish an affordance task.
The end goal of grasping is to manipulate an object to fulfil a goal-directed task. When the grasping problem is contextualised into tasks, solely satisfying the grasp stability constraints is no longer sufficient. Nonetheless, codefining grasp configurations with task success is still an open problem. Along this line, some works focus entirely on learning tasks where the object category does not influence the outcome, such as pushing or pulling [22, 16]. Hence, reliable extraction of grasp configurations is neglected. Another approach is to learn grasp quality measures for task performance via trial and error [6, 14, 11]. Based on the experiences, these studies build semantic constraints to specify which object regions to hold or avoid. Nonetheless, their dependency on great amounts of prior experiences and the lack of generalisation between object instances remain to be the main hurdle of these methods.
Our work seeks to bridge the gap between grasp affordances and task performance existing in prior work. The proposed approach unifies grasp affordance reasoning and task deployment in a self-assessed system that, without the need for extensive prior experiences, is able to transfer grasp affordance configurations to novel object instances.
Iii Proposed Method
An autonomous agent must be able to perform a task affordance in different scenarios. Given a particular object and task to perform, the robot must select a suitable grasp affordance configuration that allows executing the task’s policy successfully. Only the correct choice of both and leads to the robot being successful at addressing the task affordance problem. Despite the strong correlation between and the execution performance, current approaches in the literature consider these elements to be independent. This results in grasping configurations that are not suitable for completing the task.
In this section, we introduce our approach to self-assess the selection of a suitable grasp affordance configuration according to an estimate of the task performance. Fig. 2 illustrates the proposed pipeline which (i) detects from visual information a set of grasping candidates lying in the object’s grasp affordance space (Section III-A), (ii) exploits a learnt library of task affordance policies to forward simulate the outcome of executing the task from the grasping candidates (Section III-B), and then (iii) evaluates the grasp configuration candidates subject to a heuristic confidence metric (Section III-C) which allows for one-shot transfer of the grasp proposal (Section III-D). Finally, in Section III-E, we detail how theses components fit in the scheme of a robotic agent dealing with task affordance problems autonomously.
Iii-a Prediction of Grasp Affordance Configurations
The overall goal of this work is, given an object’s grasp affordance region , to find a grasp configuration that allows the robot to successfully employ an object for a particular task. In the grasp affordance literature, it is common to visually detect and segment the grasp affordance region using mapping to labels [5, 1, 4]. While these methods all predict via visual detection hypotheses, none estimate the configuration proposals based on a task performance insight. This relational gap endangers a successful task execution. Instead, an autonomous agent should be capable of discerning the most suitable grasp that benefits the execution of a task.
To bridge this gap, in our method we consider a grasp affordance region in a generic form such as the bounding box provided by  (see Fig. (a)a). We are interested in pruning this region by finding multiple grasp proposal candidates. With this aim, we use the pre-trained DeepGrasp model , a deep CNN that computes reliable grasp configurations on objects. The output grasp proposals from DeepGrasp, which do not account for affordance relation, are shown in Fig. (b)b. The pruned region (see Fig. (c)c), denoted as , provides a set of grasp configuration candidates that accounts for both reliability and affordability.
Iii-B Library of Task Affordances
The success of an affordance task lies in executing the corresponding task policy from a suitable grasp configuration . This is a difficult problem given that the and are codefining . Namely, the task’s requirements constrain the possibly suitable grasp configurations , at the same time that the choice of conditions the outcome of executing the task’s policy . Additionally, determining whether the execution of a task is successful requires a performance indicator. To cope with this challenge, we build on our previous work  to learn a library of task affordances from human demonstrations. The library aims at simultaneously guiding the robot on the search of a suitable task policy while informing about its expected outcome when successful. All these elements serve as the basis of the method described in Section III-C to determine via self-assessment of the candidates .
In this work, we build the library of task affordances as:
are differential equations encoding behaviour towards a goal attractor. We initialise the policies via imitation learning, and use them to reproduce an observed motion while generalising to different start and goal locations, as well as task durations.
Regarding the set of possible successful outcomes , we provide the robot with multiple experiences. We define the outcome as the state evolution of the object’s action region through the execution of the task. We employ mask RCNN (M-RCNN)  to train a model that detects objects subparts as action regions . As exemplified in Fig. 4, the action region state provides a meaningful indicator of the task. This information is used as the basis for our confidence metric, which evaluates the level of success of an affordance task for a grasping proposal.
Iii-C Search-Based Self-Assessment of Task Affordances
The task policies learnt in Section III-B allow a previously experienced task from any candidate grasp to be performed. Nonetheless, executing from any grasp configuration may not always lead to suitable performance. For example, Fig. 4 depicts the case where grasping the mug from prevents the robot from performing a pouring task as adequately as when grasping it from .
We propose to self-assess the outcome of executing the task’s policy from before deciding the most suitable grasp configuration on a new object. This is efficiently done by forward simulation of the DMP-encoded . From each roll-out, we look at the object’s state action region as a suitable task performance indicator. To this aim, we consider the entropy between the demonstrated successful task outcomes and the simulated outcome
in the form of Kullback-Leibler divergence:
which results in a low penalisation when the forward simulated outcome is similar to a previously experienced outcome in , and a high penalisation otherwise. Then, we propose to rank the grasping candidates according to a confidence metric which estimates the suitability of a candidate for a given as:
Finally, we select the grasping configuration among all grasping candidate as:
which returns the grasp configuration with highest confidence of successfully completing the task. This assessment is subject to a minimum user-defined confidence level
that rejects under-performing grasp configuration proposals. As explained in the experimental setup, such a threshold is adjusted from demonstration by a binary classifier.
Iii-D One-Shot Self-Assessment of Task Affordances
The search-based strategy presented in Section III-C in the grasp affordance region can be time and resource consuming if performed for every single task affordance problem. Alternatively, we propose to augment the library in (1) with an approximate of the prior experienced outcomes per grasp configuration , such that it allows for one-shot assessment. Namely, we extract the spatial transform of all experienced grasps with respect to the detected grasp affordance region . The relevance of these transforms is ranked in a list according to their confidence score computed following (3). Therefore, the augmented library is denoted as:
At deployment time, we look at the spatial transform from the new grasping candidates that resembles the most well-ranked transform in . This allows us to hierarchically self-assess the candidates by order of prospective success.
Iii-E Deployment on Autonomous Agent
Algorithm 1 presents the outline of SAGAT’s end-to-end deployment, which aims at improving the success of an autonomous agent when performing a task. Given visual perception of the environment, the desired affordance, the pre-trained model to extract the grasp affordance relation (see Section III-A), the model to detect the action region, and the learnt library of task affordances (see Section III-B to Section III-D) (lines 1 to 1), the end-to-end execution is as follows. First, the visual data is processed to extract the grasp affordance region (line 1) and the object’s action region (line 1). The resulting grasp affordance region along with the desired affordance are used to estimate the grasp configuration proposals on the new object using the library of task affordances as prior experiences (line 1). The retrieved set of grasp configuration candidates is analysed in order of decreasing prospective success (line 1 to line 1) until either exhausting all candidates or finding a suitable grasp for the affordance task. Importantly, the hierarchy of the proposed self-assessment analysis allows for one-shot transfer of the grasp configuration proposals, i.e. to find, on the first trial, a suitable grasp affordance by analysing the top-ranked grasp candidate. Nonetheless, the method also considers the case that exhaustive exploration of all candidates might be required, thus ensuring algorithmic completeness.
Notably, the proposed method is not dependant on a particular grasp affordance or action region description. This modularity allows the usage of the proposed method in a wide range of setups. We demonstrate the generality of the proposed method by first, using multiple state-of-the-art approaches for grasp affordance detection, and then, determining the improvement on task performance and deployability when used altogether with our approach.
Iv Experimental Evaluation and Discussion
The proposed methodology endows a robot with the ability to determine a suitable grasp configuration to succeed on an affordance task. Importantly, such a challenge is addressed without the need for extensive prior trials and errors. We demonstrate the potential of our method following the experimental setup described in Section IV-A and a thorough evaluation based on the following tests: (i) the spatial similarity between learnt and computed configurations across objects (Section IV-B), (ii) the accuracy of the task affordance deployment when transferred to new objects (Section IV-C), and (iii) the performance of our proposal when compared to other methodologies (Section IV-D).
Iv-a Experimental Setup
The end-to-end execution framework presented in Algorithm 1 is deployed on a PR2 robotic platform, in both simulated and real-world scenarios. We use a Kinect mounted on the PR2’s head as our visual sensor and the position sensors on the right arm joints to encode the end-effector state pose for learning the task policies in the library.
We evaluate the proposed approach with an experimental setup that considers objects with variate affordable actions and suitable grasping configurations. Particularly, the library of task affordances is built uniquely using the blue mug depicted in Fig. 5, but evaluated with the objects depicted in Fig. 6. As can be observed, the training and testing sets present a challenging and significant variability on the grasp affordance relation. Our experimental setup also considers multiple affordances, namely: pouring, handover and shaking. The choice of these affordances is determined by those being both common among the considered objects and socially acceptable according to .
The task policy and its expected effect corresponding to each affordance are taught to the robot via kinaesthetic demonstration. The end-effector state evolution is used to learn the task policy in form of a set of DMPs, and the state evolution of the container’s action region segmented on the 2-D camera frame to learn the expected effect. As depicted in Fig. 5 for the pouring task, the learnt policy is replicated times from different grasping candidates, including suitable grasp affordances (blue) and undesired deployments (red).
The collected demonstrations are used to adjust the confidence threshold in via a binary classifier, where the confidence level computed following (3) is the support, and the label is the target. Only successful deployments are included in the library.
Iv-B Spatial Similarity of Grasp Configurations
Our method allows the system for one-shot transfer of grasp configurations to new objects. As explained in Section III-D, we rank the grasp candidates on new objects as those that closely resemble the experiences stored in the library of task affordances. This approximation is based on the expectation that similar spatial configurations should offer similar performance when dealing with the same task. In this set of experiments, we demonstrate the validity of such a hypothesis by evaluating the spatial similarity between the proposals estimated on new objects and the ones previously identified as suitable and stored in the library.
For an object, we calculate the Euclidean distance between the segmented action region and the obtained grasp configuration . Fig. 7 shows the obtained distances denoted as . The blue horizontal line represents the mean distance obtained during the demonstrations. Overall, we observe similar distances from action regions to grasp configurations across objects. For dissimilar cases such as and (ashtray and bowl respectively), the difference is given by the fact that the obtained grasping region for most of the tasks lies on the edges of the object compartment. Even though these grasping configurations are relatively close to the action region, we will see on Table I that the average performance of the tasks is preserved.
To further evaluate similarity across obtained grasping configurations, we are also interested in how much the system prunes the grasping space based on the information stored in the library. As defined in (4), we use a confidence threshold for the pruning process of the grasping space. Thus, based on the prior of well-performing grasp configurations, highly dissimilar proposals are not considered on the self-assessed transfer process. Fig. 8 depicts the rejection rate of grasp configuration proposals per task affordance. From the plot, we see that the pouring task shows the highest rejection rate, especially for objects that have handles. This hints that for this task the grasping choice is more critical.
Iv-C One-Shot Transfer of Task Affordances
The second experimental test analyses the performance of our method when addressing task affordances on new objects. The goal of this evaluation is to determine if the chosen grasp configuration enables objects to perform the task affordance as successfully as the prior stored in the library. Fig. 9
depicts the mean and variance (green scale) of the prior experiences in the library for the tasks pour, shake and handover. Each task was performed with three real objects with notably different features: a travel mug (dark blue), measurement spoon (magenta) and a glass (blue). The resulting effect when performing the tasks from the computed grasping configuration is colour-coded on top of the prior experiences distribution.
Subject to the task affordance, the three objects show different grasp affordance regions. After the one-shot self-assessment procedure, the computed grasp configurations are the most spatially similar to the most successful grasp configuration in the experience dataset. Importantly, as illustrated in Fig. 9, this strategy is invariant to different initial and final states of the task. This is reflected in the obtained task affordance effect, which falls inside the variance of the demonstrations.
Iv-D Comparison of Task Deployment Reliability
. To conduct this evaluation, we use the open-source implementations of[5, 4, 1] on all objects illustrated in Fig. 6, in the real and simulated robotic platform. The obtained grasp regions are used to execute the task in two different ways: (i) in stand-alone fashion, i.e. as originally proposed, and (ii) as input of our SAGAT approach to determine the most suitable grasp candidate. Fig. 10 shows some examples of the grasp affordance detected with the previously mentioned methods and our approach.
We use the policies in the learnt library of task affordances to replicate the pour, shake and handover tasks on each object, for each grasp affordance, and for each method when used as stand-alone and combined with SAGAT. This results in a total of tasks deployments on the robotic platform111A compilation of experiments can be found in: https://youtu.be/nCCc3_Rk8Ks. Table I summarises the obtained results. As can be observed, deploying a task using state-of-the-art methods on grasp affordance detection provides an average success rate of across tasks. With our approach, the deployability success is enhanced for all the tasks, with an average rate of . Interestingly, the improvement is not equally distributed across tasks; more challenging tasks experience a higher success rate. This is the case of the pouring tasks where deployability success is increased by .
V Conclusions and Future Work
In this paper, we presented a novel experience-based pipeline for self-assessment of grasp affordance transfer (SAGAT). Our approach enhances the deployment reliability of current state-of-the-art methods on grasp affordance detection, by extracting multiple grasp configuration candidates from a given grasp affordance region. The outcome of executing a task from different grasp candidates is estimated via forward simulation. These estimates are evaluated and ranked via a heuristic confidence function in relation to task performance and grasp configuration candidates. Such information is stored in a library of task affordances, which serves as a basis for one-shot transfer estimation to identify grasp affordance configurations similar to those previously experienced, with the insight that similar regions lead to similar deployments of the task. We evaluate the method’s efficacy on novel task affordance problems by training on a single object and testing on multiple new ones. We observe a significant performance improvement up to approximately in our experiments when using our proposal in comparison to state-of-the-art approaches on grasp affordance detection. Experimental evaluation on a PR2 robotic platform demonstrates highly reliable deployability of the proposed method to deal with real-world task affordance problems.
This work encourages multiple interesting directions for future work. Our follow-up work will study a unified probabilistic framework to infer the most suitable grasp affordance candidate. We envision that this will allow sets of actions and grasps to be predicted when dealing with multiple correlated objects in the scene. Another interesting extension is the assessment of the end-state comfort-effect for grasping in human-robot collaboration tasks, such that the robot’s grasp affordance considers the human’s grasp capabilities.
-  (2019) Learning grasp affordance reasoning through semantic relations. IEEE Robotics and Automation Letters 4 (4), pp. 4571–4578. Cited by: §I, §II, (a)a, §III-A, §III-A, §IV-A, §IV-D, TABLE I.
-  (2010) Learning grasping points with shape context. Robotics and Autonomous Systems 58 (4), pp. 362–377. Cited by: §I, §II.
-  (2018) Real-world multiobject, multigrasp detection. IEEE Robotics and Automation Letters 3 (4), pp. 3355–3362. Cited by: §I, §II, (b)b, §III-A, Fig. 8.
-  (2019) Learning affordance segmentation for real-world robotic manipulation via synthetic images. IEEE Robotics and Automation Letters 4 (2), pp. 1140–1147. Cited by: §I, §II, §II, §III-A, §IV-D, TABLE I.
-  (2018) AffordanceNet: an end-to-end deep learning approach for object affordance detection. In International Conference on Robotics and Automation (ICRA), Cited by: §I, §II, §III-A, §IV-D, TABLE I.
-  (2019) Learning task-oriented grasping for tool manipulation from simulated self-supervision. The International Journal of Robotics Research, pp. 0278364919872545. Cited by: §I, §II.
-  (1977) The theory of affordances. In Perceiving, Acting, and Knowing: Toward and Ecological Psychology, R. Shaw and J. Bransford (Eds.), pp. 62–82. Cited by: §I.
Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §III-B.
-  (2013) Dynamical movement primitives: learning attractor models for motor behaviors. Neural computation 25 (2), pp. 328–373. Cited by: §III-B.
-  (2018) Affordances in psychology, neuroscience, and robotics: a survey. IEEE Transactions on Cognitive and Developmental Systems 10 (1), pp. 4–25. Cited by: §I.
-  (2012) A kernel-based approach to direct action perception. In 2012 IEEE International Conference on Robotics and Automation, pp. 2605–2610. Cited by: §I, §II.
-  (2011) Object–action complexes: grounded abstractions of sensory–motor processes. Robotics and Autonomous Systems 59 (10), pp. 740–757. Cited by: §II.
-  (2015) Deep learning for detecting robotic grasps. International Journal of Robotics Research 34 (4-5), pp. 705–724. External Links: Cited by: §I, §II.
-  (2018) ROBOTURK: a crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pp. 879–893. Cited by: §I, §II.
-  (2016) Affordance research in developmental robotics: a survey. IEEE Transactions on Cognitive and Developmental Systems 8 (4), pp. 237–255. Cited by: §I.
-  (2012) Learning relational affordance models for robots in multi-object manipulation tasks. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 4373–4378. Cited by: §II, §II.
-  (2008) Learning object affordances: from sensory–motor coordination to imitation. IEEE Trans. Robotics 24, pp. 15–26. Cited by: §I, §III-B.
-  (2009) Learning grasping affordances from local visual descriptors. In Development and Learning, 2009. ICDL 2009. IEEE 8th International Conference on, pp. 1–6. Cited by: §II.
-  (2019) Learning and composing primitive skills for dual-arm manipulation. In Annual Conference Towards Autonomous Robotic Systems, pp. 65–77. Cited by: §III-B.
-  (2019) Learning generalizable coupling terms for obstacle avoidance via low-dimensional geometric descriptors. IEEE Robotics and Automation Letters 4 (4), pp. 3979–3986. Cited by: §III-B.
-  (2008) Kullback-leibler divergence estimation of continuous distributions. In 2008 IEEE international symposium on information theory, pp. 1666–1670. Cited by: §III-C.
-  (2010) Learning task constraints for robot grasping using graphical models. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pp. 1579–1585. Cited by: §II, §II.