Object affordance as a guide for grasp-type recognition

by   Naoki Wake, et al.

Recognizing human grasping strategies is an important factor in robot teaching as these strategies contain the implicit knowledge necessary to perform a series of manipulations smoothly. This study analyzed the effects of object affordance-a prior distribution of grasp types for each object-on convolutional neural network (CNN)-based grasp-type recognition. To this end, we created datasets of first-person grasping-hand images labeled with grasp types and object names, and tested a recognition pipeline leveraging object affordance. We evaluated scenarios with real and illusory objects to be grasped, to consider a teaching condition in mixed reality where the lack of visual object information can make the CNN recognition challenging. The results show that object affordance guided the CNN in both scenarios, increasing the accuracy by 1) excluding unlikely grasp types from the candidates and 2) enhancing likely grasp types. In addition, the "enhancing effect" was more pronounced with high degrees of grasp-type heterogeneity. These results indicate the effectiveness of object affordance for guiding grasp-type recognition in robot teaching applications.



There are no comments yet.


page 4


Grasp-type Recognition Leveraging Object Affordance

A key challenge in robot teaching is grasp-type recognition with a singl...

Intent-based Object Grasping by a Robot using Deep Learning

A robot needs to predict an ideal rectangle for optimal object grasping ...

Dictionary Learning for Robotic Grasp Recognition and Detection

The ability to grasp ordinary and potentially never-seen objects is an i...

i-MYO: A Hybrid Prosthetic Hand Control System based on Eye-tracking, Augmented Reality and Myoelectric signal

Dexterous prosthetic hands have better grasp performance than traditiona...

Improving Grasp Planning Efficiency with Human Grasp Tendencies*

After a grasp has been planned, if the object orientation changes, the i...

Associating Grasp Configurations with Hierarchical Features in Convolutional Neural Networks

In this work, we provide a solution for posturing the anthropomorphic Ro...

Grasp Type Estimation for Myoelectric Prostheses using Point Cloud Feature Learning

Prosthetic hands can help people with limb difference to return to their...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Robot grasping has been a major issue in robot teaching for decades[11, 27]. As robot grasping determines the relationship between a robot’s hand and an object, grasping objects suitable for the given environment is critical for efficient and successful manipulations.

Recent research has trended towards learning-based end-to-end robot grasping [33, 37, 26, 24, 3, 39, 50]

, where contact points or motor commands are estimated from visual input. However, the desired grasp can differ depending on the manipulation to be achieved, even for the same target object. Therefore, a robot teaching framework should benefit from how a demonstrator grasps an object (i.e., their grasp type).

We recently began developing a platform to teach a robot “how-to-manipulate an object” through human demonstrations in mixed reality (MR) or the physical world [42, 46, 48, 43] (Fig. 1

). The demonstration is accompanied by verbal instructions and captured by a head-mounted device. This teaching framework is based on our assumption that verbal instructions and demonstrations can be efficiently employed by novice users to teach the name of the manipulated object and the grasping strategy, respectively. Thus, this study supposed that demonstrations are recorded by a head-mounted device, and the object name is available from verbal instructions. Here, the key issue is classifying grasp types based on the name of an object and first-person images at the time of grasping.

Figure 1: Conceptual diagram of robot teaching. (Top) Head-mounted device provides first-person images during a demonstration with verbal instructions (modified version of image from [42]). The demonstrations are transferred to a robot in the form of a skill set, which includes grasp type. (Bottom) Proposed pipeline for grasp-type recognition leveraging object affordance. The pipeline estimates the grasp type from the pairing of an object name and an image of a hand grasping that object. Object affordance is searched from an affordance database using text matching (modified image in [47]).

In most cases, an object name is associated with possible grasp types [22, 14, 9, 10]. To further extend this association, we proposed a pipeline that leverages a prior distribution of grasp types to improve the learning-based classification with a convolutional neural network (CNN), as shown in Fig. 1 [47]. We refer to the prior distribution as object affordance, a concept proposed by Gibson [18]. In the proposed pipeline, appropriate object affordance was searched from an affordance database using text matching. Although our preliminary experiments revealed the effectiveness of object affordance for grasp-type recognition, they were limited in several ways: 1) they focused on a limited number of grasp types and target objects, 2) the pipeline was tested with a small dataset (only fifty images for each grasp type), yielding possible underestimation in CNN recognition, and 3) the role of object affordance in guiding the CNN recognition was unclear.

This study aimed to investigate the role of object affordance in the above-described robot-teaching application using a large first-person grasping image dataset containing a wider range of labeled grasp types and household objects. We tested the pipeline with two types of affordance, reflecting one or both of likelihood and impossibility for each grasp type. The experiments showed that object affordance guides CNN recognition in two ways: 1) it excludes unlikely grasp types from the candidates and 2) enhances likely grasp types among the candidates. In addition, the “enhancing effect” was more pronounced for high degrees of grasp-type heterogeneity. Further, we tested the pipeline for recognizing mimed grasping images (i.e., images of a hand grasping an illusory object), assuming that a real object may be absent in some situations (e.g., teaching in MR). Similar to the experiment with real grasping images, object affordance proved to be effective for mimed grasping images. In addition, the CNN recognition for mimed images showed lower performance compared with its recognition for real grasping, indicating the importance of real objects being present for image-based recognition.

The contributions of this study are as follows: 1) it demonstrates the effectiveness of object affordance in guiding grasp-type recognition as well as the conditions under which the merits of object affordance are pronounced, 2) demonstrates the importance of real objects being present in grasp-type recognition, and 3) provides a dataset of first-person grasping images labeled with possible grasp types for each object.

The remainder of this paper is organized as follows. Section 2 gives an overview of the proposed pipeline alongside some related works. Section 3 describes the experiments conducted with and without real objects. Finally, Section 4 summarizes the results of the study and describes future work.

2 System overview

2.1 Grasp taxonomy and dataset

There are two main approaches to analyzing human grasping from a single image: 1) using hand poses of grasping [30, 29, 20] and 2) using a specific grasp taxonomy [14, 40, 7, 6, 23, 44, 15]. Each approach has its own advantages. Hand pose analysis in 3D space enables the state of an object, such as posture [29] and grasping area [30], to be measured. Meanwhile, taxonomy analysis enables human grasps to be represented as discrete intermediate states that focus on the pattern of fingers in contact.

This study aimed to recognize grasp types from human behavior as an extension of taxonomy-based studies. We employed the taxonomy by Feix et al., which contains 33 grasp types [16].

Building a realistic dataset of human hand shapes while manipulating objects will contribute to the study of human grasping. Some studies collected joint positions using wired sensors [17], a data glove [34], and model fittings [20, 19]. Another study created a dataset of hand–object contact maps obtained using thermography [4]. Taxonomy-based studies have also created datasets annotated with grasp types [40, 7, 44, 5]. For example, Bullock et al. collected a dataset containing first-person images of four workers [5].

Despite the variety of datasets available for grasp-type recognition, they could not be directly applied to our study because they do not aim to cover possible grasp types associated with an object. Although there exists a pseudo-image dataset focusing on object-specific affordance [10], there exists no dataset that provides actual grasping images.

The dataset created in this study covers several common household objects and provides RGB images of real human grasps, considering possible grasp types for each object (see Section 3.1.1 for details).

2.2 Object affordance

We introduce object affordance obtained by searching a database by object name (Fig. 1). Although several studies have reported the effectiveness of using multi-modal cues for grasp-type recognition [7, 49], the effectiveness of linguistically-driven object affordance is still poorly understood in the context of learning-based recognition.

Predicting affordance has become an active research topic in the cross-domain of robotics and computer vision. Affordance, which is generally regarded as an opportunity for interaction in a scene, has been defined in different ways depending on the problem to be solved. For example, in computer vision research using deep learning, affordances have been formulated as a type of label in semantic segmentation tasks

[4, 12, 38, 32, 41]. In robotics research, affordance prediction is a factor in the task-dependent object grasping problem of task-oriented grasping (TOG) [30]. In the context of TOG, affordance is defined as possible tasks (e.g., cut and poke) allowed for an object [31, 13, 45, 2]. Similarly, this study considered affordances as object-specific entities and defined them as possible grasp types for an object.

The experiments in Section 3 evaluated the role of object affordance using sub-datasets that were sampled from the created dataset, which was labeled with possible grasp types for each object (see Section 3.1.1 for details). In testing the proposed pipeline (Fig. 1), an affordance database was created for each sub-dataset by referring to the grasp-type labels found in the sub-dataset. We prepared two types of affordances for each object (Fig. 2):

  • Varied affordance was calculated as a normalized histogram of the labeled grasp types for each object.

  • Uniform affordance was calculated by flattening the non-zero values in the histogram.

While the varied affordance contains information about the likeliness and unlikeliness of grasping, the uniform affordance only contains information about the unlikeliness of grasping.

2.3 Convolutional neural network with object affordance

We formulated grasp detection by fusing a CNN with object affordance (Fig. 1) as follows. The image, object name, and grasp type are denoted as , , and

, respectively. We can assume that the output of a CNN and an affordance reflect conditional probability distributions

and , respectively (Fig. 1). Further, assuming that and are independent, the following equation holds:


Hence, the conditional probability distribution

can be estimated from the available distributions , , and . Finally, the grasp type can be determined as the one that maximizes .

A CNN network was obtained by fine-tuning ResNet-101 [21]

. To avoid overfitting, we applied random reflection and translation to images, and randomly shifted the image color in the HSV (hue, saturation, value) space after every training epoch. The learning was conducted using the Adam optimizer

[28] and continued until the validation accuracy stopped increasing. The number of training images was changed for each experiment.

Figure 2: Examples of object affordance calculated from a sub-dataset: (a) example of uniform affordance and (b) example of varied affordance. Refer to Fig. 3 for the order of grasp types and object classes.

3 Experiments

3.1 Scenario 1: with real objects

In this scenario, the demonstration of grasping a real object was given as a first-person image using a head-mounted device. We assumed that the system could retrieve object affordance from the affordance database using the name of the object mentioned through verbal instructions (e.g., “Pick up the apple.”). This section evaluates the performance of the pipeline under Scenario 1, i.e., first-person images with object affordance.

Figure 3: Grasp types assigned to Yale-CMU-Berkeley (YCB) objects. Images were selected from the database to demonstrate examples of grasping.

3.1.1 Data preparation

Demonstrations are often recorded by a head-mounted device in MR-based robot teaching. Even for robot teaching in the physical world, first-person images given by the demonstrator are preferred over third-person images due to the ability to avoid self-occlusion. Therefore, we required a dataset of first-person images of possible grasp types for each object. Because we were not able to find any existing dataset meeting these requirements, we created one.

The images were captured by a HoloLens2 sensor [36]. We used this sensor because it is a commercially-available sensor that can capture first-person images without the use of hand-made attachments. The type of target object was chosen from the Yale-CMU-Berkeley (YCB) object set [8], which covers common household items. We used this object set because it has been used as a benchmark for many robotic studies. We selected eight items from the food category and 13 items from the kitchen category: chip can, cracker box, gelatin box, potted meat can, apple, banana, peach, and pear; and pitcher, bleach cleanser, glass cleaner, wine glass, metal bowl, mug, abrasive sponge, cooking skillet, plate, fork, spoon, knife, and spatula, respectively. We selected these items to cover a variety of sizes. We prepared two datasets to avoid the overestimation of the performance of the network due to CNN overfitting:

  • YCB dataset: Training dataset containing exactly the same items as the YCB object set.

  • YCB-mimic dataset: Testing dataset containing objects that are the same as those in the YCB dataset but different in color, texture, or shape (e.g., a cracker box from another manufacturer).

The datasets were prepared through the following pipeline. Before collecting the images, we manually assigned a set of plausible grasps according to the taxonomy in [16] (Fig. 3). Based on a previous study [35], we focused on 13 grasp types that we believed were possible for common robot hands. For each object and grasp type, we captured images of a human grasping the object with their right hand. We captured more than 1500 grasp images by varying the arm orientation and rotation as much as possible. A third-party hand detector [25] was then applied offline to crop the hand regions from the captured images. After manually filtering out detection errors, 1000 images were randomly collected for each object and grasp type. The following experiments were conducted with sub-datasets that were sampled from the YCB or YCB-mimic dataset.

3.1.2 Evaluation of dataset size

Because small datasets lead to underestimation in CNN recognition, we validated the performances of CNNs trained with different sized sub-datasets of the YCB dataset. We prepared five sub-datasets containing 10, 50, 100, 500, and 1000 images per grasp type. The images were randomly sampled such that a sub-dataset included all images from the other smaller sub-datasets. The CNNs were tested with sub-datasets of the YCB-mimic dataset. We refer to these sub-datasets as the test datasets. The test datasets were created by randomly sampling 100 images per grasp type. The performances of the CNNs were validated ten times using different test datasets.

Fig. 4 shows the results. The CNN performance tended to increase with the dataset size and converged above 500 images per grasp type. This indicates that the YCB dataset is sufficiently large to avoid underestimation due to insufficient images.

3.1.3 Effect of affordance on recognition

We evaluated the effectiveness of the proposed pipeline by comparing five methods: the proposed pipeline using varied affordance (), using uniform affordance, only varied affordance (), only uniform affordance, and only the CNN (

). For a fair comparison, the same CNN was used for each method. The grasp type that maximizes the probability distribution was chosen. In the case of using only uniform affordance, the grasp type was randomly selected from the possible grasp types.

The CNN was trained with a sub-dataset of the YCB dataset. Based on the evaluation of dataset size in Section 3.1.2, the sub-dataset was prepared by randomly sampling 1000 images per grasp type. The comparison was tested 100 times using different test datasets from the YCB-mimic dataset. Each test dataset was created by randomly sampling 100 images per object.

Fig. 5 shows the result. The pipelines combining the CNN and affordance performed better than the CNN-only and affordance-only pipelines. While the proposed pipeline using varied affordance performed best, the proposed pipeline using uniform affordance was comparable. This indicates the effectiveness of using affordance for guiding grasp-type recognition.

To elucidate the role of affordance, we examined cases where the CNN failed, such as in Fig. 6. In these cases, the correct grasping was not the best candidate for the CNN, possibly due to finger occlusion, but it had a small affordance value, resulting in a small likelihood to be the output of the proposed method. As a result, the correct grasping was chosen as the final output. Therefore, it seems that object affordance contributed to excluding unlikely grasp types from the candidates of the CNN.

To investigate the advantage of varied affordance over uniform affordance, we examined cases where the proposed pipeline using uniform affordance failed, as shown in Fig. 7. In these cases, the pipeline outputted the correct grasping by employing varied affordance. Therefore, it seems that varied affordance contributed to enhancing the grasps that were likely for an object.

Figure 4: Performances of CNNs trained with different dataset sizes.
Figure 5: Performance of grasp-type recognition with different pipelines: CNN only (only the CNN), Uni. Aff. only (only uniform affordance), Var. Aff. only (only varied affordance), Uni. Aff. (proposed pipeline using uniform affordance), and Var. Aff. (proposed pipeline using varied affordance).
Figure 6: Example where the CNN failed. The order of grasp types is the same as in Fig. 3.
Figure 7: Example where the proposed pipeline using uniform affordance failed. The order of grasp types is the same as in Fig. 3.

3.1.4 Enhancing effect of varied affordance

After observing the enhancing effect of affordance, based on information theory, we hypothesized that the effect would be stronger with higher grasp-type heterogeneity. To test this hypothesis, we evaluated the effect of grasp-type heterogeneity on recognition.

We used the same 100 test datasets that were prepared for the comparison experiment. The degree of grasp-type heterogeneity, , was defined for each test dataset by the following equation:


where , , and

represent the number of object classes, vector of varied affordance of an object (i.e., each column in Fig. 


(b)), and an operation to calculate the standard deviation of non-zero values of a vector, respectively.

The sub-datasets were tested with the proposed pipeline using varied affordance and uniform affordance. The same CNN as in the comparison experiments was used. Fig. 8 shows the performance difference of the pipelines plotted against the grasp-type heterogeneity. As hypothesized, the improvement when using varied affordance increased as the grasp-type heterogeneity increased. This indicates that the enhancing effect is more pronounced when the degree of grasp-type heterogeneity is higher.

Figure 8: Performance difference between pipelines plotted against grasp-type heterogeneity. Each plot represents a test dataset.

3.2 Scenario 2: without real objects

In the previous section, we evaluated the pipeline for images of grasping real objects. However, robot teaching may not require real objects to be grasped in some situations (e.g., teaching in MR). In such situations, captured images do not include real objects; however, a user can interact with an illusory object in MR (i.e., an MR object). Because such “mimed” images lack visual object information, image-based grasp-type recognition can become challenging. This section evaluates the performance of the proposed pipeline when mimed images and object affordance are available.

3.2.1 Data preparation

To obtain the CNN for recognizing grasp types, we prepared a dataset of mimed images captured by a HoloLens2 sensor [36]. We used the texture-mapped 3D mesh models of the YCB objects described in Section 3.1.1 as MR objects. Grasp achievement was determined by the type and number of fingers in contact, following the definition in [16]. The positions of the hand joints were estimated via the HoloLens2 API. When collecting the images, a user grasped one of the rendered MR objects guided by visual cues that represent the contact state between the user’s hand and the MR object [42]. Although we captured images based on the grasp-type mapping in Fig. 3, the glass cleaner and the wine glass were ignored because corresponding 3D models were not available among those provided by [8], and the abrasive sponge was ignored owing to the inability to express soft materials in MR. Further, “small diameter” grasping was ignored because of the difficulty in measuring the joint positions with corresponding accuracy (i.e., within 1 cm [16]). We also excluded objects with only one type of grasp (i.e., the pitcher and cooking skillet).

Following the same recording and post-processing protocol as described in Section 3.1.1, we collected a dataset containing 1000 mimed images for each object and grasp type (note that the MR objects were not captured in the images). We created two datasets under different lighting conditions and used one for training the CNN and the other for testing the pipeline. Fig. 9 shows examples of the images.

Figure 9: Examples of the mimed images captured by the HoloLens2 sensor. Although the grasped YCB objects were not captured, they were presented to the user in MR.

3.2.2 Effect of affordance on recognition

We compared the same five methods as in Section 3.1. The protocols to obtain the CNN and affordance database were the same as described in Section 2.3. Fig. 10 shows the comparison results. Similar to the results in Section 3.1, the proposed pipeline showed the highest performance. Although the CNN recognition for mimed images was worse than that for real grasping images (see Fig. 5), the use of affordance proved to be effective.

We also observed the two functions of object affordance (i.e., excluding unlikely grasp types from the candidates and enhancing likely grasp types among the candidates), similar to Section 3.1. For example, Fig. 11 shows cases where the CNN failed to discriminate between “power sphere” and “precision sphere,” which appeared similarly in mimed grasping. Despite such similarity, the proposed pipeline using varied affordance succeeded by excluding either of them as candidates.

Figure 10: Performances of grasp-type recognition with different pipelines. The captions are the same as in Fig. 5.
Figure 11: Examples where the CNN failed in recognizing mimed images. (a) Recognition of “power sphere” grasping of an illusory apple. (b) Recognition of “precision sphere” grasping of an illusory mug. The order of grasp types is the same as in Fig. 3, excluding “small diameter” grasping.

4 Conclusion, discussion, and future studies

This study investigated the role of object affordance for guiding grasp-type recognition. To this end, we created two first-person image datasets containing images with and without grasped objects, respectively. The results revealed the effects of object affordance in guiding CNN recognition: 1) it excludes unlikely grasp types from the candidates and 2) enhances likely grasp types among the candidates. The enhancing effect was stronger when there was more heterogeneity between grasp types. These findings suggest that object affordance can be effective in improving grasp-type recognition.

The advantage of our proposed pipeline (Fig. 1) is that it can be updated independently of the CNN. For example, if a user experiences a grasp type that is not assigned for an object, the pipeline can be updated by simply modifying the object affordance according to the user’s feedback. As another example, if a user wants to deal with objects that are not registered in the affordance database, the pipeline can be updated by adding object affordances manually. In the case of using uniform affordances, which showed promising results (Fig. 5), object affordances can be readily added by manually assigning possible grasp types. Such an approach is less expensive than updating a CNN by collecting a large number of grasp images depending on the use case.

This study assumed that the pipeline could access the name of the grasped object and retrieve the affordance using the object name. For practical robot teaching applications, separate solutions to these requirements are needed. To access the name of the grasped object, general object recognition or user input information can be used. For example, our robot teaching platform is designed to extract the name of the grasped object from human instructions [48]. While this study used text matching to retrieve affordance using object names, we could also employ a thesaurus or word embedding methods to cover word variations.

Recognition from mimed images seems to be more difficult than that from images of grasping real objects (Fig. 5 and 10), indicating the importance of real objects being present for image-based recognition. Although this result is reasonable considering the lack of visual information about the objects, the use of real objects may limit the advantage of MR in representing a wide variety of objects. To overcome the difficulty of recognizing grasp types in MR, a previous study has proposed combining other information, such as contact points and contact normals, which can be easily calculated for MR objects [1]. As explained in Section 2.3, the proposed pipeline can be applied to an arbitrary learning-based classification technique beyond an image-based CNN as long as the output can be considered as a probability distribution. Therefore, we believe that our research will benefit studies based on MR objects as well as real objects.

As future research, the proposed pipeline could be employed in a learning-from-observation (LfO) framework, where object names can be estimated from verbal instructions. We are currently testing this hypothesis by integrating the pipeline with an LfO system that we developed in-house [48, 46].


  • [1] J. Aleotti and S. Caselli (2006) Grasp recognition in virtual reality for robot pregrasp planning by demonstration. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 2801–2806. Cited by: §4.
  • [2] J. Bohg, A. Morales, T. Asfour, and D. Kragic (2013) Data-driven grasp synthesis—a survey. IEEE Transactions on Robotics 30 (2), pp. 289–309. Cited by: §2.2.
  • [3] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, et al. (2018) Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 4243–4250. Cited by: §1.
  • [4] S. Brahmbhatt, C. Ham, C. C. Kemp, and J. Hays (2019) Contactdb: analyzing and predicting grasp contact via thermal imaging. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 8709–8719. Cited by: §2.1, §2.2.
  • [5] I. M. Bullock, T. Feix, and A. M. Dollar (2015) The yale human grasping dataset: grasp, object, and task data in household and machine shop environments. The International Journal of Robotics Research 34 (3), pp. 251–255. Cited by: §2.1.
  • [6] M. Cai, K. M. Kitani, and Y. Sato (2015) A scalable approach for understanding the visual structures of hand grasps. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 1360–1366. Cited by: §2.1.
  • [7] M. Cai, K. Kitani, and Y. Sato (2018) Understanding hand-object manipulation by modeling the contextual relationship between actions, grasp types and object attributes. arXiv preprint arXiv:1807.08254. Cited by: §2.1, §2.1, §2.2.
  • [8] B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar (2015) The ycb object and model set: towards common benchmarks for manipulation research. In Proceedings of the International Conference on Advanced Robotics (ICAR), pp. 510–517. Cited by: §3.1.1, §3.2.1.
  • [9] F. Cini, V. Ortenzi, P. Corke, and M. Controzzi (2019) On the choice of grasp type and location when handing over an object. Science Robotics 4 (27). Cited by: §1.
  • [10] E. Corona, A. Pumarola, G. Alenya, F. Moreno-Noguer, and G. Rogez (2020) Ganhand: predicting human grasp affordances in multi-object scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5031–5041. Cited by: §1, §2.1.
  • [11] M. R. Cutkosky and R. D. Howe (1990) Human grasp choice and robotic grasp analysis. In Dextrous robot hands, pp. 5–31. Cited by: §1.
  • [12] T. Do, A. Nguyen, and I. Reid (2018) Affordancenet: an end-to-end deep learning approach for object affordance detection. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 5882–5889. Cited by: §2.2.
  • [13] K. Fang, Y. Zhu, A. Garg, A. Kurenkov, V. Mehta, L. Fei-Fei, and S. Savarese (2020) Learning task-oriented grasping for tool manipulation from simulated self-supervision. The International Journal of Robotics Research 39 (2-3), pp. 202–216. Cited by: §2.2.
  • [14] T. Feix, I. M. Bullock, and A. M. Dollar (2014) Analysis of human grasping behavior: correlating tasks, objects and grasps. IEEE Transactions on Haptics 7 (4), pp. 430–441. Cited by: §1, §2.1.
  • [15] T. Feix, I. M. Bullock, and A. M. Dollar (2014) Analysis of human grasping behavior: object characteristics and grasp type. IEEE Transactions on Haptics 7 (3), pp. 311–323. Cited by: §2.1.
  • [16] T. Feix, J. Romero, H. Schmiedmayer, A. M. Dollar, and D. Kragic (2015) The grasp taxonomy of human grasp types. IEEE Transactions on Human-Machine Systems 46 (1), pp. 66–77. Cited by: §2.1, §3.1.1, §3.2.1.
  • [17] G. Garcia-Hernando, S. Yuan, S. Baek, and T. Kim (2018) First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 409–419. Cited by: §2.1.
  • [18] J. J. Gibson and L. Carmichael (1966) The senses considered as perceptual systems. Vol. 2, Houghton Mifflin Boston. Cited by: §1.
  • [19] S. Hampali, M. Rad, M. Oberweger, and V. Lepetit (2020) Honnotate: a method for 3d annotation of hand and object poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3196–3206. Cited by: §2.1.
  • [20] Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. J. Black, I. Laptev, and C. Schmid (2019) Learning joint reconstruction of hands and manipulated objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11807–11816. Cited by: §2.1, §2.1.
  • [21] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §2.3.
  • [22] H. B. Helbig, J. Steinwender, M. Graf, and M. Kiefer (2010) Action observation can prime visual object recognition. Experimental brain research 200 (3), pp. 251–258. Cited by: §1.
  • [23] D. Huang, M. Ma, W. Ma, and K. M. Kitani (2015) How do we use our hands? discovering a diverse set of common grasps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 666–675. Cited by: §2.1.
  • [24] Y. Jiang, S. Moseson, and A. Saxena (2011) Efficient grasping from rgbd images: learning using a new rectangle representation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 3304–3311. Cited by: §1.
  • [25] Jsk-ros-pkg Ssd_object_detector. Note: https://github.com/jsk-ros-pkg/jsk_recognitionAccessed: 2020-Nov-16 Cited by: §3.1.1.
  • [26] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. (2018)

    Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation

    arXiv preprint arXiv:1806.10293. Cited by: §1.
  • [27] S. B. Kang and K. Ikeuchi (1997) Toward automatic robot instruction from perception-mapping human grasps to manipulator grasps. IEEE Transactions on Robotics and Automation 13 (1), pp. 81–95. Cited by: §1.
  • [28] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.3.
  • [29] M. Kokic, D. Kragic, and J. Bohg (2019) Learning to estimate pose and shape of hand-held objects from rgb images. arXiv preprint arXiv:1903.03340. Cited by: §2.1.
  • [30] M. Kokic, D. Kragic, and J. Bohg (2020) Learning task-oriented grasping from human activity datasets. IEEE Robotics and Automation Letters 5 (2), pp. 3352–3359. Cited by: §2.1, §2.2.
  • [31] M. Kokic, J. A. Stork, J. A. Haustein, and D. Kragic (2017) Affordance detection for task-specific grasping using deep learning. In Proceedings of the IEEE-RAS International Conference on Humanoid Robotics (Humanoids), pp. 91–98. Cited by: §2.2.
  • [32] M. Lau, K. Dev, W. Shi, J. Dorsey, and H. Rushmeier (2016) Tactile mesh saliency. ACM Transactions on Graphics (TOG) 35 (4), pp. 1–11. Cited by: §2.2.
  • [33] I. Lenz, H. Lee, and A. Saxena (2015) Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34 (4-5), pp. 705–724. Cited by: §1.
  • [34] Y. Lin and Y. Sun (2014) Grasp planning based on strategy extracted from demonstration. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4458–4463. Cited by: §2.1.
  • [35] Y. Lin and Y. Sun (2015) Robot grasp planning based on demonstrated grasp strategies. The International Journal of Robotics Research 34 (1), pp. 26–42. Cited by: §3.1.1.
  • [36] Microsoft Microsoft hololens. Note: https://www.microsoft.com/en-us/hololensAccessed: 2020-Nov-16 Cited by: §3.1.1, §3.2.1.
  • [37] D. Morrison, P. Corke, and J. Leitner (2018) Closing the loop for robotic grasping: a real-time, generative grasp synthesis approach. arXiv preprint arXiv:1804.05172. Cited by: §1.
  • [38] L. Porzi, S. R. Bulo, A. Penate-Sanchez, E. Ricci, and F. Moreno-Noguer (2016) Learning depth-aware deep representations for robotic perception. IEEE Robotics and Automation Letters 2 (2), pp. 468–475. Cited by: §2.2.
  • [39] J. Redmon and A. Angelova (2015) Real-time grasp detection using convolutional neural networks. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 1316–1322. Cited by: §1.
  • [40] G. Rogez, J. S. Supancic, and D. Ramanan (2015) Understanding everyday hands in action from rgb-d images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3889–3897. Cited by: §2.1, §2.1.
  • [41] A. Roy and S. Todorovic (2016) A multi-scale cnn for affordance segmentation in rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 186–201. Cited by: §2.2.
  • [42] D. Saito, N. Wake, K. Sasabuchi, H. Koike, and K. Ikeuchi (2021) Contact web status presentation for freehand grasping in mr-based robot-teaching. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (HRI), Note: Accepted Cited by: Figure 1, §1, §3.2.1.
  • [43] K. Sasabuchi, N. Wake, and K. Ikeuchi (2020) Task-oriented motion mapping on robots of various configuration using body role division. IEEE Robotics and Automation Letters 6 (2), pp. 413–420. Cited by: §1.
  • [44] A. Saudabayev, Z. Rysbek, R. Khassenova, and H. A. Varol (2018) Human grasping database for activities of daily living with depth, color and kinematic data streams. Scientific data 5 (1), pp. 1–13. Cited by: §2.1, §2.1.
  • [45] D. Song, C. H. Ek, K. Huebner, and D. Kragic (2015) Task-based robot grasp planning using probabilistic inference. IEEE Transactions on Robotics 31 (3), pp. 546–561. Cited by: §2.2.
  • [46] N. Wake, R. Arakawa, I. Yanokura, T. Kiyokawa, K. Sasabuchi, J. Takamatsu, and K. Ikeuchi (2021) A learning-from-observation framework: one-shot robot teaching for grasp-manipulation-release household operations. In Proceedings of the IEEE/SICE International Symposium on System Integration (SII), Note: Accepted Cited by: §1, §4.
  • [47] N. Wake, K. Sasabuchi, and K. Ikeuchi (2020) Grasp-type recognition leveraging object affordance. In HOBI Workshop, IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Cited by: Figure 1, §1.
  • [48] N. Wake, I. Yanokura, K. Sasabuchi, and K. Ikeuchi (2020) Verbal focus-of-attention system for learning-from-demonstration. arXiv preprint arXiv:2007.08705. Cited by: §1, §4, §4.
  • [49] Y. Yang, Y. Li, C. Fermuller, and Y. Aloimonos (2015) Robot learning manipulation action plans by” watching” unconstrained videos from the world wide web. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 29. Cited by: §2.2.
  • [50] Q. Yu, W. Shang, Z. Zhao, S. Cong, and Y. Lou (2018) Robotic grasping of novel objects from rgb-d images by using multi-level convolutional neural networks. In Proceedings of the IEEE International Conference on Information and Automation (ICIA), pp. 341–346. Cited by: §1.