Attention based visual analysis for fast grasp planning with multi-fingered robotic hand

by   Zhen Deng, et al.
University of Hamburg

We present an attention based visual analysis framework to compute grasp-relevant information in order to guide grasp planning using a multi-fingered robotic hand. Our approach uses a computational visual attention model to locate regions of interest in a scene, and uses a deep convolutional neural network to detect grasp type and point for a sub-region of the object presented in a region of interest. We demonstrate the proposed framework in object grasping tasks, in which the information generated from the proposed framework is used as prior information to guide the grasp planning. Results show that the proposed framework can not only speed up grasp planning with more stable configurations, but also is able to handle unknown objects. Furthermore, our framework can handle cluttered scenarios. A new Grasp Type Dataset (GTD) that considers 6 commonly used grasp types and covers 12 household objects is also presented.



There are no comments yet.


page 7

page 8


Real-time Fruit Recognition and Grasp Estimation for Autonomous Apple harvesting

In this research, a fully neural network based visual perception framewo...

Learning Grasp Configurations for Novel Objects from Prior Examples

We present a new approach to learning grasp configurations for a novel o...

Workspace Aware Online Grasp Planning

This work provides a framework for a workspace aware online grasp planne...

Planning Multi-Fingered Grasps as Probabilistic Inference in a Learned Deep Network

We propose a novel approach to multi-fingered grasp planning leveraging ...

Improving Grasp Planning Efficiency with Human Grasp Tendencies*

After a grasp has been planned, if the object orientation changes, the i...

Egocentric View Hand Action Recognition by Leveraging Hand Surface and Hand Grasp Type

We introduce a multi-stage framework that uses mean curvature on a hand ...

Learning to Model the Grasp Space of an Underactuated Robot Gripper Using Variational Autoencoder

Grasp planning and most specifically the grasp space exploration is stil...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Imagine a toddler is in front of a tabletop with several objects, very likely he or she would interact with those objects by trying to pick up the red mug either by the handle or the rim, or trying to grasp the green ball. The ability of rapidly extracting relevant information from visual input is an important mechanism and a natural behavior for humans to conduct various activities. The majority of visual analysis approaches for grasp planning with multi-fingered robotic hands follow a pipeline containing object localization, object representation and recognition [1]

. However, reliable object detectors such as deep-learning based approaches require huge amounts of training data, as well as good hardware to achieve a reasonable time performance for robotic applications, while handcrafted feature based approaches can not handle the dynamics in real life scenarios.

This paper proposes an attention based visual analysis framework which directly locates sub-regions of objects as regions of interest (ROIs), and generates grasp-relevant information from visual data inside ROIs for grasp planning by a multi-fingered robotic hand. The proposed learning framework is inspired by psychological studies which demonstrated that humans combine an early bottom-up processing with a later top-down processing to visually analyze the scene [2, 3]. The bottom-up process starts with sensory input data and is completely stimulus-driven, while the top-down process extracts relevant information, which may be influenced by prior experience and semantics. In particular, a computational attention model is used to process visual data and outputs a pixel-precise saliency map, from which salient regions are selected for further processing. Meanwhile, the grasp type and point on the object segments presented in salient regions is predicted by a network. Finally, this information is used to guide the grasp planning with a multi-fingered robotic hand.

Grasp type and point convey useful information for planning the configuration of a robotic hand. However, many previous works on grasp type detection mainly focus on the analysis of human hand behavior [4, 5], while few approaches integrate grasp type detection into robotic grasp planning. Furthermore, those works that consider grasp types for robotic hands normally divide grasp types into power and precision [6] and decide the grasp type manually during planning, which is not sufficient for exploring the potential of multi-fingered robot hands. In terms of visual analysis, there are approaches in [7, 8, 9]

, which use visual analysis to define heuristics or constraints for grasp planning. In comparison to those approaches, there are two main differences: 1) our approach learns features directly from raw sensory data, while most of the previous approaches use handcrafted features; 2) 6 grasp types are considered while the previous approaches only consider 2 grasp types. To the best of our knowledge, this is the first approach which integrates grasp type detection into grasp planning for multi-fingered robotic hands.

In this paper, we address the problem of visual analysis of natural scenes for grasping by multi-fingered robotic hands. The objective is to compute grasp-relevant information from visual data, which is used to guide grasp planning. A visual analysis framework which combines a computational visual attention model and a grasp type detection model is proposed. A new Grasp Type Dataset (GTD) which considers six commonly used grasp types and contains 12 household objects is also presented.

The rest of the paper is organized as follows. Section 2 presents related work. Section 3 introduces the architecture and main components of the proposed visual analysis framework. Grasp planning is described in Section 4. Experimental results are presented in Section 5. Finally, the conclusion and future work are discussed in Section 6.

2 Related Work

Information extracted from visual analysis can be used to define heuristics or constraints for grasp planning. Previous grasp planning methods can be divided into geometric-based grasping and similarity-based grasping. In geometric-based grasping [7, 10, 9], geometric information of the object is obtained from color or depth images, and is used to define a set of heuristics to guide grasp planning. Hsiao et al. [7] proposed a heuristic which maps partial shape information of objects to grasp configuration. The direct mapping from object geometric to candidate grasps is also used in [11, 9]. Aleotti and Caselli [8] proposed a 3D shape segmentation algorithm which firstly oversegments the target object, and candidate grasps is chosen based on the shape of the resulting segments [10]. In similarity-based approaches [12, 13, 14], the similarity measure is calculated between the target object and corresponding object model from human demonstrations or simulation. The candidate grasp is then queried from datasets based on similarity measures. Herzog et al. [12] defined an object shape template as the similarity measure. This template encodes heightmaps of the object observed from various viewpoints. The object properties can also be presented with semantic affordance maps [13]

or probability models

[14, 15]. Geometric-based approaches usually require a multiple-stage pipeline to gather handcrafted features through visual data analysis. Due to vision sensory noise, the performance of the geometric-based grasping is often unstable. Meanwhile, similarity-based methods are limited to known objects which can not handle unknown objects. In contrast to previous methods, our method increases grasping stability by learning more reliable features, meanwhile it is able to handle unknown objects.

Many saliency approaches have been proposed in the last two decades. Traditional models are usually based on the feature integration theory (FIT) [16] to compute several handcrafted features which were fused to a saliency map, e.g the iNVT [17, 18] or the VOCUS system [19]. Frintrop et al. [20] proposed a simple and efficient system which computes multi-scale feature maps using Difference-of-Gaussian (DoG) filters for center-surround contrast and produces a pixel precise saliency map. Deep learning based saliency detection mostly relies on high-level pretrained features for object detection tasks due to the requirements of massive amounts of training data [21, 22, 23]. Kümmerer et al. [24] used an AlexNet [25]

pretrained on Imagenet

[26] for object recognition tasks, the resulting high-dimensional feature space is used for fixation prediction and saliency map generation. Since most of the deep-learning based approaches have a central photographer bias which is not desired in robotic applications, we choose to use a handcrafted feature based approach which gathers local visual attributes by combing low-level visual features [20].

3 Attention based visual analysis

The proposed framework contains two main components, a computational visual attention model which gathers low-level visual features and selects ROIs for further processing, and a grasp type detection model which learns higher level features and produces grasp-relevant information in the ROIs. Figure 1 illustrates an overview of the proposed attention based visual analysis framework.

Figure 1: The proposed attention based visual analysis framework. With an input RGB image, a ROI is selected using the saliency map produced by a Saliency detection model. Inside the ROI, grasp type and point are computed based on the six probability maps produced by the Grasp type detection network. The obtained information containing grasp type and point is then used as a prior to guide grasp planning. The planned grasp is executed in a physical simulator to verify its quality.

3.1 Computational visual attention model

The pixel-level saliency map is computed using the computational visual saliency method VOCUS2 [20]. In principle, any saliency system which has real-time capability and does not have a center-bias could be used. Center bias gives a preference to the center of an image, which is not desired in robotics applications. Unfortunately, this excludes most deep-learning based approaches since they are usually trained on large datasets of Internet images, which mostly have a central photographer bias. Therefore, the VOCUS2 system was chosen, which belongs to the traditional saliency systems with good performance on several benchmarks. In VOCUS2, an RGB input image is converted into an opponent-color space including intensity, red-green and blue-yellow color channels. DoG contrasts are computed with twin pyramids, which consist of two Gaussian pyramids - one for the center and one for the surround of a region - which are subtracted to obtain the DoG contrast. Finally, the contrast maps are fused across multiple scales using arithmetic mean to produce the saliency map. Salient regions that have a high contrast to the surroundings are clustered and selected and passed to the next stage for further processing [2].

3.2 Grasp type detection

Grasp type is a way of representing how a hand handles objects. Typically, the robotic grasps are divided into power and precision grasp [6]. Power grasp uses the fingers and palm to hold the object firmly, while precision grasp only uses fingertips to stabilize the object. However, this two-categories grasp taxonomy is not sufficient to convey information about hand configuration. Feix et al. [27] introduced a GRASP taxonomy in which different grasp types used by humans are presented. Considering the kinematic limitations of the robotic hand as well as Feix’s GRASP taxonomy, we extend the above two-categories grasp taxonomy into 6 commonly used grasp types: large wrap, small wrap, power, pinch, precision and tripod. Figure 2 illustrates the proposed grasp taxonomy.

Figure 2: The proposed 6 commonly used grasp types.

In order to detect grasp types directly from visual data, we refer to the architecture proposed by Chen et al. [28]. Since an object may have multiple feasible grasp types, the grasp type detection is a multi-label detection. Hence, we modify the output layer of the network and do not use the additional fully connected Conditional Random Field (CRF). Corresponding to 6 grasp types, the modified network predicts 6 pixel-level probability maps with the same resolution as the input image. In order to train the modified network for grasp type detection, this paper introduces a grasp type detection (GTD) dataset, in which 12 household objects are used and all the instances are annotated following the proposed 6 grasp types. The details of the GTD dataset and the model training are provided in Section 5.1.

Given an RGB image with height and width as input, our network outputs pixel-level probability maps for each grasp type , where . And the predicted probability of pixel belonging to the grasp type is denoted by . With the pixel-level probability maps, the best grasp type for the ROI is selected by summing the predicted probabilities of all the pixels inside ROI to obtain the probability of grasp type for ROI , as shown in Eq. 1. The grasp type with the highest probabilities is chosen, in which :


After determining the best grasp type , we need to localize the grasp point for the grasp type inside . In order to find a stable grasp point , subregions with higher predicted probabilities are clustered. Mean Shift [29] is used to find a grasp point in . Multiple clusters with multiple centers are produced, and the cluster center with highest probability is selected as the grasp point . Finally, the grasp relevant information , i.e., ROI , the grasp type and point , are generated from the proposed visual analysis framework.

4 Grasp planning with prior information

Information generated from the proposed visual analysis framework is used as a prior for guiding the grasp planning. The objective of the grasp planning is to find a feasible grasp configuration for stable grasping. Using a multi-fingered robotic hand to grasp objects requires the computing of: a) a joint configuration of the robotic hand, b) contact points on the object surface and c) relative pose between the object and the robotic hand. Due to the high dimensionality of the robotic hand, it is challenging to find the best configuration. In this paper, we take advantage of the grasp-relevant information to define a pre-grasp configuration and use a local search method to find the grasp configuration with the highest quality.

In order to infer the relative pose between the target object and robotic hand, we estimate the pose of the object segment presented in ROI. Hence, the pre-grasp configuration of the robotic hand is defined as follows:

  1. The palm center

    is set to be a point along the normal vector of the 3D grasp position

    on the object surface. There is an offset between and .

  2. The hand palm is perpendicular to the surface normal at grasp point.

  3. The number of the fingers is selected according to the grasp type .

Due to the existence of uncertainties, the defined pre-grasp configuration may fail to grasp objects. Hence, a local search is used to find the grasp configuration with the highest quality. During searching, we use a local transformation and rotation to sample a set of candidate grasps. Hence, the search space is a dimensional space, , where is the offset between the hand palm center and the 3D grasp position . denote the rotate angles in the , and axes of the hand coordinate respectively. The search process is implemented in a simulator and all the candidates are evaluated. During executing candidate grasps, all the fingers move to contact with the object surface and stop until the contact force is over a threshold. A grasp is considered to be feasible if the robotic hand can grasp and lift the object to a certain height from the tabletop. Multiple feasible grasps are found and evaluated with the quality measurement introduced by Liu et al. [30], which measures the normal component of contact forces. Finally, the grasp configuration with the highest quality is chosen. Algorithm 1 shows the process of the grasp planning procedure.

1:Requires: a computational saliency model, a grasp type detection model
2:Acquire an RGB image of the table scene.
3:Visual analysis framework returns the grasp-relevant information .
4:Using the information to initialize the pre-grasp configuration of the hand.
5:Using a local search method to find a list of feasible candidate grasps.
6:Rank all the feasible candidate grasps to find the best one and execute it in the physical simulator.
Algorithm 1 : Attention based visual analysis for grasp planning

5 Experimental Results

5.1 Dataset and implementation

Existing datasets, such as the Yale human grasping datasets [31] and the UT grasp dataset [32], are used for the analysis of human grasping behavior, a similar dataset for robotic hands is not available. Hence, we introduce a new grasp type detection (GTD) dataset that contains RGB-D 111We use only RGB data in this paper, and plan to exploit the depth data in future work. images and ground-truth grasp type labels. There are annotated images with resolution . In this dataset, 6 commonly used grasp types were considered and household objects with various shape attributes were chosen, as shown in Figure 3.a. A MATLAB GUI is designed to manually annotate grasp types on collected data. Object parts in images were labeled with different grasp types which enables multi-label detection, as shown in Figure 3(b-c). The GTD dataset was split randomly into a training set () and a testing set (

). The training parameters of the grasp type detection model are set as follows: the initial learning rate was 0.00001 and a step delay policy is used to lower the learning rate as the training progresses. Stochastic gradient descent (SGD) method with a momentum rate of 0.9 is used.

Figure 3: Illustration of GTD dataset. (a) 12 household objects contained in the GTD. (b) The original image. (c) A labeled image with large wrap. (d) A labeled image with precision. Pixels that belong to a grasp type are marked with color and others are background

5.2 Evaluation of grasp type detection

We first evaluated the accuracy of the grasp type detection on the proposed GTD dataset. For comparison, another network based on the Segnet architecture introduced in [33] is trained and evaluated. Segnet, which is widely used for image segmentation, has an encoder-decoder architecture. For pixel-level multi-label detection, we modified the output layer of the Segnet network as introduced in subsection 3.2. The same training and testing procedures are used for both networks described in 5.1. Table 1 shows the Intersection-over-union (IoU) of the two networks. Our approach achieves a higher average detection accuracy and outperforms the segnet-based network by .

L-wrap S-wrap Power Pinch Precision Tripod Average
Ours 0.63 0.58 0.71 0.56 0.61 0.52 0.60
Segnet-based 0.51 0.56 0.41 0.61 0.46 0.48 0.50
Table 1: Performance over GTD dataset (IoU).

A confusion matrix (Figure

4) is used to evaluate the overall quality of detected the grasp type. Since the network predicts 6 labels corresponding to 6 grasp types for each pixel, each row of the matrix shows the predicted probabilities of each grasp type for one ground truth label. It shows that the proposed method is able to predict correct grasp types with highest probability since the diagonal elements have the highest values. It is worth mentioning that several off-diagonal elements also have rather high values. For example, the prediction results for Power type also show a high probability for Precision, which means those two grasp types are easily mislabeled by the proposed method. The reason is that those two types have a high correlation and share many similar characters. Hence, the confusion matrix gives insights to discover the relationship between grasp types.

Figure 4: The confusion matrix of the six grasp types.

We introduced random objects from the YCB object set [34] for testing whether our system can handle unknown objects. Figure 5 shows the process of the visual analysis. Given an input RGB image, the ROI denoted by a rectangle in the saliency map is first selected by the attention model. Meanwhile, six pixel-level probability maps are obtained from the grasp type detection model. The possible grasp points denoted by the color dot in each probability map are obtained by clustering. Finally, the grasp type with highest probability in the ROI is selected. As it is shown in Figure 5, our system is also able to produce grasp type and point results on unknown objects. For each frame, ROI localization takes seconds, grasp type detection takes seconds and the complete process takes seconds on average. The proposed framework is implemented in python, and runs on a 2.50GHz Intel i5 CPU.

Figure 5: Example of the visual analysis on various objects. First column is the input RGB image. Second column is the pixel-level saliency map, in which the red rectangle denotes the selected ROI. Third column is six pixel-level probability maps. The color dots in the probability maps denote the cluster centers which are considered as candidate grasp points. Last column is the output.

5.3 Grasp planning in simulator

The proposed visual analysis framework was further evaluated in object grasping tasks. We proposed a grasping simulator based on the V-REP 222, which is a physical simulator that supports rapid verification, to conduct this experiment. The grasping experiments were performed on a shadow hand 333/, a five-fingered robotic hand which is an approximation of a human hand. During simulations, the hand configuration and the contact force between the shadow hand and objects were simulated in real-time, which were used for measuring the qualities of candidate grasps.

In order to evaluate the performance of the visual analysis framework for the grasp planning, we compared the proposed planning method with the method proposed by Veres et al. [35]. Veres et al. used a method which randomly samples a set of candidate grasps based on the normal of the object surface and then ranked all the candidates to find the best one. Since there is no grasp type provided in this method, we use the commonly used power type for the shadow hand to grasp objects. In this comparison experiment, objects were selected, as shown in Figure 6. trials are tested for each object. For each trial, an object is placed on the tabletop and a Kinect sensor captures the RGB image of the table scene. Then, the grasp configuration of the shadow hand is planned in the simulator. The maximum number of search attempts for both methods is limited to 40. For each object, the success rate of object grasping and the average number of search attempts needed for finding a feasible grasp are shown in Table 2.

Ours Veres et al. [35]
object success rate search attempt success rate search attempt
tomato soup can 8/10 2.5 8/10 20
tuna fish can 9/10 8.7 5/10 23.6
banana 9/10 2.1 5/10 21.6
apple 9/10 2.5 8/10 27.5
orange 8/10 2.8 7/10 19.4
chips can 10/10 2.7 10/10 11.4
Average 3.5 20.5
Table 2: Performance of the proposed grasp planning.
Figure 6: Examples of object grasping by the shadow hand in the simulator

It can be seen that the proposed method obtained a higher success rate of grasping than the random search method. Moreover, the number of search attempts by the proposed planning method is only compared to that of the random search method. It shows that the grasp-relevant information generated helps to reduce the search time needed for the grasp planning and to more accurately find the feasible grasp configuration in the search space. It is worth mentioning that the random search method with a power type easily fails at grasping some small objects, such as the banana and the tuna fish can. This limitation does not occur in the proposed planning method since a feasible grasp type is predicted before grasping. Hence, for multi-finger robotic hands, objects with different shape attributes should be handled with different grasp types.

We also noticed that there are several failures of object grasps using the proposed planning method. The main reason for the failures is because the predicted grasp point on the object surface is too close to the tabletop. Since the environmental constraints are not considered in this work, the shadow hand will collide with the table and fail to grasp the object. In future, it will be beneficial to also consider environment and task constraints.

Figure 7: Examples of object grasping by the Barrett hand and the Baxter gripper

Another state-of-the-art method was proposed by Ciocarlie and Allen [36], in which a hand posture subspace is defined for dexterous robotic hands, and the simulated annealing algorithm was used to find a solution in this subspace. In this work, the grasp planner only results in a power type, which means their grasp planner may fail to grasp small objects. Another limitation of their grasp planner is that it needs long search time until finding a feasible solution, with over 70,000 attempts for each plan, and an average running time of 158 seconds [36]. Compared with their work, our method requires fewer search attempts. We also tested our framework with a 3-fingered Barrett hand 444 and a 2-fingered Baxter gripper 555, Figure 7 shows some results. On average, Barrett hand has success rate with 4 search attempts while Baxter gripper has success rate with search attempts.

6 Conclusion

This paper proposes an attention based visual analysis framework, which computes grasp-relevant information directly from visual data for multi-fingered robotic grasping. By using the visual framework, a ROI is firstly localized by a computational attention model. The grasp type and point on object segment presented in the ROI is then computed using a grasp type detection model, which is used as prior information to guide grasp planning. We demonstrated that the proposed method is able to give a good prediction of grasp type and point, even in cluttered environments. Furthermore, the performance of the proposed visual analysis framework has been evaluated in object grasping tasks. Compared to previous methods without prior, the information generated from the visual analysis can significantly speed up the grasp planning. Moreover, because using a feasible grasp type, the success rate of the grasping is also improved. Results show that the proposed framework helps the robotic systems to know how and where to grasp objects according to attributes of sub-regions of objects. Since our method does not rely on object detection, it can also handle unseen objects.

For future work, several aspects are considered: first, the current framework is goal-driven, and it only learns how to grasp an object, so it will be interesting to extend it into a task-driven framework, e.g. grasping in human-robot handover task; second, currently the choice of grasp type and point only depends on the attributes of sub-regions of objects. Since grasp planing is also affected by environment and task constraints, those constraints will be taken into consideration; finally, we plan to evaluate the proposed framework on real world hardware.

If a paper is accepted, the final camera-ready version will (and probably should) include acknowledgments. All acknowledgments go at the end of the paper, including thanks to reviewers who gave useful comments, to colleagues who contributed to the ideas, and to funding agencies and corporate sponsors that provided financial support.


  • Schwarz et al. [2017] M. Schwarz, A. Milan, C. Lenz, A. Munoz, A. S. Periyasamy, M. Schreiber, S. Schüller, and S. Behnke. Nimbro picking: Versatile part handling for warehouse automation. In IEEE International Conference on Robotics and Automation (ICRA), pages 3032–3039. IEEE, 2017.
  • Theeuwes [2010] J. Theeuwes. Top–down and bottom–up control of visual selection. Acta psychologica, 135(2):77–99, 2010.
  • Awh et al. [2012] E. Awh, A. V. Belopolsky, and J. Theeuwes. Top-down versus bottom-up attentional control: A failed theoretical dichotomy. Trends in cognitive sciences, 16(8):437–443, 2012.
  • Rogez et al. [2015] G. Rogez, J. S. Supancic, and D. Ramanan. Understanding everyday hands in action from rgb-d images. In

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    , pages 3889–3897, 2015.
  • Cai et al. [2017] M. Cai, K. M. Kitani, and Y. Sato. An ego-vision system for hand grasp analysis. IEEE Transactions on Human-Machine Systems, 47(4):524–535, 2017.
  • Napier [1956] J. R. Napier. The prehensile movements of the human hand. The Journal of bone and joint surgery. British volume, 38(4):902–913, 1956.
  • Hsiao et al. [2010] K. Hsiao, S. Chitta, M. Ciocarlie, and E. G. Jones. Contact-reactive grasping of objects with partial shape information. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1228–1235. IEEE, 2010.
  • Aleotti and Caselli [2012] J. Aleotti and S. Caselli. A 3d shape segmentation approach for robot grasping by parts. Robotics and Autonomous Systems, 60(3):358–366, 2012.
  • Vahrenkamp et al. [2018] N. Vahrenkamp, E. Koch, M. Waechter, and T. Asfour. Planning high-quality grasps using mean curvature object skeletons. IEEE Robotics and Automation Letters, 2018.
  • Laga et al. [2013] H. Laga, M. Mortara, and M. Spagnuolo. Geometry and context for semantic correspondences and functionality recognition in man-made 3d shapes. ACM Transactions on Graphics (TOG), 32(5):150, 2013.
  • Harada et al. [2008] K. Harada, K. Kaneko, and F. Kanehiro. Fast grasp planning for hand/arm systems based on convex model. In IEEE International Conference on Robotics and Automation (ICRA), pages 1162–1168. IEEE, 2008.
  • Herzog et al. [2014] A. Herzog, P. Pastor, M. Kalakrishnan, L. Righetti, J. Bohg, T. Asfour, and S. Schaal. Learning of grasp selection based on shape-templates. Autonomous Robots, 36(1-2):51–65, 2014.
  • Dang and Allen [2014] H. Dang and P. K. Allen. Semantic grasping: planning task-specific stable robotic grasps. Autonomous Robots, 37(3):301–316, 2014.
  • Kopicki et al. [2016] M. Kopicki, R. Detry, M. Adjigble, R. Stolkin, A. Leonardis, and J. L. Wyatt. One-shot learning and generation of dexterous grasps for novel objects. The International Journal of Robotics Research, 35(8):959–976, 2016.
  • Kroemer and Peters [2014] O. Kroemer and J. Peters. Predicting object interactions from contact distributions. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3361–3367. IEEE, 2014.
  • Treisman and Gelade [1980] A. M. Treisman and G. Gelade. A feature-integration theory of attention. Cognitive Psychology, 12(1):97 – 136, 1980.
  • Itti et al. [1998] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, Nov 1998.
  • Walther and Koch [2006] D. Walther and C. Koch. Modeling attention to salient proto-objects. Neural Networks, 19(9):1395–1407, 2006. ISSN 0893-6080.
  • Frintrop [2006] S. Frintrop. Vocus : A visual attention system for object detection and goal-directed search.

    Lecture Notes Artificial Intelligence

    , 3899:1–197, 2006.
  • Frintrop et al. [2015] S. Frintrop, T. Werner, and G. M. García. Traditional saliency reloaded: A good old model in new shape. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2015.
  • Huang et al. [2015] X. Huang, C. Shen, X. Boix, and Q. Zhao. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In IEEE International Conference on Computer Vision (ICCV), Dec 2015.
  • Li et al. [2016] X. Li, L. Zhao, L. Wei, M. H. Yang, F. Wu, Y. Zhuang, H. Ling, and J. Wang. Deepsaliency: Multi-task deep neural network model for salient object detection. IEEE Transactions on Image Processing, 25(8):3919–3930, Aug 2016.
  • Liu and Han [2016] N. Liu and J. Han. Dhsnet: Deep hierarchical saliency network for salient object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • Kümmerer et al. [2015] M. Kümmerer, L. Theis, and M. Bethge. Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet. In ICLR Workshop, May 2015.
  • Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  • Feix et al. [2016] T. Feix, J. Romero, H.-B. Schmiedmayer, A. M. Dollar, and D. Kragic. The grasp taxonomy of human grasp types. IEEE Transactions on Human-Machine Systems, 46(1):66–77, 2016.
  • Chen et al. [2018] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
  • Comaniciu and Meer [2002] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence, 24(5):603–619, 2002.
  • Liu et al. [2004] G. Liu, J. Xu, X. Wang, and Z. Li. On quality functions for grasp synthesis, fixture planning, and coordinated manipulation. IEEE Transactions on Automation Science and Engineering, 1(2):146–162, 2004.
  • Bullock et al. [2015] I. M. Bullock, T. Feix, and A. M. Dollar. The yale human grasping dataset: Grasp, object, and task data in household and machine shop environments. The International Journal of Robotics Research, 34(3):251–255, 2015.
  • Cai et al. [2015] M. Cai, K. M. Kitani, and Y. Sato. A scalable approach for understanding the visual structures of hand grasps. In IEEE International Conference on Robotics and Automation (ICRA), pages 1360–1366. IEEE, 2015.
  • Badrinarayanan et al. [2017] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
  • Calli et al. [2015] B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar. Benchmarking in manipulation research: Using the yale-cmu-berkeley object and model set. IEEE Robotics & Automation Magazine, 22(3):36–52, 2015.
  • Veres et al. [2017] M. Veres, M. Moussa, and G. W. Taylor. An integrated simulator and dataset that combines grasping and vision for deep learning. arXiv preprint arXiv:1702.02103, 2017.
  • Ciocarlie and Allen [2009] M. T. Ciocarlie and P. K. Allen. Hand posture subspaces for dexterous robotic grasping. The International Journal of Robotics Research, 28(7):851–867, 2009.