I Introduction
In recent years, robots have become increasingly prevalent in everyday life where environments are unstructured and partially observable. When retrieving a target object from such realworld scenarios, we frequently encounter a plethora of identical objects (e.g., grocery carts, tool boxes, warehouse storage, etc.). Humans understand spatial relations between scene objects and plan grasp poses accordingly for the task [feix2014analysis]. Moreover, our fingers occasionally knock away surrounding objects—this collision even facilitates grasping. When a robot faces the same situation, as illustrated in Fig. 1, we argue that an efficient grasping system should reason about spatial relations between objects to grasp a more accessible target while utilizing minor collisions.
Recent methods commonly solve this problem by cascading an objectcentric grasp pose predictor with an analytic [mousavian2019graspnet] or datadriven collision checking module [murali2020clutteredgrasping]
. While these twostage approaches transform the problem into the wellstudied single object grasping, they frequently struggle in the collision checking stage with imperfect and partial camera observations (i.e., fail to predict collisions with the occluded parts). Furthermore, because the two stages are carried out independently, these approaches overlook the fact that spatial relations between scene objects intrinsically affect the grasping success probability. In the aforementioned scenario, the absence of the knowledge of spatial relations between objects prevents the robot from efficiently grasping a target object.
To fill the gap between the objectcentric reasoning and the collision checking module, an endtoend approach is desired for natural and coherent learning of the spatial relations between scene objects. Due to the vast 6DoF action space and infinite object shapes and arrangements, learning directly from unstructured scene observations with traditional deep learning architectures (e.g., FCNs [long2015fully], CNNs) does not generalize well. Graphs, on the other hand, have superior expressive capabilities since they explicitly model entities (e.g., objects, books, individuals, etc.) as nodes (containing descriptive features) and their relations as edges. Operating on graphs, Graph Neural Networks (GNNs) [4700287] are capable of efficiently capturing the hidden correlations between nodes [ZHOU202057]. In light of the recent advances in utilizing GNNs in robotics [garcia2019tactilegcn, wilson20a, murali2020taskgrasp]
, we propose a 6DoF targetdriven grasping system with a Grasp Graph Neural Network (G2N2). A 3D Autoencoder first learns scene representations and encodes a 3D observation in a grasp coordinate frame to a graph with nodes representing object geometries and edges indicating their spatial relationships. Given a set of sampled poses, the G2N2 directly predicts the grasping success probabilities from the generated graphs, accounting for both grasp stability and object relations. Since the quality of sampled grasp candidates is critical to the efficiency of our grasping system, we additionally propose a shape completionassisted sampling method that significantly improves sampling quality for partial geometries.
Despite simtoreal gaps, our grasping system, which is entirely trained on synthetic datasets, generalizes well to the real world. We performed experiments in both simulated and real environments, in which our approach achieved a 75.25% and 77.78% grasping accuracy respectively for densely cluttered novel targets, outperforming the baselines by large margins. Our work is one of the early attempts that utilizes GNNs for a challenging 6DoF targetdriven robotic grasping task. The core technical contributions are:

A novel Grasp Graph Neural Network (G2N2) that builds upon Graph Convolutional Networks [kipf2016gcn]. It learns to understand the spatial relations between scene objects and predicts feasible 6DoF grasp poses for challenging robotic tasks such as retrieving a target object from several identical ones in dense clutter.

A 6DoF targetdriven grasping system that contains a shape completionassisted grasp pose sampling algorithm. The method improves sample efficiency for partially observable objects and leads to more effective data collection and training.
Ii Related Work
Robotic grasping is a fundamental skill that enables further manipulation tasks [sahbani2012overview], [bohg2013data]. Recently, datadriven approaches [lenz2015deep, pinto2016supersizing]
have applied Convolutional Neural Networks (CNNs) to learn grasping in 3DoF action space
[zeng2018learning, liang2019knowledge, yang2021attribute]. Meanwhile, other works learn the more flexible 6DoF grasping [gualtieri2016high, ten2017grasp, liang2019pointnetgpd]. Choi et al. employed a 3D CNN to classify 24 predefined orientations from object voxel occupancy grids
[choi2018learning] and Lou et al. extended the work to full 6DoF and introduced reachability awareness [lou2020learning]. Basedon the PointNet++ [qi2017pointnet] architecture, Mousavian et al. learned a variational sampling method and grasp pose evaluator to detect stable 6DoF poses [mousavian2019graspnet]. When grasping a target from clutter, these approaches often rely on an analytical [mousavian2019graspnet, liang2019knowledge] or datadriven collision checking module [murali2020clutteredgrasping, lou2021collision]. Consequently, these approaches overlook the importance of spatial relations between objects to grasping results. Unlike the previous targetdriven methods, we focus on a more demanding task, retrieving a predefined target from several identical objects in dense clutter, where knowledge of object relationships is essential. We use graphs to learn deep relations between numerous objects because they clearly define the entities and their relations in a compact and effective form [xu2018powerful].Recently, Graph Neural Networks (GNNs) [4700287] have been applied to multiple disciplines that necessitates processing rich relation information, including social networks [wu2018socialgcn], citation systems [velivckovic2017graph]
[yao2019graph], and particle dynamics [sanchez2020learning]. Following the success of GNNs in the these fields, robotics community has shown growing interests. By structuring a neural network to pass information between adjacent nodes, Murali et al. use Graph Convolutional Networks (GCNs) [kipf2016gcn] to reason about the relationships between the object classes and the tasks [murali2020taskgrasp]. Wilson et al. learns representations that enable GNN to infer how to manipulate a group of objects [wilson20a]. GarciaGarcia et al. proposed TactileGCN [garcia2019tactilegcn], which predicts grasping stability from a graph of tactile sensor readings. In contrast, we propose a targetdriven approach that employs a Grasp Graph Neural Network (G2N2) to capture these relations.Aside from datadriven models, sampling methods are equally critical to grasp pose detection methods
[ten2017grasp, liang2019pointnetgpd, lou2020learning]. Given a singleview observation, reconstructing the missing geometry will naturally enhance sampling efficiency. 3D shape completion from partial observations has been studied in computer vision and graphics
[shapenet2015, FirmanCVPR2016, 7298801] and applied to robotics. Varley et al. trained a shape completion module with 3D CNNs [varley2017shape] and performed analytical grasp planning on the output shape. Rather than solely relying on the shape completion module, we use it only for grasp pose sampling. Hence, errors in shape completion that lead to unsuccessful grasps will be learned by the G2N2 and penalized during evaluation.Iii Problem Formulation
The goal of our work is to investigate targetdriven robotic grasping in dense clutter, where several identical targets may be present. Besides grasp pose stability, this task also demands intensive reasoning about spatial relations between scene objects (e.g., relative locations, occlusions, etc.) when predicting the most feasible grasp pose. We define our problem as follows:
Definition 1.
Given a query image, the robot needs to retrieve the same item from the dense clutter where several identical target objects may be present.
With a singleview camera observation of the open world, the challenge arises from:
Assumption 1.
The objects are possibly unknown (i.e., novel objects) and partially observable (e.g., object occlusions and imperfect sensors).
The partial observations inevitably increase collisions during grasping. A robot that understands spatial relations between objects can adapt to collisions. To learn such relationships with GNNs, we formulate a graph representation:
Definition 2.
A grasp graph includes grasp pose information, geometric features of objects, and the spatial relations between these objects.
However, objects far away from the target are less relevant to our task, therefore, we assume that:
Assumption 2.
The objects that are distant from a target object have limited influence on the grasping success rate.
Given a set of uniformly sampled grasp candidates , a grasp graph is constructed from grasp candidate and the scene observation . The grasp pose is subject to a binaryvalued metric {0,1} where indicates that the grasp is successful. The grasping success probability .
Iv Proposed Approach
The proposed 6DoF grasping system efficiently retrieves one of the identical targets from dense clutter. This task is challenging as it demands intensive reasoning about the spatial relations between scene objects. To incorporate grasp pose information into the latent features of the scene, the proposed grasping system first transforms the object point clouds into the grasp coordinate frame, and extracts the geometric features from each object with a 3D Convolutional Autoencoder. Using the features as node embeddings, we first construct a grasp graph representing the grasp pose, object geometries, and spatial relations between them. Next, we train an endtoend model, Grasp Graph Neural Network (G2N2), on a synthetic dataset of grasp graph to predict the grasping success probability. During testing, we employed a shape completionassisted sampling method to propose diverse grasp candidates, then the G2N2 evaluates all grasp graphs and select the highestscored grasp pose to execute. The system is depicted in Fig. 2 and delineated in Algorithm 1.
Iva Pose to Graph Transformation
Learning an effective graph representation is critical to the proposed G2N2. Our grasping system formulates the observation and a grasp candidate into a grasp graph representation containing spatial features, defined in Sec. III. For a scene of objects, a grasp graph contains a tuple of two sets , where and are the sets of nodes and edges, respectively. The graph is directed and is an edge from to . A node is associated with node features and an edge with edge features , where represents the initial dimensions of node and edge features.
We adopt the UOIS [xie2019uois] for segmentation and the Siamese networkbased target matching in [lou2020learning] for localizing targets in the scene observation . Then object point clouds are transformed by the 6DoF grasp pose into the grasp coordinate frame (i.e., the axes are aligned with the gripper reference frame and the tool center point becomes the origin). This transformation encodes the grasp pose within the point clouds. Since distant objects have little impact on the prediction, we set an effective workspace of around the grasping point. The valid object point clouds are voxelized to a set of voxel grids that represents the nodes in the grasp graph. Next, the 128dimensional node features are extracted using a 3D Convolutional Autoencoder : Conv3D(1, 32, 5) ELU Maxpool(2) Conv3D(32, 32, 3) ELU Maxpool(2) FC(, 128). The grasp graph is fully connected, and where
is a 3D vector pointing from source node
to target node and is their Euclidean distance. The grasp graph is constructed such that each node is composed of the latent geometric feature and each edge is composed of spatial feature whereis a MultiLayerPerceptron (MLP) that extracts edge features from edge attributes.
IvB Grasp Graph Neural Network
GNNs specialize in discovering the underlying relationships between nodes. In graph learning, a nonlinear function takes a graph representation as input and maps it to with updated node features [li2020deepergcn]. Graph updates in GNNs combine an arbitrary number of these nonlinear functions, and produce node features containing learned knowledge about the neighborhoods. We define the GNN structure by finding a set of operators parameterized by the grasp graph dataset such that the output graph representation is useful for our grasp pose prediction task. The graphs are updated as follows:
(1) 
Given a grasp graph, we aggregate the node features following the intuitions in [li2020deepergcn] by incorporating a generalized SoftMax() aggregator . Experimentally shown in [li2020deepergcn], the learned parameters in define a more effective aggregation scheme than mean() or max() alone for the learning task. Eq. 2 describes the messagepassing function for updating latent features of node with its neighbors:
(2) 
Our network sequentially connects three GNN layers ( in Eq. 1). Each layer has node feature where . Intuitively it convolves the 3rdorder neighbors of every node, capturing copious spatial relations. After the graph operations, the graphlevel output features are extracted from the final graph with a globalmean pooling function , then we compute the grasping success probability where represents a post messagepassing MLP. Detailed network structure is shown in Fig. 3
. At a highlevel, each grasp graph represents a candidate, and the G2N2 consumes the grasp graphs and estimates the grasping success probability
in an endtoend fashion. Finally, we select the highest scored grasp pose.IvC Grasp Sampling
Geometricbased sampling generates grasp poses constrained by the surface normal and local direction of the principal curvature [ten2017grasp, liang2019pointnetgpd]. Uniform random sampling, on the other hand, covers a larger action space [lou2020learning], however, the sample quality often suffers under partial observation because grasp candidates may collide with the occluded geometries. To reduce these low quality samples, we first infer the occluded geometry of the object with a 3D CNN shape completion module: Conv3D(1, 32, 5) ELU Maxpool(2) Conv3D(32, 32, 3) ELU Maxpool(2) FC(, 128) ReLU FC(128, ) Maxunpool(2) ConvTranspose3D(32, 32, 3) Maxunpool(2) ConvTranspose3D(32, 1, 5). The sampled grasp poses are constrained such that the angle between their approach directions and the object surface normal are less than (i.e., the gripper will not slide over the object). The process is depicted in Fig. 4. The key differences between our approach and other comparable works [varley2017shape, ten2017grasp, liang2019pointnetgpd] are 1) the diversity of the sampled grasp candidates is not compromised while more plausible grasp samples contribute to a more balanced dataset, 2) grasp candidates of higher quality also improve grasping success rate, especially under heavy occlusion (i.e., single viewed observation, objectobject occlusions), and 3) our system uses shape completion only in the sampling stage, therefore, errors have negligible impact on grasping prediction.
IvD Data Collection and Training
We first collect 3,000 depth images of the randomly positioned training objects (toy blocks of various geometric shapes) and voxelize the scene point clouds into occupancy grids. The dataset is prepared for representation learning with the 3D Convolutional Autoencoder described in Sec. IVA. Using the same dataset, a shape completion module for sampling is trained by regressing the input partial voxel grid against the complete ground truth voxel grid, obtained by transforming the full mesh from the canonical pose.
The grasping simulation environment is built in CoppeliaSim v4.0 with the Bullet physics engine version 2.83. We randomly place training objects on the tabletop to form a dense clutter, and iteratively attempt 32 grasps for each arrangement. The grasp graph dataset contains 3,000 scenes and 96,000 grasps in total. After transforming the scene observations to a grasp graph dataset into a lower dimension, we implemented and trained the G2N2 in PyTorch Geometric with the binary crossentropy loss and Adam optimizer until validation accuracy plateaued.
V Experiments
We experiment in diverse settings in both simulated and realworld environments. The experiments are designed to address the following questions: i) How effective is the proposed approach in retrieving one of the identical targets in dense clutter?, ii) How the proposed approach performs compared to other twostage methods?, and iii) Does the proposed approach generalize to more challenging and novel scenarios?
To answer those questions, we test the proposed approach and three comparable works:

GSP stands for Grasp Stability Predictor [lou2020learning]^{1}^{1}1Note that the acronym GSP refers to the 3D CNN grasping module in [lou2020learning], defined here for a concise reference., an objectcentric 6DoF grasp pose detection method that uniformly samples grasp candidates on the object point clouds and predicts their grasping success probabilities with 3D CNNs.

6GN is 6DoF GraspNet [mousavian2019graspnet] that learns to sample diverse 6DoF poses from the latent variables of a Variational Autoencoder. The candidates are evaluated by a PointNet++ [qi2017pointnet] based network and further enhanced by gradientbased refinement.

GSP+AS enhances GSP with an analytical search algorithm. It measures the number of voxels around each target object and assigns clutteredness scores that are proportional to voxel numbers.
For a fair comparison, the same target object masks are provided to each method and 100 grasp candidates per object are generated following their sampling methods. Our approach considers scene objects when evaluating grasps candidates, therefore it does not require additional collision checking. For the other objectcentric methods, Flexible Collision Library (FCL) [6225337] is used to determine if the robot mesh is in collision with the scene mesh, which is reconstructed using the Marching Cubes algorithm [marchingcubes]. The highest scored collisionfree grasp pose will be executed. We define the grasping accuracy as . A grasp is successful only if the robot gripper successfully reaches the object and lifts the object by 15 cm. If no grasp pose is found among the sampled candidates, the grasp is also considered unsuccessful.
Va Simulation Experiments
Our simulated environment includes a Franka Emika Panda robot arm and various test objects. A singleview RGBD observation is taken with a simulated camera. The first set of experiments, shown in Fig. 5, tests the different approaches with block clutter of increasing density. One set of toy blocks includes seven objects of different shapes. BlockNormal is similar to the training settings where two sets of objects are randomly placed on the tabletop; BlockMedium increases the clutter density by adding another set of objects; BlockDense adds 5 sets of objects (35 objects in total) onto the same workspace, creating a dense clutter. Each method runs 101 times and the results are summarized in Table I. 6GN occasionally fails for smaller objects, but all methods are effective in BlockNormal since objects are lightly cluttered. As the density increases, GSP and 6GN perform poorly as they often try to grasp less accessible ones. They are inevitably more prone to disastrous collisions due to occluded geometries. Although GSP+AS is able to select less cluttered objects analytically, it struggles in BlockDense, showing that the analytical clutteredness is largely limited under heavy occlusions. The proposed approach with G2N2 (Ours) performs best in all categories since it takes into account spatial relations between objects. We achieve a 84.16% success rate in the BlockDense scenario, outperforming the next best baseline by 19.72%.
GSP  6GN  GSP+AS  Ours  
BlockNormal  91.09  87.13  92.08  93.06 
BlockMedium  84.16  76.24  87.13  90.10 
BlockDense  61.39  57.43  70.30  84.16 
GSP  6GN  GSP+AS  Ours  
NovelSingle  76.24  83.16  76.24  87.13 
NovelMedium  66.34  74.25  71.29  84.16 
NovelDense  54.45  63.37  70.30  81.19 
We further experiment in more demanding settings. NovelSingle includes 8 objects from the YCB dataset [7251504] that are distinct from the training objects, focusing on testing the generalizability to novel objects. The friction coefficients heavily influence the outcome, and we use the common value according to each object’s material. NovelMedium and NovelDense double and triple the object numbers, respectively, increasing the density of the scenes. The scenes are illustrated in Fig. 6. Each method is tested 101 times, and the results are compiled in Table II. Our approach achieves the highest 81.19% grasping accuracy in NovelDense, suggesting that it is generalizable to completely different settings that include novel objects.
The ablation studies are carried out in the BlockDense scenario to closely examine each component, summarized in Table III. All other modules and parameters are controlled except the following alterations. noSC uses the random sampling previously employed in [lou2020learning] instead of our new shape completionassisted method, and it shows discouraging performance. noMsg stops message passing in our network (equivalent to not performing any graph operations) and results in lower performance, suggesting that G2N2 clearly learns to aggregate node features effectively. FCN attempts to learn from the encoded features with fully connected layers but generalizes poorly. Compared to GCN that uses Graph Convolutional Networks [kipf2016gcn] to learn graph representations, the G2N2 it is more effective.
noSC  noMsg  FCN  GCN  Ours  
Grasp Acc.  75.25  65.35  51.49  80.20  84.16 
VB Realrobot Experiments
GSP  6GN  GSP+AS  Ours  
BlockNormal  83.33  77.78  83.33  88.89 
BlockMedium  77.78  72.22  83.33  83.33 
BlockDense  66.67  61.11  72.22  80.65 
GSP  6GN  GSP+AS  Ours  
NovelSingle  72.22  83.33  72.22  83.33 
NovelMedium  44.44  77.78  66.67  83.33 
NovelDense  38.89  61.11  66.67  77.78 
We experiment in the real world to further examine the effectiveness of the proposed grasping system. The setup consists of a Franka Emika Panda robot arm and an Intel RealSense D415 RGBD camera, externally mounted on a tripod facing the robot. A singleview RGBD image is taken as the input for all approaches. Given multiple instances of a target object, the goal is to retrieve one of them from the scene. For each round of testing, we randomly initialize three different scenes as shown in Fig. 7. Six target objects are grasped sequentially per scene for 3 rounds, for a total of 18 grasps per method. The first testing scenario includes wooden blocks similar to our training objects in simulation. The experimental results are summarized in Table IV. The performance of baseline approaches drop significantly with increasing object density, necessitating knowledge of object relations. Our approach achieves the highest grasping accuracy in all three scenarios, suggesting that it is the most efficient and simtoreal transferable.
Following the same procedure, we test the generalizability to unseen objects with the arrangements shown in Fig. 8. Note that all testing objects are distinct from the training ones in size, color, and shape, largely increasing the inference challenge. Table V compiles the results of the tests. In NovelSingle, GSP+AS is equivalent to GSP since only one instance of the target object is presented. Their narrow focus struggles with larger objects. Although 6GN is trained with YCB objects and able to manage to excel in NovelSingle, it only achieves a 61.11% grasping accuracy in NovelDense since it excessively grasp the more cluttered object. By analytically finding the less cluttered target, GSP+AS improves the grasping accuracy of GSP in NovelMedium and NovelDense scenarios by more than 70%. However, it heavily relies on collision detection, which is vulnerable to occlusions. Overall, the results are consistent with the simulation counterparts. With the help of G2N2, we reason about the spatial relations between objects; our 6DoF grasping system achieves a 77.78% grasping accuracy in NovelDense, outperforming the other baselines by large margins.
Vi Conclusion
We presented a 6DoF grasping system that addresses a complex targetdriven task, grasping one of the identical objects from a dense clutter. Our approach used the Grasp Graph Neural Network (G2N2) to harness spatial relations between objects. Additionally, we designed a more efficient shape completionassisted 6DoF grasp pose sampling method that facilitates both training and grasping. Throughout exhaustive evaluation in both simulated and real settings, the proposed grasping system achieved a 88.89% grasping accuracy on scenarios with densely cluttered blocks and 77.78% grasping accuracy on scenarios with densely cluttered novel objects.