Precise Object Placement with Pose Distance Estimations for Different Objects and Grippers

10/03/2021
by   Kilian Kleeberger, et al.
IEEE
Fraunhofer
4

This paper introduces a novel approach for the grasping and precise placement of various known rigid objects using multiple grippers within highly cluttered scenes. Using a single depth image of the scene, our method estimates multiple 6D object poses together with an object class, a pose distance for object pose estimation, and a pose distance from a target pose for object placement for each automatically obtained grasp pose with a single forward pass of a neural network. By incorporating model knowledge into the system, our approach has higher success rates for grasping than state-of-the-art model-free approaches. Furthermore, our method chooses grasps that result in significantly more precise object placements than prior model-based work.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 5

01/12/2021

Transferring Experience from Simulation to the Real World for Precise Pick-And-Place Tasks in Highly Cluttered Scenes

In this paper, we introduce a novel learning-based approach for grasping...
04/15/2021

Investigations on Output Parameterizations of Neural Networks for Single Shot 6D Object Pose Estimation

Single shot approaches have demonstrated tremendous success on various c...
04/06/2020

NiLBS: Neural Inverse Linear Blend Skinning

In this technical report, we investigate efficient representations of ar...
03/02/2022

3D object reconstruction and 6D-pose estimation from 2D shape for robotic grasping of objects

We propose a method for 3D object reconstruction and 6D-pose estimation ...
04/01/2020

Learning to Place Objects onto Flat Surfaces in Human-Preferred Orientations

We study the problem of placing a grasped object on an empty flat surfac...
01/01/2019

Optimal Object Placement using a Virtual Axis

A basic task in the design of a robotic production cell is the relative ...
11/16/2020

Fast Uncertainty Quantification for Deep Object Pose Estimation

Deep learning-based object pose estimators are often unreliable and over...

I Introduction

For robots to reliably grasp and manipulate objects in undefined poses, they have to perceive their environment by means of sensors and plan corresponding actions accordingly. In this work, we focus on robotic bin-picking, where multiple rigid objects of different types are stored chaotically in a bin and the robot has to pick the objects and place them at a given target pose as exemplarily visualized in Fig. 1.

The task is challenging due to a high amount of clutter and occlusion in the scene, various types of objects that have to be differentiated, object symmetries that result in pose ambiguities, varying lighting conditions, and missing, incorrect, and noisy depth information from the real-world sensor. Furthermore, collisions with the bin and other objects in the scene have to be avoided. During grasping, the targeted object can easily move relative to the gripper under the weight of any objects that are on top of it as shown in Fig. 2 (a) or when the grasp is too far off from the center of mass of the object. Additionally, some object geometries can potentially cause entanglements as illustrated in Fig. 2 (b). This can lead to collisions when placing the picked object at the defined target pose or other objects dropping during placement.

Often, the bins cannot be emptied completely, due to no collision-free reachability of objects [1, 2]. An object at the border of a bin may not be graspable with a parallel jaw gripper because one finger collides with the bin wall as exemplarily visualized in Fig. 2 (c). The same holds, when, e.g., bars or cylinders lie very close next to each other as illustrated in Fig. 2 (d). In these cases, the candidates can be picked with an eccentric suction gripper where the suction cup can still be placed on top of the objects. Furthermore, different object geometries are generally challenging to handle with a single gripper type.

Fig. 1: (left) Real-world robot cell for picking chaotically stored objects of different categories from bins. Objects can be placed on the bar in the background (see green circle). (right) 3D point cloud (colored) with pose estimates of different object types (light gray) of our approach including the bin (gray) on real-world data without ICP refinement. The gripper (blue) is visualized at the top-ranked grasp pose.

In this paper, we tackle these challenges by providing a multi-gripper approach which executes grasping trials in simulation and transfers that gained experience to the real world to be able to select highly robust grasps on different object types. Our approach jointly solves the 6D object pose estimation (OPE), object classification, and grasps quality prediction tasks while using common gripper types such as parallel jaw and suction grippers. Our method decides autonomously which object with which gripper including grasp pose (or gripping point) is best suited for execution. Furthermore, it works for all possible kinds of object symmetries for both OPE and grasping with placement.

Current robotic bin-picking solutions [3, 4, 5, 6] require manual parameterization and tuning of the object localization and gripping algorithms until a satisfactory system performance is reached [7, 8, 2]. Through the use of object models (CAD or previously scanned [9] models), our approach can adapt itself autonomously to new objects without any human intervention. Our approach is entirely trained on synthetic images and annotations and can, therefore, be easily applied to new objects by triggering a new data generation and neural network training.

To the best of our knowledge, we are the first to use pose distance estimations for object placement based on grasping and placement trials in simulation for different objects and grippers simultaneously. We provide challenging mixed bins datasets that can also be used by other researchers for benchmarking. All datasets and videos of real-world experiments are available at https://www.bin-picking.ai/en/dataset.html. Although implemented for bin-picking, our approach can be used for other pick-and-place tasks such as shelf picking, depalletizing, conveyor belt picking, etc., while especially the latter is an attractive application due to the high speed and real-time capability of our method.

In summary, the main contributions of this work are:

  • Novel approach PQ-Net++ for grasping and placing objects of different categories using a multi-gripper policy.

  • Estimation of the pose distance for object pose estimation to select the most promising pose estimates.

  • Estimation of the pose distance for object placement to favor grasps resulting in precise placements.

  • Two novel challenging benchmark datasets and performance evaluation on these datasets.

(a) (b) (d) (c)
Fig. 2: Failure cases of robotic bin-picking tasks: (a) The robot picks an object (IPACylinder) that moves relative to the gripper during lifting due to overlaps with other objects. The object cannot be placed precisely anymore. (b) The robot picks an object (IPAConnectingRod) that is entangled with another object (IPAUBolt). This can result in a dropping of objects during placement or collisions with the environment at the defined target pose for placement. (c) An object (Bar) cannot be picked due to collisions with the bin. (d) An object (Bar) cannot be picked with a parallel jaw gripper due to collisions with other objects in the scene. Using a suction gripper allows picking these objects.

Ii Related Work

Approaches to robotic grasping and manipulation can be categorized along multiple criteria [7]. In the following, we differentiate between model-free and model-based approaches based on whether model knowledge of the object to pick is used or not.

Ii-a Model-free Approaches

Model-free grasping poses a dominant direction in robotic research motivated by the generalization ability to novel objects [7]. Dex-Net [10, 11]

makes use of synthetic data and locally analyzes the point cloud for finding robust parallel jaw grasps. The approach samples grasp candidates and ranks them by means of a neural network, which gets an aligned depth image crop and grasp candidate as input and outputs a grasp quality using a sigmoid output neuron. Dex-Net 3.0 

[12] extends this framework to suction grippers and Dex-Net 4.0 [13] uses a multi-gripper policy, which automatically infers whether to use a parallel jaw or suction gripper for the next grasp execution. For the latter, two neural networks rank grasps and an argmax policy chooses the highest quality grasp for execution.

Levine et al. [14] use 14 robots to execute 800,000 grasps over the course of two months and train a deep neural network for learning hand-eye coordination. QT-Opt [15]

collects 560,000 grasp attempts with seven robots over several weeks and demonstrates robust grasping and manipulation of a diverse set of objects based on reinforcement learning. Due to the immense amount of real-world data required, works such as GraspGAN 

[16] and RCAN [17] focus on reducing or avoiding the need for expensive and time-consuming data collection on real-world systems.

All approaches presented above solve pick-and-drop tasks and do not propose a solution for precisely placing the picked candidates, even for rigid objects.

(b) input depth image

3D output tensor

(d) (a) (c) fully convolutional architecture models with grasps grasp execution scene generation neural network
c
c
c
c
c
c
Fig. 3: Overview of our approach: (a) 3D object models with automatically generated grasp poses for parallel jaw (a, top) and suction grippers (a, bottom). (b) Physics simulation for scene generation by dropping random objects from a set of available objects into a bin. (c) Physics simulation for grasping and object placement with a robot. (d) The input of our model is a perspective depth image (d, left) that is processed by a fully convolutional network architecture (d, middle). The output of our neural network is a 3D tensor (d, right) comprising estimates for the presence of an object origin , positions , , , Euler angles , , , relative pose distance for OPE

, vector for the

object classes, and the relative pose distance for object placement for each grasp pose .

Ii-B Model-based Approaches

Analytic approaches typically match an object model to the sensor data [18, 19] and learning-based approaches map sensor data to pose estimates [20, 21, 22]. These outputs are then used to plan a collision-free and kinematically feasible grasp towards the object for picking [1, 2, 4, 5]. This is not sufficient for complex scenarios where jamming or entanglements [23] with other objects in the bin can occur or objects move relative to the gripper because other objects are above them (examples see Fig. 2). Learning-based approaches allow to approximate these challenging and analytically difficult to describe correlations. PQ-Net (Placement Quality Network) [24]

estimates object poses with graspability measures and qualities for predefined grasps based on grasping trials in simulation for a single object type. The approach specifies a pose distance threshold and classifies grasp poses based on whether the grasp results in an object placement below the threshold or not, i.e., that all successful grasp poses are equally ranked.

In this work, we directly estimate the pose distance for object placement for different object types instead of specifying a distance threshold from the target pose and having multiple candidates which are regressed towards the same quality measure. This allows our extension of the Placement Quality Network (PQ-Net++) selecting the most promising grasp for a precise object placement on a global level. Additionally, our approach estimates the pose distance from the ground truth for the proposed pose estimates to select highly promising candidates for picking. Furthermore, our approach autonomously decides which gripper to use and thus allows maximizing the probability of a successful and precise object placement.

Iii Problem Statement

Given a robot kinematic as well as a discrete and finite set of grippers with known gripper parameters (see Section IV-A), the task of the robot is to pick known rigid objects from a set of categories from a chaotic scene and place them at a given target pose . Using a depth image of the scene from a single view, our goal is to localize objects, classify them, and identify a suitable grasp pose on the objects that allows a precise object placement. The objects are localized by estimating their translation vector and rotation matrix relative to the sensor coordinate system. Collisions of the manipulator with other objects in the scene have to be avoided. Additionally, picking objects that jam, get displaced relative to the gripper, or entangle with other objects in the scene have to be avoided.

Iv Placement Quality Network (PQ-Net++)

This section describes the techniques to automatically generate grasp poses for different gripper types, the synthetic data generation procedure, the required orientation unification step, the pose distance definition, the parameterization of the output of the neural network, the multi-task loss function, the neural network architecture, the training procedure, the technique for a robust sim-to-real transfer, and the policy to infer robust grasps from the neural network output. Fig. 

3 illustrates an overview of our approach.

Iv-a Automatic Grasp Pose Generation

To avoid the need for manually specifying grasp poses on the objects, we employ routines to automatically generate them. Based on a given gripper and object, the goal is to identify a discrete set of suitable grasp poses on the object. The grasps

are defined relative to the object coordinate system. In this paper, we focus on parallel jaw and suction grippers. For the former the gripper stroke as well as closing force and for the latter the suction cup dimensions as well as force and moment limits need to be known.

To generate suction grasp candidates, points are sampled on the surface of the object. A grasp candidate consists of its position and the surface normal of the object at that position. Those candidates are then tested for a good seal formation by using a quasi-static spring model of a suction cup [12, 25]. If the grasp point is considered feasible, it is tested in simulation against collision of the gripper with the object. For parallel jaw grippers, we use an automatic routine that samples grasp poses on the object and filters them based on the gripper stroke, normal information, and a simple collision check [26, 24]. Fig. 4 exemplarily shows automatically generated grasp poses for parallel jaw and suction grippers by our routines.

(a) (b) (c)
Fig. 4: Automatically generated grasp poses exemplarily visualized for the (a) Stanford Bunny and (b) IPAConnectingRod [23] for a suction gripper and (c) for the IPAUBolt for a parallel jaw gripper. Unsuitable grasp poses are shown in red and feasible grasp poses are highlighted in green.

Iv-B Physics Simulation for Synthetic Data Generation

For data generation, we use the physics simulation CoppeliaSim (formerly V-REP) [27] with the built-in Bullet physics engine to chaotically fill bins and execute grasping and placement trials in these scenes.

Iv-B1 Scene Generation

Similar to the Siléane [28] and Fraunhofer IPA [29] datasets, we drop objects in random poses into a bin to generate chaotic scenes typical for bin-picking. The object type to drop is randomly selected from the set of available objects. The simulation records a perspective depth image of the scene together with the ID, class, pose relative to the sensor, and visibility

of all objects in the scene. The number of objects dropped in each scene is increased until a certain drop limit is reached, forming one cycle. This procedure gives a uniform distribution over different fill levels of the bin with uncorrelated images as the bins are completely refilled for each scene.

Iv-B2 Grasping and Placement

In each generated scene, we loop over all grasp poses for each gripper for each object and check the collision-free reachability and kinematic feasibility of the grasp pose with the given gripper and robot kinematic. If the grasp pose can be reached without collisions, we execute the grasp and place the object at a defined target pose . Afterwards, we log the pose distance (see Section IV-D) between the actual object placement and defined target pose while taking the proper symmetry class [30, 28] of the picked object into account. We also log if an entanglement occurred, i.e., that the lifted object is in contact with other objects from the scene as exemplarily visualized in Fig. 2 (b).

The grasping of the objects is physically simulated, i.e., the object to pick is not simply rigidly attached to the gripper. For instance, when other objects lie on top of the targeted object, the picked candidate may become displaced relative to the parallel jaw gripper as exemplarily illustrated in Fig. 2 (a) or simply peel off when using the suction gripper when the maximum pull force, shear force, or peel torque are exceeded.

Iv-C Orientation Unification

Since the grasp poses are defined relative to the object coordinate system, the orientation of the object has to be unified during data generation for grasping and placement to allow a proper learning of the correlations for the grasp poses and to avoid convergence issues with the grasp poses correlations during model training because of pose ambiguities due to object symmetries. The orientation unification step also has to be applied at test time.

In this work, we assume that the -axis of the object is the axis of symmetry. However, it can be any other axis as long as it is known. The orientation of all objects in the scene is unified relative to the camera coordinate system as follows depending on the proper symmetry class [30, 28] of the object. For objects with no proper symmetry (e.g., Stanford bunny from Fig. 4 (a)), no unification has to be applied as the orientation is already unique. For spherical symmetries, we set

to the identity matrix

. For objects with a revolution symmetry (e.g., cone), we apply a rotation around the -axis of the object coordinate system which minimizes the -component of the -axis. For revolution objects with rotoreflection invariance (e.g., cylinder), we additionally rotate around the -axis by if the -value of the -axis can be increased. For objects with finite symmetries around the -axis only (e.g., IPAConnectingRod and IPAUBolt from Fig. 4 (b) and (c), respectively), we set the orientation with minimal -value of the -axis. For objects with discrete symmetries around multiple axes (e.g., bar or cube), we choose the rotation matrix with the smallest angle

(1)

of the axis–angle representation of a rotation matrix to the camera coordinate system. is the discrete and finite set of rigid transformations that have no effect on the static state of the object.

Iv-D Pose Distance

Romain Brégier et al. [30, 28] introduced a pose distance that allows an efficient distance computation for all possible kinds of object symmetries by introducing pose representatives for all proper symmetry classes. The representatives of pose depend on the proper symmetry class of the object.

Utilizing this representation framework, the distance between a pair of poses , can be expressed as the minimum Euclidean distance between their respective pose representatives:

(2)

This pose distance is compared against the commonly used threshold of 0.1 times the diameter of the smallest bounding sphere of the object

(3)

to identify whether a pose estimate is considered as true or false positive [28, 31].

In this work, we consider the pose distance relative to 0.1 times the object diameter to deal with objects of different size and, therefore, consider the relative pose distance

(4)

If is smaller than 1, the distance is below the distance threshold and the object pose estimate or object placement is considered as correct or precise enough [28, 24].

Iv-E Relative Pose Distance Estimation for Object Pose Estimation

Our model estimates the relative pose distance between the predicted and ground truth pose of the OPE for each predicted object pose while taking the proper symmetry class of the currently considered object into account. The ground truth pose distance is determined based on the pose annotation from simulation and the pose output of the neural network and is, therefore, computed on the fly (dynamically during training), i.e., that the ground truth changes during training.

Iv-F Parameterization of the Output

Similar to [20, 21] and [24], we employ a spatial discretization to split the measurement volume of the sensor into volume elements. Each object has a -dimensional feature vector comprising the presence of an object origin, the positions , , and Euler angles , , relative to the camera coordinate system, the relative pose distance of the proposed object pose estimate (ground truth is computed on the fly), the

-dimensional one-hot encoded object class vector, and the

-dimensional vector with the relative pose distance for object placement for each grasp pose as defined for the given object class where is the maximum number of grasp poses from all objects.

The ground truth output tensor of the neural network is initialized with zeros. The feature vectors of the objects are assigned to the tensor entries, based on the position of the object origin in the scene. In case multiple objects fall into the same spatial location, the feature vector of the object with higher visibility is used as ground truth. The output tensor is visualized in Fig. 3 (d, right).

To generate the ground truth vector with the object placement annotations, we concatenate the vectors for different grippers. As the objects with generally do not have the same number of grasp poses , we simply fill the remaining entries in the -dimensional vector with the default value if . If an entanglement has occurred, we set for that grasp pose to prevent our policy from selecting entangled grasps. Furthermore, we clip the relative pose distance for object placement at for all executed grasp poses because the pose distance is unpredictably high for dropped objects. Moreover, we set the ground truth pose distance to for the grasp poses which were not executed (due to no collision-free reachability and/or kinematic feasibility). In our experiments, we use . At test time, we only consider the first entries in the relative pose distance vector for object placement based on the object class prediction.

Iv-G Loss Function

To train the neural network, the multi-task loss function

(5)

with manually tuned weights is optimized. In our experiments, we use , , , , , and . The weighting with the ground truth visibility causes the neural network to prioritize the more visible objects.

The loss term reflects the presence of object origins in the volume elements and is defined via the binary cross-entropy. For the pose loss , we use Equation (2), which properly considers all possible kinds of object symmetries. For the object classification loss , we use the categorical cross-entropy loss. For the loss of the relative pose distance for OPE (dynamic ground truth) and object placement for each grasp pose (static ground truth) with and , respectively, we use the squared L2 norm.

Iv-H Neural Network Architecture

Our model gets a single normalized perspective depth image with depth values bounded by the near and far clipping plane of the sensor as input and outputs the tensor as visualized in Fig. 3 (d). In our experiments, we use a DenseNet-BC [32, 33] with 40 layers and a growth rate of 50 as function approximator for the mapping from inputs to outputs. The growth rate specifies the number of feature maps being added per layer within a dense block. We use a input depth image and a spatial discretization for the , , and

direction, respectively, while downsampling is performed via average pooling. ReLU activation functions are employed in the hidden layers. For the 3D output tensor, we use sigmoid functions for the presence of object origins and the relative pose distance estimation for object placement, linear functions for the pose information channels, ReLU functions for the relative pose distance estimation for OPE, and vector-wise softmax functions for object classification. With this network architecture, forward passes are performed with an average frame rate of 92 fps or 44 fps on a Nvidia Tesla V100 or a GTX 1080 Ti, respectively.

Iv-I Training

During training, the loss of the entire channel for object origin presence estimation is backpropagated. For all other channels, we only backpropagate the error of the entries in the tensor with assigned feature vector

data by multiplying channel-wise with the ground truth object origin presence channel. Due to the time-consuming process of grasping and placement, not all generated samples obtain annotations for the grasp poses and we only backpropagate if annotations are given.

Fig. 5: Exemplary augmented training samples for transferring our model from simulation to the real world.

Iv-J Sim-to-Real Transfer

We use the sim-to-real transfer technique domain randomization [34] to successfully deploy our model from simulation to the real world. During scene generation, we randomize the bin pose as well as the position and orientation of the objects above the bin before dropping.

During training, we insert a random background image from the “Fraunhofer IPA Bin-Picking dataset” [29] with a probability of 0.5 (see Fig. 5, right). Otherwise, the default background from the setup is used (see Fig. 5, left). Additionally, we apply image augmentations such as blurring, elastic transformations, and adding noise to the synthetic depth images. As real-world data may also obtain wrong and not only noisy data due to, e.g., reflections for shiny metal parts (as it is the case for the IPAUBolt and IPARingScrew; see Fig. 1, left), we change the value of random single pixels and larger regions in the image to an increased depth value above the actual object (drop noise). All image augmentations are applied with varying intensity and in a random order during training. Fig. 5 exemplarily visualizes images used for training while the flawless object annotations from simulation are kept.

Iv-K Policy

The deep neural network with trained weights gets a single normalized perspective depth image with global scene information as input and outputs a 3D tensor . The function transforms the relative pose distance estimate for OPE and the relative pose distance estimates for object placement for each grasp pose in the output tensor to a score with

(6)

and

(7)

Since is bounded by , we do not use an exponential function. Our policy

(8)

selects the highest quality object with grasp pose from all volume elements for execution with and with being the number of predefined grasp poses for object (predicted object class) for all grippers. For each grasp pose , the corresponding gripper is known.

V Experiments

For performance evaluation, we introduce two novel synthetic benchmark datasets: Mixed bins symmetries and entanglements. Both datasets comprise objects from multiple categories and of different size and shape.

The symmetries dataset comprises seven symmetric objects from all possible proper symmetries classes (see Fig. 3 (b) and (d, left) as well as Fig. 2 (a), (c), and (d)). Using this dataset, we demonstrate that our method works for all possible kinds of object symmetries. Furthermore, the dataset comprises very elongated objects (cylinder and bar). When attempting to pick these objects, they can easily move relative to the gripper. Thus, a proper picking order of the objects has to be considered and simply checking collision-free reachability of the grasp poses is not sufficient.

The entanglements dataset is very challenging because the three objects can potentially entangle (see Fig. 2 (b)), i.e., that no precise placement is possible anymore due to collisions of the additionally picked object with the environment. Exemplary images from this dataset are visualized in Fig. 5 and the application to real-world data in Fig. 1.

In the following, we benchmark the performance of our approach for OPE, grasping, and precise object placement in simulation on the challenging mixed bins symmetries and entanglements datasets. Since the baseline approaches OP-Net [20] and PQ-Net [24] are not originally designed to handle different object classes, we also extended them to object classification by adding output feature maps and backpropagating during training while using the same network architecture as for our approach. All performance numbers are recorded on test datasets of 10 cycles each (500 samples). For the experiments in Section V-B and V-C, each approach executes exactly one grasp per scene. This allows a fair comparison because all approaches observe the exact same scenarios.

V-a Benchmarking 6D Object Pose Estimation

PQ-Net [24] used grasping and placement annotations for around 13% of the cycles only and reported an average precision (AP) for OPE slightly worse than OP-Net [20] across all datasets. In our datasets, we annotated 50% of the samples with grasps. Table I reports the AP values based on the metric from Brégier et al. [30, 28] for the two challenging mixed bins datasets. The AP values are determined independently for each object class and averaged afterwards. With enough training samples with grasping annotations, PQ-Net++ is a better pose estimator than OP-Net [20]. Adding additional outputs has positive effects on OPE due to multi-task learning, even if the output size of the neural network increases a lot by adding a feature map for each grasp pose . Additionally, we report very high success rates for object classification for the objects with ground truth visibility .

TABLE I: Performance evaluation for 6D object pose estimation and object classification. Higher is better.
mixed bins mixed bins
symmetries dataset entanglements dataset
AP average OP-Net [20] extended to object classification 0.80 0.81
AP average PQ-Net++ (ours) 0.88 0.83
success rate object classification OP-Net [20] 0.99 0.98
success rate object classification PQ-Net++ (ours) 0.99 0.98
mixed bins mixed bins
symmetries dataset entanglements dataset
success rate grasping Dex-Net 4.0 [13] 0.77 0.59
success rate grasping PQ-Net++ (ours) 0.98 0.94
TABLE II: Success rate for grasping. Higher is better.

TABLE III:

Results for object placement: Success rate (higher is better) and average relative pose distance for object placement (at a defined target pose) (lower is better). Evaluation of the 6D object pose estimation (OPE) of the chosen objects for picking: Success rate for OPE based on commonly used evaluation metrics (higher is better) and average relative pose distance for OPE (lower is better). Best results are marked in bold.

mixed bins mixed bins
symmetries dataset entanglements dataset
success rate object placement PQ-Net [24] extended to object classification and multiple grippers 0.96 0.82
success rate object placement PQ-Net++ (, ) (ours) 0.96 0.83
success rate object placement PQ-Net++ (, , ) (ours) 0.94 0.82
avg. relative pose distance object placement PQ-Net [24] extended to object classification and multiple grippers 0.86 0.47
avg. relative pose distance object placement PQ-Net++ (, ) (ours) 0.16 0.35
avg. relative pose distance object placement PQ-Net++ (, , ) (ours) 0.19 0.36
success rate OPE based on [30, 28] metric PQ-Net [24] extended to object classification and multiple grippers 0.99 0.89
success rate OPE based on [30, 28] metric PQ-Net++ (, ) (ours) 0.99 0.87
success rate OPE based on [30, 28] metric PQ-Net++ (, , ) (ours) 0.99 0.90
success rate OPE based on ADI metric [31] PQ-Net [24] extended to object classification and multiple grippers 1.0 0.98
success rate OPE based on ADI metric [31] PQ-Net++ (, ) (ours) 1.0 0.99
success rate OPE based on ADI metric [31] PQ-Net++ (, , ) (ours) 1.0 0.99
avg. relative pose distance OPE PQ-Net [24] extended to object classification and multiple grippers 0.18 0.59
avg. relative pose distance OPE PQ-Net++ (, ) (ours) 0.15 0.59
avg. relative pose distance OPE PQ-Net++ (, , ) (ours) 0.13 0.56

V-B Benchmarking Grasping

Since, to the best of our knowledge, there are no other model-based approaches explicitly designed for mixed bins with multiple grippers, we compare our approach with Dex-Net 4.0 [13] because it also uses multi-gripper policies to pick objects from chaotic bins. Table II reports the success rates for grasping in simulation. A grasp is considered as successful if an object is in the gripper after moving out of the bin. Dex-Net is not explicitly trained on the considered objects. The approach analyzes the point cloud locally and does contrary to our approach not reason on a global level. Thus, it does not explicitly attempt grasps close to the center of mass for very elongated objects (e.g., bar and cylinder) or not occluded objects. Moreover, Dex-Net operates in 4D only (top-down grasps) for the parallel jaw gripper, which may result in less stable grasps, whereas our approach operates in 6D. Our approach performs better because it uses model knowledge and is explicitly trained on the scenarios. Our experiments demonstrate that the system performance can be improved by incorporating model knowledge.

V-C Benchmarking Object Placement

PQ-Net [24] specifies a distance threshold of 0.1 times the object diameter for object placement and performs a binary classification for each grasp pose based on whether the placement trial resulted in a placement within the threshold or not, i.e., that all grasp poses within the threshold are treated equally. We compare this baseline with our policy with and without the relative pose distance estimation for OPE.

Table III reports the results of our object placement experiments in simulation. The object placement success rate identifies whether the placement is within the distance threshold for all samples, where both approaches perform very similar. In our experiments, we log the relative pose distance for object placement for each trial and average it over all samples from the test dataset. As we directly estimate the pose distance for object placement for each grasp pose, we are able to select grasp poses for more precise object placements. Thus, our approach chooses grasp poses that result in significantly more precise object placements than PQ-Net.

Additionally, we evaluate the correctness of the OPE of the chosen candidate from the policy (without ICP refinement). Table III reports the success rates for OPE based on the metric from Brégier et al. [30, 28] and the ADI [31] metric. For both metrics very high success rates are reported. Furthermore, we determine the average relative pose distance for OPE over all samples. Our policy with the pose distance estimate provides a better results for OPE because of explicitly focusing on more correct pose estimates.

Our method learns to avoid the failure cases presented in Fig. 2 (a) and (b) and can also pick objects close to the bin wall (see Fig. 2 (c)) or very close together (see Fig. 2 (d)) due to employing grasps from multiple grippers.

Vi Conclusions

In this paper, we introduced a novel approach for grasping and placing known rigid objects of different categories from highly cluttered scenes using multi-gripper policies. Based on a depth image of the scene, our approach estimates the 6D object pose together with a pose distance estimate for object pose estimation and a quality estimate for each automatically generated and predefined grasp pose on the object for multiple objects simultaneously in a single forward pass of the network.

Due to incorporating model knowledge into the system, our approach reports higher success rates for grasping than state-of-the-art model-free systems and additionally allows precisely placing the picked objects. By estimating the pose distance for object placement for each grasp pose, our approach selects the most promising grasp for a precise object placement on a global level and allows placing objects significantly more precise than prior work. Our experiments demonstrate that our approach scales to a high level of complexity with operating for mixed bins, all possible kinds of object symmetries, and multiple grippers in combination.

Acknowledgment

This work was partially supported by the Federal Ministry of Education and Research (Deep Picking – Grant No. 01IS20005C), the State Ministry of Baden-Württemberg for Economic Affairs, Labour and Housing Construction (Center for Cognitive Robotics –- Grant No. 017-180004 and Center for Cyber Cognitive Intelligence (CCI) – Grant No. 017-192996), and the Fraunhofer lighthouse project SWAP. We would like to thank our colleagues for helpful discussions and Lucas Doust Alba for the support with the Dex-Net experiments.

References

  • [1] F. Spenrath, A. Spiller, and A. Verl, “Gripping point determination and collision prevention in a bin-picking application,” in German Conference on Robotics (ROBOTIK), 2012.
  • [2]

    F. Spenrath and A. Pott, “Gripping point determination for bin picking using heuristic search,” in

    CIRP Conference on Intelligent Computation in Manufacturing Engineering (CIRP ICME), 2016.
  • [3] ISRA VISION AG, “Powerpick3d: The compact solution for reliably and fast picking of unsorted small components with complex geometries,” 2021. [Online]. Available: https://www.isravision.com/en/ready-to-use/robot-vision/bin-picking/powerpick3d/
  • [4] F. Spenrath, M. Palzkill, A. Pott, and A. Verl, “Object recognition: Bin-picking for industrial use,” in IEEE International Symposium on Robotics (ISR), 2013.
  • [5] F. Spenrath and A. Pott, “Statistical analysis of influencing factors for heuristic grip determination in random bin picking,” in IEEE International Conference on Advanced Intelligent Mechatronics (AIM), 2017.
  • [6]

    M. El-Shamouty, K. Kleeberger, A. Lämmle, and M. Huber, “Simulation-driven machine learning for robotics and automation,”

    tm – Technisches Messen, vol. 86, no. 11, 2019.
  • [7] K. Kleeberger, R. Bormann, W. Kraus, and M. F. Huber, “A survey on learning-based robotic grasping,” Current Robotics Reports, vol. 1, no. 4, 2020.
  • [8] D. Kraft, L.-P. Ellekilde, and J. A. Jørgensen, “Automatic grasp generation and improvement for industrial bin-picking,” in Gearing up and accelerating cross–fertilization between academic and industrial robotics research in Europe, 2014.
  • [9] R. Bormann, B. F. de Brito, J. Lindermayr, M. Omainska, and M. Patel, “Towards automated order picking robots for warehouses and retail,” in

    International Conference on Computer Vision Systems (ICVS)

    , 2019.
  • [10] J. Mahler, F. T. Pokorny, B. Hou, M. Roderick, M. Laskey, M. Aubry, K. Kohlhoff, T. Kröger, J. Kuffner, and K. Goldberg, “Dex-Net 1.0: A cloud-based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated rewards,” in IEEE International Conference on Robotics and Automation (ICRA), 2016.
  • [11]

    J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” in

    Robotics: Science and Systems (RSS), 2017.
  • [12] J. Mahler, M. Matl, X. Liu, A. Li, D. Gealy, and K. Goldberg, “Dex-Net 3.0: Computing robust vacuum suction grasp targets in point clouds using a new analytic model and deep learning,” in IEEE International Conference on Robotics and Automation (ICRA), 2018.
  • [13] J. Mahler, M. Matl, V. Satish, M. Danielczuk, B. DeRose, S. McKinley, and K. Goldberg, “Learning ambidextrous robot grasping policies,” Science Robotics, 2019.
  • [14] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” in The International Journal of Robotics Research (IJRR), 2018.
  • [15] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine, “QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation,” in Conference on Robot Learning (CoRL), 2018.
  • [16] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V. Vanhoucke, “Using simulation and domain adaptation to improve efficiency of deep robotic grasping,” in IEEE International Conference on Robotics and Automation (ICRA), 2018.
  • [17] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis, “Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks,” in

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2019.
  • [18] Y. Konishi, K. Hattori, and M. Hashimoto, “Real-time 6d object pose estimation on cpu,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019.
  • [19] M. Palzkill, “Heuristisches Suchverfahren zur Objektlageerkennung aus Punktewolken für industrielle Zuführsysteme,” Dissertation, University of Stuttgart, 2014.
  • [20] K. Kleeberger and M. F. Huber, “Single shot 6d object pose estimation,” in IEEE International Conference on Robotics and Automation (ICRA), 2020.
  • [21] K. Kleeberger, M. Völk, R. Bormann, and M. F. Huber, “Investigations on output parameterizations of neural networks for single shot 6d object pose estimation,” in IEEE International Conference on Robotics and Automation (ICRA), 2021.
  • [22] Z. Dong, S. Liu, T. Zhou, H. Cheng, L. Zeng, X. Yu, and H. Liu, “PPR-Net:point-wise pose regression network for instance segmentation and 6d pose estimation in bin-picking scenarios,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019.
  • [23] M. Moosmann, F. Spenrath, K. Kleeberger, M. U. Khalid, M. Mönnig, J. Rosport, and R. Bormann, “Increasing the robustness of random bin picking by avoiding grasps of entangled workpieces,” in CIRP Conference on Manufacturing Systems (CIRP CMS), 2020.
  • [24] K. Kleeberger, M. Völk, M. Moosmann, E. Thiessenhusen, F. Roth, R. Bormann, and M. F. Huber, “Transferring experience from simulation to the real world for precise pick-and-place tasks in highly cluttered scenes,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.
  • [25] J. Schnitzler, “Flexibler einsatz verschiedener endeffektoren beim modellbasierten griff-in-die-kiste mit deep learning,” master’s thesis, University of Stuttgart, Stuttgart, 2021.
  • [26] K. Kleeberger, F. Roth, R. Bormann, and M. F. Huber, “Automatic grasp pose generation for parallel jaw grippers,” in 16th International Conference on Intelligent Autonomous Systems (IAS-16), 2021.
  • [27] E. Rohmer, S. P. N. Singh, and M. Freese, “V-REP: A versatile and scalable robot simulation framework,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2013.
  • [28] R. Brégier, F. Devernay, L. Leyrit, and J. L. Crowley, “Symmetry aware evaluation of 3d object detection and pose estimation in scenes of many parts in bulk,” in IEEE International Conference on Computer Vision (ICCV), 2017.
  • [29] K. Kleeberger, C. Landgraf, and M. F. Huber, “Large-scale 6d object pose estimation dataset for industrial bin-picking,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019.
  • [30] R. Brégier, F. Devernay, L. Leyrit, and J. L. Crowley, “Defining the pose of any 3d rigid object and an associated distance,” International Journal of Computer Vision (IJCV), vol. 126, no. 6, 2018.
  • [31] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,” in Asian Conference on Computer Vision (ACCV), 2012.
  • [32] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [33] G. Huang, Z. Liu, G. Pleiss, L. van der Maaten, and K. Q. Weinberger, “Convolutional networks with dense connectivity,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2019.
  • [34] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.