I Introduction
Many robot tasks, for example, industrial assembly and bin picking, require manipulation of objects. Object manipulation, on the other hand, benefits from vision based grasping where the 3D geometric shape information of known objects are used to generate stable 6D grasp poses with the help of pose estimation methods. A typical performance measure in pose estimation benchmarks [1, 2, 3, 4] is the alignment error between an estimated and ground truth pose, but this is not sufficient for robot manipulation where the success depends on many other factors: i) the object and its properties (material, weight, dimensions), ii) the selected gripper, iii) a selected grasp point, and iv) a task itself to be completed (bin picking vs. precision wrenching). Several prior works [5, 6, 7, 8] also report success rates for their specific setups and tasks, but for comparison the whole setup should be rebuilt.
In this work, we formulate the problem in probabilistic manner; what is the conditional probability of a successful grasp () given the pose residual (): . To compute the probability we need to sample the residual in 6D space using a real setup. Finally, using the probabilistic model we can generate a large number of grasping samples (see Fig. 1) and benchmark 6D pose estimation methods using the realistic and practical performance measures, successful grasp probability and proportion of successful grasps on a predefined probability level (e.g., 0.90).
The main contributions of this work are:

A formulation of the conditional probability of a successful grasp, , given a 6D pose residual . For practical realization, the formulation uses nonparametric kernel regression on random samples in the 6D residual space , where denotes the 3D sphere parametrized by hyperspherical coordinates.

An experimental setup where random residual samples are generated using a robot arm and two assembly tasks with fixed grippers, objects and grasps. The setup automatically evaluates the success of each grasp, i.e. whether or , and is able to generate hundreds of samples in one day.

A 6D pose performance measure using the probabilistic formulation and evaluation of several stateoftheart 3D point cloud based pose estimation methods using the proposed measure.
Essentially, to adopt the proposed benchmark one needs the 3D point clouds models, test images and the grasp probability function which will all be made publicly available to facilitate fair comparisons.
Ii Related Work
Iia Benchmarking 6D pose estimation
The first complete 6D pose estimation benchmark with a dataset and an evaluation procedure was created by Hinterssoisser [4] and later improved in [9]. Despite of a relatively easy test images the dataset is still frequently used as the main benchmark. Hodan et. al [1] recently proposed a 6D pose estimation benchmark comparing 15 stateoftheart methods covering all the main research branches with localfeaturebased, template matching and learningbased approaches. The benchmark uses 8 different datasets containing different types of images ranging from typical household object [10] to industryrelevant objects with no significant texture [11]. For evaluations the authors propose a variant of their previously introduced Visible Surface Discrepancy (VSD) [3] error metric. In addition the authors introduce an online evaluation system that provides updated leaderboard system. However, it remains unclear how well these benchmarks measure pose estimation for practical applications such as robotic assembly and disassembly.
IiB Performance measures
There are several commonly used performance metrics for pose estimation. The most popular is the alignment error proposed by Hinterstoisser et al. [4] which calculates the average distance between model points transformed by the ground truth pose and the estimated pose. Another popular metric is to report separately 3D translation and 3D rotation errors [12].
Recently Hodan et al. [3] proposed the VSD metric where the error is only calculated over the visible part of the surface model. The main contribution of the metric is invariance to pose ambiguity, i.e. due to the object symmetry there can be multiple poses that are indistinguishable. However the method requires additional ground truth visibility mask to be estimated.
All above metrics measure the pose error in the terms of misalignment between the ground truth and estimated object surface points. This requirement is important, for example, in augmented reality applications where the perceived virtual object must align well with the real environment. However, in robotic manipulation such as industrial assembly and disassembly a suitable performance metric must reflect the fact whether the programmed task can be completed or not. Finding such a metric is the main objective of this work.
IiC 6D pose estimation methods
Many recent sateoftheart methods divise 6D pose estimation to two distinct steps [13, 14]: i) RGBbased object detection and ii) point cloud based 6D pose estimation. Since the object detection is out of the scope in our work we only focus on the recent 6D pose estimation methods. In the following we briefly review several baselines and stateofthearts which are included to the experimental part. All methods operate directly on point clouds captured by depth sensors.
Nearest Neighbor Similarity Ratio (NNSR) [15]
is a popular yet very efficient correspondence grouping method utilizing only feature matches. It flags a correspondence as an outlier if the closest and the second closest match are too similar i.e. descriptor
can not be distinctively described:(1) 
where .
Random Sample Consensus (RANSAC) [16] is another widely used method in 2D and 3D domains. It is an iterative process that uses random sampling technique to generate candidate solutions for a model (tranformation) that aligns two surfaces with a minimum pointwise error. Free parameter of the method is which is the number of iterations the algorithm samples matches from the correspondence set. The samples are used to generate candidate transformations which fitness are evaluated by transforming all query keypoints target surface and calculating Euclidean distance between the corresponding points. All the keypoints having smaller distance than are counted as inliers for the specific transformation candidate and finally the candidate having the most inliers is selected as the estimated pose transformation.
Search of Inliers (SI) [17] is a recent method that provides stateoftheart accuracy on several benchmarks. It uses two consecutive processing stages, local voting and global voting. The first voting step performs local voting, where locally selected correspondence pairs are selected from the query and a target, and the score is computed using their pairwise similarity score . At the global voting stage, the algorithm samples point correspondences, estimates a transformation and gives a global score to the points correctly aligned outside the estimation point set: . The final score is computed by integrating both local and global scores and finally are thresholded to inliers and outliers based on Otsu’s bimodal distribution thresholding.
Geometric Consistency Grouping (GC) [18] is popular baseline which is implemented in several point cloud libraries. In contrast to NNSR the GC grouping algorithm operates only on the spatial space and evaluates the consistency of two correspondences and by:
(2) 
The algorithm is initialized with a fixed number of clusters and in principle the GC can return more than one correspondence cluster. In our work we select the cluster having the largest number of correspondence.
Hough Grouping (HG) [19] extends the original Hough Transformation algorithm [20] to 3D correspondence grouping where the key idea is to iteratively cast votes for object location and pose bins in the Hough parameter space. At the end of the process the highest accumulated bins represent the most likely pose candidates and correspondence contributed to the bins are accepted. The method requires an unique reference point in the model, typically the model centroid, and each bin represents a single pose instance candidate. Therefore correct correspondence vote a same bin which gets quickly accumulated. To make correspondence points invariant to rotation an translation between the model and scene, every point is associated with local Reference Frame (RF) [21]. In the voting stage each correspondence between a capture scene and a model cast a single vote to a single or multiple bins in the 3D translation Hough accumulator space and pose is stored in the local reference frame. Finally, correspondence contributing bins having votes more than a set threshold which is adaptively set as it depends on the number of available points and the most important parameter is the Hough accumulator bin size.
Iii Object Pose for Robot Manipulation
In this section we propose a novel approach to evaluate and benchmark 6D object pose estimation algorithms for robotics. The approach is based on a probabilistic formulation of a successful grasp (Section IIIA) and sampling in the residual space to estimate the probability model (Section IIIB).
Iiia Measuring the probability of a successful grasp
The success of a grasp is a binary random variable
where denotes a successful grasp and denotes an unsuccessful grasp attempt. Therefore,follows a Bernoulli distribution with complementary probability of success and failure:
, where denotes the mathematical expectation. The fundamental problem in our case is the fact that the probability of success depends on the pose errorwhere the vector elements denote the error which respect to each pose parameter (the translation error 3Dvector and the rotation error 3Dvector). Thus, we should focus on the conditional distributions, conditional probabilities, and conditional expectations, as functions of
. For any given error , the conditional probability of a successful grasp attempt is(3) 
The maximum likelihood estimate of the Bernoulli parameter from homogeneous samples is the sample average
(4) 
where homogeneity means that all samples , , are realization of a common Bernoulli random variable with unique underlying parameter . However, guaranteeing homogeneity would require that the samples were either all collected at the same residual , or for different residuals that nonetheless yield same probability , i.e. it would require us either to collect multiple samples for each or to know beforehand over (which is what we are trying to estimate). This means that in practice must be estimated from nonhomogeneous samples, i.e. from sampled at residuals which can be different and having different underlying .
The actual form of over
is unknown and depends on many factors, e.g., the shape of an object and the properties of a gripper. Therefore it is not meaningful to assume any parametric shape such as the Gaussian or uniform distribution. Instead, we adopt the NadarayaWatson nonparametric estimator which gives the
probability of a successful grasp as(5) 
where denotes the pose error at which has been sampled and is a nonnegative multivariate kernel with vector scale .
In this work, is the multivariate Gaussian kernel
(6) 
where is the standard Gaussian bell, . The sums in (6) realize the modulo periodicity of .
The performance of the estimator (5) is heavily affected by the choice of , which determines the influence of samples in computing based on the difference between the pose errors and . Indeed, the parameter can be interpreted as reciprocal to the bandwidth of the estimator: too large results in excessive smoothing whereas too small results in localized spikes.
To find an optimal , we use the leaveoneout (LOO) crossvalidation method. Specifically, we construct the estimator on the basis of training examples leaving out the th sample:
The likelihood of given is either if , or if . We then select that maximizes the total LOO loglikelihood over the whole set :
IiiB Sampling the residual space
Evaluation of the success of the grasp is based on 2D markers attached to the manipulated objects and that can be accurately detected and localized with their 6D pose by the ArUco library [22]. In the work marker is detected using the RGB channels of the RGBD sensor. A grasp in manually programmed and the grasp point is defined with respect to the markers by a 6D similarity transformation (3D translation and 3D rotation) that is a matrix for homogeneous coordinates . The gripper final pose defines the grasp. However, the object pose estimate is given in the world frame that corresponds to the robot frame in our case. The coordinate transformation from the grasp to the world (robot) frame is defined as the transformation chain:
(7) 
where the transformations are:

– a constant transformation that is measured for a closed gripper with respect to the 2D marker;

– the transformation from the detected marker from the sensor values (RGBD) to the sensor frame;

– a constant transformation from the sensor frame to the robot endeffector frame (note that the RGBD sensor is attached to the endeffector);

– the robot endeffector position and orientation with respect to the base (world) frame calculated using the joint values and known kinematic equations.
Using the robot forward kinematics and a printed chessboard pattern we compute using the standard procedure for handcamera calibration [23]. For a calibrated camera the ArUco library provides a realtime pose of the marker with respect to the world frame . The constant offset from the marker to the actual grasp pose is estimated from
where is estimated manually by handguiding the endeffector to the predefined grasp pose using the forward kinematics engine.
6D residual
An residual is generated by sampling and and forming a 3D isometry transformation which is added to the final pose
(8) 
Our sampling procedure starts by one by one beam searching the and values for the each six residual parameter when the grasp fails. The limits are then used to generate uniform samples in the residual space (see Table I).
Grasp success validation
Our tasks of interest contain two important stages, grasp and placement, which both need to succeed. We define success automatically by two different triggers: error in the final pose of the manipulated object and wrench torque at the endeffector during execution. The final pose error measures success in task completion while torque measures collisions during the task.
For the pose we manually define the thresholds for the translation error and orientation error . These can be computed using the markers attached to the manipulated objects and using the target pose rotation and translation and the measured pose as
The torque is used to detect if the robot collides with its environment while grasping and moving the object. In addition if the robot places the object to the correct position with too high wrench the whole operation is considered as an unsuccessful attempt. The external wrench is computed based on the error between the joint torques required to stay on the programmed trajectory and the expected joint torques. Using the robot internal sensors we get the torque measurements , where , and are the forces in the axes of the robot frame coordinates and measured in Newtons. We manually set the limits and during each operation stage and trajectory and measurements violating the limits are considered as a failure.
Iv Experiments
In the following, we explain the experimental setup, data and evaluation procedure to benchmark pose estimation methods. We report the results for five point cloud based pose estimation methods.
Iva Setup
In Figure 2 is illustrated the setup used in our experiments. The experiments were conducted using the model 5 Universal Robot Arm (UR5) with a Schunk PGN100 gripper. The gripper operates pneumatically and was configured to have a high gripping force (approximately 600N) to prevent object slippage. In addition, the gripper had custom 3D printed fingers plated with rubber. For perception data, an Intel RealSense D415 RGBD sensor was secured on a 3D printed flange statically mounted between the gripper and the robot endeffector. All the inhouse made 3D prints were made using nylon reinforced with carbon fiber to tolerate external forces during the experiments. All the computation was performed on a single laptop with Ubuntu 18.04.
IvB Data
Work object 6D residual
Every work object used in the experiments was first tested 100 times on the pipeline described in IIIB with zero residual to confirm that the robot can perform the task automatically without problems. On average, a successful work operation took 4555 seconds to execute and in 24 hours the robot could automatically perform the task approximately 1,100 times depending on the work operation. The setup can automatically recover from most of the failure cases (dropping the object, object collision, etc.), however, if the marker was occluded by the environment or if the manipulated object got jammed against the internal parts of the motor, the pipeline was restarted by the human operator. For both tasks we generated approximately 3,300 valid samples.
Motor cap  Motor frame  

[0.90, 0.90]  [0.60, 0.60]  
[0.10, 0.10]  [0.30, 0.25]  
[0.10, 0.50]  [0.20, 0.40]  
[0.11, 0.11]  [0.11, 0.11]  
[0.87, 0.87]  [0.44, 0.17]  
[0.87, 0.87]  [0.26, 0.26] 
Object model and test images
In our dataset we include two real work objects from local automotive industry, a motor top cap and a motor frame. The task is to grasp and assemble the parts to the engine body. The point cloud models of the work pieces were semiautomatically created using the robot arm and the depth sensor attached to the endeffector. The robot arm was moved around the work piece and the sensor measurements from different viewpoints were merged to form a dense point cloud model using the transformation chain in Eq. 7. Finally, artifacts and redundant parts of the reconstructed point cloud were removed manually.
The test dataset was generated in a similar manner by moving the arm around the work pieces. For each of the objects we collected 150 test images in three different settings: 1) a single target object present, 2) multiple objects present and 3) a single target object present with partial occlusion. The dataset contains also the ground truth information to align the model to the test images.
IvC Data preparation
All pose estimation algorithms used in the experiments use point clouds as input (Section II). The input models and scene point clouds were first downsampled to a fixed resolution using a regular voxel grid based method to limit the amount of data for processing. Depending on the density of the cloud the size of the voxel was . For the resulting object surfaces we estimate the local surface normals by the least squares plane fitting on points in a small neighborhood. The object surfaces are downsampled to a resolution that results in approximately points depending on the object. On these points, the local descriptors for point matching were computed using the local point neighborhoods. The SHOT [21] feature descriptor was selected since it performed the best in our preliminary experiments. The descriptor support radius was computed as object model bounding box diagonal. During the experiments the most similar descriptors in a sense between the model and scene were found using a randomized tree similarity search. The found matches formed then the initial set of correspondences.
IvD Error measurement
In this work we use 2 different metrics to compare the performance of the different pose estimation methods. The first method is the proposed probabilistic approach described in Section IIIA. The 6D pose residual from the estimated pose is computed from
(9) 
where is the ground truth pose transform. The actual grasp pose success rate is then calculated using Eq. 5. In addition to the average of the computed grasp probabilities we report the proportion of the estimates for which the success probability is above a threshold value .
To put the proposed error function into perspective with the current literature we use the surface MSE [4] where the pose error is calculated as the average distance between transformed model points:
(10) 
Since complete failure of a method for certain test images might influence the result too much, we also report the top MSE error, which is less affected by estimation failures providing large errors.
IvE Results
The results for all the methods are shown in Table II. Rather strikingly, the best 3D pose estimate in all cases and using all metrics is provided by the Geometrical Consistency (GC) method which is clearly superior to HG and SI that represent more stateoftheart. GC has the lowest MSE error for both objects and in the Motor Cap pose estimation the error is over three orders of magnitude lower as compared to other methods. SI performs slightly better than HG and, surprisingly, RANSAC is behind the NNSR which is considered the baseline.
Observing the grasp rates, again the GC method outperforms all the other methods with a clear margin although the successful rate is relatively low. This is due to the fact that for the both test objects the estimated grasp pose has to be very close to the reference (Table I).
Part: Motor cap; Gripper: Shunker  Part: Motor frame; Gripper: Shunker  
Fingers: Custom made  Fingers: Custom made  
MSE  Top25%  MSE  Top25%  
GC [18]  0.10  0.27  0.35  27%  0.02  0.35  0.38  23% 
HG [19]  0.46  0.86  0.09  7%  0.38  0.98  0.15  10% 
SI [17]  0.40  0.52  0.10  7%  0.20  0.50  0.19  13% 
NNSR [20]  0.20  0.23  0.00  0%  0.33  0.37  0.00  0% 
RANSAC [16]  0.31  0.37  0.00  0%  0.49  0.49  0.00  0% 
V Conclusions
The main outcome of this work is a novel evaluation metric to benchmark 3D object pose estimation methods,
the average grasping success probability. The metric measures the true probability to succeed in the given task using the given setup and has therefore clear practical relevance. Other research groups can benchmark their pose estimation methods without the same physical setup using the provided test images and precomputed probability model. Instead of the popular MSE metric our metric provides direct interpretation of the performance, i.e. can the task be solved using the estimated poses. We provide all benchmark images and tools through a public Web page to facilitate fair comparison and to promote more practical research on grasping and 3D object pose estimation.References
 [1] T. Hodan, F. Michel, E. Brachmann, W. Kehl, A. GlentBuch, D. Kraft, B. Drost, J. Vidal, S. Ihrke, X. Zabulis, et al., “Bop: benchmark for 6d object pose estimation,” in ECCV, pp. 19–34, 2018.
 [2] J. Yang, K. Xian, Y. Xiao, and Z. Cao, “Performance evaluation of 3d correspondence grouping algorithms,” in 3DV, pp. 467–476, IEEE, 2017.
 [3] T. Hodaň, J. Matas, and Š. Obdržálek, “On evaluation of 6d object pose estimation,” in ECCV, pp. 606–619, Springer, 2016.
 [4] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of textureless 3d objects in heavily cluttered scenes,” in ACCV, pp. 548–562, Springer, 2012.
 [5] U. Viereck, A. t. Pas, K. Saenko, and R. Platt, “Learning a visuomotor controller for real world robotic grasping using simulated depth images,” arXiv preprint arXiv:1706.04652, 2017.
 [6] M. Gualtieri, A. Ten Pas, K. Saenko, and R. Platt, “High precision grasp pose detection in dense clutter,” in IROS, pp. 598–605, IEEE, 2016.

[7]
L. Pinto and A. Gupta, “Supersizing selfsupervision: Learning to grasp from 50k tries and 700 robot hours,” in
ICRA, pp. 3406–3413, IEEE, 2016.  [8] A. Saxena, L. Wong, M. Quigley, and A. Y. Ng, “A visionbased system for grasping novel objects in cluttered environments,” in Robotics research, pp. 337–348, Springer, 2010.
 [9] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother, “Learning 6d object pose estimation using 3d object coordinates,” in ECCV, pp. 536–551, Springer, 2014.
 [10] A. Doumanoglou, R. Kouskouridas, S. Malassiotis, and T.K. Kim, “Recovering 6d object pose and predicting nextbestview in the crowd,” in CVPR, pp. 3583–3592, 2016.
 [11] T. Hodan, P. Haluza, Š. Obdržálek, J. Matas, M. Lourakis, and X. Zabulis, “Tless: An rgbd dataset for 6d pose estimation of textureless objects,” in WACV, pp. 880–888, IEEE, 2017.
 [12] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgbd images,” in CVPR, pp. 2930–2937, 2013.
 [13] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab, “Ssd6d: Making rgbbased 3d detection and 6d pose estimation great again,” in ICCV, Oct 2017.
 [14] F. Manhardt, W. Kehl, N. Navab, and F. Tombari, “Deep modelbased 6d pose refinement in rgb,” in ECCV, September 2018.

[15]
D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,”
International Journal of Computer Vision
, vol. 60, no. 2, pp. 91–110, 2004.  [16] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
 [17] A. Glent Buch, Y. Yang, N. Kruger, and H. Gordon Petersen, “In search of inliers: 3d correspondence by local and global voting,” in CVPR, pp. 2067–2074, 2014.
 [18] H. Chen and B. Bhanu, “3d freeform object recognition in range images using local surface patches,” Pattern Recognition Letters, vol. 28, no. 10, pp. 1252–1262, 2007.
 [19] F. Tombari and L. Di Stefano, “Object recognition in 3d scenes with occlusions and clutter by hough voting,” in PSIVT, pp. 349–355, IEEE, 2010.
 [20] P. V. Hough, “Method and means for recognizing complex patterns,” Dec. 18 1962. US Patent 3,069,654.
 [21] F. Tombari, S. Salti, and L. Di Stefano, “Unique signatures of histograms for local surface description,” in ECCV, pp. 356–369, Springer, 2010.
 [22] S. GarridoJurado, R. MuñozSalinas, F. J. MadridCuevas, and M. J. MarínJiménez, “Automatic generation and detection of highly reliable fiducial markers under occlusion,” Pattern Recognition, vol. 47, no. 6, pp. 2280–2292, 2014.
 [23] F. C. Park and B. J. Martin, “Robot sensor calibration: solving ax= xb on the euclidean group,” IEEE Transactions on Robotics and Automation, vol. 10, no. 5, pp. 717–721, 1994.
Comments
There are no comments yet.