The bin-picking application is concerned with a robot that has to grasp single instances of rigid objects from a chaotically filled bin (see Fig. 1). In the context of industrial productions, it aims for replacing manual extraction out of storage boxes. In this case, object pose estimation is a challenging task due to cluttered scenes with heavy occlusions and many identical objects. Typically, one is forced to predict 6D poses from a single depth image or point cloud.
Since the beginning of the deep learning era, the performance on many computer vision tasks increased drastically. This is closely related to the availability of large real-world datasets as for image classification, object detection and segmentation , or autonomous driving [3, 4]. However, annotating vast amounts of real-world data is time-consuming and tedious, especially for 3D data.
Previous approaches towards datasets for 6D object pose estimation such as [5, 6, 7, 8] either lack in the amount of data, the variety of scenes or do not fit the bin-picking scenario. According to the experiments in , the currently leading method is based on point-pair features, which is in contrast to many other computer vision tasks being dominated by approaches based on deep learning.
This paper aims for supporting machine learning methods and leveraging 6D object pose estimation in bin-picking scenarios. For this purpose, we provide a new large-scale benchmark dataset referred as “Fraunhofer IPA Bin-Picking dataset” including 520 fully annotated point clouds and corresponding depth images of real-world scenes and about 206,000 synthetic scenes. It comprises eight objects from
as well as two newly introduced ones. We present an effective way to semi-automatically produce ground truth data of real-world bin-picking scenes using the iterative closest point (ICP) algorithm and a reconstruction of the scenes in a physics simulation. This comprises not only a translation vectorand a rotation matrix , but also a visibility score and a segmentation mask for each object. The synthetic data consists of approximately 198,000 annotated scenes for training and 8,000 scenes for testing. In contrast to most of other public datasets for 6D object pose estimation, our contribution can be used additionally for instance segmentation and contains a visibility score, being of great value in bin-picking scenarios.
Ii Related Work
Particularly since the emergence of affordable sensors capable of recording 3D data, numerous corresponding datasets appeared as listed in . Nevertheless, the vast majority of them is unsuitable for machine learning methods for 6D object pose estimation due to missing ground truth information. Mian et al.  provided point clouds of scenes with different objects, but those contain neither a high amount of clutter nor multiple instances of the same object type.
The LINEMOD dataset  is a popular and commonly used benchmark containing about 18,000 RGB-D images of 15 texture-less objects. The work was augmented by  such that ground truth poses are available for all objects depicted in the images. This enables the consideration of a higher degree of occlusion for evaluation.
Datasets sharing similar properties were presented in [12, 13, 6, 14]. All of these datasets have limited pose variability and data redundancy since only the very same scene is recorded from different angles (see Table I). Additionally, only Doumanoglou et al.  provided homogeneous scenes as it is the case in industrial bin-picking, i.e., multiple instances of the same object type are present in one image.
In the Rutgers APC dataset , a cluttered warehouse scenario with occlusion and 6,000 real-world test images of 24 objects is introduced, but merely includes non-rigid, textured objects and it is not targeted on bin-picking.
The T-Less  dataset provides 38,000 real training images of 30 industrial texture-less objects plus 10,000 test images of 20 scenes by systematically sampling a sphere. Again, it lacks in homogeneity, has limited pose variability and exhibits data redundancy.
Due to the time-consuming and difficult process of annotating, most approaches use markers either on the object itself or relative to the objects to automatically produce ground truth data. The same scene is recorded multiple times causing data redundancy and pose inflexibility. However, after removing redundant scenes, the datasets become too small to be applicable for machine learning methods like deep neural networks.
BOP  attempts to standardize and integrate the presented datasets in one novel benchmark for 6D object pose estimation. In addition, two new scenarios with varying lighting conditions were included, but those are not related to bin-picking. Similar to our approach, an ICP algorithm is used to refine the manually created ground truth. The work includes results from the SIXD Challenge 2017111http://cmp.felk.cvut.cz/sixd/challenge_2017/, last accessed on July 31, 2019., which focused on 6D object pose estimation of single instances of one object.
Furthermore, the common standard in industrial bin-picking uses top view 3D sensors, which is not the case in all aforementioned works. In the Siléane dataset , this common practice is recognized and a procedure to automatically annotate real-world images is presented. However, this dataset provides at most 325 images of one object, which is usually far from being suitable to use advanced machine learning methods. In this work, we extend the Siléane dataset to be large enough for learning-based methods and introduce two new industrial objects together with real-world data.
Iii Fraunhofer IPA Bin-Picking dataset
In this section, we give details on the new dataset for 6D object pose estimation for industrial bin-picking.
Iii-a Sensor Setup
|object||resolution||clip start and||orthogonal||perspective||drop||number of||number of||number of|
|clip end||size||angle||limit||training cycles||test cycles||scenes|
|Siléane Stanford bunny||876–1,706 mm||659 mm||80||250||10||21,060|
|Siléane candlestick||584–1,105 mm||534 mm||60||250||10||15,860|
|Siléane pepper||920–1,606 mm||724 mm||90||250||10||23,660|
|Siléane brick||878–1,031 mm||117 mm||150||250||10||39,260|
|Siléane gear||1,639–2,082 mm||478 mm||60||250||10||15,860|
|Siléane T-Less 20||584–1,105 mm||534 mm||99||250||10||26,000|
|Siléane T-Less 22||584–1,105 mm||534 mm||100||250||10||26,260|
|Siléane T-Less 29||584–1,105 mm||534 mm||79||250||10||20,800|
|Fraunhofer IPA gear shaft||750–1,750 mm||600 mm||30||250||10||8,060|
|Fraunhofer IPA gear shaft (real-world)||750–1,750 mm||600 mm||22||0||10||230|
|Fraunhofer IPA ring screw||1,250–1,750 mm||600 mm||35||250||10||9,360|
|Fraunhofer IPA ring screw (real-world)||1,250–1,750 mm||600 mm||28||0||10||290|
We collected the real-world data using an Ensenso N20-1202-16-BL stereo camera having a minimum working distance of 1,000 mm, a maximum working distance of 2,400 mm, and an optimal working distance of 1,400 mm. The sensor produces images with a resolution of pixels and is mounted above the bin.
For the collection of synthetic data, we use the same parameter settings in our physics simulation as in . The detailed setting for each object for the clipping planes, the perspective angle of the perspective projection, the orthogonal size of the orthogonal projection, and image resolution are listed in Table II.
Iii-B Dataset Description
We use eight objects with different symmetries from , which again uses three objects originally published by . Moreover, we introduce two novel industrial objects: A gear shaft possessing a revolution symmetry and a ring screw possessing a cyclic symmetry. An overview is depicted in Fig. 2.
The dataset is separated into a training and a test dataset. For each object in the test dataset, data is generated by iteratively filling an initially empty bin. This iterative procedure consists of ten cycles and ends if a particular drop limit is reached. For details of the iterative procedure see Section III-C. The training dataset comprises 250 cycles that are generated in the same way, but have no real-world scenes due to the time-consuming and non-scaling process of annotation.
However, the usage of sim-to-real transfer techniques such as domain randomization  has proven to be successful and to allow high performance, as for example demonstrated by . In this way, the synthetic training dataset is sufficient to achieve high-quality results on the test set including the real-world dataset with techniques based on deep learning.
The synthetic scenes are independently filled, which means that there are no dependencies between the images. The number of objects in one scene is increased one by one, but after each scene is recorded, the bin is cleared and we start from scratch. In contrast, each real-world scene in one cycle depends on the previous scenes of this cycle due to our data collection procedure (Section III-C2).
The ground truth data comprises a translation vector and a rotation matrix relative to the coordinate system of the 3D sensor, a visibility score , and a segmentation image labeled by the object ID for perspective and orthogonal projection.
Iii-C Data Collection Procedure
Iii-C1 Synthetic Data
To generate scenes typical for bin-picking, we use the physics simulation V-REP . We import a CAD model of each object in the simulation and drop them from varying positions and with random orientation into the bin (see Fig. 3). To handle the dynamics and collisions, we use the built-in Bullet physics engine222https://pybullet.org/wordpress/, last accessed on July 31, 2019. In favor of increasing realism for the newly introduced objects, we slightly shift the bin pose from image to image whereas the settings of the objects from  remain unchanged. Starting with an empty bin, we raise the number of objects dropped in the bin iteratively. This means, we drop one object in the first run, record the scene, clear the bin, drop two objects in the second run, etc. After each run, we record a depth image, a RGB image, and a segmentation image in both orthogonal and perspective projection together with the poses of all objects in the scene. This procedure is repeated until a predefined drop limit is reached and the collected data forms one cycle. Fig. 4 depicts example scenes.
The depth images are saved in 16 bit unsigned integer format (uint16). The segmentation image is created by assigning the ID of the respective object to each pixel, i.e., zero for the bin, one for the first object, etc. If a pixel belongs to the background, the maximum value 255 of uint8 is assigned. For each item, we save a segmentation image containing only the individual object in order to calculate the total number of pixels forming this object. All other objects are made invisible for this single-object image. The final visibility score is calculated externally by the ratio between the visible pixels in the segmentation image and the total number of pixels.
If the object is partly outside of the original image, we further save a larger segmentation image containing the full object. The resolution of this image is increased to ensure the same number of pixels showing the original scene in the large image. Subsequently, the large image is used as reference to calculate the visibility score. The resulting value is computed for the orthogonal and perspective version and is attached to the aforementioned ground truth file containing the ID, class, translation vector, and rotation matrix of each object instance in the scene.
Iii-C2 Real-World Data
Starting with a filled bin, we carefully remove the objects one by one and do not change the pose of the remaining ones. In each step, we record the 3D sensor data until the bin is empty. For annotation, we reverse the order of the scenes. For this purpose, we fit a point cloud representation of the object’s CAD model to the newly added object by means of the ICP algorithm to get the precise 6D pose (see Fig. 5).
With this result, we rebuild each real-world scene in our physics simulation and determine the segmentation mask and the visibility score for each object as described in Section III-C1. We further provide images showing the merger of real-world depth images and the ground truth segmentation for each individual object to prove the quality of our annotation process.
Along with the dataset, we provide CAD models, Python tools for various conversion needs of point clouds and depth images or ground truth files and scripts to facilitate working with our dataset.
As demonstrated in , our synthetic dataset along with domain randomization  can be used to get robust and accurate 6D pose estimates on our real-world scenes. By applying various augmentations to the synthetic images during training, the deep neural network is able to generalize to real-world data despite being entirely trained on synthetic data.
Iv-a Evaluation Metric
A common evaluation metric for 6D object pose estimation is ADD by Hinterstoisser et al., which accepts a pose hypothesis if the average distance of model points between the ground truth and estimated pose is less than 0.1 times the diameter of the smallest bounding sphere of the object. Because this metric cannot handle symmetric objects, ADI  was introduced for handling those. The ADI metric is widely used, but can fail to reject false positives as demonstrated in . Therefore, we use the metric provided by Brégier et al. [21, 8], which is suitable for rigid objects, for scenes of many parts in bulk, and properly considers cyclic and revolution object symmetries. A pose representative comprises a translation vector and the relevant axis vectors of the rotation matrix depending on the object’s proper symmetry group. The distance between a pair of poses and is defined as the minimum of the Euclidean distance between their respective pose representatives
A pose hypothesis is accepted (considered as true positive) if the minimum distance to the ground truth is less than 0.1 times the object’s diameter. Following , only the pose of objects that are less than 50% occluded are relevant for the retrieval. The metric breaks down the performance of a method to a single scalar value named average precision (AP) by taking the area under the precision-recall curve.
Iv-B Object Pose Estimation Challenge for Bin-Picking
We are going to offer a competition for 6D object pose estimation for bin-picking at the IROS 2019. The proposed dataset will serve as training and test dataset. We hope to promote and facilitate the performance of industrial bin-picking robots with our contribution and advance the state-of-the-art for object pose estimation. Further information regarding the competition is available at http://www.bin-picking.ai/en/competition.html.
V Conclusions and Future Work
To the best of our knowledge, we presented the first 6D object pose estimation benchmark dataset for industrial bin-picking allowing the usage of advanced machine learning methods. It is composed of both synthetic and real-world scenes. In future work, we plan to publish data of more objects.
This work was partially supported by the Baden-Württemberg Stiftung gGmbH (Deep Grasping – Grant No. NEU016/1) and the Ministry of Economic Affairs of the state Baden-Württemberg (Center for Cyber Cognitive Intelligence (CCI) – Grant No. 017-192996). We would like to thank our colleagues for helpful discussions and comments.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F. F. Li, “ImageNet: A large-scale hierarchical image database,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014.
-  A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” International Journal of Robotics Research (IJRR), 2013.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,” in Asian Conference on Computer Vision (ACCV), 2012.
-  A. Tejani, D. Tang, R. Kouskouridas, and T.-K. Kim, “Latent-class hough forests for 3d object detection and pose estimation,” in European Conference on Computer Vision (ECCV), 2014.
-  T. Hodaň, P. Haluza, Š. Obdržálek, J. Matas, M. Lourakis, and X. Zabulis, “T-LESS: An rgb-d dataset for 6d pose estimation of texture-less objects,” in IEEE Winter Conference on Applications of Computer Vision (WACV), 2017.
-  R. Brégier, F. Devernay, L. Leyrit, and J. L. Crowley, “Symmetry aware evaluation of 3d object detection and pose estimation in scenes of many parts in bulk,” in IEEE International Conference on Computer Vision (ICCV), 2017.
-  T. Hodaň, F. Michel, E. Brachmann, W. Kehl, A. G. Buch, D. Kraft, B. Drost, J. Vidal, S. Ihrke, X. Zabulis, C. Sahin, F. Manhardt, F. Tombari, T.-K. Kim, J. Matas, and C. Rother, “BOP: Benchmark for 6d object pose estimation,” in European Conference on Computer Vision (ECCV), 2018.
-  A. S. Mian, M. Bennamoun, and R. Owens, “Three-dimensional model-based object recognition and segmentation in cluttered scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 28, no. 10, 2006.
-  E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother, “Learning 6d object pose estimation using 3d object coordinates,” in European Conference on Computer Vision (ECCV), 2014.
-  U. Bonde, V. Badrinarayanan, and R. Cipolla, “Robust instance recognition in presence of occlusion and clutter,” in European Conference on Computer Vision (ECCV), 2014.
Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes,” inRobotics: Science and Systems (RSS), 2018.
-  A. Doumanoglou, R. Kouskouridas, S. Malassiotis, and T.-K. Kim, “Recovering 6d object pose and predicting next-best-view in the crowd,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  C. Rennie, R. Shome, K. E. Bekris, and A. F. De Souza, “A dataset for improved rgbd-based object detection and pose estimation for warehouse pick-and-place,” IEEE Robotics and Automation Letters (RA-L), vol. 1, no. 2, 2016.
-  M. Firman, “Rgbd datasets: Past, present and future,” in CVPR Workshop on Large Scale 3D Data: Acquisition, Modelling and Analysis, 2016.
-  J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
-  OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Józefowicz, B. McGrew, J. W. Pachocki, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba, “Learning dexterous in-hand manipulation,” CoRR, vol. abs/1808.00177, 2018.
-  E. Rohmer, S. P. Singh, and M. Freese, “V-REP: A versatile and scalable robot simulation framework,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2013.
-  M. El-Shamouty, K. Kleeberger, A. Lämmle, and M. F. Huber, “Simulation-driven machine learning for robotics and automation,” tm – Technisches Messen, 2019.
-  R. Brégier, F. Devernay, L. Leyrit, and J. L. Crowley, “Defining the pose of any 3d rigid object and an associated distance,” International Journal of Computer Vision (IJCV), vol. 126, no. 6, 2018.