Log In Sign Up

LabelFusion: A Pipeline for Generating Ground Truth Labels for Real RGBD Data of Cluttered Scenes

by   Pat Marion, et al.

Deep neural network (DNN) architectures have been shown to outperform traditional pipelines for object segmentation and pose estimation using RGBD data, but the performance of these DNN pipelines is directly tied to how representative the training data is of the true data. Hence a key requirement for employing these methods in practice is to have a large set of labeled data for your specific robotic manipulation task, a requirement that is not generally satisfied by existing datasets. In this paper we develop a pipeline to rapidly generate high quality RGBD data with pixelwise labels and object poses. We use an RGBD camera to collect video of a scene from multiple viewpoints and leverage existing reconstruction techniques to produce a 3D dense reconstruction. We label the 3D reconstruction using a human assisted ICP-fitting of object meshes. By reprojecting the results of labeling the 3D scene we can produce labels for each RGBD image of the scene. This pipeline enabled us to collect over 1,000,000 labeled object instances in just a few days. We use this dataset to answer questions related to how much training data is required, and of what quality the data must be, to achieve high performance from a DNN architecture.


page 3

page 5

page 6

page 7


Stillleben: Realistic Scene Synthesis for Deep Learning in Robotics

Training data is the key ingredient for deep learning approaches, but di...

Rapid Pose Label Generation through Sparse Representation of Unknown Objects

Deep Convolutional Neural Networks (CNNs) have been successfully deploye...

Physics-based Scene-level Reasoning for Object Pose Estimation in Clutter

This paper focuses on vision-based pose estimation for multiple rigid ob...

Automatic Labeling to Generate Training Data for Online LiDAR-based Moving Object Segmentation

Understanding the scene is key for autonomously navigating vehicles and ...

A Real2Sim2Real Method for Robust Object Grasping with Neural Surface Reconstruction

Recent 3D-based manipulation methods either directly predict the grasp p...

Automatic Labeled LiDAR Data Generation based on Precise Human Model

Following improvements in deep neural networks, state-of-the-art network...

Label and Sample: Efficient Training of Vehicle Object Detector from Sparsely Labeled Data

Self-driving vehicle vision systems must deal with an extremely broad an...

I Introduction

Advances in neural network architectures for deep learning have made significant impacts on perception for robotic manipulation tasks. State of the art networks are able to produce high quality pixelwise segmentations of RGB images, which can be used as a key component for 6DOF object pose estimation in cluttered environments

[1, 2]. However for a network to be useful in practice it must be fine tuned on labeled scenes of the specific objects targeted by the manipulation task, and these networks can require tens to hundreds of thousands of labeled training examples to achieve adequate performance. To acquire sufficient data for each specific robotics application using once-per-image human labeling would be prohibitive, either in time or money. While some work has investigated closing the gap with simulated data [3, 4, 5, 6], our method can scale to these magnitudes with real data.

In this paper we tackle this problem by developing an open-source pipeline that vastly reduces the amount of human annotation time needed to produce labeled RGBD datasets for training image segmentation neural networks. The pipeline produces ground truth segmentations and ground truth 6DOF poses for multiple objects in scenes with clutter, occlusions, and varied lighting conditions. The key components of the pipeline are: leveraging dense RGBD reconstruction to fuse together RGBD images taken from a variety of viewpoints, labeling with ICP-assisted fitting of object meshes, and automatically rendering labels using projected object meshes. These techniques allow us to label once per scene, with each scene containing thousands of images, rather than having to annotate images individually. This reduces human annotation time by several orders of magnitude over traditional techniques. We optimize our pipeline to both collect many views of a scene and to collect many scenes with varied object arrangements. Our goal is to enable manipulation researchers and practitioners to generate customized datasets, which for example can be used to train any of the available state-of-the-art image segmentation neural network architectures. Using this method we have collected over 1,000,000 labeled object instances in multi-object scenes, with only a few days of data collection and without using any crowd sourcing platforms for human annotation.

Our primary contribution is the pipeline to rapidly generate labeled data, which researchers can use to build their own datasets, with the only hardware requirement being the RGBD sensor itself. We also have made available our own dataset, which is the largest available RGBD dataset with object-pose labels (352,000 labeled images, 1,000,000+ object instances). Additionally, we contribute a number of empirical results concerning the use of large datasets for practical deep-learning-based pixelwise segmentation of manipulation-relevant scenes in clutter – specifically, we empirically quantify the generalization value of varying aspects of the training data: (i) multi-object vs single object scenes, (ii) the number of background environments, and (iii) the number of views per scene.

Ii Related Work

We review three areas of related work. First, we review pipelines for generating labeled RGBD data. Second, we review applications of this type of labeled data to 6DOF object pose estimation in the context of robotic manipulation tasks. Third, we review work related to our empirical evaluations, concerning questions of scale and generalization for practical learning in robotics-relevant contexts.

Ii-a Methods for Generating Labeled RGBD Datasets

Rather than evaluate RGBD datasets based on the specific dataset they provide, we evaluate the methods used to generate them, and how well they scale. Firman [7] provides an extensive overview of over 100 available RGBD datasets. Only a few of the methods used are capable of generating labels for 6DOF object poses, and none of these associated datasets also provide per-pixel labeling of objects. One of the most related methods to ours is that used to create the T-LESS dataset [8], which contains approximately 49K RGBD images of textureless objects labeled with the 6DOF pose of each object. Compared to our approach, [8] requires highly calibrated data collection equipment. They employ fiducials for camera pose tracking which limits the ability of their method to operate in arbitrary environments. Additionally the alignment of the object models to the pointcloud is a completely manual process with no algorithmic assistance. Similarly, [1] describes a high-precision motion-capture-based approach, which does have the benefit of generating high-fidelity ground-truth pose, but its ability to scale to large scale data generation is limited by: the confines of the motion capture studio, motion capture markers on objects interfering with the data collection, and time-intensive setup for each object.

Although the approach is not capable of generating the 6 DOF poses of objects, a relevant method for per-pixel labeling is described in [2]. They employ an automated data collection pipeline in which the key idea is to use background subtraction. Two images are taken with the camera at the exact same location – in the first, no object is present, while it is in the second. Background subtraction automatically yields a pixelwise segmentation of the object. Using this approach they generate 130,000 labeled images for their 39 objects. As a pixelwise labeling method, there are a few drawbacks to this approach. The first is that in order to apply the background subtraction method, they only have a single object present in each scene. In particular there are no training images with occlusions. They could in theory extend their method to support multi-object scenes by adding objects to the scene one-by-one, but this presents practical challenges. Secondly the approach requires an accurately calibrated robot arm to move the camera in a repeatable way. A benefit of the method, however, is that it does enable pixelwise labeling of even deformable objects.

The SceneNN [9] and ScanNet [10]

data generation pipelines share some features with our method. They both use an RGBD sensor to produce a dense 3D reconstruction and then perform annotations in 3D. However, since SceneNN and ScanNet are focused on producing datasets for RGDB scene understanding tasks, the type of annotation that is needed is quite different. In particular their methods provide pixelwise segmenation into generic object classes (floor, wall, couch etc.). Neither SceneNN or ScaneNet have gometric models for the specific objects in a scene and thus cannot provide 6DOF object poses. Whereas ScanNet and SceneNN focus on producing datasets for benchmarking scene understanding algorithms, we provide a pipeline to enable rapid generation labeled data for your particular application and object set.

Ii-B Object-Specific Pose Estimation in Clutter for Robotic Manipulation

There have been a wide variety of methods to estimate object poses for manipulation. A challenge is object specificity. [1] and [2] are both state of the art pipelines for estimating object poses from RGBD images in clutter – both approaches use RGB pixelwise segmentation neural networks (trained on their datasets described in the previous section) to crop point clouds which are then fed into ICP-based algorithms to estimate object poses by registering against prior known meshes. Another approach is to directly learn pose estimation [11]. The upcoming SIXD Challenge 2017 [12] will provide a comparison of state of the art methods for 6DOF pose estimation on a common dataset. The challenge dataset contains RGBD images annotated with ground truth 6DOF object poses. This is exactly the type of data produced by our pipeline and we aim aim to submit our dataset to the 2018 challenge. There is also a trend in manipulation research to bypass object pose estimation and work directly with the raw sensor data [13, 14, 15]. Making these methods object-specific in clutter could be aided by using the pipeline presented here to train segmentation networks.

Ii-C Empirical Evaluations of Data Requirements for Image Segmentation Generalization

While the research community is more familiar with the scale and variety of data needed for images in the style of ImageNet

[16], the type of visual data that robots have available is much different than ImageNet-style images. Additionally, higher object specificity may be desired. In robotics contexts, there has been recent work in trying to identify data requirements for achieving practical performance for deep visual models trained on simulation data [3, 4, 5, 6], and specifically augmenting small datasets of real data with large datasets of simulation data [3, 4, 5, 6]. We do not know of prior studies that have performed generalization experiments with the scale of real data used here.

Iii Data Generation Pipeline

One of the main contributions of this paper is an efficient pipeline for generating labeled RGBD training data. The steps of the pipeline are described in the following sections: RGBD data collection, dense 3D reconstruction, object mesh generation, human assisted annotation, and rendering of labeled images.

Fig. 1: Overview of the data generation pipeline. (a) Xtion RGBD sensor mounted on Kuka IIWA arm for raw data collection. (b) RGBD data processed by ElasticFusion into reconstructed pointcloud. (c) User annotation tool that allows for easy alignment using 3 clicks. User clicks are shown as red and blue spheres. The transform mapping the red spheres to the green spheres is then the user specified guess. (d) Cropped pointcloud coming from user specified pose estimate is shown in green. The mesh model shown in grey is then finely aligned using ICP on the cropped pointcloud and starting from the user provided guess. (e) All the aligned meshes shown in reconstructed pointcloud. (f) The aligned meshes are rendered as masks in the RGB image, producing pixelwise labeled RGBD images for each view.

Iii-a RGBD Data Collection

A feature of our approach is that the RGBD sensor can either be mounted on an automated arm, as in Figure (0(a)), or the the RGBD sensor can simply be hand-carried. The benefit of the former option is a reduced human workload, while the benefit of the latter option is that no sophisticated equipment (i.e. motion capture, external markers, heavy robot arm) is required, enabling data collection in a wide variety of environments. We captured 112 scenes using the handheld approach. For the remaining 26 scenes we mounted the sensor on a Kuka IIWA, as shown in Figure (0(a)). The IIWA was programmed to perform a scanning pattern in both orientation and azimuth. Note that the arm-automated method does not require one to know the transform between the robot and the camera; everything is done in camera frame. Our typical logs averaged 120 seconds in duration with data captured at 30Hz by the Asus Xtion Pro.

Iii-B Dense 3D Reconstruction

The next step is to extract a dense 3D reconstruction of the scene, shown in Figure (0(b)), from the raw RGBD data. For this step we used the open source implementation of ElasticFusion [17] with the default parameter settings, which runs in realtime on our desktop with an NVIDIA GTX 1080 GPU. ElasticFusion also provides camera pose tracking relative to the local reconstruction frame, a fact that we take advantage of when rendering labeled images. Reconstruction performance can be affected by the amount of geometric features and RGB texture in the scene. Most natural indoor scenes provide sufficient texture, but large, flat surfaces with no RGB or depth texture can occasionally incur failure modes. Our pipeline is designed in a modular fashion so that any 3D reconstruction method that provides camera pose tracking can be used in place of ElasticFusion.

Iii-C Object Mesh Generation

A pre-processing step for the pipeline is to obtain meshes for each object. Once obtained, meshes speed annotation by enabling alignment of the mesh model rather than manually intensive pixelwise segmentation of the 3D reconstruction. Using meshes necessitates rigid objects, but imposes no other restrictions on the objects themselves. We tested several different mesh construction techniques when building our dataset. In total there are twelve objects. Four object meshes were generated using an Artec Space Spider handheld scanner. One object was scanned using Next Engine turntable scanner. For the four objects which are part of the YCB dataset [18] we used the provided meshes. One of our objects, a tissue box, was modeled using primitive box geometry. In addition our pipeline provides a volumetric meshing method using the VTK implementation of [19] that operates directly on the data already produced by ElasticFusion. Finally, there exist several relatively low cost all-in-one solutions [20], [21], [22] which use RGBD sensors such as the Asus Xtion, Intel RealSense R300 and Occipital Structure Sensor, to generate object meshes. The only requirement is that the mesh be sufficiently high quality to enable the ICP based alignment (see section III-D). RGB textures of meshes are not necessary.

Iii-D Human Assisted Annotation

One of the key contributions of the paper is in reducing the amount of human annotation time needed to generate labeled per-pixel and pose data of objects in clutter. We evaluated several global registration methods [23, 24, 25] to try to automatically align our known objects to the 3D reconstruction but none of them came close to providing satisfactory results. This is due to a variety of reasons, but a principle one is that many scene points didn’t belong to any of the objects.

To circumvent this problem we developed a novel user interface that utilizes human input to assist traditional registration techniques. The user interface was developed using Director [26], a robotics interface and visualization framework. Typically the objects of interest are on a table or another flat surface – if so, a single click from the user segments out the table. The user identifies each object in the scene and then selects the corresponding mesh from the mesh library. They then perform a 3-click-based initialization of the object pose. Our insight for the alignment stage was that if the user provides a rough initial pose for the object, then traditional ICP-based techniques can successfully provide the fine alignment. The human provides the rough initial alignment by clicking three points on the object in the reconstructed pointcloud, and then clicking roughly the same three points in the object mesh, see Figure (0(c)). The transform that best aligns the 3 model points, shown in red, with the three scene points, shown in blue, in a least squares sense is found using the vtkLandmarkTransform function. The resulting transform then specifies an initial alignment of the object mesh to the scene, and a cropped pointcloud is taken from the points within 1cm of the roughly aligned model, as shown in green in Figure (0(d)). Finally, we perform ICP to align this cropped pointcloud to the model, using the rough aligment of the model as the initial seed. In practice this results in very good alignments even for cluttered scenes such as Figure (0(e)).

The entire human annotation process takes approximately 30 seconds per object. This is much faster than aligning the full object meshes by hand without using the 3-click technique which can take several minutes per object and results in less accurate object poses. We also compared our method with human labeling (polygon-drawing) each image, and found intersection over union (IoU) above 80%, with approximately four orders of magnitude less human effort per image (supplementary figures on our website).

Iii-E Rendering of Labeled Images and Object Poses

After the human annotation step of Section III-D, the rest of the pipeline is automated. Given the previous steps it is easy to generate per-pixel object labels by projecting the 3D object poses back into the 2D RGB images. Since our reconstruction method, ElasticFusion, provides camera poses relative to the local reconstruction frame, and we have already aligned our object models to the reconstructed pointcloud, we also have object poses in each camera frame, for each image frame in the log. Given object poses in camera frame it is easy to get the pixelwise labels by projecting the object meshes into the rendered images. An RGB image with projected object meshes is displayed in Figure (0(f)).

Iii-F Discussion

As compared to existing methods such as [8, 27, 1] our method requires no sophisticated calibration, works for arbitrary rigid objects in general environments, and requires only 30 seconds of human annotation time per object per scene. Since the human annotation is done on the full 3D reconstruction, one labeling effort automatically labels thousands of RGBD images of the same scene from different viewpoints.

Iv Results

We first analyze the effectiveness of the LabelFusion data generation pipeline (Section IV-A). We then use data generated from our pipeline to perform practical empirical experiments to quantify the generalization value of different aspects of training data (Section IV-B).

Fig. 2: Examples of labeled data generated by our pipeline: (a) heavily cluttered multi-object, (b) low light conditions, (c) motion blur, (d) distance from object, (e) 25 different environments. All of these scenes were collected by hand-carrying the RGBD sensor.

Iv-a Evaluation of Data Generation Pipeline

# objects 12
# distinct scenes 105 single/double object
33 with 6+ objects
# unique object instances aligned 339
avg duration of single scene 120 seconds, 3600 RGBD frames
# labeled RGBD frames 352,000
# labeled object instances 1,000,000+
TABLE I: Dataset Description

LabelFusion has the capability to rapidly produce large amounts of labeled data, with minimal human annotation time. In total we generated over 352,000 labeled RGBD images, of which over 200,000 were generated in approximately one day by two people. Because many of our images are multi- object, this amounts to over 1,000,000 labeled object instances. Detailed statistics are provided in Table I. The pipeline is open-source and intended for use. We were able to create training data in a wide variety of scenarios; examples are provided in Figure 2. In particular, we highlight the wide diversity of environments enabled by hand-carried data collection, the wide variety of lighting conditions, and the heavy clutter both of backgrounds and of multi-labeled object scenes.

Fig. 3: Time required for each step of pipeline.
Fig. 4: Example segmentation performance (alpha-blended with RGB image) of network on a multi-object test scene.

For scaling to large scale data collection, the time required to generate data is critical. Our pipeline is highly automated and most components run at approximately real-time, as shown in Figure 3. The amount of human time required is approximately 30 seconds per object per scene, which for a typical single-object scene is less than real-time. Post-processing runtime is several times greater than real-time, but is easily parallelizable – in practice, a small cluster of 2-4 modern desktop machines (quad-core Intel i7 and Nvidia GTX 900 series or higher) can be made to post-process the data from a single sensor at real-time rates. With a reasonable amount of resources (one to two people and a handful of computers), it would be possible to keep up with the real-time rate of the sensor (generating labeled data at 30 Hz).

Iv-B Empirical Evaluations: How Much Data Is Needed For Practical Object-Specific Segmentation?

With the capability to rapidly generate a vast sum of labeled real RGBD data, questions of “how much data is needed?” and “which types of data are most valuable?” are accessible. We explore practical generalization performance while varying three axes of the training data: (i) whether the training set includes multi-object scenes with occlusions or only single-object scenes, (ii) the number of background environments, and (iii) the number of views used per scene. For each, we train a state-of-the-art ResNet segmentation network [28] with different subsets of training data, and evaluate each network’s generalization performance on common test sets. Further experimental details are provided in our supplementary material; due to space constraints we can only summarize results here.

Fig. 5: Comparisons of training on single-object vs. multi-object scenes and testing on single-object (left) and multi-object (right) scenes.

First, we investigate whether there is a benefit of using training data with heavily occluded and cluttered multi-object scenes, compared to training with only single-object scenes. Although they encounter difficulties with heavy occlusions in multi-object scenes, [2] uses purely single-object scenes for training. We trained five different networks to enable comparison of segmentation performance on novel scenes (different placements of the objects) for a single background environment. Results of segmentation performance on novel scenes (measured using the mean IoU, intersection over union, per object) show an advantage given multi-object occluded scenes compared to single-object scenes (Figure 5, right). In particular, the average IoU per object increases 190% given training set instead of in Figure 5, right, even though has strictly less labeled pixels than , due to occlusions. This implies that the value of the multi-object training data is more valuable per pixel than the single-object training data. When the same amount of scenes for the single-object scenes are used to train a network with multi-object scenes , the increase in IoU performance averaged across objects is 369%. Once the network has been trained on 18 multi-object scenes , an additional 18 single-object training scenes have no noticeable effect on multi-object generalization . For generalization performance on single-object scenes (Figure 5, left), this effect is not observed; single-object training scenes are sufficient for IoU performance above 60%.

Fig. 6: Comparison of segmentation performance on novel multi-object test scenes. Networks are either trained on (a) single object scenes only, (b,d), multi-object test scenes only, or a mixture (c,e).

Second, we ask: how does the performance curve grow as more and more training data is added from different background environments? To test this, we train different networks respectively on 1, 2, 5, 10, 25, and 50 scenes each labeled with a single drill object. The smaller datasets are subsets of the larger datasets; this directly allows us to measure the value of providing more data. The test set is comprised of 11 background environments which none of the networks have seen. We observe a steady increase in segmentation performance that is approximately logarithmic with the number of training scene backgrounds used (Figure 7, left). We also took our multi-object networks trained on a single background and tested them on the 11 novel environments with the drill. We observe an advantage of the multi-object training data with occlusions over the single-object training data in generalizing to novel background environments (Figure 7, right).

Fig. 7: (left) Generalization performance as a function of the number of environments provided at training time, for a set of six networks trained on 50 different scenes or some subset ({1, 2, 5, 10, 25}) of those scenes. (right) Performance on the same test set of unknown scenes, but measured for the 5 training configurations for the multi-object, single-environment-only setup described previously.
Fig. 8: Comparison of segmentation performance on novel background environments. Networks were trained on {1, 2, 5, 10, 25, 50} background environments.

Third, we investigate whether 30 Hz data is necessary, or whether significantly less data suffices (Figure 9). We perform experiments with downsampling the effective sensor rate both for robot-arm-mounted multi-object single-background training set , and the hand-carried many-environments dataset with either 10 or 50 scenes. For each, we train four different networks, where one has all data available and the others have downsampled data at respectively 0.03, 0.3, and 3 Hz. We observe a monotonic increase in segmentation performance as the effective sensor rate is increased, but with heavily diminished returns after 0.3 Hz for the slower robot-arm-mounted data (0.03 m/s camera motion velocity). The hand-carried data (0.05 - 0.17 m/s) shows more gains with higher rates.

Fig. 9: Pixelwise segmentation performance as a function of the number of views per scene, reduced by downsampling the native 30 Hz sensor to {0.03, 0.3, 3.0.} Hz.

V Conclusion

This paper introduces LabelFusion, our pipeline for efficiently generating RGBD data annotated with per-pixel labels and ground truth object poses. Specifically only a few minutes of human time are required for labeling a scene containing thousands of RGBD images. LabelFusion is open source and available for community use, and we also supply an example dataset generated by our pipeline [29].

The capability to produce a large, labeled dataset enabled us to answer several questions related to the type and quantity of training data needed for practical deep learning segmentation networks in a robotic manipulation context. Specifically we found that networks trained on multi-object scenes performed significantly better than those trained on single object scenes, both on novel multi-object scenes with the same background, and on single-object scenes with new backgrounds. Increasing the variety of backgrounds in the training data for single-object scenes also improved generalization performance for new backgrounds, with approximately 50 different backgrounds breaking into above-50% IoU on entirely novel scenes. Our recommendation is to focus on multi-object data collection in a variety of backgrounds for the most gains in generalization performance.

We hope that our pipeline lowers the barrier to entry for using deep learning approaches for perception in support of robotic manipulation tasks by reducing the amount of human time needed to generate vast quantities of labeled data for your specific environment and set of objects. It is also our hope that our analysis of segmentation network performance provides guidance on the type and quantity of data that needs to be collected to achieve desired levels of generalization performance.


The authors thank Matthew O’Kelly for his guidance with segmentation networks and manuscript feedback. We also thank Allison Fastman and Sammy Creasey of Toyota Research Institute for their help with hardware, including object scanning and robot arm automation. David Johnson of Draper Laboratory and Shuran Song of Princeton University provided valuable input on training. We are grateful we were able to use the robot arm testing facility from Toyota Research Insitute. This work was supported by the Air Force/Lincoln Laboratory award no. 7000374874, by the Defense Advanced Research Projects Agency via Air Force Research Laboratory award FA8750-12-1-0321, and by NSF Contract IIS-1427050. The views expressed are not endorsed by the sponsors.