BusyHands: A Hand-Tool Interaction Database for Assembly Tasks Semantic Segmentation

02/19/2019 ∙ by Roy Shilkrot, et al. ∙ Stony Brook University 4

Visual segmentation has seen tremendous advancement recently with ready solutions for a wide variety of scene types, including human hands and other body parts. However, focus on segmentation of human hands while performing complex tasks, such as manual assembly, is still severely lacking. Segmenting hands from tools, work pieces, background and other body parts is extremely difficult because of self-occlusions and intricate hand grips and poses. In this paper we introduce BusyHands, a large open dataset of pixel-level annotated images of hands performing 13 different tool-based assembly tasks, from both real-world captures and virtual-world renderings. A total of 7906 samples are included in our first-in-kind dataset, with both RGB and depth images as obtained from a Kinect V2 camera and Blender. We evaluate several state-of-the-art semantic segmentation methods on our dataset as a proposed performance benchmark.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Idle hands are the devil’s playthings” — Benjamin Franklin

Computer vision is now used in many of the manufacturing and fabrication fields. Manufacturers are using high-end machine vision for part inspection and verification, as well as means to track the workers and the work pieces to gain crucial insight into the efficiency of their assembly lines. Small-scale fabrication, on the other hand, happens virtually anywhere, even at home, at school, or in personal fabrication shops. Still all kinds of fabrication, mass- or small-scale, share a commonality - manual assembly tasks performed by humans. This comes as a stark contrast to the minor offering of computer vision methods to understand manual assembly scenes. To this end we offer a first-of-its-kind dataset of fully annotated images of assembly tasks with manual tools - named BusyHands. The first offering, described in this paper, includes both real-world and virtual-world samples for semantic segmentation tasks. Later iterations of BusyHands will include arm and hand articulated poses (skeleton) as well as multi-part tool 6DOF pose. We believe an open dataset, such as our BusyHands, can drive research into deeper understanding of manual assembly task imaging, which will in turn help increase efficiency and error-tolerance in industrial pipelines or at home.

Semantic segmentation – finding contiguous areas in the image with a similar semantic context – is one of the most fundamental tasks in scene understanding. Using a segmentation over the image, further break-down of the parts to smaller parts or interaction between parts can proceed. There are numerous popular large-scale standard datasets to assist in segmentation algorithm development, e.g. ImageNet 

[1], COCO [2], SUN [3], PASCAL [4], and ADE20K [5]. Further, hand image analysis datasets [6, 7, 8, 9, 10] were proposed for segmentation, with a focus on hands, but not hand interactions. Bambach et al. [11] create a dataset for complex interactions, but doesn’t involve handheld tools. Therefore, we find most existing open collections unsuitable for interactions between hands and handheld tools, which is essential for understanding assembly.

Name # frames Depth Method
EgoHands [11] 4,800 No Manual
Handseg [7] 210,000 Yes Automatic
NYUHands [12] 6,736 Yes Automatic
[8] 43,986 Yes Synthetic
HandNet [13] 212,928 Yes Automatic
GTEA [14] 663 No Manual
[10] 1,590 No Manual
Ours 7,905 Yes Man. & Syn.
Table 1: Comparison of hand segmentation datasets. ‘Depth’ indicates the offering of an aligned depth image per RGB image.
Tool COCO [2] SUN [3] ADE20K [5] BusyHands (Ours)
Screwdriver 0 1 2 1616
Wrench 0 64 2 2051
Pliers 0 1 1 1586
Pencil 0 7 8 2320
Scissors 975 3 16 864
Cutter 4507 63 161 2021
Hammer 0 4 5 1066
Ratchet 0 0 0 967
Tape 0 0 1 796
Saw 0 1 1 1183
Eraser 0 0 0 846
Glue 0  0 0 650
Ruler 0 2 4 2428
Table 2: Comparison of number of pixel-level annotated object instances among prominent segmentation datasets and our own.  “Knife” also considered as “Cutter” in other datasets.
Figure 1: Qualitative comparison of annotation quality in our dataset vs. ADE20K [5] and MS COCO [2]. Our annotation is more precise in terms of polygon quality, and the dataset also contains depth information. Additionally, other datasets have a far smaller amount of instances in most object categories (see Table 2).

Naive methods for human hand segmentation from backgrounds, such as recognizing skin-colored pixels in RGB, are being replaced with supervised machine learning algorithms with far higher perception capabilities, such as deep convolutional networks or deep randomized decision forests. The advent of new cheap imaging technology, such as the Kinect 

[15]

depth camera, allowed enriching the fundamental features used in perception tasks to reach (and even surpass) human-level cognitive capabilities. However, adding more feature dimensions to these highly parametric models requires orders of magnitude more training data to achieve generalizable results. Consequently this lead to the construction of the aforementioned large annotated datasets and others, which are now in hard demand.

Figure 2: Frame by frame outputs captured from Kinect V2, both color and depth frames are depicted.

Manually annotating distinct semantic parts in images is tedious and error-prone, and therefore it may be prohibitively expensive. To cope with this problem,  [16, 17] adopted synthetic data which can be generated through professional 3D modeling software. Ground truth annotation for semantic segmentation can be achieved easily in 3D software, since the objects are precisely defined (by a triangulated mesh) and photorealistic rendering is ready at hand. A 3D model can also be parameterized to augment the data with a multitude of novel situations and camera angles. Conversely, synthetic scenes also need careful human staging to achieve realism that can generalize to successful real-world data analysis. All tolled, synthetic datasets are now an advancing reality for many vision tasks, especially in the autonomous driving domain [16, 17]. Therefore, we created BusyHands to have both real-world captures as well as synthetic renderings using Blender. We provide a comparative evaluation between real-world and synthetic parts in this paper.

To the best of our knowledge, ours is the first real- or virtual-world segmentation dataset that focuses on small-scale assembly works. A small sample of our annotated dataset is presented in Fig.BusyHands: A Hand-Tool Interaction Database for Assembly Tasks Semantic Segmentation. We will release for open download all parts of our dataset, as well as all pre-trained segmentation models (see §4.2). A small excerpt from the dataset exists in the supplementary material.

Advantage Disadvantage

Real

Data

Simple to collect data with commodity cameras. Data is as close as possible to the target input, thus more attractive to external practitioners. Image capture is immediate. High data randomness, assists in generalization. Annotating is expensive in terms of time and resources. Objects might not be labeled correctly due to occlusion or ambiguity. Segmentation may be subjective, because of a single annotator or disagreement. RGB-Depth registration has artifacts.

Synth.

Data

All the images are annotated accurately and instantly in an automatic manner. The dataset can be easily grown by adding more texture, pose or camera variables. RGB and Depth streams are perfectly aligned, from the virtual camera’s z-buffer. The creation of 3D models and scene staging is difficult in the earlier stage. Realistic animatronics is hard to achieve without expertise and resources. The synthetic images are not as realistic as real images, lack noise. Image rendering at high resolution and multiple passes (RGB, Depth map) is time consuming.
Table 3: Advantage analysis of real vs. synthetic data.

The rest of the paper is organized as follows. In Section 2, we discuss semantic segmentation and existing datasets in the literature. Section 3 provides details on how we cerated the BusyHands dataset. Section 4, covers existing semantic segmentation methods which we used for evaluation on our dataset. Section 5 offers conclusions about this work and future directions.

Figure 3: The collection of tools used in BusyHands. Left: Synthetic tool models, Right: Real tools.

2 Related Work

Semantic segmentation has long been a central pursuit as part of the computer vision research agenda, driven by compelling applications in autonomous navigation, security, image-based search and manufacturing, to name a few. In recent years, semantic segmentation research has seen a tremendous boost in offerings of deep convolutional network architectures, marked roughly by Long et al’s Fully-Convolutional Networks (FCN) work [18]

as the new era of semantic segmentation. The key insight behind that early work, which still resonates in most of the state-of-the-art contributions of today, is to use a visual feature-extracting network (such as VGG 

[19], ResNet [20], or a standalone one) and layer on top of it a decoding and unpooling mechanism to predict a class for each pixel at the original resolution. In this pattern, we can utilize a rich pre-trained subnetwork with powerful visual representation, proven for example, to work on large-scale image classification problems. Recent work, such as the flavors of DeepLab [21, 22, 23], PSPNet [24] and DenseASPP [25], utilize a specialized unpooling device such as the Atrous Spatial Pyramid Pooling (ASPP) feature.

2.1 Related Segmentation Datasets

The burst of creativity in semantic segmentation algorithms could not have occurred if not for the equally sharp rise in very large pixel-annotated datasets for segmentation. With abundance of data, such as PASCAL VOC [4], MS COCO [2], Cityscapes [26] or ADE20K [5], researchers could build deeper and more influential work, which makes a strong case for building and sharing datasets openly. Our dataset, on the other hand, offers a far more comprehensive cover of work-tools than any of the aforementioned datasets. In Table 2 we compare the number of pixel-level annotated instances of the objects in our dataset.

Insofar as hands are a key element to many useful applications of computer vision, such as egocentric augmented reality or manufacturing, many datasets to segment hands in images were contibuted. We list a few recent instances in Table 1. However, all of the above mentioned datasets only provide annotation for the hand (up to the wrist), whereas our annotation also provides the arm on top of an annotation of the tools in use, while taking great care to mark the hand occlusion from the tools.

# Tool Assembly Task RGB
1. screwdriver Tighten or loose screws rgb]1,1,0
2. wrench Tighten or loose nuts rgb]0,1,1
3. pliers Cut wires rgb]1,0,1
4. pencil Sketch on paper rgb]0.75,0.75,0.75
5. eraser Erase a sketch on paper rgb]0,0,0.5
6. scissors Cut paper rgb] 0.5,0.5,0.5
7. cutter Cut paper rgb]0.5,0,0
8. hammer Drive nail into wood rgb]0.5,0.5,0
9. ratchet Tighten or loose nuts rgb]0,0.5,0
10. tape measure Measure objects rgb]0.5,0,0.5
11. saw Saw a wooden board rgb]0,0.5,0.5
12. glue Glue papers rgb]0.8,0.52,0.25
13. ruler Draw line with pencil rgb]0.27,0.5,0.7
hand rgb]1,0,0
arm rgb]0,1,0
Table 4: Selected tools, their tasks and their mask RGB value (as can be seen in Fig.BusyHands: A Hand-Tool Interaction Database for Assembly Tasks Semantic Segmentation,5,4,7).

3 Constructing the BusyHands Dataset

We chose to deliver two types of image data in BusyHands, real-world and synthetic, so together they can provide a generalized and practical database for semantic segmentation for small-scale assembly works. Real and synthetic data complement each other in number of ways, which we detail in Table 3.

The structure of the dataset is designed following PASCAL [4], which includes color images and segmentation class labels (See Fig.4). The pixel-value of the segments in the label image ranges from 0 to (where ). In addition, we include depth images in our dataset to provide extra information. The work of [27, 12], showed depth images can be extremely useful for understanding human body parts. RGB information is also very hard to generalize properly. In real world situations there is immense color variability, for example shirt, tool, background or skin colors, let alone variation in lighting. Depth images circumvent these problems while the added cost of obtaining them is not high.

Figure 4: Real vs. synthetic segmentation annotation comparison.

3.1 Tools and Tasks Selection

We aim to create a dataset for most small-scale assembly works. However, assembly is a widely diverse action with many goals that uses a large class of tools. We chose to focus on common tools that exist in most households and manual assembly pipelines. We used a pre-selected collection of handheld tools (a kit from an established brand) from a home improvement store. Out of the available tools in the kit, we choose 13 common handheld tools listed in Table 4. Pictures of the collection of tools used in our recordings can be seen in Fig.3.

The manual tasks to perform with each tool are derived from the standard function of the tool itself. We staged a small workstation with wooden and paper craft pieces to be used for work pieces, and instructed the “workers” to perform simple assembly tasks (see Table 4).

3.2 Real-world Data in BusyHands

Data was captured using a standard Kinect V2 camera, capturing at 1920 1080 resolution for RGB and 512 424 for depth at 7 FPS. Depth and RGB streams are pixel-aligned using the provided SDK and the camera intrinsic and extrinsic parameters. The frame by frame outputs are demonstrated in Fig.2. The camera is mounted above the desk to provide first-person perspective effects. This was done to allow our data to be used both for segmentation of images from head-mounted gear as well as top-view cameras in a workbench, which are becoming more and more ubiquitous in the manufacturing world. During the recording, the real time video output was displayed so that the workers could adjust their postures to avoid excessive occlusion. Given the instructions as shown in Table 4, three volunteers were recruited (one female, two males). Skin pigment complexion: one Caucasian, two Asians. Multiple tools are allowed to use in one task in order to help complete the work. Per each task, the camera started to capture images after the workers began their work, and stopped automatically after recording 150 frames. A total of 39 films were captured, of which 26 were fully annotated with segmentation information.

Annotating the semantic parts in images is a tedious task. We employed Python-LabelMe222https://github.com/wkentaro/labelme, an open source image annotation software based on the original LabelMe project from MIT [28], to annotate different semantic parts and assign appropriate labels to them. The results can be seen in Fig. BusyHands: A Hand-Tool Interaction Database for Assembly Tasks Semantic Segmentation. We also show the preprocessed data samples in Fig. 2. Each sample contains color image, depth image and ground truth.

Figure 5: The 3D rendering environment in the synthetic part of BusyHand dataset. For augmentation, we provide 5 camera viewpoints: (a) Center, (b) Up-shift, (c) Down-shift, (d) Left-shift, and (e) Right-shift.
Figure 6: Top: Left: Number of class instances in the BusyHands dataset; Right: Average number of pixels for an instance of each class (e.g. Hand instances cover roughly 13,300 pixels on average). Note the logarithmic scale. Bottom: Heatmap illustration of the pixel-position of a few classes in the Real part of the dataset.

3.3 Synthetic Data in BusyHands

As mentioned before, to enrich the selection of available data in our dataset and obtain a large number of samples, we adopted using synthetic data. To generate realistic data to be on a par with real data, we purchased high quality 3D models of tools (see Fig. 3) as well as a highly realistic pair of hands, and loaded them in the Blender software333https://www.blender.org/. All the manual tasks (or instructions) were simulated by creating realistic key-frame animations mimicking human motion by observation.

To increase the generality of the dataset, so it can be applied in various physical environments, we use five camera perspectives in the synthetic dataset. As demonstrated in Figure 5, the cones in the first two rows that represent 5 different camera positions (first-person perspective, move up, move down, move to the left, move to the right) from left to right are rendered in front view (first row) and side view (second row). Corresponding color image, depth image and ground truth are given in the bottom three rows.

Unlike real-world captures, annotating semantic parts in a virtual environment is very straightforward. In Blender, we unwrapped the meshes of tools, hands, and arms to 2D UV maps, then painted the UV maps using solid colors. Each color is one-to-one mapped to one class label in our dataset according to the RGB-codes dictionary (see Table 4). Later, we utilize these colors to retrieve corresponding label numbers. Given a mapped texture in Blender, the software will output rendered images of RGB and semantic labels for all the designed animation frames automatically. A depth map for each frame is easily obtained from Blender by outputting the virtual camera’s z-buffer, and is pixel-aligned to the other streams.

3.4 Dataset Analysis and Comparison

The real world part of the dataset has 3695 labeled images, while in the synthetic part has 4170 images. Instances wise, we have 9505 instances of tools in the real dataset, and 4170 instances of tools in the synthetic parts. The proportions of each tool instance for both real data and synthetic data are listed in Fig. 6.

4 Semantic Labeling Evaluation

The BusyHand task involves predicting a pixel level semantic labeling of the image without considering higher level object instance or boundary information.

4.1 Metrics

We use a standard metric to evaluate labeling performance. The most adopted is the intersection-over-union metric , where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively [4]. We employ an averaging mechanism as is custom, over all classes and then over samples, to achieve the mean intersection over union (mIOU).

Algorithm
Train Test AdapNet DeepLabV3 DeepLabV3+ SegNet SegNet-Sk FRRN-A FRRN-B M-UNet M-UNet-Sk
Rl. Rl. 0.174 0.113 0.139 0.257 0.336 0.316 0.283 0.234 0.22
Syn. Syn. 0.714 0.532 0.584 0.782 0.856 0.856 0.858 0.759 0.842
Syn.+Rl. Rl. 0.291 0.212 0.227 0.328 0.494 0.502 0.589 0.216 0.388
Syn.+Rl. Syn. 0.623 0.367 0.313 0.591 0.641 0.776 0.763 0.547 0.713
Table 5: Results of the baseline methods on the BusyHands dataset, in terms of mIOU. The first column marks training vs. testing, e.g. ‘Syn.+Rl. Rl.’ means training on both synthetic and real images (training set) and testing only on real images (test set held out). ‘Sk’ indicates the use of skip connections in the network. ‘M-UNet‘ is the MobileUNet architecture [29].
Figure 7: Results of running FRRN-B [30] and SegNet-Skip [31] on a number of samples from the Real test dataset. The top row is the ground truth annotation.

4.2 Evaluated Segmentation Methods

We experimented with the following semantic segmentation algorithms, from the latest literature:

  • Encoder-Decoder SegNet [31]. This network uses a VGG-style encoder-decoder, where the upsampling in the decoder is done using transposed convolutions. In addition, we also used a version that employs additive skip connections from encoder to decoder.

  • Mobile UNet for Semantic Segmentation [29]. Combining the ideas of MobileNets Depthwise Separable Convolutions with UNet results in a low-parameter semantic segmentation model. In this architecture we also have a flavor with skip connections.

  • Full-Resolution Residual Networks (FRRN) [30]. Combines multi-scale context with pixel-level accuracy by using two processing streams within the network. The residual stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The pooling stream undergoes a sequence of pooling operations to obtain robust features for recognition. The two streams are coupled at the full image resolution using residuals.

  • AdapNet [32]. Modifies the ResNet50 architecture by performing the lower resolution processing using a multi-scale strategy with atrous convolutions. We use a slightly modified version using bilinear upscaling instead of transposed convolutions.

  • DeepLabV3 [23] and DeepLabV3+ [33]. Uses Atrous Spatial Pyramid Pooling to capture multi-scale context by using multiple atrous rates. This creates a large receptive field. The DeepLabV3+ network adds a Decoder module on top of the regular DeepLabV3 model.

All algorithms were implemented with the Tensorflow package

[34], forking the Semantic Segmentation Suite project [35] to which we made several adjustments.

4.3 Evaluation Results

The results of training and testing with the selected evaluation methods (listed in §4.2) are given in Table 5. We notice that the full-resolution residual networks (FRRNs) are mostly superior under all categories, followed by the SegNet with skip connections. In Figure 7

we show example results on the Real test set with FRRN-B and SegNet-Skip (additional results are available as supplementary material). The results indicate that while segmenting the arms, hands and tools is done quite well, there is a significant amount of noise from random objects on the table that classify as tools. Some post processing cleanup on the segmentation result, in particular blob geometry analysis (which we did not attempt), could potentially alleviate the level of noise.

Another insight is that the existence of synthetic data dramatically increases the power of the learners in accuracy over Real data. In the case of FRRN-A, for example, mIOU over the Real test set shot up from 0.336 when training just with Real images up to 0.502 when using also synthetic data for training. In fact only in the case of MobileUNet the performance dropped when including synthetic data, otherwise it increased performance by up to %80 throughout.

5 Conclusions

We contribute BusyHands - a high-quality fully annotated dataset for semantic segmentation with both real and synthetic image data. We also present an evaluation of numerous leading segmentation algorithms on our dataset as a baseline for other researchers. We release all of the data for general access of the computer vision community at http://hi.cs.stonybrook.edu/busyhands. This, we hope, will allow to create better image segmentation algorithms, which will even further advance computer vision research on scenes of manual assembly operations.

Acknowledgments

We would like to thank Nvidia for their generous donation of a Titan Xp and Quadro P5000 GPUs, which were used in this project. We thank the dataset annotators: Sirisha Mandali, Venkata Divya Kootagaram, as well as Fan Wang, Xiaoling Hu.

References