“Idle hands are the devil’s playthings” — Benjamin Franklin
Computer vision is now used in many of the manufacturing and fabrication fields. Manufacturers are using high-end machine vision for part inspection and verification, as well as means to track the workers and the work pieces to gain crucial insight into the efficiency of their assembly lines. Small-scale fabrication, on the other hand, happens virtually anywhere, even at home, at school, or in personal fabrication shops. Still all kinds of fabrication, mass- or small-scale, share a commonality - manual assembly tasks performed by humans. This comes as a stark contrast to the minor offering of computer vision methods to understand manual assembly scenes. To this end we offer a first-of-its-kind dataset of fully annotated images of assembly tasks with manual tools - named BusyHands. The first offering, described in this paper, includes both real-world and virtual-world samples for semantic segmentation tasks. Later iterations of BusyHands will include arm and hand articulated poses (skeleton) as well as multi-part tool 6DOF pose. We believe an open dataset, such as our BusyHands, can drive research into deeper understanding of manual assembly task imaging, which will in turn help increase efficiency and error-tolerance in industrial pipelines or at home.
Semantic segmentation – finding contiguous areas in the image with a similar semantic context – is one of the most fundamental tasks in scene understanding. Using a segmentation over the image, further break-down of the parts to smaller parts or interaction between parts can proceed. There are numerous popular large-scale standard datasets to assist in segmentation algorithm development, e.g. ImageNet, COCO , SUN , PASCAL , and ADE20K . Further, hand image analysis datasets [6, 7, 8, 9, 10] were proposed for segmentation, with a focus on hands, but not hand interactions. Bambach et al.  create a dataset for complex interactions, but doesn’t involve handheld tools. Therefore, we find most existing open collections unsuitable for interactions between hands and handheld tools, which is essential for understanding assembly.
|Ours||7,905||Yes||Man. & Syn.|
|Tool||COCO ||SUN ||ADE20K ||BusyHands (Ours)|
Naive methods for human hand segmentation from backgrounds, such as recognizing skin-colored pixels in RGB, are being replaced with supervised machine learning algorithms with far higher perception capabilities, such as deep convolutional networks or deep randomized decision forests. The advent of new cheap imaging technology, such as the Kinect
depth camera, allowed enriching the fundamental features used in perception tasks to reach (and even surpass) human-level cognitive capabilities. However, adding more feature dimensions to these highly parametric models requires orders of magnitude more training data to achieve generalizable results. Consequently this lead to the construction of the aforementioned large annotated datasets and others, which are now in hard demand.
Manually annotating distinct semantic parts in images is tedious and error-prone, and therefore it may be prohibitively expensive. To cope with this problem, [16, 17] adopted synthetic data which can be generated through professional 3D modeling software. Ground truth annotation for semantic segmentation can be achieved easily in 3D software, since the objects are precisely defined (by a triangulated mesh) and photorealistic rendering is ready at hand. A 3D model can also be parameterized to augment the data with a multitude of novel situations and camera angles. Conversely, synthetic scenes also need careful human staging to achieve realism that can generalize to successful real-world data analysis. All tolled, synthetic datasets are now an advancing reality for many vision tasks, especially in the autonomous driving domain [16, 17]. Therefore, we created BusyHands to have both real-world captures as well as synthetic renderings using Blender. We provide a comparative evaluation between real-world and synthetic parts in this paper.
To the best of our knowledge, ours is the first real- or virtual-world segmentation dataset that focuses on small-scale assembly works. A small sample of our annotated dataset is presented in Fig.BusyHands: A Hand-Tool Interaction Database for Assembly Tasks Semantic Segmentation. We will release for open download all parts of our dataset, as well as all pre-trained segmentation models (see §4.2). A small excerpt from the dataset exists in the supplementary material.
|Simple to collect data with commodity cameras. Data is as close as possible to the target input, thus more attractive to external practitioners. Image capture is immediate. High data randomness, assists in generalization.||Annotating is expensive in terms of time and resources. Objects might not be labeled correctly due to occlusion or ambiguity. Segmentation may be subjective, because of a single annotator or disagreement. RGB-Depth registration has artifacts.|
|All the images are annotated accurately and instantly in an automatic manner. The dataset can be easily grown by adding more texture, pose or camera variables. RGB and Depth streams are perfectly aligned, from the virtual camera’s z-buffer.||The creation of 3D models and scene staging is difficult in the earlier stage. Realistic animatronics is hard to achieve without expertise and resources. The synthetic images are not as realistic as real images, lack noise. Image rendering at high resolution and multiple passes (RGB, Depth map) is time consuming.|
The rest of the paper is organized as follows. In Section 2, we discuss semantic segmentation and existing datasets in the literature. Section 3 provides details on how we cerated the BusyHands dataset. Section 4, covers existing semantic segmentation methods which we used for evaluation on our dataset. Section 5 offers conclusions about this work and future directions.
2 Related Work
Semantic segmentation has long been a central pursuit as part of the computer vision research agenda, driven by compelling applications in autonomous navigation, security, image-based search and manufacturing, to name a few. In recent years, semantic segmentation research has seen a tremendous boost in offerings of deep convolutional network architectures, marked roughly by Long et al’s Fully-Convolutional Networks (FCN) work 
as the new era of semantic segmentation. The key insight behind that early work, which still resonates in most of the state-of-the-art contributions of today, is to use a visual feature-extracting network (such as VGG, ResNet , or a standalone one) and layer on top of it a decoding and unpooling mechanism to predict a class for each pixel at the original resolution. In this pattern, we can utilize a rich pre-trained subnetwork with powerful visual representation, proven for example, to work on large-scale image classification problems. Recent work, such as the flavors of DeepLab [21, 22, 23], PSPNet  and DenseASPP , utilize a specialized unpooling device such as the Atrous Spatial Pyramid Pooling (ASPP) feature.
2.1 Related Segmentation Datasets
The burst of creativity in semantic segmentation algorithms could not have occurred if not for the equally sharp rise in very large pixel-annotated datasets for segmentation. With abundance of data, such as PASCAL VOC , MS COCO , Cityscapes  or ADE20K , researchers could build deeper and more influential work, which makes a strong case for building and sharing datasets openly. Our dataset, on the other hand, offers a far more comprehensive cover of work-tools than any of the aforementioned datasets. In Table 2 we compare the number of pixel-level annotated instances of the objects in our dataset.
Insofar as hands are a key element to many useful applications of computer vision, such as egocentric augmented reality or manufacturing, many datasets to segment hands in images were contibuted. We list a few recent instances in Table 1. However, all of the above mentioned datasets only provide annotation for the hand (up to the wrist), whereas our annotation also provides the arm on top of an annotation of the tools in use, while taking great care to mark the hand occlusion from the tools.
|1.||screwdriver||Tighten or loose screws||gb]1,1,0|
|2.||wrench||Tighten or loose nuts||gb]0,1,1|
|4.||pencil||Sketch on paper||gb]0.75,0.75,0.75|
|5.||eraser||Erase a sketch on paper||gb]0,0,0.5|
|6.||scissors||Cut paper||gb] 0.5,0.5,0.5|
|8.||hammer||Drive nail into wood||gb]0.5,0.5,0|
|9.||ratchet||Tighten or loose nuts||gb]0,0.5,0|
|10.||tape measure||Measure objects||gb]0.5,0,0.5|
|11.||saw||Saw a wooden board||gb]0,0.5,0.5|
|13.||ruler||Draw line with pencil||gb]0.27,0.5,0.7|
3 Constructing the BusyHands Dataset
We chose to deliver two types of image data in BusyHands, real-world and synthetic, so together they can provide a generalized and practical database for semantic segmentation for small-scale assembly works. Real and synthetic data complement each other in number of ways, which we detail in Table 3.
The structure of the dataset is designed following PASCAL , which includes color images and segmentation class labels (See Fig.4). The pixel-value of the segments in the label image ranges from 0 to (where ). In addition, we include depth images in our dataset to provide extra information. The work of [27, 12], showed depth images can be extremely useful for understanding human body parts. RGB information is also very hard to generalize properly. In real world situations there is immense color variability, for example shirt, tool, background or skin colors, let alone variation in lighting. Depth images circumvent these problems while the added cost of obtaining them is not high.
3.1 Tools and Tasks Selection
We aim to create a dataset for most small-scale assembly works. However, assembly is a widely diverse action with many goals that uses a large class of tools. We chose to focus on common tools that exist in most households and manual assembly pipelines. We used a pre-selected collection of handheld tools (a kit from an established brand) from a home improvement store. Out of the available tools in the kit, we choose 13 common handheld tools listed in Table 4. Pictures of the collection of tools used in our recordings can be seen in Fig.3.
The manual tasks to perform with each tool are derived from the standard function of the tool itself. We staged a small workstation with wooden and paper craft pieces to be used for work pieces, and instructed the “workers” to perform simple assembly tasks (see Table 4).
3.2 Real-world Data in BusyHands
Data was captured using a standard Kinect V2 camera, capturing at 1920 1080 resolution for RGB and 512 424 for depth at 7 FPS. Depth and RGB streams are pixel-aligned using the provided SDK and the camera intrinsic and extrinsic parameters. The frame by frame outputs are demonstrated in Fig.2. The camera is mounted above the desk to provide first-person perspective effects. This was done to allow our data to be used both for segmentation of images from head-mounted gear as well as top-view cameras in a workbench, which are becoming more and more ubiquitous in the manufacturing world. During the recording, the real time video output was displayed so that the workers could adjust their postures to avoid excessive occlusion. Given the instructions as shown in Table 4, three volunteers were recruited (one female, two males). Skin pigment complexion: one Caucasian, two Asians. Multiple tools are allowed to use in one task in order to help complete the work. Per each task, the camera started to capture images after the workers began their work, and stopped automatically after recording 150 frames. A total of 39 films were captured, of which 26 were fully annotated with segmentation information.
Annotating the semantic parts in images is a tedious task. We employed Python-LabelMe222https://github.com/wkentaro/labelme, an open source image annotation software based on the original LabelMe project from MIT , to annotate different semantic parts and assign appropriate labels to them. The results can be seen in Fig. BusyHands: A Hand-Tool Interaction Database for Assembly Tasks Semantic Segmentation. We also show the preprocessed data samples in Fig. 2. Each sample contains color image, depth image and ground truth.
3.3 Synthetic Data in BusyHands
As mentioned before, to enrich the selection of available data in our dataset and obtain a large number of samples, we adopted using synthetic data. To generate realistic data to be on a par with real data, we purchased high quality 3D models of tools (see Fig. 3) as well as a highly realistic pair of hands, and loaded them in the Blender software333https://www.blender.org/. All the manual tasks (or instructions) were simulated by creating realistic key-frame animations mimicking human motion by observation.
To increase the generality of the dataset, so it can be applied in various physical environments, we use five camera perspectives in the synthetic dataset. As demonstrated in Figure 5, the cones in the first two rows that represent 5 different camera positions (first-person perspective, move up, move down, move to the left, move to the right) from left to right are rendered in front view (first row) and side view (second row). Corresponding color image, depth image and ground truth are given in the bottom three rows.
Unlike real-world captures, annotating semantic parts in a virtual environment is very straightforward. In Blender, we unwrapped the meshes of tools, hands, and arms to 2D UV maps, then painted the UV maps using solid colors. Each color is one-to-one mapped to one class label in our dataset according to the RGB-codes dictionary (see Table 4). Later, we utilize these colors to retrieve corresponding label numbers. Given a mapped texture in Blender, the software will output rendered images of RGB and semantic labels for all the designed animation frames automatically. A depth map for each frame is easily obtained from Blender by outputting the virtual camera’s z-buffer, and is pixel-aligned to the other streams.
3.4 Dataset Analysis and Comparison
The real world part of the dataset has 3695 labeled images, while in the synthetic part has 4170 images. Instances wise, we have 9505 instances of tools in the real dataset, and 4170 instances of tools in the synthetic parts. The proportions of each tool instance for both real data and synthetic data are listed in Fig. 6.
4 Semantic Labeling Evaluation
The BusyHand task involves predicting a pixel level semantic labeling of the image without considering higher level object instance or boundary information.
We use a standard metric to evaluate labeling performance. The most adopted is the intersection-over-union metric , where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively . We employ an averaging mechanism as is custom, over all classes and then over samples, to achieve the mean intersection over union (mIOU).
4.2 Evaluated Segmentation Methods
We experimented with the following semantic segmentation algorithms, from the latest literature:
Encoder-Decoder SegNet . This network uses a VGG-style encoder-decoder, where the upsampling in the decoder is done using transposed convolutions. In addition, we also used a version that employs additive skip connections from encoder to decoder.
Mobile UNet for Semantic Segmentation . Combining the ideas of MobileNets Depthwise Separable Convolutions with UNet results in a low-parameter semantic segmentation model. In this architecture we also have a flavor with skip connections.
Full-Resolution Residual Networks (FRRN) . Combines multi-scale context with pixel-level accuracy by using two processing streams within the network. The residual stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The pooling stream undergoes a sequence of pooling operations to obtain robust features for recognition. The two streams are coupled at the full image resolution using residuals.
AdapNet . Modifies the ResNet50 architecture by performing the lower resolution processing using a multi-scale strategy with atrous convolutions. We use a slightly modified version using bilinear upscaling instead of transposed convolutions.
4.3 Evaluation Results
The results of training and testing with the selected evaluation methods (listed in §4.2) are given in Table 5. We notice that the full-resolution residual networks (FRRNs) are mostly superior under all categories, followed by the SegNet with skip connections. In Figure 7
we show example results on the Real test set with FRRN-B and SegNet-Skip (additional results are available as supplementary material). The results indicate that while segmenting the arms, hands and tools is done quite well, there is a significant amount of noise from random objects on the table that classify as tools. Some post processing cleanup on the segmentation result, in particular blob geometry analysis (which we did not attempt), could potentially alleviate the level of noise.
Another insight is that the existence of synthetic data dramatically increases the power of the learners in accuracy over Real data. In the case of FRRN-A, for example, mIOU over the Real test set shot up from 0.336 when training just with Real images up to 0.502 when using also synthetic data for training. In fact only in the case of MobileUNet the performance dropped when including synthetic data, otherwise it increased performance by up to %80 throughout.
We contribute BusyHands - a high-quality fully annotated dataset for semantic segmentation with both real and synthetic image data. We also present an evaluation of numerous leading segmentation algorithms on our dataset as a baseline for other researchers. We release all of the data for general access of the computer vision community at http://hi.cs.stonybrook.edu/busyhands. This, we hope, will allow to create better image segmentation algorithms, which will even further advance computer vision research on scenes of manual assembly operations.
We would like to thank Nvidia for their generous donation of a Titan Xp and Quadro P5000 GPUs, which were used in this project. We thank the dataset annotators: Sirisha Mandali, Venkata Divya Kootagaram, as well as Fan Wang, Xiaoling Hu.
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.:
Imagenet: A large-scale hierarchical image database.
In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. (2009) 248–255
-  Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014)
-  Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. (2010) 3485–3492
-  Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2010) 303–338
-  Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017)
-  Mittal, A., Zisserman, A., Torr, P.H.S.: Hand detection using multiple proposals. In: British Machine Vision Conference. (2011)
-  Malireddi, S.R., Mueller, F., Oberweger, M., Bojja, A.K., Lepetit, V., Theobalt, C., Tagliasacchi, A.: Handseg: A dataset for hand segmentation from depth images. CoRR abs/1711.05944 (2017)
Zimmermann, C., Brox, T.:
Learning to estimate 3d hand pose from single rgb images.In: IEEE International Conference on Computer Vision (ICCV). (2017)
-  Afifi, M.: Gender recognition and biometric identification using a large dataset of hand images. CoRR abs/1711.04322 (2017)
-  Khan, A.U., Borji, A.: Analysis of hand segmentation in the wild. CoRR abs/1803.03317 (2018)
-  Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In: The IEEE International Conference on Computer Vision (ICCV). (2015)
-  Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graph. 33 (2014) 169:1–169:10
-  Wetzler, A., Slossberg, R., Kimmel, R.: Rule of thumb: Deep derotation for improved fingertip detection. arXiv preprint arXiv:1507.05726 (2015)
-  Li, Y., Ye, Z., Rehg, J.M.: Delving into egocentric actions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015) 287–295
-  Corp, M.: Kinect for xbox 360. Technical report, (Redmond WA)
-  Riegler, G., Ferstl, D., Rüther, M., Bischof, H.: A framework for articulated hand pose estimation and evaluation. In: Scandinavian Conference on Image Analysis. (2015)
-  Simon, T., Joo, H., Matthews, I.A., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. CoRR abs/1704.07809 (2017)
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015) 3431–3440
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015)
-  Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. CoRR abs/1412.7062 (2014)
-  Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR abs/1606.00915 (2016)
-  Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587 (2017)
-  Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. CoRR abs/1612.01105 (2016)
-  Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: Denseaspp for semantic segmentation in street scenes. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2018)
-  Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. CoRR abs/1604.01685 (2016)
-  Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: CVPR 2011. (2011) 1297–1304
-  Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: Labelme: a database and web-based tool for image annotation. International journal of computer vision 77 (2008) 157–173
-  Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017)
-  Pohlen, T., Hermans, A., Mathias, M., Leibe, B.: Full-resolution residual networks for semantic segmentation in street scenes. CoRR abs/1611.08323 (2016)
-  Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. CoRR abs/1511.00561 (2015)
-  Valada, A., Vertens, J., Dhall, A., Burgard, W.: Adapnet: Adaptive semantic segmentation in adverse environmental conditions. In: Robotics and Automation (ICRA), 2017 IEEE International Conference on, IEEE (2017) 4644–4651
-  Chen, L., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. CoRR abs/1802.02611 (2018)
-  Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI. Volume 16. (2016) 265–283
-  Seif, G.: Semantic Segmentation Suite. https://github.com/GeorgeSeif/Semantic-Segmentation-Suite (2018)