ALET (Automated Labeling of Equipment and Tools): A Dataset, a Baseline and a Usecase for Tool Detection in the Wild

10/25/2019
by   Fatih Can Kurnaz, et al.
45

Robots collaborating with humans in realistic environments will need to be able to detect the tools that can be used and manipulated. However, there is no available dataset or study that addresses this challenge in real settings. In this paper, we fill this gap by providing an extensive dataset (METU-ALET) for detecting farming, gardening, office, stonemasonry, vehicle, woodworking and workshop tools. The scenes correspond to sophisticated environments with or without humans using the tools. The scenes we consider introduce several challenges for object detection, including the small scale of the tools, their articulated nature, occlusion, inter-class invariance, etc. Moreover, we train and compare several state of the art deep object detectors (including Faster R-CNN, YOLO and RetinaNet) on our dataset. We observe that the detectors have difficulty in detecting especially small-scale tools or tools that are visually similar to parts of other tools. This in turn supports the importance of our dataset and paper. With the dataset, the code and the trained models, our work provides a basis for further research into tools and their use in robotics applications.

READ FULL TEXT VIEW PDF

Authors

page 1

page 6

page 7

12/19/2019

Benchmark for Generic Product Detection: A strong baseline for Dense Object Detection

Object detection in densely packed scenes is a new area where standard o...
11/25/2020

An Analysis of Deep Object Detectors For Diver Detection

With the end goal of selecting and using diver detection models to suppo...
04/27/2020

The Problem of Fragmented Occlusion in Object Detection

Object detection in natural environments is still a very challenging tas...
08/27/2018

Exploring the Applications of Faster R-CNN and Single-Shot Multi-box Detection in a Smart Nursery Domain

The ultimate goal of a baby detection task concerns detecting the presen...
09/26/2016

Multiview RGB-D Dataset for Object Instance Detection

This paper presents a new multi-view RGB-D dataset of nine kitchen scene...
06/04/2021

Hallucination In Object Detection – A Study In Visual Part Verification

We show that object detectors can hallucinate and detect missing objects...
03/02/2019

The Historical Perspective of Botnet tools

Bot as it is popularly called is an inherent attributes of botnet tool. ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The near future will see a cohabitation of robots and humans, where they will work together for performing tasks that are especially challenging, tiring or unergonomic for humans. For such collaborative tasks, robots are expected to be able to reason about the states of the humans, the task and the environment. This, in turn, requires the robot to have abilities for perceiving the humans, the task at hand, and the environment. An essential perceptual component for these abilities is the detection of objects.

Fig. 1: Some snapshots from the METU-ALET dataset, illustrating the wide range of scenes and tools that a robot is expected to recognize in a clutter, possible with human co-workers using the tools. Since annotations are too dense, only a small subset is displayed. [Best viewed in color]

With advances in deep learning, it has become possible to detect objects in challenging scenes with impressive performances. With the pioneering models such as R-CNN

[9], Fast R-CNN [11], Faster R-CNN [24], RetinaNet [16], object detection has become an easy-to-integrate functionality for robotics applications.

The robotics community has paid marginal importance to tools that are used by humans. For example, there are studies focusing on affordances of tools, or on the detection and transfer of these affordances [22, 21, 19]. However, these studies considered such tools mostly in isolated and not realistic, table-top environments. Moreover, they have considered only a limited set of tools (see Table I). What is more, the literature has not studied detection of tools, nor is there a dataset available for it.

In this paper, we focus on the detection of tools in realistic, cluttered environments (e.g. like those in Figure 1) where collaboration between humans and robots is expected. To be more specific, we study detection of tools in real work environments that are composed of many objects (tools) that look alike, occluding each other. For this end, we first construct an extensive tool detection dataset composed of 53 tool categories, instances etc. Then, we compare the widely used state-of-the-art object detectors on our dataset, showcasing that detecting tools is very challenging mainly owing to tools being too small, articulated and too much inter-class similarity. Finally, we introduce a “wearing helmet?” usecase from our dataset and train a novel CNN network for this task.

The necessity for a dataset for tool detection: Tool detection requires a dataset of its own since it bares novel challenges of its own: (i) Many tools are small objects which elicit a problem to standard object detectors that are tuned for detecting moderately larger objects. (ii) Many tools are articulated, and in addition to viewpoint, scale and illumination changes, object detectors need to address invariance to articulation. (iii) Tools are generally used in highly cluttered environments posing challenges on clutter, occlusion, appearance, and illumination – see Figure 1 for some samples. (iv) Many tools exhibit low inter-class differences (e.g., between screwdriver, chisel and file or between putty knives and scraper).

Dataset
Tool
Categories
Tool
Classes
# of
Images
# Instances
per Tool
Modality
Dense
Bounding Boxes?
RGB-D Part Aff. [22]
Kitchen, Workshop,
Garden
17
3
(SO*: 102)
6.17 RGB-D No
ToolWeb [1]
Kitchen, Office,
Workshop
23
0
(SO*: 116)
5.03 3D model No
Visual Aff. of Tools [5]
Toy 3
0
(SO*: 5280)
377
RGB stereo
(inc. semantic map)
No
Epic-Kitchens [4]
Kitchen  60+
NA
(Videos)
 200+
RGB stereo
(inc. semantic map)
Yes
METU-ALET (Ours)
Farm, Garden, Office,
Stonemasonry, Vehicles,
Woodwork, Workshop
53
2345
(SO*: 0)
200+ RGB Yes
TABLE I:

A comparison of the datasets that include tools. The Epic-Kitchens dataset provides videos, which makes analysis of scenes difficult. Moreover, the figures for the Epic-Kitchens dataset are estimated based on the provided data. *SO denotes the number of scenes that only include a single object.

I-a Related Work

In this section, we review the related work, and provide a list of our contributions and a comparison with the literature.

Object Detection:

Object detection is one of the most studied problems in Computer Vision with many practical applications in many robotics scenarios. Object detection generally follows a two-stage approach: (i) Region selection, which pertains to the selection of image regions that are likely to contain an object. (ii) Object classification, which deals with the classification of a selected region into one of the object categories. With advances in deep learning both stages have seen tremendous boost in performance in many challenging settings, e.g.,

[9, 11, 24]. Faster R-CNN [24] is a state-of-the-art representative of such two-stage detectors.

Recently, it has been shown that the two stages can be combined and objects can be detected in one stage. Models such as [18, 16, 23] assume a fixed set of localized image regions for each object category, and estimate objectness for each category and for each fixed image region. Among these, RetinaNet [16] processes features and make classification at multiple scales (called a feature pyramid or a pyramid network) and combines the results, yielding the state of the art results among one-stage detectors. Among the one-stage detectors, YOLO [23] provides a good balance between accuracy and running-time by considering a lesser number of likely image regions per class.

The current trend in deep object detectors is to take a pre-trained object classification network as the feature extractor (called the backbone network), then perform two-stage or one-stage object detection from those features. Alternatives for the backbone network include networks such as VGG [25], ResNet [12] and ResNext [28].

Tools in Robotics: The robotics community has extensively studied how tools can be grasped, manipulated, and how such affordances can be transfered across tools. For example, Kemp and Edsinger [14] focused on detection and the 3D localization of tool tips from optical flow. For the same goal, Mart et al. [19]

proposed using a CNN network (AlexNet) to first classify a blob into one of the three tool labels they considered, and then used 3D geometric features (from point cloud estimated from stereo cameras) to identify tool tips.

Another study [22]

addressed the problem of estimating the grasping positions, scoops and supports of tools from RGB-D data. For this end, they used methods such as Random Forests and Hierarchical Matching Pursuit. In a similar line of reasoning, Mar et al.

[20, 21]

studied prediction of affordances for different categories of tools separately. For this end, they first clusters tools using their 3D geometric descriptors, and then applied Generalized Regression Neural Networks for estimating the affordances of each cluster separately.

Another crucial problem in tools is the transfer of learned affordances across tools. For this, Abelha and Guerin [2] proposed a geometry-based reasoning where tools are segmented to semantically meaningful parts using 3D geometric features, and the affordances of the parts are shared across similar parts across different tools.

Related datasets: One of the major contributions of our work is the dataset on tool detection. Comparing our dataset to the ones used in the robotics literature (e.g., [22, 1, 5] – see also Table I), we see that they are limited in the number of categories and the instances that they consider. Moreover, since they mainly focus on detection of tool affordances, they are not suitable for training a deep object detector.

For related problems, there are numerous datasets of objects in the robotics literature, e.g. for 3D pose estimation and robot manipulation (e.g., LINEMOD

[13], YCB Objects [3], Table-top Objects [26], Object Recognition Challenge [27]). These datasets generally include table-top objects with 3D models and do not include tool categories.

For object detection, there are large-scale datasets such as PASCAL [8], MS-COCO [17]

and ImageNet

[6]. These datasets do include some tool categories (e.g., hammer, scissors); however, these datasets are designed to be for general purpose objects and scenes, unlike the ones we expect to see in tool-used environments (such as the ones in Figure 1). Therefore, they do not provide sufficient amount of training instances for training a detector with a reasonable performance.

I-B Contributions of Our Work

The main contributions of our work can be summarized as follows:

  • A Tool Detection Dataset: To the best of our knowledge, ours is the first to provide a dataset on detection of tools in the wild. As reviewed above, the robotics literature has focused on affordances of tools, 3D detection and pose estimation of objects, and neither set of studies provided a dataset on tools. The computer vision community, on the other hand, has extensive datasets for objects (e.g., PASCAL VOC, MS-COCO), which do not focus on tools. The Epic-Kitchen dataset focuses on object detection and action recognition in kitchen contexts, including also the tools that may be used in a kitchen.

  • A Baseline for Tool Detection: On our dataset, we train and analyze many state-of-the-art deep object detectors. Together with the dataset, the code and the trained models, our work can form as a basis for robotics applications that require detection of tools in challenging realistic work environments with humans.

  • A Usecase for Checking the Safety of Human Coworkers: Our dataset includes humans performing tasks with and without wearing a safety helmet. We form a subset of positive and negative instances of these and train a deep CNN network that checks whether a human coworker is wearing a helmet or not. This usecase reinforces the practicality of our dataset.

Ii Metu-Alet: An Extensive Tool Detection Dataset

In this section, we present and describe the details of METU-ALET and how the dataset was collected.

Ii-a Tools and Their Categories

In ALET, we consider 53 different tools that are used for six broad contexts or purposes: Farming, gardening, office, stonemasonry, vehicle, woodworking, workshop tools111In fact, it is better to call some of these objects as equipment. However, since they provide similar functionalities (being used by a human or a robot while performing a task), we will just use the term tool to refer to all such objects, for the sake of simplicity.. The most frequent 20 tools from our dataset are listed below:

    Chisel, Clamp, Drill, File, Gloves, Hammer,
    Mallet, Meter, Pen, Pencil, Plane, Pliers,
    Safety glass, Safety helmet, Saw, Screwdriver,
    Spade, Tape, Trowel, Wrench

We excluded tools used in kitchen since there is already an exclusive dataset for this purpose [4]. Moreover, we limited ourselves to tools that can be ultimately grasped, pushed, or manipulated in an easy manner by a robot. Therefore, we did not consider tools such as ladders, forklifts, and power tools that are bigger than a hand-sized drill.

Ii-B Dataset Collection

Our dataset is composed of three groups of images:

  • Images collected from the web: Using keywords and usage descriptions that describe the tools listed in Section II-A, we crawled and collected royalty-free images from the websites of the following: Creativecommons, Wikicommons, Flickr, Pexels, Unsplash, Shopify, Pixabay, Everystock, Imfree.

  • Images photographed by ourselves: We captured photos of office and workshop environments from our campus.

  • Synthetic images: For some tools such as Grinder, Rake the images collected from the web yielded insufficient examples, we formed a synthetic set as well. For this, we first photographed 10 background images that belonged to different contexts and manually labeled table-tops. Then, for each tool class for which we need more samples, we downloaded royalty-free transparent images, performed the following set of transformations:

    • Rotation: Rotation around the center of the patch with an angle sampled from .

    • Scaling: Scaling such that the longest dimension of the patch is between 12.5% and 50% percent of the area of the tabletop.

    • Shear transform with a shear factor selected randomly from .

    The number of tools added to an image is sampled from .

For annotating the tools in the downloaded and the photographed images, we used the VGG Image Annotation (VIA) tool [7]. Annotation was performed by us, the authors of the paper.

Ii-C Dataset Statistics

In this section, we describe some descriptive statistics about the METU-ALET dataset.

Cardinality and Sizes of BBs: The METU-ALET dataset includes 15,612 bounding boxes (BBs). As displayed in Figure 2, for each tool category, there are more than 200 BBs, which is on an order similar to widely used object detection datasets such as PASCAL [8]. As shown in Table II, METU-ALET includes tools that appear small (area ), medium ( area ) and large ( area) – following the naming convention from MS-COCO [17]. Although this is expected, as we will see in Section IV, deep networks have difficulty especially detecting small tools.

Fig. 2: The distribution of bounding boxes across classes. With the photographed and synthesized photos, each tool category has more than 200 bounding boxes. [Best viewed in color]
Subset
Category
Small BBs Medium BBs Large BBs Total
Downloaded 1614 4658 2990 9262
Photographed 74 507 239 820
Synthesized 145 3302 2083 5530
Total 1833 8377 5312 15612
TABLE II: The sizes of the bounding boxes (BB) of the annotated tools in METU-ALET. For calculating these statistics, we considered the scaled versions of the images that were fed to the networks, namely, the smaller dimension is 600 and the larger dimension is less than 800.

Cardinality and Sizes of the Images: Our dataset is composed of 2345 images in total, and on average, has size . See Table III for more details. Although the number of images may appear low, the number of bounding boxes is reasonable (15612) since the avg. number of BBs per image is rather large () compared to PASCAL 2012 ().

Subset Cardinality Avg. Resolution
Downloaded 1912
Photographed 89
Synthesized 344
Total/Avg 2345
TABLE III: The cardinality and the resolution of the images in METU-ALET.

Iii Methodology

In this section, we briefly describe the deep object detectors that we evaluated as a baseline, and the “Wearing helmet?” as a straightforward usecase of the ALET dataset.

Iii-a Deep Object Detectors

As stated in Section I-A, deep object detectors can be broadly analyzed in two categories: (i) single-stage detectors, and (ii) two-stage detectors. To form a baseline, we evaluated strong representatives of both single-stage (YOLO, RetinaNet) and two-stage (Faster R-CNN) detectors.

Iii-A1 Faster R-CNN

Faster R-CNN [24]

is one of the first networks to use end-to-end learning for object detection. It feeds features extracted from a backbone network to a region proposal network, which estimates an objectness score and the (relative) coordinates of a set of

anchor boxes for each position on a regular grid. For each such box with an objectness score above a threshold, the object classification network (Fast R-CNN) is executed to classify each box into one of the object categories.

For training the network, classification loss and box-regression loss (to penalize the spatial mismatch between the detected box and the ground truth box) are combined. The box-regression loss is weighted with a constant (), which we selected as 1.0 as suggested by the paper.

Iii-A2 YOLO: You Look only Once

YOLO [23] combines region-proposal and classification stages by making classification for a fixed set of default bounding boxes per position per object category. It is aimed to be a real-time object detector, and to achieve this, it assumes a coarser set of default boxes than similar networks such as SSD [18], leading to 9 default boxes per object category.

For training the network, YOLO uses squared-error loss for both classification and box-regression to make learning easier to optimize and introduces several hyperparameters (

and ) to weight the contribution of squared-error loss corresponding to position, width, height, and classification. In our experiments, we take these factors as suggested by the paper.

Iii-A3 RetinaNet

RetinaNet [16] is a one-stage detector which forms a multi-scale pyramid from the features obtained from the backbone network and performs classification and bounding box estimation in parallel for each layer (scale) of the pyramid. In order to address the data imbalance problem that affects single-stage detectors owing to the backgrounds, RetinaNet proposes using a focal loss that decreases the contribution of the “easy” examples to the overall loss. Compared to other single-stage detectors, RetinaNet considers a denser set of bounding boxes to classify.

Iii-B A Usecase for ALET: “Wearing Helmet?”

In this section, we illustrate how the ALET dataset can be used for addressing a critical issue in human-robot collaborative environments; that of checking whether the human worker is conforming to the security guidelines and wearing a safety helmet. The ALET dataset contains a good number of “helmet” instances ( in total), which suffice to be training a deep object detector (note that widely used object detection datasets such as PASCAL provide  200 samples per object category).

For addressing this task, we use the network by Kocabas et al. [15] to both detect the humans in an image and estimate their poses. The bounding box provided by this network is taken as a positive instance if an overlapping helmet is annotated at the head of the detected person, and as a negative instance otherwise.

We construct a CNN architecture (Table IV) and train it on this “wearing-helmet” subset.

Layer Description
Input
Conv1
Non-linearity ReLU
Conv2
Non-linearity ReLU
Max-pool
Conv3
Non-linearity ReLU
Max-pool
FC 512
Non-linearity ReLU
FC 128
Non-linearity ReLU
Output 1
TABLE IV:

The CNN architecture used for the “wearing helmet?” usecase. After Conv layers, batch normalization is added.

denotes the width, the height and the number of channels of a layer.

Iv Experiments

In this section, we provide a baseline for the ALET dataset and show a practical usecase for such a dataset.

Iv-a Training and Implementation Details

We split the ALET dataset into () training, () validation and () testing samples. To make comparison fair, we use the same split for training and evaluating each network. For training each network, the following libraries and settings (in all networks, the class layer is replaced with the tool categories, and the whole network except for the feature extracting backbone is updated during training):

  • Faster R-CNN: The pre-trained network from Detectron [10] is used with backbone ResNet-50-FPN.

  • YOLO: The pre-trained network from Yun [29] is used with the Darknet backbone.

  • RetinaNet: The pre-trained network from Detectron [10] is used with backbone ResNet-50-FPN.

The “wearing helmet?” dataset we collected included 941 instances, 563 of which were reserved for training, 94 for validation and 284 for testing. For training the “wearing helmet?” CNN in Table IV

, we used RMSProp and trained the network with early stopping.

Iv-B Quantitative Results for Tool Detection

On the testing subset of ALET, we compare the performance of the detectors trained on the training subset of ALET. We use mean average precision (mAP) as it is customary in object detection. AP is a measure of area under the precision-recall curve, generally calculated in the literature by averaging over the precision values at a discrete set of recall values [8]:

(1)

where is the set of recall values considered, and is the precision value at recall . In our evaluations, we followed the PASCAL [8] style and consider 11 different recalls for calculating average precision. mAP is then the mean of AP across categories.

Table V lists the AP and mAP values of the baseline networks on our dataset. We notice that the baseline detectors have trouble in detecting the tools in METU-ALET. From the table, we see that the networks classify tools such as safety helmet and anvil better than chisel, file or nutdriver. One of the reasons for this is that tools such as screwdriver, nutdriver, chisel, file look very similar to each other. Moreover, these tools appear very similar to parts of other tools from a side view or from far (e.g., the front part of a drill is likely to be classified as a screwdriver or a nutdriver, and in many cases, one half of a plier is detected as a chisel). These suggest that tool detection is indeed a very challenging problem especially owing to viewpoint challenges and tools having very similar appearances to other tools.

Anvil

Aviation-snip

Axe

Bench-vise

Brush

Caliper

Caulk-Gun

Chisel

Clamp

Climper

Drill

File

Flashlight

Gloves

Grinder

Hammer

Hex-keys

Hole-punch

Knife

Level

Mallet

Marker

Meter

Nutdriver

Pen

Pencil

Pencil-sharpener

Plane

Pliers

Putty-knives

Rake

Ratchet

Riveter

Ruler

Safety-glass

Safety-headphone

Safety-helmet

Safety-mask

Sanders

Saw

Scissors

Scraper

Screwdriver

Soldering-Iron

Spade

Square

Staple-gun

Stapler

Tape

Tape-dispenser

Trowel

Wrecker-Pry-Molding-bar

Wrench

mAP

F. R-CNN

4.99

2.81

0.62

4.05

0.22

2.60

3.80

0.78

0.51

2.33

1.35

0.79

2.21

4.12

7.13

3.83

1.47

4.58

1.72

0.67

0.80

3.44

1.24

0.33

0.90

4.74

5.24

0.20

7.38

3.19

1.41

0.37

3.56

1.10

3.74

5.22

8.95

2.86

3.22

0.76

1.50

1.00

2.29

3.70

1.27

3.24

5.09

4.65

1.35

5.22

0.54

2.15

4.98

2.76

YOLO

21.52

10.69

6.44

9.46

6.65

10.24

15.96

12.17

3.67

6.83

6.77

4.45

15.06

15.70

42.65

9.86

3.92

3.96

6.07

3.79

6.21

1.54

4.91

11.34

3.92

4.01

4.34

19.10

19.53

12.98

7.53

3.63

8.41

7.92

3.54

10.88

16.17

0.00

12.90

8.52

2.28

7.31

6.93

13.12

6.22

3.47

4.26

18.16

9.29

14.38

16.87

13.45

14.53

9.59

RetinaNet

6.51

0.36

0.87

1.11

1.19

1.59

0.25

0.00

1.67

0.57

5.99

4.45

0.43

0.00

4.64

1.45

7.38

0.14

0.78

0.74

0.00

1.38

1.64

1.42

0.39

1.69

5.67

3.57

2.60

7.32

2.24

1.46

0.00

0.07

6.30

12.15

13.96

4.17

0.00

0.56

2.04

0.41

3.79

0.00

2.58

0.00

0.00

3.35

5.91

4.55

1.01

0.00

9.48

2.64

TABLE V: Category-wise AP and mAP (as commonly used in analyzing the performance of object detection methods) values of the baseline networks.

Iv-C Sample Tool Detection Results

In Figure 3, we display a few detection results on a few of the challenging scenes in Figure 1. We see that although many tools are detected, many are missed as well.

Fig. 3: Sample detections on a few of the challenging scenes in Figure 1.

Iv-D Usecase Results: “Wearing Helmet?”

In this usecase, we analyze the method proposed in Section III-B. As described in Section IV-A, we have selected a subset from the METU-ALET dataset for this. The CNN architecture in Table IV obtained 88% accuracy on the test set, and as illustrated with some examples in Figure 4, we see that a tool detector and a human detector & pose estimator can be used to easily identify whether a human is wearing a helmet or not.

Fig. 4: The detection results of the “wearing helmet?” CNN network. [Best viewed in color]

V Conclusion

In this paper, we have introduced an extensive dataset for tool detection in the wild. Moreover, we formed a baseline by training and testing four widely-used state-of-the-art deep object detectors in the literature, namely, Faster R-CNN [24], YOLO [23], and RetinaNet [16]. We demonstrated that such detectors especially have trouble in finding tools whose appearance is highly affected by viewpoint changes and tools that resemble parts of other tools.

Moreover, we have provided a very practical yet critical usecase for human-robot collaborative scenarios. Combining the detected “helmet” categories with the detection results of a human detector & pose estimator, we have demonstrated how our dataset can be used for practical applications other than merely detecting tools in an environment.

Acknowledgment

This work was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK) through project called “CIRAK: Compliant robot manipulator support for montage workers in factories” (project no 117E002). The numerical calculations reported in this paper were partially performed at TÜBİTAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources). We would like to thank Erfan Khalaji for his contributions on an earlier version of the work.

References

  • [1] P. Abelha and F. Guerin (2017) Learning how a tool affords by simulating 3d models from the web. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4923–4929. Cited by: §I-A, TABLE I.
  • [2] P. Abelha and F. Guerin (2017) Transfer of tool affordance and manipulation cues with 3d vision data. arXiv preprint arXiv:1710.04970. Cited by: §I-A.
  • [3] B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar (2015) Benchmarking in manipulation research: the ycb object and model set and benchmarking protocols. arXiv preprint arXiv:1502.03143. Cited by: §I-A.
  • [4] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray (2018) Scaling egocentric vision: the epic-kitchens dataset. In European Conference on Computer Vision (ECCV), Cited by: TABLE I, §II-A.
  • [5] A. Dehban, L. Jamone, A. R. Kampff, and J. Santos-Victor (2016) A moderately large size dataset to learn visual affordances of objects and tools using icub humanoid robot. In ECCV Workshop on Action and Anticipation for Visual Learning, Cited by: §I-A, TABLE I.
  • [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §I-A.
  • [7] A. Dutta, A. Gupta, and A. Zissermann (2016) VGG image annotator (VIA). Note: http://www.robots.ox.ac.uk/ vgg/software/via/Version: 2.0.5, Accessed: 27 Feb 2019 Cited by: §II-B.
  • [8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §I-A, §II-C, §IV-B.
  • [9] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587. Cited by: §I-A, §I.
  • [10] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He (2018) Detectron. Note: https://github.com/facebookresearch/detectron Cited by: 1st item, 3rd item.
  • [11] R. Girshick (2015) Fast r-cnn. In IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448. Cited by: §I-A, §I.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §I-A.
  • [13] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian Conference on Computer Vision (ACCV), pp. 548–562. Cited by: §I-A.
  • [14] C. C. Kemp and A. Edsinger (2006) Robot manipulation of human tools: autonomous detection and control of task relevant features. In Intl. Conference on Development and Learning, Vol. 42. Cited by: §I-A.
  • [15] M. Kocabas, S. Karagoz, and E. Akbas (2018) Multiposenet: fast multi-person pose estimation using pose residual network. In European Conference on Computer Vision (ECCV), pp. 417–433. Cited by: §III-B.
  • [16] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. Cited by: §I-A, §I, §III-A3, §V.
  • [17] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §I-A, §II-C.
  • [18] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European Conference on Computer Vision (ECCV), pp. 21–37. Cited by: §I-A, §III-A2.
  • [19] T. Mar, L. Natale, and V. Tikhanoff (2018) A framework for fast, autonomous and reliable tool incorporation on icub. Frontiers in Robotics and AI 5, pp. 98. Cited by: §I-A, §I.
  • [20] T. Mar, V. Tikhanoff, G. Metta, and L. Natale (2015) Multi-model approach based on 3d functional features for tool affordance learning in robotics. In IEEE-RAS International Conference on Humanoid Robots, pp. 482–489. Cited by: §I-A.
  • [21] T. Mar, V. Tikhanoff, and L. Natale (2018)

    What can i do with this tool? self-supervised learning of tool affordances from their 3-d geometry

    .
    IEEE Transactions on Cognitive and Developmental Systems 10 (3), pp. 595–610. Cited by: §I-A, §I.
  • [22] A. Myers, C. L. Teo, C. Fermüller, and Y. Aloimonos (2015) Affordance detection of tool parts from geometric features. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 1374–1381. Cited by: §I-A, §I-A, TABLE I, §I.
  • [23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. Cited by: §I-A, §III-A2, §V.
  • [24] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), pp. 91–99. Cited by: §I-A, §I, §III-A1, §V.
  • [25] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §I-A.
  • [26] M. Sun, G. Bradski, B. Xu, and S. Savarese (2010) Depth-encoded hough voting for joint object detection and shape recovery. In European Conference on Computer Vision (ECCV), pp. 658–671. Cited by: §I-A.
  • [27] N. Vaskevicius, K. Pathak, A. Ichim, and A. Birk (2012) The jacobs robotics approach to object recognition and localization in the context of the icra’11 solutions in perception challenge. In IEEE International Conference on Robotics and Automation (ICRA), pp. 3475–3481. Cited by: §I-A.
  • [28] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1492–1500. Cited by: §I-A.
  • [29] A. Yun (2019) YOLO v3. Note: https://github.com/andy-yun/pytorch-0.4-yolov3 Cited by: 2nd item.