Instance Segmentation of Visible and Occluded Regions for Finding and Picking Target from a Pile of Objects

by   Kentaro Wada, et al.
The University of Tokyo

We present a robotic system for picking a target from a pile of objects that is capable of finding and grasping the target object by removing obstacles in the appropriate order. The fundamental idea is to segment instances with both visible and occluded masks, which we call `instance occlusion segmentation'. To achieve this, we extend an existing instance segmentation model with a novel `relook' architecture, in which the model explicitly learns the inter-instance relationship. Also, by using image synthesis, we make the system capable of handling new objects without human annotations. The experimental results show the effectiveness of the relook architecture when compared with a conventional model and of the image synthesis when compared to a human-annotated dataset. We also demonstrate the capability of our system to achieve picking a target in a cluttered environment with a real robot.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8


Unseen Object Amodal Instance Segmentation via Hierarchical Occlusion Modeling

Instance-aware segmentation of unseen objects is essential for a robotic...

Amodal segmentation just like doing a jigsaw

Amodal segmentation is a new direction of instance segmentation while co...

Amodal Segmentation Based on Visible Region Segmentation and Shape Prior

Almost all existing amodal segmentation methods make the inferences of o...

Seeing Unseeability to See the Unseeable

We present a framework that allows an observer to determine occluded por...

Joint Learning of Instance and Semantic Segmentation for Robotic Pick-and-Place with Heavy Occlusions in Clutter

We present joint learning of instance and semantic segmentation for visi...

SIMstack: A Generative Shape and Instance Model for Unordered Object Stacks

By estimating 3D shape and instances from a single view, we can capture ...

Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition

Existing scene understanding systems mainly focus on recognizing the vis...

I Introduction

With recent progress in deep learning, especially convolutional neural networks, the robotics community has improved the ability of the robot to find and pick various target objects in clutter 

[11, 9, 20, 25, 23], even when including novel objects [24]. However, these works restrict the environment to have few occluded target objects. Our goal in this work is to develop a framework that enables a robot to pick various target objects in the appropriate order in an environment with heavy occlusion (e.g., a pile of objects in a bin).

Picking target objects from a pile is especially difficult with a vast variety of shapes and arrangement of deformable objects. Estimating the state of stacked objects is challenging because of 1) Partial observation for the mesh model fitting; 2) Infinite possible patterns of stacking. For example, in Figure

2, the target object (Dumbbell) is located on the highest object (Binder) and under the obstacle object (Tennis Ball Holder). In this scenario, we expect the robot to pick the Tennis Ball Holder first and the Dumbbell afterward, without touching any other objects. However, the detection of only the visible part is not enough to plan such a path to grasp the target object. Typical failure cases occur when the robot tries to pick the wrong object first: 1) Binder because it is the highest object; 2) Dumbbell because it is the target object (Figure 2).

In order to plan the picking order correctly in this situation, it is necessary to understand the scene as “target object Dumbbell is occluded by obstacle object Tennis Ball Holder”. This motivates us to introduce instance occlusion segmentation, in which the occluded region of each instance (i.e. object) is segmented as well as the visible one. The segmentation task is aimed at predicting the invisible information (e.g., occluded region) from the visible one using some external knowledge (e.g., by learning from a dataset). This spontaneously leads us to applying a learning-based method as the solution to this problem.

Fig. 1: Our system recognizes occlusion status of objects via instance occlusion segmentation, and plans an appropriate picking order to pick the target object from the stacked objects.
Fig. 2: Typical case where occlusion understanding is necessary.
The target object (Dumbbell) is located on top of the highest object (Binder) and occluded by the obstacle object (Tennis Ball Holder).
In this case, we expect the robot to pick the Tennis Ball Holder first and the target object afterward. However, the robot mistakenly tries to pick the Binder first
in height order picking or the Dumbbell in greedy picking.

In this paper, we propose a vision system that detects objects and recognizes their visible and occluded region simultaneously. The system is designed to be able to handle novel objects without requiring any task-specific training data (e.g., data collection and human annotation for objects in the bin). To achieve this, our system consists of two components: 1) Image synthesis of a stack of objects solely composed by instance images of these objects, which generates various patterns of stacking and occlusion with ground truth of visible and occluded region masks for each instance; 2) Instance segmentation model with a novel ‘relook’ architecture designed for occlusion segmentation, in which we extend the recent works [8, 2] to recognize and use the density of instances for multi-class segmentation (visible and occluded) of each instance. We also propose a metric for instance occlusion segmentation, which is an extension of common instance segmentation which only segments the visible mask. In the experiments, we provide evaluation results of instance occlusion segmentation using various objects that were used in the Amazon Robotics Challenge (ARC), demonstrating the ability of our system to find and pick the target from a pile of objects.

Our main contributions are:

  • Image synthesis framework for learning instance segmentation including occlusion, which is a straightforward extension to recent works [5, 3];

  • A novel instance segmentation network model that uses the instance density to segment multi-class masks by extending recent works [8, 2];

  • A new metric for instance segmentation of multi-class masks extending recent work of instance segmentation of a single mask [12];

  • The integrated system of the above components and demonstration in the picking task from a pile of objects.

Ii Related Works

Ii-a Instance Segmentation

Instance segmentation is aimed at predicting the object region mask and its label at the same time. Since instance segmentation is a compound task of bounding box detection and pixel-wise semantic segmentation, previous works propose models that solve the two tasks sequentially or concurrently. In the sequential approach, past works [17, 18, 1]

propose models which propose mask segmentation first and classifies them afterward. On the other hand,

concurrent segmentation prediction and object detection are recently proposed [13, 8, 2]. These models simultaneously predict object classes, boxes, and masks. In fact, these concurrent approaches prove to be faster and more accurate than sequential ones.

In this paper, we extend the state-of-the-art work [8, 2], for instance occlusion segmentation by seeing it as the multi-class (visible, occluded) extension of the conventional instance segmentation (visible only). Compared to the visible region segmentation, the relationship between nearby objects is more crucial when dealing with occlusion segmentation, since the occluded region of an instance is caused by other instance’s visible region. This motivates us to extend the previous models in order to learn the connection between predicted object instances. Although previous models concurrently predict visible region masks for each box, there is no connection between them.

Specifically, multi-class extension of Mask R-CNN [8] is proposed as AffordanceNet [2]. It replaces sigmoid cross entropy loss with softmax to output multi-class region masks, so we refer to the AffordanceNet as Mask R-CNN Softmax later in this paper.

Ii-B Image Synthesis for Object Detection

Recently deep learning based methods have improved many machine vision tasks, but since deep learning requires a significant amount of data, it motivates researchers to acquire training data through synthesizing rather than using human annotations. A naive approach of synthesizing training data is using 3D mesh models to render 2D images. Past works use mesh models to learn viewpoint estimation [22], eye gaze direction estimation [21] and object detection [10] for objects in a 2D image. Above work focus on developing ways to make synthetic images closer to real images. On the other hand, it has recently been found that synthesizing only 2D instance images of objects is also effective to train detection models of object bounding boxes [5, 3]. The base idea for this is that if we could generate infinite synthetic images at random and train learning model with it, the model would generalize to real images. Our approach is closer to the latter, and we extend the past works to generate ground truth of object masks (visible, occluded) as well, in addition to the bounding box. Since it is impractical to gather realistic mesh models for various objects, 2D synthesizing is more practical than 3D one. We also show that a small number of instance images is enough to achieve human annotation-level detection performance, while past works [5, 3] use a huge number of instance images for each object.

Iii System Overview

Fig. 3: System overview.
At train time, the system receives instance images as input and trains the segmentation model.
At test time, the system receives a real image as input and outputs the scene occlusion status for the image.

Our proposed system (Figure 3) is composed of two components:

  • Instance occlusion segmentation neural networks trained using the generated images (IV);

  • 2D image synthesis of a cluttered scene generated from object instance images to handle various objects (V).

At training time, the system receives instance images of objects as the input to train instance occlusion segmentation model from the generated synthetic images. At testing time, a real image is the input and the occlusion status of the scene is predicted and outputted. Since our proposed system only requires instance images of objects of interest, we can quickly gather them from web or with a standard camera. This is important for enabling the vision system to handle various objects without human labeling, which is especially hard for instance occlusion segmentation, and for the applications to warehouse picking in e-commerce services for example.

The network output is the set of visible and occluded masks of each object instance and the occlusion status. The hierarchy of stacking like “Plate is on Box“ is interpreted automatically by scanning each pixel to see what object is visible and what objects are occluded at the pixel.

Iv Instance Occlusion Segmentation Neural Networks

Iv-a Network Architecture

Iv-A1 Mask R-CNN

We begin by reviewing the network architecture of Mask R-CNN [8]. Mask R-CNN is an extension of Faster R-CNN [19], which was previously proposed as a model for bounding box based object detection. To extend the model for instance segmentation, another branch of predicting instance mask was introduced and added to the existing class and bounding box prediction branches. In the mask branch, the instance mask is predicted without predicting object class by relying on the classification of the classification branch. This is achieved by predicting the class number of instance masks in the mask branch, extracting the mask of predicted class by classification branch afterward. Each branch of the object class, box, and mask is predicted in parallel after extracting features by ROIAlign (an extension of ROIPooling [19]) at the Region of Interest (ROI) proposed by the preceding Region Proposal Networks (RPN) [19].

Mask R-CNN is fast because the prediction of each instance information is made in parallel and independently

using the features extracted by ROIAlign for each proposed ROI. However, since the prediction about each instance after ROI feature extraction is independent of other ROIs/instances, this model in not suited for leaning the relationship among instances. Therefore, although effective for instance segmentation of the visible region, in which the relationship among instances is not so important, Mask R-CNN is not expected to be suitable for instance occlusion segmentation, in which the relationship between masks is crucial for correctly inferring the occlusion state of the instances. This leads us to introduce the inter-instance connection, which is described in


Iv-A2 Relook Architecture: Inter-Instance Connection

In order to learn the relationship and dependency between instances, it is necessary to have connections among the representations of each instance in the neural network. To do this, we convert the instance masks predicted in the first stage (left in Figure 4) to a density map (middle in Figure 4) and use it to predict instance masks in the second stage. The two instance masks of the first and second stage are added in pixel-wise (fused) as the final result, being segmentation loss computed for this fused result. The first stage is Mask R-CNN (softmax) (II-A). The second stage can be interpreted as a “relook” architecture for learning and predicting inter-instance connections, so we call this model Mask R-CNN (relook).

Fig. 4: Instance occlusion segmentation network with inter-instance connection.
The network architecture can be interpreted as a pipeline of two components:
1) Mask R-CNN (softmax) for per-instance prediction; 2) Instance density map for inter-instance connection.

The tensor shape of the layer output is shown in the bottom with (height, width, channel).

The final layer predicts three masks: visible, occluded and other (not part of the object), being the density map also generated for each of the three. For learning inter-instance connection, we concatenate the density map with the features extracted by the feature extractor. We use ResNet50-C4 and ResNet101-C4 (fourth stage output of ResNetX [16]) as in Mask R-CNN [8]

, and apply a convolutional layer. After that, ROIAlign, ‘res5’ (5th layer of ResNetX) and a deconvolutional layer is applied with sharing parameters with the first stage. On top of that, the final convolutional layer is applied to predict three masks for each instance. We use ReLU 


as the activation function for the hidden layers, and the kernel size of the convolutional layer is 3 for the hidden layer and 1 for the output layer.

Iv-B Implementation Detail

At most, we followed the implementation of the original paper Mask R-CNN [8] and Faster R-CNN [19]. For input image size, we set 600 as the minimum axis size and 1000 as the maximum axis size following [19]. We trained RPN using three ratios and four anchor scales with no threshold of ROI proposal size, which was set to 16 in [19]. We use 512 as the hidden channel size of RPN, value which was set to 1024 in [8], since we noticed it was too large for small datasets.

At the training time of Mask R-CNN, 512 ROI proposals, which have ratio, are used to train classification and box regression branch, and the foreground ROIs are used to train the mask branch. At the testing time, it first regresses class and bounding box of the instances and applies non-maximum suppression (NMS), then using the fewer and more accurate bounding boxes for ROI feature extraction by ROIAlign with following mask prediction. This causes a little difference in the prediction process between training and testing; however, it achieves mask prediction using the more accurate bounding box to get better results. This prediction difference may also cause bad effects because of the difference of the density map: at training time NMS is not applied, so the density map appears to have more instances than that of testing time. However, a little surprisingly, this prediction difference did not cause adverse effects, but rather we had better results with this prediction process than using the same prediction process at both training and testing time.

V Image Synthesis for Learning Instance Occlusion Segmentation

V-a Instance Image Gathering

We use instance images distributed at the Amazon Robotics Challenge as ItemData for both known and novel objects which will appear during the task. There are few images () taken from different viewpoints for each object with black background, as shown in the top images of Figure 5.

Fig. 5: Foreground mask extraction for instance image.
Sometimes foreground mask extraction through pixel value threshold fails even for images with simple background (middle images).
In this work we counteract this problem by training a small convolutional neural network model (bottom images) [15].

Although the background of the instance image is static, we found that it is difficult to extract the foreground mask for each instance through pixel value thresholding. This is especially true for objects with various colors (e.g., Robot Book) or transparent (e.g., Wine Glass), as shown in the middle of Figure 5.

A similar problem is also pointed out in previous work [3]: they had objects for which it is difficult to extract foreground mask by thresholding of the depth image, in particular with transparent objects such as Cola Bottle. To overcome this difficulty, they trained a small convolutional neural network (ConvNet) model [15] using the mask acquired from thresholding of the depth image as the ground truth. We applied the same approach, using the mask acquired from pixel value thresholding using instance images of 112 objects. The ground truth foreground mask is not perfect because it is not annotated by a human; however, the segmentation result in Figure 5 shows that the network successfully generalized the foreground segmentation. In this case, the automatically generated mask is usually bigger than the perfect mask, while in [3] the generated mask from depth image is usually smaller than the perfect one.

V-B Blending

Blending is necessary for 2D image synthesis to remove boundary artifacts when we put the instance images onto the background image. We apply gaussian blurring following [3] with a random sigma which ranges from 0 to 1.

V-C Data Augmentation

Data augmentation is crucial for generating synthetic data to train detection models that will generalize to real images, especially when we only have a few instance images (), compared to 600 images found in [3]. We applied color data augmentation in addition to the geometric augmentation present in past work [3]. To make the system more robust to changes in brightness and light reflection, we also applied multiplication to S and V channels after converting the RGB image into HSV color space with random selection of scale between 0.5 to 2.0. For geometric augmentation, we applied affine transformation with a random scale in the range from 0.5 to 1.0, translation from -16 to 16 pixels, rotation from -180 to 180 degrees, and shear from -16 to 16 degrees, which we believe is similar to the augmentation of 2D/3D rotation described in [3]. Figure 6 shows some examples of the augmented result.

Fig. 6: Data augmentation example.

V-D Image Synthesis of a Pile of Objects with Ground Truth

We generate the ground truth labels and masks in addition to the synthetic image of stacked objects. Figure 7 shows an example of the synthesized image and visualization of the ground truth. While stacking the instance images onto the background image, we apply image blending (V-B) and data augmentation (V-C) for each instance image at random. The foreground mask of each instance image is acquired by ConvNet (V-A), and since the filled region of already stacked instances is known, we can get masks of both visible and occluded regions (Figure (c)c and (d)d).

(a) Synthetic Image.
(b) Instance Labels.
(c) Visible Mask of an Instance.
(d) Occluded Mask of an Instance.
Fig. 7: Image synthesis example.

Vi A Metric for Instance Occlusion Segmentation

The instance segmentation model of multiple masks (affordance masks) in [2] was evaluated as a pixel-wise segmentation task to compare with other state-of-the-art semantic segmentation models. This was possible because their model segments only the visible region of each instance and there was no overlap among the masks. Also, their evaluation ignores the accuracy of the predicted number of instances: even if 3 instances are predicted for an image with 2 instances in ground truth, there is no penalty for the over-counting 1 instance. This happens because the predicted bounding boxes and affordance masks will be converted to an image of pixel-wise affordance label to evaluate it in the semantic segmentation manner. In order to attain the goal of picking targets from a clutter, however, detecting the correct number of instances is also necessary, as well as the segmentation of visible and occluded masks. This motivated us to find another metric to evaluate instance occlusion segmentation.

Recently, the metric Panoptic Quality () is proposed to evaluate the accuracy of both detection and segmentation in instance segmentation [12]. is represented as the multiplication of Detection Quality () and Segmentation Quality ():


where , and represent the set of true positive, false positive and false negative instances, and , represent predicted and ground truth instances, respectively. is the intersect over union of predicted and ground truth masks for a single mask-class:


where and are the predicted and ground truth masks for a single mask-class. is computed for each object class, and the s for all classes are averaged as “means of PQ” (). For computation of , we use the visible mask of predicted and ground truth to find the matched instances with threshold of 0.5 between these predicted masks.

Since the was proposed as the metric of instance segmentation, which is a single mask segmentation for each object, we extended the metric to be able to evaluate multi masks segmentation for each instance as :


represents the set of possible instance masks, of which we have three in the instance occlusion segmentation: background, visible and occluded. (mean of ) is the averaged over the mask classes . The and have been used in the previous works of semantic segmentation [4, 14] to evaluate the accuracy of predicted masks comparing it with the ground truth, so we believe the is an appropriate metric for instance occlusion segmentation. In the following we address to as for simplicity.

Vii Experiments

Vii-a Instance Occlusion Segmentation of ARC2017 Objects

Vii-A1 Objects for Evaluation

We evaluate our system with the 40 objects used in the Amazon Robotics Challenge 2017 (ARC2017) shown in Figure 8, each of which had 4-6 instance images distributed at the competition. We believe these objects have a broad diversity and are sufficient to demonstrate picking from a pile of objects.

In the following experiments, we use the instance images distributed at ARC2017 to evaluate our image synthesis framework (V) comparing with human annotated data we created using the real objects. The instance occlusion segmentation model (IV) is trained with the human-annotated or synthetic training dataset and is evaluated with the human-annotated testing dataset.

Fig. 8: 40 objects used at ARC2017.

Vii-A2 Human-annotated Dataset for Evaluation

For the evaluation of both the model and image synthesis, we created a dataset of instance occlusion segmentation shown in Figure 9. Since annotating an occluded region is challenging we created a set of sequential camera frames in which a pile of objects is cleared from the top. With annotating the visible mask of objects in all frames of a video captured from a fixed camera, the visible masks are backtracked to acquire the occluded masks. We created 21 videos (split into ) in which the 40 objects appear 7 times each (3-5 times in the train split).

(a) Visible Masks that includes Baby Wipes, Mouse Trap and Socks.
(b) Occluded mask of Baby Wipes.
(c) Occluded mask of Mouse Trap.
(d) Occluded mask of Socks.
Fig. 9: Annotations of instance occlusion segmentation.
Occluded mask of objects is visualized separately. Note that its color (Baby Wipes, Mouse Trap, Socks) corresponds to the mask of visualization of visible masks.
TABLE I: Instance occlusion segmentation on test data with ResNet101 backbone.

Vii-A3 Model Evaluation: Softmax vs. Relook

We evaluated the proposed relook architecture with a human-annotated dataset for both training and testing, with the comparison to the softmax extension of Mask R-CNN [2]. Table II shows the quantitative results, in which “Softmax” denotes Mask R-CNN Softmax, “Softmax_x2” denotes Softmax whose mask loss is scaled by 2, and “Relook” denotes Mask R-CNN with the relook architecture. The reason why we added the experiment of “Softmax_x2” is because our proposed model has two mask losses at first and second stage, possibly making it unfair to compare it with “Softmax”, which has no scaling factor to the mask loss. Since the learning result of RPN has lots of noise caused by randomness, we show results averaged in 10 - 15 times experiments for each model. For the fair comparison, we use ResNet50 feature extractor using learning rate 0.0375 with 3 GPUs in which the learning rate is scaled by the number of GPUs in following experiments without note: , as proposed in [6].

The results of in Table II show that the proposed relook architecture surpasses the existing models and is effective in performing instance occlusion segmentation. We also show results of (mean averaged precision) that was used as the metric of instance segmentation of visible mask in the VOC [4] and COCO [14] competitions. The results of the show that our model surpasses previous models in both instance segmentation of visible masks and detection quality.

Model mPQ mSQ mDQ mAP
Softmax [2] 13.4 24.7 40.7 46.1
Softmax_x2 [2] 13.8 25.2 41.9 47.1
Relook (Ours) 14.4 26.0 42.8 48.6
TABLE II: Softmax vs. Relook.

We also trained the network with a different backbone, ResNet101. Its qualitative results are shown in Table I.

Vii-A4 Dataset Evaluation: Human-annotated vs. Synthetic

We evaluated our image synthesis framework by training the proposed model with either synthetic or human-annotated data. After training, the model performance is evaluated using the test split of the human-annotated dataset. Table III shows the averaged results of 3 experiments using the image synthesis referring results in VII-A3. It shows that our image synthesis using 4 - 6 instance images (14.2) is as effective as the result using a small human annotated dataset (14.4) for learning instance occlusion segmentation.

Model Dataset mPQ
Softmax [2] Annotated 13.4
Synthetic 13.5
Relook (ours) Annotated 14.4
Synthetic 14.2
TABLE III: Human-annotated vs. Synthetic.

Vii-B Application to Warehouse Picking

As an application of our system, we demonstrate the picking task of a target object from a pile of objects, as shown in Figure (a)a. Figure (b)b shows the workspace configuration: there is a bin which contains the target object, a bin to place obstacle objects into it, and a cardboard box for the robot to place the target object. For this demonstration, we have extended the picking task we developed in the previous works [23, 7]. Figure 11 shows the sequential frames of recognition result (Figure (a)a-(c)c), workspace overview (Figure (d)d-(i)i) and emphasis on the robot hand (Figure (j)j-(o)o). The recognition result shows the input image, predicted visible and occluded masks, and the target object based on the occlusion understanding.

In this experiment, we set the threshold of occlusion ratio to 0.3; the occlusion ratio is the ratio of occluded pixels compared with the total pixels of an instance. We set the threshold of inter-instance occlusion ratio to 0.1 to judge that the instance is occluded by the other. For the model of this demonstration, we use ReNet101 feature extractor as the backbone of our proposed model to have a better recognition accuracy. The successful clearing of obstacle objects and picking targets in the demonstration shows that our proposed system is effective and applicable in the real-world picking task.

(a) The Scene of Stacked Objects.
(b) Workspace.
Fig. 10: Picking Demo Configuration.
Fig. 11: Picking Task Demonstration of a Target from a Pile of Objects.

Viii Conclusions

We presented a vision system that only requires a few instance images of objects to learn instance occlusion segmentation. The system consists of 1) Image synthesis with ground truth of occluded region mask of each instance; 2) Instance segmentation networks that learn inter-instance relationship, which is essential for the segmentation of occluded regions. We evaluated the proposed image synthesis and segmentation model via the ablation studies and presented the effectiveness of the proposed system in the real picking task from a pile of objects.


We thank Shun Hasegawa and Yuto Uchimi for contributions to the software and hardware development of the picking system for demonstration with real-world robot experiments.


  • [1] J. Dai, K. He, Y. Li, S. Ren, and J. Sun (2016) Instance-sensitive Fully Convolutional Networks. pp. 1–15. External Links: Document, 1603.08678, ISBN 978-3-319-46466-4, ISSN 16113349, Link Cited by: §II-A.
  • [2] T. Do, A. Nguyen, I. Reid, D. G. Caldwell, and N. G. Tsagarakis (2017) AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection. External Links: Link Cited by: 2nd item, §I, §II-A, §II-A, §II-A, §VI, §VII-A3, TABLE II, TABLE III.
  • [3] D. Dwibedi, I. Misra, and M. Hebert (2017) Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection. External Links: Link Cited by: 1st item, §II-B, §V-A, §V-B, §V-C.
  • [4] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2014) The Pascal Visual Object Classes Challenge: A Retrospective.

    International Journal of Computer Vision

    111 (1), pp. 98–136.
    External Links: Document, ISBN 0920-5691, ISSN 15731405 Cited by: §VI, §VII-A3.
  • [5] G. Georgakis, A. Mousavian, A. C. Berg, and J. Kosecka (2017) Synthesizing Training Data for Object Detection in Indoor Scenes. External Links: 1702.07836, Link Cited by: 1st item, §II-B.
  • [6] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017)

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    External Links: Document, 1706.02677, ISBN 9781601987167, ISSN 2167-3888, Link Cited by: §VII-A3.
  • [7] S. Hasegawa, K. Wada, Y. Niitani, K. Okada, and M. Inaba (2017) A Three-Fingered Hand with a Suction Gripping System for Picking Various Objects in Cluttered Narrow Space. pp. 1164–1171. External Links: Document, ISBN 9781538626818, ISSN 21530866 Cited by: §VII-B.
  • [8] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask R-CNN. External Links: 1703.06870 Cited by: 2nd item, §I, §II-A, §II-A, §II-A, §IV-A1, §IV-A2, §IV-B.
  • [9] C. Hernandez, M. Bharatheesha, W. Ko, H. Gaiser, J. Tan, K. van Deurzen, M. de Vries, B. Van Mil, J. van Egmond, R. Burger, M. Morariu, J. Ju, X. Gerrmann, R. Ensing, J. Van Frankenhuyzen, and M. Wisse (2017) Team delft’s robot winner of the amazon picking challenge 2016. Lecture Notes in Computer Science 9776 LNAI, pp. 613–624. External Links: Document, 1610.05514, ISBN 9783319687919, ISSN 16113349 Cited by: §I.
  • [10] S. Hinterstoisser, V. Lepetit, P. Wohlhart, and K. Konolige (2017) On Pre-Trained Image Features and Synthetic Images for Deep Learning. External Links: 1710.10710, Link Cited by: §II-B.
  • [11] R. Jonschkowski, C. Eppner, S. Höfer, R. Martín-Martín, and O. Brock (2016) Probabilistic multi-class segmentation for the Amazon picking challenge. IEEE International Conference on Intelligent Robots and Systems (i), pp. 1–7. External Links: Document, ISBN 9781509037629, ISSN 21530866 Cited by: §I.
  • [12] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár (2018) Panoptic Segmentation. pp. 1–9. External Links: 1801.00868, Link Cited by: 3rd item, §VI.
  • [13] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei (2016) Fully Convolutional Instance-aware Semantic Segmentation. External Links: Link Cited by: §II-A.
  • [14] T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dolí (2014) Microsoft COCO: Common Objects in Context. External Links: Link Cited by: §VI, §VII-A3.
  • [15] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation.

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    07-12-June, pp. 3431–3440.
    External Links: Document, 1411.4038, ISBN 9781467369640, ISSN 10636919 Cited by: Fig. 5, §V-A.
  • [16] V. Nair and G. E. Hinton (2010) Rectified Linear Units Improve Restricted Boltzmann Machines.

    Proceedings of the 27th International Conference on Machine Learning

    (3), pp. 807–814.
    External Links: Document, 1111.6189v1, ISBN 9781605589077, ISSN 1935-8237 Cited by: §IV-A2.
  • [17] P. O. Pinheiro, R. Collobert, and P. Dollar (2015) Learning to Segment Object Candidates. pp. 1–10. External Links: Document, 1506.06204, ISSN 10495258, Link Cited by: §II-A.
  • [18] P. O. Pinheiro, T. Y. Lin, R. Collobert, and P. Dollár (2016) Learning to refine object segments. Lecture Notes in Computer Science 9905 LNCS, pp. 75–91. External Links: Document, 1603.08695, ISBN 9783319464473, ISSN 16113349 Cited by: §II-A.
  • [19] S. Ren, K. He, R. Girshick, and J. Sun (2017) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. External Links: Document, 1506.01497, ISBN 0162-8828 VO - PP, ISSN 01628828 Cited by: §IV-A1, §IV-B.
  • [20] M. Schwarz, A. Milan, C. Lenz, A. Munoz, A. S. Periyasamy, M. Schreiber, S. Schuller, and S. Behnke (2017) NimbRo picking: Versatile part handling for warehouse automation. Proceedings - IEEE International Conference on Robotics and Automation (May), pp. 3032–3039. External Links: Document, ISBN 9781509046331, ISSN 10504729 Cited by: §I.
  • [21] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb (2016) Learning from Simulated and Unsupervised Images through Adversarial Training. External Links: Document, 1612.07828, ISBN 978-1-5386-0457-1, ISSN 1063-6919, Link Cited by: §II-B.
  • [22] H. Su, C. R. Qi, Y. Li, and L. J. Guibas (2015) Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. Proceedings of the IEEE International Conference on Computer Vision 2015 International Conference on Computer Vision, ICCV 2015 (Sec 2), pp. 2686–2694. External Links: Document, 1505.05641, ISBN 9781467383912, ISSN 15505499 Cited by: §II-B.
  • [23] K. Wada, K. Okada, and M. Inaba (2017) Probabilistic 3d multilabel real-time mapping for multi-object manipulation. In Proceedings of the International Conference on Robotics and Systems, Cited by: §I, §VII-B.
  • [24] A. Zeng, S. Song, K. Yu, E. Donlon, F. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, N. Fazeli, F. Alet, N. Chavan Dafle, R. Holladay, I. Morona, P. Q. Nair, D. Green, I. Taylor, W. Liu, T. Funkhouser, and A. Rodriguez (2018) Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching. IEEE International Conference on Robotics and Automation (ICRA), pp. In review.. External Links: 1710.01330 Cited by: §I.
  • [25] A. Zeng, K. T. Yu, S. Song, D. Suo, E. Walker, A. Rodriguez, and J. Xiao (2017)

    Multi-view self-supervised deep learning for 6D pose estimation in the Amazon Picking Challenge

    Proceedings - IEEE International Conference on Robotics and Automation, pp. 1386–1393. External Links: Document, 1609.09475, ISBN 9781509046331, ISSN 10504729 Cited by: §I.