Log In Sign Up

Unseen Object Amodal Instance Segmentation via Hierarchical Occlusion Modeling

by   Seunghyeok Back, et al.
Gwangju Institute of Science and Technology

Instance-aware segmentation of unseen objects is essential for a robotic system in an unstructured environment. Although previous works achieved encouraging results, they were limited to segmenting the only visible regions of unseen objects. For robotic manipulation in a cluttered scene, amodal perception is required to handle the occluded objects behind others. This paper addresses Unseen Object Amodal Instance Segmentation (UOAIS) to detect 1) visible masks, 2) amodal masks, and 3) occlusions on unseen object instances. For this, we propose a Hierarchical Occlusion Modeling (HOM) scheme designed to reason about the occlusion by assigning a hierarchy to a feature fusion and prediction order. We evaluated our method on three benchmarks (tabletop, indoors, and bin environments) and achieved state-of-the-art (SOTA) performance. Robot demos for picking up occluded objects, codes, and datasets are available at


page 1

page 3

page 4

page 6


Instance Segmentation of Visible and Occluded Regions for Finding and Picking Target from a Pile of Objects

We present a robotic system for picking a target from a pile of objects ...

Explain What You See: Open-Ended Segmentation and Recognition of Occluded 3D Objects

Local-HDP (for Local Hierarchical Dirichlet Process) is a hierarchical B...

Robotic Occlusion Reasoning for Efficient Object Existence Prediction

Reasoning about potential occlusions is essential for robots to efficien...

Occluded Video Instance Segmentation

Can our video understanding systems perceive objects when a heavy occlus...

Perceiving the Invisible: Proposal-Free Amodal Panoptic Segmentation

Amodal panoptic segmentation aims to connect the perception of the world...

OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene Grounding

To effectively apply robots in working environments and assist humans, i...

Amodal segmentation just like doing a jigsaw

Amodal segmentation is a new direction of instance segmentation while co...

I Introduction

Segmentation of unseen objects is an essential skill for robotic manipulations in an unstructured environment. Recently, unseen object instance segmentation (UOIS) [danielczuk2019segmenting, xie2020best, xie2021unseen, xiang2020learning, durner2021unknown, back2020segmenting] have been proposed to detect unseen objects via category-agnostic instance segmentation by learning a concept of object-ness from large-scale synthetic data. However, these methods focus on perceiving only visible regions, while humans have the ability to infer the entire structure of occluded objects based on the visible structure [palmer1999vision, zhu2017semantic]. This capability, called amodal perception, can allow a robot to straightforwardly manipulate the occlusions in a cluttered scene. Although recent works [wada2019joint, wada2018instance, inagaki2019detecting, price2019inferring, narasimhan2020seeing, qin2020s4g, li2021robotic] have shown the usefulness of amodal perception in robotics, they have been limited to perceiving bounded object sets, where prior knowledge about the manipulating target is given (e.g., labeling a task-specific dataset and training for specific objects and environments).

Fig. 1: Comparison of UOAIS and existing tasks. While other segmentation methods are limited to perceiving known object sets or detecting only visible masks of unseen objects, UOAIS aims to segment both visible and amodal masks of unseen object instances in unstructured clutter.

This work proposes unseen object amodal instance segmentation (UOAIS) to detect visible masks, amodal masks, and occlusion of unseen object instances (Fig. 1). Similar to UOIS, it performs category-agnostic instance segmentation to distinguish the visible regions of unseen objects. Meanwhile, UOAIS jointly performs two additional tasks: amodal segmentation and occlusion classification of the detected object instances. For this, we propose UOAIS-Net, which reasons the object’s occlusion via hierarchical occlusion modeling (HOM). The hierarchical fusion (HF) module in our model combines the multiple features of prediction heads according to their hierarchy, thereby allowing the model to consider the relationship between the visible mask, amodal mask, and occlusion. We trained the model on 50,000 photo-realistic RGB-D images to learn various object geometry and occlusion scenes and evaluated its performance in various environments. The experiments demonstrated that visible masks, amodal masks, and the occlusion of unseen objects could be detected in a single framework with state-of-the-art (SOTA) performance. The ablation studies demonstrated the effectiveness of the proposed HOM for UOAIS.

The contributions of this work are summarized as follows:

  • We propose a new task, UOAIS, to detect category-agnostic visible masks, amodal masks, and occlusion of arbitrary object instances in a cluttered environment.

  • We propose a HOM scheme to reason about the occlusion of objects by assigning the hierarchy to feature fusion and prediction order.

  • We introduce a large-scale photorealistic synthetic dataset named UOAIS-SIM and amodal annotations for the existing UOIS benchmark, OSD [richtsfeld2012segmentation].

  • We validated our UOAIS-Net on three benchmarks and showed the effectiveness of HOM by achieving state-of-the-art performance in both UOAIS and UOIS tasks.

  • We demonstrated a robotic application of UOAIS. Using UOAIS-Net, the object grasping order for retrieving the occluded objects in clutter can be easily planned.

Ii Related Work

Unseen Object Instance Segmentation. UOIS aims to segment the visible regions of arbitrary object instances in an image [xie2020best, xie2021unseen], and is useful for robotic tasks such as grasping [sundermeyer2021contact] and manipulating unseen objects [murali20206]. Many segmentation methods [felzenszwalb2004efficient, rusu2010semantic, richtsfeld2012segmentation, koo2014unsupervised, potapova2014incremental, pham2018scenecut] have been proposed to distinguish objects but segmentation in cluttered scenes is challenging, especially for objects with complex textures or that are under occlusion [suchi2019easylabel, pham2018scenecut]. To generalize over unseen objects and clutter scenes, recent UOIS methods [danielczuk2019segmenting, xie2020best, xie2021unseen, xiang2020learning, back2020segmenting, durner2021unknown] have trained category-agnostic instance segmentation models to learn the concept of object-ness from a large amount of domain-randomized [tobin2017domain] synthetic data. Although these methods are promising, they focus on segmenting only the visible part of objects. On the other hand, our method can jointly perform visible segmentation, amodal segmentation, and occlusion classification for unseen object instances.

Amodal Instance Segmentation. When humans perceive an occluded object, they can guess the entire structure even though part of it is invisible [palmer1999vision, zhu2017semantic]. To mimic this amodal perception ability, amodal instance segmentation [li2016amodal] has been proposed, in which the goal is to segment both the amodal and visible masks of each object instance in an image. The SOTA approaches are mainly built on visible instance segmentation [he2017mask, tian2019fcos] and perform amodal segmentation through the addition of an amodal module, such as amodal branch and invisible mask loss [follmann2019learning], multi-level coding (MLC) [qi2019amodal], refinement layers [xiao2020amodal], and occluder segmentation [ke2021deep]. They have demonstrated that it is possible to segment the amodal masks of occluded objects on various datasets [qi2019amodal, follmann2019learning]. However, these methods can detect only a particular set of trained objects and require additional training data to deal with unseen objects. In contrast, UOAIS learns to segment the category-agnostic amodal mask, reducing the need for task-specific datasets and model re-training.

Amodal Perception in Robotics. The amodal concept is useful for occlusion handling and recent studies have utilized amodal instance segmentation for robotic picking systems [wada2018instance, wada2019joint, inagaki2019detecting] to decide the proper picking order for target object retrieval. Amodal perception has also been applied to various robotics tasks including object search [price2019inferring, danielczuk2020x], grasping [qin2020s4g, wada2018instance, wada2019joint, inagaki2019detecting], and active perception [yang2019embodied, li2021robotic], but it is often limited to perceiving the amodality of specific object sets. The works most related to our method are [price2019inferring] and [agnew2020amodal]. These studies trained a 3D shape completion network that infers the occluded geometry of unseen objects based on visible observation for robotics manipulation in a cluttered scene. Although the reconstruction of the occluded 3D structure is valuable, their amodal reconstruction network requires an additional object instance segmentation module [pham2018scenecut, xie2020best] and only a single instance can be reconstructed on a single forward pass. Whereas their methods require a high computational cost, our method can directly predict the occluded regions of multiple object instances in a single forward pass; thus, it can be easily extended to various amodal robotic manipulations in the real world.

Iii Problem Statement

We formalize our problem using the following definitions.

  1. Scene: Let be a simulated or real scene containing foreground object instances , background object instances , and a camera .

  2. State: Let be a ground truth state of foreground object instances in the scene , which is the set of the bounding box , visible mask , amodal mask , occlusion , and class captured by the camera . and are the width and height of the image. The occlusion and class denote whether the instance is occluded and whether the instance is a foreground objects or not, respectively.

  3. Observation: An RGB-D image captured the scene with the camera at the pose . is the RGB image, and is the depth image.

  4. Dataset and Object Models: Let be a dataset. Let be a set of object models used for and in the dataset .

  5. Known and Unseen Object: Let the train and test set be and . Let be a function trained on and then tested on . If the , the object in is the unseen object, and the object in is the known object for .

The objective of UOAIS is to detect a category-agnostic visible mask, an amodal mask and the occlusion of arbitrary objects. Thus, our paper aims to find a function for given , where .

Fig. 2: Architecture of our proposed UOAIS-Net.

UOAIS-Net consists of (1) an RGB-D fusion backbone for RoI RGB-D feature extraction and (2) an HOM Head for hierarchical prediction of the bounding box, visible mask, amodal mask, and occlusion. It was trained on synthetic RGB-D images and then tested on real clutter scenes. Through occlusion modeling with a HF module, segmentation and occlusion prediction of unseen objects can be significantly enhanced.

Iv Unseen Object Amodal Instance Segmentation

Iv-a Motivation

To design an architecture for UOAIS, we first made the following observations based on the relationship of bounding box , visible mask , amodal mask , and occlusion .

  1. : Most SOTA instance segmentation methods [ren2015faster, he2017mask, lee2020centermask, qi2019amodal, chen2019hybrid, tian2019fcos] have detect-then-segment approaches; They detect the object region of interest (RoI) bounding box first, then segment the instance mask in a given RoI bounding box. We followed this paradigm by adopting a Mask R-CNN [he2017mask] for UOAIS, and thus visible, amodal, and occlusion predictions strongly depend on the bounding box prediction.

  2. : The amodal mask is the union of the visible mask and the invisible mask (). The visible mask is more obvious than amodal and invisible masks; thus, segment the visible mask, and then infer the amodal mask based on the segmented visible mask.

  3. : The occlusion is defined by the ratio of the visible mask to the amodal mask, i.e., if the visible mask equals to the amodal mask (), the object is not occluded (

    ); segment the visible and amodal masks, and then classify the occlusion.

Based on these observations, we propose an HOM scheme for UOAIS; (1) detect the RoI bounding box (2) segment the visible mask (3) segment the amodal mask , and (4) classify the occlusion (). For this, we propose a UOAIS-Net, which employs an HOM scheme on the top of Mask R-CNN [he2017mask] with an HF module. The HF module explicitly assigns a hierarchy to feature fusion and prediction order and improves the overall performance.


Overview. UOAIS-Net consists of (1) an RGB-D fusion backbone and (2) an HOM Head (Fig. 2). From an input RGB-D image , the RGB-D fusion backbone with a feature pyramid network (FPN) [seferbekov2018feature] extracts an RGB-D FPN feature. Next, the region proposal network (RPN) [ren2015faster] proposes possible object regions and the RoIAlign layer crops an RoI RGB-D FPN feature () size of . Then, the bounding box branch in the HOM Head performs a box regression and foreground classification. For positive RoIs, the HOM Head extracts an RoI feature () with dimensions of from the RGB-D FPN feature and predicts a visible mask , amodal mask , and occlusion hierarchically for each RoI via the HF module. is the number of channels.

Sim2Real Transfer. To generalize over unseen objects of various shapes and textures, our model is trained via Sim2Real transfer; trains a model on large synthetic data and then applies it to real clutter scenes. In this scheme, the domain gap between simulation and real scenes greatly affects the performance of the model [kim2020acceleration, xie2021unseen]. While depth shows a reasonable performance against the Sim2Real transfer [mahler2019learning, danielczuk2019segmenting, back2020segmenting], training with only the non-photorealistic RGB is often insufficient to segment the real object instances [xie2020best, xie2021unseen]. To address this issue, we trained a UOAIS-Net using photo-realistic synthetic images (UOAIS-Sim in Section IV-D) so the Sim2Real gap could be significantly reduced.

RGB-D Fusion. Depth provides useful 3D geometric information for recognizing the object [xie2020best, xie2021unseen, back2020segmenting], but the joint use of the RGB with depth is required to produce a precise segmentation boundary [hazirbas2016fusenet, park2017rdfnet, xie2020best], as depth is noisy especially for transparent and reflective surfaces. To effectively learn discriminative RGB-D features, the RGB-D fusion backbone extracts RGB and depth features with a separate ResNet-50 [he2016deep] for each modality. Then, the RGB and depth features are fused into RGB-D features in a multiple level (C3, C4, C5) through concatenation and convolution, thereby reducing the number of the channel to . Finally, RGB-D features are fed into the FPN [seferbekov2018feature] and the RoIAlign layer [he2017mask] and form an RGB-D FPN feature.

Iv-C Hierarchical Occlusion Modeling

The HOM Head is designed to reason about the occlusion in an RoI by predicting the output in the order of . The HOM Head consists of (1) a bounding box branch (2) a visible mask branch (3) an amodal mask branch, and (4) an occlusion prediction branch. The HF module for each branch fuses all prior features into that branch.

Bounding Box Branch (, ). The HOM Head first predicts the bounding box and the class from given by RPN. The structure of the bounding box branch follows a standard localization layer in [he2017mask]. is fed into two fully connected (FC) layers, then the and are predicted by the following FC layer. For the category-agnostic segmentation, we set to detect all the foreground instances.

HF module. () For the positive RoI, the HOM Head performs visible segmentation, amodal segmentation, and occlusion classification sequentially with the HF module. Following motivation , we first provided a box feature to all other subsequent branches for the prediction conditioned on . It is similar to the MLC [qi2019amodal] that provides a box feature in mask predictions, which allows the mask layers to exploit global information and enhances the segmentation. is fed into a deconvolution layer, and then an upsampled RoI feature with a size of is forwarded to three convolutional layers. The output of this operation is used as the box feature .

Following the hierarchy of the HOM scheme, the HF module for each branch fuses an with and features from a prior branch as follows:


where , , and are the visible, amodal, and occlusion features, respectively, and , , are the HF modules for the visible, amodal, occlusion branches. Specifically, the HF module fuses all inputs by concatenating them to be the channel dimension as and passing it to three 3x3 convolutional layers to reduce the number of channels into . Then it is forwarded to three convolutional layers to extract the task-relevant feature in each branch. Finally, each branch of the HOM outputs the final predictions with the prediction layer . The loss for HOM Head is


where , , and are the prediction layers for the visible mask, amodal mask, and occlusion, respectively. We used a deconvolutional layer for , and FC layer for . The total loss for UOAIS-Net is , where the regression and classification loss are from [he2017mask].

Architecture Comparison. We compare the prediction heads of amodal instance segmentation methods in Fig. 3. From the cropped RoI feature , Amodal MRCNN [follmann2019learning] outputs all the predictions with separate branches, without considering the relationship between them. ORCNN [follmann2019learning] regulates the invisible mask to be an amodal minus the visible mask, but their features are not inter-weaved. ASN [qi2019amodal] fuses the features from into mask branches via multi-level-coding, but the visible and amodal features are extracted parallelly. In contrast, all the branches in UOAIS-Net are hierarchically connected via the HF module for the hierarchical prediction via occlusion reasoning.

Fig. 3: Prediction head comparison. a) Amodal MRCNN [follmann2019learning], b) ORCNN [follmann2019learning], c) ASN [qi2019amodal], d) Ours. UOAIS-Net hierarchically fuses the box, visible, amodal and occlusion features (), while the visible and amodal features in other methods [follmann2019learning, qi2019amodal] are indirectly related to each other. denotes an RoI feature after the RoIAlign.

Implementation Details. The model was trained on UOAIS-Sim dataset following the standard schedule in [wu2019detectron2] for iterations with SGD [zinkevich2010parallelized] using the learning rate of . We applied color [liu2016ssd], depth [zakharov2018keep], and crop augmentation. Training took about h on a single Tesla A100, and the inference took s per image (1,000 iterations) on a Titan XP. For it to serve as a general object instance detector, only the plane and bin were set to background objects and all other objects are set to foreground objects ; thus, it detected all instances in the image except for the plane and bin. For the cases requiring the task-specific foreground object selection [richtsfeld2012segmentation], we trained a binary segmentation model [wu2020cgnet] on the TOD dataset [xie2020best], thereby enabling real-time foreground segmentation ( ms for single forward pass) with less than 0.5 M parameters.

Fig. 4: Photo-realistic synthetic RGB images of UOAIS-Sim in plane surfaces (top row) and bin (bottom row).
Method Hierarchy Order Amodal Mask () Invisible Mask () Occlusion () Visible Mask ()
Amodal MRCNN [follmann2019learning] { } 82.5 66.7 82.7 51.1 28.2 35.5 72.3 75.9 83.3 69.7 75.1
ORCNN [follmann2019learning] { } 83.1 67.2 84.2 48.7 23.7 29.1 63.6 70.0 83.1 70.0 75.9
ASN [qi2019amodal] { } { } 80.6 66.9 84.6 41.9 20.8 30.5 59.6 66.3 85.0 72.8 78.9
UOAIS-Net (Ours) 82.2 68.7 84.1 55.3 32.3 42.9 82.1 90.8 85.2 73.1 79.1
TABLE I: UOAIS (, , , ) performances on OSD[richtsfeld2012segmentation]-Amodal. All methods are trained with RGB-D UOAIS-Sim. denotes that they are predicted parallelly. refers the hierarchy in prediction heads. OV: Overlap , BO: Boundary
Method Input
OSD [richtsfeld2012segmentation] OCID [suchi2019easylabel]
Overlap Boundary Overlap Boundary
UOIS-Net-2D [xie2020best] RGB-D 80.7 80.5 79.9 66.0 67.1 65.6 71.9 88.3 78.9 81.7 82.0 65.9 71.4 69.1
UOIS-Net-3D [xie2021unseen] RGB-D 85.7 82.5 83.3 75.7 68.9 71.2 73.8 86.5 86.6 86.4 80.0 73.4 76.2 77.2
UCN [xiang2020learning] RGB-D 84.3 88.3 86.2 67.5 67.5 67.1 79.3 86.0 92.3 88.5 80.4 78.3 78.8 82.2
Mask R-CNN [he2017mask] RGB-D 83.8 84.3 83.9 70.3 72.7 71.0 77.8 67.6 83.7 69.0 63.3 72.6 64.1 72.7
UOAIS-Net (Ours) RGB 84.2 83.7 83.8 72.2 72.8 72.1 76.7 66.5 83.1 67.9 62.1 70.2 62.3 73.1
UOAIS-Net (Ours) Depth 85.0 86.6 85.6 68.2 66.0 66.8 81.3 89.8 90.8 89.7 86.7 84.0 84.6 87.0
UOAIS-Net (Ours) RGB-D 85.3 85.4 85.2 72.7 74.3 73.1 79.1 70.7 86.7 71.9 68.2 78.5 68.8 78.7
UCN + Zoom-in [xiang2020learning] RGB-D 87.4 87.4 87.4 69.1 70.8 69.4 83.2 91.6 92.5 91.6 86.5 87.1 86.1 89.3
TABLE II: UOIS () performances of UOAIS-Net and SOTA UOIS methods on OSD [richtsfeld2012segmentation] and OCID [suchi2019easylabel].

Iv-D Amodal Datasets

UOAIS-Sim. We generated 50,000 RGB-D images of 1,000 cluttered scenes with amodal annotations (Fig. 4.). We used the BlenderProc [denninger2019blenderproc] for photo-realistic rendering. A total of 375 3D textured object models from [kasper2012kit, singh2014bigbird, hodavn2020bop] were used, including household (e.g., cereal box, bottle) and industrial objects (e.g., bracket, screw) of various geometries. One to 40 objects were randomly dropped on randomly textured bin or plane surfaces. Images were captured at a random camera pose. We split the images and object models with a 9:1 and 4:1 ratio for the training and validation sets. The YCB object models [calli2015ycb] in BOP [hodavn2020bop] were excluded from the simulation so that they could be utilized as test objects in the real world.

OSD-Amodal. We introduced amodal annotations for the OSD [richtsfeld2012segmentation] dataset to benchmark the UOAIS performance. Several amodal instance datasets have been proposed, but [zhu2017semantic] and [qi2019amodal] consist of mainly outdoor scenes, and [follmann2019learning] does not contain depth. OSD is a UOIS benchmark consisting of RGB-D tabletop clutter scenes, but it does not provide amodal annotations. We manually annotated the amodal instance masks. To ensure consistent and precise annotation, three annotators cross-checked each annotation. We also annotated object’s relative occlusion orders for future research.

V Experiments

Datasets. We compared our methods and SOTA on the three benchmarks. OSD[richtsfeld2012segmentation], consisting of 111 tabletop clutter images, was used to evaluate the UOAIS (, , , ) and UOIS () performances. Further, OCID and WISDOM were used to compare the UOIS performance. OCID provides 2,346 indoor images and WISDOM includes 300 top-view test images in bin picking. The image size of input was set to in OSD and OCID, and in WISDOM.

Metrics. In OSD and OCID, we measured the Overlap , Boundary , and for the amodal, visible, and invisible masks [dave2019towards, xie2020best]. Overlap and Boundary evaluate the whole area and the sharpness of the prediction, respectively, where , , and are the precision, recall, and F-measure of instance masks after the Hungarian matching, respectively. is the percentage of segmented objects with an Overlap F-measure greater than 0.75. For details, refer to [xie2020best] and [dave2019towards]. We also reported the accuracy () and F-measure () of occlusion classification, where , , , . is the number of the matched instances after the Hungarian matching. , , and are the numbers of occlusion prediction, ground truth, and correct prediction, respectively. In WISDOM, we measured the mask AP, AP50, and AR [lin2014microsoft]. All the metrics of the model trained on UOAIS-Sim are the average scores of three different random seeds. We used a light-weight network [wu2020cgnet] on the OSD to filter the out-of-the-table instances as described in Section IV-C.

Fig. 5: Comparison of UOAIS-Net and SOTA methods (left) and predictions of our methods on various environments (right). Red region: pixel predicted as an invisible mask (), Red border: object predicted as an occluded ()

Comparison with SOTA in UOAIS (, , , ). Table I compares the UOAIS-Net and the SOTA amodal instance segmentation methods [follmann2019learning, qi2019amodal] in OSD-Amodal. SOTA methods are trained on UOAIS-Sim RGB-D images with the same hyper-parameters of UOAIS-Net. The only difference between the models is the prediction heads (Fig. 3). The occlusion predictions for Amodal MRCNN and ASN are decided using the ratio of to ( if ). Our proposed UOAIS-Net outperforms the others on , , and and achieves a similar performance with the ASN on , which shows the effectiveness of the HOM Head.

Method Input Train Dataset Visible Mask ()
Mask R-CNN [danielczuk2019segmenting] RGB WISDOM-Sim 38.4 - 60.8
Mask R-CNN [danielczuk2019segmenting] Depth WISDOM-Sim 51.6 - 64.7
Mask R-CNN [danielczuk2019segmenting] RGB WISDOM-Real 40.1 76.4 -
D-SOLO [wang2020solo] RGB WISDOM-Real 42.0 75.1 -
PPIS [ito2020point] RGB WISDOM-Real 52.3 82.8 -
Mask R-CNN RGB UOAIS-Sim 60.6 90.3 68.8
Mask R-CNN Depth UOAIS-Sim 56.5 85.9 65.4
Mask R-CNN RGB-D UOAIS-Sim 63.8 91.8 70.5
UOAIS-Net (Ours) RGB-D UOAIS-Sim 65.9 93.3 72.2
TABLE III: UOIS () performances in WISDOM [danielczuk2019segmenting]

Comparison with SOTA in UOIS (). Table II shows the visible segmentation performances in OSD and OCID. The metrics of [xie2020best, xiang2020learning, xie2021unseen] are from their paper where non-photorealistic images are used. The Mask R-CNN [he2017mask] trained on UOAIS-Sim was also used as a baseline. UOAIS-Net achieved the performance on par with the SOTA UOIS methods. Note that our methods can detect the amodal masks and occlusions, while others are limited to segmenting the visible masks only. UCN with Zoom-in refinement [xiang2020learning] achieved a better performance than ours in OCID, but it requires more computation ( s total, : number of instances) than our methods (

s total). Also, the loss function of UCN considers only visible regions by pushing the pixels from the same objects to be close in feature embedding space; thus, an extra clustering network and loss function should be added for the amodal perception. UOAIS-Net with depth performs better than the RGB-D model in OCID, as it includes much more complex textured backgrounds than OSD. Table

III compares the visible segmentation performance in WISDOM, where the metrics of the others are from [danielczuk2019segmenting, ito2020point], and the backbone for [danielczuk2019segmenting] is ResNet-32 [he2016deep]. UOAIS-Net outperforms all other methods and shows that it can also generalize well on the bin environment. The qualitative comparisons of the models trained on RGB-D images are shown in Fig. 5.

Amodal Mask Invisible Mask Occlusion Visible Mask
0.41 80.8 84.9 54.1 43.3 89.0 85.2 79.8
0.40 80.7 83.7 54.3 44.1 91.1 85.3 78.7
0.49 81.4 83.2 54.6 42.0 90.8 85.6 79.7
0.48 81.9 84.0 56.3 44.6 89.4 84.9 78.6
0.49 82.4 84.0 55.5 42.1 90.4 85.2 78.8
0.57 82.2 84.1 55.3 42.9 90.8 85.2 79.1
TABLE IV: Ablation of hierarchy order on OSD-Amodal
Amodal Mask Invisible Mask Occlusion Visible Mask
0.17 81.9 84.3 51.3 38.1 87.5 83.6 76.2
0.47 81.1 84.4 53.0 41.6 89.1 84.1 77.9
0.61 80.5 84.7 53.1 40.0 90.2 84.6 79.1
0.85 82.2 84.1 55.3 42.9 90.8 85.2 79.1
TABLE V: Ablation of feature fusion on OSD-Amodal

Ablation of Hierarchy Order. Table IV shows the ablation of hierarchy orders in UOAIS-Net on OSD-Amodal with RGB-D input. The overall score was computed by averaging the normalized value of each column. Consistent with our motivation, the model with the HOM scheme () performed better than the others, while the model with the reverse scheme () showed the worst performances. This indicates that the proposed HOM scheme is effective in modeling the occlusion of objects.

Ablation of Feature Fusion. Table V shows the ablation of and in the HOM Head of UOAIS-Net on OSD-Amodal, where the input is RGB-D. The model with both and fusion achieved the best performance, which highlights the importance of dense feature fusion in the prediction heads.

Occluded Object Retrieval. We demonstrated an amodal robotic manipulation by combining UOAIS-Net with the 6-DOF grasp generator [sundermeyer2021contact] (Fig. 6). When the target object is occluded, grasping it directly is often infeasible due to the collision between the objects and the robot. Using UOAIS-Net, the grasping order to retrieve the target object can be easily determined; If the target object is occluded (), remove the nearest unoccluded objects () until the target becomes unoccluded (). Then the target object can be easily grasped. We simply used a ResNet-34 [he2016deep] trained on 25 images per object to classify the objects.

Fig. 6: Amodal robotic manipulation using UOAIS. To retrieve the target object (cup in (a)) in a cluttered scene (b), grasp the unoccluded objects sequentially (box in (c) and bowl in (d)), then the target object can be easily retrieved (e). The demonstration video is available at our project website:

Vi Conclusion

This paper proposed a novel task, UOAIS, to jointly detect the visible masks, amodal masks, and occlusions of unseen objects in a single framework. Under the HOM scheme, UOAIS-Net could successfully detect and reason the occlusion of unseen objects with SOTA performance on the three datasets. We also demonstrated a useful demo for occluded object retrieval. We hope UOAIS can serve as a simple and effective baseline for amodal robotic manipulation.

Vii Acknowledgement

We give a special thanks to Sungho Shin and Yeonguk Yu for their helpful comment and technical advice. This work was fully supported by the Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (Project Name: Shared autonomy based on deep reinforcement learning for responding intelligently to unfixed environments such as robotic assembly tasks, Project Number: 20008613). This work was also partially supported by the HPC Support project of the Korea Ministry of Science and ICT and NIPA.